我需要帮助对跨不同ID集的用户列表(20 million+)进行重复数据消除。
它看起来是这样的:
举个例子:

所有这些ID都显示一个用户。
我想我可以添加一个fourthID (GroupID),这将是对它们进行重复数据删除的那个。有点像这样:

问题是:我知道如何通过CURSOR / OPEN / FETCH / NEXT命令在SQL Server上执行此操作,但我的环境中只有Hive QL、Impala和Python可用。
有没有人知道最好的方法是什么?
万分感谢,
雨果
发布于 2018-03-21 19:20:42
根据您的示例,假设id2始终存在,您可以聚合行,按id2分组:
select max(id1) id1, id2, max(id3) id3 from
( --your dataset as in example
select 'A' as id1, 1 as id2, null as id3 union all
select null as id1, 1 as id2, 'Alpha' as id3 union all
select null as id1, 2 as id2, 'Beta' as id3 union all
select 'A' as id1, 2 as id2, null as id3
)s
group by id2;
OK
A 1 Alpha
A 2 Beta
Time taken: 58.739 seconds, Fetched: 2 row(s)现在我尝试实现你的逻辑,就像你描述的那样:
select --pass2
id1, id2, id3,
case when lag(id2) over (order by id2, GroupId) = id2 then lag(GroupId) over (order by id2, GroupId) else GroupId end GroupId2
from
(
select --pass1
id1, id2, id3,
case when
lag(id1) over(order by id1, NVL(ID1,ID3)) =id1 then lag(NVL(ID1,ID3)) over(order by id1, NVL(ID1,ID3)) else NVL(ID1,ID3) end GroupId
from
( --your dataset as in example
select 'A' as id1, 1 as id2, null as id3 union all
select null as id1, 1 as id2, 'Alpha' as id3 union all
select null as id1, 2 as id2, 'Beta' as id3 union all
select 'A' as id1, 2 as id2, null as id3
)s
)s --pass1
;
OK
id1 id2 id3 groupid2
A 1 NULL A
NULL 1 Alpha A
A 2 NULL A
NULL 2 Beta A
Time taken: 106.944 seconds, Fetched: 4 row(s)https://stackoverflow.com/questions/49402019
复制相似问题