Hive窗口函数GROUPING SETS、GROUPING__ID、CUBE、ROLLUP入门
现有 hive 表 cookie5, 内容如下:
hive> select * from cookie5;
cookie5.month cookie5.day cookie5.cookieid
2015-03 2015-03-10 cookie1
2015-03 2015-03-10 cookie5
2015-03 2015-03-12 cookie7
2015-04 2015-04-12 cookie3
2015-04 2015-04-13 cookie2
2015-04 2015-04-13 cookie4
2015-04 2015-04-16 cookie4
2015-03 2015-03-10 cookie2
2015-03 2015-03-10 cookie3
2015-04 2015-04-12 cookie5
2015-04 2015-04-13 cookie6
2015-04 2015-04-15 cookie3
2015-04 2015-04-15 cookie2
2015-04 2015-04-16 cookie1
在一个group by
查询中,根据不同的维度组合进行聚合,等价于将不同维度的group by
结果集进行union all
grouping__id
,表示结果属于哪一个分组集合
注意:
grouping__id
中的__
是两个下划线
grouping__id
需要与grouping sets()或cube或rollup一起使用,否则会报错
select month, day, count(distinct cookieid) as uv
from cookie5
group by month,day
grouping sets (month,day);
结果:
month day uv
NULL 2015-03-10 4
NULL 2015-03-12 1
NULL 2015-04-12 2
NULL 2015-04-13 3
NULL 2015-04-15 2
NULL 2015-04-16 2
2015-03 NULL 5
2015-04 NULL 6
结果分析:
select month, day, count(distinct cookieid) as uv, grouping__id
from cookie5
group by month,day
grouping sets (month,day,(month,day))
order by grouping__id;
结果:
month day uv grouping__id
2015-03 2015-03-10 4 0
2015-04 2015-04-16 2 0
2015-04 2015-04-13 3 0
2015-04 2015-04-12 2 0
2015-04 2015-04-15 2 0
2015-03 2015-03-12 1 0
2015-03 NULL 5 1
2015-04 NULL 6 1
NULL 2015-04-16 2 2
NULL 2015-04-15 2 2
NULL 2015-04-13 3 2
NULL 2015-04-12 2 2
NULL 2015-03-12 1 2
NULL 2015-03-10 4 2
根据group by的维度的所有组合进行聚合
select month, day, count(distinct cookieid) as uv, grouping__id
from cookie5
group by month,day
with cube
order by grouping__id;
结果:
month day uv grouping__id
2015-03 2015-03-10 4 0
2015-04 2015-04-16 2 0
2015-04 2015-04-13 3 0
2015-04 2015-04-12 2 0
2015-04 2015-04-15 2 0
2015-03 2015-03-12 1 0
2015-03 NULL 5 1
2015-04 NULL 6 1
NULL 2015-04-16 2 2
NULL 2015-04-15 2 2
NULL 2015-04-13 3 2
NULL 2015-04-12 2 2
NULL 2015-03-12 1 2
NULL 2015-03-10 4 2
NULL NULL 7 3
等价于:
select month, day, count(distinct cookieid) as uv, 0 as grouping__id from cookie5 group by month, day
union all
select month, null, count(distinct cookieid) as uv, 1 as grouping__id from cookie5 group by month
union all
select null, day, count(distinct cookieid) as uv, 2 as grouping__id from cookie5 group by day
union all
select null, null, count(distinct cookieid) as uv, 3 as grouping__id from cookie5;
是cube的子集,以最左侧的维度为主,从该维度进行层级聚合
--(1) 以month维度进行层级聚合(月天的uv、月的uv、总uv)
select month, day, count(distinct cookieid) as uv, grouping__id
from cookie5
group by month,day
with rollup
order by grouping__id;
结果:
month day uv grouping__id
2015-04 2015-04-16 2 0
2015-04 2015-04-15 2 0
2015-04 2015-04-13 3 0
2015-04 2015-04-12 2 0
2015-03 2015-03-12 1 0
2015-03 2015-03-10 4 0
2015-04 NULL 6 1
2015-03 NULL 5 1
NULL NULL 7 3
--(2) 以day维度进行层级聚合(天月的uv、天的uv、总uv)
select day, month, count(distinct cookieid) as uv, grouping__id
from cookie5
group by day,month
with rollup
order by grouping__id;
结果:
day month uv grouping__id
2015-04-16 2015-04 2 0
2015-04-15 2015-04 2 0
2015-04-13 2015-04 3 0
2015-04-12 2015-04 2 0
2015-03-12 2015-03 1 0
2015-03-10 2015-03 4 0
2015-04-12 NULL 2 1
2015-04-16 NULL 2 1
2015-03-10 NULL 4 1
2015-04-15 NULL 2 1
2015-03-12 NULL 1 1
2015-04-13 NULL 3 1
NULL NULL 7 3