前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >Apache Hive Cube

Apache Hive Cube

作者头像
DataScience
发布2019-12-30 17:35:16
9050
发布2019-12-30 17:35:16
举报
文章被收录于专栏:A2DataA2Data

Apache Hive Cube

(维度组合聚合)

Apache Hive GROUPING SETS,GROUPING__ID,CUBE,ROLLUP

GROUPING SETS,GROUPING__ID,CUBE,ROLLUP

参数设置:

set hive.execution.engine=tez;

set tez.queue.name=normal;

set hive.new.job.grouping.set.cardinality =100;

默认为 30

这几个分析函数通常用于OLAP中,不能累加,而且需要根据不同维度上钻和下钻的指标统计,比如,分小时、天、月的UV数。

Hive版本为 apache-hive-0.13.1

数据准备:

代码语言:javascript
复制
2019-03,2019-03-10,cookie1
2019-03,2019-03-10,cookie5
2019-03,2019-03-12,cookie7
2019-04,2019-04-12,cookie3
2019-04,2019-04-13,cookie2
2019-04,2019-04-13,cookie4
2019-04,2019-04-16,cookie4
2019-03,2019-03-10,cookie2
2019-03,2019-03-10,cookie3
2019-04,2019-04-12,cookie5
2019-04,2019-04-13,cookie6
2019-04,2019-04-15,cookie3
2019-04,2019-04-15,cookie2
2019-04,2019-04-16,cookie1

CREATE EXTERNAL TABLE a2data.test (
month STRING,
day STRING, 
cookieid STRING 
) ROW FORMAT DELIMITED 
FIELDS TERMINATED BY ',' 
stored as textfile location '/tmp/test/';


hive> select * from a2data.test;
OK
2019-03 2019-03-10      cookie1
2019-03 2019-03-10      cookie5
2019-03 2019-03-12      cookie7
2019-04 2019-04-12      cookie3
2019-04 2019-04-13      cookie2
2019-04 2019-04-13      cookie4
2019-04 2019-04-16      cookie4
2019-03 2019-03-10      cookie2
2019-03 2019-03-10      cookie3
2019-04 2019-04-12      cookie5
2019-04 2019-04-13      cookie6
2019-04 2019-04-15      cookie3
2019-04 2019-04-15      cookie2
2019-04 2019-04-16      cookie1

GROUPING SETS

在一个GROUP BY查询中,根据不同的维度组合进行聚合,

等价于将不同维度的GROUP BY结果集进行UNION ALL

Apache Hive GROUPING SETS

代码语言:javascript
复制
SELECT 
month,
day,
COUNT(DISTINCT cookieid) AS uv,
GROUPING__ID 
FROM a2data.test 
GROUP BY month,day 
GROUPING SETS (month,day) 
ORDER BY GROUPING__ID;

month      day            uv      GROUPING__ID
------------------------------------------------
2019-03    NULL            5       1
2019-04    NULL            6       1
NULL       2019-03-10      4       2
NULL       2019-03-12      1       2
NULL       2019-04-12      2       2
NULL       2019-04-13      3       2
NULL       2019-04-15      2       2
NULL       2019-04-16      2       2


等价于 
SELECT month,NULL,COUNT(DISTINCT cookieid) AS uv,1 AS GROUPING__ID FROM a2data.test GROUP BY month 
UNION ALL 
SELECT NULL,day,COUNT(DISTINCT cookieid) AS uv,2 AS GROUPING__ID FROM a2data.test GROUP BY day

再如

代码语言:javascript
复制
SELECT 
month,
day,
COUNT(DISTINCT cookieid) AS uv,
GROUPING__ID 
FROM a2data.test 
GROUP BY month,day 
GROUPING SETS (month,day,(month,day)) 
ORDER BY GROUPING__ID;

month         day             uv      GROUPING__ID
------------------------------------------------
2019-03       NULL            5       1
2019-04       NULL            6       1
NULL          2019-03-10      4       2
NULL          2019-03-12      1       2
NULL          2019-04-12      2       2
NULL          2019-04-13      3       2
NULL          2019-04-15      2       2
NULL          2019-04-16      2       2
2019-03       2019-03-10      4       3
2019-03       2019-03-12      1       3
2019-04       2019-04-12      2       3
2019-04       2019-04-13      3       3
2019-04       2019-04-15      2       3
2019-04       2019-04-16      2       3


等价于
SELECT month,NULL,COUNT(DISTINCT cookieid) AS uv,1 AS GROUPING__ID FROM a2data.test GROUP BY month 
UNION ALL 
SELECT NULL,day,COUNT(DISTINCT cookieid) AS uv,2 AS GROUPING__ID FROM a2data.test GROUP BY day
UNION ALL 
SELECT month,day,COUNT(DISTINCT cookieid) AS uv,3 AS GROUPING__ID FROM a2data.test GROUP BY month,day
其中的 GROUPING__ID,表示结果属于哪一个分组集合。

CUBE

根据GROUP BY的维度的所有组合进行聚合。

代码语言:javascript
复制
SELECT 
month,
day,
COUNT(DISTINCT cookieid) AS uv,
GROUPING__ID 
FROM a2data.test 
GROUP BY month,day 
WITH CUBE 
ORDER BY GROUPING__ID;


month                  day             uv     GROUPING__ID
--------------------------------------------
NULL            NULL            7       0
2019-03         NULL            5       1
2019-04         NULL            6       1
NULL            2019-04-12      2       2
NULL            2019-04-13      3       2
NULL            2019-04-15      2       2
NULL            2019-04-16      2       2
NULL            2019-03-10      4       2
NULL            2019-03-12      1       2
2019-03         2019-03-10      4       3
2019-03         2019-03-12      1       3
2019-04         2019-04-16      2       3
2019-04         2019-04-12      2       3
2019-04         2019-04-13      3       3
2019-04         2019-04-15      2       3



等价于
SELECT NULL,NULL,COUNT(DISTINCT cookieid) AS uv,0 AS GROUPING__ID FROM a2data.test
UNION ALL 
SELECT month,NULL,COUNT(DISTINCT cookieid) AS uv,1 AS GROUPING__ID FROM a2data.test GROUP BY month 
UNION ALL 
SELECT NULL,day,COUNT(DISTINCT cookieid) AS uv,2 AS GROUPING__ID FROM a2data.test GROUP BY day
UNION ALL 
SELECT month,day,COUNT(DISTINCT cookieid) AS uv,3 AS GROUPING__ID FROM a2data.test GROUP BY month,day

ROLLUP (是CUBE的子集,以最左侧的维度为主,从该维度进行层级聚合。)

代码语言:javascript
复制
比如,以month维度进行层级聚合:
SELECT 
month,
day,
COUNT(DISTINCT cookieid) AS uv,
GROUPING__ID  
FROM a2data.test 
GROUP BY month,day
WITH ROLLUP 
ORDER BY GROUPING__ID;

month                  day             uv     GROUPING__ID
---------------------------------------------------
NULL             NULL            7       0
2019-03          NULL            5       1
2019-04          NULL            6       1
2019-03          2019-03-10      4       3
2019-03          2019-03-12      1       3
2019-04          2019-04-12      2       3
2019-04          2019-04-13      3       3
2019-04          2019-04-15      2       3
2019-04          2019-04-16      2       3

可以实现这样的上钻过程:

⊙月天的UV->月的UV->总UV

--把month和day调换顺序,则以day维度进行层级聚合:

Apache Hive

代码语言:javascript
复制
SELECT 
day,
month,
COUNT(DISTINCT cookieid) AS uv,
GROUPING__ID  
FROM a2data.test 
GROUP BY day,month 
WITH ROLLUP 
ORDER BY GROUPING__ID;


day                    month              uv     GROUPING__ID
-------------------------------------------------------
NULL            NULL               7       0
2019-04-13      NULL               3       1
2019-03-12      NULL               1       1
2019-04-15      NULL               2       1
2019-03-10      NULL               4       1
2019-04-16      NULL               2       1
2019-04-12      NULL               2       1
2019-04-12      2019-04            2       3
2019-03-10      2019-03            4       3
2019-03-12      2019-03            1       3
2019-04-13      2019-04            3       3
2019-04-15      2019-04            2       3
2019-04-16      2019-04            2       3

可以实现这样的上钻过程:
天月的UV->天的UV->总UV
(这里,根据天和月进行聚合,和根据天聚合结果一样,因为有父子关系,如果是其他维度组合的话,就会不一样)
这种函数,需要结合实际场景和数据去使用和研究,只看说明的话,很难理解。
代码语言:javascript
复制
官网的介绍: https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation%2C+Cube%2C+Grouping+and+Rollup
本文参与 腾讯云自媒体分享计划,分享自微信公众号。
原始发表:2019-05-27,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 DataScience 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档