在日常工作进行数据的ETL或者面试时,经常遇到类似的问题,比如"统计连续N天交易额超过100万的店铺"、"统计连续登录天数超过3天的用户"等。对于这类问题,思路基本都是一样的。本文将介绍常用的两种解决方案。
以"统计连续登录天数超过3天的用户"为需求。
注:以下用到的SQL函数,建议参考《Hive的利器:强大而实用的开窗函数》
数据准备
+---+----------+
|id |login_date|
+---+----------+
|01 |2021-02-28|
|01 |2021-03-01|
|01 |2021-03-02|
|01 |2021-03-04|
|01 |2021-03-05|
|01 |2021-03-06|
|01 |2021-03-08|
|02 |2021-03-01|
|02 |2021-03-02|
|02 |2021-03-03|
|02 |2021-03-06|
|03 |2021-03-06|
+---+----------+
方案一
1.先把数据按照用户id分组,根据登录日期排序
SQL:
SELECT
id,
login_date,
row_number() over(partition by id order by login_date asc) as rn
FROM data;
结果:
+---+----------+---+
|id |login_date|rn |
+---+----------+---+
|01 |2021-02-28|1 |
|01 |2021-03-01|2 |
|01 |2021-03-02|3 |
|01 |2021-03-04|4 |
|01 |2021-03-05|5 |
|01 |2021-03-06|6 |
|01 |2021-03-08|7 |
|02 |2021-03-01|1 |
|02 |2021-03-02|2 |
|02 |2021-03-03|3 |
|02 |2021-03-06|4 |
|03 |2021-03-06|1 |
+---+----------+---+
2.用登录日期与rn求date_sub,得到的差值日期如果是相等的,则说明这两天肯定是连续的
SQL:
SELECT
t1.id,
t1.login_date,
date_sub(t1.login_date, rn) as diff_date
FROM
(
SELECT
id,
login_date,
row_number() over(partition by id order by login_date asc) as rn
FROM data
) t1;
结果:
+---+----------+----------+
|id |login_date|diff_date |
+---+----------+----------+
|01 |2021-02-28|2021-02-27|
|01 |2021-03-01|2021-02-27|
|01 |2021-03-02|2021-02-27|
|01 |2021-03-04|2021-02-28|
|01 |2021-03-05|2021-02-28|
|01 |2021-03-06|2021-02-28|
|01 |2021-03-08|2021-03-01|
|02 |2021-03-01|2021-02-28|
|02 |2021-03-02|2021-02-28|
|02 |2021-03-03|2021-02-28|
|02 |2021-03-06|2021-03-02|
|03 |2021-03-06|2021-03-05|
+---+----------+----------+
3.根据id和日期差date_diff分组,登录次数即为分组后的count(1)
SQL:
SELECT
t2.id,
count(1) as login_times,
min(t2.login_date) as start_date,
max(t2.login_date) as end_date
FROM
(
SELECT
t1.id,
t1.login_date,
date_sub(t1.login_date,rn) as diff_date
FROM
(
SELECT
id,
login_date,
row_number() over(partition by id order by login_date asc) as rn
FROM data
) t1
) t2
group by t2.id, t2.diff_date
having login_times >= 3;
结果:
+---+-----------+----------+----------+
|id |login_times|start_date|end_date |
+---+-----------+----------+----------+
| 01|3 |2021-02-28|2021-03-02|
| 01|3 |2021-03-04|2021-03-06|
| 02|3 |2021-03-01|2021-03-03|
+---+-----------+----------+----------+
方案二
方案二利用lag和lead函数进行处理,思路类似。
SQL:
SELECT
id,
lag_login_date,
login_date,lead_login_date
FROM
(SELECT
id,
login_date,
lag(login_date,1,login_date) over(partition by id order by login_date) as lag_login_date,
lead(login_date,1,login_date) over(partition by id order by login_date) as lead_login_date
FROM data
) t1
where datediff(login_date,lag_login_date) =1 and datediff(lead_login_date,login_date) =1;
结果:
+---+--------------+----------+---------------+
|id |lag_login_date|login_date|lead_login_date|
+---+--------------+----------+---------------+
|01 |2018-02-28 |2018-03-01|2018-03-02 |
|01 |2018-03-04 |2018-03-05|2018-03-06 |
|02 |2018-03-01 |2018-03-02|2018-03-03 |
+---+--------------+----------+---------------+
行转列和列转行
以"连续登录"中的数据为例:
select id,
concat_ws(',',collect_list(login_date)) cw
from data
group by id;
结果:
+---+----------------------------------------------------------------------------+
|id |cw |
+---+----------------------------------------------------------------------------+
|01 |2018-02-28,2018-03-01,2018-03-02,2018-03-04,2018-03-05,2018-03-06,2018-03-08|
|02 |2018-03-01,2018-03-02,2018-03-03,2018-03-06 |
|03 |2018-03-06 |
+---+----------------------------------------------------------------------------+
以上面SQL生成的数据为基准,执行下列SQL:
select id, login_date
from t
lateral view explode(split(cw,',')) b AS login_date;
结果:
+---+----------+
|id |login_date|
+---+----------+
|01 |2018-02-28|
|01 |2018-03-01|
|01 |2018-03-02|
|01 |2018-03-04|
|01 |2018-03-05|
|01 |2018-03-06|
|01 |2018-03-08|
|02 |2018-03-01|
|02 |2018-03-02|
|02 |2018-03-03|
|02 |2018-03-06|
|03 |2018-03-06|
+---+----------+