首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >使用hql-统计连续登陆的三天及以上的用户

使用hql-统计连续登陆的三天及以上的用户

作者头像
孙晨c
发布2021-02-25 17:01:29
9710
发布2021-02-25 17:01:29
举报
文章被收录于专栏:无题~无题~

这个问题可以扩展到很多相似的问题:连续几个月充值会员、连续天数有商品卖出、连续打车、连续逾期……

数据提供

 用户ID、登入日期
 user01,2018-02-28
 user01,2018-03-01
 user01,2018-03-02
 user01,2018-03-04
 user01,2018-03-05
 user01,2018-03-06
 user01,2018-03-07
 user02,2018-03-01
 user02,2018-03-02
 user02,2018-03-03
 user02,2018-03-06

输出字段

+---------+--------+-------------+-------------+--+
|   uid   | times  | start_date  |  end_date   |
+---------+--------+-------------+-------------+--+

解法一

先对每个用户的登录日期排序,然后拿第n行的日期,减第n-2行的日期,如果等于2,就说明连续三天登录了。

解法二

开窗,窗囗内部排序然后做差

rownumber() oover

建表

create table wedw_dw.t_login_info(
 user_id string  COMMENT '用户ID'
,login_date date COMMENT '登录日期'
)
row format delimited fields terminated by ',';

导数据

hdfs dfs -put /test/login.txt /data/hive/test/wedw/dw/t_login_info/

验证数据

select * from wedw_dw.t_login_info;
+----------+-------------+--+
| user_id  | login_date  |
+----------+-------------+--+
| user01   | 2018-02-28  |
| user01   | 2018-03-01  |
| user01   | 2018-03-02  |
| user01   | 2018-03-04  |
| user01   | 2018-03-05  |
| user01   | 2018-03-06  |
| user01   | 2018-03-07  |
| user02   | 2018-03-01  |
| user02   | 2018-03-02  |
| user02   | 2018-03-03  |
| user02   | 2018-03-06  |
+----------+-------------+--+

解决方案-使用解法二

select
 t2.user_id         as user_id,
 count(1)           as times,
 min(t2.login_date) as start_date,
 max(t2.login_date) as end_date
from
(
    select
     t1.user_id,
     t1.login_date,
     date_sub(t1.login_date,rn) as date_diff
    from
    (
        select
         user_id,
         login_date,
         row_number() over(partition by user_id order by login_date asc) as rn 
        from
        wedw_dw.t_login_info
    ) t1
) t2
group by 
 t2.user_id, t2.date_diff
having times >= 3;

结果

+----------+--------+-------------+-------------+--+
| user_id  | times  | start_date  |  end_date   |
+----------+--------+-------------+-------------+--+
| user01   | 3      | 2018-02-28   | 2018-03-02  |
| user01   | 4      | 2018-03-04  | 2018-03-07   |
| user02   | 3      | 2018-03-01   | 2018-03-03  |
+----------+--------+-------------+-------------+--+

思路

  1. 先把数据按照用户id分组,根据登录日期排序
select
	user_id
	,login_date
	,row_number() over(partition by user_id order by login_date asc) as rn 
	from
	wedw_dw.t_login_info

+----------+-------------+-----+--+
| user_id  | login_date  | rn  |
+----------+-------------+-----+--+
| user01   | 2018-02-28  | 1   |
| user01   | 2018-03-01  | 2   |
| user01   | 2018-03-02  | 3   |
| user01   | 2018-03-04  | 4   |
| user01   | 2018-03-05  | 5   |
| user01   | 2018-03-06  | 6   |
| user01   | 2018-03-07  | 7   |
| user02   | 2018-03-01  | 1   |
| user02   | 2018-03-02  | 2   |
| user02   | 2018-03-03  | 3   |
| user02   | 2018-03-06  | 4   |
+----------+-------------+-----+--+
  1. 用登录日期减去排序数字rn,得到的差值日期如果是相等的,则说明这两天肯定是连续的
select
     t1.user_id
    ,t1.login_date
    ,date_sub(t1.login_date,rn) as date_diff
    from
    (
        select
         user_id
        ,login_date
        ,row_number() over(partition by user_id order by login_date asc) as rn 
        from
        wedw_dw.t_login_info
    ) t1
    ;


+----------+-------------+-------------+--+
| user_id  | login_date  |  date_diff  |
+----------+-------------+-------------+--+
| user01   | 2018-02-28  | 2018-02-27  |
| user01   | 2018-03-01  | 2018-02-27  |
| user01   | 2018-03-02  | 2018-02-27  |
| user01   | 2018-03-04  | 2018-02-28  |
| user01   | 2018-03-05  | 2018-02-28  |
| user01   | 2018-03-06  | 2018-02-28  |
| user01   | 2018-03-07  | 2018-02-28  |
| user02   | 2018-03-01  | 2018-02-28  |
| user02   | 2018-03-02  | 2018-02-28  |
| user02   | 2018-03-03  | 2018-02-28  |
| user02   | 2018-03-06  | 2018-03-02  |
+----------+-------------+-------------+--+
  1. 根据user_id和日期差date_diff 分组,最小登录日期即为此次连续登录的开始日期start_date,最大登录日期即为结束日期end_date,登录次数即为分组后的count(1)
select
 t2.user_id         as user_id
,count(1)           as times
,min(t2.login_date) as start_date
,max(t2.login_date) as end_date
from
(
    select
     t1.user_id
    ,t1.login_date
    ,date_sub(t1.login_date,rn) as date_diff
    from
    (
        select
         user_id
        ,login_date
        ,row_number() over(partition by user_id order by login_date asc) as rn 
        from
        wedw_dw.t_login_info
    ) t1
) t2
group by 
 t2.user_id
,t2.date_diff
having times >= 3
;

+----------+--------+-------------+-------------+--+
| user_id  | times  | start_date  |  end_date   |
+----------+--------+-------------+-------------+--+
| user01   | 3      | 2018-02-28   | 2018-03-02  |
| user01    | 4      | 2018-03-04  | 2018-03-07  |
| user02   | 3      | 2018-03-01   | 2018-03-03  |
+----------+--------+-------------+-------------+--+
本文参与 腾讯云自媒体分享计划,分享自作者个人站点/博客。
原始发表:2021-02-21 ,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 作者个人站点/博客 前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 数据提供
  • 输出字段
  • 解法一
  • 解法二
  • 建表
  • 导数据
  • 验证数据
  • 解决方案-使用解法二
  • 结果
  • 思路
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档