我们接着上一题大数据面试SQL046-泳池问题(上)继续讨论泳池问题。现有一份数据记录了用户进入和离开游泳池的时间,请找出一天中泳池最多人数持续时长,如有出现多次最高人数,对时间求和
--样例数据
+----------+-----------+----------------------+
| user_id | log_type | log_time |
+----------+-----------+----------------------+
| 1 | enter | 2024-04-07 08:01:00 |
| 2 | enter | 2024-04-07 08:20:00 |
| 1 | leave | 2024-04-07 09:30:00 |
| 3 | enter | 2024-04-07 10:01:00 |
| 4 | enter | 2024-04-07 10:25:00 |
| 2 | leave | 2024-04-07 11:22:00 |
| 4 | leave | 2024-04-07 12:20:00 |
| 5 | enter | 2024-04-07 12:30:00 |
| 6 | enter | 2024-04-07 13:05:00 |
| 3 | leave | 2024-04-07 14:20:00 |
| 5 | leave | 2024-04-07 14:30:00 |
| 7 | enter | 2024-04-07 14:45:00 |
| 8 | enter | 2024-04-07 15:18:00 |
| 7 | leave | 2024-04-07 15:32:00 |
| 6 | leave | 2024-04-07 15:45:00 |
| 9 | enter | 2024-04-07 16:01:00 |
| 10 | enter | 2024-04-07 16:10:00 |
| 9 | leave | 2024-04-07 16:25:00 |
| 8 | leave | 2024-04-07 16:35:00 |
| 10 | leave | 2024-04-07 16:55:00 |
+----------+-----------+----------------------+
这个题目是在上一个题目上的升级版,想要计算泳池内的最大人数持续时间,我们给每行记录添加持续时间,这里考察的是开窗函数lead()。然后我们再对数据进行排序,这里考察的是排名函数,如果满足多个需要求和,即考察row_number 和rank、dense_rank的差异性。
维度 | 评分 |
---|---|
题目难度 | ⭐️⭐️⭐️⭐️ |
题目清晰度 | ⭐️⭐️⭐️⭐️⭐ |
业务常见度 | ⭐️⭐️⭐️⭐️ |
1)使用上一个题目的结果,我们除了累积求和外,再增加一个字段为下一行记录的时间next_log_time,该时间与当前日志时间log_time的差值即为当前状态的持续时间。我们假设游泳池开放时间为08:00~17:00,使用lead()函数,当没有下一行数据时,给定默认值为17:00。
select
user_id,
log_type,
log_time,
enter_cnt,
sum(enter_cnt)over(order by log_time asc) as pool_user_cnt,
lead(log_time,1,'2024-04-07 17:00:00')over(order by log_time asc) as next_log_time
from
(
select
user_id,
log_type,
log_time,
if(log_type ='enter',1,-1) as enter_cnt
from t_user_pool_log
) t
查询结果
2)使用next_log_time - log_time计算出每行持续时间,这里需要将日期格式转换为时间戳,然后计算出s的差值。上一篇我们只要最大值,所以直接使用max,这次需要取出最大值对应的行,所以需要使用排序函数。如果存在多个最大值,需要取出多行,所以我们可以选择rank或者dense_rank函数。排序函数讲解请看:
select
user_id,
log_type,
log_time,
enter_cnt,
pool_user_cnt,
next_log_time,
unix_timestamp(next_log_time) - unix_timestamp(log_time) as keep_sec,
rank()over(order by pool_user_cnt desc) as rn
from
(select
user_id,
log_type,
log_time,
enter_cnt,
sum(enter_cnt)over(order by log_time asc) as pool_user_cnt,
lead(log_time,1,'2024-04-07 17:00:00')over(order by log_time asc) as next_log_time
from
(
select
user_id,
log_type,
log_time,
if(log_type ='enter',1,-1) as enter_cnt
from t_user_pool_log
) t
) tt
查询结果
3)如上图,我们取出rn=1的行,对其keep_sec求和即计算出持续时长,这里时间单位为s。
-- 持续时间
select
sum(keep_sec)
from
(select
user_id,
log_type,
log_time,
enter_cnt,
pool_user_cnt,
next_log_time,
unix_timestamp(next_log_time) - unix_timestamp(log_time) as keep_sec,
rank()over(order by pool_user_cnt desc) as rn
from
(select
user_id,
log_type,
log_time,
enter_cnt,
sum(enter_cnt)over(order by log_time asc) as pool_user_cnt,
lead(log_time,1,'2024-04-07 17:00:00')over(order by log_time asc) as next_log_time
from
(
select
user_id,
log_type,
log_time,
if(log_type ='enter',1,-1) as enter_cnt
from t_user_pool_log
) t
) tt
) ttt
where ttt.rn = 1
查询结果
--建表语句
CREATE TABLE t_user_pool_log (
user_id INT,
log_type STRING,
log_time STRING
)
COMMENT '泳池进出日志表'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;
-- 样例数据插入
insert into t_user_pool_log values
(1,'enter','2024-04-07 08:01:00'),
(2,'enter','2024-04-07 08:20:00'),
(1,'leave','2024-04-07 09:30:00'),
(3,'enter','2024-04-07 10:01:00'),
(4,'enter','2024-04-07 10:25:00'),
(2,'leave','2024-04-07 11:22:00'),
(4,'leave','2024-04-07 12:20:00'),
(5,'enter','2024-04-07 12:30:00'),
(6,'enter','2024-04-07 13:05:00'),
(3,'leave','2024-04-07 14:20:00'),
(5,'leave','2024-04-07 14:30:00'),
(7,'enter','2024-04-07 14:45:00'),
(8,'enter','2024-04-07 15:18:00'),
(7,'leave','2024-04-07 15:32:00'),
(6,'leave','2024-04-07 15:45:00'),
(9,'enter','2024-04-07 16:01:00'),
(10,'enter','2024-04-07 16:10:00'),
(9,'leave','2024-04-07 16:25:00'),
(8,'leave','2024-04-07 16:35:00'),
(10,'leave','2024-04-07 16:55:00');