我们接着上两题继续讨论泳池问题,还是相同的数据。现有一份数据记录了用户进入和离开游泳池的时间,请计算出泳池内的平均人数
--样例数据
+----------+-----------+----------------------+
| user_id | log_type | log_time |
+----------+-----------+----------------------+
| 1 | enter | 2024-04-07 08:01:00 |
| 2 | enter | 2024-04-07 08:20:00 |
| 1 | leave | 2024-04-07 09:30:00 |
| 3 | enter | 2024-04-07 10:01:00 |
| 4 | enter | 2024-04-07 10:25:00 |
| 2 | leave | 2024-04-07 11:22:00 |
| 4 | leave | 2024-04-07 12:20:00 |
| 5 | enter | 2024-04-07 12:30:00 |
| 6 | enter | 2024-04-07 13:05:00 |
| 3 | leave | 2024-04-07 14:20:00 |
| 5 | leave | 2024-04-07 14:30:00 |
| 7 | enter | 2024-04-07 14:45:00 |
| 8 | enter | 2024-04-07 15:18:00 |
| 7 | leave | 2024-04-07 15:32:00 |
| 6 | leave | 2024-04-07 15:45:00 |
| 9 | enter | 2024-04-07 16:01:00 |
| 10 | enter | 2024-04-07 16:10:00 |
| 9 | leave | 2024-04-07 16:25:00 |
| 8 | leave | 2024-04-07 16:35:00 |
| 10 | leave | 2024-04-07 16:55:00 |
+----------+-----------+----------------------+
这个题目是在前两个题目的进一步升级,这里并没有再进一步考察什么函数,而是考察业务知识。想要计算平均值,我们可以一个全天泳池内所有人累计时长,然后除以泳池的开放时长,就是泳池内的平均人数;
维度 | 评分 |
---|---|
题目难度 | ⭐️⭐️⭐️⭐️ |
题目清晰度 | ⭐️⭐️⭐️⭐️⭐ |
业务常见度 | ⭐️⭐️⭐️⭐️ |
1)使用之前的结果,我们计算出泳池中的人数和持续时间,然后对人数*时间结果求和,即为一天内所有人的累计时长。使用上一个题目的中间结果,先计算出每行记录的人数pool_user_cnt和持续时间keep_sec。
select
user_id,
log_type,
log_time,
enter_cnt,
pool_user_cnt,
next_log_time,
unix_timestamp(next_log_time) - unix_timestamp(log_time) as keep_sec
from
(select
user_id,
log_type,
log_time,
enter_cnt,
sum(enter_cnt)over(order by log_time asc) as pool_user_cnt,
lead(log_time,1,'2024-04-07 17:00:00')over(order by log_time asc) as next_log_time
from
(
select
user_id,
log_type,
log_time,
if(log_type ='enter',1,-1) as enter_cnt
from t_user_pool_log
) t
) tt
查询结果
2)计算出pool_user_cnt*keep_sec 的结果
select
user_id,
log_type,
log_time,
pool_user_cnt,
next_log_time,
keep_sec,
pool_user_cnt*keep_sec as user_keep_sec
from
(select
user_id,
log_type,
log_time,
enter_cnt,
pool_user_cnt,
next_log_time,
unix_timestamp(next_log_time) - unix_timestamp(log_time) as keep_sec
from
(select
user_id,
log_type,
log_time,
enter_cnt,
sum(enter_cnt)over(order by log_time asc) as pool_user_cnt,
lead(log_time,1,'2024-04-07 17:00:00')over(order by log_time asc) as next_log_time
from
(
select
user_id,
log_type,
log_time,
if(log_type ='enter',1,-1) as enter_cnt
from t_user_pool_log
) t
) tt
)ttt
查询结果
3)我们假设泳池开放时间为8:00~17:00 计算出总的时间长度total_sec.对user_keep_sec 求和之后除以total_sec 得到平均人数。
select
sum(user_keep_sec)/(unix_timestamp('2024-04-07 17:00:00')-unix_timestamp('2024-04-07 08:00:00')) as avg_num
from
(select
user_id,
log_type,
log_time,
pool_user_cnt,
next_log_time,
keep_sec,
pool_user_cnt*keep_sec as user_keep_sec
from
(select
user_id,
log_type,
log_time,
enter_cnt,
pool_user_cnt,
next_log_time,
unix_timestamp(next_log_time) - unix_timestamp(log_time) as keep_sec
from
(select
user_id,
log_type,
log_time,
enter_cnt,
sum(enter_cnt)over(order by log_time asc) as pool_user_cnt,
lead(log_time,1,'2024-04-07 17:00:00')over(order by log_time asc) as next_log_time
from
(
select
user_id,
log_type,
log_time,
if(log_type ='enter',1,-1) as enter_cnt
from t_user_pool_log
) t
) tt
)ttt
)tttt
查询结果
4)简化SQL
select
sum(pool_user_cnt*keep_sec)/(unix_timestamp('2024-04-07 17:00:00')-unix_timestamp('2024-04-07 08:00:00')) as avg_num
from
(select
user_id,
log_type,
log_time,
enter_cnt,
sum(enter_cnt)over(order by log_time asc) as pool_user_cnt,
unix_timestamp(lead(log_time,1,'2024-04-07 17:00:00')over(order by log_time asc))-unix_timestamp(log_time) as keep_sec
from
(
select
user_id,
log_type,
log_time,
if(log_type ='enter',1,-1) as enter_cnt
from t_user_pool_log
) t
) tt
查询结果
四、建表语句和数据插入
--建表语句
CREATE TABLE t_user_pool_log (
user_id INT,
log_type STRING,
log_time STRING
)
COMMENT '泳池进出日志表'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;
-- 样例数据插入
insert into t_user_pool_log values
(1,'enter','2024-04-07 08:01:00'),
(2,'enter','2024-04-07 08:20:00'),
(1,'leave','2024-04-07 09:30:00'),
(3,'enter','2024-04-07 10:01:00'),
(4,'enter','2024-04-07 10:25:00'),
(2,'leave','2024-04-07 11:22:00'),
(4,'leave','2024-04-07 12:20:00'),
(5,'enter','2024-04-07 12:30:00'),
(6,'enter','2024-04-07 13:05:00'),
(3,'leave','2024-04-07 14:20:00'),
(5,'leave','2024-04-07 14:30:00'),
(7,'enter','2024-04-07 14:45:00'),
(8,'enter','2024-04-07 15:18:00'),
(7,'leave','2024-04-07 15:32:00'),
(6,'leave','2024-04-07 15:45:00'),
(9,'enter','2024-04-07 16:01:00'),
(10,'enter','2024-04-07 16:10:00'),
(9,'leave','2024-04-07 16:25:00'),
(8,'leave','2024-04-07 16:35:00'),
(10,'leave','2024-04-07 16:55:00');
本文同步在微信公众号”数据仓库技术“和个人博客”数据仓库技术“发表。原文:www.dwsql.com