系统设计题(1) 连续5天登录用户(快手)

程序员小王

发布于 2020-07-14 14:30:59

9160

发布于 2020-07-14 14:30:59

文章被收录于专栏：架构说

早安心语

听到别人生活不如意，就自我感觉良好；

见到他人晒幸福晒成绩，又觉得自己失败迷茫。

把时间都用来关注别人，哪还有功夫提升自己？

记住，你的注意力在哪里，成长就在哪里。

向着自己心中的愿景勇敢前进，踏实走好每一步，终有一天生活会垂青于你

题目：连续5天登录用户

现在我们有一张用户登录日志表，记录用户每天的登录时间，

我们想要统计一下，用户每次连续登录的开始日期和结束日期，以及连续登录天数。

用户ID	登录日期
1001	2017-01-01
1001	2017-01-02
1001	2017-01-04
1001	2017-01-06
1002	2017-01-02
1002	2017-01-03

同学们先思考下，整理下思路，如果没有思路或者某几个点不了解，就可以继续往下看了。

思路

分组
计算个数必须大于5，并且按照等差数列最大值和最小相差大于5

答案

mysql> select userid,count(*) as total,max(logindate)-min(logindate) as diff from test.tmloginlog group by userid HAVING total >5 and diff >5 ;
+---------+-------+------+
| user_id | total | diff |
+---------+-------+------+
| 1001 | 13 | 15 |
| 1002 | 15 | 16 |
+---------+-------+------+
2 rows in set (0.01 sec)

CREATE TABLE test.tmloginlog
(
userid int,
logindate date
)ENGINE=InnoDB;
-- 这里的数据是最简化的情况，每个用户每天只有一条登录信息，
insert into test.tmloginlog values(1001,'2017-01-01');
insert into test.tmloginlog values(1001,'2017-01-02');
insert into test.tmloginlog values(1001,'2017-01-04');
insert into test.tmloginlog values(1001,'2017-01-05');
insert into test.tmloginlog values(1001,'2017-01-06');
insert into test.tmloginlog values(1001,'2017-01-07');
insert into test.tmloginlog values(1001,'2017-01-08');
insert into test.tmloginlog values(1001,'2017-01-09');
insert into test.tmloginlog values(1001,'2017-01-10');
insert into test.tmloginlog values(1001,'2017-01-12');
insert into test.tmloginlog values(1001,'2017-01-13');
insert into test.tmloginlog values(1001,'2017-01-15');
insert into test.tmloginlog values(1001,'2017-01-16');
insert into test.tmloginlog values(1002,'2017-01-01');
insert into test.tmloginlog values(1002,'2017-01-02');
insert into test.tmloginlog values(1002,'2017-01-03');
insert into test.tmloginlog values(1002,'2017-01-04');
insert into test.tmloginlog values(1002,'2017-01-05');
insert into test.tmloginlog values(1002,'2017-01-06');
insert into test.tmloginlog values(1002,'2017-01-07');
insert into test.tmloginlog values(1002,'2017-01-08');
insert into test.tmloginlog values(1002,'2017-01-09');
insert into test.tmloginlog values(1002,'2017-01-10');
insert into test.tmloginlog values(1002,'2017-01-11');
insert into test.tmloginlog values(1002,'2017-01-12');
insert into test.tmloginlog values(1002,'2017-01-13');
insert into test.tmloginlog values(1002,'2017-01-16');
insert into test.tmloginlog values(1002,'2017-01-17');
insert into test.tmloginlog values(1003,'2017-01-18');

如果大量数据怎么办

如果这样问，我也不知道怎办? 不管百万，千万，还是一个记录我思路查看 EXPLAIN执行结果

观察如下

Using temporary就意味着使用了内部临时表 ----->内存临时表和“磁盘临时表
Using filesort -就意味着MySQL 线程需要排序 - --->排序的时候用到了 sort buffer

画外音：

什么是临时表

场景：union去重复

explain select id%10 as m, count(*) as c from t1 group by m;

explain select id%10 as m, count(*) as c from t1 group by m order by null; ExtraUsing index; Using temporary

要解决 group by 语句的优化问题，你可以先想一下这个问题：

执行 group by 语句为什么需要临时表？

group by 的语义逻辑，是统计不同的值出现的个数。

但是，由于每一行的 id%100 的结果是无序的，所以我们就需要有一个临时表，来记录并统计结果。

那么，如果扫描过程中可以保证出现的数据是有序的，是不是就简单了呢？

explain select SQLBIGRESULT id%100 as m, count(*) as c from t1 group by m; Using index; Using filesort select z, count(*) as c from t1 group by z; explain select id%10 as m, count(*) as c from t1 group by m;

区别是什么：字段 id%10 是无序的，字段 z是有序的。

什么是sort_buffer？

sortbuffersize，就是 MySQL 为排序开辟的内存（sort_buffer）的大小。
如果要排序的数据量小于 sortbuffersize，排序就在内存中完成。
但如果排序数据量太大，内存放不下，则不得不利用磁盘临时文件辅助排序

SELECT * FROM information_schema. OPTIMIZER_TRACE\G

ALTER TABLE test.tmloginlog ADD INDEX useridlog(userid,login_date)

优化后

explain select userid,count(*) as total,max(logindate)-min(logindate) as diff from test.tmloginlog group by userid HAVING total >5 and diff >5 ;
Extra 字段里面多了“Using index”，表示的就是使用了覆盖索引，性能上会
快很多

思考与行动

基于上面的 union、union all 和 group by 语句的执行过程的分析，MySQL 什么时候会使用内部临时表？

如果语句执行过程可以一边读数据，一边直接得到结果，是不需要额外内存的，否则就需要额外的内存，来保存中间结果；
joinbuffer 是无序数组，sortbuffer 是有序数组，临时表是二维表结构；
如果执行逻辑需要用到二维表特性，就会优先考虑使用临时表。比如我们的例子中， union 需要用到唯一索引约束， group by 还需要用到另外一个字段来存累积计数

验证

案例1 count()虽然相当于另外一个字段。但是不需要获取内容 比较特殊
explain select user_id,count() from test.tmloginlog group by user_id;
explain select userid,count(logindate) from test.tmloginlog group by user_id;
Using index
案例2 count(*)不变，但是 user_id%10 结果发生变化
explain select userid%10 as m,count(*) from test.tmlogin_log group by m;
Using index;
Using temporary;
Using filesort
案例3 :Using index for group-by
explain select userid,max(logindate) from test.tmloginlog group by user_id;
案例4：select userid,logindate,count(*) from test.tmloginlog group by userid,logindate;
请使用一条语句 统计各班的男生和女生人数
SELECT classid, gender, COUNT(*) num FROM students GROUP BY classid, gender;