这是一个实际业务需求中的问题。某一直播业务表中记录了如下格式的用户进出直播间日志数据:
+--------+--------+---------------------+---------------------+
| roomid | userid | s | e |
+--------+--------+---------------------+---------------------+
| 1 | 1 | 2018-01-01 01:01:01 | 2018-01-01 01:10:01 |
| 1 | 1 | 2018-01-01 01:01:02 | 2018-01-01 01:11:01 |
| 1 | 1 | 2018-01-01 01:01:05 | 2018-01-01 01:10:01 |
| 1 | 1 | 2018-01-01 01:11:02 | 2018-01-01 01:11:05 |
| 1 | 2 | 2018-01-01 01:01:02 | 2018-01-01 01:01:05 |
| 1 | 3 | 2018-01-01 01:01:05 | 2018-01-01 01:02:05 |
| 2 | 1 | 2018-01-01 01:01:03 | 2018-01-03 01:11:01 |
| 2 | 4 | 2018-01-01 01:03:02 | 2018-01-01 01:12:05 |
| 2 | 5 | 2018-01-01 01:11:02 | 2018-01-01 01:12:05 |
| 2 | 6 | 2018-01-01 01:15:02 | 2018-01-01 01:16:05 |
| 2 | 7 | 2018-01-01 01:01:03 | 2018-01-01 01:11:05 |
| 2 | 8 | 2018-01-01 23:01:03 | 2018-01-02 01:11:01 |
| 3 | 1 | 2018-01-05 01:01:01 | 2018-01-10 01:01:01 |
| 3 | 2 | 2018-01-05 01:01:01 | 2018-01-06 01:01:01 |
| 3 | 3 | 2018-01-06 01:01:01 | 2018-01-06 02:01:01 |
...
四个字段分别表示直播间ID、用户ID、进入时间和退出时间。求每天每个活跃房间的峰值人数和总时长。活跃房间的定义是:以每秒为时间粒度,如果在某一时刻同时有两个及其以上的用户在房间内,该房间当天即为活跃房间。峰值人数是指一天内同时在一个活跃房间的最大人数。总活跃时长是指一天内活跃时长的总和。
这是一个典型的重叠时间段的统计问题。具体来说,该需求可以细分为这样几个需要解决的问题:
(1)一个房间内同一用户的重叠时段问题 理论上同一用户进出房间的时间段是不存在重叠的。但表数据是移动端程序上报的,做过移动应用的开发者应该都理解,类似数据统计类的需求不能直接依赖端上报的数据,因为有各种原因造成上报数据不准确。此案例中,任意给定的一个房间,用户在其内的时间存在重叠部分,而重叠又分同一用户的重叠与不同用户之间重叠两种情况。对于前一种情况,在判断房间是否活跃时,不应该对用户重复计数,因此这部分的重叠时段需要进行合并。例如,2018-01-01日,用户1在房间1有四条日志记录:
+--------+--------+---------------------+---------------------+
| roomid | userid | s | e |
+--------+--------+---------------------+---------------------+
| 1 | 1 | 2018-01-01 01:01:01 | 2018-01-01 01:10:01 |
| 1 | 1 | 2018-01-01 01:01:05 | 2018-01-01 01:10:01 |
| 1 | 1 | 2018-01-01 01:01:02 | 2018-01-01 01:11:01 |
| 1 | 1 | 2018-01-01 01:11:02 | 2018-01-01 01:11:05 |
+--------+--------+---------------------+---------------------+
为了判断房间1在'2018-01-01 01:01:01'和'2018-01-01 01:11:05'之间是否存在活跃时间段,需要将四条记录合并为如下两条记录:
+--------+--------+---------------------+---------------------+
| roomid | userid | s | e |
+--------+--------+---------------------+---------------------+
| 1 | 1 | 2018-01-01 01:01:01 | 2018-01-01 01:11:01 |
| 1 | 1 | 2018-01-01 01:11:02 | 2018-01-01 01:11:05 |
+--------+--------+---------------------+---------------------+
(2)起止时段跨天的问题 由于是按天进行统计,对于进出时间点跨天的情况,要进行拆分。例如,用户1在房间2的进出时间跨越了三天:
+--------+--------+---------------------+---------------------+
| roomid | userid | s | e |
+--------+--------+---------------------+---------------------+
| 2 | 1 | 2018-01-01 01:01:03 | 2018-01-03 01:11:01 |
+--------+--------+---------------------+---------------------+
为了统计'2018-01-01'、'2018-01-02'、'2018-01-03'三天的数据,需要将这条记录拆分为如下三条记录:
+--------+--------+---------------------+---------------------+
| roomid | userid | s | e |
|--------+--------+---------------------+---------------------|
| 2 | 1 | 2018-01-01 01:01:03 | 2018-01-01 23:59:59 |
| 2 | 1 | 2018-01-02 00:00:00 | 2018-01-02 23:59:59 |
| 2 | 1 | 2018-01-03 00:00:00 | 2018-01-03 01:11:01 |
+--------+--------+---------------------+---------------------+
拆分的起止时间相差一秒,不能相同。在后面介绍计算活跃时间段内的不同用户数及其活跃时长的算法时,会看到这点非常重要。
(3)统计活跃时段 经过了前两步的数据预处理便可以统计活跃时段。这步是一个令人头疼的问题,关键在于如何高效地获取活跃时段。我们尝试了多种解决方案,后面将介绍其中两种,它们的性能有着天壤之别。 下面建立测试表并生成数据,用于演示各种SQL的执行结果。
create table test1 (roomid int, userid int, s datetime, e datetime);
insert into test1 values
(1, 1, '2018-01-01 01:01:01', '2018-01-01 01:10:01'),
(1, 2, '2018-01-01 01:01:02', '2018-01-01 01:01:05'),
(1, 3, '2018-01-01 01:01:05', '2018-01-01 01:02:05'),
(2, 4, '2018-01-01 01:03:02', '2018-01-01 01:12:05'),
(2, 5, '2018-01-01 01:11:02', '2018-01-01 01:12:05'),
(2, 6, '2018-01-01 01:15:02', '2018-01-01 01:16:05'),
(2, 7, '2018-01-01 01:01:03', '2018-01-01 01:11:05'),
(1, 1, '2018-01-01 01:01:05', '2018-01-01 01:10:01'),
(1, 1, '2018-01-01 01:01:02', '2018-01-01 01:11:01'),
(1, 1, '2018-01-01 01:11:02', '2018-01-01 01:11:05'),
(2, 1, '2018-01-01 01:01:03', '2018-01-03 01:11:01'),
(2, 8, '2018-01-01 23:01:03', '2018-01-02 01:11:01'),
(3, 1, '2018-01-05 01:01:01', '2018-01-10 01:01:01'),
(3, 2, '2018-01-05 01:01:01', '2018-01-06 01:01:01'),
(3, 3, '2018-01-06 01:01:01', '2018-01-06 02:01:01');
commit;
为了验证不同方案的在实际数据集上的执行性能,采集了三天的2505495条业务数据,存储在u_room_log表中。u_room_log与test1表结构相同,并且都没有任何索引。
如前所述,我们需要解决的第一个问题时合并一个房间内同一用户的重叠时间段。下面讨论两种自关联和游标实现方案。
重叠问题的SQL解决方案中,最容易想到的是自关联。先求出每个分组的开始时间,并用DISTINCT返回去重,然后用同样的方法得到每组结束的时间,最后把前两步的结果集合并,并通过MIN函数取得结束的时间。完整的SQL解决方案如下面的代码所示:
select distinct roomid, userid,
if(date(s)!=date(e) and id>1,date(s+interval id-1 day),s) s,
if(date(s+interval id-1 day)=date(e),e,date_format(s+interval id-1 day,'%Y-%m-%d 23:59:59')) e
from (select distinct s.roomid, s.userid, s.s,
(select min(e) -- 合并后每个区间的结束时间
from (select distinct roomid, userid, e
from test1 a
where not exists (select * from test1 b
where a.roomid = b.roomid
and a.userid = b.userid
and a.e >= b.s
and a.e < b.e)) s2
where s2.e > s.s
and s.roomid = s2.roomid
and s.userid = s2.userid) e
from (select distinct roomid, userid, s -- 每个房间每个用户的开始时间
from test1 a
where not exists (select * from test1 b
where a.roomid = b.roomid
and a.userid = b.userid
and a.s > b.s
and a.s <= b.e)) s,
(select distinct roomid, userid, e -- 每个房间每个用户的结束时间
from test1 a
where not exists (select * from test1 b
where a.roomid = b.roomid
and a.userid = b.userid
and a.e >= b.s
and a.e < b.e)) e
where s.roomid = e.roomid
and s.userid = e.userid) t1,
(select id from nums where id<=100) nums
where nums.id<=datediff(e,s)+1;
最外层的查询用于处理跨天时段。关联数字辅助表将单行数据分解为多行,id<=100表示单个时段跨越的天数最多是100。对于按天统计的直播业务,这个跨度足够了。为了提高查询性能,该值应该为满足需求的最小值。下面是该查询的执行结果:
+--------+--------+---------------------+---------------------+
| roomid | userid | s | e |
+--------+--------+---------------------+---------------------+
| 1 | 1 | 2018-01-01 01:01:01 | 2018-01-01 01:11:01 |
| 1 | 2 | 2018-01-01 01:01:02 | 2018-01-01 01:01:05 |
| 1 | 3 | 2018-01-01 01:01:05 | 2018-01-01 01:02:05 |
| 2 | 4 | 2018-01-01 01:03:02 | 2018-01-01 01:12:05 |
| 2 | 5 | 2018-01-01 01:11:02 | 2018-01-01 01:12:05 |
| 2 | 6 | 2018-01-01 01:15:02 | 2018-01-01 01:16:05 |
| 2 | 7 | 2018-01-01 01:01:03 | 2018-01-01 01:11:05 |
| 1 | 1 | 2018-01-01 01:11:02 | 2018-01-01 01:11:05 |
| 2 | 1 | 2018-01-01 01:01:03 | 2018-01-01 23:59:59 |
| 2 | 8 | 2018-01-01 23:01:03 | 2018-01-01 23:59:59 |
| 3 | 1 | 2018-01-05 01:01:01 | 2018-01-05 23:59:59 |
| 3 | 2 | 2018-01-05 01:01:01 | 2018-01-05 23:59:59 |
| 3 | 3 | 2018-01-06 01:01:01 | 2018-01-06 02:01:01 |
| 2 | 1 | 2018-01-02 00:00:00 | 2018-01-02 23:59:59 |
| 2 | 8 | 2018-01-02 00:00:00 | 2018-01-02 01:11:01 |
| 3 | 1 | 2018-01-06 00:00:00 | 2018-01-06 23:59:59 |
| 3 | 2 | 2018-01-06 00:00:00 | 2018-01-06 01:01:01 |
| 2 | 1 | 2018-01-03 00:00:00 | 2018-01-03 01:11:01 |
| 3 | 1 | 2018-01-07 00:00:00 | 2018-01-07 23:59:59 |
| 3 | 1 | 2018-01-08 00:00:00 | 2018-01-08 23:59:59 |
| 3 | 1 | 2018-01-09 00:00:00 | 2018-01-09 23:59:59 |
| 3 | 1 | 2018-01-10 00:00:00 | 2018-01-10 01:01:01 |
+--------+--------+---------------------+---------------------+
22 rows in set (0.01 sec)
原表的15行数据,经过重叠合并与跨天拆分后变为22条数据。自关联的写法比较易懂,在小数据集上的性能尚可,但如果表很大,这种写法就会凸显性能问题。将查询中的test1表改为u_room_log表,没有等到出结果。慢的原因从查询计划中就可得到直观反映:
+----+--------------------+------------+------------+-------+---------------+-------------+---------+-------------------+----------+----------+----------------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+--------------------+------------+------------+-------+---------------+-------------+---------+-------------------+----------+----------+----------------------------------------------------+
| 1 | PRIMARY | nums | NULL | range | PRIMARY | PRIMARY | 8 | NULL | 100 | 100.00 | Using where; Using index; Using temporary |
| 1 | PRIMARY | <derived2> | NULL | ALL | NULL | NULL | NULL | NULL | 24980980 | 100.00 | Using where; Using join buffer (Block Nested Loop) |
| 2 | DERIVED | <derived6> | NULL | ALL | NULL | NULL | NULL | NULL | 2498089 | 100.00 | Using where; Using temporary |
| 2 | DERIVED | <derived8> | NULL | ref | <auto_key0> | <auto_key0> | 14 | s.roomid,s.userid | 10 | 100.00 | Distinct |
| 8 | DERIVED | a | NULL | ALL | NULL | NULL | NULL | NULL | 2498089 | 100.00 | Using where; Using temporary |
| 9 | DEPENDENT SUBQUERY | b | NULL | ALL | NULL | NULL | NULL | NULL | 2498089 | 0.11 | Using where |
| 6 | DERIVED | a | NULL | ALL | NULL | NULL | NULL | NULL | 2498089 | 100.00 | Using where; Using temporary |
| 7 | DEPENDENT SUBQUERY | b | NULL | ALL | NULL | NULL | NULL | NULL | 2498089 | 0.11 | Using where |
| 3 | DEPENDENT SUBQUERY | <derived4> | NULL | ref | <auto_key0> | <auto_key0> | 14 | s.roomid,s.userid | 249809 | 33.33 | Using where; Using index |
| 4 | DERIVED | a | NULL | ALL | NULL | NULL | NULL | NULL | 2498089 | 100.00 | Using where; Using temporary |
| 5 | DEPENDENT SUBQUERY | b | NULL | ALL | NULL | NULL | NULL | NULL | 2498089 | 0.11 | Using where |
+----+--------------------+------------+------------+-------+---------------+-------------+---------+-------------------+----------+----------+----------------------------------------------------+
要对一个250万行的表多次进行相关子查询,总计要扫描的行数是多个250万的乘积,从执行时间看基本没有意义,因此这个写法被否定了。我们希望找到只扫描一遍表的实现方法,这是最优的解决方案,因为无论如何也要扫描一遍表。
在数据库优化中有一条基本原则,就是尽量使用集合操作而避免使用游标,来看一个最简单的例子。nums是单列100万行的数字辅助表,select查询时间为0.41秒。
mysql> select @id:=id from nums;
...
1000000 rows in set, 1 warning (0.41 sec)
而游标遍历的时间为3.05秒,比单条select语句慢了7.4倍。
mysql> delimiter //
mysql> create procedure p_cursor()
-> begin
-> declare done int default 0;
-> declare v_id bigint;
->
-> declare cur_nums cursor for select id from nums;
-> declare continue handler for not found set done = 1;
->
-> open cur_nums;
-> repeat
-> fetch cur_nums into v_id;
-> until done end repeat;
-> close cur_nums;
-> end//
Query OK, 0 rows affected (0.01 sec)
mysql>
mysql> call p_cursor()//
Query OK, 0 rows affected (3.05 sec)
此案例中情况却有所不同。有可能通过业务数据表上的游标,在逐行遍历表时编写复杂的应用逻辑,避免大表之间的关联,极大减少扫描行数,性能会比表关联好很多。下面是用游标合并重叠时间段的存储过程。
drop procedure if exists sp_overlap;
delimiter //
create procedure sp_overlap()
begin
declare done int default 0;
declare v_roomid bigint;
declare v_userid bigint;
declare v_start datetime;
declare v_end datetime;
declare v_prev_roomid int;
declare v_prev_userid bigint;
declare v_max_end datetime;
declare cur_t1 cursor for select roomid,userid,s,e from test1 order by roomid,userid,s,e;
declare continue handler for not found set done = 1;
drop table if exists t;
drop table if exists t1;
drop table if exists tmp_s;
create temporary table t(
roomid bigint,
userid bigint,
s datetime,
e datetime,
broken int
) engine=memory;
create temporary table t1 (
roomid int,
userid bigint,
s datetime,
e datetime
) engine=memory;
create temporary table tmp_s(
roomid bigint,
userid bigint,
s datetime,
e datetime,
i int
) engine=memory;
open cur_t1;
repeat
fetch cur_t1 into v_roomid,v_userid,v_start,v_end;
if done !=1 then
if(v_roomid=v_prev_roomid and v_userid=v_prev_userid) then
if(v_start<=v_max_end) then
insert into t values(v_roomid,v_userid,v_start,v_end,0);
else
insert into t values(v_roomid,v_userid,v_start,v_end,1);
end if;
if(v_end>=v_max_end) then
set v_max_end:=v_end;
end if;
set v_prev_roomid:=v_roomid;
set v_userid:=v_userid;
else
set v_max_end:=v_end;
set v_prev_roomid:=v_roomid;
set v_prev_userid:=v_userid;
insert into t values(v_roomid,v_userid,v_start,v_end,1);
end if;
end if;
until done end repeat;
close cur_t1;
insert into tmp_s
select roomid,userid,min(s) s,max(e) e,datediff(max(e),min(s))+1 i
from (select roomid,userid,s,e,case when @flag=flag then @rn:=@rn+broken when @flag:=flag then @rn:=broken end ran
from (select roomid,userid,s,e,broken,concat(roomid,',',userid) flag from t,(select @flag:='',@rn:=0) vars) a
order by roomid,userid,s,e) b
group by roomid,userid,ran;
select max(i) into @c from tmp_s;
insert into t1(roomid,userid,s,e)
select roomid, userid,
if(date(s)!=date(e) and id>1,date(s+interval id-1 day),s) s,
if(date(s+interval id-1 day)=date(e) ,e,date_format(s+interval id-1 day,'%y-%m-%d 23:59:59')) e
from tmp_s t1,
(select id from nums where id<=@c) nums
where (nums.id<=t1.i);
end
//
定义游标的查询需要按房间ID、用户ID、起始时间、终止时间排序。v_roomid、v_userid、v_start、v_end四个变量存储游标当前行四个字段的数据。由于要按房间和用户分组,v_prev_roomid与v_prev_userid分别存储前一行的房间ID和用户ID,用于和当前行进行比较,判断哪些行属于同一组。
v_max_end变量存储同一分组中当前最大的结束时间。在当前行的开始时间小于等于v_max_end时,说明当前行与同组中前面的时间段存在重叠,用0标识该行,否则表示当前行与同组中前面的时间段不存在重叠,用1标识该行。将游标遍历结果存储在临时表t中,t只比原表多了broken字段,用于存储所在行是否需要合并的标识:
+--------+--------+---------------------+---------------------+--------+
| roomid | userid | s | e | broken |
+--------+--------+---------------------+---------------------+--------+
| 1 | 1 | 2018-01-01 01:01:01 | 2018-01-01 01:10:01 | 1 |
| 1 | 1 | 2018-01-01 01:01:02 | 2018-01-01 01:11:01 | 0 |
| 1 | 1 | 2018-01-01 01:01:05 | 2018-01-01 01:10:01 | 0 |
| 1 | 1 | 2018-01-01 01:11:02 | 2018-01-01 01:11:05 | 1 |
| 1 | 2 | 2018-01-01 01:01:02 | 2018-01-01 01:01:05 | 1 |
| 1 | 3 | 2018-01-01 01:01:05 | 2018-01-01 01:02:05 | 1 |
| 2 | 1 | 2018-01-01 01:01:03 | 2018-01-03 01:11:01 | 1 |
| 2 | 4 | 2018-01-01 01:03:02 | 2018-01-01 01:12:05 | 1 |
| 2 | 5 | 2018-01-01 01:11:02 | 2018-01-01 01:12:05 | 1 |
| 2 | 6 | 2018-01-01 01:15:02 | 2018-01-01 01:16:05 | 1 |
| 2 | 7 | 2018-01-01 01:01:03 | 2018-01-01 01:11:05 | 1 |
| 2 | 8 | 2018-01-01 23:01:03 | 2018-01-02 01:11:01 | 1 |
| 3 | 1 | 2018-01-05 01:01:01 | 2018-01-10 01:01:01 | 1 |
| 3 | 2 | 2018-01-05 01:01:01 | 2018-01-06 01:01:01 | 1 |
| 3 | 3 | 2018-01-06 01:01:01 | 2018-01-06 02:01:01 | 1 |
+--------+--------+---------------------+---------------------+--------+
15 rows in set (0.00 sec)
临时表tmp_s存储合并行后的结果。除了原有的四列外,该表还增加了表示开始时间和结束时间之间跨越天数的一列。在生成该表数据的查询语句中:
case when @flag=flag then @rn:=@rn+broken when @flag:=flag then @rn:=broken end
这句的含义是按房间和用户分组(@flag相同的表示为同一组),并且累加同一组中的broken,因为需要合并行的broken=0,所以所有需要合并行的累加broken都是1。外层查询就按这三列group by,min(s)、max(e)、datediff(max(e),min(s))+1 分别得到合并后的开始时间、结束时间和跨越天数。
然后用下面的查询取得最大跨越天数:
select max(i) from tmp_s;
最后将tmp_s与数字辅助表连接,进行跨天时间段的拆分,并将拆分后的结果存入临时表t1。
本过程使用游标仅扫描一遍原始数据表,将中间处理结果存储到内存临时表中,对于处理重叠问题具有一定的通用性。之所以用到了三个临时表,是为了增加代码的可读性。每步产生的中间结果都存储于内存临时表,逻辑比较清晰。在性能优化时也要进行可读性、灵活性、易维护性等多方面权衡,避免“优化强迫症”。本例是可以不用写三个临时表的,去掉一个临时表可能提高些许性能,但若将此复杂的处理步骤合并为单一查询,必然使SQL语句变得极为晦涩难懂,更不易维护,最终结果是得不偿失。
此存储过程在u_room_log表上执行,生成2557836行数据,用时2分26秒,这是一个可以接受的性能度量。
mysql> set max_heap_table_size=268435456;
Query OK, 0 rows affected (0.00 sec)
mysql> set tmp_table_size=268435456;
Query OK, 0 rows affected (0.00 sec)
mysql> call sp_overlap();
Query OK, 2557836 rows affected (2 min 26.36 sec)
经过了前两步的数据处理,得到了结果集 t1,其中同一房间同一用户不存在重叠时间段,包括开始和结束的两个时间点也不重合,并且每行的开始时间和结束时间都不跨天。下面要依据活跃时段的定义,以 t1 作为输入,找到不同用户的重叠时间段。这里使用了“最小范围”和“正负计数器”两种不同算法来实现,但在大数据量的生产环境中,只有后者在性能上是可行的。
该算法步骤如下: (1)将进出同一房间的所有时间点(不分用户)统一排序。例如,roomid=1的进出房间记录如下:
+--------+--------+---------------------+---------------------+
| roomid | userid | s | e |
+--------+--------+---------------------+---------------------+
| 1 | 1 | 2018-01-01 01:01:01 | 2018-01-01 01:11:01 |
| 1 | 2 | 2018-01-01 01:01:02 | 2018-01-01 01:01:05 |
| 1 | 3 | 2018-01-01 01:01:05 | 2018-01-01 01:02:05 |
| 1 | 1 | 2018-01-01 01:11:02 | 2018-01-01 01:11:05 |
+--------+--------+---------------------+---------------------+
这步处理完成后的输出为:
+--------+---------------------+
| roomid | timepoint |
+--------+---------------------+
| 1 | 2018-01-01 01:01:01 |
| 1 | 2018-01-01 01:01:02 |
| 1 | 2018-01-01 01:01:05 |
| 1 | 2018-01-01 01:01:05 |
| 1 | 2018-01-01 01:02:05 |
| 1 | 2018-01-01 01:11:01 |
| 1 | 2018-01-01 01:11:02 |
| 1 | 2018-01-01 01:11:05 |
+--------+---------------------+
(2)对于上一步输出中同一roomid的数据,将当前行的时间点作为结束时间,前一行的时间点作为开始时间,并且过滤掉开始时间为空或开始时间等于结束时间的数据。输出为每个房间的最小时间范围间隔。例如,roomid=1的最小时间范围间隔为:
+--------+---------------------+---------------------+
| roomid | starttime | endtime |
+--------+---------------------+---------------------+
| 1 | 2018-01-01 01:01:01 | 2018-01-01 01:01:02 |
| 1 | 2018-01-01 01:01:02 | 2018-01-01 01:01:05 |
| 1 | 2018-01-01 01:01:05 | 2018-01-01 01:02:05 |
| 1 | 2018-01-01 01:02:05 | 2018-01-01 01:11:01 |
| 1 | 2018-01-01 01:11:01 | 2018-01-01 01:11:02 |
| 1 | 2018-01-01 01:11:02 | 2018-01-01 01:11:05 |
+--------+---------------------+---------------------+
这步是算法的核心,实际上就是把同一房间的所有进出时间点串行化到一个连续的时间轴上,输出的每个时间段首尾相接但不重叠。
(3)将上一步的输出与 t1 表做内连接。如果用户的在线时间和最小范围重叠,就将重叠的最小范围和userid、roomid输出。结果包含了某个房间某个用户一个或者多个的最小范围。例如,roomid=1的房间,每个用户对应的最小时间范围间隔为:
+--------+--------+---------------------+---------------------+
| roomid | userid | s | e |
+--------+--------+---------------------+---------------------+
| 1 | 1 | 2018-01-01 01:01:01 | 2018-01-01 01:01:02 |
| 1 | 1 | 2018-01-01 01:01:02 | 2018-01-01 01:01:05 |
| 1 | 2 | 2018-01-01 01:01:02 | 2018-01-01 01:01:05 |
| 1 | 3 | 2018-01-01 01:01:05 | 2018-01-01 01:02:05 |
| 1 | 1 | 2018-01-01 01:01:05 | 2018-01-01 01:02:05 |
| 1 | 1 | 2018-01-01 01:02:05 | 2018-01-01 01:11:01 |
| 1 | 1 | 2018-01-01 01:11:02 | 2018-01-01 01:11:05 |
+--------+--------+---------------------+---------------------+
(4)按上一步输出中的roomid和最小时间范围分组,过滤出每组中userid个数大于1的数据,结果为每个房间对应的活跃时间段。例如,roomid=1的房间输出为:
+--------+---------------------+---------------------+---+
| roomid | s | e | c |
+--------+---------------------+---------------------+---+
| 1 | 2018-01-01 01:01:02 | 2018-01-01 01:01:05 | 2 |
| 1 | 2018-01-01 01:01:05 | 2018-01-01 01:02:05 | 2 |
+--------+---------------------+---------------------+---+
(5)统计每个房间每天活跃时段内的最大人数,并汇总活跃时长(舍入到分钟)。例如,roomid=1的房间输出为:
+--------+------------+------+------+
| roomid | dt | ts | c |
+--------+------------+------+------+
| 1 | 2018-01-01 | 1 | 2 |
+--------+------------+------+------+
下面是实现最小范围算法的存储过程:
drop procedure if exists sp_active_duration;
delimiter //
create procedure sp_active_duration()
begin
declare done int default 0;
declare v_roomid bigint;
declare v_start datetime;
declare v_end datetime;
drop table if exists tmp_time_point;
create temporary table tmp_time_point(
roomid bigint,
timepoint datetime
) engine=memory;
insert into tmp_time_point select roomid,s from t1;
insert into tmp_time_point select roomid,e from t1;
select roomid,date(s) dt,round(sum(timestampdiff(second,s,e))/60) ts,max(c) c
from (select roomid,s,e ,count(distinct userid) c
from (select distinct v6.roomid,v6.userid,starttime s,endtime e
from (select distinct roomid,cast(starttime as datetime) starttime,cast(endtime as datetime) endtime
from (select if(@roomid=roomid,@d,'') as starttime,@d:=timepoint,@roomid:=roomid,p.roomid,p.timepoint endtime
from tmp_time_point p,(select @d:='',@roomid:=-1) vars
order by roomid,timepoint) v4
where starttime!='' and date(starttime)=date(endtime) and starttime <> endtime) v5
inner join t1 v6 on(v5.starttime between v6.s and v6.e and v5.endtime between v6.s and v6.e and v5.roomid=v6.roomid)) v6
group by roomid,s,e having count(distinct userid)>1) v7
group by roomid,date(s);
end
//
delimiter ;
tmp_time_point表即为步骤(1)的输出结果。mysql限制在一条查询中只能引用临时表一次,否则会报 ERROR 1137 (HY000): Can't reopen table: 't1' 错误,所以生成tmp_time_point表数据时执行了两次insert语句。中间结果集 v5、v6、v7 分别为步骤(2)、步骤(3)和步骤(4)的输出结果。
最小范围算法获取活跃时段的逻辑没问题,但在第(3)步骤中需要表关联,当数据量很大时,这步需要花费非常多的时间,因为要扫描大量数据行。存储过程中最后的select语句在u_room_log表上的执行计划如下:
mysql> explain select roomid,date(s) dt,round(sum(timestampdiff(second,s,e))/60) ts,max(c) c
-> from (select roomid,s,e ,count(distinct userid) c
-> from (select distinct v6.roomid,v6.userid,greatest(s,starttime) s,least(e,endtime) e
-> from (select distinct roomid,cast(starttime as datetime) starttime,cast(endtime as datetime) endtime
-> from (select if(@roomid=roomid,@d,'') as starttime,@d:=timepoint,@roomid:=roomid,p.roomid,p.timepoint endtime
-> from tmp_time_point p,(select @d:='',@roomid:=-1) vars
-> order by roomid,timepoint) v4
-> where starttime!='' and date(starttime)=date(endtime) and starttime <> endtime) v5
-> inner join t1 v6 on(v5.starttime between v6.s and v6.e and v5.endtime between v6.s and v6.e and v5.roomid=v6.roomid)) v6
-> group by roomid,s,e having count(distinct userid)>1) v7
-> group by roomid,date(s);
+----+-------------+------------+------------+--------+---------------+-------------+---------+----------------+------------+----------+------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+------------+------------+--------+---------------+-------------+---------+----------------+------------+----------+------------------------------+
| 1 | PRIMARY | <derived2> | NULL | ALL | NULL | NULL | NULL | NULL | 1308213650 | 100.00 | Using temporary |
| 2 | DERIVED | <derived3> | NULL | ALL | NULL | NULL | NULL | NULL | 1308213650 | 100.00 | Using filesort |
| 3 | DERIVED | v6 | NULL | ALL | roomid | NULL | NULL | NULL | 2557836 | 100.00 | Using where; Using temporary |
| 3 | DERIVED | <derived4> | NULL | ref | <auto_key0> | <auto_key0> | 9 | test.v6.roomid | 41436 | 1.23 | Using where; Using index |
| 4 | DERIVED | <derived5> | NULL | ALL | NULL | NULL | NULL | NULL | 5115672 | 81.00 | Using where; Using temporary |
| 5 | DERIVED | <derived6> | NULL | system | NULL | NULL | NULL | NULL | 1 | 100.00 | Using filesort |
| 5 | DERIVED | p | NULL | ALL | NULL | NULL | NULL | NULL | 5115672 | 100.00 | NULL |
| 6 | DERIVED | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | No tables used |
+----+-------------+------------+------------+--------+---------------+-------------+---------+----------------+------------+----------+------------------------------+
8 rows in set, 5 warnings (0.01 sec)
可有看到,步骤(3)需要关联两个几百万行的大表,因此在u_room_log表上执行sp_active_duration()过程没有等到出结果。
与重叠时间段优化思想类似,我们希望只扫描一遍表数据,去掉表关联以提高性能。实际上,经过sp_overlap过程处理后,可以用一种高效的方式得到活跃时段。该算法的核心思想是:将所有的进出时间点统一排序,同时记录每个时间点的进出用户数。这样我们可以将在线时间分成多个互斥的时间段,并且利用当前时间点前面的所有累计进出用户数,作为前一个时间点到当前时间点的重叠度,也即不同用户数。用户进入房间标记为+1,离开房间标记为-1,因此不妨称之为正负计数器算法,具体步骤如下。(1)将同一房间的所有进入时间点和退出时间点合并成一列,将进入时间标记为1,退出时间标记为-1。实际上,1表示在对应的时间点有一个用户进入,-1表示在对应的时间点有一个用户退出。这步处理后roomid=1的记录变为:
+--------+---------------------+------+
| roomid | timepoint | type |
+--------+---------------------+------+
| 1 | 2018-01-01 01:01:01 | 1 |
| 1 | 2018-01-01 01:01:02 | 1 |
| 1 | 2018-01-01 01:01:05 | -1 |
| 1 | 2018-01-01 01:01:05 | 1 |
| 1 | 2018-01-01 01:02:05 | -1 |
| 1 | 2018-01-01 01:11:01 | -1 |
| 1 | 2018-01-01 01:11:02 | 1 |
| 1 | 2018-01-01 01:11:05 | -1 |
+--------+---------------------+------+
(2)按房间和时间点分组,对标志位汇总聚合,目的是去除重复的时间点。重复时间点表示在同一秒有多个用户进入、或者退出、或者进入退出同一个房间。汇总的目的就是确定在该时间点,最终进出的用户数。这一步是必须的,原因有两个:1. 我们必须保证对于一个房间每个时间点是唯一的;2. 必须确定某一时间点的进出方向和进出数量。这两个点是保证算法成立的充要条件。出于同样的理由,在拆分跨天记录时,为保持时间点的唯一性,起止时间相差一秒。这步处理后roomid=1的记录变为:
+--------+---------------------+------+
| roomid | timepoint | type |
+--------+---------------------+------+
| 1 | 2018-01-01 01:01:01 | 1 |
| 1 | 2018-01-01 01:01:02 | 1 |
| 1 | 2018-01-01 01:01:05 | 0 |
| 1 | 2018-01-01 01:02:05 | -1 |
| 1 | 2018-01-01 01:11:01 | -1 |
| 1 | 2018-01-01 01:11:02 | 1 |
| 1 | 2018-01-01 01:11:05 | -1 |
+--------+---------------------+------+
(3)按房间分组,时间点排序,取得当前时间点的前一个时间点对应的进出用户数。如果没有前一个时间点,说明是该房间的第一次进入,前一个时间点对应的进出用户数设为0。这步处理后的记录roomid=1变为:
+--------+---------------------+------+----------+
| roomid | timepoint | type | prevType |
+--------+---------------------+------+----------+
| 1 | 2018-01-01 01:01:01 | 1 | 0 |
| 1 | 2018-01-01 01:01:02 | 1 | 1 |
| 1 | 2018-01-01 01:01:05 | 0 | 1 |
| 1 | 2018-01-01 01:02:05 | -1 | 0 |
| 1 | 2018-01-01 01:11:01 | -1 | -1 |
| 1 | 2018-01-01 01:11:02 | 1 | -1 |
| 1 | 2018-01-01 01:11:05 | -1 | 1 |
+--------+---------------------+------+----------+
(4)取当前时间点的前一个时间点作为起始时间,当前时间点作为终止时间,将房间的在线时间区间划分成互斥时段。用当前时间点前面的所有累计进出用户数,作为该时段的重叠度。这步处理后roomid=1的记录如下,rn即为starttime和endtime这段时间内的不同用户数:
+--------+---------------------+---------------------+------+
| roomid | starttime | endtime | rn |
+--------+---------------------+---------------------+------+
| 1 | NULL | 2018-01-01 01:01:01 | 0 |
| 1 | 2018-01-01 01:01:01 | 2018-01-01 01:01:02 | 1 |
| 1 | 2018-01-01 01:01:02 | 2018-01-01 01:01:05 | 2 |
| 1 | 2018-01-01 01:01:05 | 2018-01-01 01:02:05 | 2 |
| 1 | 2018-01-01 01:02:05 | 2018-01-01 01:11:01 | 1 |
| 1 | 2018-01-01 01:11:01 | 2018-01-01 01:11:02 | 0 |
| 1 | 2018-01-01 01:11:02 | 2018-01-01 01:11:05 | 1 |
+--------+---------------------+---------------------+------+
(5)按天统计每个房间活跃时长(重叠度大于1的时段汇总),并求出活跃时段的峰值人数(最大重叠度)。最终roomid=1的结果如下,其中dur为活跃时长(单位舍入为分钟),c是峰值人数:
+--------+------------+------+------+
| roomid | dt | dur | c |
+--------+------------+------+------+
| 1 | 2018-01-01 | 1 | 2 |
+--------+------------+------+------+
采用正负计数器算法后的sp_active_duration如下:
drop procedure if exists sp_active_duration;
delimiter //
create procedure sp_active_duration()
begin
declare done int default 0;
declare v_roomid bigint;
declare v_start datetime;
declare v_end datetime;
declare cur_test cursor for select roomid,s,e from t1;
declare continue handler for not found set done = 1;
drop table if exists tmp_time_point;
create temporary table tmp_time_point(
roomid bigint,
timepoint datetime,
type smallint
) engine=memory;
-- 开始点+1, 结束点-1
insert into tmp_time_point(roomid,timepoint,type) select roomid,s,1 from t1;
insert into tmp_time_point(roomid,timepoint,type) select roomid,e,-1 from t1;
select roomid,date(s) dt,round(sum(timestampdiff(second,date_format(s,'%Y-%m-%d %H:%i:%s'),date_format(e,'%Y-%m-%d %H:%i:%s')))/60) ts,max(rn) c
from (select if(@roomid=roomid,@d,'') as s,
@d:=str_to_date(timepoint,'%Y-%m-%d %H:%i:%s.%f'),
@roomid:=roomid,
p.roomid,
str_to_date(timepoint,'%Y-%m-%d %H:%i:%s.%f') e,
rn
from (select round(case when @roomid=roomid then @rn:=@rn+prevType when @roomid:=roomid then @rn:=prevType end) rn,b.prevType,roomid,timepoint,type
from (select a.roomid,timepoint,type,if(@roomid=roomid,@type,0) prevType, @roomid:=roomid, @type:=type
from (select *
from (select roomid,timepoint,sum(type) type from tmp_time_point group by roomid,timepoint) tmp_time_point,
(select @roomid:=-1,@rn:=0,@type:=0) vars
order by roomid ,timepoint) a) b
order by roomid ,timepoint) p,
(select @d:='',@roomid:=-1) vars
order by roomid,timepoint) v4
where rn>=2
group by roomid,date(s);
end
//
delimiter ;
tmp_time_point表存储步骤(1)的结果。b、v4分别是步骤(3)和步骤(4)的输出结果。过程中最后的查询只扫描一遍tmp_time_point表,处理速度大为提高。u_room_log表上sp_active_duration过程的执行时间为1分13秒。
为满足原始需求,只需要在一个会话中连续调用两个存储过程即可。250万的业务日志数据,总执行时间约为3分40秒。
set max_heap_table_size=268435456;
set tmp_table_size=268435456;
call sp_overlap();
call sp_active_duration();
MySQL 8提供了丰富的窗口函数,使复杂分析查询成为可能。更进一步,老版MySQL的行级变量用法已经不再推荐使用:
mysql> select @a:=id from nums limit 1;
+--------+
| @a:=id |
+--------+
| 1 |
+--------+
1 row in set, 1 warning (0.00 sec)
mysql> show warnings;
+---------+------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Level | Code | Message |
+---------+------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Warning | 1287 | Setting user variables within expressions is deprecated and will be removed in a future release. Consider alternatives: 'SET variable=expression, ...', or 'SELECT expression(s) INTO variables(s)'. |
+---------+------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)
没有提供窗口函数前,为了处理复杂逻辑,使用行级变量也是不得已而为之。本身就不是标准SQL,可读性很差,如果需要换RDBMS,比重做一遍还麻烦。而MySQL 8在SQL功能上已经接近Oracle,重叠时间段问题用一句查询即可解决:
with c1 as -- 合并同一房间同一用户的重叠时间段,用于统计峰值人数
(
select distinct roomid,userid,min(s) s,max(e) e
from (select roomid,userid,s,e,
sum(broken) over (partition by roomid, userid order by s,e) flag
from (select *,
(case when s <= max(e) over (partition by roomid, userid order by s,e rows between unbounded preceding and 1 preceding) then 0
else 1
end) as broken
from test1
) t
) t
group by roomid,userid,flag
),
c2 as -- 拆分跨天的时间段
(
select *
from (select roomid,userid,s,e
from c1
where date(s) = date(e) -- 不跨天
union all
select roomid,userid,
case when id = 1 then s else date_add(date(s),interval id-1 day) end s,
case when id = m2 then e else date_add(date(s),interval id*3600*24 -1 second) end e
from (select roomid,userid,s,e,id,
max(id) over (partition by roomid,userid,s) m2
from c1,(select id from nums where id<=100) n
where date(s) <> date(e) -- 跨天
and id <= date(e)-date(s)+1) t1) t1
),
c3 as -- 在计算最小范围的同时,计算区间用户数
(
select roomid,ts endtime,sum(prevtype) over(partition by roomid order by ts) rn,
lag(ts) over (partition by roomid order by ts) starttime
from (
select a.*,ifnull(lag(type) over (partition by roomid order by ts),0) prevtype
from (
select
roomid,ts,sum(type) type
from (
select roomid,e ts, -1 type
from c2
union all
select roomid,s ts, 1 type
from c2
) t1 group by roomid,ts
) a
) c
)
select roomid,dt,round(sum(dur)/60) ts,max(rn) c from (
select roomid,date(starttime) dt,timestampdiff(second,starttime,endtime) dur,rn
from c3 where rn>=2
) t
group by roomid,dt
order by roomid,dt;
该查询处理逻辑和存储过程完全相同,只是大部分复杂工作都交给窗口函数完成了,写法更简练,但执行时间没有存储过程快。相同环境下,with查询在u_room_log上的执行时间为4分10秒左右,比自定义的存储过程执行还慢半分钟。