SQL优化案例解析：MINUS改写为标量子查询后提升5倍，但还可以再快近百倍

老叶茶馆

发布于 2024-04-03 18:09:56

4290

* GreatSQL社区原创内容未经授权不得随意使用，转载请联系小编并注明来源。 写在前面(By老叶)从GreatSQL 8.0.32-25版本开始支持Rapid引擎，该引擎使得GreatSQL能满足联机分析（OLAP）查询请求。老叶尝试利用Rapid引擎优化本案例，结果是相当可喜的，对比如下： -SQL执行耗时(秒)Rows_examinedRead_keyRead_nextRead_rnd_nextTmp_tablesTmp_disk_tablesTmp_table_sizesInnoDB_pages_distinct原始SQL9.943230200036368909070210877403112372448181标量子查询改写优化后2.294906728836617308100036382011284368182走Rapid引擎0.11763300001090000 上述测试结果表明：

用标量子查询优化后，扫描数据量大幅减小，也不产生磁盘临时表了，最终SQL性能提升约5倍。
改走Rapid引擎后，SQL性能提升约85倍。

测试用例相关背景信息：

表数据量：1000万行数据，用mysql_random_load随机生成
其中subscriber_id列的基数为9976840（99.77%）
innodb_buffer_pool_size = 8G
t1表空间大小：1.2G
rapid_memory_limit = 1G
rapid_worker_threads = 16
tmp_table_size = 32M

综上，利用Rapid引擎可一步到位完成SQL优化，实现降维打击，省去了繁杂的SQL优化过程。由于Rapid引擎的限制，本案中的SQL还需要简单修改，详见下方内容。背景 minus 指令运用在两个 SQL 语句上，取两个语句查询结果集的差集。它先找出第一个 SQL 所产生的结果，然后看这些结果有没有在第二个 SQL 的结果中，如果在，那这些数据就被去除，不会在最后的结果中出现，第二个 SQL 结果集比第一个SQL结果集多的数据也会被抛弃。这两天的优化工作中遇到这样一种案例，第一个SQL语句结果集很小（这个前提很重要，否则不能用标量子查询改写来优化），第二个SQL语句结果集很大，这种情况下我们怎么来优化处理呢？实验 创建测试表 CREATE TABLE t1(id int primary key auto_increment, subscriber_id decimal(20, 0) not null, member_num varchar(20) not null, effectdate datetime, expirydate datetime, create_date datetime, key idx_subscriber(subscriber_id)); 创建存储过程，向测试插入50万数据。（实际生产案例中表中数据有几千万）注意下面的存储过程中，是GreatSQL在Oracle模式下创建的，GreatSQL实现了大量的Oracle语法兼容，比如下面存储过程中遇到的日期加减，add_months函数，while loop循环等，数据库由Oracle向GreatSQL迁移时，会节省很多代码改造工作。 set sql_mode=oracle; delimiter // create or replace procedure p1() as p1 int :=1; n1 int; d1 datetime; begin while p1<=500000 loop n1:=round(rand()*500000); d1:=to_date('2016-01-01','yyyy-mm-dd') + round(rand()*3000); insert into t1(subscriber_id,member_num,effectdate,expirydate,create_date) values(n1,concat('m_',n1),last_day(d1)+1,add_months(last_day(d1)+1,100),d1); set p1=p1+1; end loop; end; // delimiter ; 这个表create_date列的数据是从2016年1月到2024年3月的数据，使用了随机值，保证每个月的数据量相近，subscriber_id也是随机值生成的，选择性很好（这个也很重要，如果subscriber_id列基数很小，则改写后的SQL执行效率会很差），这个模型数据与生产环境差不多。执行下面这个SQL语句： SELECT DISTINCT subscriber_id, member_num FROM t1 WHERE create_date >= '2024-02-01' AND create_date < '2024-03-01' AND to_char(effectdate, 'yyyymm') > '202402' minus SELECT DISTINCT subscriber_id, member_num FROM t1 WHERE 202402 BETWEEN to_char(effectdate, 'yyyymm') AND to_char(expirydate, 'yyyymm'); 这条SQL是根据生产环境使用的语句简化而来的，只为突出本文要说明的知识点。此SQL的执行计划如下： greatsql> explain analyze -> select distinct subscriber_id, member_num -> from t1 -> where create_date >= '2024-02-01' -> and create_date < '2024-03-01' -> and to_char(effectdate, 'yyyymm') > '202402' -> minus -> select distinct subscriber_id, member_num -> from t1 -> where 202402 between to_char(effectdate, 'yyyymm') and -> to_char(expirydate, 'yyyymm')\G *************************** 1. row *************************** EXPLAIN: -> Table scan on <except temporary> (cost=168492.31..169186.99 rows=55375) (actual time=2420.123..2420.896 rows=1758 loops=1) -> Except materialize with deduplication (cost=168492.30..168492.30 rows=55375) (actual time=2420.121..2420.121 rows=4855 loops=1) -> Table scan on <temporary> (cost=55858.24..56552.91 rows=55375) (actual time=221.965..223.384 rows=4855 loops=1) -> Temporary table with deduplication (cost=55858.23..55858.23 rows=55375) (actual time=221.962..221.962 rows=4855 loops=1) -> Filter: ((t1.create_date >= TIMESTAMP'2024-02-01 00:00:00') and (t1.create_date < TIMESTAMP'2024-03-01 00:00:00') and (to_char(t1.effectdate,'yyyymm') > '202402')) (cost=50320.70 rows=55375) (actual time=0.118..217.497 rows=4875 loops=1) -> Table scan on t1 (cost=50320.70 rows=498477) (actual time=0.084..179.826 rows=500000 loops=1) -> Table scan on <temporary> (cost=100168.41..106401.86 rows=498477) (actual time=1520.965..1571.682 rows=307431 loops=1) -> Temporary table with deduplication (cost=100168.40..100168.40 rows=498477) (actual time=1520.963..1520.963 rows=307431 loops=1) -> Filter: (202402 between to_char(t1.effectdate,'yyyymm') and to_char(t1.expirydate,'yyyymm')) (cost=50320.70 rows=498477) (actual time=0.123..934.617 rows=492082 loops=1) -> Table scan on t1 (cost=50320.70 rows=498477) (actual time=0.104..716.919 rows=500000 loops=1) 1 row in set (2.47 sec) 从执行计划看出，SQL总体耗时2.47s。第一部分的查询结果集有4855条，耗时221.962ms，第二部分的查询结果集有307431条，耗时1571.682ms。 优化分析： 首先第一部分create_date加上索引会提升查询效率，因为只需要查询一个月的数据，而此SQL耗时最多的是第二部分，重在第二部分的优化处理。第二部分查询结果集在做minus运算时大部分记录都是要被抛弃的，查询出来再被抛弃相当于做了无用功，而SQL优化的核心思想就是在于减少IO，那我们要做的就是想办法省去第二部分SQL的全面查询，只需要验证第一部分的查询结果集是否在第二部分查询结果中存在就好了。那如何验证呢? 把第一部分select的列值传给第二部分作为where条件去查找，只要能查到，无论几条都算在第二部分存在，这部分数据就要被去除，查不到就是在第二部分不存在，数据保留在最终结果集。根据这个逻辑我想到了标量子查询的妙用。标量子查询改写参考： select distinct subscriber_id, member_num from (select a.subscriber_id, a.member_num, (select count(*) cnt from t1 b where a.subscriber_id = b.subscriber_id and a.member_num = b.member_num and 202402 between to_char(effectdate, 'yyyymm') and to_char(expirydate, 'yyyymm')) as cnt from t1 a where create_date >= '2024-02-01' and create_date < '2024-03-01' and to_char(effectdate, 'yyyymm') > '202402') where cnt = 0 改后SQL的执行计划如下： greatsql> explain analyze -> select distinct subscriber_id, member_num -> from (select a.subscriber_id, -> a.member_num, -> (select count(*) cnt -> from t1 b -> where a.subscriber_id = b.subscriber_id -> and a.member_num = b.member_num -> and 202402 between to_char(effectdate, 'yyyymm') and -> to_char(expirydate, 'yyyymm')) as cnt -> from t1 a -> where create_date >= '2024-02-01' -> and create_date < '2024-03-01' -> and to_char(effectdate, 'yyyymm') > '202402') -> where cnt = 0\G *************************** 1. row *************************** EXPLAIN: -> Table scan on <temporary> (cost=3172.53..3235.95 rows=4875) (actual time=168.555..168.775 rows=1758 loops=1) -> Temporary table with deduplication (cost=3172.51..3172.51 rows=4875) (actual time=168.553..168.553 rows=1758 loops=1) -> Index lookup on alias_temp_-1556603461854822391 using <auto_key0> (cnt=0) (cost=2681.86..2685.01 rows=10) (actual time=166.656..167.178 rows=1765 loops=1) -> Materialize (cost=2681.51..2681.51 rows=4875) (actual time=166.649..166.649 rows=4875 loops=1) -> Filter: (to_char(a.effectdate,'yyyymm') > '202402') (cost=2194.01 rows=4875) (actual time=0.380..45.477 rows=4875 loops=1) -> Index range scan on a using idx_creatdate over ('2024-02-01 00:00:00' <= create_date < '2024-03-01 00:00:00'), with index condition: ((a.create_date >= TIMESTAMP'2024-02-01 00:00:00') and (a.create_date < TIMESTAMP'2024-03-01 00:00:00')) (cost=2194.01 rows=4875) (actual time=0.344..43.143 rows=4875 loops=1) -> Select #3 (subquery in projection; dependent) -> Aggregate: count(0) (cost=0.42 rows=1) (actual time=0.022..0.022 rows=1 loops=4875) -> Filter: ((a.member_num = b.member_num) and (202402 between to_char(b.effectdate,'yyyymm') and to_char(b.expirydate,'yyyymm'))) (cost=0.40 rows=0.2) (actual time=0.019..0.021 rows=1 loops=4875) -> Index lookup on b using idx_subscriber (subscriber_id=a.subscriber_id) (cost=0.40 rows=2) (actual time=0.018..0.019 rows=2 loops=4875) 1 row in set, 2 warnings (0.26 sec) 从执行计划可以看出，子查询执行次数依赖于主查询，执行了4875次，因为subscriber_id列选择性很好，所以每次查询效率很高。SQL总体耗时0.26秒，而原SQL耗时2.47s，性能提升了将近10倍。在实际生产案例中第二部分结果集有5000万左右，第一部分结果集只有几十条，SQL执行半天都跑不出结果，改造后几乎秒出。提醒一点，注意NULL值比较，当select 列表中的部分列存在NULL值时就不能直接用等号（=）关联来判断了，得用is NULL来判断，本案例不涉及此问题，语句是否等价有时需要结合业务，具体情况具体分析。 结论： 本文提供了一种minus语句的优化方法，将minus转化为标量子查询表达，这种优化方式适用于第一部分查询结果集比较小，查询的列比较少的情况，且要结合业务确认是否需要对NULL值进行判断。优化时一般避免使用标量子查询，因为标量子查询会构造天然的嵌套循环连接，但也并不是说标量子查询一定不可用，还是要从根儿上考虑，优化核心思想，减少IO是要点。补充由于GreatSQL Rapid引擎的约束，上述SQL还需要做简单改写调整以适配，修改成下面这样即可： rapid: SELECT DISTINCT subscriber_id, member_num FROM t1 WHERE create_date >= '2024-02-01' AND create_date < '2024-03-01' AND effectdate >= '2024-03-01' minus SELECT DISTINCT subscriber_id, member_num FROM t1 where effectdate < '2024-03-01' and expirydate > '2024-03-01';