一个线上的update操作问题 01
问题发现
今天在处理线上业务的时候,发现了一个比较有意思的问题,是一个表数据的操作,这里将这个操作过程写出来,以供大家参考。
首先,我们来看这个表数据操作的具体SQL:
UPDATE XXXXX_business_detail_20190411 SET dvalue= WHERE id= AND userid=;
UPDATE XXXXX_business_detail_20190411 SET dvalue= WHERE id= AND userid=;
.......
UPDATE XXXXX_business_detail_20190412 SET dvalue= WHERE id= AND userid=;
UPDATE XXXXX_business_detail_20190412 SET dvalue= WHERE id= AND userid=;
......
UPDATE XXXXX_business_detail_20190413 SET dvalue= WHERE id= AND userid=;
UPDATE XXXXX_business_detail_20190413 SET dvalue= WHERE id= AND userid=;
UPDATE XXXXX_business_detail_20190413 SET dvalue= WHERE id= AND userid=;
.......
UPDATE XXXXX_business_detail_20190414 SET dvalue= WHERE id= AND userid=;
UPDATE XXXXX_business_detail_20190414 SET dvalue= WHERE id= AND userid=;
如上述代码所示,这样的update操作一共影响4个日表,分别是xxxx0411、xxxx0412、xxxx0413、xxxx0414,
SQL一共有10万行左右,也就是说有大约10万个update语句,这个操作如果直接执行的话,具体时间可能会很长,目前不好评估,于是我首先看了看这个日表的表结构:
CREATE TABLE `payment_business_detail_20190414` (
-> `id` int() unsigned NOT NULL AUTO_INCREMENT COMMENT '自增主键',
-> `gameid` int() NOT NULL DEFAULT '0' COMMENT 'XXXX类型',
-> `mpid` int() NOT NULL DEFAULT '0' COMMENT '产品',
-> `userid` int() NOT NULL DEFAULT '0' COMMENT '用户ID',
-> `business_typeid` int() NOT NULL DEFAULT '0' COMMENT 'XXXX类型',
-> `business_subid` int() NOT NULL DEFAULT '0' COMMENT 'XXXX类型',
-> `dcid` int() NOT NULL DEFAULT '0' COMMENT 'XXXX类型',
-> `acctype` int() NOT NULL DEFAULT '0' COMMENT 'XXXX类型',
-> `dtid` int() NOT NULL DEFAULT '0' COMMENT 'XXXX类型',
-> `dvalue` int() NOT NULL DEFAULT '0' COMMENT '数值',
-> `ctime` datetime NOT NULL DEFAULT '1000-01-01 00:00:00' COMMENT '生成时间',
-> `business_planid` int() NOT NULL DEFAULT '0' COMMENT 'XXXX方案id',
-> `svr_ip` varchar() NOT NULL DEFAULT '' COMMENT 'XXXXip',
-> `svr_port` int() NOT NULL DEFAULT '0' COMMENT 'XXXX端口',
-> `req_svr_ip` varchar() NOT NULL DEFAULT '' COMMENT 'XXXXip',
-> `req_svr_port` int() NOT NULL DEFAULT '0' COMMENT 'XXXX端口',
-> `param1` int() NOT NULL DEFAULT '0' COMMENT '附加参数1',
-> `param2` int() NOT NULL DEFAULT '0' COMMENT '附加参数2',
-> `param3` int() NOT NULL DEFAULT '0' COMMENT '附加参数3',
-> `param4` int() NOT NULL DEFAULT '0' COMMENT '附加参数4',
-> `param5` int() NOT NULL DEFAULT '0' COMMENT '附加参数5',
-> PRIMARY KEY (`id`)
-> ) ENGINE=InnoDB AUTO_INCREMENT= DEFAULT CHARSET=utf8 COMMENT='业务明细表';
可以看到,字段相当的多,大概21个字段,其中id是主键,没有其他的二级索引,然后我看了下其中一个表的数据量:
mysql--root 12:45:20>>select count(*) from XXXXXX_business_detail_20190413;
+----------+
| count(*) |
+----------+
| 7368395 |
+----------+
1 row in set (4.15 sec)
大概700万数据,对这个情况进行评估,在这个表上进行10000次update操作,耗费的时间可能会比较长,预估时间是5分钟,为了避免直接操作影响业务,这里使用另外一种方法进行处理。
02
处理过程
这个问题,我们通过分析,可以得出结论,这些update操作更新的列是唯一的,也就是dvalue列,而且过滤条件是唯一的,也就是where后面的条件都是id和userid,表的名称虽然不统一,但是整体来看,只有4个表,这里,我们采用的方法如下:
1.将这个10万个update的语句,抽象成一个表data,列名称分别是table_name,dvalue,id,userid;(过程在最后将会进行补充!!!)
2.利用条件关联,将这10w个update的SQL,整合成4个SQL:
update xxx0411 t1,data t2
set t1.dvalue=t2.dvalue
where
t1.id=t2.id
and
t1.userid=t2.userid
and
t2.table_name=xxx0411
update xxx0412 t1,data t2
set t1.dvalue=t2.dvalue
where
t1.id=t2.id
and
t1.userid=t2.userid
and
t2.table_name=xxx0412
update xxx0413 t1,data t2
set t1.dvalue=t2.dvalue
where
t1.id=t2.id
and
t1.userid=t2.userid
and
t2.table_name=xxx0413
update xxx0414 t1,data t2
set t1.dvalue=t2.dvalue
where
t1.id=t2.id
and
t1.userid=t2.userid
and
t2.table_name=xxx0414
然后在执行这4个SQL,他们的作用相当于执行上面那10W行SQL。
我们看下这个操作总的耗时:
mysql 12:20:54>>update xxxxx_business_detail_20190411 t1,tkdata t2 set t1.dvalue=t2.dvalue where t1.id=t2.id and t1.userid=t2.userid and t2.table_name='payment_business_detail_20190411';
Query OK, 5323 rows affected (0.06 sec)
Rows matched: 5478 Changed: 5323 Warnings: 0
mysql 12:22:00>>update xxxxx_business_detail_20190412 t1,tkdata t2 set t1.dvalue=t2.dvalue where t1.id=t2.id and t1.userid=t2.userid and t2.table_name='payment_business_detail_20190412';
Query OK, 10805 rows affected (0.11 sec)
Rows matched: 11036 Changed: 10805 Warnings: 0
mysql 12:22:31>>update xxxxx_business_detail_20190413 t1,tkdata t2 set t1.dvalue=t2.dvalue where t1.id=t2.id and t1.userid=t2.userid and t2.table_name='payment_business_detail_20190413';
Query OK, 46429 rows affected (0.40 sec)
Rows matched: 47689 Changed: 46429 Warnings: 0
mysql 12:22:51>>update xxxxx_business_detail_20190414 t1,tkdata t2 set t1.dvalue=t2.dvalue where t1.id=t2.id and t1.userid=t2.userid and t2.table_name='payment_business_detail_20190414';
Query OK, 38155 rows affected (0.40 sec)
Rows matched: 38830 Changed: 38155 Warnings: 0
总的耗时也就1s多,但是他们实际影响的行数有10w多行数据,这样执行的速度就相当之快了。
再来看看实际测试过程中的执行速度,也就是直接执行这10w个update的速度,我在测试环境做了一个测试:
[root ~]# time mysql --socket=/data/mysql_5720/tmp/mysql.sock --port=5720 -D test < /tmp/data.txt
real 0m39.308s
user 0m2.288s
sys 0m0.931s
时间证明,大约需要39s,比我们预估的时间要端,但是执行时间总体来讲还是比较长的,我们的方法差不多节约了20倍的时间,如果数据量足够大,这个节约的时间一定非常可观了。
03
抽象表的命令
首先,我们拿到的update文本是:
UPDATE xxxxxx_business_detail_20190411 SET dvalue= WHERE id= AND userid=;
UPDATE xxxxxx_business_detail_20190411 SET dvalue= WHERE id= AND userid=;
UPDATE xxxxxx_business_detail_20190411 SET dvalue= WHERE id= AND userid=;
UPDATE xxxxxx_business_detail_20190411 SET dvalue= WHERE id= AND userid=;
UPDATE xxxxxx_business_detail_20190411 SET dvalue= WHERE id= AND userid=;
UPDATE xxxxxx_business_detail_20190411 SET dvalue= WHERE id= AND userid=;
UPDATE xxxxxx_business_detail_20190411 SET dvalue= WHERE id= AND userid=;
UPDATE xxxxxx_business_detail_20190411 SET dvalue= WHERE id= AND userid=;
UPDATE xxxxxx_business_detail_20190411 SET dvalue= WHERE id= AND userid=;
先对它做一个awk的操作,不懂的同学可以参考之前awk命令一章:
cat tkcommitteedd.sql |awk '{print $2 $4 $6 $8}'|less
结果:
xxxxxx_business_detail_20190411dvalue=6579id=1910349userid=390080512;
xxxxxx_business_detail_20190411dvalue=6579id=1910354userid=511169467;
xxxxxx_business_detail_20190411dvalue=6579id=1910356userid=280945901;
xxxxxx_business_detail_20190411dvalue=6579id=1910357userid=601858387;
这个结果不利于我们分析,我们给他添加上相关的空格来分隔开:
cat data.sql |awk '{print $2 " " $4 " " $6 " " $8}'|less
结果:
xxxxxx_business_detail_20190411 dvalue= id= userid=;
xxxxxx_business_detail_20190411 dvalue= id= userid=;
xxxxxx_business_detail_20190411 dvalue= id= userid=;
xxxxxx_business_detail_20190411 dvalue= id= userid=;
因为我们只想得到里面的表名和其他字段的数值,因此我们用'='将这4个结果连接起来,然后再用'='进行分隔,这样就能使用awk的print功能了,操作如下:
cat data.sql |awk '{print $2 "=" $4 "=" $6 "=" $8}'|less
结果:
xxxxxx_business_detail_20190411=dvalue=6579=id=1910349=userid=390080512;
xxxxxx_business_detail_20190411=dvalue=6579=id=1910354=userid=511169467;
xxxxxx_business_detail_20190411=dvalue=6579=id=1910356=userid=280945901;
xxxxxx_business_detail_20190411=dvalue=6579=id=1910357=userid=601858387;
最终分割完的结果如下:
cat data.sql ||awk '{print $2 "=" $4 "=" $6 "=" $8}'
|awk -F= '{print $1 " " $3 " " $5 " " $7}'|less
结果:
xxxxxx_business_detail_20190411 ;
xxxxxx_business_detail_20190411 ;
xxxxxx_business_detail_20190411 ;
xxxxxx_business_detail_20190411 ;
这里,我们最后用sed命令去除掉每一行末尾的;符号就行。如下:
cat tkcommitteedd.sql |awk '{print $2 "=" $4 "=" $6 "=" $8}'
|awk -F= '{print $1 " " $3 " " $5 " " $7}'
|sed 's/;//g'|less
结果:
payment_business_detail_20190411
payment_business_detail_20190411
payment_business_detail_20190411
payment_business_detail_20190411
得到了这个结果之后,再使用load data infile方法,导入我们的目标表之中,这样,我们的data表就创建成功了,这套操作怎么说也值5毛钱了。。。