前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >DB一次卡顿的事后优化

DB一次卡顿的事后优化

作者头像
richard.xia_志培
发布2022-06-14 14:34:12
1.2K0
发布2022-06-14 14:34:12
举报
文章被收录于专栏:PT运维技术PT运维技术

背景:

3月30日下班时间,一条业务线突发业务故障,业务方反馈用户无法访问。由于时间点比较特殊,DBA/开发/运维都在回家的途中,很难第一时间处理DB故障。20-30分钟后,DBA到家后,在抓取MySQL/OS等相关信息后,重启了数据库,问题得到解决。

事后分析:

第二天工作时间,将DBA在现场抓取到的现场相关文件复制到测试服务器上。具体如下:

代码语言:javascript
复制
1. MySQL进程的strace 信息
2. iotrace(获取不正确), 
3. lsof
4. top
5. engine_inodb_status(未获取), 
6. ps
7. pstack(没有抓取到)
8. /proc/`pidof mysql`/(没有抓取出来).

仔细分析strace的抓取信息:

代码语言:javascript
复制
 绝大部分线程处于锁等待状态:
[pid 379735] futex(0x135dac4, FUTEX_WAIT_PRIVATE, 7235605, NULL <unfinished ...>
[pid 379721] futex(0x9acc00f4, FUTEX_WAIT_PRIVATE, 48733, NULL <unfinished ...>
[pid 379689] futex(0x9acc00f4, FUTEX_WAIT_PRIVATE, 48732, NULL <unfinished ...>
[pid 379352] futex(0x9acc00f4, FUTEX_WAIT_PRIVATE, 48731, NULL <unfinished ...>
[pid 379689] <... futex resumed> )      = -1 EAGAIN (Resource temporarily unavailable)
[pid 379352] <... futex resumed> )      = -1 EAGAIN (Resource temporarily unavailable)
[pid 379689] futex(0x9acc00f4, FUTEX_WAIT_PRIVATE, 48733, NULL <unfinished ...>
[pid 379352] futex(0x9acc00f4, FUTEX_WAIT_PRIVATE, 48733, NULL <unfinished ...>
[pid 379243] futex(0x9acc00f4, FUTEX_WAIT_PRIVATE, 48730, NULL <unfinished ...>
[pid 379188] futex(0x9acc00f4, FUTEX_WAIT_PRIVATE, 48729, NULL <unfinished ...>
[pid 379243] <... futex resumed> )      = -1 EAGAIN (Resource temporarily unavailable)
[pid 379188] <... futex resumed> )      = -1 EAGAIN (Resource temporarily unavailable)
[pid 379243] futex(0x9acc00f4, FUTEX_WAIT_PRIVATE, 48733, NULL <unfinished ...>
[pid 379188] futex(0x9acc00f4, FUTEX_WAIT_PRIVATE, 48733, NULL <unfinished ...>
[pid 379164] futex(0x9acc00f4, FUTEX_WAIT_PRIVATE, 48728, NULL <unfinished ...>
[pid 379144] futex(0x9acc00f4, FUTEX_WAIT_PRIVATE, 48727, NULL <unfinished ...>
[pid 379164] <... futex resumed> )      = -1 EAGAIN (Resource temporarily unavailable)
[pid 379144] <... futex resumed> )      = -1 EAGAIN (Resource temporarily unavailable)
[pid 379164] futex(0x9acc00f4, FUTEX_WAIT_PRIVATE, 48733, NULL <unfinished ...>
[pid 379144] futex(0x9acc00f4, FUTEX_WAIT_PRIVATE, 48733, NULL <unfinished ...>
[pid 379137] futex(0x9acc00f4, FUTEX_WAIT_PRIVATE, 48726, NULL <unfinishe
统计次数后:
    179  futex(0x9acb6250, FUTEX_WAKE_PRIVATE, 1) = 0
    991  futex(0x9acc00f4, FUTEX_WAIT_PRIVATE, 48733

绝大部分线程全部都在等待0x9acc00f4 锁, DBA当时应该执行

代码语言:javascript
复制
gdb -p `pidof mysql` 

然后执行

代码语言:javascript
复制
p *(pthread_mutex_t*)0x9acc00f4

即可找到哪个线程持有锁(根据__owner),也可以获知futex锁的类型(__kind), 然后拿到OS_thread_id到/proc/`pidof mysqld`/smap中或者mysql performance_schema.threads的thread_os_id获取, 从而定位到具体的功能线程。

由于DBA重启了MySQL, 所以当前后无法重现了。最最重要的,居然没有通过pstack抓取到mysql当时的进程堆栈的信息。

代码语言:javascript
复制
##业务进程发起的select查询
[pid 381101] poll([{fd=2378, events=POLLIN|POLLPRI}], 1, 1500000) = 1 ([{fd=2378, revents=POLLIN}])
[pid 381101] recvfrom(2378, "P\0\0\0", 4, MSG_DONTWAIT, NULL, NULL) = 4
[pid 381101] recvfrom(2378, "\3SELECT * FROM t_call_comrade WHERE callee_guid = 12400596 AND state > 0 LIMIT 1", 80, MSG_DONTWAIT, NULL, NULL) = 80
[pid 381101] sched_yield()              = 0
[pid 381101] futex(0x9acc00f4, FUTEX_WAIT_PRIVATE, 48741, NULL <unfinished ...>
###业务进程发起insert
[pid 380158] poll([{fd=2375, events=POLLIN|POLLPRI}], 1, 1500000) = 1 ([{fd=2375, revents=POLLIN}])
[pid 380158] recvfrom(2375, "\23\0\0\0", 4, MSG_DONTWAIT, NULL, NULL) = 4
[pid 380158] recvfrom(2375, "\3SET AUTOCOMMIT = 0", 19, MSG_DONTWAIT, NULL, NULL) = 19
[pid 380158] sendto(2375, "\7\0\0\1\0\0\0\0\0\0\0", 11, MSG_DONTWAIT, NULL, 0) = 11
[pid 380158] recvfrom(2375, 0xaa68ed00, 4, 64, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)
[pid 380158] poll([{fd=2375, events=POLLIN|POLLPRI}], 1, 1500000) = 1 ([{fd=2375, revents=POLLIN}])
[pid 380158] recvfrom(2375, "\16\0\0\0", 4, MSG_DONTWAIT, NULL, NULL) = 4
[pid 380158] recvfrom(2375, "\3select NOW();", 14, MSG_DONTWAIT, NULL, NULL) = 14
[pid 380158] sendto(2375, "\1\0\0\1\1\33\0\0\2\3def\0\0\0\5NOW()\0\f?\0\23\0\0\0\f\201\0\0\0\0\5\0\0\3\376\0\0\0\0\24\0\0\4\0232021-03-30 19:20:02\5\0\0\5\376\0\0\0\0", 78, MSG_DONTWAIT, NULL, 0) = 78
[pid 380158] recvfrom(2375, 0xaa68ed00, 4, 64, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)
[pid 380158] poll([{fd=2375, events=POLLIN|POLLPRI}], 1, 1500000) = 1 ([{fd=2375, revents=POLLIN}])
[pid 380158] recvfrom(2375, "G\1\0\0", 4, MSG_DONTWAIT, NULL, NULL) = 4
[pid 380158] recvfrom(2375, "\3insert into dj_global.t_account_purchase_reforged_info (bn_account,first_buy_time, buy_type, buy_time, update_time) values(\"528756761@battlenet.oauth\",\"2021-03-30 18:44:56\", 1, \"2021-03-30 18:44:56\", \"2021-03-30 19:20:02\")  ON DUPLICATE KEY UPDATE buy_type=1 ,buy_time=\"2021-03-30 18:44:56\", update_time=\"2021-03-30 19:20:02\";", 327, MSG_DONTWAIT, NULL, NULL) = 327
[pid 380158] sched_yield()              = 0
[pid 380158] futex(0x9acc00f4, FUTEX_WAIT_PRIVATE, 48738, NULL <unfinished ...>

从strace里可以看到,业务方发起的SQL语句,紧跟着发生了线程调度sched_yield(), 然后等待0x9acc00f4地址上持有futex释放,381101 线程和380158线程表现出一样的行为(收到业务SQL后yield(), futex wait), 从目前strace日志来看,发生了线程间的死锁,但原因目前不好界定。

DBA通过命令获取到的系统层的指标:

代码语言:javascript
复制
ps 获取mysqld进程的信息
#################################################################################################################################################################################
148547 root      20   0    4340   1500   1468 S   0.0  0.0   0:00.06 /bin/sh /home/opt/mysql/bin/mysqld_safe --datadir=/data/mysql/var --pid-file=/data/mysql/tmp/mys+
150853 mysql     20   0  0.118t 0.094t   8172 S   0.0 76.4   1388:21 /home/opt/mysql/bin/mysqld --basedir=/home/opt/mysql --datadir=/data/mysql/var --plugin-dir=/hom+

top获取到mysqld进程的信息
#################################################################################################################################################################################
root     148547  0.0  0.0   4340  1500 ?        S    Feb16   0:00 /bin/sh /home/opt/mysql/bin/mysqld_safe --datadir=/data/mysql/var --pid-file=/data/mysql/tmp/mysql.pid
mysql    150853  2.2 76.3 126431396 100598400 ? Sl   Feb16 1388:21 /home/opt/mysql/bin/mysqld --basedir=/home/opt/mysql --datadir=/data/mysql/var --plugin-dir=/home/opt/mysql/lib/plugin --user=mysql 

pidstat获取到mysql线程切换的相关信息
#################################################################################################################################################################################
#      Time   UID      TGID       TID    %usr %system  %guest    %CPU   CPU  minflt/s  majflt/s     VSZ    RSS   %MEM   kB_rd/s   kB_wr/s kB_ccwr/s iodelay   cswch/s nvcswch/s  Command
1617103000  8001         0    375925    0.00    0.00    0.00    0.00    12      0.00      0.00 126431136 100598412  76.38      0.00      0.00      0.00       0      0.00      0.00  |__/home/opt/mysql/bin/mysqld --basedir=/home/opt/mysql 
 1617103000  8001         0    376083    0.00    0.00    0.00    0.00    12      0.00      0.00 126431136 100598412  76.38      0.00      0.00      0.00       0      0.00      0.00  |__/home/opt/mysql/bin/mysqld --basedir=/home/opt/mysql 
 1617103000  8001         0    376529    0.00    0.00    0.00    0.00     0      0.00      0.00 126431136 100598412  76.38      0.00      0.00      0.00       0      0.00      0.00  |__/home/opt/mysql/bin/mysqld --basedir=/home/opt/mysql 
 1617103000  8001         0    376939    0.00    0.00    0.00    0.00     0      0.00      0.00 126431136 100598412  76.38      0.00      0.00      0.00       0      0.00      0.00  |__/home/opt/mysql/bin/mysqld --basedir=/home/opt/mysql 
 1617103000  8001         0    377270    0.00    0.00    0.00    0.00     0      0.00      0.00 126431136 100598412  76.38      0.00      0.00      0.00       0      0.00      0.00  |__/home/opt/mysql/bin/mysqld --basedir=/home/opt/mysql
 1617103000  8001         0    377473    0.00    0.00    0.00    0.00    12      0.00      0.00 126431136 100598412  76.38      0.00      0.00      0.00       0      0.00      0.00  |__/home/opt/mysql/bin/mysqld --basedir=/home/opt/mysql 

#################################################################################################################################################################################
lsof获取mysql当时打开的regular file的fd数量
# grep ^mysqld -w  lsof.txt |grep REG -w  -c
1439386

#################################################################################################################################################################################
iostat获取当时磁盘IO情况(但io调度队列没有抓取到)
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.09    0.00    0.10    0.04    0.00   99.77
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.01     0.79    0.06    0.45     0.44     5.45    23.16     0.00    2.42   17.11    0.44   2.07   0.11
sdb               0.00   116.86    1.96  188.16    19.64  2733.96    28.97     0.03    0.18    0.21    0.18   0.15   2.82
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.03    0.00    0.15    0.15    0.00   99.67

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00    2.00    1.00    12.00     4.00    10.67     0.06   20.00   30.00    0.00  20.00   6.00
sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00    0.27    0.00    0.00   99.72

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00    0.00 1607.00     0.00 410804.00   511.27     4.47    2.79    0.00    2.79   0.05   8.00
sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

#      Time   UID      TGID       TID    %usr %system  %guest    %CPU   CPU  minflt/s  majflt/s     VSZ    RSS   %MEM   kB_rd/s   kB_wr/s kB_ccwr/s iodelay   cswch/s nvcswch/s  Command
 1617103000  8001         0    375860    0.00    0.00    0.00    0.00    14      0.00      0.00 126431136 100598412  76.38      0.00      0.00      0.00       0      0.00      0.00  |__/home/opt/mysql/bin/mysqld 
 1617103000  8001         0    375861    0.00    0.00    0.00    0.00    14      0.00      0.00 126431136 100598412  76.38      0.00      0.00      0.00       0      0.00      0.00  |__/home/opt/mysql/bin/mysqld 
 1617103000  8001         0    375862    0.00    0.00    0.00    0.00     0      0.00      0.00 126431136 100598412  76.38      0.00      0.00      0.00       0      0.00      0.00  |__/home/opt/mysql/bin/mysqld 
 1617103000  8001         0    375869    0.00    0.00    0.00    0.00     0      0.00      0.00 126431136 100598412  76.38      0.00      0.00      0.00       0      0.00      0.00  |__/home/opt/mysql/bin/mysqld 
 1617103000  8001         0    375925    0.00    0.00    0.00    0.00    12      0.00      0.00 126431136 100598412  76.38      0.00      0.00      0.00       0      0.00      0.00  |__/home/opt/mysql/bin/mysqld 
 1617103000  8001         0    376083    0.00    0.00    0.00    0.00    12      0.00      0.00 126431136 100598412  76.38      0.00      0.00      0.00       0      0.00      0.00  |__/home/opt/mysql/bin/mysqld 
 1617103000  8001         0    376529    0.00    0.00    0.00    0.00     0      0.00      0.00 126431136 100598412  76.38      0.00      0.00      0.00       0      0.00      0.00  |__/home/opt/mysql/bin/mysqld 
 1617103000  8001         0    376939    0.00    0.00    0.00    0.00     0      0.00      0.00 126431136 100598412  76.38      0.00      0.00      0.00       0      0.00      0.00  |__/home/opt/mysql/bin/mysqld 
 1617103000  8001         0    377270    0.00    0.00    0.00    0.00     0      0.00      0.00 126431136 100598412  76.38      0.00      0.00      0.00       0      0.00      0.00  |__/home/opt/mysql/bin/mysqld 
 1617103000  8001         0    377473    0.00    0.00    0.00    0.00    12      0.00      0.00 126431136 100598412  76.38      0.00      0.00      0.00       0      0.00      0.00  |__/home/opt/mysql/bin/mysqld 
 1617103000  8001         0    377530    0.00    0.00    0.00    0.00     0      0.00      0.00 126431136 100598412  76.38      0.00      0.00      0.00       0      0.00      0.00  |__/home/opt/mysql/bin/mysqld

OS层面的指标分析:

  1. 故障时间点:磁盘可以正常读写,IO并无压力, CPU并无开销。
  2. 故障时间点:MySQL的绝大部分线程无切换( 主动cswch,被动nvcswch), 推断为futex锁争抢所致。

DBA抓取到show proccesslist:

代码语言:javascript
复制
3618376 dz_webdevelop   10.170.24.77:50416      dj_global       Query   1727    Opening tables  SELECT * FROM t_call_comrade WHERE callee_guid = 12254381 AND state > 0 LIMIT 1
3618337 dz_platform     10.200.58.206:57055     NULL    Query   1767    Opening tables  insert into dj_global.t_account_purchase_reforged_info (bn_account,first_buy_time, buy_type, buy_tim
3618303 dz_webdevelop   10.170.24.77:50406      dj_global       Query   1797    Opening tables  SELECT * FROM t_call_comrade WHERE callee_guid = 2180351 AND state > 0 LIMIT 1
3618302 dz_webdevelop   10.192.11.201:40860     dj_global       Query   1798    Opening tables  SELECT * FROM t_call_comrade WHERE callee_guid = 2180351 AND state > 0 LIMIT 1
3618285 dj_web  10.170.24.77:50402      quartzcluster   Query   1816    Opening tables  SELECT COUNT(TRIGGER_NAME) FROM QRTZ_TRIGGERS WHERE SCHED_NAME = 'quartzScheduler' AND NOT (MISFIRE_
3618287 dj_web  10.192.11.201:40857     quartzcluster   Query   1816    Opening tables  SELECT COUNT(TRIGGER_NAME) FROM QRTZ_TRIGGERS WHERE SCHED_NAME = 'quartzScheduler' AND NOT (MISFIRE_
3618273 dz_platform     10.200.58.206:57048     NULL    Query   1827    Opening tables  insert into dj_global.t_account_purchase_reforged_info (bn_account,first_buy_time, buy_type, buy_tim
3618263 dj_web  10.170.24.69:40034      quartzcluster   Query   1837    Opening tables  SELECT TRIGGER_NAME, TRIGGER_GROUP, NEXT_FIRE_TIME, PRIORITY FROM QRTZ_TRIGGERS WHERE SCHED_NAME = '
3618258 dj_web  10.170.24.65:24319      quartzcluster   Query   1840    Opening tables  SELECT TRIGGER_NAME, TRIGGER_GROUP, NEXT_FIRE_TIME, PRIORITY FROM QRTZ_TRIGGERS WHERE SCHED_NAME = '
3618251 dj_web  10.170.24.64:2225       quartzcluster   Query   1846    Opening tables  SELECT TRIGGER_NAME, TRIGGER_GROUP, NEXT_FIRE_TIME, PRIORITY FROM QRTZ_TRIGGERS WHERE SCHED_NAME = '
3574297 dz_platform     10.200.112.158:32998    sample       Query   1860    Sending data    SELECT id, tag, src_type, src_value FROM sample.t_classify_list WHERE target_type=1 and ((src_typ
3574307 dz_platform     10.200.112.158:33045    sample       Query   1860    update  INSERT INTO sample.t_install_lang_conclude SET language="zh_CN",total_count=1,conclude_date=CURRE
3574310 dz_platform     10.200.112.158:33053    sample       Query   1860    update  INSERT INTO sample.t_im_operation_log SET player_guid=4956735,target_guid=4528631,operation_type=
3574318 dz_platform     10.200.112.158:33092    sample       Query   1860    updating        UPDATE sample.t_global_guid SET last_logout_time="2021-03-30 18:49:28" WHERE guid=14674769
3574323 dz_platform     10.200.112.158:33107    sample       Query   1860    Sending data    SELECT id, tag, src_type, src_value FROM sample.t_classify_list WHERE target_type=1 and ((src_typ
3583444 dz_platform     10.200.112.162:46546    sample Query   1860    update  INSERT INTO sample.t_war3rpg_battle_detail_roll_20210330 SET room_guid=250575866908835456,json_deta
3584388 dz_platform     10.200.112.163:17470    sample Query   1860    update  INSERT INTO sample.t_war3rpg_battle_detail_roll_20210330 SET room_guid=8881174844195141777,json_det
3584689 dz_platform     10.200.112.163:17498    sample Query   1860    update  INSERT INTO sample.t_war3rpg_battle_detail_roll_20210330 SET room_guid=6276502896882417449,json_det
3584709 dz_platform     10.200.112.163:17501    sample Query   1860    update  INSERT INTO sample.t_war3rpg_battle_detail_roll_20210330 SET room_guid=9109374814637982141,json_det
3584717 dz_platform     10.200.112.163:17502    sample Query   1860    update  INSERT INTO sample.t_war3rpg_battle_detail_roll_20210330 SET room_guid=10213742152050212746,json_de
3584731 dz_platform     10.200.112.163:17503    sample Query   1860    update  INSERT INTO sample.t_war3rpg_battle_detail_roll_20210330 SET room_guid=5122262309034721223,json_det
3584739 dz_platform     10.200.112.163:17504    sample Query   1860    update  INSERT INTO sample.t_war3rpg_battle_detail_roll_20210330 SET room_guid=6588339744732801910,json_det
3584744 dz_platform     10.200.112.163:17506    sample Query   1860    update  INSERT INTO sample.t_war3rpg_battle_detail_roll_20210330 SET room_guid=7366914210512698611,json_det
3584746 dz_platform     10.200.112.162:46631    sample Query   1860    update  INSERT INTO sample.t_war3rpg_battle_detail_roll_20210330 SET room_guid=11886193238767501583,json_de
3584748 dz_platform     10.200.112.162:46632    sample Query   1860    update  INSERT INTO sample.t_war3rpg_battle_detail_roll_20210330 SET room_guid=3605642505651157383,json_det
3584749 dz_platform     10.200.112.162:46633    sample Query   1860    update  INSERT INTO sample.t_war3rpg_battle_detail_roll_20210330 SET room_guid=3815382960058859966,json_det
3584760 dz_platform     10.200.112.162:50635    sample Query   1860    update  INSERT INTO sample.t_war3rpg_battle_detail_roll_20210330 SET room_guid=17043244837117294637,json_de
3584767 dz_platform     10.200.112.162:50636    sample Query   1860    update  INSERT INTO sample.t_war3rpg_battle_detail_roll_20210330 SET room_guid=12951179862128919047,json_de
3584774 dz_platform     10.200.112.162:50637    sample Query   1860    update  INSERT INTO sample.t_war3rpg_battle_detail_roll_20210330 SET room_guid=828818081452327429,json_deta
3584822 dz_platform     10.200.112.162:50647    sample Query   1860    update  INSERT INTO sample.t_war3rpg_battle_detail_roll_20210330 SET room_guid=7181366928926244039,json_det
3584831 dz_platform     10.200.112.163:17527    sample Query   1860    update  INSERT INTO sample.t_war3rpg_battle_detail_roll_20210330 SET room_guid=7554396207362277886,json_det
3584840 dz_platform     10.200.112.163:17528    sample Query   1860    update  INSERT INTO sample.t_war3rpg_battle_detail_roll_20210330 SET room_guid=14956419192627331508,json_de
3584842 dz_platform     10.200.112.163:17529    sample Query   1860    update  INSERT INTO sample.t_war3rpg_battle_detail_roll_20210330 SET room_guid=7996456081375296351,json_det
3584846 dz_platform     10.200.112.162:50653    sample Query   1860    update  INSERT INTO sample.t_war3rpg_battle_detail_roll_20210330 SET room_guid=14021279291868905690,json_de
3584850 dz_platform     10.200.112.163:17531    sample Query   1860    update  INSERT INTO sample.t_war3rpg_battle_detail_roll_20210330 SET room_guid=11628538000356606163,json_de
3584855 dz_platform     10.200.112.163:17534    sample Query   1860    update  INSERT INTO sample.t_war3rpg_battle_detail_roll_20210330 SET room_guid=15036078569541534150,json_de
3584856 dz_platform     10.200.112.163:17535    sample Query   1860    update  INSERT INTO sample.t_war3rpg_battle_detail_roll_20210330 SET room_guid=1278899597164609392,json_det
3584859 dz_platform     10.200.112.163:17537    sample Query   1860    update  INSERT INTO sample.t_war3rpg_battle_detail_roll_20210330 SET room_guid=15043363989421752776,json_de
3584882 dz_platform     10.200.112.162:50661    sample Query   1860    update  INSERT INTO sample.t_war3rpg_battle_detail_roll_20210330 SET room_guid=12716984950564520400,json_de
3584889 dz_platform     10.200.112.162:50662    sample Query   1860    update  INSERT INTO sample.t_war3rpg_battle_detail_roll_20210330 SET room_guid=17146115119242214679,json_de
3584899 dz_platform     10.200.112.162:50664    sample Query   1860    update  INSERT INTO sample.t_war3rpg_battle_detail_roll_20210330 SET room_guid=4066619825112416761,json_det
3584901 dz_platform     10.200.112.162:50666    sample Query   1860    update  INSERT INTO sample.t_war3rpg_battle_detail_roll_20210330 SET room_guid=12516628510072308130,json_de
3574083 dz_platform     10.200.112.159:38523    sample       Query   1861    updating        UPDATE sample.t_global_guid SET last_logout_time="2021-03-30 18:49:27" WHERE guid=2747787
3574140 dz_platform     10.200.112.159:38759    sample       Query   1861    update  INSERT INTO sample.t_install_lang_conclude SET language="zh_CN",total_count=1,conclude_date=CURRE
3574143 dz_platform     10.200.112.158:32576    sample       Query   1861    update  INSERT INTO sample.t_install_lang_conclude SET language="zh_CN",total_count=1,conclude_date=CURRE
3574168 dz_platform     10.200.112.158:32654    sample       Query   1861    update  INSERT INTO sample.t_install_lang_conclude SET language="zh_CN",total_count=1,conclude_date=CURRE
3574169 dz_platform     10.200.112.158:32655    sample       Query   1861    updating        UPDATE sample.t_global_guid SET last_logout_time="2021-03-30 18:49:27" WHERE guid=1717147
3574225 dz_platform     10.200.112.158:32813    sample       Query   1861    update  INSERT INTO sample.t_install_lang_conclude SET language="zh_CN",total_count=1,conclude_date=CURRE
3574241 dz_platform     10.200.112.159:38986    sample       Query   1861    updating        UPDATE sample.t_global_guid SET last_logout_time="2021-03-30 18:49:27" WHERE guid=12178924
3574285 dz_platform     10.200.112.158:32925    sample       Query   1861    update  INSERT INTO sample.t_install_lang_conclude SET language="zh_CN",total_count=1,conclude_date=CURRE
3574294 dz_platform     10.200.112.158:32966    sample       Query   1861    updating        UPDATE sample.t_global_guid SET last_logout_time="2021-03-30 18:49:27" WHERE guid=9287147
3584040 dz_platform     10.200.112.163:17450    sample Query   1861    update  INSERT INTO sample.t_war3rpg_battle_detail_roll_20210330 SET room_guid=13716119511846024487,json_de
3584213 dz_platform     10.200.112.163:17461    sample Query   1861    update  INSERT INTO sample.t_war3rpg_battle_detail_roll_20210330 SET room_guid=5722385290302258856,json_det
3584823 dz_platform     10.200.112.163:17523    sample Query   1861    update  INSERT INTO sample.t_war3rpg_battle_detail_roll_20210330 SET room_guid=15396974823064402816,json_de
3584827 dz_platform     10.200.112.163:17525    sample Query   1861    update  INSERT INTO sample.t_war3rpg_battle_detail_roll_20210330 SET room_guid=6378134879806353018,json_det
3584864 dz_platform     10.200.112.162:50659    sample Query   1861    update  INSERT INTO sample.t_war3rpg_battle_detail_roll_20210330 SET room_guid=12738415664845488047,json_de
3584881 dz_platform     10.200.112.162:50660    sample Query   1861    update  INSERT INTO sample.t_war3rpg_battle_detail_roll_20210330 SET room_guid=6307006536302657328,json_det
3584950 dz_platform     10.200.112.162:50681    sample Query   1861    update  INSERT INTO sample.t_war3rpg_battle_detail_roll_20210330 SET room_guid=15311403621235753478,json_de
3584953 dz_platform     10.200.112.162:50684    sample Query   1861    update  INSERT INTO sample.t_war3rpg_battle_detail_roll_20210330 SET room_guid=14492173806780744008,json_de
3584955 dz_platform     10.200.112.162:50685    sample Query   1861    update  INSERT INTO sample.t_war3rpg_battle_detail_roll_20210330 SET room_guid=15259713926051594746,json_de

从收集到的指标来看,可以看出:

  1. show processlist中出现大量的opening table状态的任务, 推断为table_cache_manager 的锁所致, 结合strace 中观察的futex锁到情况,比较吻合。 此处有一个重要参数table_open_cache_instances(出现故障的DB上该参数配置为1)需要重点描述: MySQL根据线程的 thread_id 确定线程将要使用的 table_cache,即 thread_id % table_cache_instances,从该instance的 tabel_cache 元素中查找相关连的 table_cache_element,如果查找到,则从table_cache_element 的 free_tables 中出取一个并返回,并调整 table_cache_element 中的 free_tables & used_tables 元素;如果不存在,则重新创建一个 table, 并加入对应的 table_cache_element 的 used_tables的列表; 因此: 从以上过程可以看出,MySQL 在打开表的过程中会首先从 table_cache 中进行查找有没有可以直接使用的表句柄,有则直接使用,没有则会创建并加入到对应的 table_cache 中对应的 table_cache_element 中,从刚才提取的现场信息来看, 有大量的线程在查找 table_cache 的过程中阻塞了,而 table_open_cache_instances 的个数为 1, 因此,此参数的设置需要调整,可以大大缓解线程死锁问题。
  1. 从SQL的执行时间来看 ,1861秒前出现大量的insert/update阻塞,到了1846秒前,出现大量Opening tables 在状态,中间相隔15秒左右,可以推断出table_cache_instances不足引起的锁等待的事件源,是发生在1861秒前的时间点。
  2. MySQL在1861秒前后的时间段,应用层没有出现慢日志,也没有出现错误日志。

总体推论如下:

  1. 1861秒前(2021-03-30 18:49:27)发生的数据库卡顿,15秒后业务的重试引起的并发使得table_open_cache_instances锁争抢加剧从而影响到MySQL的可用性,DBA人员在19:18分抓取到线程信息,仅仅是故障之后的现场。 虽然IO性能问题可以触发此现象,但是引起IO性能问题很多,同时受限于当前的监控频率和监控盲区,以及OS层和硬件层并无报错日志,所以很难准确地定位问题所在。

优化措施:

一 .DB层/OS层/硬件层 完善监控盲区:

1. 故障时刻raid相关的监控并没有(仅依赖机房5分钟一次的巡检),例如:

代码语言:javascript
复制
Raid卡电池低电量监控
Raid卡电池状态监控
LD Current Cache Policy监控
LD state 监控
磁盘预坏监控
磁盘介质错误数监控

2. SSD可用性以及性能监控缺失

代码语言:javascript
复制
SSD磁盘寿命监控
SSD温度监控(温度引起的衰减)
SSD可写监控

3. DB层面缺乏MySQL内部线程状态的监控(特别是功能线程),同时增加关键指标的同比监控:(DB故障时刻无告警)

代码语言:javascript
复制
Table_open_cache_hits
Innodb_rows_updated
Innodb_rows_inserted
Innodb_row_lock_current_waits
Table_open_cache_overflows
Innodb_buffer_pool_pages_flushed
#####################################
当前会话数(show processlist获取)

4. 调整table_open_cache_instances 的大小(当前为1)缓解锁竞争问题。

参看监控图中Table_open_cache_overflows值较大,

表明该table_open_cache_instances值明显偏小。

5.MySQL层面的HA方案,以便failover保留现场。(第二期优化)

二. DBA人员技术预案:

制定第三方公司DBA人员抓取现场的技术预案:

1.相同规格的MySQL服务器有10多台, 任意一台DB出现同样卡顿问题,首先确保服务器上MySQLD二进制程序没有被strip掉符号表, 然后再执行

代码语言:javascript
复制
pstack `pidof mysqld` 

如果pstack执行失败,则执行:

代码语言:javascript
复制
 gdb -ex "set pagination 0" -ex "thread apply all bt" -batch -p `pidof mysqld`  > /tmp/gdb.log 2>&1

2. 执行

代码语言:javascript
复制
strace -ff -T -s 4096 -p `pidof mysqld` >/tmp/trace.log 2>&1

使用前文描述的方法获取持有锁的线程id

3. 确保数据库进程使用的块设备的设备名,然后执行

代码语言:javascript
复制
blktrace -d /dev/DB所在磁盘的设备名  -o - | blkparse -i - -f "%D %2c %8s %5T.%9t %5p %2a %3d\n"  > iotrace.txt

整理为以下的脚本:

代码语言:javascript
复制
##########################################################
cat /proc/slabinfo > slabinfo.txt
###获取进程task_struct相关信息
cp -r /proc/`pidof mysqld`/   /opt
##获取线程切换的相关情况
pidstat -d -r -u -w -l -t -h -p ALL 5 1 > pidstat.txt
ps axuw > ps_aux.txt
top -b -n 1 -c > top.txt
##获取线程打开文件描述情况 
lsof >   lsof.txt
vmstat 1 8 > vmstat.txt
iostat -dx 1 -c 3 > iostat.txt
##获取io队列相关信息
blktrace -d /dev/sdb -o - | blkparse -i - -f "%D %2c %8s %5T.%9t %5p %2a %3d\n"  > iotrace.txt
##获取mysql线程堆栈信息
pstack `pidof mysqld`  > pstack.txt
##获取mysql线程d d系统调用信息 
strace -ff -s 4096 -p `pidof mysqld` > mysqld_trace.txt 2>&1
##获取mysql线程堆栈信息
gdb -ex "set pagination 0" -ex "thread apply all bt" -batch -p `pidof mysqld`  > /tmp/gdb.log 2>&1

#############################################################
###获取当前MysQL执行耗时的函数 
echo "call sys.ps_setup_enable_consumer('waits');
select *,latency/1000/1000/1000 as ms from sys.x$latest_file_io order by latency desc limit 20;
call sys.ps_setup_disable_consumer('waits');"|mysql
 
mysqladmin -S /var/lib/mysql/mysql.sock  debug
mysql -e "show engine innodb status\G" > innodb_status.txt
mysql -e "show full processlist"  > process.list

4. 收集完成后,重启MySQL应用

本文参与 腾讯云自媒体分享计划,分享自微信公众号。
原始发表:2021-04-13,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 PT运维技术 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
相关产品与服务
云数据库 SQL Server
腾讯云数据库 SQL Server (TencentDB for SQL Server)是业界最常用的商用数据库之一,对基于 Windows 架构的应用程序具有完美的支持。TencentDB for SQL Server 拥有微软正版授权,可持续为用户提供最新的功能,避免未授权使用软件的风险。具有即开即用、稳定可靠、安全运行、弹性扩缩等特点。
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档