背景:
3月30日下班时间,一条业务线突发业务故障,业务方反馈用户无法访问。由于时间点比较特殊,DBA/开发/运维都在回家的途中,很难第一时间处理DB故障。20-30分钟后,DBA到家后,在抓取MySQL/OS等相关信息后,重启了数据库,问题得到解决。
事后分析:
第二天工作时间,将DBA在现场抓取到的现场相关文件复制到测试服务器上。具体如下:
1. MySQL进程的strace 信息
2. iotrace(获取不正确),
3. lsof
4. top
5. engine_inodb_status(未获取),
6. ps
7. pstack(没有抓取到)
8. /proc/`pidof mysql`/(没有抓取出来).
仔细分析strace的抓取信息:
绝大部分线程处于锁等待状态:
[pid 379735] futex(0x135dac4, FUTEX_WAIT_PRIVATE, 7235605, NULL <unfinished ...>
[pid 379721] futex(0x9acc00f4, FUTEX_WAIT_PRIVATE, 48733, NULL <unfinished ...>
[pid 379689] futex(0x9acc00f4, FUTEX_WAIT_PRIVATE, 48732, NULL <unfinished ...>
[pid 379352] futex(0x9acc00f4, FUTEX_WAIT_PRIVATE, 48731, NULL <unfinished ...>
[pid 379689] <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable)
[pid 379352] <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable)
[pid 379689] futex(0x9acc00f4, FUTEX_WAIT_PRIVATE, 48733, NULL <unfinished ...>
[pid 379352] futex(0x9acc00f4, FUTEX_WAIT_PRIVATE, 48733, NULL <unfinished ...>
[pid 379243] futex(0x9acc00f4, FUTEX_WAIT_PRIVATE, 48730, NULL <unfinished ...>
[pid 379188] futex(0x9acc00f4, FUTEX_WAIT_PRIVATE, 48729, NULL <unfinished ...>
[pid 379243] <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable)
[pid 379188] <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable)
[pid 379243] futex(0x9acc00f4, FUTEX_WAIT_PRIVATE, 48733, NULL <unfinished ...>
[pid 379188] futex(0x9acc00f4, FUTEX_WAIT_PRIVATE, 48733, NULL <unfinished ...>
[pid 379164] futex(0x9acc00f4, FUTEX_WAIT_PRIVATE, 48728, NULL <unfinished ...>
[pid 379144] futex(0x9acc00f4, FUTEX_WAIT_PRIVATE, 48727, NULL <unfinished ...>
[pid 379164] <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable)
[pid 379144] <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable)
[pid 379164] futex(0x9acc00f4, FUTEX_WAIT_PRIVATE, 48733, NULL <unfinished ...>
[pid 379144] futex(0x9acc00f4, FUTEX_WAIT_PRIVATE, 48733, NULL <unfinished ...>
[pid 379137] futex(0x9acc00f4, FUTEX_WAIT_PRIVATE, 48726, NULL <unfinishe
统计次数后:
179 futex(0x9acb6250, FUTEX_WAKE_PRIVATE, 1) = 0
991 futex(0x9acc00f4, FUTEX_WAIT_PRIVATE, 48733
绝大部分线程全部都在等待0x9acc00f4 锁, DBA当时应该执行
gdb -p `pidof mysql`
然后执行
p *(pthread_mutex_t*)0x9acc00f4
即可找到哪个线程持有锁(根据__owner),也可以获知futex锁的类型(__kind), 然后拿到OS_thread_id到/proc/`pidof mysqld`/smap中或者mysql performance_schema.threads的thread_os_id获取, 从而定位到具体的功能线程。
由于DBA重启了MySQL, 所以当前后无法重现了。最最重要的,居然没有通过pstack抓取到mysql当时的进程堆栈的信息。
##业务进程发起的select查询
[pid 381101] poll([{fd=2378, events=POLLIN|POLLPRI}], 1, 1500000) = 1 ([{fd=2378, revents=POLLIN}])
[pid 381101] recvfrom(2378, "P\0\0\0", 4, MSG_DONTWAIT, NULL, NULL) = 4
[pid 381101] recvfrom(2378, "\3SELECT * FROM t_call_comrade WHERE callee_guid = 12400596 AND state > 0 LIMIT 1", 80, MSG_DONTWAIT, NULL, NULL) = 80
[pid 381101] sched_yield() = 0
[pid 381101] futex(0x9acc00f4, FUTEX_WAIT_PRIVATE, 48741, NULL <unfinished ...>
###业务进程发起insert
[pid 380158] poll([{fd=2375, events=POLLIN|POLLPRI}], 1, 1500000) = 1 ([{fd=2375, revents=POLLIN}])
[pid 380158] recvfrom(2375, "\23\0\0\0", 4, MSG_DONTWAIT, NULL, NULL) = 4
[pid 380158] recvfrom(2375, "\3SET AUTOCOMMIT = 0", 19, MSG_DONTWAIT, NULL, NULL) = 19
[pid 380158] sendto(2375, "\7\0\0\1\0\0\0\0\0\0\0", 11, MSG_DONTWAIT, NULL, 0) = 11
[pid 380158] recvfrom(2375, 0xaa68ed00, 4, 64, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)
[pid 380158] poll([{fd=2375, events=POLLIN|POLLPRI}], 1, 1500000) = 1 ([{fd=2375, revents=POLLIN}])
[pid 380158] recvfrom(2375, "\16\0\0\0", 4, MSG_DONTWAIT, NULL, NULL) = 4
[pid 380158] recvfrom(2375, "\3select NOW();", 14, MSG_DONTWAIT, NULL, NULL) = 14
[pid 380158] sendto(2375, "\1\0\0\1\1\33\0\0\2\3def\0\0\0\5NOW()\0\f?\0\23\0\0\0\f\201\0\0\0\0\5\0\0\3\376\0\0\0\0\24\0\0\4\0232021-03-30 19:20:02\5\0\0\5\376\0\0\0\0", 78, MSG_DONTWAIT, NULL, 0) = 78
[pid 380158] recvfrom(2375, 0xaa68ed00, 4, 64, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)
[pid 380158] poll([{fd=2375, events=POLLIN|POLLPRI}], 1, 1500000) = 1 ([{fd=2375, revents=POLLIN}])
[pid 380158] recvfrom(2375, "G\1\0\0", 4, MSG_DONTWAIT, NULL, NULL) = 4
[pid 380158] recvfrom(2375, "\3insert into dj_global.t_account_purchase_reforged_info (bn_account,first_buy_time, buy_type, buy_time, update_time) values(\"528756761@battlenet.oauth\",\"2021-03-30 18:44:56\", 1, \"2021-03-30 18:44:56\", \"2021-03-30 19:20:02\") ON DUPLICATE KEY UPDATE buy_type=1 ,buy_time=\"2021-03-30 18:44:56\", update_time=\"2021-03-30 19:20:02\";", 327, MSG_DONTWAIT, NULL, NULL) = 327
[pid 380158] sched_yield() = 0
[pid 380158] futex(0x9acc00f4, FUTEX_WAIT_PRIVATE, 48738, NULL <unfinished ...>
从strace里可以看到,业务方发起的SQL语句,紧跟着发生了线程调度sched_yield(), 然后等待0x9acc00f4地址上持有futex释放,381101 线程和380158线程表现出一样的行为(收到业务SQL后yield(), futex wait), 从目前strace日志来看,发生了线程间的死锁,但原因目前不好界定。
DBA通过命令获取到的系统层的指标:
ps 获取mysqld进程的信息
#################################################################################################################################################################################
148547 root 20 0 4340 1500 1468 S 0.0 0.0 0:00.06 /bin/sh /home/opt/mysql/bin/mysqld_safe --datadir=/data/mysql/var --pid-file=/data/mysql/tmp/mys+
150853 mysql 20 0 0.118t 0.094t 8172 S 0.0 76.4 1388:21 /home/opt/mysql/bin/mysqld --basedir=/home/opt/mysql --datadir=/data/mysql/var --plugin-dir=/hom+
top获取到mysqld进程的信息
#################################################################################################################################################################################
root 148547 0.0 0.0 4340 1500 ? S Feb16 0:00 /bin/sh /home/opt/mysql/bin/mysqld_safe --datadir=/data/mysql/var --pid-file=/data/mysql/tmp/mysql.pid
mysql 150853 2.2 76.3 126431396 100598400 ? Sl Feb16 1388:21 /home/opt/mysql/bin/mysqld --basedir=/home/opt/mysql --datadir=/data/mysql/var --plugin-dir=/home/opt/mysql/lib/plugin --user=mysql
pidstat获取到mysql线程切换的相关信息
#################################################################################################################################################################################
# Time UID TGID TID %usr %system %guest %CPU CPU minflt/s majflt/s VSZ RSS %MEM kB_rd/s kB_wr/s kB_ccwr/s iodelay cswch/s nvcswch/s Command
1617103000 8001 0 375925 0.00 0.00 0.00 0.00 12 0.00 0.00 126431136 100598412 76.38 0.00 0.00 0.00 0 0.00 0.00 |__/home/opt/mysql/bin/mysqld --basedir=/home/opt/mysql
1617103000 8001 0 376083 0.00 0.00 0.00 0.00 12 0.00 0.00 126431136 100598412 76.38 0.00 0.00 0.00 0 0.00 0.00 |__/home/opt/mysql/bin/mysqld --basedir=/home/opt/mysql
1617103000 8001 0 376529 0.00 0.00 0.00 0.00 0 0.00 0.00 126431136 100598412 76.38 0.00 0.00 0.00 0 0.00 0.00 |__/home/opt/mysql/bin/mysqld --basedir=/home/opt/mysql
1617103000 8001 0 376939 0.00 0.00 0.00 0.00 0 0.00 0.00 126431136 100598412 76.38 0.00 0.00 0.00 0 0.00 0.00 |__/home/opt/mysql/bin/mysqld --basedir=/home/opt/mysql
1617103000 8001 0 377270 0.00 0.00 0.00 0.00 0 0.00 0.00 126431136 100598412 76.38 0.00 0.00 0.00 0 0.00 0.00 |__/home/opt/mysql/bin/mysqld --basedir=/home/opt/mysql
1617103000 8001 0 377473 0.00 0.00 0.00 0.00 12 0.00 0.00 126431136 100598412 76.38 0.00 0.00 0.00 0 0.00 0.00 |__/home/opt/mysql/bin/mysqld --basedir=/home/opt/mysql
#################################################################################################################################################################################
lsof获取mysql当时打开的regular file的fd数量
# grep ^mysqld -w lsof.txt |grep REG -w -c
1439386
#################################################################################################################################################################################
iostat获取当时磁盘IO情况(但io调度队列没有抓取到)
avg-cpu: %user %nice %system %iowait %steal %idle
0.09 0.00 0.10 0.04 0.00 99.77
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.01 0.79 0.06 0.45 0.44 5.45 23.16 0.00 2.42 17.11 0.44 2.07 0.11
sdb 0.00 116.86 1.96 188.16 19.64 2733.96 28.97 0.03 0.18 0.21 0.18 0.15 2.82
avg-cpu: %user %nice %system %iowait %steal %idle
0.03 0.00 0.15 0.15 0.00 99.67
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 2.00 1.00 12.00 4.00 10.67 0.06 20.00 30.00 0.00 20.00 6.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
avg-cpu: %user %nice %system %iowait %steal %idle
0.00 0.00 0.27 0.00 0.00 99.72
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 1607.00 0.00 410804.00 511.27 4.47 2.79 0.00 2.79 0.05 8.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
# Time UID TGID TID %usr %system %guest %CPU CPU minflt/s majflt/s VSZ RSS %MEM kB_rd/s kB_wr/s kB_ccwr/s iodelay cswch/s nvcswch/s Command
1617103000 8001 0 375860 0.00 0.00 0.00 0.00 14 0.00 0.00 126431136 100598412 76.38 0.00 0.00 0.00 0 0.00 0.00 |__/home/opt/mysql/bin/mysqld
1617103000 8001 0 375861 0.00 0.00 0.00 0.00 14 0.00 0.00 126431136 100598412 76.38 0.00 0.00 0.00 0 0.00 0.00 |__/home/opt/mysql/bin/mysqld
1617103000 8001 0 375862 0.00 0.00 0.00 0.00 0 0.00 0.00 126431136 100598412 76.38 0.00 0.00 0.00 0 0.00 0.00 |__/home/opt/mysql/bin/mysqld
1617103000 8001 0 375869 0.00 0.00 0.00 0.00 0 0.00 0.00 126431136 100598412 76.38 0.00 0.00 0.00 0 0.00 0.00 |__/home/opt/mysql/bin/mysqld
1617103000 8001 0 375925 0.00 0.00 0.00 0.00 12 0.00 0.00 126431136 100598412 76.38 0.00 0.00 0.00 0 0.00 0.00 |__/home/opt/mysql/bin/mysqld
1617103000 8001 0 376083 0.00 0.00 0.00 0.00 12 0.00 0.00 126431136 100598412 76.38 0.00 0.00 0.00 0 0.00 0.00 |__/home/opt/mysql/bin/mysqld
1617103000 8001 0 376529 0.00 0.00 0.00 0.00 0 0.00 0.00 126431136 100598412 76.38 0.00 0.00 0.00 0 0.00 0.00 |__/home/opt/mysql/bin/mysqld
1617103000 8001 0 376939 0.00 0.00 0.00 0.00 0 0.00 0.00 126431136 100598412 76.38 0.00 0.00 0.00 0 0.00 0.00 |__/home/opt/mysql/bin/mysqld
1617103000 8001 0 377270 0.00 0.00 0.00 0.00 0 0.00 0.00 126431136 100598412 76.38 0.00 0.00 0.00 0 0.00 0.00 |__/home/opt/mysql/bin/mysqld
1617103000 8001 0 377473 0.00 0.00 0.00 0.00 12 0.00 0.00 126431136 100598412 76.38 0.00 0.00 0.00 0 0.00 0.00 |__/home/opt/mysql/bin/mysqld
1617103000 8001 0 377530 0.00 0.00 0.00 0.00 0 0.00 0.00 126431136 100598412 76.38 0.00 0.00 0.00 0 0.00 0.00 |__/home/opt/mysql/bin/mysqld
OS层面的指标分析:
DBA抓取到show proccesslist:
3618376 dz_webdevelop 10.170.24.77:50416 dj_global Query 1727 Opening tables SELECT * FROM t_call_comrade WHERE callee_guid = 12254381 AND state > 0 LIMIT 1
3618337 dz_platform 10.200.58.206:57055 NULL Query 1767 Opening tables insert into dj_global.t_account_purchase_reforged_info (bn_account,first_buy_time, buy_type, buy_tim
3618303 dz_webdevelop 10.170.24.77:50406 dj_global Query 1797 Opening tables SELECT * FROM t_call_comrade WHERE callee_guid = 2180351 AND state > 0 LIMIT 1
3618302 dz_webdevelop 10.192.11.201:40860 dj_global Query 1798 Opening tables SELECT * FROM t_call_comrade WHERE callee_guid = 2180351 AND state > 0 LIMIT 1
3618285 dj_web 10.170.24.77:50402 quartzcluster Query 1816 Opening tables SELECT COUNT(TRIGGER_NAME) FROM QRTZ_TRIGGERS WHERE SCHED_NAME = 'quartzScheduler' AND NOT (MISFIRE_
3618287 dj_web 10.192.11.201:40857 quartzcluster Query 1816 Opening tables SELECT COUNT(TRIGGER_NAME) FROM QRTZ_TRIGGERS WHERE SCHED_NAME = 'quartzScheduler' AND NOT (MISFIRE_
3618273 dz_platform 10.200.58.206:57048 NULL Query 1827 Opening tables insert into dj_global.t_account_purchase_reforged_info (bn_account,first_buy_time, buy_type, buy_tim
3618263 dj_web 10.170.24.69:40034 quartzcluster Query 1837 Opening tables SELECT TRIGGER_NAME, TRIGGER_GROUP, NEXT_FIRE_TIME, PRIORITY FROM QRTZ_TRIGGERS WHERE SCHED_NAME = '
3618258 dj_web 10.170.24.65:24319 quartzcluster Query 1840 Opening tables SELECT TRIGGER_NAME, TRIGGER_GROUP, NEXT_FIRE_TIME, PRIORITY FROM QRTZ_TRIGGERS WHERE SCHED_NAME = '
3618251 dj_web 10.170.24.64:2225 quartzcluster Query 1846 Opening tables SELECT TRIGGER_NAME, TRIGGER_GROUP, NEXT_FIRE_TIME, PRIORITY FROM QRTZ_TRIGGERS WHERE SCHED_NAME = '
3574297 dz_platform 10.200.112.158:32998 sample Query 1860 Sending data SELECT id, tag, src_type, src_value FROM sample.t_classify_list WHERE target_type=1 and ((src_typ
3574307 dz_platform 10.200.112.158:33045 sample Query 1860 update INSERT INTO sample.t_install_lang_conclude SET language="zh_CN",total_count=1,conclude_date=CURRE
3574310 dz_platform 10.200.112.158:33053 sample Query 1860 update INSERT INTO sample.t_im_operation_log SET player_guid=4956735,target_guid=4528631,operation_type=
3574318 dz_platform 10.200.112.158:33092 sample Query 1860 updating UPDATE sample.t_global_guid SET last_logout_time="2021-03-30 18:49:28" WHERE guid=14674769
3574323 dz_platform 10.200.112.158:33107 sample Query 1860 Sending data SELECT id, tag, src_type, src_value FROM sample.t_classify_list WHERE target_type=1 and ((src_typ
3583444 dz_platform 10.200.112.162:46546 sample Query 1860 update INSERT INTO sample.t_war3rpg_battle_detail_roll_20210330 SET room_guid=250575866908835456,json_deta
3584388 dz_platform 10.200.112.163:17470 sample Query 1860 update INSERT INTO sample.t_war3rpg_battle_detail_roll_20210330 SET room_guid=8881174844195141777,json_det
3584689 dz_platform 10.200.112.163:17498 sample Query 1860 update INSERT INTO sample.t_war3rpg_battle_detail_roll_20210330 SET room_guid=6276502896882417449,json_det
3584709 dz_platform 10.200.112.163:17501 sample Query 1860 update INSERT INTO sample.t_war3rpg_battle_detail_roll_20210330 SET room_guid=9109374814637982141,json_det
3584717 dz_platform 10.200.112.163:17502 sample Query 1860 update INSERT INTO sample.t_war3rpg_battle_detail_roll_20210330 SET room_guid=10213742152050212746,json_de
3584731 dz_platform 10.200.112.163:17503 sample Query 1860 update INSERT INTO sample.t_war3rpg_battle_detail_roll_20210330 SET room_guid=5122262309034721223,json_det
3584739 dz_platform 10.200.112.163:17504 sample Query 1860 update INSERT INTO sample.t_war3rpg_battle_detail_roll_20210330 SET room_guid=6588339744732801910,json_det
3584744 dz_platform 10.200.112.163:17506 sample Query 1860 update INSERT INTO sample.t_war3rpg_battle_detail_roll_20210330 SET room_guid=7366914210512698611,json_det
3584746 dz_platform 10.200.112.162:46631 sample Query 1860 update INSERT INTO sample.t_war3rpg_battle_detail_roll_20210330 SET room_guid=11886193238767501583,json_de
3584748 dz_platform 10.200.112.162:46632 sample Query 1860 update INSERT INTO sample.t_war3rpg_battle_detail_roll_20210330 SET room_guid=3605642505651157383,json_det
3584749 dz_platform 10.200.112.162:46633 sample Query 1860 update INSERT INTO sample.t_war3rpg_battle_detail_roll_20210330 SET room_guid=3815382960058859966,json_det
3584760 dz_platform 10.200.112.162:50635 sample Query 1860 update INSERT INTO sample.t_war3rpg_battle_detail_roll_20210330 SET room_guid=17043244837117294637,json_de
3584767 dz_platform 10.200.112.162:50636 sample Query 1860 update INSERT INTO sample.t_war3rpg_battle_detail_roll_20210330 SET room_guid=12951179862128919047,json_de
3584774 dz_platform 10.200.112.162:50637 sample Query 1860 update INSERT INTO sample.t_war3rpg_battle_detail_roll_20210330 SET room_guid=828818081452327429,json_deta
3584822 dz_platform 10.200.112.162:50647 sample Query 1860 update INSERT INTO sample.t_war3rpg_battle_detail_roll_20210330 SET room_guid=7181366928926244039,json_det
3584831 dz_platform 10.200.112.163:17527 sample Query 1860 update INSERT INTO sample.t_war3rpg_battle_detail_roll_20210330 SET room_guid=7554396207362277886,json_det
3584840 dz_platform 10.200.112.163:17528 sample Query 1860 update INSERT INTO sample.t_war3rpg_battle_detail_roll_20210330 SET room_guid=14956419192627331508,json_de
3584842 dz_platform 10.200.112.163:17529 sample Query 1860 update INSERT INTO sample.t_war3rpg_battle_detail_roll_20210330 SET room_guid=7996456081375296351,json_det
3584846 dz_platform 10.200.112.162:50653 sample Query 1860 update INSERT INTO sample.t_war3rpg_battle_detail_roll_20210330 SET room_guid=14021279291868905690,json_de
3584850 dz_platform 10.200.112.163:17531 sample Query 1860 update INSERT INTO sample.t_war3rpg_battle_detail_roll_20210330 SET room_guid=11628538000356606163,json_de
3584855 dz_platform 10.200.112.163:17534 sample Query 1860 update INSERT INTO sample.t_war3rpg_battle_detail_roll_20210330 SET room_guid=15036078569541534150,json_de
3584856 dz_platform 10.200.112.163:17535 sample Query 1860 update INSERT INTO sample.t_war3rpg_battle_detail_roll_20210330 SET room_guid=1278899597164609392,json_det
3584859 dz_platform 10.200.112.163:17537 sample Query 1860 update INSERT INTO sample.t_war3rpg_battle_detail_roll_20210330 SET room_guid=15043363989421752776,json_de
3584882 dz_platform 10.200.112.162:50661 sample Query 1860 update INSERT INTO sample.t_war3rpg_battle_detail_roll_20210330 SET room_guid=12716984950564520400,json_de
3584889 dz_platform 10.200.112.162:50662 sample Query 1860 update INSERT INTO sample.t_war3rpg_battle_detail_roll_20210330 SET room_guid=17146115119242214679,json_de
3584899 dz_platform 10.200.112.162:50664 sample Query 1860 update INSERT INTO sample.t_war3rpg_battle_detail_roll_20210330 SET room_guid=4066619825112416761,json_det
3584901 dz_platform 10.200.112.162:50666 sample Query 1860 update INSERT INTO sample.t_war3rpg_battle_detail_roll_20210330 SET room_guid=12516628510072308130,json_de
3574083 dz_platform 10.200.112.159:38523 sample Query 1861 updating UPDATE sample.t_global_guid SET last_logout_time="2021-03-30 18:49:27" WHERE guid=2747787
3574140 dz_platform 10.200.112.159:38759 sample Query 1861 update INSERT INTO sample.t_install_lang_conclude SET language="zh_CN",total_count=1,conclude_date=CURRE
3574143 dz_platform 10.200.112.158:32576 sample Query 1861 update INSERT INTO sample.t_install_lang_conclude SET language="zh_CN",total_count=1,conclude_date=CURRE
3574168 dz_platform 10.200.112.158:32654 sample Query 1861 update INSERT INTO sample.t_install_lang_conclude SET language="zh_CN",total_count=1,conclude_date=CURRE
3574169 dz_platform 10.200.112.158:32655 sample Query 1861 updating UPDATE sample.t_global_guid SET last_logout_time="2021-03-30 18:49:27" WHERE guid=1717147
3574225 dz_platform 10.200.112.158:32813 sample Query 1861 update INSERT INTO sample.t_install_lang_conclude SET language="zh_CN",total_count=1,conclude_date=CURRE
3574241 dz_platform 10.200.112.159:38986 sample Query 1861 updating UPDATE sample.t_global_guid SET last_logout_time="2021-03-30 18:49:27" WHERE guid=12178924
3574285 dz_platform 10.200.112.158:32925 sample Query 1861 update INSERT INTO sample.t_install_lang_conclude SET language="zh_CN",total_count=1,conclude_date=CURRE
3574294 dz_platform 10.200.112.158:32966 sample Query 1861 updating UPDATE sample.t_global_guid SET last_logout_time="2021-03-30 18:49:27" WHERE guid=9287147
3584040 dz_platform 10.200.112.163:17450 sample Query 1861 update INSERT INTO sample.t_war3rpg_battle_detail_roll_20210330 SET room_guid=13716119511846024487,json_de
3584213 dz_platform 10.200.112.163:17461 sample Query 1861 update INSERT INTO sample.t_war3rpg_battle_detail_roll_20210330 SET room_guid=5722385290302258856,json_det
3584823 dz_platform 10.200.112.163:17523 sample Query 1861 update INSERT INTO sample.t_war3rpg_battle_detail_roll_20210330 SET room_guid=15396974823064402816,json_de
3584827 dz_platform 10.200.112.163:17525 sample Query 1861 update INSERT INTO sample.t_war3rpg_battle_detail_roll_20210330 SET room_guid=6378134879806353018,json_det
3584864 dz_platform 10.200.112.162:50659 sample Query 1861 update INSERT INTO sample.t_war3rpg_battle_detail_roll_20210330 SET room_guid=12738415664845488047,json_de
3584881 dz_platform 10.200.112.162:50660 sample Query 1861 update INSERT INTO sample.t_war3rpg_battle_detail_roll_20210330 SET room_guid=6307006536302657328,json_det
3584950 dz_platform 10.200.112.162:50681 sample Query 1861 update INSERT INTO sample.t_war3rpg_battle_detail_roll_20210330 SET room_guid=15311403621235753478,json_de
3584953 dz_platform 10.200.112.162:50684 sample Query 1861 update INSERT INTO sample.t_war3rpg_battle_detail_roll_20210330 SET room_guid=14492173806780744008,json_de
3584955 dz_platform 10.200.112.162:50685 sample Query 1861 update INSERT INTO sample.t_war3rpg_battle_detail_roll_20210330 SET room_guid=15259713926051594746,json_de
从收集到的指标来看,可以看出:
总体推论如下:
优化措施:
一 .DB层/OS层/硬件层 完善监控盲区:
1. 故障时刻raid相关的监控并没有(仅依赖机房5分钟一次的巡检),例如:
Raid卡电池低电量监控
Raid卡电池状态监控
LD Current Cache Policy监控
LD state 监控
磁盘预坏监控
磁盘介质错误数监控
2. SSD可用性以及性能监控缺失
SSD磁盘寿命监控
SSD温度监控(温度引起的衰减)
SSD可写监控
3. DB层面缺乏MySQL内部线程状态的监控(特别是功能线程),同时增加关键指标的同比监控:(DB故障时刻无告警)
Table_open_cache_hits
Innodb_rows_updated
Innodb_rows_inserted
Innodb_row_lock_current_waits
Table_open_cache_overflows
Innodb_buffer_pool_pages_flushed
#####################################
当前会话数(show processlist获取)
4. 调整table_open_cache_instances 的大小(当前为1)缓解锁竞争问题。
参看监控图中Table_open_cache_overflows值较大,
表明该table_open_cache_instances值明显偏小。
5.MySQL层面的HA方案,以便failover保留现场。(第二期优化)
二. DBA人员技术预案:
制定第三方公司DBA人员抓取现场的技术预案:
1.相同规格的MySQL服务器有10多台, 任意一台DB出现同样卡顿问题,首先确保服务器上MySQLD二进制程序没有被strip掉符号表, 然后再执行
pstack `pidof mysqld`
如果pstack执行失败,则执行:
gdb -ex "set pagination 0" -ex "thread apply all bt" -batch -p `pidof mysqld` > /tmp/gdb.log 2>&1
2. 执行
strace -ff -T -s 4096 -p `pidof mysqld` >/tmp/trace.log 2>&1
使用前文描述的方法获取持有锁的线程id
3. 确保数据库进程使用的块设备的设备名,然后执行
blktrace -d /dev/DB所在磁盘的设备名 -o - | blkparse -i - -f "%D %2c %8s %5T.%9t %5p %2a %3d\n" > iotrace.txt
整理为以下的脚本:
##########################################################
cat /proc/slabinfo > slabinfo.txt
###获取进程task_struct相关信息
cp -r /proc/`pidof mysqld`/ /opt
##获取线程切换的相关情况
pidstat -d -r -u -w -l -t -h -p ALL 5 1 > pidstat.txt
ps axuw > ps_aux.txt
top -b -n 1 -c > top.txt
##获取线程打开文件描述情况
lsof > lsof.txt
vmstat 1 8 > vmstat.txt
iostat -dx 1 -c 3 > iostat.txt
##获取io队列相关信息
blktrace -d /dev/sdb -o - | blkparse -i - -f "%D %2c %8s %5T.%9t %5p %2a %3d\n" > iotrace.txt
##获取mysql线程堆栈信息
pstack `pidof mysqld` > pstack.txt
##获取mysql线程d d系统调用信息
strace -ff -s 4096 -p `pidof mysqld` > mysqld_trace.txt 2>&1
##获取mysql线程堆栈信息
gdb -ex "set pagination 0" -ex "thread apply all bt" -batch -p `pidof mysqld` > /tmp/gdb.log 2>&1
#############################################################
###获取当前MysQL执行耗时的函数
echo "call sys.ps_setup_enable_consumer('waits');
select *,latency/1000/1000/1000 as ms from sys.x$latest_file_io order by latency desc limit 20;
call sys.ps_setup_disable_consumer('waits');"|mysql
mysqladmin -S /var/lib/mysql/mysql.sock debug
mysql -e "show engine innodb status\G" > innodb_status.txt
mysql -e "show full processlist" > process.list
4. 收集完成后,重启MySQL应用