前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >MySQL PXC 集群死锁分析案例-2

MySQL PXC 集群死锁分析案例-2

原创
作者头像
雪人
修改2024-07-26 10:40:11
1350
修改2024-07-26 10:40:11

接前文

二、节点日志分析108节点

■■ 以下日志显示,3:18分109节点出现问题(问题详见"109节点日志分析"),继而109节点被驱逐出去集群。

代码语言:shell
复制
2024-07-15T03:18:34.562042Z 0 [Note] [MY-000000] [Galera] Member 1(pxc-cluster-node-2) initiates vote on 2a37cad4-f834-11ee-916e-da042b
ca7bc3:9338777,892e7c82cf62c56b:  Multi-statement transaction required more than 'max_binlog_cache_size' bytes of storage; increase thi
s mysqld variable and try again, Error_code: 1197; Writing one row to the row-based binary log failed, Error_code: 1534;
2024-07-15T03:18:34.562391Z 0 [Note] [MY-000000] [Galera] Recomputed vote based on error codes: 1197, 1534. New vote d659be586fd355cc w
ill be used for further steps. Old Vote: 892e7c82cf62c56b
2024-07-15T03:18:34.562448Z 0 [Note] [MY-000000] [Galera] Votes over 2a37cad4-f834-11ee-916e-da042bca7bc3:9338777:
   0000000000000000:   1/3
   d659be586fd355cc:   1/3
Waiting for more votes.
2024-07-15T03:18:34.562568Z 15 [Note] [MY-000000] [Galera] Got vote request for seqno 2a37cad4-f834-11ee-916e-da042bca7bc3:9338777
2024-07-15T03:18:34.564109Z 0 [Note] [MY-000000] [Galera] Member 0(pxc-cluster-node-3) responds to vote on 2a37cad4-f834-11ee-916e-da04
2bca7bc3:9338777,0000000000000000: Success
2024-07-15T03:18:34.564159Z 0 [Warning] [MY-000000] [Galera] Received bogus VOTE message: 9338777.0, from node 2a36a321-f834-11ee-bc44-
bb66609ee19b, expected > 9338860. Ignoring.
2024-07-15T03:18:34.564188Z 0 [Note] [MY-000000] [Galera] Votes over 2a37cad4-f834-11ee-916e-da042bca7bc3:9338777:
   0000000000000000:   1/3
   d659be586fd355cc:   1/3
Waiting for more votes.
2024-07-15T03:18:34.564212Z 0 [Note] [MY-000000] [Galera] Member 2(pxc-cluster-node-1) responds to vote on 2a37cad4-f834-11ee-916e-da04
2bca7bc3:9338777,0000000000000000: Success
2024-07-15T03:18:34.564229Z 0 [Note] [MY-000000] [Galera] Votes over 2a37cad4-f834-11ee-916e-da042bca7bc3:9338777:
   0000000000000000:   2/3
   d659be586fd355cc:   1/3
Winner: 0000000000000000
Winner: 0000000000000000
2024-07-15T03:18:34.564279Z 15 [Note] [MY-000000] [Galera] Vote 0 (success) on 2a37cad4-f834-11ee-916e-da042bca7bc3:9338777 is consiste
nt with group. Continue.
2024-07-15T03:18:34.571056Z 0 [Note] [MY-000000] [Galera] declaring 99999999-9910 at tcp://[9999:9999:9999:9999::6e]:4567 stable
2024-07-15T03:18:34.571138Z 0 [Note] [MY-000000] [Galera] forgetting 601c1fce-8a87 (tcp://[9999:9999:9999:9999::6d]:4567)
2024-07-15T03:18:34.572400Z 0 [Note] [MY-000000] [Galera] Node 99999999-9910 state primary
2024-07-15T03:18:34.578145Z 0 [Note] [MY-000000] [Galera] Current view of cluster as seen by this node
view (view_id(PRIM,99999999-9910,10)
memb {
        99999999-9910,0
        99999999-9908,0
        }
joined {
        }
left {
        }
partitioned {
        601c1fce-8a87,0
        }
)
2024-07-15T03:18:34.578202Z 0 [Note] [MY-000000] [Galera] Save the discovered primary-component to disk
2024-07-15T03:18:34.579951Z 0 [Note] [MY-000000] [Galera] forgetting 601c1fce-8a87 (tcp://[9999:9999:9999:9999::6d]:4567)
2024-07-15T03:18:34.579977Z 0 [Note] [MY-000000] [Galera] New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 2
2024-07-15T03:18:34.580032Z 0 [Note] [MY-000000] [Galera] STATE EXCHANGE: Waiting for state UUID.
2024-07-15T03:18:34.581317Z 0 [Note] [MY-000000] [Galera] STATE EXCHANGE: sent state msg: eb67b0a0-4258-11ef-a082-2e388f64aaa3
2024-07-15T03:18:34.582510Z 0 [Note] [MY-000000] [Galera] STATE EXCHANGE: got state msg: eb67b0a0-4258-11ef-a082-2e388f64aaa3 from 0 (p
xc-cluster-node-3)
2024-07-15T03:18:34.582561Z 0 [Note] [MY-000000] [Galera] STATE EXCHANGE: got state msg: eb67b0a0-4258-11ef-a082-2e388f64aaa3 from 1 (p
xc-cluster-node-1)
2024-07-15T03:18:34.582581Z 0 [Note] [MY-000000] [Galera] Quorum results:
        version    = 6,
        component  = PRIMARY,
        conf_id    = 9,
        members    = 2/2 (primary/total),
        act_id     = 9338911,
        last_appl. = 9338766,
        protocols  = 2/11/4 (gcs/repl/appl),
        vote policy= 0,
        group UUID = 2a37cad4-f834-11ee-916e-da042bca7bc3
2024-07-15T03:18:34.582642Z 0 [Note] [MY-000000] [Galera] Flow-control interval: [141, 141]
2024-07-15T03:18:34.582782Z 13 [Note] [MY-000000] [Galera] ####### processing CC 9338912, local, ordered
2024-07-15T03:18:34.582833Z 13 [Note] [MY-000000] [Galera] Maybe drain monitors from 9338911 upto current CC event 9338912 upto:9338911
2024-07-15T03:18:34.582853Z 13 [Note] [MY-000000] [Galera] Drain monitors from 9338911 up to 9338911
2024-07-15T03:18:34.582875Z 13 [Note] [MY-000000] [Galera] ####### My UUID: 6663bf31-40da-11ef-b7ff-dfd7e45ce66f
2024-07-15T03:18:34.582890Z 13 [Note] [MY-000000] [Galera] Skipping cert index reset
2024-07-15T03:18:34.582925Z 13 [Note] [MY-000000] [Galera] REPL Protocols: 11 (6)
2024-07-15T03:18:34.582951Z 13 [Note] [MY-000000] [Galera] ####### Adjusting cert position: 9338911 -> 9338912
2024-07-15T03:18:34.583030Z 0 [Note] [MY-000000] [Galera] Service thread queue flushed.
2024-07-15T03:18:34.586904Z 13 [Note] [MY-000000] [Galera] ================================================
View:
  id: 2a37cad4-f834-11ee-916e-da042bca7bc3:9338912
  status: primary
  protocol_version: 4
  capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY, ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED, PREOR
DERED, STREAMING, NBO
  final: no
  own_index: 1
  members(2):
        0: 2a36a321-f834-11ee-bc44-bb66609ee19b, pxc-cluster-node-3
        1: 6663bf31-40da-11ef-b7ff-dfd7e45ce66f, pxc-cluster-node-1
=================================================
2024-07-15T03:18:34.587017Z 13 [Note] [MY-000000] [WSREP] wsrep_notify_cmd is not defined, skipping notification.
2024-07-15T03:18:34.595121Z 13 [Note] [MY-000000] [Galera] Recording CC from group: 9338912
2024-07-15T03:18:34.595181Z 13 [Note] [MY-000000] [Galera] Lowest cert index boundary for CC from group: 9338767
2024-07-15T03:18:34.595204Z 13 [Note] [MY-000000] [Galera] Min available from gcache for CC from group: 9337022

■■ 但随后的日志显示,108节点处于正常服务状态。

■■ 随后的日志显示,04:00有节点重启加入集群:

代码语言:shell
复制
2024-07-15T04:00:07.309528Z 0 [Note] [MY-000000] [Galera] (99999999-9908, 'tcp://[9999:9999:9999:9999::6c]:4567') connection established
 to 99999999-9909 tcp://[9999:9999:9999:9999::6d]:4567
2024-07-15T04:00:07.573846Z 116353 [Warning] [MY-013360] [Server] Plugin mysql_native_password reported: ''mysql_native_password' is dep
recated and will be removed in a future release. Please use caching_sha2_password instead'
2024-07-15T04:00:07.812276Z 0 [Note] [MY-000000] [Galera] declaring 99999999-9910 at tcp://[9999:9999:9999:9999::6e]:4567 stable
2024-07-15T04:00:07.812392Z 0 [Note] [MY-000000] [Galera] declaring 99999999-9909 at tcp://[9999:9999:9999:9999::6d]:4567 stable
2024-07-15T04:00:07.813733Z 0 [Note] [MY-000000] [Galera] Node 99999999-9910 state primary
2024-07-15T04:00:07.815269Z 0 [Note] [MY-000000] [Galera] Current view of cluster as seen by this node
view (view_id(PRIM,99999999-9910,11)
memb {
        99999999-9910,0
        99999999-9908,0
        99999999-9909,0
        }
joined {
        }
left {
        }
partitioned {
        }
)
2024-07-15T04:00:07.815311Z 0 [Note] [MY-000000] [Galera] Save the discovered primary-component to disk
2024-07-15T04:00:07.817183Z 0 [Note] [MY-000000] [Galera] New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 3
2024-07-15T04:00:07.817262Z 0 [Note] [MY-000000] [Galera] STATE EXCHANGE: Waiting for state UUID.
2024-07-15T04:00:07.821664Z 0 [Note] [MY-000000] [Galera] STATE EXCHANGE: sent state msg: b97e1f85-425e-11ef-af8a-cac3db9396f3
2024-07-15T04:00:07.822896Z 0 [Note] [MY-000000] [Galera] STATE EXCHANGE: got state msg: b97e1f85-425e-11ef-af8a-cac3db9396f3 from 0 (px
c-cluster-node-3)
2024-07-15T04:00:07.822956Z 0 [Note] [MY-000000] [Galera] STATE EXCHANGE: got state msg: b97e1f85-425e-11ef-af8a-cac3db9396f3 from 1 (px
c-cluster-node-1)
2024-07-15T04:00:08.310999Z 0 [Note] [MY-000000] [Galera] STATE EXCHANGE: got state msg: b97e1f85-425e-11ef-af8a-cac3db9396f3 from 2 (px
c-cluster-node-2)
2024-07-15T04:00:08.311090Z 0 [Note] [MY-000000] [Galera] Quorum results:
        version    = 6,
        component  = PRIMARY,
        conf_id    = 10,
        members    = 2/3 (primary/total),
        act_id     = 9339054,
        last_appl. = 9338766,
        protocols  = 2/11/4 (gcs/repl/appl),
        vote policy= 0,
        group UUID = 2a37cad4-f834-11ee-916e-da042bca7bc3
2024-07-15T04:00:08.311280Z 0 [Note] [MY-000000] [Galera] Flow-control interval: [173, 173]
2024-07-15T04:00:08.311566Z 13 [Note] [MY-000000] [Galera] ####### processing CC 9339055, local, ordered
2024-07-15T04:00:08.311730Z 13 [Note] [MY-000000] [Galera] Maybe drain monitors from 9339054 upto current CC event 9339055 upto:9339054
2024-07-15T04:00:08.311758Z 13 [Note] [MY-000000] [Galera] Drain monitors from 9339054 up to 9339054
2024-07-15T04:00:08.311777Z 13 [Note] [MY-000000] [Galera] ####### My UUID: 6663bf31-40da-11ef-b7ff-dfd7e45ce66f
2024-07-15T04:00:08.311791Z 13 [Note] [MY-000000] [Galera] Skipping cert index reset
2024-07-15T04:00:08.311804Z 13 [Note] [MY-000000] [Galera] REPL Protocols: 11 (6)
2024-07-15T04:00:08.311820Z 13 [Note] [MY-000000] [Galera] ####### Adjusting cert position: 9339054 -> 9339055
2024-07-15T04:00:08.311910Z 0 [Note] [MY-000000] [Galera] Service thread queue flushed.
2024-07-15T04:00:08.315209Z 13 [Note] [MY-000000] [Galera] ================================================
View:
  id: 2a37cad4-f834-11ee-916e-da042bca7bc3:9339055
  status: primary
  protocol_version: 4
  capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY, ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED, PREORD
ERED, STREAMING, NBO
  final: no
  own_index: 1
  members(3):
        0: 2a36a321-f834-11ee-bc44-bb66609ee19b, pxc-cluster-node-3
        1: 6663bf31-40da-11ef-b7ff-dfd7e45ce66f, pxc-cluster-node-1
        2: b93005c2-425e-11ef-8b79-73a96136cc10, pxc-cluster-node-2
=================================================

■■ 但紧接着是较多的锁表等待,根据以往测试情况,每次冲突也是总有线程获胜,而同时必然有线程受害:

代码语言:shell
复制
TRANSACTION 114209457, ACTIVE 0 sec starting index read
mysql tables in use 1, locked 1
LOCK WAIT , undo log entries 3
MySQL thread id 16, OS thread handle 139734347618048, query id 6745542 Applying batch of row changes (update)
TRANSACTION 114208702, ACTIVE 1338 sec
, undo log entries 6
MySQL thread id 114856, OS thread handle 139691406636800, query id 6676148 2409:807c:5a06:1::104:1705 tm_xj wsrep: replicating and certi
fying write set
commit
2024-07-15T04:00:08.333081Z 16 [Note] [MY-000000] [WSREP] --------- CONFLICT DETECTED --------
2024-07-15T04:00:08.333098Z 16 [Note] [MY-000000] [WSREP] cluster conflict due to high priority abort for threads:

2024-07-15T04:00:08.333111Z 16 [Note] [MY-000000] [WSREP] Winning thread:
   THD: 16, mode: high priority, state: exec, conflict: executing, seqno: 9339058
   SQL: (null)

2024-07-15T04:00:08.333122Z 16 [Note] [MY-000000] [WSREP] Victim thread:
   THD: 114856, mode: local, state: exec, conflict: certifying, seqno: -1
   SQL: commit

TRANSACTION 114209458, ACTIVE 0 sec starting index read
mysql tables in use 1, locked 1
LOCK WAIT , undo log entries 3
MySQL thread id 12, OS thread handle 139734349731584, query id 6745554 Applying batch of row changes (update)
TRANSACTION 114208878, ACTIVE 994 sec
, undo log entries 10
MySQL thread id 115190, OS thread handle 139671641110272, query id 6694341 2409:807c:5a06:1::104:1702 tm_xj wsrep: replicating and certi
fying write set
commit
2024-07-15T04:00:08.334316Z 12 [Note] [MY-000000] [WSREP] --------- CONFLICT DETECTED --------
2024-07-15T04:00:08.334343Z 12 [Note] [MY-000000] [WSREP] cluster conflict due to high priority abort for threads:

2024-07-15T04:00:08.334363Z 12 [Note] [MY-000000] [WSREP] Winning thread:
   THD: 12, mode: high priority, state: exec, conflict: executing, seqno: 9339060
   SQL: (null)

2024-07-15T04:00:08.334379Z 12 [Note] [MY-000000] [WSREP] Victim thread:
   THD: 115190, mode: local, state: exec, conflict: certifying, seqno: -1
   SQL: commit

连续发生了188次冲突

root@node1:0 /u01/mysqld.log# grep "CONFLICT DETECTED" mysqld.log.108|grep "2024-07-15T04:00"|wc -l

188

最后一次冲突如下:

代码语言:shell
复制
2024-07-15T04:00:08.729167Z 15 [Note] [MY-000000] [WSREP] --------- CONFLICT DETECTED --------
2024-07-15T04:00:08.729202Z 15 [Note] [MY-000000] [WSREP] cluster conflict due to high priority abort for threads:

2024-07-15T04:00:08.729221Z 15 [Note] [MY-000000] [WSREP] Winning thread:
   THD: 15, mode: high priority, state: exec, conflict: executing, seqno: 9339273
   SQL: (null)

2024-07-15T04:00:08.729236Z 15 [Note] [MY-000000] [WSREP] Victim thread:
   THD: 115577, mode: local, state: exec, conflict: must_replay, seqno: 9339387
   SQL: commit

■■ 如下日志显示,在这些冲突发生110秒后,108节点试图连接109节点:

代码语言:shell
复制
2024-07-15T04:01:50.417379Z 0 [Note] [MY-000000] [Galera] (99999999-9908, 'tcp://[9999:9999:9999:9999::6c]:4567') turning message relay
requesting on, nonlive peers: tcp://[9999:9999:9999:9999::6d]:4567
2024-07-15T04:01:51.894449Z 0 [Note] [MY-000000] [Galera] (99999999-9908, 'tcp://[9999:9999:9999:9999::6c]:4567') reconnecting to b93005
c2-8b79 (tcp://[9999:9999:9999:9999::6d]:4567), attempt 0
2024-07-15T04:01:52.574837Z 116695 [Warning] [MY-013360] [Server] Plugin mysql_native_password reported: ''mysql_native_password' is dep
recated and will be removed in a future release. Please use caching_sha2_password instead'
2024-07-15T04:01:53.277070Z 116696 [Warning] [MY-013360] [Server] Plugin mysql_native_password reported: ''mysql_native_password' is dep
recated and will be removed in a future release. Please use caching_sha2_password instead'
2024-07-15T04:01:54.435874Z 116697 [Warning] [MY-013360] [Server] Plugin mysql_native_password reported: ''mysql_native_password' is dep
recated and will be removed in a future release. Please use caching_sha2_password instead'

+++++++++++ 以下日志显示,3秒钟没能连接成功,怀疑超时,断定109节点非活动,该节点被驱逐

代码语言:shell
复制
2024-07-15T04:01:54.847941Z 0 [Note] [MY-000000] [Galera] declaring node with index 2 suspected, timeout PT5S (evs.suspect_timeout)
2024-07-15T04:01:54.848077Z 0 [Note] [MY-000000] [Galera] evs::proto(99999999-9908, OPERATIONAL, view_id(REG,99999999-9910,11)) suspecti
ng node: 99999999-9909
2024-07-15T04:01:54.848102Z 0 [Note] [MY-000000] [Galera] evs::proto(99999999-9908, OPERATIONAL, view_id(REG,99999999-9910,11)) suspecte
d node without join message, declaring inactive
2024-07-15T04:01:55.348302Z 0 [Note] [MY-000000] [Galera] declaring node with index 2 inactive (evs.inactive_timeout)

2024-07-15T04:01:55.851247Z 0 [Note] [MY-000000] [Galera] declaring 99999999-9910 at tcp://[9999:9999:9999:9999::6e]:4567 stable
2024-07-15T04:01:55.852703Z 0 [Note] [MY-000000] [Galera] Node 99999999-9910 state primary
2024-07-15T04:01:55.858696Z 0 [Note] [MY-000000] [Galera] Current view of cluster as seen by this node
view (view_id(PRIM,99999999-9910,12)
memb {
        99999999-9910,0
        99999999-9908,0
        }
joined {
        }
left {
        }
partitioned {
        99999999-9909,0
        }
)
2024-07-15T04:01:55.858739Z 0 [Note] [MY-000000] [Galera] Save the discovered primary-component to disk
2024-07-15T04:01:55.860637Z 0 [Note] [MY-000000] [Galera] forgetting 99999999-9909 (tcp://[9999:9999:9999:9999::6d]:4567)
2024-07-15T04:01:55.860717Z 0 [Note] [MY-000000] [Galera] New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 2
2024-07-15T04:01:55.860857Z 0 [Note] [MY-000000] [Galera] STATE EXCHANGE: Waiting for state UUID.
2024-07-15T04:01:55.860748Z 0 [Note] [MY-000000] [Galera] (99999999-9908, 'tcp://[9999:9999:9999:9999::6c]:4567') turning message relay 
requesting off

■■ 之后交换状态信息,获取到仲裁结果,确认了2个活动节点:108、110,并清理了109节点

代码语言:shell
复制
2024-07-15T04:01:55.862455Z 0 [Note] [MY-000000] [Galera] STATE EXCHANGE: sent state msg: f9e37a10-425e-11ef-86d1-67533f4bae37
2024-07-15T04:01:55.863749Z 0 [Note] [MY-000000] [Galera] STATE EXCHANGE: got state msg: f9e37a10-425e-11ef-86d1-67533f4bae37 from 0 (px
c-cluster-node-3)
2024-07-15T04:01:55.863809Z 0 [Note] [MY-000000] [Galera] STATE EXCHANGE: got state msg: f9e37a10-425e-11ef-86d1-67533f4bae37 from 1 (px
c-cluster-node-1)
2024-07-15T04:01:55.863837Z 0 [Note] [MY-000000] [Galera] Quorum results:
        version    = 6,
        component  = PRIMARY,
        conf_id    = 11,
        members    = 2/2 (primary/total),
        act_id     = 9339389,
        last_appl. = 9338766,
        protocols  = 2/11/4 (gcs/repl/appl),
        vote policy= 0,
        group UUID = 2a37cad4-f834-11ee-916e-da042bca7bc3
2024-07-15T04:01:55.864077Z 0 [Note] [MY-000000] [Galera] Flow-control interval: [141, 141]
2024-07-15T04:01:55.864321Z 14 [Note] [MY-000000] [Galera] ####### processing CC 9339390, local, ordered
2024-07-15T04:01:55.864384Z 14 [Note] [MY-000000] [Galera] Maybe drain monitors from 9339389 upto current CC event 9339390 upto:9339389
2024-07-15T04:01:55.864411Z 14 [Note] [MY-000000] [Galera] Drain monitors from 9339389 up to 9339389
2024-07-15T04:01:55.864436Z 14 [Note] [MY-000000] [Galera] ####### My UUID: 6663bf31-40da-11ef-b7ff-dfd7e45ce66f
2024-07-15T04:01:55.864456Z 14 [Note] [MY-000000] [Galera] Skipping cert index reset
2024-07-15T04:01:55.864475Z 14 [Note] [MY-000000] [Galera] REPL Protocols: 11 (6)
2024-07-15T04:01:55.864521Z 14 [Note] [MY-000000] [Galera] ####### Adjusting cert position: 9339389 -> 9339390
2024-07-15T04:01:55.864632Z 0 [Note] [MY-000000] [Galera] Service thread queue flushed.
2024-07-15T04:01:55.867920Z 14 [Note] [MY-000000] [Galera] ================================================
View:
  id: 2a37cad4-f834-11ee-916e-da042bca7bc3:9339390
  status: primary
  protocol_version: 4
  capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY, ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED, PREORD
ERED, STREAMING, NBO
  final: no
  own_index: 1
  members(2):
        0: 2a36a321-f834-11ee-bc44-bb66609ee19b, pxc-cluster-node-3
        1: 6663bf31-40da-11ef-b7ff-dfd7e45ce66f, pxc-cluster-node-1
=================================================
2024-07-15T04:01:55.868021Z 14 [Note] [MY-000000] [WSREP] wsrep_notify_cmd is not defined, skipping notification.
2024-07-15T04:01:55.874762Z 14 [Note] [MY-000000] [Galera] Recording CC from group: 9339390
2024-07-15T04:01:55.874837Z 14 [Note] [MY-000000] [Galera] Lowest cert index boundary for CC from group: 9338767
2024-07-15T04:01:55.874866Z 14 [Note] [MY-000000] [Galera] Min available from gcache for CC from group: 9337022
2024-07-15T04:01:58.396276Z 0 [Note] [MY-000000] [Galera]  cleaning up 99999999-9909 (tcp://[9999:9999:9999:9999::6d]:4567)

■■ 之后的日志显示,109节点再次加入未成,继续被驱逐出集群。

三、节点日志分析110节点

■■ 前面的日志同108节点

■■ 4:03的日志显示,109节点再次加入未成,继续被驱逐出集群。

四、总结

根据现场反馈的信息,109节点于11:18分被驱逐后,系统缓慢,业务阻断,此时数据库任务积压,SQL运行慢,大量的锁表,数据库连接数1100左右,大量的wrsp同步线程,杀掉变成killed状态后怀疑资源并未释放。此时即使停掉2个节点,仅保留1个节点,仍然无法提交事务,进而继续大量锁表,此时只能重启数据库解决问题。

总结以上分析过程,以下按照时间顺序客观描述一些症状、现象,以利于找到问题的根本原因。

1、11:16

节点109 第二次发生大事务,全局缓存页达到了 570MB,报错,导致节点109 11:18 停止服务,被驱逐出集群。

2、11:23

一线人员反馈登录时报服务异常,然后自动跳转回登陆页面,导致登录不上系统。项目经理收到问题反馈后开始排查系统问题。登录ELK系统,查看系统监控发现家客异步导出微服务延迟异常。进一步查看详情,发现数据库出现积压,SQL执行变慢。

3、11:47

其他系统侧人员电话联系反馈该系统接口调用失败,导致相关工单无法办理。

4、12时

系统恢复正常登陆后,其他侧反馈接口正常,但是仍有小部分接口出现异常情况,需多次提交,才能成功。

5、12:03

运维人员后台监控发现出入库异常,"Error updating database. Caus"

运维人员根据异常信息定位到有锁表进程并对锁表进程进行释放

6、12:35左右

出入库接口依然存在无法正常回单问题,经核查发现数据库进程状态虽然变成“kill”,但是并没有释放。然后立即将现场最新情况反馈给公司专家,寻求远程协助。

7、13:30

停掉了两个备库节点,只保留主数据库节点。然后滚动重启相关服务及接口。重启完后,验证系统登陆正常,观察相关接口也正常。

8、14:00

支撑群里面一线装维又反馈回单出现报错如下:

Could not commit JDBC transaction; nested exception is com.mysql.cj.jdbc.exceptions.MySQLTra Deadlock found when trying to getlock; try restarting transaction

经与公司专家紧急开会讨论,初步估计是PXC集群异常,虽然已经下线了两个备库节点,但是主节点还在做强一致性同步,导致事务无法提交,从而出现大量锁表的情况。

9、16:15

电话联系系统管理员XXX,申请重启数据库。于16:25完成数据库重启,然后滚动重启相关服务及接口。重启完后,验证系统登陆正常,系统接口也正常。

10、16:44

在支撑群中通知一线人员系统恢复正常,重新进行回单。

一线分析参考:

本次故障是一线异步导出全网数据时,后台会同时更新XX平台割接过来的XXX关系状态,导出数据量过大时,出现了大事务提交,导致备库节点异常,从而整个PXC集群各节点数据同步卡死,最终导致事务提交不了,出现锁表的情况。

原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。

如有侵权,请联系 cloudcommunity@tencent.com 删除。

原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。

如有侵权,请联系 cloudcommunity@tencent.com 删除。

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 接前文
  • 二、节点日志分析108节点
    • ■■ 以下日志显示,3:18分109节点出现问题(问题详见"109节点日志分析"),继而109节点被驱逐出去集群。
      • ■■ 但随后的日志显示,108节点处于正常服务状态。
        • ■■ 随后的日志显示,04:00有节点重启加入集群:
          • ■■ 但紧接着是较多的锁表等待,根据以往测试情况,每次冲突也是总有线程获胜,而同时必然有线程受害:
            • 连续发生了188次冲突
              • 最后一次冲突如下:
                • ■■ 如下日志显示,在这些冲突发生110秒后,108节点试图连接109节点:
                  • ■■ 之后交换状态信息,获取到仲裁结果,确认了2个活动节点:108、110,并清理了109节点
                    • ■■ 之后的日志显示,109节点再次加入未成,继续被驱逐出集群。
                    • 三、节点日志分析110节点
                      • ■■ 前面的日志同108节点
                        • ■■ 4:03的日志显示,109节点再次加入未成,继续被驱逐出集群。
                        • 四、总结
                          • 1、11:16
                            • 2、11:23
                              • 3、11:47
                                • 4、12时
                                  • 5、12:03
                                    • 6、12:35左右
                                      • 7、13:30
                                        • 8、14:00
                                          • 9、16:15
                                            • 10、16:44
                                              • 一线分析参考:
                                              相关产品与服务
                                              云数据库 MySQL
                                              腾讯云数据库 MySQL(TencentDB for MySQL)为用户提供安全可靠,性能卓越、易于维护的企业级云数据库服务。其具备6大企业级特性,包括企业级定制内核、企业级高可用、企业级高可靠、企业级安全、企业级扩展以及企业级智能运维。通过使用腾讯云数据库 MySQL,可实现分钟级别的数据库部署、弹性扩展以及全自动化的运维管理,不仅经济实惠,而且稳定可靠,易于运维。
                                              领券
                                              问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档