前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >通过Wireshark和arthas排查由DNS引发的Ignite生产故障案例

通过Wireshark和arthas排查由DNS引发的Ignite生产故障案例

作者头像
lovelife110
发布2021-12-08 17:14:20
2.7K0
发布2021-12-08 17:14:20
举报
文章被收录于专栏:爱生活爱编程爱生活爱编程

通过Wireshark和arthas排查由DNS引发的Ignite生产故障案例

故障背景

一次维护人员在上完线后,发现在分布式内存数据网格apache Ignite集群上通过客户端执行加载数据任务时,出现客户端节点连不上服务节点的问题。

生产环境情况: 分为A、B两中心 A中心 x86架构主机8台,为ignite服务节点 B中心 power架构主机8台,为ignite服务节点 Ignite版本为1.10,jdk为1.8

问题反馈: 当用x86 客户端连power架构的服务节点时,出现客户端节点连不上问题。而power 客户端连power服务端没有问题;x86 或者power客户端连x86节点也没有问题。

故障分析

由于开发人员log4j2日志配置不对,导致生产上没有看到错误日志

目前从维护人员得到的信息来看,猜测可能有2个原因导致故障: 1、跨架构导致故障,因为客户端x86架构,服务端power架构 2、网络问题导致故障,怀疑是跨中心,跨了网段导致故障

第一次定位问题

跨架构假设

针对第一点的假设,通过并行环境模拟,模拟不出故障,因此只能到生产环境重现故障,再做下一步定位。

生产重现故障

修改log4j2日志,得到报错信息(以下是后续本地环境模拟的,报错内容和生产一致): 服务端节点错误:

代码语言:javascript
复制
[2021-11-25T18:13:28,887][ERROR][tcp-disco-sock-reader-[045905d0 192.168.139.129:44809 client]-#5-#58][TcpDiscoverySpi] Caught exception on message read [sock=Socket[addr=/192.168.139.129,port=44809,localport=47500], locNodeId=5be3f5a4-08a1-4f46-aa14-e009963dc83c, rmtNodeId=045905d0-bcaa-4631-bd70-7cc1922a8422]
java.net.SocketException: Broken pipe (Write failed)
        at java.net.SocketOutputStream.socketWrite0(Native Method) ~[?:1.8.0_211]
        at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111) ~[?:1.8.0_211]
        at java.net.SocketOutputStream.write(SocketOutputStream.java:134) ~[?:1.8.0_211]
        at org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.writeToSocket(TcpDiscoverySpi.java:1807) ~[ignite-core-2.10.0.jar:2.10.0]
        at org.apache.ignite.spi.discovery.tcp.ServerImpl$SocketReader.processJoinRequestMessage(ServerImpl.java:7581) ~[ignite-core-2.10.0.jar:2.10.0]
        at org.apache.ignite.spi.discovery.tcp.ServerImpl$SocketReader.body(ServerImpl.java:7092) [ignite-core-2.10.0.jar:2.10.0]
        at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:58) [ignite-core-2.10.0.jar:2.10.0]

客户端报错日志:

代码语言:javascript
复制
[2021-11-25T18:13:23,792][ERROR][tcp-client-disco-msg-worker-#4-#42][TcpDiscoverySpi] Exception on joining: Read timed out
java.net.SocketTimeoutException: Read timed out
        at java.net.SocketInputStream.socketRead0(Native Method) ~[?:1.8.0_211]
        at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) ~[?:1.8.0_211]
        at java.net.SocketInputStream.read(SocketInputStream.java:171) ~[?:1.8.0_211]
        at java.net.SocketInputStream.read(SocketInputStream.java:141) ~[?:1.8.0_211]
        at java.net.SocketInputStream.read(SocketInputStream.java:224) ~[?:1.8.0_211]
        at org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.readReceipt(TcpDiscoverySpi.java:1904) ~[ignite-01-1.0-SNAPSHOT.jar:?]
        at org.apache.ignite.spi.discovery.tcp.ClientImpl.sendJoinRequest(ClientImpl.java:806) [ignite-01-1.0-SNAPSHOT.jar:?]
        at org.apache.ignite.spi.discovery.tcp.ClientImpl.joinTopology(ClientImpl.java:629) [ignite-01-1.0-SNAPSHOT.jar:?]
        at org.apache.ignite.spi.discovery.tcp.ClientImpl.access$1000(ClientImpl.java:150) [ignite-01-1.0-SNAPSHOT.jar:?]
        at org.apache.ignite.spi.discovery.tcp.ClientImpl$MessageWorker.tryJoin(ClientImpl.java:2108) [ignite-01-1.0-SNAPSHOT.jar:?]
        at org.apache.ignite.spi.discovery.tcp.ClientImpl$MessageWorker.body(ClientImpl.java:1751) [ignite-01-1.0-SNAPSHOT.jar:?]
        at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120) [ignite-01-1.0-SNAPSHOT.jar:?]
        at org.apache.ignite.spi.discovery.tcp.ClientImpl$1.body(ClientImpl.java:317) [ignite-01-1.0-SNAPSHOT.jar:?]
        at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:58) [ignite-01-1.0-SNAPSHOT.jar:?]
[18:13:23] Failed to connect to any address from IP finder (will retry to join topology every 2000 ms; change 'reconnectDelay' to configure the frequency of retries): [/192.168.139.128:47500]
[2021-11-25T18:13:27,042][ERROR][tcp-client-disco-msg-worker-#4-#42][TcpDiscoverySpi] Exception on joining: Failed to serialize object: TcpDiscoveryJoinRequestMessage [node=TcpDiscoveryNode [id=045905d0-bcaa-4631-bd70-7cc1922a8422, consistentId=045905d0-bcaa-4631-bd70-7cc1922a8422, addrs=ArrayList [0:0:0:0:0:0:0:1%lo, 127.0.0.1, 192.168.139.129], sockAddrs=HashSet [/0:0:0:0:0:0:0:1%lo:0, /127.0.0.1:0, /192.168.139.129:0], discPort=0, order=0, intOrder=0, lastExchangeTime=1637835172514, loc=true, ver=2.10.0#20211125-sha1:00000000, isClient=true], dataPacket=org.apache.ignite.spi.discovery.tcp.internal.DiscoveryDataPacket@29f3a5ed, super=TcpDiscoveryAbstractMessage [sndNodeId=null, id=8e765965d71-045905d0-bcaa-4631-bd70-7cc1922a8422, verifierNodeId=null, topVer=0, pendingIdx=0, failedNodes=null, isClient=true]]
org.apache.ignite.IgniteCheckedException: Failed to serialize object: TcpDiscoveryJoinRequestMessage [node=TcpDiscoveryNode [id=045905d0-bcaa-4631-bd70-7cc1922a8422, consistentId=045905d0-bcaa-4631-bd70-7cc1922a8422, addrs=ArrayList [0:0:0:0:0:0:0:1%lo, 127.0.0.1, 192.168.139.129], sockAddrs=HashSet [/0:0:0:0:0:0:0:1%lo:0, /127.0.0.1:0, /192.168.139.129:0], discPort=0, order=0, intOrder=0, lastExchangeTime=1637835172514, loc=true, ver=2.10.0#20211125-sha1:00000000, isClient=true], dataPacket=org.apache.ignite.spi.discovery.tcp.internal.DiscoveryDataPacket@29f3a5ed, super=TcpDiscoveryAbstractMessage [sndNodeId=null, id=8e765965d71-045905d0-bcaa-4631-bd70-7cc1922a8422, verifierNodeId=null, topVer=0, pendingIdx=0, failedNodes=null, isClient=true]]
        at org.apache.ignite.marshaller.jdk.JdkMarshaller.marshal0(JdkMarshaller.java:102) ~[ignite-01-1.0-SNAPSHOT.jar:?]
        at org.apache.ignite.marshaller.AbstractNodeNameAwareMarshaller.marshal(AbstractNodeNameAwareMarshaller.java:68) ~[ignite-01-1.0-SNAPSHOT.jar:?]
        at org.apache.ignite.internal.util.IgniteUtils.marshal(IgniteUtils.java:10597) ~[ignite-01-1.0-SNAPSHOT.jar:?]
        at org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.writeToSocket(TcpDiscoverySpi.java:1764) ~[ignite-01-1.0-SNAPSHOT.jar:?]
        at org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.writeToSocket(TcpDiscoverySpi.java:1701) ~[ignite-01-1.0-SNAPSHOT.jar:?]
        at org.apache.ignite.spi.discovery.tcp.ClientImpl.sendJoinRequest(ClientImpl.java:797) [ignite-01-1.0-SNAPSHOT.jar:?]
        at org.apache.ignite.spi.discovery.tcp.ClientImpl.joinTopology(ClientImpl.java:629) [ignite-01-1.0-SNAPSHOT.jar:?]
        at org.apache.ignite.spi.discovery.tcp.ClientImpl.access$1000(ClientImpl.java:150) [ignite-01-1.0-SNAPSHOT.jar:?]
        at org.apache.ignite.spi.discovery.tcp.ClientImpl$MessageWorker.tryJoin(ClientImpl.java:2108) [ignite-01-1.0-SNAPSHOT.jar:?]
        at org.apache.ignite.spi.discovery.tcp.ClientImpl$MessageWorker.body(ClientImpl.java:1751) [ignite-01-1.0-SNAPSHOT.jar:?]
        at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120) [ignite-01-1.0-SNAPSHOT.jar:?]
        at org.apache.ignite.spi.discovery.tcp.ClientImpl$1.body(ClientImpl.java:317) [ignite-01-1.0-SNAPSHOT.jar:?]
        at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:58) [ignite-01-1.0-SNAPSHOT.jar:?]
Caused by: java.net.SocketException: Broken pipe (Write failed)
        at java.net.SocketOutputStream.socketWrite0(Native Method) ~[?:1.8.0_211]
        at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111) ~[?:1.8.0_211]
        at java.net.SocketOutputStream.write(SocketOutputStream.java:155) ~[?:1.8.0_211]
        at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82) ~[?:1.8.0_211]
        at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) ~[?:1.8.0_211]
        at org.apache.ignite.marshaller.jdk.JdkMarshallerOutputStreamWrapper.flush(JdkMarshallerOutputStreamWrapper.java:58) ~[ignite-01-1.0-SNAPSHOT.jar:?]
        at java.io.ObjectOutputStream$BlockDataOutputStream.flush(ObjectOutputStream.java:1823) ~[?:1.8.0_211]
        at java.io.ObjectOutputStream.flush(ObjectOutputStream.java:719) ~[?:1.8.0_211]
        at org.apache.ignite.marshaller.jdk.JdkMarshaller.marshal0(JdkMarshaller.java:99) ~[ignite-01-1.0-SNAPSHOT.jar:?]
        ... 12 more
        Suppressed: java.net.SocketException: Broken pipe (Write failed)
                at java.net.SocketOutputStream.socketWrite0(Native Method) ~[?:1.8.0_211]
                at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111) ~[?:1.8.0_211]
                at java.net.SocketOutputStream.write(SocketOutputStream.java:155) ~[?:1.8.0_211]
                at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82) ~[?:1.8.0_211]
                at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) ~[?:1.8.0_211]
                at org.apache.ignite.marshaller.jdk.JdkMarshallerOutputStreamWrapper.flush(JdkMarshallerOutputStreamWrapper.java:58) ~[ignite-01-1.0-SNAPSHOT.jar:?]
                at java.io.ObjectOutputStream$BlockDataOutputStream.flush(ObjectOutputStream.java:1823) ~[?:1.8.0_211]
                at java.io.ObjectOutputStream.flush(ObjectOutputStream.java:719) ~[?:1.8.0_211]
                at java.io.ObjectOutputStream.close(ObjectOutputStream.java:740) ~[?:1.8.0_211]
                at org.apache.ignite.marshaller.jdk.JdkMarshaller.marshal0(JdkMarshaller.java:100) ~[ignite-01-1.0-SNAPSHOT.jar:?]
                at org.apache.ignite.marshaller.AbstractNodeNameAwareMarshaller.marshal(AbstractNodeNameAwareMarshaller.java:68) ~[ignite-01-1.0-SNAPSHOT.jar:?]
                at org.apache.ignite.internal.util.IgniteUtils.marshal(IgniteUtils.java:10597) ~[ignite-01-1.0-SNAPSHOT.jar:?]
                at org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.writeToSocket(TcpDiscoverySpi.java:1764) ~[ignite-01-1.0-SNAPSHOT.jar:?]
                at org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.writeToSocket(TcpDiscoverySpi.java:1701) ~[ignite-01-1.0-SNAPSHOT.jar:?]
                at org.apache.ignite.spi.discovery.tcp.ClientImpl.sendJoinRequest(ClientImpl.java:797) [ignite-01-1.0-SNAPSHOT.jar:?]
                at org.apache.ignite.spi.discovery.tcp.ClientImpl.joinTopology(ClientImpl.java:629) [ignite-01-1.0-SNAPSHOT.jar:?]
                at org.apache.ignite.spi.discovery.tcp.ClientImpl.access$1000(ClientImpl.java:150) [ignite-01-1.0-SNAPSHOT.jar:?]
                at org.apache.ignite.spi.discovery.tcp.ClientImpl$MessageWorker.tryJoin(ClientImpl.java:2108) [ignite-01-1.0-SNAPSHOT.jar:?]
                at org.apache.ignite.spi.discovery.tcp.ClientImpl$MessageWorker.body(ClientImpl.java:1751) [ignite-01-1.0-SNAPSHOT.jar:?]
                at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120) [ignite-01-1.0-SNAPSHOT.jar:?]
                at org.apache.ignite.spi.discovery.tcp.ClientImpl$1.body(ClientImpl.java:317) [ignite-01-1.0-SNAPSHOT.jar:?]
                at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:58) [ignite-01-1.0-SNAPSHOT.jar:?]

Wireshark抓包分析

通过tcpdump分别对客户端,服务端进行抓包,来分析网络情况。 命令如下: tcpdump -i ens192 host 10.1.12.XXX and port 47500 -w server.pcap

47500 为Ignite客户端与服务器建立连接的端口号 ens192 指定网卡

服务端抓包如下:

在这里插入图片描述
在这里插入图片描述

TCP层,有个FLAGS字段,这个字段有以下几个标识:SYN, FIN, ACK, PSH, RST, URG。 SYN表示建立连接 FIN表示关闭连接 ACK表示响应 PSH表示有 DATA数据传输 RST表示连接重置

以下重点分析几处: 第10行结合源码分析,客户端发TcpDiscoveryJoinRequestMessage包到服务端,seq=440,ack=434,Len=14480

服务端顺利接收到,为13行,seq=434,ack=14920,Len=0

备注: 在已经建立好连接的TCP上(只考虑数据包和ack包),seq和ack的计算规则为 本次要发送的包的 seq = 上一个发送的包的seq + 上一个发送的包的长度(不含包头) 本次要发送的包的 ack = 上一个接收到的包的seq + 上一个接收到的包的长度(不含包头)

上面图中画圈的部分发现,第19行和第20行相差了10秒,而且客户端(10.48.131.13)发起[RST,ACK]重连。

接着重试后往下看:

在这里插入图片描述
在这里插入图片描述

在客户端重试的TCP包中,37行发现发送len为1的应答包(RES_OK),但是重试前服务端没有发送这个应答包,结合处理TcpDiscoveryJoinRequest源码看,代码为:

代码语言:javascript
复制
 private boolean processJoinRequestMessage(TcpDiscoveryJoinRequestMessage msg,
            @Nullable ClientMessageWorker clientMsgWrk) throws IOException {
....
 spi.writeToSocket(msg, sock, RES_OK, sockTimeout);
....
}

这里得到结论:服务端在第一次处理TcpDiscoveryJoinRequest包等待了很久,客户端socket到了超时时间10秒后,重连了,在重连后,后续都正常。

客户端10秒超时源码跟踪

ClientImpl 805行:

代码语言:javascript
复制
 return new T3<>(new SocketStream(sock),
     spi.readReceipt(sock, timeoutHelper.nextTimeoutChunk(ackTimeout0)),
     res.clientAck());

timeoutHelper.nextTimeoutChunk继续往下跟IgniteSpiOperationTimeoutHelper :

代码语言:javascript
复制
    public long nextTimeoutChunk(long dfltTimeout) throws IgniteSpiOperationTimeoutException {
        long now = System.nanoTime();
        long left;
        // 因为timeoutEnabled为true,走到这一行
        if (timeoutEnabled)
            left = timeoutThreshold - now;
            ...

timeoutThreshold的值源于ClientImpl 722:

代码语言:javascript
复制
IgniteSpiOperationTimeoutHelper timeoutHelper = new IgniteSpiOperationTimeoutHelper(spi, true);  

再跟到IgniteSpiOperationTimeoutHelper 58行

代码语言:javascript
复制
long timeout = (lastRelatedOperationTime > 0 ? lastRelatedOperationTime : System.nanoTime()) +
    U.millisToNanos(srvOp ? adapter.failureDetectionTimeout() : adapter.clientFailureDetectionTimeout());

srvOp为true ,即System.nanoTime() + failureDetectionTimeout,failureDetectionTimeout默认10秒

找到源码后,客户端修改超时时间setFailureDetectionTimeout示例如下:

代码语言:javascript
复制
        IgniteConfiguration cfg = new IgniteConfiguration();
        cfg.setClientMode(true);
        cfg.setFailureDetectionTimeout(30000L);

结合日志查看服务端卡点

结合日志来看,11:05:50,288 服务端收到Initialized connection with remote client node ,11:06:05,325后,服务端打印日志Message has been received,这里得到结论,服务端卡住15秒

代码语言:javascript
复制
2021-11-08 11:05:50,288 INFO  tcp-disco-sock-reader-[a0a3eaf9 10.45.179.99:24449 client]-#7%npbossdev-gridB-test%-#799%npbossdev-gridB-test% (org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi:478) - Initialized connection with remote client node [nodeId=a0a3eaf9-b1c2-4a61-b581-7dfa545045f7, rmtAddr=/10.45.179.99:24449]
2021-11-08 11:06:05,325 DEBUG tcp-disco-sock-reader-[a0a3eaf9 10.45.179.99:24449 client]-#7%npbossdev-gridB-test%-#799%npbossdev-gridB-test% (org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi:465) - Message has been received: TcpDiscoveryJoinRequestMessage [node=TcpDiscoveryNode [id=a0a3eaf9-b1c2-4a61-b581-7dfa545045f7, consistentId=a0a3eaf9-b1c2-4a61-b581-7dfa545045f7, addrs=ArrayList [10.45.179.99], sockAddrs=HashSet [/10.45.179.99:0], discPort=0, order=0, intOrder=0, lastExchangeTime=1636340750309, loc=false, ver=2.10.0#20211108-sha1:00000000, isClient=true], dataPacket=org.apache.ignite.spi.discovery.tcp.internal.DiscoveryDataPacket@637f47b5, super=TcpDiscoveryAbstractMessage [sndNodeId=a0a3eaf9-b1c2-4a61-b581-7dfa545045f7, id=3b2528dfc71-a0a3eaf9-b1c2-4a61-b581-7dfa545045f7, verifierNodeId=null, topVer=0, pendingIdx=0, failedNodes=null, isClient=true]]
2021-11-08 11:06:05,326 ERROR tcp-disco-sock-reader-[a0a3eaf9 10.45.179.99:24449 client]-#7%npbossdev-gridB-test%-#799%npbossdev-gridB-test% (org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi:498) - Caught exception on message read [sock=Socket[addr=/10.45.179.99,port=24449,localport=47500], locNodeId=faa30201-a16e-4035-bf56-b2901c6a1298, rmtNodeId=a0a3eaf9-b1c2-4a61-b581-7dfa545045f7]
java.net.SocketException: Broken pipe (Write failed)
	at java.net.SocketOutputStream.socketWrite0(Native Method) ~[?:1.8.0]
	at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:122) ~[?:1.8.0]
	at java.net.SocketOutputStream.write(SocketOutputStream.java:145) ~[?:1.8.0]
	at org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.writeToSocket(TcpDiscoverySpi.java:1807) ~[ignite-core-2.10.0.jar:2.10.0]
	at org.apache.ignite.spi.discovery.tcp.ServerImpl$SocketReader.processJoinRequestMessage(ServerImpl.java:7581) ~[ignite-core-2.10.0.jar:2.10.0]
	at org.apache.ignite.spi.discovery.tcp.ServerImpl$SocketReader.body(ServerImpl.java:7092) [ignite-core-2.10.0.jar:2.10.0]
	at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:58) [ignite-core-2.10.0.jar:2.10.0]

第二次故障定位

发现服务端有超时15秒的问题后,决定通过arthas继续跟踪生产哪个方法导致耗时这么久。

arthas定位

通过trace查看耗时方法

根据服务端报错信息发现org.apache.ignite.spi.discovery.tcp.ServerImpl$SocketReader.body这个方法有问题

执行trace命令跟踪: trace org.apache.ignite.spi.discovery.tcp.ServerImpl$SocketReader body ‘#cost>10000’

代码语言:javascript
复制
`---ts=2021-11-25 10:55:55;thread_name=tcp-disco-sock-reader-[]-#49-#468;id=1fa;is_daemon=false;priority=10;TCCL=sun.misc.Launcher$AppClassLoader@18b4aac2
    `---[35092.814909ms] org.apache.ignite.spi.discovery.tcp.ServerImpl$SocketReader:body()
        +---[0.029574ms] org.apache.ignite.spi.discovery.tcp.ServerImpl:getConfiguredNodeId() #6750
        +---[0.013376ms] org.apache.ignite.IgniteLogger:isInfoEnabled() #6756
        +---[0.417309ms] org.apache.ignite.IgniteLogger:info() #6757
        +---[0.081638ms] org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi:configureSocketOptions() #6767
        +---[2.029919ms] org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi:readMessage() #6821
        +---[0.024223ms] org.apache.ignite.spi.discovery.tcp.messages.TcpDiscoveryHandshakeRequest:client() #6863
        +---[0.006424ms] org.apache.ignite.spi.discovery.tcp.messages.TcpDiscoveryHandshakeRequest:creatorNodeId() #6865
        +---[0.047115ms] org.apache.ignite.internal.util.typedef.internal.U:id8() #6869
        +---[0.005641ms] org.apache.ignite.spi.discovery.tcp.messages.TcpDiscoveryHandshakeRequest:client() #6870
        +---[0.200221ms] org.apache.ignite.internal.util.typedef.internal.U:enhanceThreadName() #6869
        +---[0.008674ms] org.apache.ignite.spi.discovery.tcp.internal.TcpDiscoveryNode:internalOrder() #6873
        +---[0.012978ms] org.apache.ignite.spi.discovery.tcp.messages.TcpDiscoveryHandshakeResponse:<init>() #6873
        +---[0.075802ms] org.apache.ignite.spi.discovery.tcp.ServerImpl:allNodesSupport() #6875
        +---[0.060064ms] org.apache.ignite.spi.discovery.tcp.messages.TcpDiscoveryHandshakeResponse:setDiscoveryDataPacketCompression() #6875
        +---[0.00566ms] org.apache.ignite.spi.discovery.tcp.messages.TcpDiscoveryHandshakeRequest:client() #6877
        +---[0.013879ms] org.apache.ignite.spi.discovery.tcp.messages.TcpDiscoveryHandshakeResponse:clientAck() #6878
        +---[0.005648ms] org.apache.ignite.IgniteLogger:isDebugEnabled() #6941
        +---[0.026675ms] org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi:getEffectiveSocketTimeout() #6943
        +---[0.049792ms] org.apache.ignite.IgniteLogger:debug() #6942
        +---[0.004571ms] org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi:getEffectiveSocketTimeout() #6946
        +---[0.48207ms] org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi:writeToSocket() #6946
        +---[0.007472ms] org.apache.ignite.spi.discovery.tcp.messages.TcpDiscoveryHandshakeRequest:client() #6959
        +---[0.030182ms] org.apache.ignite.spi.discovery.tcp.ServerImpl$ClientMessageWorker:<init>() #6960
        +---[0.007268ms] org.apache.ignite.spi.discovery.tcp.ServerImpl:access$3900() #6963
        +---[0.002032ms] org.apache.ignite.IgniteLogger:isDebugEnabled() #6992
        +---[0.023339ms] org.apache.ignite.IgniteLogger:debug() #6993
        +---[0.002474ms] org.apache.ignite.IgniteLogger:isInfoEnabled() #7001
        +---[0.010884ms] org.apache.ignite.spi.discovery.tcp.messages.TcpDiscoveryHandshakeRequest:client() #7003
        +---[0.021883ms] org.apache.ignite.IgniteLogger:info() #7002
        +---[0.005246ms] org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi:getEffectiveSocketTimeout() #7060
        +---[0.007627ms] org.apache.ignite.spi.discovery.tcp.ServerImpl$SocketReader:isInterrupted() #7062
        +---[0.010837ms] org.apache.ignite.internal.processors.security.SecurityUtils:serializeVersion() #7064
        +---[0.006864ms] org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi:marshaller() #7066
        +---[0.001182ms] org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi:ignite() #7067
        +---[0.005046ms] org.apache.ignite.Ignite:configuration() #7067
        +---[0.004567ms] org.apache.ignite.internal.util.typedef.internal.U:resolveClassLoader() #7067
        +---[35086.430093ms] org.apache.ignite.internal.util.typedef.internal.U:unmarshal() #7066
        +---[0.021627ms] org.apache.ignite.spi.discovery.tcp.messages.TcpDiscoveryAbstractMessage:senderNodeId() #7069
        +---[0.025284ms] org.apache.ignite.spi.discovery.tcp.ServerImpl:messageLogger() #7071
        +---[0.026901ms] org.apache.ignite.spi.discovery.tcp.TcpDiscoveryImpl$DebugLogger:isDebugEnabled() #7073
        +---[0.073315ms] org.apache.ignite.spi.discovery.tcp.TcpDiscoveryImpl$DebugLogger:debug() #7074
        +---[0.042802ms] org.apache.ignite.spi.discovery.tcp.internal.TcpDiscoveryStatistics:onMessageReceived() #7076
        +---[0.010485ms] org.apache.ignite.spi.discovery.tcp.messages.TcpDiscoveryJoinRequestMessage:responded() #7091
        +---[0.272454ms] org.apache.ignite.spi.discovery.tcp.ServerImpl$SocketReader:processJoinRequestMessage() #7092 [throws Exception]
        +---[0.002176ms] org.apache.ignite.IgniteLogger:isDebugEnabled() #7331
        +---[0.61067ms] org.apache.ignite.internal.util.typedef.internal.U:error() #7332
        +---[0.002346ms] org.apache.ignite.spi.discovery.tcp.ServerImpl$SocketReader:isInterrupted() #7335
        +---[0.008792ms] org.apache.ignite.internal.util.typedef.X:hasCause() #7339
        +---[0.017524ms] org.apache.ignite.spi.discovery.tcp.ServerImpl:access$9200() #7340
        +---[0.025243ms] org.apache.ignite.spi.discovery.tcp.ServerImpl:onException() #7346
        +---[0.013936ms] org.apache.ignite.internal.processors.security.SecurityUtils:restoreDefaultSerializeVersion() #7352
        +---[0.001579ms] org.apache.ignite.IgniteLogger:isDebugEnabled() #7358
        +---[0.013629ms] org.apache.ignite.IgniteLogger:debug() #7359
        +---[0.004428ms] org.apache.ignite.spi.discovery.tcp.ServerImpl:access$3900() #7362
        +---[0.002056ms] org.apache.ignite.spi.discovery.tcp.ServerImpl$ClientMessageWorker:runner() #7364
        +---[0.004352ms] org.apache.ignite.internal.util.typedef.internal.U:interrupt() #7364
        +---[0.025334ms] org.apache.ignite.internal.util.typedef.internal.U:closeQuiet() #7367
        +---[0.00144ms] org.apache.ignite.IgniteLogger:isInfoEnabled() #7369
        +---[0.012279ms] org.apache.ignite.IgniteLogger:info() #7370
        +---[0.039086ms] org.apache.ignite.spi.discovery.tcp.ServerImpl:isLocalNodeCoordinator() #7373
        +---[0.0024ms] org.apache.ignite.spi.discovery.tcp.ServerImpl:access$1500() #7373
        +---[0.002276ms] org.apache.ignite.spi.discovery.tcp.internal.TcpDiscoveryNodesRing:hasRemoteServerNodes() #7373
        +---[0.005297ms] org.apache.ignite.spi.discovery.tcp.ServerImpl:access$10200() #7374
        `---[0.006973ms] org.apache.ignite.internal.util.typedef.internal.U:enhanceThreadName() #7374

发现unmarshal() 这个方法最耗时,继续trace跟踪: trace --skipJDKMethod false org.apache.ignite.marshaller.jdk.JdkMarshaller unmarshal0 ‘#cost>10000’ 备注: --skipJDKMethod false 开启这个可以跟踪jdk方法,以及开启options unsafe true(系统级别的类(即java.*)默认不能进行增强,需要增强是请参考这里的unsafe开关,增强系统类时请谨慎操作)

代码语言:javascript
复制
`---ts=2021-11-25 10:31:36;thread_name=tcp-disco-sock-reader-[5f6cbd0a 192.168.139.129:35461 client]-#45-#298;id=150;is_daemon=false;priority=10;TCCL=sun.misc.Launcher$AppClassLoader@18b4aac2
    `---[35081.943106ms] org.apache.ignite.marshaller.jdk.JdkMarshaller:unmarshal0()
        +---[0.001963ms] org.apache.ignite.marshaller.jdk.JdkMarshallerInputStreamWrapper:<init>() #122
        +---[50.965952ms] org.apache.ignite.marshaller.jdk.JdkMarshallerObjectInputStream:<init>() #122
        +---[35030.958673ms] java.io.ObjectInputStream:readObject() #124
        `---[0.002752ms] java.io.ObjectInputStream:close() #125

这里通过trace和结合查看源码,大体明白服务节点卡在反序列化代码

通过thread查看线程

本次也可以通过日志"ERROR tcp-disco-sock-reader-"得知发生报错的线程,从而通过arthas的thread查看tcp-disco-sock-reader线程卡在什么位置: 执行命令:thread -all | grep tcp-disco-sock-reader- 得到结果:

代码语言:javascript
复制
2362   tcp-disco-sock-reader-[0e1a224e 192.168.1 main                 10            RUNNABLE     0.0           0.000         0:0.017       false         false   

根据线程ID继续查此线程卡在什么位置: 执行命令:thread 2362 得到结果:

代码语言:javascript
复制
"tcp-disco-sock-reader-[0e1a224e 192.168.139.129:39423 client]-#180-#2312" Id=2362 RUNNABLE (in native)
    at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
    at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:929)
    at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1324)
    at java.net.InetAddress.getAllByName0(InetAddress.java:1277)
    at java.net.InetAddress.getAllByName(InetAddress.java:1193)
    at java.net.InetAddress.getAllByName(InetAddress.java:1127)
    at java.net.InetAddress.getByName(InetAddress.java:1077)
    at java.net.InetSocketAddress.<init>(InetSocketAddress.java:220)
    at org.apache.ignite.internal.util.IgniteUtils.toSocketAddresses(IgniteUtils.java:9433)
    at org.apache.ignite.internal.util.IgniteUtils.toSocketAddresses(IgniteUtils.java:9411)
    at org.apache.ignite.spi.discovery.tcp.internal.TcpDiscoveryNode.readExternal(TcpDiscoveryNode.java:614)
    at java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:2118)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2067)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573)
    at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2287)
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2211)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2069)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573)
    at java.io.ObjectInputStream.readObject(ObjectInputStream.java:431)
    at org.apache.ignite.marshaller.jdk.JdkMarshaller.unmarshal0(JdkMarshaller.java:124)
    at org.apache.ignite.marshaller.AbstractNodeNameAwareMarshaller.unmarshal(AbstractNodeNameAwareMarshaller.java:92)
    at org.apache.ignite.internal.util.IgniteUtils.unmarshal(IgniteUtils.java:10408)
    at org.apache.ignite.spi.discovery.tcp.ServerImpl$SocketReader.body(ServerImpl.java:7066)
    at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:58)

现在查到这里,已经确认hostname导致本次问题

jstack

jstack也可以导出卡住线程的堆栈信息,命令如下: jstack -l PID >> output.log ,PID可以通过JPS得到

得到结果后,过滤出tcp-disco-sock-reader线程信息

代码语言:javascript
复制
"tcp-disco-sock-reader-[1d3a8137 192.168.139.129:52379 client]-#9-#70" #95 prio=10 os_prio=0 tid=0x0000000000a20000 nid=0x251b runnable [0x00007f71d0c65000]
   java.lang.Thread.State: RUNNABLE
	at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
	at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:929)
	at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1324)
	at java.net.InetAddress.getAllByName0(InetAddress.java:1277)
	at java.net.InetAddress.getAllByName(InetAddress.java:1193)
	at java.net.InetAddress.getAllByName(InetAddress.java:1127)
	at java.net.InetAddress.getByName(InetAddress.java:1077)
	at java.net.InetSocketAddress.<init>(InetSocketAddress.java:220)
	at org.apache.ignite.internal.util.IgniteUtils.toSocketAddresses(IgniteUtils.java:9433)
	at org.apache.ignite.internal.util.IgniteUtils.toSocketAddresses(IgniteUtils.java:9411)
	at org.apache.ignite.spi.discovery.tcp.internal.TcpDiscoveryNode.readExternal(TcpDiscoveryNode.java:614)
	at java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:2118)
	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2067)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573)
	at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2287)
	at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2211)
	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2069)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573)
	at java.io.ObjectInputStream.readObject(ObjectInputStream.java:431)
	at org.apache.ignite.marshaller.jdk.JdkMarshaller.unmarshal0(JdkMarshaller.java:124)
	at org.apache.ignite.marshaller.AbstractNodeNameAwareMarshaller.unmarshal(AbstractNodeNameAwareMarshaller.java:92)
	at org.apache.ignite.internal.util.IgniteUtils.unmarshal(IgniteUtils.java:10408)
	at org.apache.ignite.spi.discovery.tcp.ServerImpl$SocketReader.body(ServerImpl.java:7066)
	at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:58)

   Locked ownable synchronizers:
	- None

kill -3 pid

由于power架构下 arthas或者jstack 不能使用,所以也可以通过kill -3 加上pid 导出卡住线程堆栈 命令如下: kill -3 pid ,pid通过jps可以查到 结果会输出到控制台,过滤出tcp-disco-sock-reader线程信息

代码语言:javascript
复制
"tcp-disco-sock-reader-[4eeee5c0 192.168.139.129:36905 client]-#4-#45" #65 prio=10 os_prio=0 tid=0x00000000010d6000 nid=0x2444 runnable [0x00007f20b869e000]
   java.lang.Thread.State: RUNNABLE
	at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
	at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:929)
	at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1324)
	at java.net.InetAddress.getAllByName0(InetAddress.java:1277)
	at java.net.InetAddress.getAllByName(InetAddress.java:1193)
	at java.net.InetAddress.getAllByName(InetAddress.java:1127)
	at java.net.InetAddress.getByName(InetAddress.java:1077)
	at java.net.InetSocketAddress.<init>(InetSocketAddress.java:220)
	at org.apache.ignite.internal.util.IgniteUtils.toSocketAddresses(IgniteUtils.java:9433)
	at org.apache.ignite.internal.util.IgniteUtils.toSocketAddresses(IgniteUtils.java:9411)
	at org.apache.ignite.spi.discovery.tcp.internal.TcpDiscoveryNode.readExternal(TcpDiscoveryNode.java:614)
	at java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:2118)
	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2067)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573)
	at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2287)
	at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2211)
	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2069)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573)
	at java.io.ObjectInputStream.readObject(ObjectInputStream.java:431)
	at org.apache.ignite.marshaller.jdk.JdkMarshaller.unmarshal0(JdkMarshaller.java:124)
	at org.apache.ignite.marshaller.AbstractNodeNameAwareMarshaller.unmarshal(AbstractNodeNameAwareMarshaller.java:92)
	at org.apache.ignite.internal.util.IgniteUtils.unmarshal(IgniteUtils.java:10408)
	at org.apache.ignite.spi.discovery.tcp.ServerImpl$SocketReader.body(ServerImpl.java:7066)
	at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:58)

最终问题定位

由上面分析,已经分析到是因为hostname导致本地故障,对照生产配置发现: power服务节点在/etc/hosts 配置了power客户端的hostname,但是没有配置x86客户端节点的hostname, 而x86服务节点服务端都配置了x86和power的客户端的hostname,所以会导致x86 客户端连power架构的服务节点时,出现客户端节点连不上问题。而x86 或者power客户端连x86节点是没有问题的。

对于这个问题,有个前提条件:客户端节点在自己这台的/etc/hosts 配置上了的hostname,如果没配置,是不会引发问题。

至于和DNS的关系,原因如下: 导致问题的代码为InetSocketAddress(String hostname, int port),InetSocketAddress会自动去解析hostname,也就是DNS解析,获取到IP,然后连接;

之前生产环境为什么没发现问题,是因为dns没有开启,本次由于其他业务上线,开启dns

相关配置如下: 原来没开启DNS: cat /etc/nsswitch.conf hosts: files 本次上线,维护在files后加上dns就会开启DNS,导致问题产生 hosts: files dns

15秒配置如下: 位于/etc/resolv.conf options timeout:5 attempts:3 rotate 超时时间5秒,重试次数为3

总结:这次引起bug的原因有2个,一是power服务端节点遗漏配置x86客户端的hostname,二是维护开启DNS,导致DNS解析hostname找不到,卡住15秒

结合ignite源代码回顾

客户端收集本地信息

首先客户端在启动加入节点前,会收集本地信息,这时候就会带上hostname,源码跟踪如下:

在这里插入图片描述
在这里插入图片描述

在IgniteUtils类里,

代码语言:javascript
复制
    public static IgniteBiTuple<Collection<String>, Collection<String>> resolveLocalAddresses(InetAddress locAddr,
        boolean allHostNames) throws IOException {
...
 for (InetAddress addr : locAddrs)
                    addresses(addr, addrs, hostNames, allHostNames);
...
}

在addresses(addr, addrs, hostNames, allHostNames)方法里,获得hostname 即InetAddress getHostName方法

代码语言:javascript
复制
    private static void addresses(InetAddress addr, Collection<String> addrs, Collection<String> hostNames,
        boolean allHostNames) {
...
        String hostName = addr.getHostName();
        String ipAddr = addr.getHostAddress();
...

如果客户端节点在自己这台的/etc/hosts 配置上了自己的hostname,这时代码就会把hostname存在加入节点里(locNode)。

最后,ClientImpl 把要加入的客户端节点(包含上面的hostname),包装为TcpDiscoveryJoinRequestMessage,传给服务端

代码语言:javascript
复制
 ...
 TcpDiscoveryNode node = locNode;
 ...
 TcpDiscoveryJoinRequestMessage joinReqMsg = new TcpDiscoveryJoinRequestMessage(node, discoveryData);
 ...
 msg = joinReqMsg;
 ...
 spi.writeToSocket(sock, msg, timeoutHelper.nextTimeoutChunk(spi.getSocketTimeout()));

服务端反序列化解析hostname

客户端把TcpDiscoveryJoinRequestMessage发给服务端,服务端进行反序列化,代码为ServerImpl 7066行,

代码语言:javascript
复制
  TcpDiscoveryAbstractMessage msg = U.unmarshal(spi.marshaller(), in,
      U.resolveClassLoader(spi.ignite().configuration()));

U.unmarshal底层调用ObjectInputStream的readObject方法,ObjectInputStream 里会调用readExternal接口,ignite里TcpDiscoveryJoinRequestMessage 重写了readExternal,在重写的readExternal里会初始化InetSocketAddress。

TcpDiscoveryJoinRequestMessage 主要包含TcpDiscoveryNode和DiscoveryDataPacket

在这里插入图片描述
在这里插入图片描述

TcpDiscoveryNode里包含writeExternal和readExternal

查看readExternal源码,包含初始化InetSocketAddress:

代码语言:javascript
复制
    @Override public void readExternal(ObjectInput in) throws IOException, ClassNotFoundException {
    ...
    sockAddrs = U.toSocketAddresses(this, discPort);
    ...

继续跟入U.toSocketAddresses源码,找到InetSocketAddress:

代码语言:javascript
复制
    public static Collection<InetSocketAddress> toSocketAddresses(Collection<String> addrs,
        Collection<String> hostNames, int port) {
        Set<InetSocketAddress> res = new HashSet<>(addrs.size());

        Iterator<String> hostNamesIt = hostNames.iterator();

        for (String addr : addrs) {
            String hostName = hostNamesIt.hasNext() ? hostNamesIt.next() : null;

            if (!F.isEmpty(hostName)) {
                InetSocketAddress inetSockAddr = new InetSocketAddress(hostName, port);
                ...
本文参与 腾讯云自媒体分享计划,分享自作者个人站点/博客。
原始发表:2021-11-29 ,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 作者个人站点/博客 前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 通过Wireshark和arthas排查由DNS引发的Ignite生产故障案例
  • 故障背景
  • 故障分析
  • 第一次定位问题
    • 跨架构假设
      • 生产重现故障
        • Wireshark抓包分析
          • 客户端10秒超时源码跟踪
            • 结合日志查看服务端卡点
            • 第二次故障定位
              • arthas定位
                • 通过trace查看耗时方法
                • 通过thread查看线程
              • jstack
                • kill -3 pid
                • 最终问题定位
                • 结合ignite源代码回顾
                  • 客户端收集本地信息
                    • 服务端反序列化解析hostname
                    领券
                    问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档