记录一次redis cluster集群故障-运维笔记

公司的一个系统业务出现故障,提示查询redis失败。查询相关日志,分析过程记录如下:

redis cluster 集群组成(3主3从,每个节点的数据有6GB,节点机器在同一个内网环境中)
10.0.22.207:7001(971d05cd7b9bb3634ad024e6aac3dff158c52eee) master
10.0.22.208:7003(e7592314869c29375599d781721ad76675645c4c) slave

10.0.22.209:7005(52b8d27838244657d9b01a233578f24d287979fe) master
10.0.22.209:7002(213bde6296c36b5f31b958c7730ff1629125a204) slave

10.0.22.208:7000(e936d5b4c95b6cae57f994e95805aef87ea4a7a5) master
10.0.22.207:7004(0060012d749167d3f72833d916e53b3445b66c62) slave

日志排查过程如下:
分析1:
主节点7001失去和从节点7003的连接:
46590:M 24 Oct 21:57:51.379 # Connection with slave 10.0.22.208:7003 lost.

分析2:
主节点7000/7005判定7001失联:
42645:M 24 Oct 21:57:50.117 * Marking node 971d05cd7b9bb3634ad024e6aac3dff158c52eee as failing (quorum reached).

分析3:
从节点7002/7003/7004收到主节点7005说7001失联:
46986:S 24 Oct 21:57:50.120 * FAIL message received from 52b8d27838244657d9b01a233578f24d287979fe about 971d05cd7b9bb3634ad024e6aac3dff158c52eee

分析4:
主节点7000/7005授权7003升级为主节点转移:
42645:M 24 Oct 21:57:51.055 # Failover auth granted to e7592314869c29375599d781721ad76675645c4c for epoch 16

分析5:
原主节点7001修改自己的配置,成为7003的从节点:
46590:M 24 Oct 21:57:51.488 # Configuration change detected. Reconfiguring myself as a replica of e7592314869c29375599d781721ad76675645c4c

分析6:
主节点7000/7005/7003明确7001失败状态:
42645:M 24 Oct 21:57:51.522 * Clear FAIL state for node 971d05cd7b9bb3634ad024e6aac3dff158c52eee: master without slots is reachable again.

分析7:
新从节点7001开始从新主节点7003,第一次全量同分析数据:
7003日志::
4255:M 24 Oct 21:57:51.906 * Full resync requested by slave 10.0.22.207:7001
4255:M 24 Oct 21:57:51.906 * Starting BGSAVE for SYNC with target: disk
4255:M 24 Oct 21:57:51.941 * Background saving started by pid 5230

7001日志::
46590:S 24 Oct 21:57:51.948 * Full resync from master: d7751c4ebf1e63d3baebea1ed409e0e7243a4423:440721826993

分析8:
主节点7000/7005判定7003(新主)失联:
42645:M 24 Oct 21:58:00.320 * Marking node e7592314869c29375599d781721ad76675645c4c as failing (quorum reached).

分析9:
主节点7000/7005判定7003(新主)恢复:
60295:M 24 Oct 21:58:18.181 * Clear FAIL state for node e7592314869c29375599d781721ad76675645c4c: is reachable again and nobody is serving its slots after some time.

分析10:
主节点7003完成全量同分析所需要的BGSAVE操作:
5230:C 24 Oct 21:59:01.474 * DB saved on disk
5230:C 24 Oct 21:59:01.491 * RDB: 7112 MB of memory used by copy-on-write
4255:M 24 Oct 21:59:01.877 * Background saving terminated with success

分析11:
从节点7001开始从主节点7003接收到数据:
46590:S 24 Oct 21:59:02.263 * MASTER <-> SLAVE sync: receiving 2657606930 bytes from master

分析12:
主节点7003发现从节点7001对output buffer作了限制:
4255:M 24 Oct 22:00:19.014 # Client id=14259015 addr=10.0.22.207:21772 fd=844 name= age=148 idle=148 flags=S db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=16349 oll=4103 omem=95944066 events=rw cmd=psync scheduled to be closed ASAP for overcoming of output buffer limits.
4255:M 24 Oct 22:00:19.015 # Connection with slave 10.0.22.207:7001 lost.

分析13:
从节点7001从主节点7003同分析数据失败,连接断了,第一次全量同分析失败:
46590:S 24 Oct 22:00:19.018 # I/O error trying to sync with MASTER: connection lost
46590:S 24 Oct 22:00:20.102 * Connecting to MASTER 10.0.22.208:7003
46590:S 24 Oct 22:00:20.102 * MASTER <-> SLAVE sync started

分析14:
从节点7001重新开始同分析,连接失败,主节点7003的连接数满了:
46590:S 24 Oct 22:00:21.103 * Connecting to MASTER 10.0.22.208:7003
46590:S 24 Oct 22:00:21.103 * MASTER <-> SLAVE sync started
46590:S 24 Oct 22:00:21.104 * Non blocking connect for SYNC fired the event.
46590:S 24 Oct 22:00:21.104 # Error reply to PING from master: '-ERR max number of clients reached'

分析15:
从节点7001重新连上主节点7003,第二次开始全量同分析:
7001日志:
46590:S 24 Oct 22:00:49.175 * Connecting to MASTER 10.0.22.208:7003
46590:S 24 Oct 22:00:49.175 * MASTER <-> SLAVE sync started
46590:S 24 Oct 22:00:49.175 * Non blocking connect for SYNC fired the event.
46590:S 24 Oct 22:00:49.176 * Master replied to PING, replication can continue...
46590:S 24 Oct 22:00:49.179 * Partial resynchronization not possible (no cached master)
46590:S 24 Oct 22:00:49.501 * Full resync from master: d7751c4ebf1e63d3baebea1ed409e0e7243a4423:440780763454

7003日志:
4255:M 24 Oct 22:00:49.176 * Slave 10.0.22.207:7001 asks for synchronization
4255:M 24 Oct 22:00:49.176 * Full resync requested by slave 10.0.22.207:7001
4255:M 24 Oct 22:00:49.176 * Starting BGSAVE for SYNC with target: disk
4255:M 24 Oct 22:00:49.498 * Background saving started by pid 18413
18413:C 24 Oct 22:01:52.466 * DB saved on disk
18413:C 24 Oct 22:01:52.620 * RDB: 2124 MB of memory used by copy-on-write
4255:M 24 Oct 22:01:53.186 * Background saving terminated with success

分析16:
从节点7001同分析数据成功,开始加载经内存:
46590:S 24 Oct 22:01:53.190 * MASTER <-> SLAVE sync: receiving 2637183250 bytes from master
46590:S 24 Oct 22:04:51.485 * MASTER <-> SLAVE sync: Flushing old data
46590:S 24 Oct 22:05:58.695 * MASTER <-> SLAVE sync: Loading DB in memory

分析17:
集群恢复正常:
42645:M 24 Oct 22:05:58.786 * Clear FAIL state for node 971d05cd7b9bb3634ad024e6aac3dff158c52eee: slave is reachable again.

分析18:
从节点7001同分析数据成功,耗时7分钟:
46590:S 24 Oct 22:08:19.303 * MASTER <-> SLAVE sync: Finished with success


7001失联原因分析:
由于几台节点机在同一个内网环境中,不太可能发生网络中断的情况,
于是通过SLOWLOG GET命令查看了慢查询日志,发现有一个KEYS命令被执行了,耗时8.3秒,
再查看集群节点超时设置,发现是5s(cluster-node-timeout 5000)


出现节点失联的原因:
客户端执行了耗时1条8.3s的命令,
2018/10/24 21:57:43 开始执行KEYS命令
2018/10/24 21:57:50 7001被判断失联(redis日志)
2018/10/24 21:57:51 执行完KEYS命令

总结来说,有以下几个问题:
1)由于cluster-node-timeout设置比较短,慢查询KEYS导致了集群判断节点7001失联
2)由于7001失联,导致7003升级为主,开始主从同分析
3)由于配置client-output-buffer-limit的限制,导致第一次全量同分析失败了
4)又由于PHP客户端的连接池有问题,疯狂连接服务器,产生了类似SYN攻击的效果
5)第一次全量同分析失败后,从节点重连主节点花了30秒(超过了最大连接数1w)

采取措施:
1)单实例的切割到4G以下,否则发生主从切换会耗时很长
2)调整client-output-buffer-limit参数,防止同分析进行到一半失败
3)调整cluster-node-timeout,不能少于15s
4)禁止任何耗时超过cluster-node-timeout的慢查询,因为会导致主从切换
5)修复客户端类似SYN攻击的疯狂连接方式

本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

发表于

我来说两句

0 条评论
登录 后参与评论

相关文章

来自专栏bboysoul

记载一次删除mysql的默认数据库mysql的过程

今天终于尝到了什么叫删库跑路,妈的真的悲剧,早上想在树莓派里安装个wordpress,想放点街舞视频给大家分享一下,突然发现数据库里面还有一个以前的论坛数据库,...

863
来自专栏码匠的流水账

docker化360的pika

2311
来自专栏小狼的世界

Kubernetes基础:查看状态、管理服务

在Kubernetes中创建一个Deployment 部署就会在Node上创建一个Pod,Pod是Kubernetes中对于一组容器以及与容器相关的资源的集合。...

4531
来自专栏jeremy的技术点滴

重搭docker集群

3645
来自专栏IT笔记

Docker学习之搭建私有镜像仓库

过节之前来一发,又是许久没整理笔记了,今天跟大家聊聊Docker如何搭建私有仓库的几种方式。首先我们来回顾一下之前讲到的Doker 平台的基本构成。 Dok...

40611
来自专栏雨过天晴

原 加速Docker镜像编译之更换软件源

3482

了解微服务,第6部分:健康检查

随着我们的微服务和它们运营的环境变得越来越复杂,让我们的服务为Docker Swarm提供一种安全检查机制也变得日益重要。因此,我们将在博客系列的第六部分中介绍...

8123
来自专栏技术翻译

Kubernetes揭秘:解决服务依赖问题

本系列文章探讨了企业客户在使用Kubernetes时遇到的一些常见问题。Container Service客户经常提出的一个问题是,“我如何处理服务之间的依赖关...

5102
来自专栏cs

python所遇到的坑

6774
来自专栏耕耘实录

国内环境Kubernetes v1.12.1的安装与配置

版权声明:本文为耕耘实录原创文章,各大自媒体平台同步更新。欢迎转载,转载请注明出处,谢谢

2573

扫码关注云+社区

领取腾讯云代金券