前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >MongoDB 4.4 读写分离、副本集相关BUG

MongoDB 4.4 读写分离、副本集相关BUG

作者头像
徐靖
发布2022-09-22 11:45:02
9370
发布2022-09-22 11:45:02
举报
文章被收录于专栏:DB说DB说

【背景】

MongoDB 4.4.4集群稳定运行将近半年,由于操作系统安全漏洞,需要升级系统版本来修复,需要将MongoDB实例关闭,然后进行系统升级后重启服务器。关闭MongoDB实例,如是实例是主库,那么执行主备切换即可(使用rs.stepDown()或者修复优先级别),原本很简单的事情(4.4之前版本操作N次),结果遇到2个BUG。第一个是分片集群下读写分离 第二主备切换出现实例全部宕机(这个出乎意料,并不是每次都触发),修复这2个BUG,MongoDB至少采用4.4.7版本.如果没有使用读写分离,建议采用4.4.6版本(4.4.5不建议使用)

读写分离BUG--升级到4.4.8版本验证没有问题

【触发场景】

  • MongoDB 4.4.0-4.4.6 分片集群
  • URI使用

"maxStalenessSeconds=xxx"and "readPreference=secondary/secondaryPreferred/nearest"

  • 应用查询到分片X(不管是广播还是单个分片)
  • 分片X中出现从节点宕机

如果读写分离满足以上时,

MongoError: Encountered non-retryable error during query :: caused by :: Incompatible wire version

【修复版本】

https://jira.mongodb.org/browse/SERVER-57136

Fix Version/s:5.1.0, 4.4.7, 5.0.0-rc1

【应用连接】

mongodb://admin:***@mongoprd1.com:31051,mongoprd2.com:31051,mongoprd3.com:31051/xiaoxu?&readPreference=secondaryPreferred&maxStalenessSeconds=120

【应用报错】

Incompatible wire version' on server mongoprd1.com:31051. The full response is {"ok": 0.0, "errmsg": "Incompatible wire version", "code": 188, "codeName": "IncompatibleServerVersion", "operationTime": {"timestamp": {"t": 1628055478, "i": 7}}, "signature": {"hash": {"timestamp": {"t": 1628055478, "i": 7}}, "timestamp": {"t": 1628055478, "i": 7}}, "signature": {"hash": {"

【数据库mongos日志】--shard分片中没有发现类似错误

{"t":{"$date":"2021-08-04T14:14:31.992+08:00"},"s":"I", "c":"QUERY", "id":4625501, "ctx":"conn535","msg":"Unable to establish remote cursors","attr":{"error":{"code":188,"codeName":"IncompatibleServerVersion","errmsg":"Incompatible wire version"},"nRemotes":2}}

【使用Python程序来模拟这个错误】

【python程序】

mongos url需使用readPreference=secondaryPreferred&maxStalenessSeconds这个2个参数,否则不会出现这个错误,另外如果是非分片集群下,无异常。

代码语言:javascript
复制
from pymongo import MongoClient
import pprint
 client = MongoClient('mongodb://admin:admin@mongodbtest.com:41051/?
readPreference=secondaryPreferred&maxStalenessSeconds=90')
db = client.xiaoxu
coll = db.xiaoxu
i = 0
while i < 100000:
  doc = { 'no': 100 + i }
  pprint.pprint(coll.insert_one(doc))
  pprint.pprint(coll.find_one(doc))  
  i += 1

【验证db:xiaoxu所有在主节点信息】

备注:从以下可以看出,xiaoxu数据库所在主节点是shard2,主要为了模拟对应分片下从实例宕机的影响.此时shard1宕机无影响,如果是分片集合,广播下发查询时,任何分片下出现实例宕机都有影响。

代码语言:javascript
复制
mongos> sh.status();                                                                                                                                                          
--- Sharding Status ---                                                                                                                                                       
  sharding version: {                                                                                                                                                         
        "_id" : 1,                                                                                                                                                            
        "minCompatibleVersion" : 5,                                                                                                                                           
        "currentVersion" : 6,                                                                                                                                                 
        "clusterId" : ObjectId("5fc608cefcbcbec36d4f785d")                                                                                                                    
  }                                                                                                                                                                           
  shards:                                                                                                                                                                     
        {  "_id" : "shard1",  "host" : "shard1/mongodbtest1.com:27017,mongodbtest2.com:27017,mongodbtest3.com:27017",  "state" : 1 }                                                                        
        {  "_id" : "shard2",  "host" : "shard2/mongodbtest1.com:27018,mongodbtest2.com:27018,mongodbtest3.com:27018",  "state" : 1 }                                                    
  active mongoses:                                                                                                                                                            
        "4.4.4" : 3                                                                                                                                                           
  autosplit:                                                                                                                                                                  
        Currently enabled: yes                                                                                                                                                
  balancer:                                                                                                                                                                   
        Currently enabled:  yes                                                                                                                                               
        Currently running:  no                                                                                                                                                                                                                                                   
        Migration Results for the last 24 hours:                                                                                                                              
                No recent migrations                                                                                                                                          
  databases:              
        ......                                                                                                                                                    
        {  "_id" : "xiaoxu",  "primary" : "shard2",  "partitioned" : false,  "version" : {  "uuid" : UUID("6593e368-e82a-4a8c-a184-ccf57b1773e9"),  
        "lastMod" : 1 } }         
mongos>

【模拟shard2下任一从节点宕机--异常与正常都可以】

备注:如果此时是shard1下从节点出现宕机,对查询无影响

代码语言:javascript
复制
mongod -f  /data/mongodb/mongodb44/mongod27018/conf/shard2.conf 
--shutdown

【登录剩下任一节点验证】

代码语言:javascript
复制
mongo 127.0.0.1:27018/admin -uadmin --eval "rs.status()" |egrep "name|stateStr"
Enter password:
"name" : "mongodbtest1.com:27018",
"stateStr" : "(not reachable/healthy)",
"name" : "mongodbtest2.com:27018",
"stateStr" : "SECONDARY",
"name" : "mongodbtest3.com:27018",
"stateStr" : "PRIMARY",

【python程序抛出异常】

pymongo.errors.OperationFailure: Encountered non-retryable error during query :: caused by :: Incompatible wire version,

full error: {'ok': 0.0, 'errmsg': 'Encountered non-retryable error during query :: caused by ::

Incompatible wire version', 'code': 188, 'codeName': 'IncompatibleServerVersion', 'operationTime': Timestamp(1628219161, 4),

'$clusterTime': {'clusterTime': Timestamp(1628219162, 8), 'signature': {'hash': b'-D\xe2~\xe3\x14;\xe6bqa>\x14\xad\xf30<\xf7U\xd0', 'keyId': 6937532038958284801}}}

【验证mongos、shard中错误--mongos有错误,shard中没有错误】

{"t":{"$date":"2021-08-06T11:06:02.048+08:00"},"s":"I", "c":"QUERY", "id":4625501, "ctx":"conn564","msg":"Unable to establish remote cursors",

"attr":{"error":{"code":188,"codeName":"IncompatibleServerVersion","errmsg":"Incompatible wire version"},"nRemotes":0}}

【集群版本升级到4.4.8】

升级后从节点宕机对前端查询无影响。新版本中Skip maxStaleness wire version check when server is down来修复这个BUG,如果无法升级,可以取消读写分离来规避这个问题。

主备切换出现实例全部宕机BUG

【触发场景】

在主节点执行rs.stepDown()后,新主节点已选出来且接受写入后副本集中所有成员全部宕机(没有模拟出来),查看jira中资料发现副本集状态发生变化有可能触发这个BUG,例如增加成员、升级4.4版本设置兼容性,主实例降级、网络分区错误等会产生Invariant failure错误。

【原主节点】

--replSetStepDown command completed从这个日志来看,已完成主节点降级操作然后宕机

{"t":{"

{"t":{"$date":"2021-07-28T15:39:49.070+08:00"},"s":"F", "c":"-", "id":23079, "ctx":"TopologyVersionObserver","msg":"Invariant failure","attr":{"expr":"opCtx != nullptr && _opCtx == nullptr","file":"src/mongo/db/client.cpp","line":126}}

{"t":{"$date":"2021-07-28T15:39:49.070+08:00"},"s":"I", "c":"REPL", "id":2903000, "ctx":"conn208969","msg":"Restarting heartbeats after learning of a new primary","attr":{"myPrimaryId":"none","senderAndPrimaryId":4,"senderTerm":3}}

{"t":{"$date":"2021-07-28T15:39:49.070+08:00"},"s":"F", "c":"-", "id":23080, "ctx":"TopologyVersionObserver","msg":"\n\n***aborting after invariant() failure\n\n"}

{"t":{"$date":"2021-07-28T15:39:49.071+08:00"},"s":"F", "c":"CONTROL", "id":4757800, "ctx":"TopologyVersionObserver","msg":"Writing fatal message","attr":{"message":"Got signal: 6 (Aborted).\n"}}

【新主节点】

--Transition to primary complete; database writes are now permitted 新主节点已选出且接受写入操作然后宕机

{"t":{"$date":"2021-07-28T15:39:49.069+08:00"},"s":"I", "c":"STORAGE", "id":20657, "ctx":"OplogApplier-0","msg":"IndexBuildsCoordinator::onStepUp - this node is stepping up to primary"}

{"t":{"$date":"2021-07-28T15:39:49.069+08:00"},"s":"I", "c":"REPL", "id":21331, "ctx":"OplogApplier-0","msg":"Transition to primary complete; database writes are now permitted"}

{"t":{"$date":"2021-07-28T15:39:49.070+08:00"},"s":"F", "c":"-", "id":23079, "ctx":"waitForMajority","msg":"Invariant failure","attr":{"expr":"opCtx != nullptr && _opCtx == nullptr","file":"src/mongo/db/client.cpp","line":126}}

{"t":{"$date":"2021-07-28T15:39:49.070+08:00"},"s":"F", "c":"-", "id":23080, "ctx":"waitForMajority","msg":"\n\n***aborting after invariant() failure\n\n"}

{"t":{"$date":"2021-07-28T15:39:49.070+08:00"},"s":"F", "c":"CONTROL", "id":4757800, "ctx":"waitForMajority","msg":"Writing fatal message","attr":{"message":"Got signal: 6 (Aborted).\n"}}

{"t":{"$date":"2021-07-28T15:39:49.071+08:00"},"s":"F", "c":"-", "id":23079, "ctx":"TopologyVersionObserver","msg":"Invariant failure","attr":{"expr":"opCtx != nullptr && _opCtx == nullptr","file":"src/mongo/db/client.cpp","line":126}}

{"t":{"$date":"2021-07-28T15:39:49.072+08:00"},"s":"F", "c":"-", "id":23080, "ctx":"TopologyVersionObserver","msg":"\n\n***aborting after invariant() failure\n\n"}

【第三个节点】

{"t":{"$date":"2021-07-28T15:40:19.466+08:00"},"s":"F", "c":"-", "id":23079, "ctx":"TopologyVersionObserver","msg":"Invariant failure","attr":{"expr":"opCtx != nullptr && _opCtx == nullptr","file":"src/mongo/db/client.cpp","line":126}}

{"t":{"$date":"2021-07-28T15:40:19.466+08:00"},"s":"F", "c":"-", "id":23080, "ctx":"TopologyVersionObserver","msg":"\n\n***aborting after invariant() failure\n\n"}

{"t":{"$date":"2021-07-28T15:40:19.466+08:00"},"s":"F", "c":"CONTROL", "id":4757800, "ctx":"TopologyVersionObserver","msg":"Writing fatal message","attr":{"message":"Got signal: 6 (Aborted).\n"}}

【对应BUG以及影响版本--升级到4.4.5,4.4.5不建议使用

https://jira.mongodb.org/browse/SERVER-53566

MongoDB version 4.4.5 is not recommended for production use due to a critical issue, WT-7426. The issue is fixed in version 4.4.6.

本文参与 腾讯云自媒体分享计划,分享自微信公众号。
原始发表:2021-08-09,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 DB说 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
相关产品与服务
云数据库 MongoDB
腾讯云数据库 MongoDB(TencentDB for MongoDB)是腾讯云基于全球广受欢迎的 MongoDB 打造的高性能 NoSQL 数据库,100%完全兼容 MongoDB 协议,支持跨文档事务,提供稳定丰富的监控管理,弹性可扩展、自动容灾,适用于文档型数据库场景,您无需自建灾备体系及控制管理系统。
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档