前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >【Mongo】mongo sharding集群其中一个分片故障

【Mongo】mongo sharding集群其中一个分片故障

作者头像
用户5522200
发布2020-05-11 17:39:09
2.6K0
发布2020-05-11 17:39:09
举报
文章被收录于专栏:lindalinda

mongo sharding集群其中一个分片故障 [toc]

场景说明

ip

port

角色

port

角色

port

角色

port

角色

192.168.59.140

27000

mongos

27100

config

27101

shard1-primary

27102

shard2-secondary

192.168.59.141

27000

mongos

27100

config

27101

shard1-secondary

27102

shard2-primary

192.168.59.142

27000

mongos

27100

config

27101

shard1-arbiter

27102

shard2-arbiter

场景

故障说明

影响

场景1

shard2 secondary节点故障

不影响业务

场景2

shard2 primary节点故障

不影响业务

场景3

shard2 arbiter节点故障

不影响业务

场景4

shard2 2个节点故障,其中有一个仲裁节点

影响业务,整个集群无法提供读写

场景5

shard2 其中2个节点故障,primary+secondary

影响业务,整个集群无法提供读写

处理步骤

场景1:shard2 secondary节点故障

不影响业务;恢复方法:

1、部署新的mongo server 实例
代码语言:javascript
复制
(1)新实例第一次启动的时候,配置文件注释掉如下信息:
#security:
#  keyFile: /data/mongodb/auth/keyfile.key
#  authorization: enabled
 
#replication:
#  oplogSizeMB: 512
#  replSetName: shard2
#sharding:
#  clusterRole: shardsvr

(2)创建管理账户
启动实例后,
创建管理账户:chjroot
监控账户:monitor

(3)将配置文件的注释去掉,重启新实例
2、下线问题节点
代码语言:javascript
复制
shard2:PRIMARY> rs.remove("192.168.59.140:27102")
{
    "ok" : 1,
    "operationTime" : Timestamp(1588216157, 1),
    "$gleStats" : {
        "lastOpTime" : {
            "ts" : Timestamp(1588216157, 1),
            "t" : NumberLong(2)
        },
        "electionId" : ObjectId("7fffffff0000000000000002")
    },
    "lastCommittedOpTime" : Timestamp(1588216156, 1),
    "$configServerState" : {
        "opTime" : {
            "ts" : Timestamp(1588216153, 2),
            "t" : NumberLong(1)
        }
    },
    "$clusterTime" : {
        "clusterTime" : Timestamp(1588216157, 1),
        "signature" : {
            "hash" : BinData(0,"Z+IrFefwfA638bKEBEqp6mJVEnc="),
            "keyId" : NumberLong("6819186238046601246")
        }
    }
}
3、添加一个数据节点到shard2
代码语言:javascript
复制
shard2:PRIMARY> rs.add("192.168.59.142:27103")
{
    "ok" : 1,
    "operationTime" : Timestamp(1588216039, 1),
    "$gleStats" : {
        "lastOpTime" : {
            "ts" : Timestamp(1588216039, 1),
            "t" : NumberLong(2)
        },
        "electionId" : ObjectId("7fffffff0000000000000002")
    },
    "lastCommittedOpTime" : Timestamp(1588215749, 4),
    "$configServerState" : {
        "opTime" : {
            "ts" : Timestamp(1588216037, 2),
            "t" : NumberLong(1)
        }
    },
    "$clusterTime" : {
        "clusterTime" : Timestamp(1588216039, 1),
        "signature" : {
            "hash" : BinData(0,"OlhhP3xyV9/Ye8n6nX9hWmu3RU8="),
            "keyId" : NumberLong("6819186238046601246")
        }
    }
}
4. 查看集群状态
代码语言:javascript
复制
shard2:PRIMARY> rs.status()
#确认新加入节点变成SECONDARY状态
5、查看mongos状态
代码语言:javascript
复制
mongos> sh.status()
--- Sharding Status --- 
  sharding version: {
      "_id" : 1,
      "minCompatibleVersion" : 5,
      "currentVersion" : 6,
      "clusterId" : ObjectId("5ea29dc1123de331d06a015d")
  }
  shards:
        {  "_id" : "shard1",  "host" : "shard1/192.168.59.140:27101,192.168.59.141:27101",  "state" : 1 }
        {  "_id" : "shard2",  "host" : "shard2/192.168.59.141:27102,192.168.59.142:27103",  "state" : 1 }

#shard2的host自动变更为"shard2/192.168.59.141:27102,192.168.59.142:27103"

场景2:shard2 primary节点故障

不影响业务;secondary节点自动提升为primary节点。 恢复方法,同场景1。

场景3:shard2 arbiter节点故障

不影响业务; 恢复方法,同场景1。 只有一点不同:新增数据节点变为新增仲裁节点

代码语言:javascript
复制
shard2:PRIMARY> rs.addArb("192.168.59.142:27103")

场景4:shard1 2个节点故障,其中有一个仲裁节点

影响业务,shard1副本集只剩下一个数据节点的时候,会自动降级为secondary,此时shard1虽然正常,但是mongos节点新的读写都会报错,已存在的连接会超时,记录到系统日志,

如下:

代码语言:javascript
复制
mongos> show dbs
2020-04-30T14:55:04.110+0800 E QUERY    [js] Error: listDatabases failed:{
    "ok" : 0,
    "errmsg" : "Could not find host matching read preference { mode: \"primary\" } for set shard1",
    "code" : 133,
    "codeName" : "FailedToSatisfyReadPreference",
    "operationTime" : Timestamp(1588229692, 2),
    "$clusterTime" : {
        "clusterTime" : Timestamp(1588229692, 2),
        "signature" : {
            "hash" : BinData(0,"2BYNFCHN8dZgE8E1J6AluDOVNZM="),
            "keyId" : NumberLong("6819186238046601246")
        }
    }
} :
_getErrorWithCode@src/mongo/shell/utils.js:25:13
Mongo.prototype.getDBs@src/mongo/shell/mongo.js:124:1
shellHelper.show@src/mongo/shell/utils.js:876:19
shellHelper@src/mongo/shell/utils.js:766:15
@(shellhelp2):1:1

系统日志信息:
2020-04-30T14:56:23.257+0800 I COMMAND  [conn914] command lcl_szy.mycol01 appName: "MongoDB Shell" command: insert { insert: "mycol01", ordered: true, lsid: { id: UUID("b9e81b56-3cd9-4f7e-9713-2bc2705e6181") }, $clusterTime: { clusterTime: Timestamp(1588229755, 2), signature: { hash: BinData(0, 199FBD05D97079A444636107558992C60AB4D77D), keyId: 6819186238046601246 } }, $db: "lcl_szy" } nShards:1 ninserted:0 numYields:0 reslen:353 protocol:op_msg 19684ms

此时mongos的集群状态,会开启balance迁移shard2的数据到shard1,不过由于shard2只有一个secondary节点,balance并不会成功。chunk的大小一直没有变化。

代码语言:javascript
复制
mongos> sh.status()
--- Sharding Status ---
  sharding version: {
    "_id" : 1,
    "minCompatibleVersion" : 5,
    "currentVersion" : 6,
    "clusterId" : ObjectId("5ea29dc1123de331d06a015d")
  }
  shards:
        {  "_id" : "shard1",  "host" : "shard1/192.168.59.140:27101,192.168.59.141:27101",  "state" : 1 }
        {  "_id" : "shard2",  "host" : "shard2/192.168.59.140:27102,192.168.59.141:27102",  "state" : 1 }
  active mongoses:
        "4.0.4-62-g7e345a7" : 3
  autosplit:
        Currently enabled: yes
  balancer:
        Currently enabled:  yes
        Currently running:  yes
        Failed balancer rounds in last 5 attempts:  5
        Last reported error:  Could not find host matching read preference { mode: "primary" } for set shard1
        Time of Reported error:  Thu Apr 30 2020 15:06:39 GMT+0800 (CST)
        Migration Results for the last 24 hours:
                196 : Success
                7622 : Failed with error 'aborted', from shard2 to shard1
  databases:
        {  "_id" : "config",  "primary" : "config",  "partitioned" : true }
                config.system.sessions
                        shard key: { "_id" : 1 }
                        unique: false
                        balancing: true
                        chunks:
                                shard1  1
                        { "_id" : { "$minKey" : 1 } } -->> { "_id" : { "$maxKey" : 1 } } on : shard1 Timestamp(1, 0)
        {  "_id" : "iot_test",  "primary" : "shard2",  "partitioned" : true,  "version" : {  "uuid" : UUID("d628fb8e-c88e-4548-9421-45862f6ade21"),  "lastMod" : 1 } }
                iot_test.vehicle_signal
                        shard key: { "deviceId" : "hashed" }
                        unique: false
                        balancing: true
                        chunks:
                                shard1  196
                                shard2  197
                        too many chunks to print, use verbose if you want to force print
        {  "_id" : "lcl_szy",  "primary" : "shard1",  "partitioned" : false,  "version" : {  "uuid" : UUID("635e1d10-b035-41ea-9a78-85ff4fdbadc0"),  "lastMod" : 1 } }
 
mongos> sh.isBalancerRunning()
true

恢复步骤

1、将shard1仅剩的secondary节点降级为单实例运行,恢复业务
代码语言:javascript
复制
shard1:SECONDARY> config=rs.conf()
shard1:SECONDARY> config.members=[config.members[0]]
shard1:SECONDARY>  rs.reconfig(config,{force:true})
{
    "ok" : 1,
    "operationTime" : Timestamp(1588230485, 58),
    "$gleStats" : {
        "lastOpTime" : Timestamp(0, 0),
        "electionId" : ObjectId("7fffffff0000000000000004")
    },
    "lastCommittedOpTime" : Timestamp(0, 0),
    "$configServerState" : {
        "opTime" : {
            "ts" : Timestamp(1588232958, 2),
            "t" : NumberLong(1)
        }
    },
    "$clusterTime" : {
        "clusterTime" : Timestamp(1588232958, 2),
        "signature" : {
            "hash" : BinData(0,"yeqvk5rGUesVN67DY5+0RojTM7I="),
            "keyId" : NumberLong("6819186238046601246")
        }
    }
}
 
shard1:PRIMARY>
shard1:PRIMARY> show dbs
admin     0.000GB
config    0.000GB
iot_test  1.660GB
lcl_szy   2.965GB
local     0.714GB
2、查看mongos状态

shard1已经变为单节点

代码语言:javascript
复制
mongos> sh.status()
--- Sharding Status ---
  sharding version: {
    "_id" : 1,
    "minCompatibleVersion" : 5,
    "currentVersion" : 6,
    "clusterId" : ObjectId("5ea29dc1123de331d06a015d")
  }
  shards:
        {  "_id" : "shard1",  "host" : "shard1/192.168.59.140:27101",  "state" : 1,  "draining" : true }
        {  "_id" : "shard2",  "host" : "shard2/192.168.59.140:27102,192.168.59.141:27102",  "state" : 1 }
3、给shard1新增仲裁节点

同场景3

4、给shard1新增数据节点

同场景1

5、查看新副本集的状态

场景5:shard1 其中2个节点故障,primary+secondary

此时整个集群不可用。需要尽快减小对业务的影响

1、将shard1节点从mongo sharding集群中去掉

config集群中所有关于shard1的信息全部删除

代码语言:javascript
复制
repl_config:PRIMARY> use config
repl_config:PRIMARY> db.shards.find()
{ "_id" : "shard1", "host" : "shard1/192.168.59.140:27101,192.168.59.141:27101", "state" : 1, "draining" : true }
{ "_id" : "shard2", "host" : "shard2/192.168.59.140:27102,192.168.59.141:27102", "state" : 1 }
repl_config:PRIMARY> db.shards.remove({'_id':"shard1"})
 repl_config:PRIMARY> db.shards.find()
{ "_id" : "shard2", "host" : "shard2/192.168.59.140:27102,192.168.59.141:27102", "state" : 1 }
repl_config:PRIMARY> db.collections.find()
删掉开启了分片且有数据在shard1的那条记录
repl_config:PRIMARY> db.collections.remove({"_id":"iot_test.vehicle_signal"})
WriteResult({ "nRemoved" : 1 })
repl_config:PRIMARY> db.databases.remove({"_id":"lcl_szy"})
WriteResult({ "nRemoved" : 1 })
repl_config:PRIMARY> db.databases.find()
{ "_id" : "iot_test", "primary" : "shard2", "partitioned" : true, "version" : { "uuid" : UUID("d628fb8e-c88e-4548-9421-45862f6ade21"), "lastMod" : 1 } }
repl_config:PRIMARY>

此时mongos可以正常写入,读取少了shard1上的数据

2、重新部署shard1副本集集群
3、将shar1加入mogno sharding集群
代码语言:javascript
复制
mongos> sh.addShard('shard1/192.168.59.140:27101,192.168.59.141:27101,192.168.59.142:27101')
{
    "shardAdded" : "shard1",
    "ok" : 1,
    "operationTime" : Timestamp(1588236561, 2),
    "$clusterTime" : {
        "clusterTime" : Timestamp(1588236561, 2),
        "signature" : {
            "hash" : BinData(0,"3It1W8WjI0mMPuYN6wlCfS8M8fo="),
            "keyId" : NumberLong("6819186238046601246")
        }
    }
}
mongos> sh.status()
--- Sharding Status ---
  sharding version: {
    "_id" : 1,
    "minCompatibleVersion" : 5,
    "currentVersion" : 6,
    "clusterId" : ObjectId("5ea29dc1123de331d06a015d")
  }
  shards:
        {  "_id" : "shard1",  "host" : "shard1/192.168.59.140:27101,192.168.59.141:27101",  "state" : 1 }
        {  "_id" : "shard2",  "host" : "shard2/192.168.59.140:27102,192.168.59.141:27102",  "state" : 1 }
  active mongoses:
        "4.0.4-62-g7e345a7" : 3
  autosplit:
        Currently enabled: yes
  balancer:
        Currently enabled:  yes
        Currently running:  no
        Failed balancer rounds in last 5 attempts:  5
        Last reported error:  Could not find host matching read preference { mode: "primary" } for set shard1
        Time of Reported error:  Thu Apr 30 2020 16:14:06 GMT+0800 (CST)
        Migration Results for the last 24 hours:
                259 : Success
                7267 : Failed with error 'aborted', from shard2 to shard1
  databases:
        {  "_id" : "config",  "primary" : "config",  "partitioned" : true }
                config.system.sessions
                        shard key: { "_id" : 1 }
                        unique: false
                        balancing: true
                        chunks:
                                shard2  1
                        { "_id" : { "$minKey" : 1 } } -->> { "_id" : { "$maxKey" : 1 } } on : shard2 Timestamp(2, 0)
        {  "_id" : "iot_test",  "primary" : "shard2",  "partitioned" : true,  "version" : {  "uuid" : UUID("d628fb8e-c88e-4548-9421-45862f6ade21"),  "lastMod" : 1 } }
 
mongos>
4、iot_test.vehicle_signal开启分片

config节点删除原来的chunk信息

代码语言:javascript
复制
repl_config:PRIMARY> use config
repl_config:PRIMARY> db.chunks.remove({})
WriteResult({ "nRemoved" : 394 })

mongos节点开启分片

代码语言:javascript
复制
mongos> db.runCommand({"shardCollection":"iot_test.vehicle_signal","key":{"deviceId":"hashed"}})
{
    "collectionsharded" : "iot_test.vehicle_signal",
    "collectionUUID" : UUID("ecffe19f-1cd9-48ab-b92b-fa676d5b9e0a"),
    "ok" : 1,
    "operationTime" : Timestamp(1588236901, 131),
    "$clusterTime" : {
        "clusterTime" : Timestamp(1588236901, 131),
        "signature" : {
            "hash" : BinData(0,"WqQbQEzdGESF+A+J7qZaDHJYQXw="),
            "keyId" : NumberLong("6819186238046601246")
        }
    }
}
5、将shard1的备份数据恢复至mongo sharding集群
本文参与 腾讯云自媒体同步曝光计划,分享自作者个人站点/博客。
如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 作者个人站点/博客 前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 场景说明
  • 处理步骤
    • 场景1:shard2 secondary节点故障
      • 1、部署新的mongo server 实例
      • 2、下线问题节点
      • 3、添加一个数据节点到shard2
      • 4. 查看集群状态
      • 5、查看mongos状态
    • 场景2:shard2 primary节点故障
      • 场景3:shard2 arbiter节点故障
        • 场景4:shard1 2个节点故障,其中有一个仲裁节点
          • 1、将shard1仅剩的secondary节点降级为单实例运行,恢复业务
          • 2、查看mongos状态
          • 3、给shard1新增仲裁节点
          • 4、给shard1新增数据节点
          • 5、查看新副本集的状态
        • 场景5:shard1 其中2个节点故障,primary+secondary
          • 1、将shard1节点从mongo sharding集群中去掉
          • 2、重新部署shard1副本集集群
          • 3、将shar1加入mogno sharding集群
          • 4、iot_test.vehicle_signal开启分片
          • 5、将shard1的备份数据恢复至mongo sharding集群
      相关产品与服务
      云数据库 MongoDB
      腾讯云数据库 MongoDB(TencentDB for MongoDB)是腾讯云基于全球广受欢迎的 MongoDB 打造的高性能 NoSQL 数据库,100%完全兼容 MongoDB 协议,支持跨文档事务,提供稳定丰富的监控管理,弹性可扩展、自动容灾,适用于文档型数据库场景,您无需自建灾备体系及控制管理系统。
      领券
      问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档