前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >MongoDB触发oom-killer的简单处理(一)(r7笔记第54天)

MongoDB触发oom-killer的简单处理(一)(r7笔记第54天)

作者头像
jeanron100
发布2018-03-16 17:50:09
1.8K0
发布2018-03-16 17:50:09
举报

今天正在处理问题,突然收到一封邮件,提示mongo的一个副本集出现了问题。邮件内容如下: 当前状态:PROBLEM 事件编号:11187790 发生时间:2015.12.21 13:30:01 持续时长:1h 20m 监控节点:mongo_mobile_gl_comment_sec.161 报警级别:高危(High) 事件信息:One or more replication members need attention 事件详情:Replication: Count of ReplicaSet Members Needing Attention: 1 DEBUG信息: Replication: Count of ReplicaSet Members Needing Attention (mongo_mobile_gl_comment_sec.161:repl_member_attention_state_count): 1 对于这类问题,最开始以为是网络延时,结果邮件还是持续报警,这个时候感觉应该是哪里出了问题。连接到环境中,发现mongo进程已经消失了。 尝试连接,提示连接失败。 [root@mongo_mobile_gl_comment_sec ~]# mongo --port=27010 MongoDB shell version: 2.6.3 connecting to: 127.0.0.1:27010/test 2015-12-21T12:21:28.407+0530 warning: Failed to connect to 127.0.0.1:27010, reason: errno:111 Connection refused 2015-12-21T12:21:28.407+0530 Error: couldn't connect to server 127.0.0.1:27010 (127.0.0.1), connection attempt failed at src/mongo/shell/mongo.js:148 exception: connect failed 这个时候准备重启mongoDB,这个时候提示重启失败。 [root@mongo_mobile_gl_comment_sec ~]# /etc/init.d/mongod start Starting mongod: [FAILED] 这个时候只能查看日志了,得到的日志信息如下,可以看到服务器重启的痕迹,但是mongo重启失败。 less /storage/mongodb/log/mongod.log 2015-12-21T11:06:02.893+0530 [PeriodicTaskRunner] task: DBConnectionPool-cleaner took: 8ms 2015-12-21T11:06:33.976+0530 [conn1675793] command admin.$cmd command: isMaster { ismaster: 1 } keyUpdates:0 numYields:0 reslen:434 1094ms 2015-12-21T11:06:33.976+0530 [conn1675786] command local.$cmd command: getnonce { getnonce: 1 } ntoreturn:1 keyUpdates:0 numYields:0 reslen:65 30244ms 2015-12-21T11:06:33.976+0530 [conn1675795] command admin.$cmd command: isMaster { ismaster: 1 } keyUpdates:0 numYields:0 reslen:434 1096ms 2015-12-21T11:06:33.976+0530 [conn1675794] command admin.$cmd command: buildInfo { buildinfo: 1 } keyUpdates:0 numYields:0 reslen:700 1039ms 2015-12-21T11:06:33.982+0530 [conn1675789] command local.$cmd command: getnonce { getnonce: 1 } ntoreturn:1 keyUpdates:0 numYields:0 reslen:65 31111ms 2015-12-21T11:06:33.982+0530 [conn1675788] command local.$cmd command: getnonce { getnonce: 1 } ntoreturn:1 keyUpdates:0 numYields:0 reslen:65 30028ms 2015-12-21T11:06:33.982+0530 [conn1675790] command local.$cmd command: getnonce { getnonce: 1 } ntoreturn:1 keyUpdates:0 numYields:0 reslen:65 31079ms 2015-12-21T11:06:33.982+0530 [conn1675791] command local.$cmd command: getnonce { getnonce: 1 } ntoreturn:1 keyUpdates:0 numYields:0 reslen:65 31095ms 2015-12-21T11:06:34.715+0530 [conn1675792] command admin.$cmd command: isMaster { ismaster: 1 } keyUpdates:0 numYields:0 reslen:434 30769ms 2015-12-21T11:06:34.769+0530 [conn271045] query mobile_UCENTER.scoreRecord query: { $and: [ { uid: 2093542 }, { createTime: { $lte: new Date(1450675608270) } }, { createTime: { $gte: new Date(1450636200270) } } ] } planSummary: IXSCAN { uid: 1 } ntoreturn:0 ntoskip:0 nscanned:125 nscannedObjects:125 keyUpdates:0 numYields:109 locks(micros) r:275071880 nreturned:0 reslen:20 576812ms 2015-12-21T11:06:43.649+0530 [PeriodicTaskRunner] task: WriteBackManager::cleaner took: 15ms 2015-12-21T12:25:31.762+0530 ***** SERVER RESTARTED ***** 2015-12-21T12:25:31.775+0530 [initandlisten] MongoDB starting : pid=11668 port=27010 dbpath=/storage/mongodb/data/db/ 64-bit host=mongo_mobile_gl_comment_sec.126 2015-12-21T12:25:31.775+0530 [initandlisten] db version v2.6.3 2015-12-21T12:25:31.775+0530 [initandlisten] git version: 255f67a66f9603c59380b2a389e386910bbb52cb 2015-12-21T12:25:31.775+0530 [initandlisten] build info: Linux build12.nj1.10gen.cc 2.6.32-431.3.1.el6.x86_64 #1 SMP Fri Jan 3 21:39:27 UTC 2014 x86_64 BOOST_LIB_VERSION=1_49 2015-12-21T12:25:31.775+0530 [initandlisten] allocator: tcmalloc 2015-12-21T12:25:31.775+0530 [initandlisten] options: { config: "/etc/mongod.conf", net: { bindIp: "127.0.0.1,172.16.3.126", http: { enabled: false }, maxIncomingConnections: 5000, port: 27010 }, processManagement: { fork: true, pidFilePath: "/storage/mongodb/mongod.pid" }, replication: { oplogSizeMB: 30960, replSet: "rs_comment" }, security: { authorization: "enabled", keyFile: "/home/mongodb/key/rs_key" }, storage: { dbPath: "/storage/mongodb/data/db/", directoryPerDB: true, journal: { enabled: false } }, systemLog: { destination: "file", logAppend: true, path: "/storage/mongodb/log/mongod.log", quiet: true } } 2015-12-21T12:25:31.786+0530 [initandlisten] exception in initAndListen: 12596 old lock file, terminating 2015-12-21T12:25:31.786+0530 [initandlisten] dbexit: 2015-12-21T12:25:31.786+0530 [initandlisten] shutdown: going to close listening sockets... 2015-12-21T12:25:31.786+0530 [initandlisten] shutdown: going to flush diaglog... 2015-12-21T12:25:31.786+0530 [initandlisten] shutdown: going to close sockets... 2015-12-21T12:25:31.786+0530 [initandlisten] shutdown: waiting for fs preallocator... 2015-12-21T12:25:31.786+0530 [initandlisten] shutdown: closing all files... 2015-12-21T12:25:31.786+0530 [initandlisten] closeAllFiles() finished 2015-12-21T12:25:31.786+0530 [initandlisten] dbexit: really exiting now 对于这类问题,因为也算是菜鸟,所以有些不知所措,不过看了下日志错误,提示是存在一个锁文件,这个看起来和weblogic启动也有点类似,查了一下百度,也得到了印证,我就不继续靠谷歌了。开始备份这个锁文件 [root@mongo_mobile_gl_comment_sec db]# mv mongod.lock mongod.lock.bak 然后继续尝试重启,这次就没有问题了。 [root@mongo_mobile_gl_comment_sec db]# /etc/init.d/mongod start Starting mongod: [ OK ] 再次尝试登录就可以看到是成功了。 [root@mongo_mobile_gl_comment_sec db]# mongo --port=27010 MongoDB shell version: 2.6.3 connecting to: 127.0.0.1:27010/test 当然回过头来,为什么服务器会重启,重启之后可以看到,服务器的负载极高。几乎没有任何喘气的空隙了。 top - 15:29:37 up 364 days, 6:50, 1 user, load average: 92.52, 59.74, 34.39 Tasks: 73 total, 1 running, 72 sleeping, 0 stopped, 0 zombie Cpu(s): 26.7%us, 7.8%sy, 0.0%ni, 0.7%id, 60.7%wa, 2.8%hi, 1.2%si, 0.2%st Mem: 15408212k total, 15199732k used, 208480k free, 2732k buffers Swap: 16777212k total, 9540k used, 16767672k free, 13722616k cached 使用sar得到的结果如下,可以看到idle没有什么空间了。 [root@mongo_mobile_gl_comment_sec db]# sar 1 3 Linux 3.10.42-52.145.amzn1.x86_64 (mongo_mobile_gl_comment_sec.126) 12/21/2015 _x86_64_ (2 CPU) 03:30:12 PM CPU %user %nice %system %iowait %steal %idle 03:30:13 PM all 46.84 0.00 8.95 43.68 0.53 0.00 03:30:14 PM all 23.16 0.00 9.47 66.32 0.00 1.05 03:30:15 PM all 18.18 0.00 5.88 75.40 0.53 0.00 Average: all 29.45 0.00 8.11 61.73 0.35 0.35 查看系统日志,看看奔溃前的服务器情况。然后又看到了熟悉的oom-killer,又是out of memory,又是swap不足了。 Dec 21 10:47:51 template-mongodb-2-6-3 dhclient[1504]: bound to 172.16.3.126 -- renewal in 1415 seconds. Dec 21 11:09:07 template-mongodb-2-6-3 kernel: [31458594.279409] mongod invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0 Dec 21 11:09:07 template-mongodb-2-6-3 kernel: [31458594.286302] mongod cpuset=/ mems_allowed=0 Dec 21 11:09:07 template-mongodb-2-6-3 kernel: [31458594.288577] CPU: 0 PID: 11565 Comm: mongod Not tainted 3.10.42-52.145.amzn1.x86_64 #1 Dec 21 11:09:07 template-mongodb-2-6-3 kernel: [31458594.292344] Hardware name: Xen HVM domU, BIOS 4.2.amazon 12/03/2014 。。。 Dec 21 11:09:07 template-mongodb-2-6-3 kernel: [31458594.503711] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB Dec 21 11:09:07 template-mongodb-2-6-3 kernel: [31458594.508107] 37117 total pagecache pages Dec 21 11:09:07 template-mongodb-2-6-3 kernel: [31458594.510198] 34175 pages in swap cache Dec 21 11:09:07 template-mongodb-2-6-3 kernel: [31458594.512098] Swap cache stats: add 5987691, delete 5953516, find 100670521/100858969 Dec 21 11:09:07 template-mongodb-2-6-3 kernel: [31458594.515633] Free swap = 8kB Dec 21 11:09:07 template-mongodb-2-6-3 kernel: [31458594.517084] Total swap = 16777212kB Dec 21 11:09:07 template-mongodb-2-6-3 kernel: [31458594.554801] 3999999 pages RAM Dec 21 11:09:07 template-mongodb-2-6-3 kernel: [31458594.556423] 82410 pages reserved Dec 21 11:09:07 template-mongodb-2-6-3 kernel: [31458594.558234] 526732 pages shared Dec 21 11:09:07 template-mongodb-2-6-3 kernel: [31458594.559786] 3881988 pages non-shared 。。。 Dec 21 11:09:07 template-mongodb-2-6-3 kernel: [31458594.678772] [ 8407] 0 8407 63087 900 75 0 0 php Dec 21 11:09:07 template-mongodb-2-6-3 kernel: [31458594.682439] Out of memory: Kill process 23821 (mongod) score 982 or sacrifice child Dec 21 11:09:07 template-mongodb-2-6-3 kernel: [31458594.686761] Killed process 23821 (mongod) total-vm:794240260kB, anon-rss:13553192kB, file-rss:0kB 可以在重启之后看到swap的进程情况。 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 11830 mongod 20 0 728g 10g 9.9g S 82.2 70.0 106:53.68 /usr/bin/mongod -f /etc/mongod.conf 326 root 20 0 0 0 0 S 2.7 0.0 4989:16 [kswapd0] 以10秒为间隔,可以看到产生了大量的读操作。 [root@mongo_mobile_gl_comment_sec ~]# mongotop -h 127.0.0.1 --port 27010 -u admin -p'xxxxxx' 10 connected to: 127.0.0.1:27010 ns total read write 2015-12-21T10:33:56 mobile_UCENTER.scoreRecord 1986786ms 1986786ms 0ms social.t_user 117490ms 117490ms 0ms mobile_UCENTER.T_USER_INVITE_CODE 97416ms 97416ms 0ms social.t_comment 94913ms 94913ms 0ms social.t_topic 74259ms 74259ms 0ms mobile_UCENTER.expRecord 48979ms 48979ms 0ms social.t_praise 7286ms 7286ms 0ms social.t_topic_sub 4003ms 4003ms 0ms mobile_UCENTER.boughtRecord 1561ms 1561ms 0ms 使用mongostat查看统计信息。可以看到不同的db的负载情况,连dml的情况都能够统计的非常详细。

代码语言:javascript
复制

# mongostat -h 127.0.0.1 --port 27010 -u admin -p'xxxxxxx' 5
connected to: 127.0.0.1:27010
insert  query update delete getmore command flushes mapped  vsize    res  faults   locked db idx miss %     qr|qw   ar|aw  netIn netOut   conn        set repl       time 
    *9    955     *8     *0       0    11|0       0   726g   728g   10.1g   2371 social:0.3%          0     116|0     1|0   166k   308k   1527 rs_comment  SEC   16:05:19 
   *20   2080     *7     *0       0    10|0       0   726g   728g   10.1g   2315 mobile_UCENTER:0.4%          0     700|0     2|0   342k    449k  1527 rs_comment  SEC   16:05:25 
   *22   1963    *13     *0       0    14|0       0   726g   728g   10.2g   2909 mobile_UCENTER:0.2%          0       0|0   300|0   352k    426k  1527 rs_comment  SEC   16:05:30 
   *18   1583     *5     *0       0    11|0       0   726g   728g   10.2g   2495 mobile_UCENTER:0.3%          0     282|0    10|0   275k    346k  1527 rs_comment  SEC   16:05:35 

后续的就是持续深化,优化好之后再来收尾,看看为什么会负载突然提升,从系统层面怎么提升了。

本文参与 腾讯云自媒体分享计划,分享自微信公众号。
原始发表:2015-12-21,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 杨建荣的学习笔记 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
相关产品与服务
数据库
云数据库为企业提供了完善的关系型数据库、非关系型数据库、分析型数据库和数据库生态工具。您可以通过产品选择和组合搭建,轻松实现高可靠、高可用性、高性能等数据库需求。云数据库服务也可大幅减少您的运维工作量,更专注于业务发展,让企业一站式享受数据上云及分布式架构的技术红利!
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档