首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >linux服务器开发实战(一)——排查Flamingo服务端一个崩溃的问题

linux服务器开发实战(一)——排查Flamingo服务端一个崩溃的问题

作者头像
范蠡
发布2018-09-29 17:59:58
1.5K0
发布2018-09-29 17:59:58
举报

我的flamingo服务器(关于flamingo可以参看这里)最近在杀掉进程(如使用Ctrl + C或者kill + 程序pid)偶尔会出现崩溃问题,虽然这个问题没多大影响,因为进程本来就马上要死了,在退出的过程中崩溃也就无所谓了,但是本着严谨和求知的态度,我还是排查了一下。下面记录一下debug的过程,希望对读者有所启发。

正常情况下,我的程序处理了Ctrl+C信号时,会走正常的退出流程,预想的程序不会崩溃的,但实际还是崩溃了。

主线程是一个EventLoop无限循环,同时程序接收到Ctrl+C信号时,设置主线程退出标志。代码如下:

 int main(int argc, char* argv[])
 2  {  
 3    //设置信号处理
 4    signal(SIGCHLD, SIG_DFL);
 5    signal(SIGPIPE, SIG_IGN);
 6    signal(SIGINT, prog_exit);
 7    signal(SIGTERM, prog_exit);
 8
 9    //省略无关代码...
10
11    g_mainLoop.loop();
12
13    return 0;
14}

信号处理程序如下:

1void prog_exit(int signo)
2   {
3    std::cout << "program recv signal [" << signo << "] to exit." << std::endl;
4
5    g_mainLoop.quit();
6
7}

通过日志也看不到对于排查崩溃现象的有任何帮助的日志信息,于是启用linux的linux的coredump文件生成机制,某次产生了如下coredump文件:

于是使用gdb调试查看一下崩溃调用堆栈(第一步使用命令gdb 可执行文件名称 core文件名,第二步使用bt命令查看崩溃堆栈):

 1[zhangyl@iZ238vnojlyZ myimserver]$ gdb mychatserver core.9798 
 2GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-80.el7
 3   Copyright (C) 2013 Free Software Foundation, Inc.
 4   License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
 5   This is free software: you are free to change and redistribute it.
 6   There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
 7   and "show warranty" for details.
 8   This GDB was configured as "x86_64-redhat-linux-gnu".
 9   For bug reporting instructions, please see:
10   <http://www.gnu.org/software/gdb/bugs/>...
11   Reading symbols from /home/zhangyl/myimserver/mychatserver...done.
12   [New LWP 9798]
13   [New LWP 9802]
14   [New LWP 9804]
15   [New LWP 9800]
16   [New LWP 9803]
17   [New LWP 9801]
18   [New LWP 9805]
19   [Thread debugging using libthread_db enabled]
20   Using host libthread_db library "/lib64/libthread_db.so.1".
21   Core was generated by `./mychatserver -d'.
22   Program terminated with signal 11, Segmentation fault.
23   #0  0x00000000004f5d3f in FixedBuffer<4000000>::avail (this=0x7f7067564010) at /home/zhangyl/myimserver/base/asynclogging.h:45
24   45              int avail() const { return static_cast<int>(end() - cur_); }
25Missing separate debuginfos, use: debuginfo-install glibc-2.17-106.el7_2.6.x86_64 libgcc-4.8.5-11.el7.x86_64 libstdc++-4.8.5-11.el7.x86_64
26(gdb) bt
27#0  0x00000000004f5d3f in FixedBuffer<4000000>::avail (this=0x7f7067564010) at /home/zhangyl/myimserver/base/asynclogging.h:45
28#1  0x00000000004f4e20 in AsyncLogging::append (this=0x7ffd5d58ad50, 
29    logline=0x7ffd5d58fe70 "20180820 15:46:06 DEBUG ~EventLoop EventLoop destructs in other thread - eventloop.cpp:87\n", len=90)
30    at /home/zhangyl/myimserver/base/asynclogging.cpp:98
31#2  0x000000000053bfc8 in asyncOutput (msg=0x7ffd5d58fe70 "20180820 15:46:06 DEBUG ~EventLoop EventLoop destructs in other thread - eventloop.cpp:87\n", len=90)
32    at /home/zhangyl/myimserver/chatserversrc/main.cpp:33
33#3  0x00000000004f205d in Logger::~Logger (this=0x7ffd5d58fe60, __in_chrg=<optimized out>) at /home/zhangyl/myimserver/base/logging.cpp:149
34#4  0x0000000000504bf1 in net::EventLoop::~EventLoop (this=0x7d1c00 <g_mainLoop>, __in_chrg=<optimized out>) at /home/zhangyl/myimserver/net/eventloop.cpp:87
35#5  0x00007f7067d79e69 in __run_exit_handlers () from /lib64/libc.so.6
36#6  0x00007f7067d79eb5 in exit () from /lib64/libc.so.6
37#7  0x00007f7067d62b1c in __libc_start_main () from /lib64/libc.so.6
38#8  0x00000000004f0ba9 in _start ()
39(gdb)

崩溃的根源在于堆栈#0,使用frame命令(frame 堆栈编号,如frame 0,表示切换到堆栈#0)切换到堆栈#0

1(gdb) frame 0
2#0  0x00000000004f5d3f in FixedBuffer<4000000>::avail (this=0x7f7067564010) at /home/zhangyl/myimserver/base/asynclogging.h:45
345              int avail() const { return static_cast<int>(end() - cur_); }
4(gdb) p this
5$1 = (const FixedBuffer<4000000> * const) 0x7f7067564010
6(gdb)

使用print命令(简写成p)查看下当前对象的this指针,this指针时一个地址值,但是不代表这个对象有效,再使用print命令查看一下这个对象本身的数据(p *this),发现这个对象已经失效:

1(gdb) p *this
2Cannot access memory at address 0x7f7067564010
3(gdb)

这个对象为啥失效呢?而且堆栈#3堆栈#4显示的是在EventLoop里面调用日志类Logger的析构函数。先看下堆栈#4处的函数代码:

 1EventLoop::~EventLoop()
 2{    
 3    LOG_DEBUG << "EventLoop destructs in other thread";
 4
 5    //std::stringstream ss;
 6    //ss << "eventloop destructs threadid = " << threadId_;
 7    //std::cout << ss.str() << std::endl;
 8
 9    wakeupChannel_->disableAll();
10    wakeupChannel_->remove();
11    ::close(wakeupFd_);
12    t_loopInThisThread = NULL;
13}

除了打印一行日志,也没有和日志Logger类相关的代码,且EventLoop也没有与Logger类相关的成员变量(也就是无隐式析构)。我们再看下堆栈#3处的代码:

 1Logger::~Logger()
 2{
 3    impl_.finish();
 4    const LogStream::Buffer& buf(stream().buffer());
 5    g_output(buf.data(), buf.length());
 6    if (impl_.level_ == FATAL)
 7    {
 8        g_flush();
 9        abort();
10    }
11}

崩溃的地方就是g_output(buf.data(), buf.length()); 而这个buf的内容正好就是EventLoop析构函数里面打印的一行日志:

EventLoop destructs in other thread

那么这个g_output为啥会调用AsyncLogging::append函数呢?(参考堆栈#1)。我们通过全局搜索发现g_output是一个全局变量,并且在定义时已经有初始值:

Logger::OutputFunc g_output = defaultOutput;

而初始值defaultOutput的行为就是将日志输出到控制台:

1void defaultOutput(const char* msg, int len)
2  {
3    size_t n = fwrite(msg, 1, len, stdout);
4    //FIXME check n
5    (void)n;
6}

fwrite中的第二个参数是1在linux中0代表标准输入控制台,1代表标准输出控制台,2代表错误输出控制台

我们其实在main函数中,改变了日志的输出行为,让日志即可以输出到控制台也能输出到日志文件中:

 1EventLoop g_mainLoop;
 2AsyncLogging* g_asyncLog = NULL;
 3void asyncOutput(const char* msg, int len)
 4  {
 5    if (g_asyncLog != NULL)
 6    {
 7        g_asyncLog->append(msg, len);
 8        std::cout << msg << std::endl;
 9    }
10}
11
12void asyncOutput(const char* msg, int len)
13   {
14    if (g_asyncLog != NULL)
15    {
16        g_asyncLog->append(msg, len);
17        std::cout << msg << std::endl;
18    }
19}
20
21int main(int argc, char* argv[])
22  {
23    //省略无关代码
24    std::string strLogFileFullPath(logfilepath);
25    strLogFileFullPath += logfilename;
26    Logger::setLogLevel(Logger::DEBUG);
27    int kRollSize = 500 * 1000 * 1000;
28    AsyncLogging log(strLogFileFullPath.c_str(), kRollSize);
29    log.start();
30    g_asyncLog = &log;
31    Logger::setOutput(asyncOutput);
32
33    //省略无关代码
34
35    g_mainLoop.loop();
36
37    return 0;
38}

看到这里问题就来了,当main函数执行完毕,即出了main函数的作用域,log对象已经销毁。此时正好正在销毁全局对象g_mainLoop,此时调用EventLoop的析构函数,在其析构函数中打印“EventLoop destructs in other thread”这行日志,仍然会调用asyncOutput函数,但是此时g_asyncLog 已经是一个无效的指针了,且不是空指针(野指针),所以你调用它的append方法必然会引起内存问题。

如何改正这个问题呢? 方法有多种,方法一就是在g_mainLoop.loop()后面加上一行,将全局指针对象g_asyncLog显式设置为空指针,这样EventLoop里面的析构函数打印的日志就不会打印了,因为asyncOutput函数中已经做了空指针判断。代码改成如下:

 1int main(int argc, char* argv[])
 2  {
 3    //省略无关代码
 4    std::string strLogFileFullPath(logfilepath);
 5    strLogFileFullPath += logfilename;
 6    Logger::setLogLevel(Logger::DEBUG);
 7    int kRollSize = 500 * 1000 * 1000;
 8    AsyncLogging log(strLogFileFullPath.c_str(), kRollSize);
 9    log.start();
10    g_asyncLog = &log;
11    Logger::setOutput(asyncOutput);
12
13    //省略无关代码
14
15    g_mainLoop.loop();
16    g_asyncLog = NULL;
17
18    return 0;
19}

这种解决方案缺点是丢了部分日志,方法二就是将这个log对象也申明成全局变量,即将g_asyncLog改成对象而不是指针我们这里采用方法二,修改后的代码如下:

 1EventLoop g_mainLoop;
 2
 3AsyncLogging g_asyncLog;
 4
 5void asyncOutput(const char* msg, int len)
 6  {
 7    g_asyncLog.append(msg, len);
 8    std::cout << msg << std::endl;
 9}
10
11int main(int argc, char* argv[])
12  {
13    //无关代码省略...
14    std::string strLogFileFullPath(logfilepath);
15    strLogFileFullPath += logfilename;
16    Logger::setLogLevel(Logger::DEBUG);
17    int kRollSize = 1024 * 1024 * 1024;
18    //AsyncLogging log(strLogFileFullPath.c_str(), kRollSize);
19    g_asyncLog.setBaseName(strLogFileFullPath.c_str());
20    g_asyncLog.setRollSize(kRollSize);
21    g_asyncLog.start();
22    Logger::setOutput(asyncOutput);
23
24    //无关代码省略...
25
26    g_mainLoop.loop();
27
28
29    return 0;
30}

改完之后,重新编译程序,我们要验证一下,我们让程序接收Ctrl + C信号,由于gdb默认会自己处理Ctrl + C指令(行为是让gdb中断下来接收用户输入),我们可以修改gdb的设置让gdb不要处理这个信号,而是把这个信号传给我们的程序,在gdb中执行如下指令:

handle SIGINT nostop print pass 

这样程序就能响应Ctrl + C了SIGINT是Ctrl + C产生的信号值)。

 1(gdb)  handle SIGINT nostop print pass 
 2SIGINT is used by the debugger.
 3Are you sure you want to change it? (y or n) y
 4Signal        Stop      Print   Pass to program Description
 5SIGINT        No        Yes     Yes             Interrupt
 6(gdb) r
 7Starting program: /home/zhangyl/flamingoserver/chatserver 
 8[Thread debugging using libthread_db enabled]
 9Using host libthread_db library "/lib64/libthread_db.so.1".
10[New Thread 0x7ffff442b700 (LWP 11460)]
11[New Thread 0x7ffff3c2a700 (LWP 11461)]
12[Thread 0x7ffff3c2a700 (LWP 11461) exited]
1320180820 22:00:27 INFO  show databases - DatabaseMysql.cpp:93
14
1520180820 22:00:27 INFO  CMysqlManager::_IsDBExist, find database(flamingo) - MysqlManager.cpp:195
16
1720180820 22:00:27 INFO  desc t_user - DatabaseMysql.cpp:93
18
1920180820 22:00:27 INFO  desc t_user_relationship - DatabaseMysql.cpp:93
20
2120180820 22:00:27 INFO  desc t_chatmsg - DatabaseMysql.cpp:93
22
2320180820 22:00:27 INFO  SELECT f_user_id, f_username, f_nickname, f_password,  f_facetype, f_customface, f_gender, f_birthday, f_signature, f_address, f_phonenumber, f_mail, f_teaminfo FROM t_user ORDER BY  f_user_id DESC - DatabaseMysql.cpp:93
24
2520180820 22:00:27 INFO  current base userid: 0, current base group id: 268435455 - UserManager.cpp:111
26
27[New Thread 0x7ffff3c2a700 (LWP 11462)]
28[New Thread 0x7ffff2a74700 (LWP 11463)]
29[New Thread 0x7ffff2273700 (LWP 11464)]
30[New Thread 0x7ffff1a72700 (LWP 11465)]
31[New Thread 0x7ffff1271700 (LWP 11466)]
32[New Thread 0x7ffff0a70700 (LWP 11467)]
33[New Thread 0x7fffd3fff700 (LWP 11468)]
34[New Thread 0x7fffd37fe700 (LWP 11469)]
3520180820 22:00:27 INFO  chatserver initialization completed, now you can use client to connect it. - main.cpp:139
36
37^C
38Program received signal SIGINT, Interrupt.
39program recv signal [2] to exit.
40Exit loop...
41Exit chatserver....
42[Thread 0x7ffff442b700 (LWP 11460) exited]
43[Thread 0x7fffd3fff700 (LWP 11468) exited]
44[Thread 0x7fffd37fe700 (LWP 11469) exited]
45[Thread 0x7ffff0a70700 (LWP 11467) exited]
46[Thread 0x7ffff1271700 (LWP 11466) exited]
47[Thread 0x7ffff1a72700 (LWP 11465) exited]
48[Thread 0x7ffff2273700 (LWP 11464) exited]
49[Thread 0x7ffff3c2a700 (LWP 11462) exited]
50[Thread 0x7ffff7fe5a80 (LWP 11459) exited]
51[Inferior 1 (process 11459) exited normally]
52(gdb) 

至此,我们验证并修复了该bug,这样程序在响应Ctrl + C或者用kill + pid杀死进程就可以走正常退出流程了,而不再崩溃。不知道你学到了没有?

小结一下: 通过上面的例子我们可以发现,作为一个合格的linux后台开发人员,我们不仅要熟悉业务代码本身,还要熟练适用gdb各种命令,同时对操作系统的一些机制也要了解(例如:如何设置程序崩溃以后产生core文件)。如果你还没有掌握,建议一定要好好练习一下gdb的使用。

本文参与 腾讯云自媒体分享计划,分享自微信公众号。
原始发表:2018-08-21,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 高性能服务器开发 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档