专栏首页方亮bug诞生记——无调用关系的代码导致死锁

bug诞生记——无调用关系的代码导致死锁

版权声明:本文为博主原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。

本文链接:https://blog.csdn.net/breaksoftware/article/details/100567271

这个bug源于项目中一个诡异的现象:代码层面没有明显的锁的问题,但是执行时发生了死锁一样的表现。我把业务逻辑简化为:父进程一直维持一个子进程。(转载请指明出于breaksoftware的csdn博客)

首先我们定义一个结构体ProcessGuard,它持有子进程的ID以及保护它的的锁。这样我们在多线程中,可以安全的操作这个结构体。

#include <stdio.h>
#include <unistd.h>
#include <string.h>
#include <unistd.h>
#include <stdlib.h>
#include <signal.h>
#include <pthread.h>

struct ProcessGuard {
    pthread_mutex_t pids_mutex;
    pid_t pid;
};

主进程的主线程启动一个线程,用于不停监视ProcessGuard的pid是否为0(即子进程不存在)。如果不存在就创建子进程,并把进程ID记录到pid中;

void chile_process() {
    while (1) {
        printf("This is the child process. My PID is %d.My thread_id is %lu.\n", getpid(), pthread_self());
        sleep(1);
    }
}

void create_process_routine() {
    printf("This is the child thread of parent process. My PID is %d.My thread_id is %lu.\n", getpid(), pthread_self());
    while (1) {
        int child = 0;
        if (child == 0) {
            pthread_mutex_lock(&g_guard->pids_mutex);
        }
        
        if (g_guard->pid != 0) {
            continue;    
        }
        
        pid_t pid = fork();
        sleep(1);
        printf("Create child process %d.\n", pid);

        if (pid < 0) {
            perror("fork failed");
        }
        else if (pid == 0) {
            chile_process();
            child = 1;
            break;
        }
        else {
            // parent process
            g_guard->pid = pid;
            printf("dispatch task to process. pid is %d.\n", pid);
        }

        if (child == 0) {
            pthread_mutex_unlock(&g_guard->pids_mutex);  
        }
        else {
            break;
        }
    }
}

我们在父进程的主线程中注册一个signal监听。如果子进程被杀掉,则将ProcessGuard中pid设置为0,这样父进程的监控线程将重新启动一个进程。

void sighandler(int signum) {
    printf("This is the parent process.Catch signal %d.My PID is %d.My thread_id is %lu.\n", signum, getpid(), pthread_self());
    pthread_mutex_lock(&g_guard->pids_mutex);
    g_guard->pid = 0;
    pthread_mutex_unlock(&g_guard->pids_mutex); 
}

最后看下父进程,它初始化一些结构后,注册了signal处理事件并启动了创建子进程的线程。

int main(void) {
    pthread_t creat_process_tid;

    g_guard = malloc(sizeof(struct ProcessGuard));
    pthread_mutex_t pids_mutex;
    if (pthread_mutex_init(&g_guard->pids_mutex, NULL) != 0) {
        perror("init pids_mutex error.");
        exit(1);
    }
    g_guard->pid = 0;

    printf("This is the Main thread of parent process.PID is %d.My thread_id is %lu.\n", getpid(), pthread_self());

    signal(SIGCHLD, sighandler);

    pthread_create(&creat_process_tid, NULL, (void*)create_process_routine, NULL);

    while(1)  {
        printf("Get task from network.\n");
        sleep(1);
    }
    
    pthread_mutex_destroy(&g_guard->pids_mutex);

    return 0;
}

上述代码,我们看到锁只在线程函数create_process_routine和signal处理函数sighandler中被使用了。它们两个在代码层面没有任何调用关系,所以不应该出现死锁!但是实际并非如此。

我们运行程序,并且杀死子进程,会发现主进程并没有重新启动一个新的子进程。

$ ./test      
This is the Main thread of parent process.PID is 17641.My thread_id is 140014057678656.
Get task from network.
This is the child thread of parent process. My PID is 17641.My thread_id is 140014049122048.
Create child process 17643.
dispatch task to process. pid is 17643.
Create child process 0.
This is the child process. My PID is 17643.My thread_id is 140014049122048.
Get task from network.
This is the child process. My PID is 17643.My thread_id is 140014049122048.
This is the child process. My PID is 17643.My thread_id is 140014049122048.
Get task from network.
This is the child process. My PID is 17643.My thread_id is 140014049122048.
This is the child process. My PID is 17643.My thread_id is 140014049122048.
This is the child process. My PID is 17643.My thread_id is 140014049122048.
Get task from network.
This is the child process. My PID is 17643.My thread_id is 140014049122048.
Get task from network.
This is the child process. My PID is 17643.My thread_id is 140014049122048.
This is the child process. My PID is 17643.My thread_id is 140014049122048.
Get task from network.
This is the child process. My PID is 17643.My thread_id is 140014049122048.
This is the child process. My PID is 17643.My thread_id is 140014049122048.
This is the child process. My PID is 17643.My thread_id is 140014049122048.
Get task from network.
This is the child process. My PID is 17643.My thread_id is 140014049122048.
This is the child process. My PID is 17643.My thread_id is 140014049122048.
Get task from network.
This is the parent process.Catch signal 17.My PID is 17641.My thread_id is 140014049122048.
Get task from network.
Get task from network.
Get task from network.
Get task from network.
Get task from network.

这个和我们代码设计不符合,而且不太符合逻辑。于是我们使用gdb attach主进程。

Attaching to process 17641
[New LWP 17642]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f578fb7a9d0 in __GI___nanosleep (requested_time=requested_time@entry=0x7fffd2b41190, remaining=remaining@entry=0x7fffd2b41190) at ../sysdeps/unix/sysv/linux/nanosleep.c:28
28      ../sysdeps/unix/sysv/linux/nanosleep.c: No such file or directory.
(gdb) info threads
  Id   Target Id         Frame 
* 1    Thread 0x7f57902be740 (LWP 17641) "test" 0x00007f578fb7a9d0 in __GI___nanosleep (requested_time=requested_time@entry=0x7fffd2b41190, remaining=remaining@entry=0x7fffd2b41190)
    at ../sysdeps/unix/sysv/linux/nanosleep.c:28
  2    Thread 0x7f578fa95700 (LWP 17642) "test" __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
(gdb) t 2
[Switching to thread 2 (Thread 0x7f578fa95700 (LWP 17642))]
#0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
135     ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S: No such file or directory.
(gdb) bt
#0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007f578fe91023 in __GI___pthread_mutex_lock (mutex=0x55c51383e260) at ../nptl/pthread_mutex_lock.c:78
#2  0x000055c512c29a9d in sighandler ()
#3  <signal handler called>
#4  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:133
#5  0x00007f578fe91023 in __GI___pthread_mutex_lock (mutex=0x55c51383e260) at ../nptl/pthread_mutex_lock.c:78
#6  0x000055c512c29b42 in create_process_routine ()
#7  0x00007f578fe8e6db in start_thread (arg=0x7f578fa95700) at pthread_create.c:463
#8  0x00007f578fbb788f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

我们查看线程2的调用栈,发现栈帧5和栈帧1锁住了相同的mutex(0x55c51383e260)。而我们线程代码中锁是加/解成对,那么第二个锁是哪儿来的呢?

我们看到栈帧1的锁是源于栈帧2对应的函数sighandler,即下面代码

void sighandler(int signum) {
    printf("This is the parent process.Catch signal %d.My PID is %d.My thread_id is %lu.\n", signum, getpid(), pthread_self());
    pthread_mutex_lock(&g_guard->pids_mutex);
    g_guard->pid = 0;
    pthread_mutex_unlock(&g_guard->pids_mutex); 
}

于是,问题来了。我们在线程函数create_process_routine中从来没有调用sighandler,那这个调用是哪儿来的?

在linux文档http://man7.org/linux/man-pages/man7/signal.7.html中,我们发现了有关signal的这段话

A process-directed signal may be delivered to any one of the threads that does not currently have the signal blocked. If more than one of the threads has the signal unblocked, then the kernel chooses an arbitrary thread to which to deliver the signal.

这句话是说process-directed signal会被投递到当前没有被标记不接受该signal的任意一个线程中。 具体是哪个,是由系统内核决定的。这就意味着我们的sighandler可能在主线程中执行,也可能在子线程中执行。于是发生了我们上面的死锁现象。

那么如何解决?官方的方法是使用sigprocmask让一些存在潜在死锁关系的线程不接收这些信号。但是这个方案在复杂的系统中是存在缺陷的。因为我们的工程往往使用各种开源库或者第三方库,我们无法控制它们启动线程的问题。所以,我的建议是:在signal处理函数中,尽量使用无锁结构。通过中间数据的设计,将复杂的业务代码和signal处理函数隔离。

本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

我来说两句

0 条评论
登录 后参与评论

相关文章

  • Redis源码解析——有序整数集

            有序整数集是Redis源码中一个以大尾(big endian)形式存储,由小到大排列且无重复的整型集合。它存储的类型包括16位、32位和64位的...

    方亮
  • 一种注册表沙箱的思路、实现——研究Reactos中注册表函数的实现4

            今天为了KPI,搞了一天的PPT,搞得恶心想吐。最后还是回到这儿,这儿才是我的净土,可以写写我的研究。

    方亮
  • WMI技术介绍和应用——VC开发WMI应用的基本步骤

            在《WMI技术介绍和应用——WMI概述》中介绍了我们可以使用C++、.net或者支持ActiveX技术的脚本语言来使用WMI。但是各种语言对WM...

    方亮
  • 进程状态小记

    PROCESS STATE CODES        Here are the different values that the s, stat...

    用户3765803
  • torch.nn.SyncBatchNorm

    torch.nn.SyncBatchNorm(num_features, eps=1e-05, momentum=0.1, affine=True, track...

    于小勇
  • Mongodb 查询优化

    A good writeup of how your index should be created is available in Optimizing Mo...

    乐事
  • SAP WebClient UI和business switch相关的逻辑介绍

    Do you know the meaning of these two checkboxes in F2 popup?

    Jerry Wang
  • Externalizing Session State for a Spring Boot Application Using Spring-Session

    Spring-session is a very cool new project that aims to provide a simpler way of ...

    九州暮云
  • 查看CPU信息小脚本

    echo "                  the `hostname` cpuinfo                       "

    三杯水Plus
  • Redis源码解析——有序整数集

            有序整数集是Redis源码中一个以大尾(big endian)形式存储,由小到大排列且无重复的整型集合。它存储的类型包括16位、32位和64位的...

    方亮

扫码关注云+社区

领取腾讯云代金券