当运行节点(容器中的一个简单节点进程)时,我在我的内核日志中看到了这一点,进程就失效了:
Mar 21 19:07:08 ip-10-0-2-233 kernel: [26336450.745710] INFO: task v8:SweeperThrea:2569 blocked for more than 120 seconds.
Mar 21 19:07:08 ip-10-0-2-233 kernel: [26336450.745717] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 21 19:07:08 ip-10-0-2-233 kernel: [26336450.745723] v8:SweeperThrea D 0000000000000000     0  2569   2470 0x00000002
Mar 21 19:07:08 ip-10-0-2-233 kernel: [26336450.745727]  ffff8801d228fca8 0000000000000246 ffff8801d0fb1740 ffff8801d228ffd8
Mar 21 19:07:08 ip-10-0-2-233 kernel: [26336450.745730]  ffff8801d228ffd8 ffff8801d228ffd8 ffffffff81c15440 ffff8801d0fb1740
Mar 21 19:07:08 ip-10-0-2-233 kernel: [26336450.745737]  ffff8801d0fb1740 ffff8801d02f8878 0000000000000002 ffff8801d0fb1740
Mar 21 19:07:08 ip-10-0-2-233 kernel: [26336450.745741] Call Trace:
Mar 21 19:07:08 ip-10-0-2-233 kernel: [26336450.745745]  [<ffffffff816ca029>] schedule+0x29/0x70
Mar 21 19:07:08 ip-10-0-2-233 kernel: [26336450.745748]  [<ffffffff810d1365>] zap_pid_ns_processes+0x125/0x180
Mar 21 19:07:08 ip-10-0-2-233 kernel: [26336450.745752]  [<ffffffff8105e91c>] do_exit+0x85c/0x9d0
Mar 21 19:07:08 ip-10-0-2-233 kernel: [26336450.745755]  [<ffffffff8105eb0f>] do_group_exit+0x3f/0xa0
Mar 21 19:07:08 ip-10-0-2-233 kernel: [26336450.745759]  [<ffffffff8106e571>] get_signal_to_deliver+0x1c1/0x610
Mar 21 19:07:08 ip-10-0-2-233 kernel: [26336450.745764]  [<ffffffff8101439f>] do_signal+0x3f/0x8d0
Mar 21 19:07:08 ip-10-0-2-233 kernel: [26336450.745767]  [<ffffffff810f3427>] ? call_rcu_sched+0x17/0x20
Mar 21 19:07:08 ip-10-0-2-233 kernel: [26336450.745771]  [<ffffffff8108429f>] ? __put_cred+0x3f/0x50
Mar 21 19:07:08 ip-10-0-2-233 kernel: [26336450.745775]  [<ffffffff810843b9>] ? abort_creds+0x29/0x30
Mar 21 19:07:08 ip-10-0-2-233 kernel: [26336450.745779]  [<ffffffff81014cb0>] do_notify_resume+0x80/0xb0
Mar 21 19:07:08 ip-10-0-2-233 kernel: [26336450.745781]  [<ffffffff816d3a9a>] int_signal+0x12/0x17更多信息: docker正在对进程执行SIGTERM然后SIGKILL。
docker -d -D日志:
[debug] server.go:924 Calling POST /containers/create
2014/03/21 19:04:11 POST /v1.10/containers/create
[/var/lib/docker|90a3fa34] +job create()
[/var/lib/docker|90a3fa34] -job create() = OK (0)
[debug] server.go:924 Calling POST /containers/{name:.*}/start
2014/03/21 19:04:11 POST /v1.10/containers/074dda12cb039f688777e2cb5115dd0c5088f6b93ed21586782cfe4e57533766/start
[/var/lib/docker|90a3fa34] +job start(074dda12cb039f688777e2cb5115dd0c5088f6b93ed21586782cfe4e57533766)
[/var/lib/docker|90a3fa34] +job allocate_interface(074dda12cb039f688777e2cb5115dd0c5088f6b93ed21586782cfe4e57533766)
[/var/lib/docker|90a3fa34] -job allocate_interface(074dda12cb039f688777e2cb5115dd0c5088f6b93ed21586782cfe4e57533766) = OK (0)
[/var/lib/docker|90a3fa34] -job start(074dda12cb039f688777e2cb5115dd0c5088f6b93ed21586782cfe4e57533766) = OK (0)
[debug] server.go:924 Calling POST /containers/{name:.*}/stop
2014/03/21 19:04:16 POST /v1.10/containers/074dda12cb039f688777e2cb5115dd0c5088f6b93ed21586782cfe4e57533766/stop?t=10
[/var/lib/docker|90a3fa34] +job stop(074dda12cb039f688777e2cb5115dd0c5088f6b93ed21586782cfe4e57533766)
2014/03/21 19:04:26 Container 074dda12cb039f688777e2cb5115dd0c5088f6b93ed21586782cfe4e57533766 failed to exit within 10 seconds of SIGTERM - using the force
2014/03/21 19:04:36 Container SIGKILL failed to exit within 10 seconds of lxc-kill 074dda12cb03 - trying direct SIGKILL伙伴关系:
docker(2387)─node(2532)─{node}(2569)ps辅助性
F S   UID   PID  PPID  C PRI  NI ADDR SZ WCHAN  TTY          TIME CMD
4 S     0  2387  2386  0  80   0 - 153471 futex_ pts/0   00:00:07 docker
4 Z     0  2532  2387  0  80   0 -     0 exit   ?        00:00:00 node <defunct>/proc/2532/fd中没有文件,/proc/2532//task/2569/fd/中没有文件
/proc/2532/stack堆栈
[<ffffffff8105e763>] do_exit+0x6a3/0x9d0
[<ffffffff8105eb0f>] do_group_exit+0x3f/0xa0
[<ffffffff8105eb87>] sys_exit_group+0x17/0x20
[<ffffffff816d37dd>] system_call_fastpath+0x1a/0x1f
[<ffffffffffffffff>] 0xffffffffffffffff/proc/2532/task/2569/stack堆栈
[<ffffffff810d1365>] zap_pid_ns_processes+0x125/0x180
[<ffffffff8105e91c>] do_exit+0x85c/0x9d0
[<ffffffff8105eb0f>] do_group_exit+0x3f/0xa0
[<ffffffff8106e571>] get_signal_to_deliver+0x1c1/0x610
[<ffffffff8101439f>] do_signal+0x3f/0x8d0
[<ffffffff81014cb0>] do_notify_resume+0x80/0xb0
[<ffffffff816d3a9a>] int_signal+0x12/0x17
[<ffffffffffffffff>] 0xffffffffffffffff复制脚本:
CNT=0
while true
do 
  echo $CNT
  DOCK=$(sudo docker run -d -t anandkumarpatel/zombie_bug ./node index.js)
  sleep 60 && sudo docker stop $DOCK > out.log &
  sleep 1
  CNT=$(($CNT+1))
  if [[ "$CNT" == "50" ]]; then
    exit
  fi
done在一次失败的捕杀过程中,可以在巴斯泰宾上找到停靠者的碎片:
http://pastebin.com/HxDwiRBW我的系统信息和版本。我是使用自定义构建码头,但它是分叉形式0.9.0发布与2个小补丁。但这也将复制干净的0.9.0版本。
$  lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 13.04
Release:    13.04
Codename:   raring
$ sudo docker version
Client version: 0.9.0
Go version (client): go1.2.1
Git commit (client): 70f72ea
Server version: 0.9.0
Git commit (server): 70f72ea
Go version (server): go1.2.1
Last stable version: 0.9.0
$ uname -a
Linux ip-10-0-2-233 3.8.0-19-generic #30-Ubuntu SMP Wed May 1 16:35:23 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux但是,这并不是在每个系统上都有回报!出于某种原因,它只在我们的生产服务器和另一台服务器上进行复制。他们都有相似的吐露,但同样的ubuntu版本。
如果有更多的信息要收集,让我知道,我会抓住它。我有接近100%的复配率在一个测试系统,所以我可以收集任何需要。
相关:https://github.com/dotcloud/docker/issues/4811
Docker container refuses to get killed after run command turns into a zombie
更改为较新的内核不能解决此问题:不复制:linux映像-3.8.0-35-泛型与linux-映像-3.8.0-19-泛型
将进行搜索,查看何时修复,以确定是否有助于找到根本原因。
发布于 2014-03-24 04:24:01
更改为最新内核解决了这个问题
发现了确切的内核差异: REPRO: linux-映像-3.8.0-31-泛型无重复:linux-映像-3.8.0-32-泛型
我想这就是解决办法:
+++ linux-3.8.0/kernel/pid_namespace.c
@@ -181,6 +181,7 @@
    int nr;
    int rc;
    struct task_struct *task, *me = current;
+   int init_pids = thread_group_leader(me) ? 1 : 2;
    /* Don't allow any more processes into the pid namespace */
    disable_pid_allocation(pid_ns);
@@ -230,7 +231,7 @@
     */
    for (;;) {
        set_current_state(TASK_UNINTERRUPTIBLE);
-       if (pid_ns->nr_hashed == 1)
+       if (pid_ns->nr_hashed == init_pids)
            break;
        schedule();
    }从这里来的:https://groups.google.com/forum/#!msg/fa.linux.kernel/u4b3n4oYDQ4/GuLrXfDIYggJ
将升级我们所有的服务器,以复制这一点,看看它是否还在发生。
https://stackoverflow.com/questions/22588254
复制相似问题