在Ubuntu上运行1.10的Kubernetes集群中,我们遇到了间歇性的连接/dns问题。
我们已经查看了所有的bug报告/等,最近我们可以发现一个进程占用了/run/xtables.lock
,它在kube-proxy pod中引起了问题。
绑定到worker的其中一个kube-proxy pod在日志中重复出现此错误:
E0920 13:39:42.758280 1 proxier.go:647] Failed to ensure that filter chain KUBE-SERVICES exists: error creating chain "KUBE-EXTERNAL-SERVICES": exit status 4: Another app is currently holding the xtables lock. Stopped waiting after 5s.
E0920 13:46:46.193919 1 proxier.go:647] Failed to ensure that filter chain KUBE-SERVICES exists: error creating chain "KUBE-EXTERNAL-SERVICES": exit status 4: Another app is currently holding the xtables lock. Stopped waiting after 5s.
E0920 14:05:45.185720 1 proxier.go:647] Failed to ensure that filter chain KUBE-SERVICES exists: error creating chain "KUBE-EXTERNAL-SERVICES": exit status 4: Another app is currently holding the xtables lock. Stopped waiting after 5s.
E0920 14:11:52.455183 1 proxier.go:647] Failed to ensure that filter chain KUBE-SERVICES exists: error creating chain "KUBE-EXTERNAL-SERVICES": exit status 4: Another app is currently holding the xtables lock. Stopped waiting after 5s.
E0920 14:38:36.213967 1 proxier.go:647] Failed to ensure that filter chain KUBE-SERVICES exists: error creating chain "KUBE-EXTERNAL-SERVICES": exit status 4: Another app is currently holding the xtables lock. Stopped waiting after 5s.
E0920 14:44:43.442933 1 proxier.go:647] Failed to ensure that filter chain KUBE-SERVICES exists: error creating chain "KUBE-EXTERNAL-SERVICES": exit status 4: Another app is currently holding the xtables lock. Stopped waiting after 5s.
这些错误大约在3周前开始发生,到目前为止我们还无法修复它。因为问题是间歇性的,我们直到现在才追踪到这一点。
我们认为这会导致kube-flannel-ds pod中的一个也处于永久CrashLoopBackOff
状态:
NAME READY STATUS RESTARTS AGE
coredns-78fcdf6894-6z6rs 1/1 Running 0 40d
coredns-78fcdf6894-dddqd 1/1 Running 0 40d
etcd-k8smaster1 1/1 Running 0 40d
kube-apiserver-k8smaster1 1/1 Running 0 40d
kube-controller-manager-k8smaster1 1/1 Running 0 40d
kube-flannel-ds-amd64-sh5gc 1/1 Running 0 40d
kube-flannel-ds-amd64-szkxt 0/1 CrashLoopBackOff 7077 40d
kube-proxy-6pmhs 1/1 Running 0 40d
kube-proxy-d7d8g 1/1 Running 0 40d
kube-scheduler-k8smaster1 1/1 Running 0 40d
围绕/run/xtables.lock
的大多数错误报告似乎显示该问题已于2017年7月得到解决,但我们在新的设置中看到了这一点。我们似乎在iptables中有适当的链配置。
运行fuser /run/xtables.lock
不会返回任何内容。
有人对此有什么见解吗?它造成了很多痛苦
发布于 2018-09-27 21:23:05
因此,在进一步挖掘之后,我们能够使用以下命令找到原因代码:
kubectl -n kube-system describe pods kube-flannel-ds-amd64-szkxt
当然,pod的名称将在不同的安装中更改,但终止的原因代码输出为:
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
我们之前遗漏了这个原因代码(我们主要关注退出代码137),它意味着内存不足;已终止。
默认情况下,kube-flannel-ds获得的最大内存分配为100Mi
-这显然太低了。在参考配置中更改此默认值还记录了其他问题,但我们的修复方法是将最大限制调整为256Mi
更改配置只需一步,只需执行以下命令:
kubectl -n kube-system edit ds kube-flannel-ds-amd64
并将limits -> memory下的100Mi
值更改为更高的值;我们执行了256Mi
。
默认情况下,这些pod仅更新OnDelete
,因此您需要删除CrashLoopBackOff
中的pod,之后将使用更新的值重新创建它。
我想您也可以滚动并删除其他节点上的任何节点,但我们只删除了一直失败的节点。
以下是一些帮助我们追踪到这一点的问题的参考:
https://github.com/coreos/flannel/issues/963 https://github.com/coreos/flannel/issues/1012
https://stackoverflow.com/questions/52428740
复制相似问题