[TOC]
描述:在学习任何一门新技术总是免不了坑坑拌拌,当您学会了记录坑后然后将其记录当下次遇到,相同问题的时候可以第一时间进行处理;
启动参数总结一览表:
--register-node [Boolean] # 节点是否自动注册
/etc/kubernetes/kubelet.conf
关于构建环境
您可以根据自己的情况将构建环境与部署环境分开,例如:
学习时,参考本教程,使用 kubernetes 的 master 节点完成 构建和镜像推送 开发时,在自己的笔记本上完成 构建和镜像推送 工作中,使用 Jenkins Pipeline 或者 gitlab-runner Pipeline 来完成 构建和镜像推送
K8S Containerd 镜像回收GC参数配置 参考地址: https://kubernetes.io/docs/concepts/architecture/garbage-collection/
$ vim /var/lib/kubelet/config.yaml
imageGCHighThresholdPercent: 85
imageGCLowThresholdPercent: 80
maxPods: 180 # pod最大数
Kubelet 相关配置只要修改后,都需进入如下操作
systemctl daemon-reload
systemctl restart kubelet.service
描述:APISERVER_NAME 不能是 master 的 hostname,且必须全为小写字母、数字、小数点,不能包含减号export APISERVER_NAME=apiserver.weiyi
;
POD_SUBNET 所使用的网段不能与 master节点/worker节点 所在的网段重叠( CIDR 值:无类别域间路由,Classless Inter-Domain Routing
),export POD_SUBNET=10.100.0.1/16
。
解决办法:
# 1.如不能下载 kubernetes 的 docker 镜像 ,请替换镜像源以及手工初始化
# --image-repository= mirrorgcrio
# --image-repository=registry.cn-hangzhou.aliyuncs.com/google_containers
~$ kubeadm config images list --image-repository=registry.cn-hangzhou.aliyuncs.com/google_containers
registry.cn-hangzhou.aliyuncs.com/google_containers/kube-apiserver:v1.19.6
registry.cn-hangzhou.aliyuncs.com/google_containers/kube-controller-manager:v1.19.6
registry.cn-hangzhou.aliyuncs.com/google_containers/kube-scheduler:v1.19.6
registry.cn-hangzhou.aliyuncs.com/google_containers/kube-proxy:v1.19.6
registry.cn-hangzhou.aliyuncs.com/google_containers/pause:3.2
registry.cn-hangzhou.aliyuncs.com/google_containers/etcd:3.4.13-0
registry.cn-hangzhou.aliyuncs.com/google_containers/coredns:1.7.0
#2.检查环境变量
echo MASTER_IP=${MASTER_IP} && echo APISERVER_NAME=${APISERVER_NAME} && echo POD_SUBNET=${POD_SUBNET}
Tips : 在重新初始化 master 节点前,请先执行 kubeadm reset -f
操作;
Pending[ImagePullBackoff]
异常问题问题描述:
1.如果输出结果中出现 ImagePullBackoff 或者长时间处于 Pending 的情况
$kubectl get pods calico-node-4vql2 -n kube-system -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
calico-node-4vql2 0/1 Pending[ImagePullBackoff ] 0 7m22s <none> node <none> <none>
NAME READY STATUS RESTARTS AGE
coredns-94d74667-6dj45 1/1 ImagePullBackOff 0 12m
calico-node-4vql2 1/1 Pending 0 12m
解决方法:
#(1)通过get pods找到pod被调度到了哪一个节点并,确定 Pod 所使用的容器镜像:
kubectl get pods calico-node-4vql2 -n kube-system -o yaml | grep image:
- image: calico/node:v3.13.1
- image: calico/cni:v3.13.1
- image: calico/pod2daemon-flexvol:v3.13.1
kubectl get pods coredns-94d74667-6dj45 -n kube-system -o yaml | grep image:
- image: registry.aliyuncs.com/google_containers/coredns:1.3.1
#(2)在 Pod 所在节点执行 docker pull 指令(当Node状态为NotReady时候也可以采用此种方法,但不是唯一)d
docker pull calico/node:v3.13.1
docker pull calico/cni:v3.13.1
docker pull calico/pod2daemon-flexvol:v3.13.1
docker pull registry.aliyuncs.com/google_containers/coredns:1.3.1
#(3)然后在master节点上查看状态恢复正常
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
calico-node-4vql2 1/1 Running 0 36m 10.10.107.192 node <none> <none>
WeiyiGeek.Pending
2.输出结果中某个 Pod 长期处于 ContainerCreating、PodInitializing 或 Init:0/3
的状态:
解决办法:
#(1)查看该 Pod 的状态
kubectl describe pods -n kube-system calico-node-4vql2
kubectl describe pods -n kube-system coredns-8567978547-bmd9f
#(2)如果输出结果中,最后一行显示的是 Pulling image,请耐心等待
Normal Pulling 44s kubelet, k8s-worker-02 Pulling image "calico/pod2daemon-flexvol:v3.13.1"
#(3)将该 Pod 删除,系统会自动重建一个新的 Pod
kubectl delete pod kube-flannel-ds-amd64-8l25c -n kube-system
1.#worker 节点不能访问 apiserver
如果 master 节点能够访问 apiserver、而 worker 节点不能,则请检查自己的网络设置,/etc/hosts 是否正确设置? 是否有安全组或防火墙的限制?
#master节点验证
curl -ik https://localhost:6443
#worker节点验证
curl -ik https://apiserver.weiyi:6443
#正常输出结果如下所示:
HTTP/1.1 403 Forbidden
Cache-Control: no-cache, private
Content-Type: application/json
X-Content-Type-Options: nosniff
Date: Fri, 15 Nov 2019 04:34:40 GMT
Content-Length: 233
{
"kind": "Status",
"apiVersion": "v1",
"metadata": {
...
2.#worker 节点默认网卡
3.#master 节点生成的token已过有效时间为 2 个小时 kubeadm token create
错误信息:
kubectl apply -f calico-3.13.1.yaml
The connection to the server localhost:8080 was refused - did you specify the right host or port?
错误原因: 由于在初始化之后没将k8s的/etc/kubernetes/admin.conf
拷贝到用户的加目录之中/root/.kube/config
解决办法:
# (1) 普通用户对集群访问配置文件设置
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
# (2) 自动运行设置 KUBECONFIG 环境以及k8s命令自动补齐
grep "export KUBECONFIG" ~/.profile | echo "export KUBECONFIG=$HOME/.kube/config" >> ~/.profile
tee -a ~/.profile <<'EOF'
source <(kubectl completion bash)
source <(kubeadm completion bash)
# source <(helm completion bash)
EOF
source ~/.profile
PS : 如果在加入 k8s 集群时采用普通需要在前面加sudo kubeadm init ...
用以提升权限否则将出现[ERROR IsPrivilegedUser]: user is not running as root
该错误;
错误信息:
systemctl status kubelet
6月 23 09:04:02 master-01 kubelet[8085]: E0623 09:04:02.186893 8085 kubelet.go:2187] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady mes...ninitialized
6月 23 09:04:04 master-01 kubelet[8085]: W0623 09:04:04.938700 8085 cni.go:237] Unable to update cni config: no networks found in /etc/cni/net.d
问题原因: 由于master节点初始化安装后报错,在未进行重置的情况下又进行初始化操作或者重置操作不完整导致,还有一种情况是没有安装网络组件比如(flannel 或者 calico
);
解决办法: 执行以下命令重置初始化信息,然后在重新初始化;
systemctl stop kubelet
docker stop $(docker ps -aq)
docker rm -f $(docker ps -aq)
systemctl stop docker
kubeadm reset
rm -rf $HOME/.kube /etc/kubernetes
rm -rf /var/lib/cni/ /etc/cni/ /var/lib/kubelet/*
iptables -F && iptables -t nat -F && iptables -t mangle -F && iptables -X
systemctl start docker
systemctl start kubelet
#安装 calico 网络插件(没有高可用)
rm -f calico-3.13.1.yaml
wget -L https://kuboard.cn/install-script/calico/calico-3.13.1.yaml
kubectl apply -f calico-3.13.1.yaml
kubeadm reset
无法进行节点重置,提示retrying of unary invoker failed
错误信息:
[reset] Removing info for node "master-01" from the ConfigMap "kubeadm-config" in the "kube-system" Namespace
{"level":"warn","ts":"2020-06-23T09:10:30.074+0800","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-174bf993-5731-4b29-9b30-7e958ade79a4/10.10.107.191:2379","attempt":0,"error":"rpc error: code = Unknown desc = etcdserver: re-configuration failed due to not enough started members"}
问题原因: 在重置前etcd容器处于运转之中导致无法进行节点的重置操作;
解决办法: 停止所有的容器以及docker服务然后再执行节点的重置操作
docker stop $(docker ps -aq) && systemctl stop docker
问题描述:
kubeadm init --config=kubeadm-config.yaml --upload-certs
[init] Using Kubernetes version: v1.18.4
[preflight] You can also perform this action in beforehand using 'kubeadm config images pull'
error execution phase preflight: [preflight] Some fatal errors occurred:
[ERROR ImagePull]: failed to pull image registry.cn-hangzhou.aliyuncs.com/google_containers/kube-apiserver:v1.18.4: output: Error response from daemon: manifest for registry.cn-hangzhou.aliyuncs.com/google_containers/kube-apiserver:v1.18.4 not found: manifest unknown: manifest unknown
, error: exit status 1
问题原因: 由于k8s.gcr.io官方镜像网站无法下载镜像,而采用的同步镜像源站registry.cn-hangzhou.aliyuncs.com/google_containers/
仓库中没有指定k8s版本的依赖组件;
解决办法: 换其它镜像进行尝试或者离线将镜像包导入的docker中(参考前面的笔记2-Kubernetes入门手动安装部署
),建议在进行执行上面的命令前先执行kubeadm config images pull --image-repository mirrorgcrio --kubernetes-version=1.18.4
查看镜像是否能被拉取;
# 常规k8s.gcr.io镜像站点
# gcr.azk8s.cn/google_containers/ # 已失效
registry.aliyuncs.com/google_containers/
registry.cn-hangzhou.aliyuncs.com/google_containers/
# harbor中k8s.gcr.io的镜像
mirrorgcrio
问题描述:
PING gateway-example.example.svc.cluster.local (10.105.141.232) 56(84) bytes of data.
From 172.17.76.171 (172.17.76.171) icmp_seq=1 Time to live exceeded
From 172.17.76.171 (172.17.76.171) icmp_seq=2 Time to live exceeded
问题原因:在 Kubernetes 的网络中Service 就是 ping 不通的,因为 Kubernetes 只是为 Service 生成了一个虚拟 IP 地址,实现的方式有三种 User space / Iptables / IPVS
等代理模式;
不管是哪种代理模式Kubernetes Service 的 IP 背后都没有任何实体可以响应「ICMP」全称为 Internet 控制报文协议(Internet Control Message Protocol)
,但是可以通过curl或者telnet进行访问与
问题解决:
$ kubectl cluster-info
# Kubernetes master is running at https://k8s.weiyigeek.top:6443
# KubeDNS is running at https://k8s.weiyigeek.top:6443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy
错误信息:
$ sudo kubeadm init --config=/home/weiyigeek/k8s-init/kubeadm-init-config.yaml --upload-certs | tee kubeadm_init.log
[init] Using Kubernetes version: v1.19.6
[preflight] Running pre-flight checks
error execution phase preflight: [preflight] Some fatal errors occurred:
[ERROR Swap]: running with swap on is not supported. Please disable swap
[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`
To see the stack trace of this error execute with --v=5 or higher
问题原因: 由于加入节点的机器未禁用swap分区导致
weiyigeek@weiyigeek-107:~$ free
total used free shared buff/cache available
Mem: 8151908 299900 7270588 956 581420 7600492
Swap: 4194300 0 4194300
解决版本: 禁用swap分区
sudo swapoff -a && sudo sed -i '/ swap / s/^\(.*\)$/#\1/g' /etc/fstab && free # CentOS
sudo swapoff -a && sudo sed -i 's/^\/swap.img\(.*\)$/#\/swap.img \1/g' /etc/fstab && free # Ubuntu
total used free shared buff/cache available
Mem: 8151908 304428 7260196 956 587284 7595204
Swap: 0 0 0
环境说明: OS:Ubuntu-20.04 / K8s:1.19.3 / docker:19.03.13 / flannel:v0.13.0
错误信息:0/1 nodes are available: 1 node(s) had taint {node.kubernetes.io/not-ready: }, that the pod didn't tolerate.
错误原因: kubeadm 初始化完成后未安装 flannel 网络插件
解决流程:安装部署 flannel 网络插件
$ kubectl get pod --all-namespaces
# NAMESPACE NAME READY STATUS RESTARTS AGE
# kube-system coredns-6c76c8bb89-8cgjz 0/1 Pending 0 99s
# kube-system coredns-6c76c8bb89-wgbs9 0/1 Pending 0 99s
$ kubectl describe pod -n kube-system coredns-6c76c8bb89-8cgjz
...
# Events:
# Type Reason Age From Message
# ---- ------ ---- ---- -------
# Warning FailedScheduling 39s (x2 over 39s) default-scheduler 0/1 nodes are available: 1 node(s) had taint {node.kubernetes.io/not-ready: }, that the pod didn't tolerate.
$ kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml
# podsecuritypolicy.policy/psp.flannel.unprivileged created
# clusterrole.rbac.authorization.k8s.io/flannel created
# clusterrolebinding.rbac.authorization.k8s.io/flannel created
# serviceaccount/flannel created
# configmap/kube-flannel-cfg created
# daemonset.apps/kube-flannel-ds created
$ kubectl get pod --all-namespaces
# NAMESPACE NAME READY STATUS RESTARTS AGE
# kube-system coredns-6c76c8bb89-8cgjz 1/1 Running 0 5m12s
# kube-system coredns-6c76c8bb89-wgbs9 1/1 Running 0 5m12s
$ kubectl get node
# NAME STATUS ROLES AGE VERSION
# ubuntu Ready master 30m v1.19.3
环境说明: OS:Ubuntu-20.04 / K8s:1.19.3 / docker:19.03.13 / flannel:v0.13.0
错误信息:rpc error: code = Unknown desc = [failed to set up sandbox container "355.....4ec7" network for pod "coredns-": networkPlugin cni failed to set up pod "coredns-6c76c8bb89-6xgjl_kube-system"
错误原因: kubeadm 初始化CNI网络插件有误
解决流程:重新进行Kubeadm初始化即可并且验证serviceSubnet是否为10.96.0.0/12;
# 资源信息
weiyigeek@ubuntu:~$ kubectl get pod -n kube-system
NAME READY STATUS RESTARTS AGE
coredns-6c76c8bb89-87zh7 0/1 ContainerCreating 0 18h
coredns-6c76c8bb89-p68x8 0/1 ContainerCreating 0 18h
etcd-ubuntu 1/1 Running 0 18h
kube-apiserver-ubuntu 1/1 Running 0 18h
kube-controller-manager-ubuntu 1/1 Running 0 18h
kube-proxy-22t2f 1/1 Running 0 17h
kube-proxy-wcjrv 1/1 Running 0 18h
kube-scheduler-ubuntu 1/1 Running 0 18h
# 删除重新够构建
weiyigeek@ubuntu:~$ kubectl delete pod -n kube-system coredns-6c76c8bb89-87zh7 coredns-6c76c8bb89-p68x8
pod "coredns-6c76c8bb89-87zh7" deleted
pod "coredns-6c76c8bb89-p68x8" deleted
dial tcp 10.96.0.1:443: connect: no route to host
报错信息:
dial tcp 10.96.0.1:443: i/o timeout
dial tcp 10.96.0.1:443: connect: no route to host
报错原因:
coredns Pod 未正常启动
calico 网络插件未安装 或 calico-kube-controllers Pod 未正常启动 解决办法: 查看对应的报错信息进行解决;
~$ kubectl get pod -n kube-system | grep -e "calico|coredns"
~$ curl http://10.96.0.1:443
Client sent an HTTP request to an HTTPS server.
错误信息:
sudo kubeadm init
[init] Using Kubernetes version: v1.21.0
[preflight] Running pre-flight checks
[WARNING IsDockerSystemdCheck]: detected "cgroupfs" as the Docker cgroup driver. The recommended driver is "systemd". Please follow the guide at https://kubernetes.io/docs/setup/cri/
error execution phase preflight: [preflight] Some fatal errors occurred:
[ERROR Swap]: running with swap on is not supported. Please disable swap
[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`
To see the stack trace of this error execute with --v=5 or higher
报错原因: ERROR Swap 是由于当前操作系统未关闭swap交换分区,而WARNING IsDockerSystemdCheck警告则说明cgroup driver未采用systemd。
解决办法:
# 1.关闭Swapp交换分区
sudo swapoff -a && sudo sed -i '/ swap / s/^\(.*\)$/#\1/g' /etc/fstab && free # CentOS
sudo swapoff -a && sudo sed -i 's/^\/swap.img\(.*\)$/#\/swap.img \1/g' /etc/fstab && free #Ubuntu
# 2.更改docker的 cgroup driver 驱动为systemd
cat /etc/docker/daemon.json
{
"registry-mirrors": [
"https://registry.cn-hangzhou.aliyuncs.com"
],
"max-concurrent-downloads": 10,
"log-driver": "json-file",
"log-level": "warn",
"log-opts": {
"max-size": "10m",
"max-file": "3"
},
"exec-opts": ["native.cgroupdriver=systemd"],
"storage-driver": "overlay2",
"insecure-registries": ["harbor.weiyigeek", "harbor.weiyi", "harbor.cloud"],
"data-root":"/home/data/docker/"
}
错误信息:
Error response from daemon: manifest for registry.cn-hangzhou.aliyuncs.com/google_containers/coredns/coredns:v1.8.0 not found: manifest unknown: manifest unknown
Error response from daemon: No such image: registry.cn-hangzhou.aliyuncs.com/google_containers/coredns/coredns:v1.8.0
Error: No such image: registry.cn-hangzhou.aliyuncs.com/google_containers/coredns/coredns:v1.8.0
解决办法: 上docker的hub平台上搜索拉取后然后更改tag即可,地址:https://hub.docker.com/。
docker pull coredns/coredns:1.8.0
docker tag coredns/coredns:1.8.0 registry.cn-hangzhou.aliyuncs.com/google_containers/coredns/coredns:v1.8.0
描述: Pending 说明 Pod 还没有调度到某个 Node 上面。可以通过 kubectl describe pod <pod-name>
命令查看到当前 Pod 的事件,进而判断为什么没有调度.
错误信息:
$ kubectl describe pod mypod
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 12s (x6 over 27s) default-scheduler 0/4 nodes are available: 2 Insufficient cpu.
可能原因:
- 资源不足,集群内所有的 Node 都不满足该 Pod 请求的 CPU、内存、GPU 或者临时存储空间等资源。解决方法是删除集群内不用的 Pod 或者增加新的 Node。
- HostPort 端口已被占用,通常推荐使用 Service 对外开放服务端口
描述: 首先还是通过 kubectl describe pod 命令查看到当前 Pod 的事件
错误信息:
$ kubectl -n kube-system describe pod nginx-pod
# Events:
# Type Reason Age From Message
# ---- ------ ---- ---- -------
# Normal Scheduled 1m default-scheduler Successfully assigned nginx-pod to node1
# Normal SuccessfulMountVolume 1m kubelet, gpu13 MountVolume.SetUp succeeded for volume "config-volume"
# Normal SuccessfulMountVolume 1m kubelet, gpu13 MountVolume.SetUp succeeded for volume "coredns-token-sxdmc"
# Warning FailedSync 2s (x4 over 46s) kubelet, gpu13 Error syncing pod
# Normal SandboxChanged 1s (x4 over 46s) kubelet, gpu13 Pod sandbox changed, it will be killed and re-created.
问题原因: 发现是 cni0 网桥配置了一个不同网段的 IP 地址导致,删除该网桥(网络插件会自动重新创建)即可修复
# 可以发现,该 Pod 的 Sandbox 容器无法正常启动,具体原因需要查看 Kubelet 日志:
$ journalctl -u kubelet
...
Mar 14 04:22:04 node1 kubelet[29801]: E0314 04:22:04.649912 29801 cni.go:294] Error adding network: failed to set bridge addr: "cni0" already has an IP address different from 10.244.4.1/24
Mar 14 04:22:04 node1 kubelet[29801]: E0314 04:22:04.649941 29801 cni.go:243] Error while adding to cni network: failed to set bridge addr: "cni0" already has an IP address different from 10.244.4.1/24
Mar 14 04:22:04 node1 kubelet[29801]: W0314 04:22:04.891337 29801 cni.go:258] CNI failed to retrieve network namespace path: Cannot find network namespace for the terminated container "c4fd616cde0e7052c240173541b8543f746e75c17744872aa04fe06f52b5141c"
Mar 14 04:22:05 node1 kubelet[29801]: E0314 04:22:05.965801 29801 remote_runtime.go:91] RunPodSandbox from runtime service failed: rpc error: code = 2 desc = NetworkPlugin cni failed to set up pod "nginx-pod" network: failed to set bridge addr: "cni0" already has an IP address different from 10.244.4.1/24
解决办法:
$ ip link set cni0 down
$ brctl delbr cni0
其它原因:
镜像拉取失败,比如
配置了错误的镜像
Kubelet 无法访问镜像(国内环境访问 gcr.io 需要特殊处理)
私有镜像的密钥配置错误
镜像太大,拉取超时(可以适当调整 kubelet 的 --image-pull-progress-deadline 和 --runtime-request-timeout 选项)
CNI 网络错误,一般需要检查 CNI 网络插件的配置,比如
无法配置 Pod 网络
无法分配 IP 地址
容器无法启动,需要检查是否打包了正确的镜像或者是否配置了正确的容器参数
描述: 这通常是镜像名称配置错误或者私有镜像的密钥配置错误导致。这种情况可以使用 docker pull <image>
来验证镜像是否可以正常拉取。
错误信息:
$ kubectl describe pod mypod
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 36s default-scheduler
Normal Pulling 17s (x2 over 33s) kubelet, k8s-agentpool1-38622806-0 pulling image "a1pine"
Warning Failed 14s (x2 over 29s) kubelet, k8s-agentpool1-38622806-0 Failed to pull image "a1pine": rpc error: code = Unknown desc = Error response from daemon: repository a1pine not found: does not exist or no pull access
Warning Failed 14s (x2 over 29s) kubelet, k8s-agentpool1-38622806-0 Error: ErrImagePull
Normal SandboxChanged 4s (x7 over 28s) kubelet, k8s-agentpool1-38622806-0 Pod sandbox changed, it will be killed and re-created.
Normal BackOff 4s (x5 over 25s) kubelet, k8s-agentpool1-38622806-0 Back-off pulling image "a1pine"
Warning Failed 1s (x6 over 25s) kubelet, k8s-agentpool1-38622806-0 Error: ImagePullBackOff
解决办法:
# 1. 如果是私有镜像,需要首先创建一个 docker-registry 类型的 Secret
kubectl create secret docker-registry my-secret --docker-server=DOCKER_REGISTRY_SERVER --docker-username=DOCKER_USER --docker-password=DOCKER_PASSWORD --docker-email=DOCKER_EMAIL
# 2. 然后在容器中引用这个 Secret
spec:
containers:
- name: private-reg-container
image: <your-private-image>
imagePullSecrets:
- name: my-secret
描述: CrashLoopBackOff 状态说明容器曾经启动了,但又异常退出了。此时 Pod 的 RestartCounts 通常是大于 0 的,可以先查看一下容器的日志
问题原因:
* 容器进程退出
* 健康检查失败退出
* OOMKilled
$ kubectl describe pod mypod
...
Containers:
sh:
Container ID: docker://3f7a2ee0e7e0e16c22090a25f9b6e42b5c06ec049405bc34d3aa183060eb4906
Image: alpine
Image ID: docker-pullable://alpine@sha256:7b848083f93822dd21b0a2f14a110bd99f6efb4b838d499df6d04a49d0debf8b
Port: <none>
Host Port: <none>
State: Terminated
Reason: OOMKilled
Exit Code: 2
Last State: Terminated
Reason: OOMKilled
Exit Code: 2
Ready: False
Restart Count: 3
Limits:
cpu: 1
memory: 1G
Requests:
cpu: 100m
memory: 500M
...
* 如果此时如果还未发现线索,还可以到容器内执行命令来进一步查看退出原因
kubectl exec cassandra -- cat /var/log/cassandra/system.log
* 如果还是没有线索,那就需要 SSH 登录该 Pod 所在的 Node 上,查看 Kubelet 或者 Docker 的日志进一步排查了
# Query Node
kubectl get pod <pod-name> -o wide
# SSH to Node
ssh <username>@<node-name>
通常处于 Error 状态说明 Pod 启动过程中发生了错误。常见的原因包括
依赖的 ConfigMap、Secret 或者 PV 等不存在
请求的资源超过了管理员设置的限制,比如超过了 LimitRange 等
违反集群的安全策略,比如违反了 PodSecurityPolicy 等
容器无权操作集群内的资源,比如开启 RBAC 后,需要为 ServiceAccount 配置角色绑定
从 v1.5 开始,Kubernetes 不会因为 Node 失联而删除其上正在运行的 Pod,而是将其标记为 Terminating 或 Unknown 状态。想要删除这些状态的 Pod 有三种方法:
从集群中删除该 Node。使用公有云时,kube-controller-manager 会在 VM 删除后自动删除对应的 Node。而在物理机部署的集群中,需要管理员手动删除 Node(如 kubectl delete node <node-name>。
Node 恢复正常。Kubelet 会重新跟 kube-apiserver 通信确认这些 Pod 的期待状态,进而再决定删除或者继续运行这些 Pod。
用户强制删除。用户可以执行 kubectl delete pods <pod> --grace-period=0 --force 强制删除 Pod。除非明确知道 Pod 的确处于停止状态(比如 Node 所在 VM 或物理机已经关机),否则不建议使用该方法。特别是 StatefulSet 管理的 Pod,强制删除容易导致脑裂或者数据丢失等问题。
如果 Kubelet 是以 Docker 容器的形式运行的,此时 kubelet 日志中可能会发现如下的错误:
{"log":"E0926 19:59:39.977461 54420 nestedpendingoperations.go:262] Operation for \"\\\"kubernetes.io/secret/30f3ffec-a29f-11e7-b693-246e9607517c-default-token-6tpnm\\\" (\\\"30f3ffec-a29f-11e7-b693-246e9607517c\\\")\" failed. No retries permitted until 2017-09-26 19:59:41.977419403 +0800 CST (durationBeforeRetry 2s). Error: UnmountVolume.TearDown failed for volume \"default-token-6tpnm\" (UniqueName: \"kubernetes.io/secret/30f3ffec-a29f-11e7-b693-246e9607517c-default-token-6tpnm\") pod \"30f3ffec-a29f-11e7-b693-246e9607517c\" (UID: \"30f3ffec-a29f-11e7-b693-246e9607517c\") : remove /var/lib/kubelet/pods/30f3ffec-a29f-11e7-b693-246e9607517c/volumes/kubernetes.io~secret/default-token-6tpnm: device or resource busy\n","stream":"stderr","time":"2017-09-26T11:59:39.977728079Z"}
{"log":"E0926 19:59:39.977461 54420 nestedpendingoperations.go:262] Operation for \"\\\"kubernetes.io/secret/30f3ffec-a29f-11e7-b693-246e9607517c-default-token-6tpnm\\\" (\\\"30f3ffec-a29f-11e7-b693-246e9607517c\\\")\" failed. No retries permitted until 2017-09-26 19:59:41.977419403 +0800 CST (durationBeforeRetry 2s). Error: UnmountVolume.TearDown failed for volume \"default-token-6tpnm\" (UniqueName: \"kubernetes.io/secret/30f3ffec-a29f-11e7-b693-246e9607517c-default-token-6tpnm\") pod \"30f3ffec-a29f-11e7-b693-246e9607517c\" (UID: \"30f3ffec-a29f-11e7-b693-246e9607517c\") : remove /var/lib/kubelet/pods/30f3ffec-a29f-11e7-b693-246e9607517c/volumes/kubernetes.io~secret/default-token-6tpnm: device or resource busy\n","stream":"stderr","time":"2017-09-26T11:59:39.977728079Z"}
如果是这种情况,则需要给 kubelet 容器设置 –containerized 参数并传入以下的存储卷
# 以使用 calico 网络插件为例
-v /:/rootfs:ro,shared \
-v /sys:/sys:ro \
-v /dev:/dev:rw \
-v /var/log:/var/log:rw \
-v /run/calico/:/run/calico/:rw \
-v /run/docker/:/run/docker/:rw \
-v /run/docker.sock:/run/docker.sock:rw \
-v /usr/lib/os-release:/etc/os-release \
-v /usr/share/ca-certificates/:/etc/ssl/certs \
-v /var/lib/docker/:/var/lib/docker:rw,shared \
-v /var/lib/kubelet/:/var/lib/kubelet:rw,shared \
-v /etc/kubernetes/ssl/:/etc/kubernetes/ssl/ \
-v /etc/kubernetes/config/:/etc/kubernetes/config/ \
-v /etc/cni/net.d/:/etc/cni/net.d/ \
-v /opt/cni/bin/:/opt/cni/bin/ \
处于 Terminating 状态的 Pod 在 Kubelet 恢复正常运行后一般会自动删除。但有时也会出现无法删除的情况,并且通过 kubectl delete pods –grace-period=0 –force 也无法强制删除。此时一般是由于 finalizers 导致的,通过 kubectl edit 将 finalizers 删除即可解决。
"finalizers": [
"foregroundDeletion"
]
Kubelet 使用 inotify 机制检测 /etc/kubernetes/manifests 目录(可通过 Kubelet 的 –pod-manifest-path 选项指定)中静态 Pod 的变化,并在文件发生变化后重新创建相应的 Pod。但有时也会发生修改静态 Pod 的 Manifest 后未自动创建新 Pod 的情景,此时一个简单的修复方法是重启 Kubelet。
Namespace 一直处于 terminating 状态,一般有两种原因:
Namespace 中还有资源正在删除中
Namespace 的 Finalizer 未正常清理
对第一个问题,可以执行下面的命令来查询所有的资源 kubectl api-resources –verbs=list –namespaced -o name | xargs -n 1 kubectl get –show-kind –ignore-not-found -n $NAMESPACE
而第二个问题则需要手动清理 Namespace 的 Finalizer 列表: kubectl get namespaces NAMESPACE -o json | jq ‘.spec.finalizers=[]’ > /tmp/ns.jsonkubectl proxy &curl -k -H “Content-Type: application/json” -X PUT –data-binary @/tmp/ns.json http://127.0.0.1:8001/api/v1/namespaces/
Warning Failed 2m19s (x4 over 3m4s) kubelet Error: failed to create containerd task: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: process_linux.go:508: setting cgroup config for procHooks process caused: failed to write "107374182400000": write /sys/fs/cgroup/cpu,cpuacct/system.slice/containerd.service/kubepods-burstable-pod6e586bca_1fd9_412a_892c_a77b38d7f3ec.slice:cri-containerd:app/cpu.cfs_quota_us: invalid argument: unknown
resources:
requests:
memory: "512Mi"
cpu: "100m"
limits:
memory: "2048Mi"
cpu: "1000m"
cpu 没有该Gi单位
问题1.MountVolume.SetUp failed for volume “default-token-zglkd” : failed to sync secret cache: timed out waiting for the condition 问题复现:
~/K8s/Day8/demo7$ kubectl get pod
NAME READY STATUS RESTARTS AGE
web-pvc-demo-0 0/1 ContainerCreating 0 58s
~/K8s/Day8/demo7$ kubectl describe pod web-pvc-demo-0
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 2m11s default-scheduler Successfully assigned default/web-pvc-demo-0 to k8s-node-5
Warning FailedMount 2m11s kubelet MountVolume.SetUp failed for volume "default-token-zglkd" : failed to sync secret cache: timed out waiting for the condition
Warning FailedMount 9s kubelet Unable to attach or mount volumes: unmounted volumes=[diskpv], unattached volumes=[diskpv default-token-zglkd]: timed out waiting for the condition
问题原因: 由于kubernetes的MountVolume有一定的缓存导致已删除绑定的PV不可再重复的挂载; 解决办法: 删除无法挂载的PV卷以及PVC卷,如果还是不能解决直接重启集群;
问题2.使用NFS动态提供Kubernetes存储卷在创建PVC后一直是pending状态, 显示正在等待由外部供应器“fuseim.pri/ifs”或由系统管理员手动创建的卷
问题环境: k8s(v1.23.1)
问题复现: 通过kubectl describe
命令查看错误提示信息 waiting for a volume to be created, either by external provisioner “fuseim.pri/ifs” or manually created by system administrator。
其次是通过
kubectl logs
命令查看 nfs-client-provisioner
pod日志中有unexpected error getting claim reference: selfLink was empty, can’t make reference
提示。
问题原因: 在v1.16版本将在ObjectMeta和ListMeta对象中弃用SelfLink字段,并且在v1.20版本之后默认禁用了selfLink(但是我们仍然可以通过参数的形式来进行恢复)
问题解决: 在 k8s 的 master 端 找到 kube-apiserver.yaml 文件,并在文件中的command参数中添加 - --feature-gates=RemoveSelfLink=false
或者在其systemd单元服务中加入此参数然后重启即可。
/etc/kubernetes/manifests/kube-apiserver.yaml
- --feature-gates=RemoveSelfLink=false
问题3.Kubelet启动异常报Image garbage collection failed once. Stats initialization may not have completed yet: failed to get imageFs info: unable to find data in memory cache错误 问题复现:
E0704 15:20:03.875017 7912 kubelet.go:1292] Image garbage collection failed once. Stats initialization may not have completed yet: failed to get imageFs info: unable to find data in memory cache
E0704 15:20:03.920105 7912 kubelet.go:1853] skipping pod synchronization - [container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful]
解决办法: 由于我们cgroupdriver
使用了systemd
因此我们需要升级systemd.
yum update -y
在k8s集群中加入额外主机报failure loading certificate for CA: couldn‘t load the certificate file
错误。
错误信息: 当k8s做集群高可用的时候,需要将另一个master加入到当前master出现了如下错误。
failure loading certificate for CA: couldn't load the certificate file /etc/kubernetes/pki/ca.crt: open /etc/kubernetes/pki/ca.crt: no such file or directory
问题原因: 由于新的节点上没有kubernetes集群上的pki目录中的ca证书。 解决办法:
scp -rp /etc/kubernetes/pki/ca.* master02:/etc/kubernetes/pki
scp -rp /etc/kubernetes/pki/sa.* master02:/etc/kubernetes/pki
scp -rp /etc/kubernetes/pki/front-proxy-ca.* master02:/etc/kubernetes/pki
scp -rp /etc/kubernetes/pki/etcd/ca.* master02:/etc/kubernetes/pki/etcd
scp -rp /etc/kubernetes/admin.conf master02:/etc/kubernetes