本文主要记录在kubeadm reset后,在重装集群,加入和管理节点过程中遇到的问题。
iptables -F && iptables -t nat -F && iptables -t mangle -F && iptables -X
ipvsadm --clear
systemctl stop kubelet
systemctl stop docker
rm -rf /var/lib/cni/*
rm -rf /var/lib/kubelet/*
rm -rf /etc/cni/*
rm -rf $HOME/.kube/config
systemctl start docker
根据自己需要选择清空iptables或者ipvs,如果后续加入集群过程中,提示有别的文件夹存在文件,可以直接rm -rf删除指定文件夹。
在执行kubeadm reset之后,开始进行重装集群操作,此时会遇到很多calico相关的报错日志,包括以下日志但是不限于这些日志:
以下日志是在执行kubectl apply -f calico.yaml之后出现的。
calico的BGP报错
Warning Unhealthy pod/calico-node-k6tz5 Readiness probe failed: calico/node is not ready: BIRD is not ready: Error querying BIRD: unable to connect to BIRDv4 socket: dial unix /var/run/bird/bird.ctl: connect: no such file or directory
Warning Unhealthy pod/calico-node-k6tz5 Liveness probe failed: calico/node is not ready: bird/confd is not live: exit status 1
Warning BackOff pod/calico-node-k6tz5 Back-off restarting failed container
分支网卡解决方案,参考这里,但是更改了网卡后还是报错,参考2.2操作(根源就在kube-proxy)。
dail tcp 10.96.0.1:443: connect: connection refused报错
Hit error connecting to datastore - retry error=Get “https://10.96.0.1:443/api/v1/nodes/foo”: dial tcp 10.96.0.1:443: connect: connection refused
calico liveness和readniess probe探针报错
Warning Unhealthy 69m (x2936 over 12d) kubelet Readiness probe errored: rpc error: code = Unknown desc = operation timeout: context deadline exceeded
Warning Unhealthy 57m (x2938 over 12d) kubelet Liveness probe errored: rpc error: code = Unknown desc = operation timeout: context deadline exceeded
Warning Unhealthy 12m (x6 over 13m) kubelet Liveness probe failed: container is not running
Normal SandboxChanged 11m (x2 over 13m) kubelet Pod sandbox changed, it will be killed and re-created.
Normal Killing 11m (x2 over 13m) kubelet Stopping container calico-node
Warning Unhealthy 8m3s (x32 over 13m) kubelet Readiness probe failed: container is not running
Warning Unhealthy 4m45s (x6 over 5m35s) kubelet Liveness probe failed: container is not running
Normal SandboxChanged 3m42s (x2 over 5m42s) kubelet Pod sandbox changed, it will be killed and re-created.
Normal Killing 3m42s (x2 over 5m42s) kubelet Stopping container calico-node
Warning Unhealthy 42s (x31 over 5m42s) kubelet Readiness probe failed: container is not running
kube-proxy报错包括Failed to retrieve node info: Unauthorized和Failed to list *v1.Endpoints: Unauthorized等,具体如下:
W0430 12:33:28.887260 1 server_others.go:267] Flag proxy-mode="" unknown, assuming iptables proxy
W0430 12:33:28.913671 1 node.go:113] Failed to retrieve node info: Unauthorized
I0430 12:33:28.915780 1 server_others.go:147] Using iptables Proxier.
W0430 12:33:28.916065 1 proxier.go:314] invalid nodeIP, initializing kube-proxy with 127.0.0.1 as nodeIP
W0430 12:33:28.916089 1 proxier.go:319] clusterCIDR not specified, unable to distinguish between internal and external traffic
I0430 12:33:28.917555 1 server.go:555] Version: v1.14.1
I0430 12:33:28.959345 1 conntrack.go:52] Setting nf_conntrack_max to 131072
I0430 12:33:28.960392 1 config.go:202] Starting service config controller
I0430 12:33:28.960444 1 controller_utils.go:1027] Waiting for caches to sync for service config controller
I0430 12:33:28.960572 1 config.go:102] Starting endpoints config controller
I0430 12:33:28.960609 1 controller_utils.go:1027] Waiting for caches to sync for endpoints config controller
E0430 12:33:28.970720 1 event.go:191] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"fh-ubuntu01.159a40901fa85264", GenerateName:"", Namespace:"default", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:"Node", Namespace:"", Name:"fh-ubuntu01", UID:"fh-ubuntu01", APIVersion:"", ResourceVersion:"", FieldPath:""}, Reason:"Starting", Message:"Starting kube-proxy.", Source:v1.EventSource{Component:"kube-proxy", Host:"fh-ubuntu01"}, FirstTimestamp:v1.Time{Time:time.Time{wall:0xbf2a2e0639406264, ext:334442672, loc:(*time.Location)(0x2703080)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0xbf2a2e0639406264, ext:334442672, loc:(*time.Location)(0x2703080)}}, Count:1, Type:"Normal", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'Unauthorized' (will not retry!)
E0430 12:33:28.970939 1 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1.Endpoints: Unauthorized
E0430 12:33:28.971106 1 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1.Service: Unauthorized
E0430 12:33:29.977038 1 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1.Endpoints: Unauthorized
E0430 12:33:29.979890 1 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1.Service: Unauthorized
E0430 12:33:30.980098 1 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1.Endpoints: Unauthorized
清空kube-proxy的secrets,参考这里;
kubeadm reset后 proxy的pod用的还是上次产生的secret,删除此secret后会自动生成新的,然后再删除相关的kube-proxy容器,问题是重新运行了kubeadm,集群里保存的证书和新生成的证书不一致导致。
解决kube-proxy的问题,你会发现2.1的calico问题已经自动解决了。
加入集群时候,提示kube-apiserver x509证书问题
kube-apiserver[16692]: E0211 14:34:11.507411 16692 authentication.go:63] “Unable to authenticate the request” err=“[x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "kubernetes")
删除$HOME/.kube,参考这里
2.1的问题基本是由2.2导致的,优先处理2.2集群内过期的secret问题,2.1的calico问题会自动解决;
以上问题,仅是解决问题的一个切面,记录仅供参考。