前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >研发工程师玩转Kubernetes——Node失效后恢复的实验

研发工程师玩转Kubernetes——Node失效后恢复的实验

作者头像
方亮
发布2023-06-04 14:25:50
2950
发布2023-06-04 14:25:50
举报
文章被收录于专栏:方亮

《研发工程师玩转Kubernetes——Node失效后的Pod的调度实验》中我们看到Kubernetes会一直等待Node状态恢复。这节我们将做实验,看看Node恢复后Kubernetes的表现。 仍然借助《研发工程师玩转Kubernetes——Node失效后的Pod的调度实验》的实验过程。

Node失效前

在停止前Node前,各个组件的状态如下:

代码语言:javascript
复制
kubectl get deployments.apps -o wide --watch
代码语言:javascript
复制
NAME               READY   UP-TO-DATE   AVAILABLE   AGE   CONTAINERS        IMAGES   SELECTOR
nginx-deployment   1/1     1            1           20s   nginx-container   nginx    app=nginx
代码语言:javascript
复制
kubectl get pod --watch -o wide
代码语言:javascript
复制
NAME                               READY   STATUS    RESTARTS   AGE   IP             NODE      NOMINATED NODE   READINESS GATES
nginx-deployment-8f788645b-qszpv   1/1     Running   0          10s   10.1.209.134   ubuntub   <none>           <none>
代码语言:javascript
复制
kubectl get node --watch -o wide
代码语言:javascript
复制
NAME      STATUS   ROLES    AGE    VERSION   INTERNAL-IP     EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
ubuntub   Ready    <none>   2d5h   v1.26.4   172.23.79.69    <none>        Ubuntu 22.04.2 LTS   5.15.0-73-generic   containerd://1.6.15
ubuntua   Ready    <none>   2d7h   v1.26.4   172.23.71.113   <none>        Ubuntu 22.04.2 LTS   5.15.0-73-generic   containerd://1.6.15
ubuntud   Ready    <none>   24h    v1.26.4   172.23.74.199   <none>        Ubuntu 22.04.2 LTS   5.15.0-73-generic   containerd://1.6.15
ubuntuc   Ready    <none>   2d5h   v1.26.4   172.23.67.151   <none>        Ubuntu 22.04.2 LTS   5.15.0-73-generic   containerd://1.6.15
ubuntue   Ready    <none>   24h    v1.26.4   172.23.73.117   <none>        Ubuntu 22.04.2 LTS   5.15.0-73-generic   containerd://1.6.15

可见Pod被部署在UbuntuB上。

Node失效中

现在我们关闭Node UbuntuB。登录该机器执行

代码语言:javascript
复制
sudo poweroff

再查看上述组件状态

代码语言:javascript
复制
kubectl get deployments.apps -o wide --watch
代码语言:javascript
复制
NAME               READY   UP-TO-DATE   AVAILABLE   AGE   CONTAINERS        IMAGES   SELECTOR
nginx-deployment   1/1     1            1           20s   nginx-container   nginx    app=nginx
nginx-deployment   0/1     1            0           5m38s   nginx-container   nginx    app=nginx

上述显示,Deployment已经发现Pod不可用了,因为AVAILABLE字段变成了0,READY也变成了0/1。

代码语言:javascript
复制
kubectl get pod --watch -o wide
代码语言:javascript
复制
NAME                               READY   STATUS    RESTARTS   AGE   IP             NODE      NOMINATED NODE   READINESS GATES
nginx-deployment-8f788645b-qszpv   1/1     Running   0          10s   10.1.209.134   ubuntub   <none>           <none>
nginx-deployment-8f788645b-qszpv   1/1     Running   0          5m38s   10.1.209.134   ubuntub   <none>           <none>

但是查看Pod的指令则显示该Pod还在运行中。

代码语言:javascript
复制
get node --watch -o wide
代码语言:javascript
复制
NAME      STATUS   ROLES    AGE    VERSION   INTERNAL-IP     EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
ubuntub   Ready    <none>   2d5h   v1.26.4   172.23.79.69    <none>        Ubuntu 22.04.2 LTS   5.15.0-73-generic   containerd://1.6.15
ubuntua   Ready    <none>   2d7h   v1.26.4   172.23.71.113   <none>        Ubuntu 22.04.2 LTS   5.15.0-73-generic   containerd://1.6.15
ubuntud   Ready    <none>   24h    v1.26.4   172.23.74.199   <none>        Ubuntu 22.04.2 LTS   5.15.0-73-generic   containerd://1.6.15
ubuntuc   Ready    <none>   2d5h   v1.26.4   172.23.67.151   <none>        Ubuntu 22.04.2 LTS   5.15.0-73-generic   containerd://1.6.15
ubuntue   Ready    <none>   24h    v1.26.4   172.23.73.117   <none>        Ubuntu 22.04.2 LTS   5.15.0-73-generic   containerd://1.6.15
ubuntub   Ready    <none>   2d5h   v1.26.4   172.23.79.69    <none>        Ubuntu 22.04.2 LTS   5.15.0-73-generic   containerd://1.6.15
ubuntua   Ready    <none>   2d7h   v1.26.4   172.23.71.113   <none>        Ubuntu 22.04.2 LTS   5.15.0-73-generic   containerd://1.6.15
ubuntud   Ready    <none>   24h    v1.26.4   172.23.74.199   <none>        Ubuntu 22.04.2 LTS   5.15.0-73-generic   containerd://1.6.15
ubuntuc   Ready    <none>   2d5h   v1.26.4   172.23.67.151   <none>        Ubuntu 22.04.2 LTS   5.15.0-73-generic   containerd://1.6.15
ubuntue   Ready    <none>   24h    v1.26.4   172.23.73.117   <none>        Ubuntu 22.04.2 LTS   5.15.0-73-generic   containerd://1.6.15
ubuntub   NotReady   <none>   2d5h   v1.26.4   172.23.79.69    <none>        Ubuntu 22.04.2 LTS   5.15.0-73-generic   containerd://1.6.15
ubuntub   NotReady   <none>   2d5h   v1.26.4   172.23.79.69    <none>        Ubuntu 22.04.2 LTS   5.15.0-73-generic   containerd://1.6.15
ubuntub   NotReady   <none>   2d5h   v1.26.4   172.23.79.69    <none>        Ubuntu 22.04.2 LTS   5.15.0-73-generic   containerd://1.6.15

Node也一直在报UbuntuB失效了。

Node恢复后

因为Deployment会等待比较长的时间,才会将失效的Node上的Pod迁移。于是Node恢复会分为几种场景:

  1. 在Pod迁移尚未迁移时恢复
  2. 在Pod迁移中恢复
  3. 在Pod迁移完成后恢复

在Pod迁移尚未迁移时恢复

我们再次启动UbuntuB。要尽量迅速,以赶在Kubernetes判定Pod彻底失效并迁移其到其他Node上之前完成启动。

代码语言:javascript
复制
kubectl get deployments.apps -o wide --watch
代码语言:javascript
复制
NAME               READY   UP-TO-DATE   AVAILABLE   AGE   CONTAINERS        IMAGES   SELECTOR
nginx-deployment   1/1     1            1           20s   nginx-container   nginx    app=nginx
nginx-deployment   0/1     1            0           5m38s   nginx-container   nginx    app=nginx
nginx-deployment   1/1     1            1           9m6s    nginx-container   nginx    app=nginx

可以看到Deployment很快恢复到正常状态(READY变成了1/1)。

代码语言:javascript
复制
get pod --watch -o wide
代码语言:javascript
复制
NAME                               READY   STATUS    RESTARTS   AGE   IP             NODE      NOMINATED NODE   READINESS GATES
nginx-deployment-8f788645b-qszpv   1/1     Running   0          10s   10.1.209.134   ubuntub   <none>           <none>
nginx-deployment-8f788645b-qszpv   1/1     Running   0          5m38s   10.1.209.134   ubuntub   <none>           <none>
nginx-deployment-8f788645b-qszpv   0/1     Unknown   0          8m14s   <none>         ubuntub   <none>           <none>
nginx-deployment-8f788645b-qszpv   0/1     Unknown   0          8m57s   <none>         ubuntub   <none>           <none>
nginx-deployment-8f788645b-qszpv   0/1     Unknown   0          8m57s   <none>         ubuntub   <none>           <none>
nginx-deployment-8f788645b-qszpv   1/1     Running   1 (53s ago)   9m3s    10.1.209.135   ubuntub   <none>           <none>

Pod在经历了3次Unknown状态后也恢复了,但是其IP发生了变化(从10.1.209.134变成了10.1.209.135)。

代码语言:javascript
复制
kubectl get node --watch -o wide
代码语言:javascript
复制
NAME      STATUS   ROLES    AGE    VERSION   INTERNAL-IP     EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
ubuntub   Ready    <none>   2d5h   v1.26.4   172.23.79.69    <none>        Ubuntu 22.04.2 LTS   5.15.0-73-generic   containerd://1.6.15
ubuntua   Ready    <none>   2d7h   v1.26.4   172.23.71.113   <none>        Ubuntu 22.04.2 LTS   5.15.0-73-generic   containerd://1.6.15
ubuntud   Ready    <none>   24h    v1.26.4   172.23.74.199   <none>        Ubuntu 22.04.2 LTS   5.15.0-73-generic   containerd://1.6.15
ubuntuc   Ready    <none>   2d5h   v1.26.4   172.23.67.151   <none>        Ubuntu 22.04.2 LTS   5.15.0-73-generic   containerd://1.6.15
ubuntue   Ready    <none>   24h    v1.26.4   172.23.73.117   <none>        Ubuntu 22.04.2 LTS   5.15.0-73-generic   containerd://1.6.15
ubuntub   Ready    <none>   2d5h   v1.26.4   172.23.79.69    <none>        Ubuntu 22.04.2 LTS   5.15.0-73-generic   containerd://1.6.15
ubuntua   Ready    <none>   2d7h   v1.26.4   172.23.71.113   <none>        Ubuntu 22.04.2 LTS   5.15.0-73-generic   containerd://1.6.15
ubuntud   Ready    <none>   24h    v1.26.4   172.23.74.199   <none>        Ubuntu 22.04.2 LTS   5.15.0-73-generic   containerd://1.6.15
ubuntuc   Ready    <none>   2d5h   v1.26.4   172.23.67.151   <none>        Ubuntu 22.04.2 LTS   5.15.0-73-generic   containerd://1.6.15
ubuntue   Ready    <none>   24h    v1.26.4   172.23.73.117   <none>        Ubuntu 22.04.2 LTS   5.15.0-73-generic   containerd://1.6.15
ubuntub   NotReady   <none>   2d5h   v1.26.4   172.23.79.69    <none>        Ubuntu 22.04.2 LTS   5.15.0-73-generic   containerd://1.6.15
ubuntub   NotReady   <none>   2d5h   v1.26.4   172.23.79.69    <none>        Ubuntu 22.04.2 LTS   5.15.0-73-generic   containerd://1.6.15
ubuntub   NotReady   <none>   2d5h   v1.26.4   172.23.79.69    <none>        Ubuntu 22.04.2 LTS   5.15.0-73-generic   containerd://1.6.15
ubuntua   Ready      <none>   2d7h   v1.26.4   172.23.71.113   <none>        Ubuntu 22.04.2 LTS   5.15.0-73-generic   containerd://1.6.15
ubuntud   Ready      <none>   24h    v1.26.4   172.23.74.199   <none>        Ubuntu 22.04.2 LTS   5.15.0-73-generic   containerd://1.6.15
ubuntub   NotReady   <none>   2d5h   v1.26.4   172.23.69.88    <none>        Ubuntu 22.04.2 LTS   5.15.0-73-generic   containerd://1.6.15
ubuntub   NotReady   <none>   2d5h   v1.26.4   172.23.69.88    <none>        Ubuntu 22.04.2 LTS   5.15.0-73-generic   containerd://1.6.15
ubuntub   NotReady   <none>   2d5h   v1.26.4   172.23.69.88    <none>        Ubuntu 22.04.2 LTS   5.15.0-73-generic   containerd://1.6.15
ubuntub   Ready      <none>   2d5h   v1.26.4   172.23.69.88    <none>        Ubuntu 22.04.2 LTS   5.15.0-73-generic   containerd://1.6.15
ubuntub   Ready      <none>   2d5h   v1.26.4   172.23.69.88    <none>        Ubuntu 22.04.2 LTS   5.15.0-73-generic   containerd://1.6.15
ubuntub   Ready      <none>   2d5h   v1.26.4   172.23.69.88    <none>        Ubuntu 22.04.2 LTS   5.15.0-73-generic   containerd://1.6.15

失效的Node(UbuntuB) 在重启后自动加入了Kubernetes集群,但是其IP(可能)发生了变化。

在Pod迁移完成后恢复

代码语言:javascript
复制
kubectl get deployments.apps -o wide --watch
代码语言:javascript
复制
NAME               READY   UP-TO-DATE   AVAILABLE   AGE   CONTAINERS        IMAGES   SELECTOR
nginx-deployment   1/1     1            1           24m   nginx-container   nginx    app=nginx
nginx-deployment   0/1     1            0           26m   nginx-container   nginx    app=nginx
nginx-deployment   0/1     0            0           31m   nginx-container   nginx    app=nginx
nginx-deployment   0/1     1            0           31m   nginx-container   nginx    app=nginx
nginx-deployment   1/1     1            1           32m   nginx-container   nginx    app=nginx

可以看到在AGE为31m时,Deployment开始着手迁移Pod了,并在32m时,Pod迁移完成。

代码语言:javascript
复制
kubectl get pod --watch -o wide
代码语言:javascript
复制
NAME                               READY   STATUS    RESTARTS      AGE   IP             NODE      NOMINATED NODE   READINESS GATES
nginx-deployment-8f788645b-qszpv   1/1     Running   1 (16m ago)   24m   10.1.209.135   ubuntub   <none>           <none>
nginx-deployment-8f788645b-qszpv   1/1     Running   1 (17m ago)   26m   10.1.209.135   ubuntub   <none>           <none>
nginx-deployment-8f788645b-qszpv   1/1     Running   1 (23m ago)   31m   10.1.209.135   ubuntub   <none>           <none>
nginx-deployment-8f788645b-qszpv   1/1     Terminating   1 (23m ago)   31m   10.1.209.135   ubuntub   <none>           <none>
nginx-deployment-8f788645b-qxxj2   0/1     Pending       0             0s    <none>         <none>    <none>           <none>
nginx-deployment-8f788645b-qxxj2   0/1     Pending       0             0s    <none>         ubuntua   <none>           <none>
nginx-deployment-8f788645b-qxxj2   0/1     ContainerCreating   0             0s    <none>         ubuntua   <none>           <none>
nginx-deployment-8f788645b-qxxj2   0/1     ContainerCreating   0             0s    <none>         ubuntua   <none>           <none>
nginx-deployment-8f788645b-qxxj2   1/1     Running             0             58s   10.1.94.69     ubuntua   <none>           <none>
nginx-deployment-8f788645b-qszpv   0/1     Terminating         1 (24m ago)   32m   <none>         ubuntub   <none>           <none>
nginx-deployment-8f788645b-qszpv   0/1     Terminating         1             33m   <none>         ubuntub   <none>           <none>
nginx-deployment-8f788645b-qszpv   0/1     Terminating         1             33m   <none>         ubuntub   <none>           <none>
nginx-deployment-8f788645b-qszpv   0/1     Terminating         1             33m   <none>         ubuntub   <none>           <none>
nginx-deployment-8f788645b-qszpv   0/1     Terminating         1             33m   <none>         ubuntub   <none>           <none>

上面信息显示,位于失效Node上的Pod nginx-deployment-8f788645b-qszpv于31m时开始终止(Terminating)。然后Pod被迁移到UbuntuA上,并换了名字——nginx-deployment-8f788645b-qxxj2。后续,虽然失效的Node UbuntuB重新加入了集群,但是其上的Pod一直处于失效状态,且不会被删除。 node的信息和上个场景一样,这儿就不表了。

在Pod迁移中恢复

因为上例中新Pod在Master Node UbuntuA上分配的,而我们不能关闭这个Node。于是我们就删除之前的Deployment,并使用“污点”特性,让后续Pod不分配在UbuntuA上,才能进行实验。 待Deployment在UbuntuC上部署Pod时,我们启动失效的Node UbuntuB。其在新Pod部署完之前重新加入集群。

代码语言:javascript
复制
kubectl get deployments.apps -o wide --watch
代码语言:javascript
复制
NAME               READY   UP-TO-DATE   AVAILABLE   AGE   CONTAINERS        IMAGES   SELECTOR
nginx-deployment   0/1     0            0           1s    nginx-container   nginx    app=nginx
nginx-deployment   0/1     0            0           1s    nginx-container   nginx    app=nginx
nginx-deployment   0/1     0            0           1s    nginx-container   nginx    app=nginx
nginx-deployment   0/1     1            0           1s    nginx-container   nginx    app=nginx
nginx-deployment   1/1     1            1           4s    nginx-container   nginx    app=nginx
nginx-deployment   0/1     1            0           68s   nginx-container   nginx    app=nginx
nginx-deployment   0/1     0            0           6m13s   nginx-container   nginx    app=nginx
nginx-deployment   0/1     1            0           6m13s   nginx-container   nginx    app=nginx
nginx-deployment   1/1     1            1           7m12s   nginx-container   nginx    app=nginx

Deployment在6m13s时开始终止失效Node上的Pod,并开始在其他Node上部署新的Pod。7m12s时,新的Pod部署完成。

代码语言:javascript
复制
kubectl get pod --watch -o wide
代码语言:javascript
复制
NAME                               READY   STATUS    RESTARTS   AGE   IP       NODE     NOMINATED NODE   READINESS GATES
nginx-deployment-8f788645b-qd8ct   0/1     Pending   0          0s    <none>   <none>   <none>           <none>
nginx-deployment-8f788645b-qd8ct   0/1     Pending   0          0s    <none>   ubuntub   <none>           <none>
nginx-deployment-8f788645b-qd8ct   0/1     ContainerCreating   0          0s    <none>   ubuntub   <none>           <none>
nginx-deployment-8f788645b-qd8ct   0/1     ContainerCreating   0          0s    <none>   ubuntub   <none>           <none>
nginx-deployment-8f788645b-qd8ct   1/1     Running             0          3s    10.1.209.138   ubuntub   <none>           <none>
nginx-deployment-8f788645b-qd8ct   1/1     Running             0          67s   10.1.209.138   ubuntub   <none>           <none>
nginx-deployment-8f788645b-qd8ct   1/1     Running             0          6m12s   10.1.209.138   ubuntub   <none>           <none>
nginx-deployment-8f788645b-qd8ct   1/1     Terminating         0          6m12s   10.1.209.138   ubuntub   <none>           <none>
nginx-deployment-8f788645b-p9ntd   0/1     Pending             0          0s      <none>         <none>    <none>           <none>
nginx-deployment-8f788645b-p9ntd   0/1     Pending             0          0s      <none>         ubuntuc   <none>           <none>
nginx-deployment-8f788645b-p9ntd   0/1     ContainerCreating   0          1s      <none>         ubuntuc   <none>           <none>
nginx-deployment-8f788645b-p9ntd   0/1     ContainerCreating   0          2s      <none>         ubuntuc   <none>           <none>
nginx-deployment-8f788645b-qd8ct   0/1     Terminating         0          6m49s   <none>         ubuntub   <none>           <none>
nginx-deployment-8f788645b-p9ntd   1/1     Running             0          59s     10.1.43.194    ubuntuc   <none>           <none>
nginx-deployment-8f788645b-qd8ct   0/1     Terminating         0          7m31s   <none>         ubuntub   <none>           <none>
nginx-deployment-8f788645b-qd8ct   0/1     Terminating         0          7m38s   <none>         ubuntub   <none>           <none>
nginx-deployment-8f788645b-qd8ct   0/1     Terminating         0          7m38s   <none>         ubuntub   <none>           <none>

6m12s时,失效Node UbuntuB上的Pod nginx-deployment-8f788645b-qd8ct变成Terminating状态。Node UbuntuC上开始部署新的Pod nginx-deployment-8f788645b-p9ntd。6m49s时,失效Node UbuntuB试图通知Kubernetes Master让其上的Pod nginx-deployment-8f788645b-qd8ct加入Deployment维护的Pod副本,但是此时Kubernetes Master已经让这个Pod终止了,于是老的Pod恢复失败。7m11s时,新的Pod nginx-deployment-8f788645b-p9ntd运行成功。

总结

总结下:Node失效后,Kubernetes并不会自动将其摘除,而是一直等待它可用。失效后的Node在重启后,会自动向Kubernetes Master Node(本例中是Node UbuntuA)请求恢复。相应的,如果此时Deployment没有将原本在失效Node上的Pod设置为终止(Terminating)状态,则则Pod会就地恢复(IP会变动);如果设置为终止状态,则老的Pod会失效,Deployment会使用新启动的Pod。

本文参与 腾讯云自媒体同步曝光计划,分享自作者个人站点/博客。
原始发表:2023-06-03,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 作者个人站点/博客 前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • Node失效前
  • Node失效中
  • Node恢复后
    • 在Pod迁移尚未迁移时恢复
      • 在Pod迁移完成后恢复
        • 在Pod迁移中恢复
        • 总结
        相关产品与服务
        容器服务
        腾讯云容器服务(Tencent Kubernetes Engine, TKE)基于原生 kubernetes 提供以容器为核心的、高度可扩展的高性能容器管理服务,覆盖 Serverless、边缘计算、分布式云等多种业务部署场景,业内首创单个集群兼容多种计算节点的容器资源管理模式。同时产品作为云原生 Finops 领先布道者,主导开源项目Crane,全面助力客户实现资源优化、成本控制。
        领券
        问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档