专栏首页程序猿 Damon 带你进阶全栈k8s master机器文件系统故障的一次恢复过程

k8s master机器文件系统故障的一次恢复过程

研发反馈他们那边一套集群有台master文件系统损坏无法开机,他们是三台openstack上的虚机,是虚拟化宿主机故障导致的虚机文件系统损坏。三台机器是master+node,指导他修复后开机,修复过程和我之前文章opensuse的一次救援步骤一样

起来后我上去看,因为做了 HA 的,所以只有这个node有问题,集群没影响

[root@k8s-m1 ~]# kubectl get node -o wide
NAME             STATUS     ROLES    AGE   VERSION   INTERNAL-IP      EXTERNAL-IP   OS-IMAGE                KERNEL-VERSION                CONTAINER-RUNTIME
10.252.146.104   NotReady   <none>   30d   v1.16.9   10.252.146.104   <none>        CentOS Linux 8 (Core)   4.18.0-193.6.3.el8_2.x86_64   docker://19.3.11
10.252.146.105   Ready      <none>   30d   v1.16.9   10.252.146.105   <none>        CentOS Linux 8 (Core)   4.18.0-193.6.3.el8_2.x86_64   docker://19.3.11
10.252.146.106   Ready      <none>   30d   v1.16.9   10.252.146.106   <none>        CentOS Linux 8 (Core)   4.18.0-193.6.3.el8_2.x86_64   docker://19.3.11

启动docker试试

[root@k8s-m1 ~]# systemctl start docker
Job for docker.service canceled.

无法启动,查看下启动失败的服务

[root@k8s-m1 ~]# systemctl --failed
  UNIT               LOAD   ACTIVE SUB    DESCRIPTION
● containerd.service loaded failed failed containerd container runtime

查看下containerd的日志

[root@k8s-m1 ~]# journalctl -xe -u containerd
Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.481459735+08:00" level=info msg="loading plugin "io.containerd.service.v1.snapshots-service"..." type=io.containerd.service.v1
Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.481472223+08:00" level=info msg="loading plugin "io.containerd.runtime.v1.linux"..." type=io.containerd.runtime.v1
Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.481517630+08:00" level=info msg="loading plugin "io.containerd.runtime.v2.task"..." type=io.containerd.runtime.v2
Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.481562176+08:00" level=info msg="loading plugin "io.containerd.monitor.v1.cgroups"..." type=io.containerd.monitor.v1
Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.481964349+08:00" level=info msg="loading plugin "io.containerd.service.v1.tasks-service"..." type=io.containerd.service.v1
Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.481996158+08:00" level=info msg="loading plugin "io.containerd.internal.v1.restart"..." type=io.containerd.internal.v1
Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.482048208+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.containers"..." type=io.containerd.grpc.v1
Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.482081110+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.content"..." type=io.containerd.grpc.v1
Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.482096598+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.diff"..." type=io.containerd.grpc.v1
Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.482112263+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.events"..." type=io.containerd.grpc.v1
Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.482123307+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.healthcheck"..." type=io.containerd.grpc.v1
Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.482133477+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.images"..." type=io.containerd.grpc.v1
Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.482142943+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.leases"..." type=io.containerd.grpc.v1
Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.482151644+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.namespaces"..." type=io.containerd.grpc.v1
Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.482160741+08:00" level=info msg="loading plugin "io.containerd.internal.v1.opt"..." type=io.containerd.internal.v1
Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.482184201+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.snapshots"..." type=io.containerd.grpc.v1
Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.482194643+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.tasks"..." type=io.containerd.grpc.v1
Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.482206871+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.version"..." type=io.containerd.grpc.v1
Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.482215454+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.introspection"..." type=io.containerd.grpc.v1
Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.482365838+08:00" level=info msg=serving... address="/run/containerd/containerd.sock"
Jul 23 11:20:11 k8s-m1 containerd[9186]: time="2020-07-23T11:20:11.482404139+08:00" level=info msg="containerd successfully booted in 0.003611s"
Jul 23 11:20:11 k8s-m1 containerd[9186]: panic: runtime error: invalid memory address or nil pointer dereference
Jul 23 11:20:11 k8s-m1 containerd[9186]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x5626b983c259]
Jul 23 11:20:11 k8s-m1 containerd[9186]: goroutine 55 [running]:
Jul 23 11:20:11 k8s-m1 containerd[9186]: github.com/containerd/containerd/vendor/go.etcd.io/bbolt.(*Bucket).Cursor(...)
Jul 23 11:20:11 k8s-m1 containerd[9186]:         /go/src/github.com/containerd/containerd/vendor/go.etcd.io/bbolt/bucket.go:84
Jul 23 11:20:11 k8s-m1 containerd[9186]: github.com/containerd/containerd/vendor/go.etcd.io/bbolt.(*Bucket).Get(0x0, 0x5626bb7e3f10, 0xb, 0xb, 0x0, 0x2, 0x4)
Jul 23 11:20:11 k8s-m1 containerd[9186]:         /go/src/github.com/containerd/containerd/vendor/go.etcd.io/bbolt/bucket.go:260 +0x39
Jul 23 11:20:11 k8s-m1 containerd[9186]: github.com/containerd/containerd/metadata.scanRoots.func6(0x7fe557c63020, 0x2, 0x2, 0x0, 0x0, 0x0, 0x0, 0x5626b95eec72)
Jul 23 11:20:11 k8s-m1 containerd[9186]:         /go/src/github.com/containerd/containerd/metadata/gc.go:222 +0xcb
Jul 23 11:20:11 k8s-m1 containerd[9186]: github.com/containerd/containerd/vendor/go.etcd.io/bbolt.(*Bucket).ForEach(0xc0003d1780, 0xc00057b640, 0xa, 0xa)
Jul 23 11:20:11 k8s-m1 containerd[9186]:         /go/src/github.com/containerd/containerd/vendor/go.etcd.io/bbolt/bucket.go:388 +0x100
Jul 23 11:20:11 k8s-m1 containerd[9186]: github.com/containerd/containerd/metadata.scanRoots(0x5626bacedde0, 0xc0003d1680, 0xc0002ee2a0, 0xc00031a3c0, 0xc000527a60, 0x7fe586a43fff)
Jul 23 11:20:11 k8s-m1 containerd[9186]:         /go/src/github.com/containerd/containerd/metadata/gc.go:216 +0x4df
Jul 23 11:20:11 k8s-m1 containerd[9186]: github.com/containerd/containerd/metadata.(*DB).getMarked.func1(0xc0002ee2a0, 0x0, 0x0)
Jul 23 11:20:11 k8s-m1 containerd[9186]:         /go/src/github.com/containerd/containerd/metadata/db.go:359 +0x165
Jul 23 11:20:11 k8s-m1 containerd[9186]: github.com/containerd/containerd/vendor/go.etcd.io/bbolt.(*DB).View(0xc00000c1e0, 0xc00008b860, 0x0, 0x0)
Jul 23 11:20:11 k8s-m1 containerd[9186]:         /go/src/github.com/containerd/containerd/vendor/go.etcd.io/bbolt/db.go:701 +0x92
Jul 23 11:20:11 k8s-m1 containerd[9186]: github.com/containerd/containerd/metadata.(*DB).getMarked(0xc0000a0a80, 0x5626bacede20, 0xc0000d6010, 0x203000, 0x203000, 0x400)
Jul 23 11:20:11 k8s-m1 containerd[9186]:         /go/src/github.com/containerd/containerd/metadata/db.go:342 +0x7e
Jul 23 11:20:11 k8s-m1 containerd[9186]: github.com/containerd/containerd/metadata.(*DB).GarbageCollect(0xc0000a0a80, 0x5626bacede20, 0xc0000d6010, 0x0, 0x1, 0x0, 0x0)
Jul 23 11:20:11 k8s-m1 containerd[9186]:         /go/src/github.com/containerd/containerd/metadata/db.go:257 +0xa3
Jul 23 11:20:11 k8s-m1 containerd[9186]: github.com/containerd/containerd/gc/scheduler.(*gcScheduler).run(0xc0000a0b40, 0x5626bacede20, 0xc0000d6010)
Jul 23 11:20:11 k8s-m1 containerd[9186]:         /go/src/github.com/containerd/containerd/gc/scheduler/scheduler.go:310 +0x511
Jul 23 11:20:11 k8s-m1 containerd[9186]: created by github.com/containerd/containerd/gc/scheduler.init.0.func1
Jul 23 11:20:11 k8s-m1 containerd[9186]:         /go/src/github.com/containerd/containerd/gc/scheduler/scheduler.go:132 +0x462
Jul 23 11:20:11 k8s-m1 systemd[1]: containerd.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Jul 23 11:20:11 k8s-m1 systemd[1]: containerd.service: Failed with result 'exit-code'.

这个问题从panic抛出的堆栈信息看和我之前文章docker启动panic很类似,都是 boltdb 文件出错,找下 git 信息去看看代码路径在哪

[root@k8s-m1 ~]# systemctl cat containerd | grep ExecStart
ExecStartPre=-/sbin/modprobe overlay
ExecStart=/usr/bin/containerd

[root@k8s-m1 ~]# /usr/bin/containerd --version
containerd containerd.io 1.2.13 7ad184331fa3e55e52b890ea95e65ba581ae3429

按照这个blob去用github的url访问是404,只有去按照tag版本查看了,根据相关代码找到了 boltdb 的文件名是meta.db

https://github.com/containerd/containerd/blob/v1.2.13/metadata/db.go#L257
https://github.com/containerd/containerd/blob/v1.2.13/metadata/db.go#L79
https://github.com/containerd/containerd/blob/v1.2.13/services/server/server.go#L261-L268

查找下ic.Root路径是多少

[root@k8s-m1 ~]# /usr/bin/containerd --help | grep config
     config    information on the containerd config
   --config value, -c value     path to the configuration file (default: "/etc/containerd/config.toml")

[root@k8s-m1 ~]# grep root /etc/containerd/config.toml
#root = "/var/lib/containerd"
[root@k8s-m1 ~]]# find /var/lib/containerd -type f -name meta.db
/var/lib/containerd/io.containerd.metadata.v1.bolt/meta.db

找到boltdb文件,改名启动

[root@k8s-m1 ~]]# mv /var/lib/containerd/io.containerd.metadata.v1.bolt/meta.db{,.bak}
[root@k8s-m1 ~]# systemctl status containerd.service
● containerd.service - containerd container runtime
   Loaded: loaded (/usr/lib/systemd/system/containerd.service; disabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Thu 2020-07-23 11:20:11 CST; 17min ago
     Docs: https://containerd.io
  Process: 9186 ExecStart=/usr/bin/containerd (code=exited, status=2)
  Process: 9182 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS)
 Main PID: 9186 (code=exited, status=2)

Jul 23 11:20:11 k8s-m1 containerd[9186]: github.com/containerd/containerd/metadata.(*DB).getMarked(0xc0000a0a80, 0x5626bacede20, 0xc0000d6010, 0x203000, 0x203000, 0x400)
Jul 23 11:20:11 k8s-m1 containerd[9186]:         /go/src/github.com/containerd/containerd/metadata/db.go:342 +0x7e
Jul 23 11:20:11 k8s-m1 containerd[9186]: github.com/containerd/containerd/metadata.(*DB).GarbageCollect(0xc0000a0a80, 0x5626bacede20, 0xc0000d6010, 0x0, 0x1, 0x0, 0x0)
Jul 23 11:20:11 k8s-m1 containerd[9186]:         /go/src/github.com/containerd/containerd/metadata/db.go:257 +0xa3
Jul 23 11:20:11 k8s-m1 containerd[9186]: github.com/containerd/containerd/gc/scheduler.(*gcScheduler).run(0xc0000a0b40, 0x5626bacede20, 0xc0000d6010)
Jul 23 11:20:11 k8s-m1 containerd[9186]:         /go/src/github.com/containerd/containerd/gc/scheduler/scheduler.go:310 +0x511
Jul 23 11:20:11 k8s-m1 containerd[9186]: created by github.com/containerd/containerd/gc/scheduler.init.0.func1
Jul 23 11:20:11 k8s-m1 containerd[9186]:         /go/src/github.com/containerd/containerd/gc/scheduler/scheduler.go:132 +0x462
Jul 23 11:20:11 k8s-m1 systemd[1]: containerd.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Jul 23 11:20:11 k8s-m1 systemd[1]: containerd.service: Failed with result 'exit-code'.
[root@k8s-m1 ~]# systemctl restart containerd.service
[root@k8s-m1 ~]# systemctl status containerd.service
● containerd.service - containerd container runtime
   Loaded: loaded (/usr/lib/systemd/system/containerd.service; disabled; vendor preset: disabled)
   Active: active (running) since Thu 2020-07-23 11:25:37 CST; 1s ago
     Docs: https://containerd.io
  Process: 15661 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS)
 Main PID: 15663 (containerd)
    Tasks: 16
   Memory: 28.6M
   CGroup: /system.slice/containerd.service
           └─15663 /usr/bin/containerd

Jul 23 11:25:37 k8s-m1 containerd[15663]: time="2020-07-23T11:25:37.496725460+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.images"..." type=io.containerd.grpc.v1
Jul 23 11:25:37 k8s-m1 containerd[15663]: time="2020-07-23T11:25:37.496734129+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.leases"..." type=io.containerd.grpc.v1
Jul 23 11:25:37 k8s-m1 containerd[15663]: time="2020-07-23T11:25:37.496742793+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.namespaces"..." type=io.containerd.grpc.v1
Jul 23 11:25:37 k8s-m1 containerd[15663]: time="2020-07-23T11:25:37.496751740+08:00" level=info msg="loading plugin "io.containerd.internal.v1.opt"..." type=io.containerd.internal.v1
Jul 23 11:25:37 k8s-m1 containerd[15663]: time="2020-07-23T11:25:37.496775185+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.snapshots"..." type=io.containerd.grpc.v1
Jul 23 11:25:37 k8s-m1 containerd[15663]: time="2020-07-23T11:25:37.496785498+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.tasks"..." type=io.containerd.grpc.v1
Jul 23 11:25:37 k8s-m1 containerd[15663]: time="2020-07-23T11:25:37.496794873+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.version"..." type=io.containerd.grpc.v1
Jul 23 11:25:37 k8s-m1 containerd[15663]: time="2020-07-23T11:25:37.496803178+08:00" level=info msg="loading plugin "io.containerd.grpc.v1.introspection"..." type=io.containerd.grpc.v1
Jul 23 11:25:37 k8s-m1 containerd[15663]: time="2020-07-23T11:25:37.496944458+08:00" level=info msg=serving... address="/run/containerd/containerd.sock"
Jul 23 11:25:37 k8s-m1 containerd[15663]: time="2020-07-23T11:25:37.496958031+08:00" level=info msg="containerd successfully booted in 0.003994s"

containerd 起来后,启动下 docker

[root@k8s-m1 ~]# systemctl status docker
● docker.service - Docker Application Container Engine
   Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/docker.service.d
           └─10-docker.conf
   Active: inactive (dead) since Thu 2020-07-23 11:20:13 CST; 18min ago
     Docs: https://docs.docker.com
  Process: 9398 ExecStopPost=/bin/bash -c /sbin/iptables -D FORWARD -s 0.0.0.0/0 -j ACCEPT &> /dev/null || : (code=exited, status=0/SUCCESS)
  Process: 9187 ExecStart=/usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock (code=exited, status=0/SUCCESS)
 Main PID: 9187 (code=exited, status=0/SUCCESS)

Jul 23 11:20:13 k8s-m1 dockerd[9187]: time="2020-07-23T11:20:13.956503485+08:00" level=error msg="Stop container error: Failed to stop container 68860c8d16b9ce7e74e8efd9db00e70a57eef1b752c2e6c703073c0bce5517d3 with error: Cannot kill c>
Jul 23 11:20:13 k8s-m1 dockerd[9187]: time="2020-07-23T11:20:13.954347116+08:00" level=error msg="Stop container error: Failed to stop container 5ec9922beed1276989f1866c3fd911f37cc26aae4e4b27c7ce78183a9a4725cc with error: Cannot kill c>
Jul 23 11:20:13 k8s-m1 dockerd[9187]: time="2020-07-23T11:20:13.953615411+08:00" level=info msg="Container failed to stop after sending signal 15 to the process, force killing"
Jul 23 11:20:13 k8s-m1 dockerd[9187]: time="2020-07-23T11:20:13.956557179+08:00" level=error msg="Stop container error: Failed to stop container 6d0096fbcd4055f8bafb6b38f502a0186cd1dfca34219e9dd6050f512971aef5 with error: Cannot kill c>
Jul 23 11:20:13 k8s-m1 dockerd[9187]: time="2020-07-23T11:20:13.954601191+08:00" level=info msg="Container failed to stop after sending signal 15 to the process, force killing"
Jul 23 11:20:13 k8s-m1 dockerd[9187]: time="2020-07-23T11:20:13.956600790+08:00" level=error msg="Stop container error: Failed to stop container 6d1175ba6c55cb05ad89f4134ba8e9d3495c5acb5f07938dc16339b7cca013bf with error: Cannot kill c>
Jul 23 11:20:13 k8s-m1 dockerd[9187]: time="2020-07-23T11:20:13.957188989+08:00" level=info msg="Daemon shutdown complete"
Jul 23 11:20:13 k8s-m1 dockerd[9187]: time="2020-07-23T11:20:13.957212655+08:00" level=info msg="stopping event stream following graceful shutdown" error="context canceled" module=libcontainerd namespace=plugins.moby
Jul 23 11:20:13 k8s-m1 dockerd[9187]: time="2020-07-23T11:20:13.957209679+08:00" level=info msg="stopping event stream following graceful shutdown" error="context canceled" module=libcontainerd namespace=moby
Jul 23 11:20:13 k8s-m1 systemd[1]: Stopped Docker Application Container Engine.
[root@k8s-m1 ~]# systemctl start docker
[root@k8s-m1 ~]# systemctl status docker
● docker.service - Docker Application Container Engine
   Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/docker.service.d
           └─10-docker.conf
   Active: active (running) since Thu 2020-07-23 11:26:11 CST; 1s ago
     Docs: https://docs.docker.com
  Process: 9398 ExecStopPost=/bin/bash -c /sbin/iptables -D FORWARD -s 0.0.0.0/0 -j ACCEPT &> /dev/null || : (code=exited, status=0/SUCCESS)
  Process: 16156 ExecStartPost=/sbin/iptables -I FORWARD -s 0.0.0.0/0 -j ACCEPT (code=exited, status=0/SUCCESS)
 Main PID: 15974 (dockerd)
    Tasks: 62
   Memory: 89.1M
   CGroup: /system.slice/docker.service
           └─15974 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock

Jul 23 11:26:10 k8s-m1 dockerd[15974]: time="2020-07-23T11:26:10.851106564+08:00" level=error msg="cb4e16249cd8eac48ed734c71237195f04d63c56c55c0199b3cdf3d49461903d cleanup: failed to delete container from containerd: no such container"
Jul 23 11:26:10 k8s-m1 dockerd[15974]: time="2020-07-23T11:26:10.860456898+08:00" level=error msg="d9bbcab186ccb59f96c95fc886ec1b66a52aa96e45b117cf7d12e3ff9b95db9f cleanup: failed to delete container from containerd: no such container"
Jul 23 11:26:10 k8s-m1 dockerd[15974]: time="2020-07-23T11:26:10.872405757+08:00" level=error msg="07eb7a09bc8589abcb4d79af4b46798327bfb00624a7b9ceea457de392ad8f3d cleanup: failed to delete container from containerd: no such container"
Jul 23 11:26:10 k8s-m1 dockerd[15974]: time="2020-07-23T11:26:10.877896618+08:00" level=error msg="f5867657025bd7c3951cbd3e08ad97338cf69df2a97967a419e0e78eda869b73 cleanup: failed to delete container from containerd: no such container"
Jul 23 11:26:11 k8s-m1 dockerd[15974]: time="2020-07-23T11:26:11.143661583+08:00" level=info msg="Default bridge (docker0) is assigned with an IP address 172.17.0.0/16. Daemon option --bip can be used to set a preferred IP address"
Jul 23 11:26:11 k8s-m1 dockerd[15974]: time="2020-07-23T11:26:11.198200760+08:00" level=info msg="Loading containers: done."
Jul 23 11:26:11 k8s-m1 dockerd[15974]: time="2020-07-23T11:26:11.219959208+08:00" level=info msg="Docker daemon" commit=42e35e61f3 graphdriver(s)=overlay2 version=19.03.11
Jul 23 11:26:11 k8s-m1 dockerd[15974]: time="2020-07-23T11:26:11.220049865+08:00" level=info msg="Daemon has completed initialization"
Jul 23 11:26:11 k8s-m1 dockerd[15974]: time="2020-07-23T11:26:11.232373131+08:00" level=info msg="API listen on /var/run/docker.sock"
Jul 23 11:26:11 k8s-m1 systemd[1]: Started Docker Application Container Engine.

etcd启动也失败,journal 查看下 etcd 状态

[root@k8s-m1 ~]# journalctl -xe -u etcd
Jul 23 11:26:15 k8s-m1 etcd[18129]: Loading server configuration from "/etc/etcd/etcd.config.yml"
Jul 23 11:26:15 k8s-m1 etcd[18129]: etcd Version: 3.3.20
Jul 23 11:26:15 k8s-m1 etcd[18129]: Git SHA: 9fd7e2b80
Jul 23 11:26:15 k8s-m1 etcd[18129]: Go Version: go1.12.17
Jul 23 11:26:15 k8s-m1 etcd[18129]: Go OS/Arch: linux/amd64
Jul 23 11:26:15 k8s-m1 etcd[18129]: setting maximum number of CPUs to 16, total number of available CPUs is 16
Jul 23 11:26:15 k8s-m1 etcd[18129]: found invalid file/dir wal under data dir /var/lib/etcd (Ignore this if you are upgrading etcd)
Jul 23 11:26:15 k8s-m1 etcd[18129]: the server is already initialized as member before, starting as etcd member...
Jul 23 11:26:15 k8s-m1 etcd[18129]: ignoring peer auto TLS since certs given
Jul 23 11:26:15 k8s-m1 etcd[18129]: peerTLS: cert = /etc/kubernetes/pki/etcd/peer.crt, key = /etc/kubernetes/pki/etcd/peer.key, ca = /etc/kubernetes/pki/etcd/ca.crt, trusted-ca = /etc/kubernetes/pki/etcd/ca.crt, client-cert-auth = fals>
Jul 23 11:26:15 k8s-m1 etcd[18129]: listening for peers on https://10.252.146.104:2380
Jul 23 11:26:15 k8s-m1 etcd[18129]: ignoring client auto TLS since certs given
Jul 23 11:26:15 k8s-m1 etcd[18129]: pprof is enabled under /debug/pprof
Jul 23 11:26:15 k8s-m1 etcd[18129]: The scheme of client url http://127.0.0.1:2379 is HTTP while peer key/cert files are presented. Ignored key/cert files.
Jul 23 11:26:15 k8s-m1 etcd[18129]: The scheme of client url http://127.0.0.1:2379 is HTTP while client cert auth (--client-cert-auth) is enabled. Ignored client cert auth for this url.
Jul 23 11:26:15 k8s-m1 etcd[18129]: listening for client requests on 127.0.0.1:2379
Jul 23 11:26:15 k8s-m1 etcd[18129]: listening for client requests on 10.252.146.104:2379
Jul 23 11:26:15 k8s-m1 etcd[18129]: skipped unexpected non snapshot file 000000000000002e-000000000052f2be.snap.broken
Jul 23 11:26:15 k8s-m1 etcd[18129]: recovered store from snapshot at index 5426092
Jul 23 11:26:15 k8s-m1 etcd[18129]: restore compact to 3967425
Jul 23 11:26:15 k8s-m1 etcd[18129]: cannot unmarshal event: proto: KeyValue: illegal tag 0 (wire type 0)
Jul 23 11:26:15 k8s-m1 systemd[1]: etcd.service: Main process exited, code=exited, status=1/FAILURE
Jul 23 11:26:15 k8s-m1 systemd[1]: etcd.service: Failed with result 'exit-code'.
Jul 23 11:26:15 k8s-m1 systemd[1]: Failed to start Etcd Service.
[root@k8s-m1 ~]# ll /var/lib/etcd/member/snap/
total 8560
-rw-r--r-- 1 root root   13499 Jul 20 13:36 000000000000002e-000000000052cbac.snap
-rw-r--r-- 2 root root  128360 Jul 20 13:01 000000000000002e-000000000052f2be.snap.broken
-rw------- 1 root root 8617984 Jul 23 11:26 db

这套集群是使用我的ansible部署,求star的,自带了备份脚本,但是是三天前坏的

[root@k8s-m1 ~]# ll /opt/etcd_bak/
total 41524
-rw-r--r-- 1 root root 8618016 Jul 17 02:00 etcd-2020-07-17-02:00:01.db
-rw-r--r-- 1 root root 8618016 Jul 18 02:00 etcd-2020-07-18-02:00:01.db
-rw-r--r-- 1 root root 8323104 Jul 19 02:00 etcd-2020-07-19-02:00:01.db
-rw-r--r-- 1 root root 8618016 Jul 20 02:00 etcd-2020-07-20-02:00:01.db

有恢复剧本,但是前提是etcd的v2和v3不能共存,否则无法恢复备份,我们线上都是把v2的存储关闭了的。主要是这个tasks里的26到42行步骤,这里复制了其他机器master上的 07/23 号的etcd备份文件,然后改了下host跑了下

[root@k8s-m1 ~]# cd Kubernetes-ansible
[root@k8s-m1 Kubernetes-ansible]# ansible-playbook restoreETCD.yml -e 'db=/opt/etcd_bak/etcd-bak.db'

PLAY [10.252.146.104] **********************************************************************************************************************************************************************************************************************

TASK [Gathering Facts] *********************************************************************************************************************************************************************************************************************
ok: [10.252.146.104]

TASK [restoreETCD : fail] ******************************************************************************************************************************************************************************************************************
skipping: [10.252.146.104]

TASK [restoreETCD : 检测备份文件存在否] *************************************************************************************************************************************************************************************************************
ok: [10.252.146.104]

TASK [restoreETCD : fail] ******************************************************************************************************************************************************************************************************************
skipping: [10.252.146.104]

TASK [restoreETCD : set_fact] **************************************************************************************************************************************************************************************************************
skipping: [10.252.146.104]

TASK [restoreETCD : set_fact] **************************************************************************************************************************************************************************************************************
ok: [10.252.146.104]

TASK [restoreETCD : 停止etcd] ****************************************************************************************************************************************************************************************************************
ok: [10.252.146.104]

TASK [restoreETCD : 删除etcd数据目录] ************************************************************************************************************************************************************************************************************
ok: [10.252.146.104] => (item=/var/lib/etcd)

TASK [restoreETCD : 分发备份文件] ****************************************************************************************************************************************************************************************************************
ok: [10.252.146.104]

TASK [restoreETCD : 恢复备份] ******************************************************************************************************************************************************************************************************************
changed: [10.252.146.104]

TASK [restoreETCD : 启动etcd] ****************************************************************************************************************************************************************************************************************
fatal: [10.252.146.104]: FAILED! => {"changed": false, "msg": "Unable to start service etcd: Job for etcd.service failed because the control process exited with error code.\nSee \"systemctl status etcd.service\" and \"journalctl -xe\" for details.\n"}

PLAY RECAP *********************************************************************************************************************************************************************************************************************************
10.252.146.104             : ok=7    changed=1    unreachable=0    failed=1    skipped=3    rescued=0    ignored=0

查看下日志

[root@k8s-m1 Kubernetes-ansible]# journalctl -xe -u etcd
Jul 23 11:27:46 k8s-m1 etcd[58954]: Loading server configuration from "/etc/etcd/etcd.config.yml"
Jul 23 11:27:46 k8s-m1 etcd[58954]: etcd Version: 3.3.20
Jul 23 11:27:46 k8s-m1 etcd[58954]: Git SHA: 9fd7e2b80
Jul 23 11:27:46 k8s-m1 etcd[58954]: Go Version: go1.12.17
Jul 23 11:27:46 k8s-m1 etcd[58954]: Go OS/Arch: linux/amd64
Jul 23 11:27:46 k8s-m1 etcd[58954]: setting maximum number of CPUs to 16, total number of available CPUs is 16
Jul 23 11:27:46 k8s-m1 etcd[58954]: the server is already initialized as member before, starting as etcd member...
Jul 23 11:27:46 k8s-m1 etcd[58954]: ignoring peer auto TLS since certs given
Jul 23 11:27:46 k8s-m1 etcd[58954]: peerTLS: cert = /etc/kubernetes/pki/etcd/peer.crt, key = /etc/kubernetes/pki/etcd/peer.key, ca = /etc/kubernetes/pki/etcd/ca.crt, trusted-ca = /etc/kubernetes/pki/etcd/ca.crt, client-cert-auth = fals>
Jul 23 11:27:46 k8s-m1 etcd[58954]: listening for peers on https://10.252.146.104:2380
Jul 23 11:27:46 k8s-m1 etcd[58954]: ignoring client auto TLS since certs given
Jul 23 11:27:46 k8s-m1 etcd[58954]: pprof is enabled under /debug/pprof
Jul 23 11:27:46 k8s-m1 etcd[58954]: The scheme of client url http://127.0.0.1:2379 is HTTP while peer key/cert files are presented. Ignored key/cert files.
Jul 23 11:27:46 k8s-m1 etcd[58954]: The scheme of client url http://127.0.0.1:2379 is HTTP while client cert auth (--client-cert-auth) is enabled. Ignored client cert auth for this url.
Jul 23 11:27:46 k8s-m1 etcd[58954]: listening for client requests on 127.0.0.1:2379
Jul 23 11:27:46 k8s-m1 etcd[58954]: listening for client requests on 10.252.146.104:2379
Jul 23 11:27:47 k8s-m1 etcd[58954]: member ac2dcf6aed12e8f1 has already been bootstrapped
Jul 23 11:27:47 k8s-m1 systemd[1]: etcd.service: Main process exited, code=exited, status=1/FAILURE
Jul 23 11:27:47 k8s-m1 systemd[1]: etcd.service: Failed with result 'exit-code'.
Jul 23 11:27:47 k8s-m1 systemd[1]: Failed to start Etcd Service.

这个member xxxx has already been bootstrapped解决办法就是把配置文件的下面修改,后面启动完记得改回来

initial-cluster-state: 'new' 改成 initial-cluster-state: 'existing'

然后成功启动

[root@k8s-m1 Kubernetes-ansible]# systemctl start etcd
[root@k8s-m1 Kubernetes-ansible]# journalctl -xe -u etcd
Jul 23 11:27:55 k8s-m1 etcd[59889]: Loading server configuration from "/etc/etcd/etcd.config.yml"
Jul 23 11:27:55 k8s-m1 etcd[59889]: etcd Version: 3.3.20
Jul 23 11:27:55 k8s-m1 etcd[59889]: Git SHA: 9fd7e2b80
Jul 23 11:27:55 k8s-m1 etcd[59889]: Go Version: go1.12.17
Jul 23 11:27:55 k8s-m1 etcd[59889]: Go OS/Arch: linux/amd64
Jul 23 11:27:55 k8s-m1 etcd[59889]: setting maximum number of CPUs to 16, total number of available CPUs is 16
Jul 23 11:27:55 k8s-m1 etcd[59889]: found invalid file/dir wal under data dir /var/lib/etcd (Ignore this if you are upgrading etcd)
Jul 23 11:27:55 k8s-m1 etcd[59889]: the server is already initialized as member before, starting as etcd member...
Jul 23 11:27:55 k8s-m1 etcd[59889]: ignoring peer auto TLS since certs given
Jul 23 11:27:55 k8s-m1 etcd[59889]: peerTLS: cert = /etc/kubernetes/pki/etcd/peer.crt, key = /etc/kubernetes/pki/etcd/peer.key, ca = /etc/kubernetes/pki/etcd/ca.crt, trusted-ca = /etc/kubernetes/pki/etcd/ca.crt, client-cert-auth = fals>
Jul 23 11:27:55 k8s-m1 etcd[59889]: listening for peers on https://10.252.146.104:2380
Jul 23 11:27:55 k8s-m1 etcd[59889]: ignoring client auto TLS since certs given
Jul 23 11:27:55 k8s-m1 etcd[59889]: pprof is enabled under /debug/pprof
Jul 23 11:27:55 k8s-m1 etcd[59889]: The scheme of client url http://127.0.0.1:2379 is HTTP while peer key/cert files are presented. Ignored key/cert files.
Jul 23 11:27:55 k8s-m1 etcd[59889]: The scheme of client url http://127.0.0.1:2379 is HTTP while client cert auth (--client-cert-auth) is enabled. Ignored client cert auth for this url.
Jul 23 11:27:55 k8s-m1 etcd[59889]: listening for client requests on 127.0.0.1:2379
Jul 23 11:27:55 k8s-m1 etcd[59889]: listening for client requests on 10.252.146.104:2379
Jul 23 11:27:55 k8s-m1 etcd[59889]: recovered store from snapshot at index 5952463
Jul 23 11:27:55 k8s-m1 etcd[59889]: restore compact to 4369703
Jul 23 11:27:55 k8s-m1 etcd[59889]: name = etcd-001
Jul 23 11:27:55 k8s-m1 etcd[59889]: data dir = /var/lib/etcd
Jul 23 11:27:55 k8s-m1 etcd[59889]: member dir = /var/lib/etcd/member
Jul 23 11:27:55 k8s-m1 etcd[59889]: dedicated WAL dir = /var/lib/etcd/wal
Jul 23 11:27:55 k8s-m1 etcd[59889]: heartbeat = 100ms
Jul 23 11:27:55 k8s-m1 etcd[59889]: election = 1000ms
Jul 23 11:27:55 k8s-m1 etcd[59889]: snapshot count = 5000
Jul 23 11:27:55 k8s-m1 etcd[59889]: advertise client URLs = https://10.252.146.104:2379
Jul 23 11:27:55 k8s-m1 etcd[59889]: restarting member ac2dcf6aed12e8f1 in cluster 367e2aebc6430cbe at commit index 5952491
Jul 23 11:27:55 k8s-m1 etcd[59889]: ac2dcf6aed12e8f1 became follower at term 47
Jul 23 11:27:55 k8s-m1 etcd[59889]: newRaft ac2dcf6aed12e8f1 [peers: [1e713be314744d53,8b1621b475555fd9,ac2dcf6aed12e8f1], term: 47, commit: 5952491, applied: 5952463, lastindex: 5952491, lastterm: 47]
Jul 23 11:27:55 k8s-m1 etcd[59889]: enabled capabilities for version 3.3
Jul 23 11:27:55 k8s-m1 etcd[59889]: added member 1e713be314744d53 [https://10.252.146.105:2380] to cluster 367e2aebc6430cbe from store
Jul 23 11:27:55 k8s-m1 etcd[59889]: added member 8b1621b475555fd9 [https://10.252.146.106:2380] to cluster 367e2aebc6430cbe from store
Jul 23 11:27:55 k8s-m1 etcd[59889]: added member ac2dcf6aed12e8f1 [https://10.252.146.104:2380] to cluster 367e2aebc6430cbe from store

查看集群状态

[root@k8s-m1 Kubernetes-ansible]# etcd-ha
+-----------------------------+------------------+---------+---------+-----------+-----------+------------+
|          ENDPOINT           |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+-----------------------------+------------------+---------+---------+-----------+-----------+------------+
| https://10.252.146.104:2379 | ac2dcf6aed12e8f1 |  3.3.20 |  8.3 MB |     false |        47 |    5953557 |
| https://10.252.146.105:2379 | 1e713be314744d53 |  3.3.20 |  8.6 MB |     false |        47 |    5953557 |
| https://10.252.146.106:2379 | 8b1621b475555fd9 |  3.3.20 |  8.3 MB |      true |        47 |    5953557 |
+-----------------------------+------------------+---------+---------+-----------+-----------+------------+

然后给kube-apiserver三个组件和kubelet起来后

[root@k8s-m1 Kubernetes-ansible]# kubectl get node -o wide
NAME             STATUS   ROLES    AGE   VERSION   INTERNAL-IP      EXTERNAL-IP   OS-IMAGE                KERNEL-VERSION                CONTAINER-RUNTIME
10.252.146.104   Ready    <none>   30d   v1.16.9   10.252.146.104   <none>        CentOS Linux 8 (Core)   4.18.0-193.6.3.el8_2.x86_64   docker://19.3.11
10.252.146.105   Ready    <none>   30d   v1.16.9   10.252.146.105   <none>        CentOS Linux 8 (Core)   4.18.0-193.6.3.el8_2.x86_64   docker://19.3.11
10.252.146.106   Ready    <none>   30d   v1.16.9   10.252.146.106   <none>        CentOS Linux 8 (Core)   4.18.0-193.6.3.el8_2.x86_64   docker://19.3.11

pod也在慢慢自愈了

原文链接:

http://zhangguanzhang.github.io/2020/07/23/fs-error-fix-k8s-master/

本文转载自 Zhangguanzhang,点击下方阅读原文

结束福利

开源实战利用 k8s 作微服务的架构设计代码:

https://gitee.com/damon_one/spring-cloud-k8s
https://gitee.com/damon_one/spring-cloud-oauth2

欢迎大家 star,多多指教。

关于作者

笔名:Damon,技术爱好者,长期从事 Java 开发、Spring Cloud 的微服务架构设计,以及结合 docker、k8s 做微服务容器化,自动化部署等一站式项目部署、落地。Go 语言学习,k8s 研究,边缘计算框架 KubeEdge 等。公众号 程序猿Damon、天山六路折梅手 发起人。个人微信 MrNull008,个人网站:damon8.cn,欢迎來撩。

往期回顾

微服务自动化部署CI/CD

ArrayList、LinkedList&nbsp;你真的了解吗?

微服务架构设计之解耦合

大佬整理的mysql规范,分享给大家

如果张东升是个程序员

浅谈负载均衡

浅谈开发与研发之差异

浅谈&nbsp;Java&nbsp;集合&nbsp;|&nbsp;底层源码解析

Oauth2的授权码模式《上》

Oauth2的认证实战-HA篇

基于 Sentinel 作熔断 | 文末赠资料

基础设施服务k8s快速部署之HA篇

今天被问微服务,这几点,让面试官刮目相看

Spring cloud 之多种方式限流(实战)

Spring cloud 之熔断机制(实战)

面试被问finally 和 return,到底谁先执行?

Springcloud Oauth2 HA篇

Spring Cloud Kubernetes之实战一配置管理

Spring Cloud Kubernetes之实战二服务注册与发现

Spring Cloud Kubernetes之实战三网关Gateway

本文分享自微信公众号 - 程序猿Damon(Damon4X)

原文出处及转载信息见文内详细说明,如有侵权,请联系 yunjia_community@tencent.com 删除。

原始发表时间:2020-08-26

本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

我来说两句

0 条评论
登录 后参与评论

相关文章

  • Spring cloud 之熔断机制(实战)

    前面讲过 Spring cloud 之多种方式限流(实战)来处理请求频繁的压力。大家都知道,多个微服务之间调用的时候,假设微服务 A 调用微服务 B 和微服务 ...

    程序猿Damon
  • 浅谈开发与研发之差异

    今天浅谈一下个人的一些看法以及想法,如有雷同,纯属巧合。如有差异,可以当作一个思考。

    程序猿Damon
  • 今天被问微服务,这几点,让面试官刮目相看

    在金三银四的季节,不少人在换工作,也许在面试时,觉得技术交流的都不错,在跟HR交流时,薪资谈得也很得意,但是在离开之后,却一直没有HR联电的下文,似乎已经忘记,...

    程序猿Damon
  • express中间件系统的基本实现

    一直觉得express的中间件系统这种流式处理非常形象,就好像加工流水线一样,每个环节都在针对同一个产品的不同部分完成自己的工作,最后得到一个成品。今天就来实现...

    大史不说话
  • PHP全栈学习笔记16

    以$打头命名变量,变量要先赋值后使用 同一个变量,即可以存储数字也可以存储字符串,也就是可以存储任意类型的数据 变量不用指定数据类型,但必须赋值后才能使用

    达达前端
  • 腾讯云周斌:锤子发布会被DDos攻击,我们是这样抗住的

    神无月
  • dubbo(二)动态编译compiler

    上一篇提到过@Adaptive注解的作用:被@Adaptive修饰的类实际上是一个装饰类。被@Adaptive修饰的方法则会生成一个动态代理类,而根据模...

    虞大大
  • 攻击溯源-基于因果关系的攻击溯源图构建技术

    所谓“知已知彼,百战不殆”,在网络空间这个大战场中,攻防博弈双方实质上是信息获取能力的对抗,只有获取更多、更全的信息才能制定有效的攻防策略,在网络空间战场博弈中...

    绿盟科技研究通讯
  • Kubernetes 1.19.0——deployment(3)

    其实worker节点没必要做高可用,如果把worker2关机,等一段时间就会发现,pod都会在worker1上运行 ,当worker2重启后pod也不会回到wo...

    gz_naldo
  • 承接上文

    前置通知[Before advice]:在连接点前面执行,前置通知不会影响连接点的执行,除非此处抛出异常。 

    小勇DW3

扫码关注云+社区

领取腾讯云代金券