When deploying or running business, you may trigger high-risk operations at different levels, leading to service failures to different degrees. To help you estimate and avoid operational risks, this document describes the consequences of the high-risk operations and corresponding solutions. Below you can find the high-risk operations you may trigger when dealing with clusters, networking and load balancing, logs, and cloud disks.
Clusters
Type | High-risk Operation | Consequence | Solution |
Master and etcd nodes | Modifying the security groups of nodes in a cluster | Master node may become unavailable | Configure security groups as recommended by Tencent Cloud |
| Node expires or is terminated | The master node becomes unavailable | Unrecoverable |
| Reinstalling operating system | Master component deleted | Unrecoverable |
| Upgrading master or etcd component version on your own | Cluster may become unavailable | Roll back to the original version |
| Deleting or formatting core directory data such as node /etc/kubernetes | The master node becomes unavailable | Unrecoverable |
| Changing node IP | The master node becomes unavailable | Change back to the old IP |
| Modifying parameters of core components, e.g. etcd, kube-apiserver, docker, etc., on your own | Master node may become unavailable | Configure parameters as recommended by Tencent Cloud |
| Changing master or etcd certificate on your own | Cluster may become unavailable | Unrecoverable |
Worker nodes | Modifying the security groups of nodes in a cluster | Node may become unavailable | Configure security groups as recommended by Tencent Cloud |
| Modifying Node Instance Specification | Forced shutdown of the server, node unavailability | Remove the node and add it back to the cluster |
| Node expires or is terminated | The node becomes unavailable | Unrecoverable |
| Reinstalling operating system | Node components get deleted | Remove the node and add it back to the cluster |
| Upgrading node component version on your own | Node may become unavailable | Roll back to the original version |
| Changing node IP | Node becomes unavailable | Change back to the old IP |
| Modifying parameters of core components, e.g. etcd, kube-apiserver, docker, etc., on your own | Node may become unavailable | Configure parameters as recommended by Tencent Cloud |
| Modifying operating system configuration | Node may become unavailable | Try to restore the configurations or delete the node and purchase a new one |
Others | Modifying permissions in CAM | Some cluster resources, such as cloud load balancers, may not be able to be created | Restore the permissions |
Networking and Load Balancing
High-risk Operation | Consequence | Solution |
Modifying kernel parameters net.ipv4.ip_forward=0 | Network not connected | Modify kernel parameters to net.ipv4.ip_forward=1 |
Modifying kernel parameter net.ipv4.tcp_tw_recycle = 1 | NAT exception | Modify kernel parameter net.ipv4.tcp_tw_recycle = 0 |
Container CIDR’s UDP port 53 is not opened to the Internet in the security group configuration of the node | In-cluster DNS cannot work normally | Configure security groups as recommended by Tencent Cloud |
Modifying or deleting LB tags added in TKE | A new LB is purchased | Restore the LB tags |
Creating custom listeners in TKE-managed LB through LB console | Modification gets reset by TKE | Automatically create listeners through service YAML |
| Binding custom backend rs in TKE-managed LB through LB console | Prohibit manual binding of backend rs |
| Modifying certificate of TKE-managed LB through LB console | Automatically manage certificate through ingress YAML |
| Modifying TKE-managed LB listener name through LB console | Prohibit modification of TKE-managed LB listener name |
Logs
High-risk Operation | Consequence | Solution | Remarks |
Deleting the /tmp/ccs-log-collector/pos directory of the host | Log gets collected again | - | The Pos file contains the record of the file's collection location. |
Deleting the /tmp/ccs-log-collector/buffer directory of the host | Log gets lost | - | The buffer contains cached log files waiting to be consumed. |
Cloud Disk
High-risk Operation | Consequence | Solution |
Manually unmounting cloud disks through console | Pod encounters an I/O error while writing | Delete the mount directory of the node and reschedule the Pod |
Unmounting disk mounting path on the node | Pod writing to local disk | Re-mount the corresponding directory onto Pod |
Directly operating CBS block device on the node | Pod writing to local disk | - |