This document describes the causes and solutions of various FAQs of TKE Serverless clusters.
Why is the Pod specification inconsistent with the set Request/Limit?
When allocating resources for Pods, TKE Serverless will calculate the Request and Limit values set by the workload, and automatically determine the amount of resources required for running the Pods, instead of allocating resources according to the set Request and Limit values. For more information, see CPU specifications calculation methods for pods and GPU specification calculation methods for pods.
How to create or modify the container network of the TKE Serverless cluster?
When creating a cluster, you need to select a VPC as the cluster network and specify a subnet as the container network. For more information, see Notes on the Container Network. The Pod of the TKE Serverless cluster directly occupies an IP address of the container network subnet. When using the cluster, you can create or modify the container network through creating or removing the super node. The detailed instructions are shown below.
Step 1. Create a super node to add a container network
1. Log in to the TKE console and click Cluster in the left sidebar.
2. Click the ID of the cluster for which you need to modify the container network to go to the cluster details page.
3. Click Super node in the left sidebar. On the Super node page, click Create.
4. On the Create super node page, select the container network with sufficient IP addresses and click OK.
Step 2. Remove the super node to delete the container network
Note
Make sure that at least one super node remains in the TKE Serverless cluster after the removal. If there is only one super node, you cannot remove it.
Before removing a super node, you need to drain all Pods on it (excluding those managed by DaemonSet) to other super nodes. After the draining is completed, you can remove the super node; otherwise, the removal will fail. The detailed directions are as shown below.
1. Log in to the TKE console and click Cluster in the left sidebar.
2. Click the ID of the cluster for which you need to modify the container network to go to the cluster details page.
3. Click Super node in the left sidebar. On the Super node page, choose More > Drain on the right of the node name.
4. On the Drain node page, check the node information and click OK. After the node is drained, it will enter the "Blocked" status, and no more Pods can be scheduled to it.
Note
Note that Pods will be rebuilt once the node is drained.
5. 4. On the Super node page, click Remove on the right of the node name.
6. 5. On the Delete node page, click OK.
What should I do if the Pod fails to schedule because of insufficient subnet IPs?
When a Pod fails to be scheduled due to insufficient subnet IP addresses, you can find two events in the node logs.
Event 1:
Event 2:
You can view the YAML of the super node by accessing the TKE console or executing the following command in the command line.
kubectl get nodes -oyaml
The returned result is as follows:
spec:
taints:
-effect: NoSchedule
key: node.kubernetes.io/network-unavailable
timeAdded:"2021-04-20T07:00:16Z"
-lastHeartbeatTime:"2021-04-20T07:55:28Z"
lastTransitionTime:"2021-04-20T07:00:16Z"
message: eklet node has insufficient IP available of subnet subnet-bok73g4c
reason: EKLetHasInsufficientSubnetIP
status:"True"
type: NetworkUnavailable
It shows that the Pod fails to be scheduled due to insufficient subnet IP addresses of the container network. In this case, you need to create super nodes to add subnets and available IP ranges. For how to create a super node, see Creating Super Node.
What are the instructions for using the TKE Serverless cluster security group?
When creating the TKE Serverless cluster Pod, if you do not specify a security group, the default security group will be used. You can also specify a security group for the Pod through Annotation eks.tke.cloud.tencent.com/security-group-id: security group ID. Make sure that the security group ID already exists in the region where the workload resides. For more information about this annotation, see Annotation.
How do I set the container termination message?
Kubernetes can set the source of container termination messages through the terminationMessagePath. When a container exits, Kubernetes retrieves the termination message from the specified termination message file in the container's terminationMessagePath field and uses this content to populate the container's termination message. The default value for the message is: /dev/termination-log.
Additionally, you can set the terminationMessagePolicy field for the container to further customize the container termination message. The default value for this field is File, which retrieves the termination message only from the termination message file. You can set it to FallbackToLogsOnError according to your needs, which means that if the container exits with an error and the termination message file is empty, the last part of the container log output will be used as the termination message.
Sample code:
apiVersion: apps/v1beta2
kind: Deployment
metadata:
name: nginx
spec:
containers:
-image: nginx
imagePullPolicy: Always
name: nginx
resources:
limits:
cpu: 500m
memory: 1Gi
requests:
cpu: 250m
memory: 256Mi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: FallbackToLogsOnError
With the above configuration, when the container exits with an error and the termination message file is empty, Get Pod will find that the output of stderr is displayed in containerStatuses.
How to use Host parameters?
Note the following when using TKE Serverless clusters:
TKE Serverless clusters do not have nodes but are compatible with Host parameters, such as Hostpath, Hostnetwork: true, and DnsPolicy: ClusterFirstWithHostNet. Note that these parameters cannot deliver the full capabilities of K8s, as there is no node.
For example, you may want to use Hostpath to share data, but the two Pods scheduled to the same super node will see the Hostpath of different hosts. In addition, if the Pod is rebuilt, Hostpath files will be deleted at the same time.
How do I mount CFS/NFS?
In TKE Serverless clusters, you can use Tencent Cloud's Cloud File Storage (CFS) or mount an external NFS as a volume to a Pod for persistent data storage. A sample YAML to mount CFS/NFS to a Pod is as shown below:
apiVersion: v1
kind: Pod
metadata:
name: test-pd
spec:
containers:
-image: k8s.gcr.io/test-webserver
name: test-container
volumeMounts:
-mountPath: /cache
name: cache-volume
volumes:
-name: nfs
nfs:
path: /dir
server: 127.0.0.1
spec.volumes: Set the name, type, and parameters of the volume.
spec.volumes.nfs: Set the NFS/CFS disk.
spec.containers.volumeMounts: Set the mount point of the volume in the Pod.
How to speed up container start-up by image reuse?
TKE Serverless supports caching container images to speed up the next startup of the container with the same images.
Conditions for reuse:
1. For Pods with the same Workload, if a Pod is created and terminated at the same AZ within the cache time, the new Pod will not pull the same image by default.
2. If you want to reuse images for Pods with different Workloads (including Deployment, Statefulset, and Job), use the following annotation:
eks.tke.cloud.tencent.com/cbs-reuse-key
For the Pods with the same annotation value under the same user account, the start-up image will be reused within cache time as much as possible. It is recommended to enter the image name of the annotation value: eks.tke.cloud.tencent.com/cbs-reuse-key: "image-name".
Cache time: 2 hours.
How do I solve image reuse exceptions?
When image reuse function is enabled, if a Pod is created, $kubectl describe pod may see the following errors:
no space left on device: unknown
Warning FreeDiskSpaceFailed 26m eklet, eklet-subnet-xxx failed to garbage collect required amount of images. Wanted to free 4220828057 bytes, but freed 3889267064 bytes
Recovery Method:
No action is required. Wait for a few minutes and the Pod will run automatically.
Reason:
no space left on device: unknown
When Pods reuse the system disk by default, the existing images on the system disk occupy all the disk space, causing insufficient space for downloading new images, resulting in the error "no space left on device: unknown". TKE Serverless supports a scheduled image recycling mechanism. When the disk space is full, it will automatically delete the redundant images in the system disk to free up available space. (This process may take several minutes.)
Warning FreeDiskSpaceFailed 26m eklet, eklet-subnet-xxx failed to garbage collect required amount of images. Wanted to free 4220828057 bytes, but freed 3889267064 bytes
This log indicates that the current Pod requires 4220828057 bytes of space to download the image, but only 3889267064 bytes of data have been cleared. The event is generated because there are multiple images on the disk, and only some of them have been cleaned up. TKE Serverless's scheduled image recycling mechanism will continue to clean up images until the new image can be successfully pulled.
What should I do if Operation not permitted is reported when I mount an external NFS?
If you encounter an "Operation not permitted" error when using a self-built NFS for persistent storage, you need to modify the /etc/exports file of your self-built NFS by adding the /<path><ip-range>(rw,insecure) parameter. An example is shown below:
/data/ 10.0.0.0/16(rw,insecure)
How do I free up a full Pod disk (ImageGCFailed)?
TKE Serverless Pods provide 20 GB of free system disk space by default. If the disk is full, you can free it up in the following ways.
1. Free up unused container images
If 80% of the space is used, the TKE Serverless backend will trigger the container image repossession process to recover the unused images and free up the space. If this process fails, the ImageGCFailed: failed to garbage collect required amount of images message will be reported to remind you of the insufficient disk space.
Common causes of insufficient disk space include:
The business has a lot of temporary outputs. You can confirm this with the du command.
The business holds deleted file descriptors, so disk space is not freed up. You can confirm this with the lsof command.
If you want to adjust the threshold for the container image repossession, set the following annotation:
If your business has been upgraded in-place or a container has abnormally exited, the exited container will be retained until the disk utilization reaches 85%. The cleanup threshold can be adjusted with the following annotation:
If you don't want to have the exited container automatically cleaned up (for example, you need the exit information for further troubleshooting), you can disable the automatic cleanup with the following annotation; however, the disk space cannot be automatically freed up in this case.
Only the Pod is restarted, but the host will not be rebuilt. Normal gracestop, prestop, and health checks are performed for the exit and startup.
Note
This feature was launched on April 27, 2022 (UTC +8) and can be enabled on Pods created earlier only after they are rebuilt.
9100 port issue
TKE Serverless Pods expose monitoring data via port 9100 by default, and you can access 9100/metrics to get the data by running the following commands:
Get all metrics:
curl -g "http://<pod-ip>:9100/metrics"
We recommend you remove the ipvs metric for large clusters:
If your business requires listening on port 9100, you can avoid conflicts by using other ports to collect monitoring data when creating a Pod. The configuration is as shown below:
eks.tke.cloud.tencent.com/metrics-port: "9110"
If the port for monitoring data exposure is not changed and the business listens on port 9100 directly, an error will be reported in the new TKE Serverless network scheme, indicating that port 9100 is already in use:
listen() to 0.0.0.0:9100, backlog 511 failed (1: Operation not permitted)
When this error is reported, you need to add the annotation metrics-port to the Pod to change the monitoring port and then rebuild the Pod.
Note
If the Pod has a public EIP, you need to set up a security group. Pay attention to port 9100 and open required ports.