Help & Documentation>Tencent Kubernetes Engine

Description

Last updated: 2023-09-26 14:12:55

Feature Overview

In typical scenarios, qGPU Pods fairly utilize physical GPU resources, with the qGPU kernel driver allocating equivalent GPU time slices for each task. However, different GPU computing tasks have varying characteristics and importance, leading to diverse GPU resource usage and requirements. For example, real-time inference is sensitive to GPU resources, requiring low latency and quick access to GPU resources for computation, but its GPU resource utilization is usually not high. Model training requires a larger amount of GPU resources but has a lower sensitivity to latency and can tolerate some suppression. In this context, Tencent Cloud has introduced the qGPU online/offline hybrid deployment feature. qGPU online/offline hybrid deployment is an innovative GPU scheduling technology launched by Tencent Cloud, supporting the simultaneous mixed deployment of online (high-priority) tasks and offline (low-priority) tasks on the same GPU card. At the kernel and driver level, it achieves 100% utilization of idle computing power for low-priority tasks and 100% preemption for high-priority tasks. Relying on qGPU online/offline hybrid scheduling technology, users' GPU resources can be further utilized, increasing GPU utilization to 100% and minimizing GPU usage costs.



Strengths

qGPU online/offline hybrid deployment keeps GPU computing power under absolute control and pushes the utilization to the limit:
100% utilization of the idle computing power of high-priority tasks: All GPU computing power can be used by low-priority tasks when it is not occupied by high-priority tasks.
100% preemption of the computing power of low-priority tasks: Busy high-priority tasks can preempt GPU computing power from low-priority tasks.

Typical Use Cases

Hybrid deployment of online and offline inference

Search/recommendation inference tasks support online services and require real-time GPU computing power. Data preprocessing inference tasks support offline data cleaning and processing and have lower real-time requirements for GPU computing power. By setting online inference tasks as high-priority and offline inference tasks as low-priority, they can be hybrid deployed on the same GPU card.

Hybrid deployment of online inference and offline training

Real-time reference is sensitive to the availability of GPU computing power and uses a relatively small amount of resources, while model training consumes a large amount of resources and is insensitive to the availability of GPU computing power. Therefore, the former can be set as a high-priority task and the latter as a low-priority one for deployment on the same GPU card.

Technical Principles



With the online/offline scheduling strategy provided by TKE clusters, qGPU online/offline hybrid deployment capability can be enabled, helping online tasks (high-priority) and offline tasks (low-priority) to share physical GPU resources more efficiently. qGPU online/offline hybrid deployment technology mainly includes two features:

Feature 1: 100% utilization of the idle computing power by low-priority Pods

After low-priority Pods are scheduled to the node GPU, if the GPU computing power is not occupied by high-priority Pods, low-priority Pods can use all the computing power. When multiple low-priority Pods share the GPU computing power, the qGPU policy applies. When there are multiple high-priority Pods, resource competition applies instead of a specific policy.

Feature 2: 100% preemption of the computing power of low-priority Pods

qGPU online/offline hybrid deployment provides a priority preemption capability, ensuring that high-priority Pods can immediately and fully utilize GPU computing resources when busy. This is achieved through a priority preemption scheduling strategy. We have implemented this absolute preemption capability at the qGPU driver level: Firstly, the qGPU driver can sense the demand for GPU computing power from high-priority Pods. As soon as a high-priority Pod submits a computing task involving GPU power, the qGPU driver will provide all the computing power to the high-priority Pod in the shortest time possible, with a response time controlled within 1ms. When the high-priority Pod has no tasks running, the driver will release the occupied computing power within 100ms and reallocate it to the offline Pods. Secondly, the qGPU driver can support the pause and resume of computing tasks. When a high-priority Pod has a computing task running, the low-priority Pod that originally occupied the GPU will be immediately paused, releasing the GPU computing power for the high-priority Pod. When the high-priority Pod task is completed, the low-priority Pod will be promptly awakened and continue computing from the interruption point. The timing diagram of the computing tasks running at various priority levels is shown below:



Scheduling policy

On a general qGPU node, you can set the policy for scheduling Pods on the same card. In the online/offline hybrid deployment feature, the policy affects only the scheduling of low-priority Pods.
Low-priority Pods: When high-priority Pods are in a dormant state, low-priority Pods are running, and scheduling between low-priority Pods still follows the policy strategy. As soon as high-priority Pods start using GPU computing power, all low-priority Pods will be paused immediately until the high-priority Pods' computation tasks are completed. Then, low-priority tasks will resume running according to the policy strategy.
High-priority Pods: High-priority Pods will immediately preempt GPU computing power when they have computing tasks. The relationship between high-priority Pods and low-priority Pods is absolute preemption, unaffected by specific policies. The allocation of GPU computing power among multiple high-priority Pods is a competitive mode, not controlled by specific policy strategies.