Help & Documentation>Tencent Cloud TI-ONE Platform>Use Cases>Introduction to the Training Acceleration Feature of Angel

Introduction to the Training Acceleration Feature of Angel

Last updated: 2026-02-26 15:27:38

Introduction to the Training Acceleration Feature

Tilearn-Angel, evolved from tiacc_training, provides LLM training acceleration capabilities compatible with the Hugging Face ecosystem. It supports computational optimization capabilities that combine custom CUDA operators with automatic compilation optimization, and 3D hybrid parallelism (Tensor Parallelism, Pipeline Parallelism, and Data Parallelism) compatible with the Hugging Face ecosystem. It also supports communication acceleration capabilities compatible with native Distributed Data Parallel (DDP). You can use these capabilities directly without modifying their original code or converting their models. Additionally, it supports general training acceleration capabilities such as optimizer fusion, CPU/GPU affinity optimization, and adaptive mixed precision, as well as model compression capabilities. You can enable these capabilities by adding just a few lines of code.


1. Tilearn-Angel LLM Training Acceleration Images

It is recommended to use the latest built-in image of the platform:
tilearn-llm0.9-torch2.3-py3.10-cuda12.4-gpu
Update the latest tilearn.llm and tilearn.ops packages within the image (optional).
# tilearn-llm>=0.9.3
# tilearn.ops>=0.2.1.172
pip3 uninstall -y tilearn.llm tilearn.ops
pip3 install tilearn-llm==0.9.3 -i https://pypi.tuna.tsinghua.edu.cn/simple
pip3 install tilearn.ops==0.2.1.172 -i https://g-bnvx3728-pypi.pkg.coding.net/tione/tilearn/simple
wget https://tione-public-cos-1308945662.cos.ap-shanghai.myqcloud.com/tilearn/hybrid_parallel/colossalai-0.3.4.1-cp310-cp310-linux_x86_64.whlpip3 install colossalai-0.3.4.1-cp310-cp310-linux_x86_64.whl
When a custom image is used, it must meet one of the following conditions:
The custom image is built based on pytorch/pytorch:2.1.2-cuda12.1-cudnn8-devel, with torch.version=='2.1.2'.
The custom image is built based on the platform image tilearn-llm1.0-torch2.1-angel-vllm1.0-py3.10-cuda12.1-gpu.
For other cases, contact the acceleration team for support.

2. Tilearn-Angel Use Instructions

For more details, see Tilearn.llm Use Instructions.

2.1. Tilearn-Angel Computation Acceleration

This section primarily introduces the computational optimization capabilities for LLMs. For more information about how to use general training acceleration capabilities (communication optimization, optimizer fusion, CPU/GPU affinity optimization, and adaptive mixed precision), see Tilearn.llm Use Instructions.
Taking the Llama model as an example, the code modifications for using computational optimization are as follows:
### TILEARN.LLM
from tilearn.llm.transformers import LlamaForCausalLM

### The model API is consistent with the standard Hugging Face.
model = LlamaForCausalLM.from_pretrained(...)
Use the AutoModelForCausalLM API.
### TILEARN.LLM
from tilearn.llm.transformers import AutoModelForCausalLM

### The model API is consistent with the standard Hugging Face.
model = AutoModelForCausalLM.from_pretrained(...)
Note:
Due to potential conflicts between Baichuan1-13B and Baichuan2-13B, tilearn.llm.transformers.AutoModelForCausalLM defaults to enabling Baichuan1-13B. If you need to use Baichuan2-13B, set the environment variable in your training startup script export TILEARN_LLM_BAICHUAN_13B=2.
The following models support acceleration: Llama, Bloom, Baichuan1, and Baichuan2. For more details, see Tilearn.llm Use Instructions.


2.2. Tilearn-Angel 3D Hybrid Parallelism Acceleration

Tilearn-Angel supports 3D hybrid parallelism (TensorParallel, PipelineParallel, and DateParallel) compatible with the Hugging Face ecosystem. It requires no model conversion. You can perform 3D parallel training directly using the Hugging Face Trainer. The usage method is as follows. For more information, see 3D Hybrid Parallelism Notebook Examples.

Environment variable configuration
export TILEARN_HYBRID_TP_SIZE=1
export TILEARN_HYBRID_PP_SIZE=2
Training code configuration
### Computational optimization.
from tilearn.llm.transformers import LlamaForCausalLM
from tilearn.llm.transformers import AutoModelForCausalLM
### 3D parallelism.
import tilearn.llm.hybrid_parallel

def main():
### The model API is consistent with the standard Hugging Face.
model = AutoModelForCausalLM.from_pretrained(...)
run_exp()
Recommended parameter configurations for LLM fine-tuning
Llama-3-8B model (seqlength=4096)
# Default parameters for 8 x A100 40G.
GradienAccumulationSteps=64
BatchSize=1
GradientCheckPointing=False
TilearnHybridTPSize=2
TilearnHybridPPSize=2
# Default parameters for 8 x A800 80G.
GradienAccumulationSteps=32
BatchSize=1
GradientCheckPointing=False
TilearnHybridTPSize=1
TilearnHybridPPSize=2
TilearnHybridZeroStage=1 3. Tilearn-Angel Training Acceleration Performance




3. Appendix: Tiacc_training Legacy Training Acceleration

tilearn.llm 0.7.12 and later versions as well as tilearn.ops 0.2.0.1 and later versions support the legacy tiacc_training acceleration capabilities. For more information about general training acceleration capabilities, see Section 2. Tilearn-Angel Use Instructions in this document.
If you use the latest platform image tilearn-llm0.4.2-torch2.1-deepspeed0.10.0-py3.10-cuda12.1-gpu, you can directly use legacy features.

Environment Setup

It is recommended to use the built-in PyTorch image and TensorFlow image of the platform:
PyTorch image: ti-acc2.0-torch1.9-py3.8-cuda11.1-gpu

TensorFlow image: ti-acc1.0-tf1.15-py3.6-cuda10.0-gpu


Using DDP Distributed Training Communication Optimization (PyTorch + DDP)

Start the training script in a way compatible with native DDP without modifying the training code. Sample startup command:
python3 -u -m tiacc_training.distributed.launch --nproc_per_node $GPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT main.py
Measured performance of DDP distributed training communication optimization: (Acceleration effects are noticeable only in multi-machine multi-GPU card scenarios. The performance is identical to native DDP in single-machine multi-GPU card scenarios.)
Hardware Environment
Model
Number of GPU Cards
Native DDP (examples/sec per V100)
TI-ACC Communication Optimization (examples/sec per V100)
Tencent Cloud GN10Xp.20XLARGE320
resnext50_32x4d
1 (single-machine)
227
227
8 (single-machine)
215
215
16 (two-machine)
116
158.6

Using Adaptive Mixed Precision Optimization (PyTorch)

import torch.cuda.amp as amp
import tiacc_training.torch
scaler = amp.GradScaler()
# Instantiate an object of the adaptive mixed precision policy class.
policy = tiacc_training.torch.tiacc_torch_warp.MixedPrecision_TrainingPolicy(policy,start_step,hold_step,end_step,interval_time,interval_hold_time)
# Determine whether to enable adaptive mixed precision for the current epoch based on the input parameters.
mixed_precision = policy.enable_mixed_precision(epoch,lr=lr,loss=loss,scaler=scaler)
with amp.autocast(enabled=mixed_precision):
outputs = model(inputs)
loss = criterion(outputs, targets)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
Measured performance of adaptive mixed precision optimization:
Hardware Environment
Model
Number of GPU Cards
Native PyTorch (examples/sec per V100)
TI-ACC Data I/O Optimization (examples/sec per V100)
TI-ACC Data I/O + Adaptive Mixed Precision Optimization (examples/sec per V100)
Tencent Cloud GN10Xp.20XLARGE320
resnet50 mmcls
8 (single-machine)
70.8
350.5
379.2
centernet mmdet
8 (single-machine)
26.4
28.6
30.6

Using Optimized Embedding Variable Construction (TensorFlow + PS)

# Start the container.
docker run -itd --name tiacc-rec-fm --network=host --ipc=host ccr.ccs.tencentyun.com/ti-platform/tensorflow:1.15.5-py3-rec-0121
# Access the container.
docker exec -it tiacc-rec-fm bash
# Method to use native TensorFlow embedding.
cd wideanddeep && bash start_all.sh --fm
# Method to use TI-ACC lookup optimization.
cd wideanddeep && bash start_all.sh --tiacc --fm
Measured performance of optimized embedding variable construction and lookup computation:
Hardware Environment
Model
Number of GPU Cards
Native TensorFlow (global_steps/sec per V100)
TI-ACC After Optimization (global_steps/sec per V100)
Tencent Cloud GN10Xp.20XLARGE320
DeepFM
16 (two-machine)
41.9 - 56
96.1 - 103.3
Wide & Deep
16 (two-machine)
49.9 - 69
120 - 128

Acknowledgement

The Tilearn-Angel acceleration engine and related demo case studies benefit from Deepspeed, ColossalAI, Transformers, LLaMA-Factory, flash-attention, and PyTorch. We thank the authors of the above projects for their contributions.


Training Acceleration Class/Function Description

1. Tilearn.llm LLM Training Acceleration

For more information about how to use the training acceleration capabilities (computational optimization, communication optimization, optimizer fusion, CPU/GPU affinity optimization, and adaptive mixed precision), see Tilearn.llm Use Instructions.

1.1. Computational Optimization Related APIs

The computational optimization API is fully compatible with Hugging Face. Taking the Llama model as an example, the usage is as follows:
### TILEARN.LLM
from tilearn.llm.transformers import LlamaForCausalLM

### The model API is consistent with the standard Hugging Face.
model = LlamaForCausalLM.from_pretrained(...)

1.2. 3D Hybrid Parallelism Optimization Related APIs

The 3D hybrid parallelism optimization feature is fully compatible with the Hugging Face ecosystem. You can use the Hugging Face Trainer without model conversion. Taking the Llama 3 model as an example, the usage is as follows:
Environment variable configuration
export TILEARN_HYBRID_TP_SIZE=1
export TILEARN_HYBRID_PP_SIZE=2
Training code configuration
### Computational optimization.
from tilearn.llm.transformers import LlamaForCausalLM
from tilearn.llm.transformers import AutoModelForCausalLM
### 3D parallelism.
import tilearn.llm.hybrid_parallel

def main():
### The model API is consistent with the standard Hugging Face.
model = AutoModelForCausalLM.from_pretrained(...)
run_exp()
Recommended parameter configurations for LLM fine-tuning
Llama-3-8B model (seqlength=4096)
# Default parameters for 8 x A100 40G.
GradienAccumulationSteps=64
BatchSize=1
GradientCheckPointing=False
TilearnHybridTPSize=2
TilearnHybridPPSize=2
TilearnHybridZeroStage=1
# Default parameters for 8 x A800 80G.
GradienAccumulationSteps=32
BatchSize=1
GradientCheckPointing=False
TilearnHybridTPSize=1
TilearnHybridPPSize=2
TilearnHybridZeroStage=1

2. Tiacc_training Legacy Training Acceleration

tilearn.llm 0.7.12 and later versions as well as tilearn.ops 0.2.0.1 and later versions support the legacy tiacc_training acceleration capabilities. For more information about general training acceleration capabilities, see Section 2. Tilearn-Angel Use Instructions in this document.
If you use the latest platform image tilearn-llm0.4.2-torch2.1-deepspeed0.10.0-py3.10-cuda12.1-gpu, you can directly use legacy features.

tiacc_training.distributed.launch Function

Initializes DDP communication acceleration optimization. The API is fully consistent with native torch.distributed.launch. It automatically adapts native DDP-related functions to call TI-ACC communication acceleration capabilities. The main native DDP-related modules/classes include torch.distributed and torch.nn.parallel.DistributedDataParallel().

adaptfp16.MixedPrecision_TrainingPolicy Class

Instantiates an adaptive policy for adaptive mixed precision during training. Adaptive policies include time-based mixed precision, time-learning rate mixed precision policy, and loss function mixed precision policy. Initialization parameters:
Parameter
Type
Required
Parameter Description
Example
Default Value
policy
INT
Yes
Adaptive mixed precision policy.
0: time-based mixed precision, suitable for general adaptive scenarios.
1: time-learning rate mixed precision policy, suitable for scenarios where abnormal loss fluctuations occur during a specific training phase.
2: loss function mixed precision policy, suitable for scenarios where the loss decreases too rapidly or too slowly during training.
0
None
start_time
INT
No
Start time for enabling adaptive mixed precision. Generally, it is recommended to set the start time to 10. This parameter is required when the policy value is set to 0 or 1, and is optional when the policy value is set to 2.
10
10
end_time
INT
No
End time for enabling adaptive mixed precision. Generally, it is recommended to set the end time to the last epoch time. This parameter is required when the policy value is set to 0 or 1, and is optional when the policy value is set to 2.
1000
None
hold_time
INT
No
Hold time for enabling policy 1. During the hold time, a unified policy is used: either enabled or disabled. Generally, it is recommended to set the hold time to the duration of abnormal loss fluctuations during training. This parameter is required when the policy value is set to 1, and is optional when the policy value is set to 0 or 2.
20
None
interval_time
INT
No
Interval for enabling policy 2. The default value is 1000, which indicates that policy 2 is enabled every 1,000 epochs. This parameter is required when the policy value is set to 2, and is optional when the policy value is set to 0 or 1.
1000
1000
interval_hold_time
INT
No
Hold time after policy 2 is enabled within the interval_time. The default value is 100. For example, if interval_time is 1000, policy 2 is enabled during 1000-1100, 2000-2100... This parameter is required when the policy value is set to 2, and is optional when the policy value is set to 0 or 1.
100
100
Instantiated object:
Object
Type
Object Description
policy
MixedPrecision_TrainingPolicy class
An instantiated object of the adaptive policy for automatic mixed precision during training.

enable_mixed_precision Function Method

This function belongs to the MixedPrecision_TrainingPolicy class and determines whether to enable adaptive mixed precision for the current epoch based on the input parameters. Input parameters:
Parameter
Type
Required
Parameter Description
Example
Default Value
epoch
INT
Yes
Current epoch.
20
None
scaler
torch.cuda.amp.GradScaler
Yes
Instantiated object for gradient scaling.
scaler
None
lr
float
No
lr is the learning rate for the current epoch.
0.01
None
loss
float
No
loss is the loss value from the previous epoch.
0.1
None
Output parameters:
Output Parameter
Type
Parameter Description
mixed_precision
BOOL
Determines whether to enable adaptive mixed precision for the current epoch based on the input parameters. Returns TRUE if enabling is required; otherwise, returns FALSE.