Inference Acceleration Based on TensorRT-LLM

Last updated: 2026-03-03 16:58:39

Overview

TensorRT-LLM is a Large Language Model (LLM) inference acceleration framework developed by NVIDIA. It provides user-friendly Python APIs and uses the latest optimization techniques to build LLMs into TensorRT engine files, enabling efficient inference on NVIDIA GPUs.
TensorRT-LLM also provides a backend that can be integrated with NVIDIA Triton Inference Server for deploying models as online inference services and supports the In-Flight Batching technology, which can significantly improve the throughput of inference services and reduce latency.
This document uses the Baichuan2-13B-Chat model as an example to demonstrate how to perform inference acceleration on an LLM using TensorRT-LLM and deploy it.

Operation Steps

Creating a Dev Machine for Model Conversion

You can pull the TensorRT-LLM image provided by Tencent Cloud TI-ONE Platform (TI-ONE) and save it to your image repository instance of Tencent Container Registry (TCR) Individual Edition or Enterprise Edition:
MY_IMAGE="<Your repository address>"
docker pull tione-public-hub.tencentcloudcr.com/qcloud-ti-platform/tritonserver:23.10-py3-trtllm-0.7.1
docker tag tione-public-hub.tencentcloudcr.com/qcloud-ti-platform/tritonserver:23.10-py3-trtllm-0.7.1 ${MY_IMAGE}
docker push ${MY_IMAGE}
Using the custom image above, launch a dev machine and mount the requested Cloud File Storage (CFS) or Data Accelerator Goose FileSystem (GooseFS) storage, as shown below. This dev machine requires 1 GPU card for inference to build TensorRT engine files.


Building a TensorRT-LLM Model

After the dev machine is accessed, the image has built-in sample code for model conversion in the /workspace/TensorRT-LLM-examples directory. You can follow the examples to proceed:
1. Download the Baichuan2-13B-Chat model.
You can download the model yourself and save it to the CFS path. Below is a reference method:
apt update && apt install git-lfs
git lfs install
GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/baichuan-inc/Baichuan2-13B-Chat.git
cd Baichuan2-13B-Chat
git lfs pull
2. Follow the comments to modify the content of the build_triton_repo_baichuan2_13b.sh file, and then run the script:
#!/bin/bash
set -ex
# Specify the model parallelism number.
TP=1
# Modify it. Specify the local directory of the original Hugging Face model.
HF_MODEL=/home/tione/notebook/triton-example/hf_model/Baichuan2-13B-Chat
# Modify it. Specify the output directory of the Triton model package. It is recommended to create a directory in the CFS instance.
TRITON_REPO=/home/tione/notebook/triton-example/triton_model/Baichuan2-13B-Chat/trt-${TP}-gpu
# Specify the path to the TensorRT-LLM Engine build script.
BUILD_SCRIPT=tensorrtllm_backend/tensorrt_llm/examples/baichuan/build.py

# Create an output directory.
mkdir -p ${TRITON_REPO}
cp -r tensorrtllm_backend/all_models/inflight_batcher_llm/* ${TRITON_REPO}/
# Copy Tokenizer relevant files to the output directory.
cp ${HF_MODEL}/*token* ${MODEL_PATH}/tensorrt_llm/1/

# Build the TensorRT-LLM Engine file. For details of the parameters, see tensorrt_llm/examples/baichuan/README.md.
# Example 1: Baichuan V2 13B parameter scale model, using FP16 and enabling in-flight batching.
#python3 $BUILD_SCRIPT --model_version v2_13b \
# --model_dir ${HF_MODEL} \
# --output_dir ${TRITON_REPO}/tensorrt_llm/1/ \
# --world_size ${TP} \
# --max_batch_size 32 \
# --dtype float16 \
# --use_gemm_plugin float16 \
# --use_gpt_attention_plugin float16 \
# --remove_input_padding \
# --paged_kv_cache

# Example 2: Baichuan V2 13B parameter scale model, using INT8 weight-only quantization and enabling in-flight batching.
python3 $BUILD_SCRIPT --model_version v2_13b \
--model_dir ${HF_MODEL} \
--output_dir ${TRITON_REPO}/tensorrt_llm/1/ \
--world_size ${TP} \
--max_batch_size 32 \
--dtype float16 \
--use_weight_only \
--use_gemm_plugin float16 \
--use_gpt_attention_plugin float16 \
--remove_input_padding \
--paged_kv_cache


# Modify the Triton config.pbtxt configuration file.
# The options.txt file can be modified as needed, but it is generally recommended to use the default values.
OPTIONS=options.txt
python3 tensorrtllm_backend/tools/fill_template.py -i ${TRITON_REPO}/preprocessing/config.pbtxt ${OPTIONS}
python3 tensorrtllm_backend/tools/fill_template.py -i ${TRITON_REPO}/postprocessing/config.pbtxt ${OPTIONS}
python3 tensorrtllm_backend/tools/fill_template.py -i ${TRITON_REPO}/tensorrt_llm_bls/config.pbtxt ${OPTIONS}
python3 tensorrtllm_backend/tools/fill_template.py -i ${TRITON_REPO}/ensemble/config.pbtxt ${OPTIONS}
python3 tensorrtllm_backend/tools/fill_template.py -i ${TRITON_REPO}/tensorrt_llm/config.pbtxt ${OPTIONS}

# Create a symbolic link to /data/model. In TI-ONE Online Services, models are mounted to this path by default.
mkdir -p /data
ln -s ${TRITON_REPO} /data/model

# Start the Triton inference service locally for debugging.
# launch_triton_server
The directory structure of the converted model is as follows:
# tree
.
├── ensemble
│ ├── 1
│ └── config.pbtxt
├── postprocessing
│ ├── 1
│ │ └── model.py
│ └── config.pbtxt
├── preprocessing
│ ├── 1
│ │ └── model.py
│ └── config.pbtxt
├── tensorrt_llm
│ ├── 1
│ │ ├── baichuan_float16_tp1_rank0.engine
│ │ ├── config.json
│ │ ├── model.cache
│ │ ├── special_tokens_map.json
│ │ ├── tokenization_baichuan.py
│ │ ├── tokenizer_config.json
│ │ └── tokenizer.model
│ └── config.pbtxt
└── tensorrt_llm_bls
├── 1
│ └── model.py
└── config.pbtxt
You can directly run the launch_triton_server command on the dev machine to start the Triton Inference Server and refer to api_test.sh for local calls. If you want to release a formal inference service that can be called from the public network or within the Virtual Private Cloud (VPC) network, see the section below.

Creating an Online Service

When a service is created, set the model source to CFS or GooseFS, and select a model from the path on the CFS or GooseFS instance where the converted Triton model package is stored.
For the running environment, select the previous custom image or the built-in image Built-in/TRION(1.0.0)/23.10-py3-trtllm-0.7.1.
Select computing resources in accordance with actual resource availability, with at least 8 CPU cores, 40 GB of memory, and preferably, an A100 or A800 GPU.
Logs similar to the following indicate that the service has started up successfully:




Calling an API

For the text generation API, see Triton's documentation, with an example as follows:
# The public access address can be obtained from the Service Call tab on the frontend page of the online service instance.
SERVER_URL=https://service-********.sh.tencentapigw.com:443/tione

# Non-streaming request.
curl -X POST ${SERVER_URL}/v2/models/tensorrt_llm_bls/generate -d '{"text_input": "<reserved_106>Who are you?<reserved_107>", "max_tokens": 64}'

# Streaming request.
curl -X POST ${SERVER_URL}/v2/models/tensorrt_llm_bls/generate_stream -d '{"text_input": "<reserved_106>Who are you?<reserved_107>", "max_tokens": 64, "stream": true}'
Non-streaming returned results:
{"cum_log_probs":0.0,"model_name":"tensorrt_llm_bls","model_version":"1","output_log_probs":[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],"text_output":"I am Baichuan LLM, an LLM created by engineers at Baichuan AI. I can engage in natural conversations with humans, answer questions, assist with creative tasks, and help the public easily and inclusively access world knowledge and professional services. If you have any questions, feel free to ask me anytime."}
Streaming returned results:
data: {"cum_log_probs":0.0,"model_name":"tensorrt_llm_bls","model_version":"1","output_log_probs":0.0,"text_output":"I am"}
data: {"cum_log_probs":0.0,"model_name":"tensorrt_llm_bls","model_version":"1","output_log_probs":[0.0,0.0],"text_output":"I am Baichuan"}
data: {"cum_log_probs":0.0,"model_name":"tensorrt_llm_bls","model_version":"1","output_log_probs":[0.0,0.0,0.0],"text_output":"I am Baichuan LLM"}
data: {"cum_log_probs":0.0,"model_name":"tensorrt_llm_bls","model_version":"1","output_log_probs":[0.0,0.0,0.0,0.0],"text_output":"I am Baichuan LLM"}
data: {"cum_log_probs":0.0,"model_name":"tensorrt_llm_bls","model_version":"1","output_log_probs":[0.0,0.0,0.0,0.0,0.0],"text_output":"I am Baichuan LLM,"}

... multiple lines omitted ...

data: {"cum_log_probs":0.0,"model_name":"tensorrt_llm_bls","model_version":"1","output_log_probs":[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],"text_output":"I am Baichuan LLM, an LLM created by engineers at Baichuan AI. I can engage in natural conversations with humans, answer questions, assist with creative tasks, and help the public easily and inclusively access world knowledge and professional services. If you have any questions, feel free to ask me anytime."}