Overview
This document takes qwen2-7b-instruct as an example to demonstrate how to use Tencent Cloud TI-ONE Platform (TI-ONE) to deploy and perform inference acceleration on custom LLMs. It also provides usage instructions for accelerated quantization performance testing through test scripts. You may also use the scripts provided in this document to test the inference acceleration effects achieved through TI-ONE's built-in inference image Angel-vLLM.
Angel-vLLM is an LLM inference acceleration framework deeply optimized by the Tencent Cloud AI acceleration team based on open-source vLLM. While using the same API as the community vLLM and maintaining full compatibility with the community vLLM features, it has the following features:
1. Richer features. Compared to the community open-source edition of vLLM, Angel-vLLM provides INT8, NF4, and FP8 online quantization, lookahead parallel decoding, and other features.
2. More powerful performance. Compared to the community open-source edition of the vLLM, Angel-vLLM quantization not only saves GPU memory but also reduces latency and increases throughput. Lookahead parallel decoding has been refined through practical business applications, showing significant improvement in RAG scenarios.
3. Better aligned precision. Angel-vLLM generation results have been extensively tested in online business operations, achieving complete alignment with Hugging Face generation results or keeping the precision basically unchanged.
TI-ONE has a built-in inference image integrated with the Angel-vLLM inference acceleration framework, enabling one-click deployment of LLM inference services compatible with the OpenAI API protocol.
This document will introduce how to use the built-in image Angel-vLLM on TI-ONE to deploy LLM inference services, enable Angel-vLLM acceleration capabilities, and call APIs.
Prerequisites
Before deploying and using your own model via TI-ONE and its built-in inference acceleration image Angel-vLLM, you need to prepare the following resources (including Cloud File Storage (CFS) for storing model files and GPU computing resources for the inference model).
Cloud File Storage
Apply for CFS: During the deployment and inference of custom LLMs, CFS is used for model file storage. Therefore, you need to apply for CFS first. For differences between CFS instances, see CFS > Storage Types and Specifications. Ensure network connectivity between the purchased CFS instance and the computing resource Cloud Virtual Machine (CVM) instances mentioned above. For details about CFS usage, see CFS > Creating File Systems and Mount Points.
Note: If your underlying storage service is Data Accelerator Goose FileSystem extreme (GooseFSx), the operations are similar to CFS described in this document, so you can still use this document as a reference.
Computing Resources
Self-purchased computing resources: If you are using TI-ONE for the first time, see Resource Group Management to purchase appropriate computing resources based on the inference model parameters.
Select resources based on the actual model size or your available resources. The CVM instance resources required for LLM inference depend on the model parameter scale and context length. We recommend referring to the following table to quickly estimate the resources required for inference services. For detailed GPU memory usage, see Inference Resource Requirements.
Model Parameter Scale | GPU Card Type and Quantity |
6 ~ 8B | PNV5b * 1 / A10 * 1 / A100 * 1 / V100 * 1 |
12 ~ 14B | PNV5b * 1 / A10 * 2 / A100 * 1 / V100 * 2 |
65 ~ 72B | PNV5b * 8 / A100 * 8 |
Operation Steps
Model Preparations
Choose Training Workshop > Dev Machines > Create to create a dev machine:
Image: You may choose any built-in image, as this dev machine is only used for downloading model files.
Source of Machine: If you have not purchased any computing resources previously, you can select Purchase from the TI-ONE platform and pay-as-you-go billing mode. To select from CVM instances or select the yearly/monthly subscription billing mode, see the computing resource purchase instructions in the Prerequisites section.
Billing Mode: You can select either pay-as-you-go or yearly/monthly subscription billing mode.
Storage Path Configuration: Select CFS. The name should be the CFS instance you applied for and configured in the Prerequisites section. The path is the root directory by default and is used to specify the storage location for your custom LLM.
Other Settings: Not required by default.
Note: We download or upload the required model files through the dev machine. The resources used in the example are 8 CPU cores and 16 G of memory (they are used only for the dev machine and are unrelated to subsequent inference services; resources can be appropriately reduced).
The final configuration is as follows:

Model Files
After the dev machine is successfully created, start it. Then, click the Python3 kernel under the dev machine to create an ipynb page for downloading the required models via scripts. You can search for the LLM you need in ModelScope or Hugging Face, and use the Python scripts provided in these communities to download the model and save it to CFS. This document takes the Qwen2-7b-Instruct model as an example, with the download code as follows:
!pip install modelscopefrom modelscope import snapshot_download#qwen/Qwen2-7B-Instruct is the name of the model to be downloaded, and cache_dir is the address where the downloaded model will be saved. Here, /home/tione/notebook indicates saving the downloaded model in the root directory of the CFS mount directory.model_dir = snapshot_download('qwen/Qwen2-7b-Instruct', cache_dir='/home/tione/notebook')
Note: The path mapping relationship is as follows: Taking the container mount path /home/tione/notebook when CFS is mounted as an example, if the CFS source path is / and cache_dir is /home/tione/notebook, the model is actually saved at the root path of CFS; if the CFS path is /dir and cache_dir is /home/tione/notebook/model, the model is actually saved in the /dir/model directory of CFS.
Copy the above download script, replace qwen/Qwen2-7b-Instruct with the model you need to download, paste it into the new ipynb file, and click the run icon (or click the cell and press Enter+Shift) to start downloading the model.

Additionally, you can upload your existing LLM files locally, as shown in the figure below:
Note: The bandwidth for this upload method is limited. It is recommended to be used only for files smaller than 10 MB. For larger models, it is recommended to upload via Tencent Cloud Object Storage (COS) (preferably to a COS bucket in the same region as CFS). For desktop COS usage, see COS > User Instructions for Desktop Version. For command-line COS usage, see COS > COSCMD or COS > COSCLI. Then, install the coscmd or coscli tool in the dev machine to download files to CFS.

Deployment Process
Service name: Defined by you.
Source of Machine: Choose either Select from CVM instances or Purchase from the TI-ONE platform, and then select the corresponding resource group or use pay-as-you-go mode to select CVM instance specifications on demand.
Deployment mode: If the model can be deployed using the resources of a single machine, standard deployment is recommended. If you need to deploy very large models where a single machine's resources are insufficient, you can select the multi-machine distributed deployment mode.
Replica settings:
Model Source: Choose Cloud Storage > CFS.
Model: Select the CFS instance where the model is located, then enter the source path of the model file on CFS. If you need to select a specific checkpoint from the training output, enter the actual training output path and ensure to specify up to the checkpoint-level directory.
Image: Choose Built-in/LLM/Angel-vLLM(2.1).
Resource: Select the required computing resources for the model on demand. For resource requirements, see the appendix Inference Resource Requirements.
Advanced Settings: For the starting command and environment variables, see Service Deployment Parameters.

Click Start Service, and the service will enter the creation process. The service status may remain in the Preparing status for a period, indicating that the service startup is not yet completed. You can view the startup progress by clicking the Events or Logs tabs of a specific service.
Service Deployment Parameters
Startup Command Description
The image's default startup command (that is, the default value when no startup command is specified) is
run. Its overall feature is essentially equivalent to the startup command vllm serve or python3 -m vllm.entrypoints.openai.api_server provided by vLLM.The main differences between the default startup command and the startup command provided by vLLM are as follows:
Added the feature for setting certain parameters via environment variables. For details, see Description of Startup Command Parameters and Environment Variables below.
Adjusted the default values of certain parameters and added support for automatic configuration of certain parameters. For details, see Description of Startup Command Parameters and Environment Variables below.
When the multi-machine distributed deployment mode is used, the multi-machine Ray communication environment is automatically established.
Added support for quick startup of Low-Rank Adaptation (LoRA) models from built-in LLM training on TI-ONE (automatically enabling multi-LoRA or performing LoRA weight merging).
Optimized the memory pre-reading loading speed of models on CFS Turbo storage media.
All the following startup command parameters are compatible with the three startup commands:
run, vllm serve, and python3 -m vllm.entrypoints.openai.api_server, while environment variables are only supported by the default run startup command.Description of Startup Command Parameters and Environment Variables
Parameter Details
Angel-vLLM core acceleration feature parameters:
Startup Command Parameter | Environment Variable | Meaning |
--quantization | QUANTIZATION | Quantization method, not set by default. Compared to the open-source vLLM 0.6.2 version, the following quantization methods are added: FP8: W8A8 FP8 precision quantization, supported on NVIDIA GPU series with a compute capability of 8.9 or higher (Ada Lovelace and Hopper); reduces GPU memory usage for model weights by 50%, improves inference speed, and maintains near-lossless precision.ifq_nf4: W4A16 NF4 precision quantization, supported on NVIDIA GPU series with a compute capability of 8.0 or higher (Ampere, Ada Lovelace, and Hopper); reduces GPU memory usage for model weights by 75%, improves inference speed, and offers better precision compared to INT4 quantization (this quantization method only supports quantization based on float16 precision, that is, --dtype float16).ifq: W8A16 INT8 precision quantization, supported on NVIDIA GPU series with a compute capability of 7.0 or higher (Volta, Turing, Ampere, Ada Lovelace, and Hopper); reduces GPU memory usage for model weights by 50%, improves inference speed, and maintains near-lossless precision.All these 3 quantization methods support one-click online quantization, without the need for prior model conversion or offline calibration. GPU memory usage: ifq_nf4 < FP8 = ifq < non-quantized Inference speed: FP8 >= ifq_nf4 > ifq > non-quantized Model precision: non-quantized >= FP8 >= ifq > ifq_nf4 |
--use-lookahead | USE_LOOKAHEAD | The default value is 0. Setting it to 1 enables lookahead parallel decoding, which can significantly speed up decoding in many scenarios. Compared to the open-source implementation, lookahead parallel decoding does not require additional small models or model heads and can achieve results consistent with non-parallel decoding. It shows noticeable acceleration effects when a large portion of the output text has appeared in the input text (such as RAG scenarios), or when there are similar requests or answers in a large number of requests (for acceleration effects, see Appendix). |
--num-speculative-tokens | NUM_SPECULATIVE_TOKENS | The default value is 6. This parameter defines the number of tokens decoded at a time during lookahead parallel decoding. If the required concurrency is high, this value can be decreased. If the concurrency is low, it can be increased. |
To be compatible with more scenarios, the platform's built-in image has adjusted the default values of certain startup parameters. The following startup parameters differ from vLLM's default startup parameters:
Startup Command Parameter | Environment Variable | Meaning |
--max-model-len | MAX_MODEL_LEN | Maximum context length of the model. To be compatible with instance types with smaller GPU memory, the platform modifies the parameter value to 8192 for models with a context length greater than 8K by default. If you need to support a longer context, you can manually configure this parameter. |
--dtype | DTYPE | The default value is float16. If you want to perform inference with bfloat16 precision, manually change the value to bfloat16. |
--enable-prefix-caching | ENABLE_PREFIX_CACHING | The default value is true, which means the prefix caching feature is enabled. This significantly accelerates processing for long-text inputs with repeated prefixes and reduces Time to First Token (TTFT) in continuous multi-turn chat sessions. You can manually disable it by setting the environment variable ENABLE_PREFIX_CACHING=false. (For V100 GPUs, as this feature is not yet supported, the default value is false). |
--tensor-parallel-size | TP | The default value is the number of GPU cards specified when the online service is created, indicating the model parallelism size. |
--trust-remote-code | TRUST_REMOTE_CODE | In order to be compatible with models that require this option to be enabled, the default value is true. You can set the environment variable TRUST_REMOTE_CODE to false to manually disable it. |
--model | MODEL | Model name or path. The default value is /data/model, which corresponds to the default model mount path within the platform's container. |
--port | - | Service port. The default value is 8501, which corresponds to the default inference service port on the platform. |
--use-v2-block-manager | - | The default value is true, indicating that the v2 block manager is used. |
--chat-template | - | Hugging Face chat template. It can be a .jinja chat template file path or template string.(If you do not set this parameter but have specified the MODEL_ID or CONV_TEMPLATE environment variables, or if the model directory contains a ti_model_config.json file, the platform will attempt to perform automatic matching. For details, see Chat Templates). |
--tool-call-parser | - | Tool call parser. Valid values: llama3_json, hermes, mistral, and internlm. (If you do not set this parameter but have specified the MODEL_ID or CONV_TEMPLATE environment variables, or if the model directory contains a ti_model_config.json file, the platform will attempt to perform automatic matching. For details, see Chat Templates). |
--enable-auto-tool-choice | - | Enables the automatic tool call capability, which must be used in conjunction with --tool-call-parser.(If you do not set this parameter but have specified the MODEL_ID or CONV_TEMPLATE environment variables, or if the model directory contains a ti_model_config.json file, the platform will attempt to perform automatic matching. For details, see Chat Templates). |
Other parameters that can be configured via environment variables supported by the platform image:
Startup Command Parameter | Environment Variable | Meaning |
--gpu-memory-utilization | GPU_MEMORY_UTILIZATION | GPU memory utilization. The default value is 0.9. |
--enforce-eager | ENFORCE_EAGER | Whether to forcibly enable PyTorch's eager mode. The default value is false, in which case CUDA Graph is additionally used for further acceleration, but this consumes extra GPU memory and increases service startup time. |
- | MODEL_ID | Model name. The default value is model. It is recommended to set it to the model's name on Hugging Face, which will trigger automatic matching for the chat template. |
- | CONV_TEMPLATE | Chat template name. It is not set by default and primarily used for automatic matching of the chat template. |
- | DISABLE_MEM_CACHE | Disables memory cache pre-reading. The default value is false. |
For the remaining services-supported startup parameters, see vLLM's official documentation OpenAI Compatible Server.
Use Instructions
To specify the startup command: Advanced Settings > Startup Command. For example, to use ifq_nf4 quantization, the configuration is as follows (to adjust other parameters, see the service deployment parameters below):

To specify environment variables: Advanced Settings > Environment Variables. For example, to use ifq_nf4 quantization, the configuration is as follows (the method for using other parameters via environment variables is the same; note that environment variable names must be in uppercase):

Recommended Parameters for Different Scenarios
In most cases, you can start the inference service without entering any startup commands or environment variables. For specific scenarios, we provide configuration guidance for certain startup parameters here.
Case - Model Experience - Minimum Resource Requirements
Scenario: The computing resources for inference are limited. Insufficient GPU memory is reported when you deploy the model with default parameters. The goal is only for a quick trial of the model, with no specific requirements for model precision or inference speed.
Recommended startup command:
run --quantization ifq_nf4 --enforce-eager --gpu-memory-utilization 0.95 --max-model-len 2048Description:
--quantization ifq_nf4: enables 4-bit quantization, reducing GPU memory occupied by model weights by 75%. (This feature is supported on Ampere, Ada Lovelace, or Hopper series GPUs. For V100 or T4 GPUs, enable the ifq quantization mode.)--enforce-eager: disables CUDA Graph, reducing additional GPU memory usage and allowing for a higher value for --gpu-memory-utilization.--gpu-memory-utilization 0.95: increases GPU memory utilization to 95%, enhancing GPU memory usage efficiency.--max-model-len 2048: reduces the context length to 2048, decreasing the GPU memory required for KV cache.Case - Production Environment Deployment - Resume Parsing
Scenario: There are higher requirements for both model precision and inference speed. In resume parsing scenarios, the input text is long, and the generated content is highly likely to overlap with the input, making it suitable to enable parallel decoding acceleration.
Recommended startup command:
run --dtype auto --quantization fp8 --use-lookahead --num-speculative-tokens 4 --max-model-len 32768 --enable-prefix-caching --disable-log-requests --api-keys mock_api_key --served-model-name model_123.Description:
--dtype auto: automatically uses the precision configured in the model's config.json file. (For certain models, the effects of fp16 and bf16 may be different; adjust based on actual testing conditions.)--quantization fp8: enables FP8 W8A8 quantization, significantly accelerating inference while maintaining precision and reducing GPU memory occupied by model weights by 50%. (This feature is supported on Ada Lovelace and Hopper series GPUs. For other GPU models, enable the ifq quantization mode.)--use-lookahead: enables parallel decoding to accelerate generation in this scenario and increase throughput.--num-speculative-tokens 4: specifies the number of tokens for parallel decoding. When support for high concurrency is required, it is recommended to set a smaller value. This value can be adjusted based on actual testing to evaluate throughput.--max-model-len 32768: sets the maximum supported context length as needed.--enable-prefix-caching: enables the prefix caching feature. This option is enabled by default in the run startup command and can be omitted.--disable-log-requests: disables the logging of request details to prevent excessive and overly lengthy request information in logs.--api-keys mock_api_key: sets an authentication token compatible with OpenAI API calls to prevent unauthorized access to public network APIs.--served-model-name model_123: specifies the model name, making it easier for clients to identify which model returns the inference result.Case - Using the vLLM Native Startup Command - Throughput Optimization
Scenario: You want to deploy the service using vLLM's native startup command and optimize service throughput as much as possible.
Recommended startup command:
vllm serve /data/model --port 8501 -tp 2 --quantization fp8 --max-model-len 32768 --enable-prefix-caching --disable-log-requests --served-model-name model_123 --num-scheduler-steps 8Description:
/data/model: sets the model path to the default container mount path.--port 8501: sets the service port to the default platform port 8501.-tp 2: sets the model tensor parallelism size to 2. This value should be adjusted based on the actual number of GPU cards used by the service, and can be omitted for single-GPU card inference.--quantization FP8: enables FP8 W8A8 quantization, significantly accelerating inference while maintaining precision and reducing GPU memory occupied by model weights by 50%. (This feature is supported on Ada Lovelace and Hopper series GPUs. For other GPU models, enable the ifq quantization mode.)--max-model-len 32768: sets the maximum supported context length as needed.--enable-prefix-caching: enables the prefix caching feature to reduce TTFT.--disable-log-requests: disables the logging of request details to prevent excessive and overly lengthy request information in logs.--served-model-name model_123: specifies the model name, making it easier for clients to identify which model returns the inference result.--num-scheduler-steps 8: enables multi-step scheduling to optimize GPU utilization and improve service throughput (this option conflicts with parallel decoding; it is recommended to choose either this or parallel decoding based on actual business needs).Chat Experience
After the service is ready, you can see the Chat Experience tab. Click the tab to go to the page for an online chat experience.
Chat API Call
The service also supports direct requests via HTTP/HTTPS protocol. Click the Service Call page, and enter the chat API endpoint
/v1/chat/completions in the API calling address field. For the API format, see the format of the OpenAI Chat Completions API.Below are several common call scenarios:
Plain Text Chat
Example request:
{"messages":[{"role":"user","content":"Hello"}],"max_tokens":128}
Tool Call
This is supported by the API only when the tool call capability is enabled during service deployment. For details, see FAQs.
Example request:
{"tools": [{"type": "function","function": {"name": "get_current_weather","description": "Get the current weather in a given location","parameters": {"type": "object","properties": {"city": {"type": "string","description": "The city to find the weather for, e.g. 'Beijing'"}},"required": ["city"]}}}],"messages": [{"role": "user", "content": "what's the weather of Shanghai today?"}],"max_tokens": 512}
Multimodal
This is supported only when the model supports multimodal capabilities, such as the
Qwen/Qwen2-VL-7B-Instruct and meta-llama/Llama-3.2-11B-Vision-Instruct models.Example request:
{"messages": [{"role": "user","content": [{"type": "text","text": "What's in this image?"},{"type": "image_url","image_url": {"url": "https://tione-public-cos-1308945662.cos.ap-shanghai.myqcloud.com/test/cat.jpg"}}]}],"max_tokens": 512}
Third-Party LLM Application Integration
Taking Dify as an example, you can add a model of type OpenAI-API-compatible, and then enter the service's online calling address in the API endpoint URL field, appending
/v1 to the end.Server Performance Testing
Note: The following test uses random data. If you need to test the parallel decoding acceleration capability, it is recommended to use real business data for testing.
Start up the dev machine you prepared in the model preparation step above (ensure that the CFS instance is mounted with the LLM). Download the test scripts from the official vLLM project at vllm/benchmarks at main · vllm-project/vllm · GitHub (including
benchmark_serving.py and backend_request_func.py). Create a working directory on the dev machine. Create a startup script as follows (you can modify the parameters according to your needs).#!/bin/bashRESULTS_FOLDER="results"HOST="https://ms-cxxxxxxx.ti.tencentcs.com/ms-czzfp5c9/"MODEL="../Qwen2-7B-Instruct"QPS="inf"INPUT_LEN=128OUTPUT_LEN=128NUM_PROMPTS=20CONCURRENCY=16mkdir -p $RESULTS_FOLDERclient_command="python3 benchmark_serving.py \--port 8501 \--base-url $HOST \--endpoint /v1/chat/completions \--backend openai-chat \--model $MODEL \--dataset-name random \--random-input-len $INPUT_LEN \--random-output-len $OUTPUT_LEN \--ignore-eos \--request-rate $QPS \--num-prompts $NUM_PROMPTS \--save-result \--result-dir $RESULTS_FOLDER \--metadata input_len=$INPUT_LEN output_len=$OUTPUT_LEN qps=$QPS concurrency=$CONCURRENCY"eval $client_command
The meaning of each parameter is described as follows:
Parameter Name | Parameter Meaning |
MODEL | Path of the model mounted in CFS |
RESULTS_FOLDER | Path to save the final result |
INPUT_LEN | Number of tokens input into the model |
OUTPUT_LEN | Maximum number of tokens output by the model |
NUM_PROMPTS | Total number of requests |
CONCURRENCY | Number of concurrent requests |
HOST | Calling address |
Note: Conventional service calling addresses are not recommended because the time it takes to call linkages may fluctuate due to potential impacts from Web Application Firewall (WAF) and other factors. It is recommended that you use high-speed service calling addresses to stress-test service performance. That is, configure HOST as a high-speed service calling address. You can add the VPC network where the dev machine resides to the high-speed service call IP range of the online service. The operation guide is as follows:
First, on the service call page, click Add High-Speed Service Call IP Range:

Next, copy the dev machine's network information (the network it belongs to and the subnet it resides in) as follows:

Fill in the corresponding information and confirm the addition:

The final calling address is as follows:

The final test file directory structure is as follows:

Create a terminal and run bash run.sh to start the test script. You can view the performance test results from the output on the terminal:

The meanings of the metrics are as follows:
Metric Name | Metric Meaning |
Successful requests | Total number of completed requests |
Benchmark duration (s) | Total time taken to complete all requests |
Total input tokens | Number of input tokens in all requests |
Total generated tokens | Number of output tokens in all requests |
Request throughput (req/s) | Request throughput |
Output token throughput (tok/s) | Generated result throughput |
Time to First Token (ms) | Time for generating the first token |
Time per Output Token (ms) | Time between each token generation |
Inter-token Latency (ms) | Time between each token generation |
For a comparison of inference acceleration results on actual data sets, see Inference Acceleration Performance Test Results.
FAQs
Question: What are the limitations on using Angel-vLLM's parallel decoding capability?
Answer: Parallel decoding conflicts with vLLM's --enable-chunked-prefill feature. By default, vLLM automatically enables --enable-chunked-prefill when --max-model-len exceeds 32 KB. To use lookahead parallel decoding acceleration in this scenario, manually disable it by adding
--enable-chunked-prefill false to the startup command. Additionally, parallel decoding is also incompatible if vLLM is started with the --num-scheduler-steps parameter set to a value greater than 1.Question: What are the differences between Angel-vLLM's quantization capability and open-source GPTQ and AWQ quantization?
Answer: Angel-vLLM's INT8, NF4, and FP8 quantization all support one-click online quantization, without requiring offline calibration with a data set. Compared to other open-source quantization, it offers faster implementations. Among these, INT8 and FP8 achieve near-lossless precision, while NF4 quantization yields better precision than INT4 in experimental results. Open-source GPTQ and AWQ are offline calibration algorithms, primarily used for INT4 and INT8 quantization. Angel-vLLM does not support using ifq, ifq_nf4, or FP8 quantization methods to deploy models that have been calibrated by GPTQ or AWQ. For such models, use vLLM's native quantization methods, such as gptq, gptq_marlin, awq, and awq_marlin.
Note:
To run inference on the GPTQ or AWQ quantized model of qwen2 or qwen2.5, add the environment variable VLLM_ALLOW_QUANT_LM_HEAD=0.
Question: Why does the GPU memory utilization reach 90% when I deploy a 7B model on a GPU with 40 G of memory?
Answer: The vLLM framework pre-allocates GPU memory. Specifically, it uses
GPU memory x gpu_memory_utilization - peak GPU memory consumption for a single inference (including model weights + KV cache). The remaining GPU memory is allocated to GPU KV cache blocks to increase the batch processing size supported by the service. To estimate the minimum GPU memory required for model weights and KV cache, see Inference Resource Requirements.Question: How to enable the function call capability?
Answer: To enable tool call, add the parameters
--enable-auto-tool-choice and --tool-call-parser <parser_name>. For usage, see vLLM official documentation. Alternatively, you can set MODEL_ID to the model's name on Hugging Face. For models supporting tool calls, these parameters are enabled by default. For details, see Chat Templates.Appendix
Inference Acceleration Performance Test Results
FP8 Quantization Acceleration Test
Test environment:
Configuration Item | Configuration Details |
GPU type | PNV5b |
Test model name | Qwen2-7B-Instruct |
Test data set | ShareGPT_V3_unfiltered_cleaned_split.json |
Test code | vllm/benchmark |
Test item | Comparison between no acceleration and FP8 acceleration |
Test script:
python benchmark_serving.py \--port xxxx \--base-url http://xxxxxxxx/ms-rpbpr92c/ \--backend openai-chat \--model ../Qwen2-7B-Instruct \--dataset-name sharegpt \--endpoint /v1/chat/completions \--tokenizer ../Qwen2-7B-Instruct \--max-concurrency 10 \--save-result \--result-dir results \--dataset-path /path/to/ShareGPT_V3_unfiltered_cleaned_split.json
Test results based on Angel-vLLM inference acceleration (concurrency=10, all other test conditions consistent):
Monitoring Metric | No Angel-vLLM Feature | With FP8 Quantization | Performance Improvement |
Successful requests | 1000 | 1000 | - |
Total input tokens | 217393 | 217393 | - |
Request throughput (req/s) | 0.68 | 1.08 | +58.82% |
Output token throughput (tok/s) | 383.75 | 597.04 | +55.58% |
Total Token throughput (tok/s) | 531.74 | 832.25 | +56.51% |
Mean TTFT (ms) | 113.27 | 86.90 | +23.28% |
Median TTFT (ms) | 76.52 | 55.64 | +27.29% |
P99 TTFT (ms) | 874.30 | 708.10 | +19.01% |
Mean TPOT (ms) | 24.33 | 16.56 | +31.94% |
Median TPOT (ms) | 24.17 | 16.46 | +31.90% |
P99 TPOT (ms) | 27.34 | 18.68 | +31.68% |
Mean ITL (ms) | 24.22 | 16.52 | +31.79% |
Median ITL (ms) | 23.45 | 15.83 | +32.49% |
P99 ITL (ms) | 69.92 | 50.20 | +28.20% |
Lookahead Parallel Decoding Acceleration Performance
Since the acceleration principle of lookahead parallel decoding is that tokens that have been encountered by the service will be accelerated, generally, the acceleration effect improves as the number of sessions increases. The figure below shows an intuitive comparison of the generation speed for the second identical request with the default parameters versus with lookahead parallel decoding enabled.

Chat Templates
The chat template is used to convert user-input chat into
prompt text for the LLM. For details, see the introductions on Hugging Face.Chat templates are generally described using Jinja templates. For new open-source chat models (typically with Instruct or Chat suffixes), you can usually find the
chat_template field in the tokenizer_config.json file or the chat_template.json file within the model directory. This field represents the chat template for the model. If no special settings are configured, this chat template will be used by default.
To facilitate the deployment of models that do not come with a
chat_template field, and to enable function call capabilities for models that support it, if you specify the MODEL_ID or CONV_TEMPLATE environment variables, or if the model directory contains a ti_model_config.json file, and the image is started with the default startup command, the inference framework will automatically match the model's chat template according to the following rules and order (where the matching rules for CONV_TEMPLATE and MODEL_ID are in an OR relationship).Model Series | CONV_TEMPLATE | MODEL_ID (Case-Insensitive) | Default Added Startup Command | Default Additional stop_token_ids |
Non-chat model | generate | - | --chat-template examples/template_generate.jinja | - |
Hunyuan Large | hunyuan | Contains hunyuan-large. | --chat-template examples/template_hunyuan.jinja | [127960, 127967] |
Tencent industry LLMs | shennong_chat | Contains shennong or sn-. | --chat-template examples/template_shennong.jinja | - |
Llama-3.2 Vision Instruct model | llama-3.2-vision | Contains llama-3.2, with vision and instruct or chat. | --chat-template examples/tool_chat_template_llama3.2_vision.jinja --enable-auto-tool-choice --tool-call-parser llama3_json --enforce-eager --max-num-seqs 8 | - |
Llama-3.2 Instruct model | llama-3.2 | Contains llama-3.2 and instruct or chat; does not contain vision. | --chat-template examples/tool_chat_template_llama3.2_json.jinja --enable-auto-tool-choice --tool-call-parser llama3_json | - |
Llama-3.1 Instruct model | llama-3.1 | Contains llama-3.1 and instruct or chat. | --chat-template examples/tool_chat_template_llama3.1_json.jinja --enable-auto-tool-choice --tool-call-parser llama3_json | |
Qwen2.5 Instruct model | qwen2.5 | Contains qwen2.5 and instruct or chat. | --enable-auto-tool-choice --tool-call-parser hermes | - |
Baichuan2 Chat model | baichuan2-chat | Contains baichuan2 and instruct or chat. | --chat-template examples/template_baichuan.jinja | - |
Llama2 Chat model | llama-2 | Contains llama-2 and instruct or chat. | --chat-template examples/template_llama2.jinja | - |
Llama3 Chat model | llama-3 | Contains llama-3- and instruct or chat. | --chat-template examples/template_llama3.jinja | [128001, 128009] |
Qwen Chat model | qwen or qwen-7b-chat | Contains qwen and instruct or chat; does not contain vl. | --chat-template examples/template_qwen.jinja | [151643, 151644, 151645] |
Inference Resource Requirements
Inference resource requirements primarily include CPU, memory, and GPU. We mainly calculate the required computing resources based on GPU memory. It is recommended that CPU and memory be allocated proportionally according to the instance type and the number of GPU cards, with memory larger than the storage space occupied by model weights.
LLM inference demands high GPU memory. We can calculate the required GPU memory size and the minimum number of GPUs required for running LLM inference using this image through the following methods:
GPU memory required for model weights (G) ≈ Model weight parameter scale (B) x Model weight precision (byte)GPU memory required for KV cache (G) = Model layer size (hidden_size) x Number of model layers (num_hidden_layers) x Number of model KV heads (num_key_value_heads)/Number of model attention heads (num_attention_heads) x Model context length (max_position_embeddings) x Model KV-cache precision (byte) x 2/1024/1024/1024Required GPU memory > (GPU memory required for model weights + GPU memory required for KV cache)/GPU memory utilization (gpu_memory_utilization)----Minimum number of GPUs required = ceil (Required GPU memory/GPU memory per GPU card)With the requirement num_attention_heads % number of GPUs == 0
Generally, the model weight precision and KV cache precision are float16 or bfloat16 (that is, 2 bytes) before quantization. If INT8 or FP8 quantization is applied, they become 1 byte; if INT4 or NF4 quantization is applied, they become 0.5 bytes. Other key parameters can be found in the
config.json file within the model directory.Example:
Open-Source Model Name | Meta-Llama-3.1-8B-Instruct | Qwen2-72B-Instruct |
Model Weight Parameter Scale (Billion) | 8 | 72 |
Model Weight Precision (Byte) (Adjustable by Enabling --quantization) | 2 | 2 |
Total GPU Memory Occupied by Model Weights (G) | 16 | 144 |
Model Layer Size (hidden_size) | 4096 | 8192 |
Number of Model Layers (num_hidden_layers) | 32 | 80 |
Number of Model KV Heads (num_key_value_heads) | 8 | 8 |
Number of Model Attention Heads (num_attention_heads) | 32 | 64 |
Model Context Length (max_position_embeddings) (Adjustable via the --max-model-len Parameter) | 131072 | 32768 |
Model KV Cache Precision (Byte) | 2 | 2 |
Total GPU Memory Occupied by Model KV Cache (G) | 16 | 10 |
GPU Memory Utilization (Adjustable via the --gpu-memory-utilization Parameter) | 0.9 | 0.9 |
Minimum GPU Memory Required (G) | 35.56 | 171.11 |
Minimum Number of GPUs with 24 G GPU Memory | 2 | 8 |
Minimum Number of GPUs with 40 G GPU Memory | 1 | 8 |
Minimum Number of GPUs with 48 G GPU Memory | 1 | 4 |