Help & Documentation>Tencent Cloud TI-ONE Platform>Use Cases>LLM Deployment and Inference>Inference Acceleration Based on the Built-in Image Angel-vLLM

Inference Acceleration Based on the Built-in Image Angel-vLLM

Last updated: 2026-03-12 09:41:51

Overview

This document takes qwen2-7b-instruct as an example to demonstrate how to use Tencent Cloud TI-ONE Platform (TI-ONE) to deploy and perform inference acceleration on custom LLMs. It also provides usage instructions for accelerated quantization performance testing through test scripts. You may also use the scripts provided in this document to test the inference acceleration effects achieved through TI-ONE's built-in inference image Angel-vLLM.
Angel-vLLM is an LLM inference acceleration framework deeply optimized by the Tencent Cloud AI acceleration team based on open-source vLLM. While using the same API as the community vLLM and maintaining full compatibility with the community vLLM features, it has the following features:
1. Richer features. Compared to the community open-source edition of vLLM, Angel-vLLM provides INT8, NF4, and FP8 online quantization, lookahead parallel decoding, and other features.
2. More powerful performance. Compared to the community open-source edition of the vLLM, Angel-vLLM quantization not only saves GPU memory but also reduces latency and increases throughput. Lookahead parallel decoding has been refined through practical business applications, showing significant improvement in RAG scenarios.
3. Better aligned precision. Angel-vLLM generation results have been extensively tested in online business operations, achieving complete alignment with Hugging Face generation results or keeping the precision basically unchanged.
TI-ONE has a built-in inference image integrated with the Angel-vLLM inference acceleration framework, enabling one-click deployment of LLM inference services compatible with the OpenAI API protocol.
This document will introduce how to use the built-in image Angel-vLLM on TI-ONE to deploy LLM inference services, enable Angel-vLLM acceleration capabilities, and call APIs.

Prerequisites

Before deploying and using your own model via TI-ONE and its built-in inference acceleration image Angel-vLLM, you need to prepare the following resources (including Cloud File Storage (CFS) for storing model files and GPU computing resources for the inference model).

Cloud File Storage

Apply for CFS: During the deployment and inference of custom LLMs, CFS is used for model file storage. Therefore, you need to apply for CFS first. For differences between CFS instances, see CFS > Storage Types and Specifications. Ensure network connectivity between the purchased CFS instance and the computing resource Cloud Virtual Machine (CVM) instances mentioned above. For details about CFS usage, see CFS > Creating File Systems and Mount Points.
Note: If your underlying storage service is Data Accelerator Goose FileSystem extreme (GooseFSx), the operations are similar to CFS described in this document, so you can still use this document as a reference.

Computing Resources

Self-purchased computing resources: If you are using TI-ONE for the first time, see Resource Group Management to purchase appropriate computing resources based on the inference model parameters.
Select resources based on the actual model size or your available resources. The CVM instance resources required for LLM inference depend on the model parameter scale and context length. We recommend referring to the following table to quickly estimate the resources required for inference services. For detailed GPU memory usage, see Inference Resource Requirements.
Model Parameter Scale
GPU Card Type and Quantity
6 ~ 8B
PNV5b * 1 / A10 * 1 / A100 * 1 / V100 * 1
12 ~ 14B
PNV5b * 1 / A10 * 2 / A100 * 1 / V100 * 2
65 ~ 72B
PNV5b * 8 / A100 * 8

Operation Steps

Model Preparations

Choose Training Workshop > Dev Machines > Create to create a dev machine:
Image: You may choose any built-in image, as this dev machine is only used for downloading model files.
Source of Machine: If you have not purchased any computing resources previously, you can select Purchase from the TI-ONE platform and pay-as-you-go billing mode. To select from CVM instances or select the yearly/monthly subscription billing mode, see the computing resource purchase instructions in the Prerequisites​ section.
Billing Mode: You can select either pay-as-you-go or yearly/monthly subscription billing mode.
Storage Path Configuration: Select CFS. The name should be the CFS instance you applied for and configured in the Prerequisites​ section. The path is the root directory by default and is used to specify the storage location for your custom LLM.
Other Settings: Not required by default.
Note: We download or upload the required model files through the dev machine. The resources used in the example are 8 CPU cores and 16 G of memory (they are used only for the dev machine and are unrelated to subsequent inference services; resources can be appropriately reduced).
The final configuration is as follows:


Model Files

After the dev machine is successfully created, start it. Then, click the Python3 kernel under the dev machine to create an ipynb page for downloading the required models via scripts. You can search for the LLM you need in ModelScope or Hugging Face, and use the Python scripts provided in these communities to download the model and save it to CFS. This document takes the Qwen2-7b-Instruct model as an example, with the download code as follows:
!pip install modelscope

from modelscope import snapshot_download
#qwen/Qwen2-7B-Instruct is the name of the model to be downloaded, and cache_dir is the address where the downloaded model will be saved. Here, /home/tione/notebook indicates saving the downloaded model in the root directory of the CFS mount directory.
model_dir = snapshot_download('qwen/Qwen2-7b-Instruct', cache_dir='/home/tione/notebook')
Note: The path mapping relationship is as follows: Taking the container mount path /home/tione/notebook when CFS is mounted as an example, if the CFS source path is / and cache_dir is /home/tione/notebook, the model is actually saved at the root path of CFS; if the CFS path is /dir and cache_dir is /home/tione/notebook/model, the model is actually saved in the /dir/model directory of CFS.
Copy the above download script, replace qwen/Qwen2-7b-Instruct with the model you need to download, paste it into the new ipynb file, and click the run icon (or click the cell and press Enter+Shift) to start downloading the model.

Additionally, you can upload your existing LLM files locally, as shown in the figure below:
Note: The bandwidth for this upload method is limited. It is recommended to be used only for files smaller than 10 MB. For larger models, it is recommended to upload via Tencent Cloud Object Storage (COS) (preferably to a COS bucket in the same region as CFS). For desktop COS usage, see COS > User Instructions for Desktop Version. For command-line COS usage, see COS > COSCMD or COS > COSCLI. Then, install the coscmd or coscli tool in the dev machine to download files to CFS.


Deployment Process

Go to the Model Services > Online Services page in the TI-ONE console, and click Create Service:
Service name: Defined by you.
Source of Machine: Choose either Select from CVM instances or Purchase from the TI-ONE platform, and then select the corresponding resource group or use pay-as-you-go mode to select CVM instance specifications on demand.
Deployment mode: If the model can be deployed using the resources of a single machine, standard deployment is recommended. If you need to deploy very large models where a single machine's resources are insufficient, you can select the multi-machine distributed deployment mode.
Replica settings:
Model Source: Choose Cloud Storage > CFS.
Model: Select the CFS instance where the model is located, then enter the source path of the model file on CFS. If you need to select a specific checkpoint from the training output, enter the actual training output path and ensure to specify up to the checkpoint-level directory.
Image: Choose Built-in/LLM/Angel-vLLM(2.1).
Resource: Select the required computing resources for the model on demand. For resource requirements, see the appendix Inference Resource Requirements.
Advanced Settings: For the starting command and environment variables, see Service Deployment Parameters.

Click Start Service, and the service will enter the creation process. The service status may remain in the Preparing status for a period, indicating that the service startup is not yet completed. You can view the startup progress by clicking the Events or Logs tabs of a specific service.

Service Deployment Parameters

Startup Command Description

The image's default startup command (that is, the default value when no startup command is specified) is run. Its overall feature is essentially equivalent to the startup command vllm serve or python3 -m vllm.entrypoints.openai.api_server provided by vLLM.
The main differences between the default startup command and the startup command provided by vLLM are as follows:
Added the feature for setting certain parameters via environment variables. For details, see Description of Startup Command Parameters and Environment Variables below.
Adjusted the default values of certain parameters and added support for automatic configuration of certain parameters. For details, see Description of Startup Command Parameters and Environment Variables below.
When the multi-machine distributed deployment mode is used, the multi-machine Ray communication environment is automatically established.
Added support for quick startup of Low-Rank Adaptation (LoRA) models from built-in LLM training on TI-ONE (automatically enabling multi-LoRA or performing LoRA weight merging).
Optimized the memory pre-reading loading speed of models on CFS Turbo storage media.
All the following startup command parameters are compatible with the three startup commands: run, vllm serve, and python3 -m vllm.entrypoints.openai.api_server, while environment variables are only supported by the default run startup command.

Description of Startup Command Parameters and Environment Variables

Parameter Details
Angel-vLLM core acceleration feature parameters:
Startup Command Parameter
Environment Variable
Meaning
--quantization
QUANTIZATION
Quantization method, not set by default. Compared to the open-source vLLM 0.6.2 version, the following quantization methods are added:
FP8: W8A8 FP8 precision quantization, supported on NVIDIA GPU series with a compute capability of 8.9 or higher (Ada Lovelace and Hopper); reduces GPU memory usage for model weights by 50%, improves inference speed, and maintains near-lossless precision.
ifq_nf4: W4A16 NF4 precision quantization, supported on NVIDIA GPU series with a compute capability of 8.0 or higher (Ampere, Ada Lovelace, and Hopper); reduces GPU memory usage for model weights by 75%, improves inference speed, and offers better precision compared to INT4 quantization (this quantization method only supports quantization based on float16 precision, that is, --dtype float16).
ifq: W8A16 INT8 precision quantization, supported on NVIDIA GPU series with a compute capability of 7.0 or higher (Volta, Turing, Ampere, Ada Lovelace, and Hopper); reduces GPU memory usage for model weights by 50%, improves inference speed, and maintains near-lossless precision.
All these 3 quantization methods support one-click online quantization, without the need for prior model conversion or offline calibration.
GPU memory usage: ifq_nf4 < FP8 = ifq < non-quantized
Inference speed: FP8 >= ifq_nf4 > ifq > non-quantized
Model precision: non-quantized >= FP8 >= ifq > ifq_nf4
--use-lookahead
USE_LOOKAHEAD
The default value is 0. Setting it to 1 enables lookahead parallel decoding, which can significantly speed up decoding in many scenarios. Compared to the open-source implementation, lookahead parallel decoding does not require additional small models or model heads and can achieve results consistent with non-parallel decoding. It shows noticeable acceleration effects when a large portion of the output text has appeared in the input text (such as RAG scenarios), or when there are similar requests or answers in a large number of requests (for acceleration effects, see Appendix).
--num-speculative-tokens
NUM_SPECULATIVE_TOKENS
The default value is 6. This parameter defines the number of tokens decoded at a time during lookahead parallel decoding. If the required concurrency is high, this value can be decreased. If the concurrency is low, it can be increased.
To be compatible with more scenarios, the platform's built-in image has adjusted the default values of certain startup parameters. The following startup parameters differ from vLLM's default startup parameters:
Startup Command Parameter
Environment Variable
Meaning
--max-model-len
MAX_MODEL_LEN
Maximum context length of the model. To be compatible with instance types with smaller GPU memory, the platform modifies the parameter value to 8192 for models with a context length greater than 8K by default. If you need to support a longer context, you can manually configure this parameter.
--dtype
DTYPE
The default value is float16. If you want to perform inference with bfloat16 precision, manually change the value to bfloat16.
--enable-prefix-caching
ENABLE_PREFIX_CACHING
The default value is true, which means the prefix caching feature is enabled. This significantly accelerates processing for long-text inputs with repeated prefixes and reduces Time to First Token (TTFT) in continuous multi-turn chat sessions. You can manually disable it by setting the environment variable ENABLE_PREFIX_CACHING=false. (For V100 GPUs, as this feature is not yet supported, the default value is false).
--tensor-parallel-size
TP
The default value is the number of GPU cards specified when the online service is created, indicating the model parallelism size.
--trust-remote-code
TRUST_REMOTE_CODE
In order to be compatible with models that require this option to be enabled, the default value is true. You can set the environment variable
TRUST_REMOTE_CODE to false to manually disable it.
--model
MODEL
Model name or path. The default value is /data/model, which corresponds to the default model mount path within the platform's container.
--port
-
Service port. The default value is 8501, which corresponds to the default inference service port on the platform.
--use-v2-block-manager
-
The default value is true, indicating that the v2 block manager is used.
--chat-template
-
Hugging Face chat template. It can be a .jinja chat template file path or template string.
(If you do not set this parameter but have specified the MODEL_ID or CONV_TEMPLATE environment variables, or if the model directory contains a ti_model_config.json file, the platform will attempt to perform automatic matching. For details, see Chat Templates).
--tool-call-parser
-
Tool call parser. Valid values: llama3_json, hermes, mistral, and internlm.
(If you do not set this parameter but have specified the MODEL_ID or CONV_TEMPLATE environment variables, or if the model directory contains a ti_model_config.json file, the platform will attempt to perform automatic matching. For details, see Chat Templates).
--enable-auto-tool-choice
-
Enables the automatic tool call capability, which must be used in conjunction with --tool-call-parser.
(If you do not set this parameter but have specified the MODEL_ID or CONV_TEMPLATE environment variables, or if the model directory contains a ti_model_config.json file, the platform will attempt to perform automatic matching. For details, see Chat Templates).
Other parameters that can be configured via environment variables supported by the platform image:
Startup Command Parameter
Environment Variable
Meaning
--gpu-memory-utilization
GPU_MEMORY_UTILIZATION
GPU memory utilization. The default value is 0.9.
--enforce-eager
ENFORCE_EAGER
Whether to forcibly enable PyTorch's eager mode. The default value is false, in which case CUDA Graph is additionally used for further acceleration, but this consumes extra GPU memory and increases service startup time.
-
MODEL_ID
Model name. The default value is model. It is recommended to set it to the model's name on Hugging Face, which will trigger automatic matching for the chat template.
-
CONV_TEMPLATE
Chat template name. It is not set by default and primarily used for automatic matching of the chat template.
-
DISABLE_MEM_CACHE
Disables memory cache pre-reading. The default value is false.
For the remaining services-supported startup parameters, see vLLM's official documentation OpenAI Compatible Server.

Use Instructions

To specify the startup command: Advanced Settings > Startup Command. For example, to use ifq_nf4 quantization, the configuration is as follows (to adjust other parameters, see the service deployment parameters below):

To specify environment variables: Advanced Settings > Environment Variables. For example, to use ifq_nf4 quantization, the configuration is as follows (the method for using other parameters via environment variables is the same; note that environment variable names must be in uppercase):


Recommended Parameters for Different Scenarios

In most cases, you can start the inference service without entering any startup commands or environment variables. For specific scenarios, we provide configuration guidance for certain startup parameters here.

Case - Model Experience - Minimum Resource Requirements

Scenario: The computing resources for inference are limited. Insufficient GPU memory is reported when you deploy the model with default parameters. The goal is only for a quick trial of the model, with no specific requirements for model precision or inference speed.
Recommended startup command: run --quantization ifq_nf4 --enforce-eager --gpu-memory-utilization 0.95 --max-model-len 2048
Description:
--quantization ifq_nf4: enables 4-bit quantization, reducing GPU memory occupied by model weights by 75%. (This feature is supported on Ampere, Ada Lovelace, or Hopper series GPUs. For V100 or T4 GPUs, enable the ifq quantization mode.)
--enforce-eager: disables CUDA Graph, reducing additional GPU memory usage and allowing for a higher value for --gpu-memory-utilization.
--gpu-memory-utilization 0.95: increases GPU memory utilization to 95%, enhancing GPU memory usage efficiency.
--max-model-len 2048: reduces the context length to 2048, decreasing the GPU memory required for KV cache.

Case - Production Environment Deployment - Resume Parsing

Scenario: There are higher requirements for both model precision and inference speed. In resume parsing scenarios, the input text is long, and the generated content is highly likely to overlap with the input, making it suitable to enable parallel decoding acceleration.
Recommended startup command: run --dtype auto --quantization fp8 --use-lookahead --num-speculative-tokens 4 --max-model-len 32768 --enable-prefix-caching --disable-log-requests --api-keys mock_api_key --served-model-name model_123.
Description:
--dtype auto: automatically uses the precision configured in the model's config.json file. (For certain models, the effects of fp16 and bf16 may be different; adjust based on actual testing conditions.)
--quantization fp8: enables FP8 W8A8 quantization, significantly accelerating inference while maintaining precision and reducing GPU memory occupied by model weights by 50%. (This feature is supported on Ada Lovelace and Hopper series GPUs. For other GPU models, enable the ifq quantization mode.)
--use-lookahead: enables parallel decoding to accelerate generation in this scenario and increase throughput.
--num-speculative-tokens 4: specifies the number of tokens for parallel decoding. When support for high concurrency is required, it is recommended to set a smaller value. This value can be adjusted based on actual testing to evaluate throughput.
--max-model-len 32768: sets the maximum supported context length as needed.
--enable-prefix-caching: enables the prefix caching feature. This option is enabled by default in the run startup command and can be omitted.
--disable-log-requests: disables the logging of request details to prevent excessive and overly lengthy request information in logs.
--api-keys mock_api_key: sets an authentication token compatible with OpenAI API calls to prevent unauthorized access to public network APIs.
--served-model-name model_123: specifies the model name, making it easier for clients to identify which model returns the inference result.

Case - Using the vLLM Native Startup Command - Throughput Optimization

Scenario: You want to deploy the service using vLLM's native startup command and optimize service throughput as much as possible.
Recommended startup command: vllm serve /data/model --port 8501 -tp 2 --quantization fp8 --max-model-len 32768 --enable-prefix-caching --disable-log-requests --served-model-name model_123 --num-scheduler-steps 8
Description:
/data/model: sets the model path to the default container mount path.
--port 8501: sets the service port to the default platform port 8501.
-tp 2: sets the model tensor parallelism size to 2. This value should be adjusted based on the actual number of GPU cards used by the service, and can be omitted for single-GPU card inference.
--quantization FP8: enables FP8 W8A8 quantization, significantly accelerating inference while maintaining precision and reducing GPU memory occupied by model weights by 50%. (This feature is supported on Ada Lovelace and Hopper series GPUs. For other GPU models, enable the ifq quantization mode.)
--max-model-len 32768: sets the maximum supported context length as needed.
--enable-prefix-caching: enables the prefix caching feature to reduce TTFT.
--disable-log-requests: disables the logging of request details to prevent excessive and overly lengthy request information in logs.
--served-model-name model_123: specifies the model name, making it easier for clients to identify which model returns the inference result.
--num-scheduler-steps 8: enables multi-step scheduling to optimize GPU utilization and improve service throughput (this option conflicts with parallel decoding; it is recommended to choose either this or parallel decoding based on actual business needs).

Chat Experience

After the service is ready, you can see the Chat Experience tab. Click the tab to go to the page for an online chat experience.

Chat API Call

The service also supports direct requests via HTTP/HTTPS protocol. Click the Service Call page, and enter the chat API endpoint /v1/chat/completions in the API calling address field. For the API format, see the format of the OpenAI Chat Completions API.
Below are several common call scenarios:

Plain Text Chat

Example request:
{"messages":[{"role":"user","content":"Hello"}],"max_tokens":128}

Tool Call

This is supported by the API only when the tool call capability is enabled during service deployment. For details, see FAQs.
Example request:
{
"tools": [
{
"type": "function",
"function": {
"name": "get_current_weather",
"description": "Get the current weather in a given location",
"parameters": {
"type": "object",
"properties": {
"city": {
"type": "string",
"description": "The city to find the weather for, e.g. 'Beijing'"
}
},
"required": [
"city"
]
}
}
}
],
"messages": [
{"role": "user", "content": "what's the weather of Shanghai today?"}
],
"max_tokens": 512
}

Multimodal

This is supported only when the model supports multimodal capabilities, such as the Qwen/Qwen2-VL-7B-Instruct and meta-llama/Llama-3.2-11B-Vision-Instruct models.
Example request:
{
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What's in this image?"
},
{
"type": "image_url",
"image_url": {
"url": "https://tione-public-cos-1308945662.cos.ap-shanghai.myqcloud.com/test/cat.jpg"
}
}
]
}
],
"max_tokens": 512
}

Third-Party LLM Application Integration

Taking Dify as an example, you can add a model of type OpenAI-API-compatible, and then enter the service's online calling address in the API endpoint URL field, appending /v1 to the end.

Server Performance Testing

Note: The following test uses random data. If you need to test the parallel decoding acceleration capability, it is recommended to use real business data for testing.
Start up the dev machine you prepared in the model preparation step above (ensure that the CFS instance is mounted with the LLM). Download the test scripts from the official vLLM project at vllm/benchmarks at main · vllm-project/vllm · GitHub (including benchmark_serving.py and backend_request_func.py). Create a working directory on the dev machine. Create a startup script as follows (you can modify the parameters according to your needs).
#!/bin/bash
RESULTS_FOLDER="results"
HOST="https://ms-cxxxxxxx.ti.tencentcs.com/ms-czzfp5c9/"
MODEL="../Qwen2-7B-Instruct"

QPS="inf"
INPUT_LEN=128
OUTPUT_LEN=128
NUM_PROMPTS=20
CONCURRENCY=16

mkdir -p $RESULTS_FOLDER

client_command="python3 benchmark_serving.py \
--port 8501 \
--base-url $HOST \
--endpoint /v1/chat/completions \
--backend openai-chat \
--model $MODEL \
--dataset-name random \
--random-input-len $INPUT_LEN \
--random-output-len $OUTPUT_LEN \
--ignore-eos \
--request-rate $QPS \
--num-prompts $NUM_PROMPTS \
--save-result \
--result-dir $RESULTS_FOLDER \
--metadata input_len=$INPUT_LEN output_len=$OUTPUT_LEN qps=$QPS concurrency=$CONCURRENCY"

eval $client_command
The meaning of each parameter is described as follows:
Parameter Name
Parameter Meaning
MODEL
Path of the model mounted in CFS
RESULTS_FOLDER
Path to save the final result
INPUT_LEN
Number of tokens input into the model
OUTPUT_LEN
Maximum number of tokens output by the model
NUM_PROMPTS
Total number of requests
CONCURRENCY
Number of concurrent requests
HOST
Calling address
Note: Conventional service calling addresses are not recommended because the time it takes to call linkages may fluctuate due to potential impacts from Web Application Firewall (WAF) and other factors. It is recommended that you use high-speed service calling addresses to stress-test service performance. That is, configure HOST as a high-speed service calling address. You can add the VPC network where the dev machine resides to the high-speed service call IP range of the online service. The operation guide is as follows:
First, on the service call page, click Add High-Speed Service Call IP Range:

Next, copy the dev machine's network information (the network it belongs to and the subnet it resides in) as follows:

Fill in the corresponding information and confirm the addition:

The final calling address is as follows:

The final test file directory structure is as follows:

Create a terminal and run bash run.sh to start the test script. You can view the performance test results from the output on the terminal:

The meanings of the metrics are as follows:
Metric Name
Metric Meaning
Successful requests
Total number of completed requests
Benchmark duration (s)
Total time taken to complete all requests
Total input tokens
Number of input tokens in all requests
Total generated tokens
Number of output tokens in all requests
Request throughput (req/s)
Request throughput
Output token throughput (tok/s)
Generated result throughput
Time to First Token (ms)
Time for generating the first token
Time per Output Token (ms)
Time between each token generation
Inter-token Latency (ms)
Time between each token generation
For a comparison of inference acceleration results on actual data sets, see Inference Acceleration Performance Test Results.

FAQs

Question: What are the limitations on using Angel-vLLM's parallel decoding capability?
Answer: Parallel decoding conflicts with vLLM's --enable-chunked-prefill feature. By default, vLLM automatically enables --enable-chunked-prefill when --max-model-len exceeds 32 KB. To use lookahead parallel decoding acceleration in this scenario, manually disable it by adding --enable-chunked-prefill false to the startup command. Additionally, parallel decoding is also incompatible if vLLM is started with the --num-scheduler-steps parameter set to a value greater than 1.
Question: What are the differences between Angel-vLLM's quantization capability and open-source GPTQ and AWQ quantization?
Answer: Angel-vLLM's INT8, NF4, and FP8 quantization all support one-click online quantization, without requiring offline calibration with a data set. Compared to other open-source quantization, it offers faster implementations. Among these, INT8 and FP8 achieve near-lossless precision, while NF4 quantization yields better precision than INT4 in experimental results. Open-source GPTQ and AWQ are offline calibration algorithms, primarily used for INT4 and INT8 quantization. Angel-vLLM does not support using ifq, ifq_nf4, or FP8 quantization methods to deploy models that have been calibrated by GPTQ or AWQ. For such models, use vLLM's native quantization methods, such as gptq, gptq_marlin, awq, and awq_marlin.
Note:
To run inference on the GPTQ or AWQ quantized model of qwen2 or qwen2.5, add the environment variable VLLM_ALLOW_QUANT_LM_HEAD=0.
Question: Why does the GPU memory utilization reach 90% when I deploy a 7B model on a GPU with 40 G of memory?
Answer: The vLLM framework pre-allocates GPU memory. Specifically, it uses GPU memory x gpu_memory_utilization - peak GPU memory consumption for a single inference (including model weights + KV cache). The remaining GPU memory is allocated to GPU KV cache blocks to increase the batch processing size supported by the service. To estimate the minimum GPU memory required for model weights and KV cache, see Inference Resource Requirements.
Question: How to enable the function call capability?
Answer: To enable tool call, add the parameters --enable-auto-tool-choice and --tool-call-parser <parser_name>. For usage, see vLLM official documentation. Alternatively, you can set MODEL_ID to the model's name on Hugging Face. For models supporting tool calls, these parameters are enabled by default. For details, see Chat Templates.

Appendix

Inference Acceleration Performance Test Results

FP8 Quantization Acceleration Test

Test environment:
Configuration Item
Configuration Details
GPU type
PNV5b
Test model name
Qwen2-7B-Instruct
Test data set
ShareGPT_V3_unfiltered_cleaned_split.json
Test code
vllm/benchmark
Test item
Comparison between no acceleration and FP8 acceleration
Test script:
python benchmark_serving.py \
--port xxxx \
--base-url http://xxxxxxxx/ms-rpbpr92c/ \
--backend openai-chat \
--model ../Qwen2-7B-Instruct \
--dataset-name sharegpt \
--endpoint /v1/chat/completions \
--tokenizer ../Qwen2-7B-Instruct \
--max-concurrency 10 \
--save-result \
--result-dir results \
--dataset-path /path/to/ShareGPT_V3_unfiltered_cleaned_split.json
Test results based on Angel-vLLM inference acceleration (concurrency=10, all other test conditions consistent):
Monitoring Metric
No Angel-vLLM Feature
With FP8 Quantization
Performance Improvement
Successful requests
1000
1000
-
Total input tokens
217393
217393
-
Request throughput (req/s)
0.68
1.08
+58.82%
Output token throughput (tok/s)
383.75
597.04
+55.58%
Total Token throughput (tok/s)
531.74
832.25
+56.51%
Mean TTFT (ms)
113.27
86.90
+23.28%
Median TTFT (ms)
76.52
55.64
+27.29%
P99 TTFT (ms)
874.30
708.10
+19.01%
Mean TPOT (ms)
24.33
16.56
+31.94%
Median TPOT (ms)
24.17
16.46
+31.90%
P99 TPOT (ms)
27.34
18.68
+31.68%
Mean ITL (ms)
24.22
16.52
+31.79%
Median ITL (ms)
23.45
15.83
+32.49%
P99 ITL (ms)
69.92
50.20
+28.20%

Lookahead Parallel Decoding Acceleration Performance

Since the acceleration principle of lookahead parallel decoding is that tokens that have been encountered by the service will be accelerated, generally, the acceleration effect improves as the number of sessions increases. The figure below shows an intuitive comparison of the generation speed for the second identical request with the default parameters versus with lookahead parallel decoding enabled.


Chat Templates

The chat template is used to convert user-input chat into prompt text for the LLM. For details, see the introductions on Hugging Face.
Chat templates are generally described using Jinja templates. For new open-source chat models (typically with Instruct or Chat suffixes), you can usually find the chat_template field in the tokenizer_config.json file or the chat_template.json file within the model directory. This field represents the chat template for the model. If no special settings are configured, this chat template will be used by default.

To facilitate the deployment of models that do not come with a chat_template field, and to enable function call capabilities for models that support it, if you specify the MODEL_ID or CONV_TEMPLATE environment variables, or if the model directory contains a ti_model_config.json file, and the image is started with the default startup command, the inference framework will automatically match the model's chat template according to the following rules and order (where the matching rules for CONV_TEMPLATE and MODEL_ID are in an OR relationship).
Model Series
CONV_TEMPLATE
MODEL_ID (Case-Insensitive)
Default Added Startup Command
Default Additional stop_token_ids
Non-chat model
generate
-
--chat-template examples/template_generate.jinja
-
Hunyuan Large
hunyuan
Contains hunyuan-large.
--chat-template examples/template_hunyuan.jinja
[127960, 127967]
Tencent industry LLMs
shennong_chat
Contains shennong or sn-.
--chat-template examples/template_shennong.jinja
-
Llama-3.2 Vision Instruct model
llama-3.2-vision
Contains llama-3.2, with vision and instruct or chat.
--chat-template examples/tool_chat_template_llama3.2_vision.jinja --enable-auto-tool-choice --tool-call-parser llama3_json --enforce-eager --max-num-seqs 8
-
Llama-3.2 Instruct model
llama-3.2
Contains llama-3.2 and instruct or chat; does not contain vision.
--chat-template examples/tool_chat_template_llama3.2_json.jinja --enable-auto-tool-choice --tool-call-parser llama3_json
-
Llama-3.1 Instruct model
llama-3.1
Contains llama-3.1 and instruct or chat.
--chat-template examples/tool_chat_template_llama3.1_json.jinja --enable-auto-tool-choice --tool-call-parser llama3_json

Qwen2.5 Instruct model
qwen2.5
Contains qwen2.5 and instruct or chat.
--enable-auto-tool-choice --tool-call-parser hermes
-
Baichuan2 Chat model
baichuan2-chat
Contains baichuan2 and instruct or chat.
--chat-template examples/template_baichuan.jinja
-
Llama2 Chat model
llama-2
Contains llama-2 and instruct or chat.
--chat-template examples/template_llama2.jinja
-
Llama3 Chat model
llama-3
Contains llama-3- and instruct or chat.
--chat-template examples/template_llama3.jinja
[128001, 128009]
Qwen Chat model
qwen or qwen-7b-chat
Contains qwen and instruct or chat; does not contain vl.
--chat-template examples/template_qwen.jinja
[151643, 151644, 151645]

Inference Resource Requirements

Inference resource requirements primarily include CPU, memory, and GPU. We mainly calculate the required computing resources based on GPU memory. It is recommended that CPU and memory be allocated proportionally according to the instance type and the number of GPU cards, with memory larger than the storage space occupied by model weights.
LLM inference demands high GPU memory. We can calculate the required GPU memory size and the minimum number of GPUs required for running LLM inference using this image through the following methods:
GPU memory required for model weights (G) ≈ Model weight parameter scale (B) x Model weight precision (byte)
GPU memory required for KV cache (G) = Model layer size (hidden_size) x Number of model layers (num_hidden_layers) x Number of model KV heads (num_key_value_heads)/Number of model attention heads (num_attention_heads) x Model context length (max_position_embeddings) x Model KV-cache precision (byte) x 2/1024/1024/1024
Required GPU memory > (GPU memory required for model weights + GPU memory required for KV cache)/GPU memory utilization (gpu_memory_utilization)
----
Minimum number of GPUs required = ceil (Required GPU memory/GPU memory per GPU card)
With the requirement num_attention_heads % number of GPUs == 0
Generally, the model weight precision and KV cache precision are float16 or bfloat16 (that is, 2 bytes) before quantization. If INT8 or FP8 quantization is applied, they become 1 byte; if INT4 or NF4 quantization is applied, they become 0.5 bytes. Other key parameters can be found in the config.json file within the model directory.
Example:
Open-Source Model Name
Meta-Llama-3.1-8B-Instruct
Qwen2-72B-Instruct
Model Weight Parameter Scale (Billion)
8
72
Model Weight Precision (Byte)
(Adjustable by Enabling --quantization)
2
2
Total GPU Memory Occupied by Model Weights (G)
16
144
Model Layer Size (hidden_size)
4096
8192
Number of Model Layers (num_hidden_layers)
32
80
Number of Model KV Heads (num_key_value_heads)
8
8
Number of Model Attention Heads (num_attention_heads)
32
64
Model Context Length (max_position_embeddings) (Adjustable via the --max-model-len Parameter)
131072
32768
Model KV Cache Precision (Byte)
2
2
Total GPU Memory Occupied by Model KV Cache (G)
16
10
GPU Memory Utilization
(Adjustable via the --gpu-memory-utilization Parameter)
0.9
0.9
Minimum GPU Memory Required (G)
35.56
171.11
Minimum Number of GPUs with 24 G GPU Memory
2
8
Minimum Number of GPUs with 40 G GPU Memory
1
8
Minimum Number of GPUs with 48 G GPU Memory
1
4