
vLLM 是 UC Berkeley 开源的高性能大语言模型推理引擎,专为生产环境的 LLM 服务设计。传统推理框架在处理大规模并发请求时,存在显存利用率低、吞吐量受限等问题。vLLM 通过创新的 PagedAttention 机制,将 KV Cache 的内存管理效率提升至接近理论上限,相比 HuggingFace Transformers 可实现 24x 的吞吐量提升。
在实际生产环境中,大模型推理服务面临三大核心挑战:GPU 显存碎片化导致的资源浪费、动态批处理的调度复杂度、以及高并发场景下的请求排队延迟。vLLM 通过操作系统虚拟内存思想重构 Attention 计算,彻底解决了这些痛点。
组件 | 版本要求 | 说明 |
|---|---|---|
操作系统 | Ubuntu 20.04/22.04 或 CentOS 7.9+ | 推荐 Ubuntu 22.04,内核版本 5.15+ |
GPU 驱动 | NVIDIA Driver 525.x 或更高 | 支持 CUDA 12.x |
CUDA Toolkit | 12.1 或 12.2 | 不推荐 12.3+(兼容性问题) |
Python | 3.9 - 3.11 | 推荐 3.10,不支持 3.12 |
GPU 硬件 | NVIDIA A100/A800/H100/H800 | 最低 V100 16GB,推荐计算能力 8.0+ |
内存 | 物理内存 >= 2x GPU显存 | 用于模型加载和 CPU offload |
磁盘空间 | >= 200GB SSD | 存储模型权重和缓存 |
# 检查 GPU 信息
nvidia-smi --query-gpu=name,driver_version,memory.total --format=csv
# 检查 CUDA 版本
nvcc --version
nvidia-smi | grep "CUDA Version"
# 检查 Python 版本
python3 --version
# 检查系统资源
free -h
df -h /data
lscpu | grep -E "^CPU\(s\)|Model name|Thread"
# 检查 PCIe 带宽(重要:影响多 GPU 通信)
nvidia-smi topo -m
# 更新系统包
sudo apt update && sudo apt upgrade -y
# 安装编译依赖
sudo apt install -y \
build-essential \
cmake \
git \
libssl-dev \
libnuma-dev \
numactl
# 创建虚拟环境
python3 -m venv /opt/vllm-env
source /opt/vllm-env/bin/activate
# 升级 pip 和核心包
pip install --upgrade pip setuptools wheel
# 安装 PyTorch(CUDA 12.1 版本)
pip install torch==2.1.2 torchvision==0.16.2 --index-url https://download.pytorch.org/whl/cu121
# 验证 PyTorch 安装
python -c "import torch; print(f'PyTorch: {torch.__version__}, CUDA: {torch.cuda.is_available()}')"
# 安装 vLLM(从 PyPI)
pip install vllm==0.2.7
# 或从源码编译(获取最新优化)
git clone https://github.com/vllm-project/vllm.git
cd vllm
pip install -e .
# 安装额外依赖
pip install ray[default]==2.9.1 # 用于分布式服务
pip install fastapi uvicorn[standard] # API 服务
pip install prometheus-client # 监控指标
# 验证安装
python -c "import vllm; print(vllm.__version__)"
vllm --version
# 从 HuggingFace 下载模型(以 Llama-2-7B 为例)
mkdir -p /data/models
cd /data/models
# 使用 huggingface-cli 下载
pip install huggingface_hub
huggingface-cli download meta-llama/Llama-2-7b-chat-hf \
--local-dir /data/models/Llama-2-7b-chat-hf \
--local-dir-use-symlinks False
# 或使用 git lfs
git lfs install
git clone https://huggingface.co/meta-llama/Llama-2-7b-chat-hf
# 检查模型文件完整性
ls -lh /data/models/Llama-2-7b-chat-hf/
# 应包含:config.json, tokenizer.json, pytorch_model.bin 等
# 基础启动命令
python -m vllm.entrypoints.api_server \
--model /data/models/Llama-2-7b-chat-hf \
--host 0.0.0.0 \
--port 8000 \
--gpu-memory-utilization 0.90 \
--max-model-len 4096 \
--dtype float16
关键参数说明:
--gpu-memory-utilization:GPU 显存使用比例(推荐 0.85-0.95,预留空间给 CUDA 上下文)--max-model-len:最大序列长度(需小于模型配置的 max_position_embeddings)--dtype:推理精度(float16/bfloat16/float32,A100 推荐 bfloat16)--swap-space:CPU 内存交换空间(GB),用于显存不足时的降级方案# 使用 2 张 GPU 并行推理
python -m vllm.entrypoints.api_server \
--model /data/models/Llama-2-70b-chat-hf \
--tensor-parallel-size 2 \
--host 0.0.0.0 \
--port 8000 \
--gpu-memory-utilization 0.90 \
--dtype bfloat16
# 使用 4 张 GPU(需要 NVLink 或 NVSwitch 连接)
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m vllm.entrypoints.api_server \
--model /data/models/Llama-2-70b-chat-hf \
--tensor-parallel-size 4 \
--pipeline-parallel-size 1 \
--host 0.0.0.0 \
--port 8000
# 配置文件:/etc/vllm/config.py
from vllm import EngineArgs, LLMEngine
engine_args = EngineArgs(
model="/data/models/Llama-2-7b-chat-hf",
tokenizer="/data/models/Llama-2-7b-chat-hf",
# 显存管理
gpu_memory_utilization=0.90,
swap_space=4, # 4GB CPU swap
max_model_len=4096,
# 批处理参数
max_num_batched_tokens=8192, # 单批次最大 token 数
max_num_seqs=256, # 最大并发序列数
# 调度策略
scheduling_policy="fcfs", # fcfs 或 priority
max_waiting_time=600, # 最大等待时间(秒)
# KV Cache 配置
block_size=16, # PagedAttention block 大小
enable_prefix_caching=True, # 启用前缀缓存
# 性能优化
dtype="bfloat16",
quantization=None, # 或 "awq", "gptq"
enforce_eager=False, # 禁用可减少显存,启用可降低延迟
# 分布式配置
tensor_parallel_size=1,
pipeline_parallel_size=1,
distributed_executor_backend="ray",
)
# 初始化引擎
engine = LLMEngine.from_engine_args(engine_args)
# 创建 systemd 服务文件
sudo tee /etc/systemd/system/vllm.service > /dev/null <<EOF
[Unit]
Description=vLLM API Server
After=network.target
[Service]
Type=simple
User=vllm
Group=vllm
WorkingDirectory=/opt/vllm
Environment="PATH=/opt/vllm-env/bin:/usr/local/bin:/usr/bin"
Environment="CUDA_VISIBLE_DEVICES=0"
ExecStart=/opt/vllm-env/bin/python -m vllm.entrypoints.api_server \
--model /data/models/Llama-2-7b-chat-hf \
--host 0.0.0.0 \
--port 8000 \
--gpu-memory-utilization 0.90 \
--max-model-len 4096 \
--dtype bfloat16 \
--disable-log-requests
Restart=on-failure
RestartSec=10
StandardOutput=append:/var/log/vllm/access.log
StandardError=append:/var/log/vllm/error.log
[Install]
WantedBy=multi-user.target
EOF
# 创建日志目录
sudo mkdir -p /var/log/vllm
sudo chown vllm:vllm /var/log/vllm
# 启动服务
sudo systemctl daemon-reload
sudo systemctl start vllm
sudo systemctl enable vllm
# 查看服务状态
sudo systemctl status vllm
# 健康检查
curl http://localhost:8000/health
# 模型信息查询
curl http://localhost:8000/v1/models
# 简单生成测试
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "/data/models/Llama-2-7b-chat-hf",
"prompt": "San Francisco is a",
"max_tokens": 50,
"temperature": 0.7
}'
# 流式输出测试
curl -N http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "/data/models/Llama-2-7b-chat-hf",
"prompt": "Write a story about AI:",
"max_tokens": 200,
"stream": true
}'
# 预期输出:逐字流式返回生成内容
# 安装性能测试工具
pip install locust
# 创建压测脚本:benchmark.py
cat > benchmark.py <<'EOF'
from locust import HttpUser, task, between
import json
class VLLMUser(HttpUser):
wait_time = between(1, 2)
@task
def generate(self):
payload = {
"model": "/data/models/Llama-2-7b-chat-hf",
"prompt": "Explain quantum computing in simple terms:",
"max_tokens": 100,
"temperature": 0.7
}
self.client.post("/v1/completions", json=payload)
EOF
# 运行压测(100 并发用户,30 秒)
locust -f benchmark.py --host=http://localhost:8000 \
--users 100 --spawn-rate 10 --run-time 30s --headless
# 使用官方 benchmark 工具
python -m vllm.entrypoints.openai.api_server \
--model /data/models/Llama-2-7b-chat-hf &
sleep 10
python benchmarks/benchmark_throughput.py \
--model /data/models/Llama-2-7b-chat-hf \
--num-prompts 100 \
--input-len 128 \
--output-len 256
#!/bin/bash
# 文件路径:/opt/vllm/start_vllm.sh
set -e
# 环境变量配置
export CUDA_VISIBLE_DEVICES=0,1 # 指定使用的 GPU
export VLLM_WORKER_MULTIPROC_METHOD=spawn # 多进程启动方式
export TOKENIZERS_PARALLELISM=false# 避免 tokenizer 警告
export OMP_NUM_THREADS=32 # CPU 线程数
# 模型和数据路径
MODEL_PATH="/data/models/Llama-2-13b-chat-hf"
LOG_DIR="/var/log/vllm"
PID_FILE="/var/run/vllm.pid"
# 创建日志目录
mkdir -p $LOG_DIR
# 检查是否已运行
if [ -f $PID_FILE ]; then
OLD_PID=$(cat $PID_FILE)
if ps -p $OLD_PID > /dev/null 2>&1; then
echo"vLLM is already running (PID: $OLD_PID)"
exit 1
else
rm -f $PID_FILE
fi
fi
# 启动 vLLM
nohup python -m vllm.entrypoints.openai.api_server \
--model $MODEL_PATH \
--served-model-name llama-2-13b-chat \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 2 \
--gpu-memory-utilization 0.92 \
--max-model-len 4096 \
--max-num-batched-tokens 8192 \
--max-num-seqs 256 \
--dtype bfloat16 \
--enable-prefix-caching \
--disable-log-requests \
--trust-remote-code \
> $LOG_DIR/vllm.log 2>&1 &
# 记录 PID
echo $! > $PID_FILE
echo"vLLM started (PID: $(cat $PID_FILE))"
echo"Logs: $LOG_DIR/vllm.log"
# 等待服务就绪
sleep 10
for i in {1..30}; do
if curl -s http://localhost:8000/health > /dev/null; then
echo"vLLM is ready!"
exit 0
fi
echo"Waiting for vLLM to start... ($i/30)"
sleep 2
done
echo"ERROR: vLLM failed to start"
exit 1
# 文件路径:/opt/vllm/vllm_client.py
import requests
import json
from typing import Iterator, Optional, Dict, Any
classVLLMClient:
"""vLLM API 客户端封装"""
def__init__(self, base_url: str = "http://localhost:8000"):
self.base_url = base_url.rstrip('/')
self.session = requests.Session()
self.session.headers.update({
"Content-Type": "application/json"
})
defgenerate(
self,
prompt: str,
model: str = "llama-2-13b-chat",
max_tokens: int = 256,
temperature: float = 0.7,
top_p: float = 0.95,
stream: bool = False,
**kwargs
) -> Dict[str, Any] | Iterator[Dict[str, Any]]:
"""
生成文本
Args:
prompt: 输入提示
model: 模型名称
max_tokens: 最大生成 token 数
temperature: 采样温度
top_p: 核采样参数
stream: 是否流式输出
**kwargs: 其他参数(stop, frequency_penalty 等)
"""
payload = {
"model": model,
"prompt": prompt,
"max_tokens": max_tokens,
"temperature": temperature,
"top_p": top_p,
"stream": stream,
**kwargs
}
if stream:
return self._stream_generate(payload)
else:
response = self.session.post(
f"{self.base_url}/v1/completions",
json=payload,
timeout=300
)
response.raise_for_status()
return response.json()
def_stream_generate(self, payload: Dict) -> Iterator[Dict]:
"""流式生成处理"""
response = self.session.post(
f"{self.base_url}/v1/completions",
json=payload,
stream=True,
timeout=300
)
response.raise_for_status()
for line in response.iter_lines():
if line:
line = line.decode('utf-8')
if line.startswith('data: '):
data = line[6:] # 移除 'data: ' 前缀
if data == '[DONE]':
break
yield json.loads(data)
defchat(
self,
messages: list,
model: str = "llama-2-13b-chat",
max_tokens: int = 256,
temperature: float = 0.7,
stream: bool = False
):
"""聊天接口(OpenAI 兼容)"""
payload = {
"model": model,
"messages": messages,
"max_tokens": max_tokens,
"temperature": temperature,
"stream": stream
}
response = self.session.post(
f"{self.base_url}/v1/chat/completions",
json=payload,
stream=stream,
timeout=300
)
response.raise_for_status()
if stream:
return self._stream_chat(response)
else:
return response.json()
def_stream_chat(self, response):
"""流式聊天处理"""
for line in response.iter_lines():
if line:
line = line.decode('utf-8')
if line.startswith('data: '):
data = line[6:]
if data == '[DONE]':
break
yield json.loads(data)
defget_models(self):
"""获取可用模型列表"""
response = self.session.get(f"{self.base_url}/v1/models")
response.raise_for_status()
return response.json()
defhealth_check(self) -> bool:
"""健康检查"""
try:
response = self.session.get(
f"{self.base_url}/health",
timeout=5
)
return response.status_code == 200
except:
returnFalse
# 使用示例
if __name__ == "__main__":
client = VLLMClient("http://localhost:8000")
# 健康检查
print(f"Health: {client.health_check()}")
# 简单生成
result = client.generate(
prompt="Explain machine learning in one sentence:",
max_tokens=50
)
print(f"Generated: {result['choices'][0]['text']}")
# 流式生成
print("\nStreaming generation:")
for chunk in client.generate(
prompt="Write a haiku about AI:",
max_tokens=100,
stream=True
):
text = chunk['choices'][0]['text']
print(text, end='', flush=True)
print()
# 聊天接口
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"}
]
response = client.chat(messages=messages)
print(f"\nChat response: {response['choices'][0]['message']['content']}")
场景描述:支持 1000 QPS 的在线推理服务,P99 延迟要求 < 500ms
实现架构:
# 1. 使用 Nginx 做负载均衡
# /etc/nginx/conf.d/vllm.conf
upstream vllm_backend {
least_conn; # 最少连接数算法
server 192.168.1.101:8000 weight=1 max_fails=3 fail_timeout=30s;
server 192.168.1.102:8000 weight=1 max_fails=3 fail_timeout=30s;
server 192.168.1.103:8000 weight=1 max_fails=3 fail_timeout=30s;
keepalive 128; # 保持连接池
}
server {
listen 80;
server_name api.example.com;
# 限流配置
limit_req_zone $binary_remote_addr zone=api_limit:10m rate=100r/s;
limit_req zone=api_limit burst=200 nodelay;
location /v1/ {
proxy_pass http://vllm_backend;
proxy_http_version 1.1;
proxy_set_header Connection "";
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
# 超时配置
proxy_connect_timeout 10s;
proxy_send_timeout 300s;
proxy_read_timeout 300s;
# 缓冲配置
proxy_buffering off; # 流式输出需要关闭缓冲
}
# 健康检查端点
location /health {
proxy_pass http://vllm_backend/health;
access_log off;
}
}
运行结果:
压测结果(10000 次请求):
- 平均延迟: 287ms
- P50 延迟: 245ms
- P95 延迟: 412ms
- P99 延迟: 478ms
- 吞吐量: 1247 QPS
- GPU 利用率: 94%
场景描述:根据请求复杂度自动路由到不同规模的模型
实现代码:
# 文件名:multi_model_router.py
from fastapi import FastAPI, Request
from vllm_client import VLLMClient
import tiktoken
import asyncio
app = FastAPI()
# 模型配置
MODELS = {
"small": {
"endpoint": "http://192.168.1.101:8000", # 7B 模型
"max_tokens": 2048,
"cost_per_1k": 0.001
},
"medium": {
"endpoint": "http://192.168.1.102:8000", # 13B 模型
"max_tokens": 4096,
"cost_per_1k": 0.002
},
"large": {
"endpoint": "http://192.168.1.103:8000", # 70B 模型
"max_tokens": 8192,
"cost_per_1k": 0.01
}
}
# 初始化客户端
clients = {
name: VLLMClient(config["endpoint"])
for name, config in MODELS.items()
}
# 初始化 tokenizer(用于计算 token 数)
tokenizer = tiktoken.get_encoding("cl100k_base")
defselect_model(prompt: str, max_output_tokens: int) -> str:
"""根据输入复杂度选择模型"""
input_tokens = len(tokenizer.encode(prompt))
total_tokens = input_tokens + max_output_tokens
# 路由策略
if total_tokens <= 2048and"code"notin prompt.lower():
return"small"# 简单任务用小模型
elif total_tokens <= 4096:
return"medium"# 中等复杂度
else:
return"large"# 复杂任务或长文本
@app.post("/v1/completions")
asyncdefcompletions(request: Request):
"""智能路由的 completions 接口"""
data = await request.json()
prompt = data.get("prompt", "")
max_tokens = data.get("max_tokens", 256)
# 选择模型
model_name = select_model(prompt, max_tokens)
client = clients[model_name]
# 调用推理
result = client.generate(
prompt=prompt,
max_tokens=max_tokens,
temperature=data.get("temperature", 0.7),
stream=data.get("stream", False)
)
# 添加路由信息到响应
ifnot data.get("stream"):
result["_routing"] = {
"model": model_name,
"endpoint": MODELS[model_name]["endpoint"]
}
return result
@app.get("/stats")
asyncdefstats():
"""获取各模型健康状态"""
status = {}
for name, client in clients.items():
status[name] = {
"healthy": client.health_check(),
"endpoint": MODELS[name]["endpoint"]
}
return status
if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=8080, workers=4)
实现步骤:
python multi_model_router.py场景描述:聊天应用中大量请求包含相同的 System Prompt
优化配置:
# 启用 Prefix Caching
python -m vllm.entrypoints.api_server \
--model /data/models/Llama-2-7b-chat-hf \
--enable-prefix-caching \
--gpu-memory-utilization 0.85 \
--max-num-seqs 512
# 测试脚本:验证缓存效果
import time
from vllm_client import VLLMClient
client = VLLMClient()
system_prompt = "You are a helpful AI assistant specialized in Python programming. " * 50# 长 system prompt
user_queries = [
"How to sort a list?",
"Explain list comprehension",
"What is a decorator?",
] * 10
# 第一轮:冷启动
start = time.time()
for query in user_queries:
client.generate(prompt=f"{system_prompt}\n\nUser: {query}\nAssistant:", max_tokens=100)
cold_time = time.time() - start
# 第二轮:命中缓存
start = time.time()
for query in user_queries:
client.generate(prompt=f"{system_prompt}\n\nUser: {query}\nAssistant:", max_tokens=100)
warm_time = time.time() - start
print(f"Cold start: {cold_time:.2f}s")
print(f"Warm start: {warm_time:.2f}s")
print(f"Speedup: {cold_time/warm_time:.2f}x")
运行结果:
Cold start: 45.32s
Warm start: 18.74s
Speedup: 2.42x
缓存命中率: 87%
显存节省: 3.2GB
显存配置优化:根据实际负载调整 gpu-memory-utilization
# 低并发场景(< 50 QPS)
--gpu-memory-utilization 0.85 # 预留更多空间给 CUDA
# 高并发场景(> 200 QPS)
--gpu-memory-utilization 0.95 # 最大化显存利用
# 动态检测最优值
python -c "
from vllm import LLM
llm = LLM(model='/data/models/Llama-2-7b-chat-hf')
print(f'Recommended GPU memory utilization: {llm.llm_engine.scheduler.gpu_memory_utilization}')
"
批处理参数调优:平衡延迟和吞吐
# 低延迟优先(在线服务)
--max-num-batched-tokens 4096 \
--max-num-seqs 128 \
--scheduler-delay 0.0 # 不等待凑批
# 高吞吐优先(离线批处理)
--max-num-batched-tokens 16384 \
--max-num-seqs 512 \
--scheduler-delay 0.1 # 等待 100ms 凑批
量化加速:使用 AWQ/GPTQ 降低显存占用
# AWQ 4-bit 量化(推荐)
pip install autoawq
python -m vllm.entrypoints.api_server \
--model /data/models/Llama-2-13b-AWQ \
--quantization awq \
--dtype half \
--gpu-memory-utilization 0.90
# 性能对比:
# FP16: 26GB 显存, 35 tokens/s
# AWQ: 8GB 显存, 32 tokens/s (速度仅降低 9%)
API 访问控制:添加认证和限流
# 文件名:secure_vllm_server.py
from fastapi import FastAPI, HTTPException, Depends, Header
from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials
from slowapi import Limiter, _rate_limit_exceeded_handler
from slowapi.util import get_remote_address
from slowapi.errors import RateLimitExceeded
import hashlib
import hmac
app = FastAPI()
security = HTTPBearer()
limiter = Limiter(key_func=get_remote_address)
app.state.limiter = limiter
app.add_exception_handler(RateLimitExceeded, _rate_limit_exceeded_handler)
# API Key 验证
VALID_API_KEYS = {
"sk-prod-key-1": {"tier": "premium", "rate_limit": "100/minute"},
"sk-prod-key-2": {"tier": "basic", "rate_limit": "10/minute"}
}
defverify_api_key(credentials: HTTPAuthorizationCredentials = Depends(security)):
api_key = credentials.credentials
if api_key notin VALID_API_KEYS:
raise HTTPException(status_code=401, detail="Invalid API Key")
return VALID_API_KEYS[api_key]
@app.post("/v1/completions")
@limiter.limit("100/minute") # 全局限流
asyncdefcompletions(request: Request, user_info = Depends(verify_api_key)):
# 应用用户级别限流
rate_limit = user_info["rate_limit"]
# ... 推理逻辑
输入过滤:防止 Prompt Injection
import re
defsanitize_prompt(prompt: str) -> str:
"""清理危险输入"""
# 移除潜在的系统指令注入
dangerous_patterns = [
r"ignore previous instructions",
r"system:",
r"<\|im_start\|>", # ChatML 标记
r"\[INST\]", # Llama-2 指令标记
]
for pattern in dangerous_patterns:
prompt = re.sub(pattern, "", prompt, flags=re.IGNORECASE)
# 长度限制
max_length = 4096
if len(prompt) > max_length:
prompt = prompt[:max_length]
return prompt
模型权重保护:加密存储
# 使用 dm-crypt 加密模型存储分区
sudo cryptsetup luksFormat /dev/sdb1
sudo cryptsetup luksOpen /dev/sdb1 models_encrypted
sudo mkfs.ext4 /dev/mapper/models_encrypted
sudo mount /dev/mapper/models_encrypted /data/models
# 自动挂载配置(需要密钥文件)
echo"models_key" > /root/.models_key
chmod 600 /root/.models_key
echo"models_encrypted /dev/sdb1 /root/.models_key luks" >> /etc/crypttab
多实例部署:使用 Kubernetes 管理
# vllm-deployment.yaml
apiVersion:apps/v1
kind:Deployment
metadata:
name:vllm-llama2-7b
spec:
replicas:3
selector:
matchLabels:
app:vllm
model:llama2-7b
template:
metadata:
labels:
app:vllm
model:llama2-7b
spec:
nodeSelector:
nvidia.com/gpu.product:NVIDIA-A100-SXM4-80GB
containers:
-name:vllm
image:vllm/vllm-openai:v0.2.7
command:
-python
--m
-vllm.entrypoints.openai.api_server
args:
---model=/models/Llama-2-7b-chat-hf
---host=0.0.0.0
---port=8000
---gpu-memory-utilization=0.90
resources:
limits:
nvidia.com/gpu:1
volumeMounts:
-name:models
mountPath:/models
readinessProbe:
httpGet:
path:/health
port:8000
initialDelaySeconds:60
periodSeconds:10
livenessProbe:
httpGet:
path:/health
port:8000
initialDelaySeconds:120
periodSeconds:30
volumes:
-name:models
persistentVolumeClaim:
claimName:models-pvc
---
apiVersion:v1
kind:Service
metadata:
name:vllm-service
spec:
selector:
app:vllm
ports:
-port:80
targetPort:8000
type:LoadBalancer
备份策略:自动备份关键配置
#!/bin/bash
# 文件名:backup_vllm_config.sh
BACKUP_DIR="/data/backups/vllm"
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
mkdir -p $BACKUP_DIR
# 备份配置文件
tar -czf $BACKUP_DIR/config_$TIMESTAMP.tar.gz \
/etc/vllm/ \
/etc/systemd/system/vllm.service \
/opt/vllm/
# 备份模型元数据(不备份权重文件,太大)
find /data/models -name "config.json" -o -name "tokenizer_config.json" | \
tar -czf $BACKUP_DIR/model_metadata_$TIMESTAMP.tar.gz -T -
# 保留最近 7 天的备份
find $BACKUP_DIR -name "*.tar.gz" -mtime +7 -delete
echo"Backup completed: $BACKUP_DIR/config_$TIMESTAMP.tar.gz"
⚠️ 警告:max-model-len 设置过大会导致 OOM,必须根据 GPU 显存计算
❗ 显存计算公式:
# KV Cache 显存占用估算
defestimate_kv_cache_memory(
num_layers: int,
hidden_size: int,
num_heads: int,
max_seq_len: int,
max_batch_size: int,
dtype_bytes: int = 2 # float16/bfloat16
) -> float:
"""返回 GB"""
kv_cache_per_token = 2 * num_layers * hidden_size * dtype_bytes
total_bytes = kv_cache_per_token * max_seq_len * max_batch_size
return total_bytes / (1024**3)
# Llama-2-7B 示例
memory = estimate_kv_cache_memory(
num_layers=32,
hidden_size=4096,
num_heads=32,
max_seq_len=4096,
max_batch_size=256
)
print(f"KV Cache: {memory:.2f} GB") # 约 16GB
# 加上模型权重 13GB,总共需要 29GB(A100 40GB 可以运行)
❗ 张量并行的 GPU 必须通过 NVLink 连接:
# 检查 GPU 拓扑
nvidia-smi topo -m
# 输出示例(正确配置):
# GPU0 GPU1 GPU2 GPU3
# GPU0 X NV12 NV12 NV12
# GPU1 NV12 X NV12 NV12
# GPU2 NV12 NV12 X NV12
# GPU3 NV12 NV12 NV12 X
# 如果显示 PHB(PCIe Host Bridge),性能会严重下降
❗ 不要混用不同精度的模型权重和推理 dtype:
# 错误示例
python -m vllm.entrypoints.api_server \
--model /data/models/Llama-2-7b-fp32 \ # 模型是 FP32
--dtype bfloat16 # 推理用 BF16,会导致精度损失
# 正确做法:转换模型
python -c "
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
'/data/models/Llama-2-7b-fp32',
torch_dtype='bfloat16'
)
model.save_pretrained('/data/models/Llama-2-7b-bf16')
"
错误现象 | 原因分析 | 解决方案 |
|---|---|---|
CUDA out of memory | gpu-memory-utilization 过高或 max-model-len 过大 | 降低参数值,或增加 swap-space |
RuntimeError: NCCL error | 多 GPU 通信失败,通常是网络或驱动问题 | 检查 nvidia-smi topo -m,更新驱动 |
requests timeout | 模型加载慢或批处理堆积 | 增加 --max-num-seqs,检查磁盘 IO |
AssertionError: max_model_len > max_position_embeddings | 超过模型支持的最大序列长度 | 减小 --max-model-len 至模型配置值 |
ValueError: quantization not supported | 量化格式不匹配 | 检查模型是否是对应的量化版本(AWQ/GPTQ) |
版本兼容:
平台兼容:
模型兼容:
# 检查模型是否支持
from vllm import LLM
supported_models = [
"LlamaForCausalLM",
"MistralForCausalLM",
"GPT2LMHeadModel",
"GPTNeoXForCausalLM",
"FalconForCausalLM",
# ... 查看完整列表: https://docs.vllm.ai/en/latest/models/supported_models.html
]
# 验证模型
try:
llm = LLM(model="/data/models/your-model")
print("Model is compatible!")
except Exception as e:
print(f"Model not supported: {e}")
# 查看 systemd 服务日志
sudo journalctl -u vllm -f --since "10 minutes ago"
# 查看应用日志
tail -f /var/log/vllm/vllm.log
# 过滤错误日志
grep -E "ERROR|CUDA|OOM" /var/log/vllm/vllm.log | tail -50
# 查看 GPU 错误
dmesg | grep -i "nvidia\|gpu"
# 实时监控 GPU 状态
watch -n 1 nvidia-smi
问题一:服务启动后立即 OOM
# 诊断命令
nvidia-smi --query-gpu=memory.used,memory.free --format=csv
python -c "
from vllm import LLM
import torch
print(f'Available GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB')
"
解决方案:
fuser -v /dev/nvidia*,杀死无关进程--gpu-memory-utilization 至 0.80--max-model-len 或 --max-num-seqs--quantization awqnvidia-smi 显示显存占用稳定在 90% 以下问题二:推理速度异常慢(< 10 tokens/s)
# 诊断命令
# 1. 检查 GPU 利用率
nvidia-smi dmon -s u -c 10
# 2. 检查 PCIe 带宽
nvidia-smi pcie
# 3. 检查 CPU 瓶颈
top -H -p $(pgrep -f vllm)
# 4. 性能分析
pip install py-spy
sudo py-spy top --pid $(pgrep -f vllm)
解决方案:
--enforce-eager(禁用可提升速度)--tokenizer-pool-sizenvidia-smi topo -m,可能是 GPU 插槽问题问题三:请求排队时间长
# 诊断命令
curl http://localhost:8000/metrics | grep vllm_queue
# 查看调度器状态
curl http://localhost:8000/stats
解决方案:
--max-num-seqs 允许更多并发--scheduler-delay 加快调度# 开启详细日志
export VLLM_LOGGING_LEVEL=DEBUG
python -m vllm.entrypoints.api_server \
--model /data/models/Llama-2-7b-chat-hf \
--log-level debug
# 开启 CUDA 错误检查
export CUDA_LAUNCH_BLOCKING=1
# 性能分析(生成 trace 文件)
pip install torch-tb-profiler
python -m torch.utils.bottleneck your_script.py
# NCCL 调试(多 GPU 通信问题)
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=ALL
# GPU 指标
nvidia-smi --query-gpu=timestamp,name,pci.bus_id,utilization.gpu,utilization.memory,memory.total,memory.used,memory.free,temperature.gpu,power.draw --format=csv -l 1
# vLLM 内置指标(Prometheus 格式)
curl http://localhost:8000/metrics
# 关键指标:
# - vllm_num_requests_running:运行中的请求数
# - vllm_num_requests_waiting:等待中的请求数
# - vllm_gpu_cache_usage_perc:KV Cache 使用率
# - vllm_avg_generation_throughput:平均吞吐量(tokens/s)
# - vllm_avg_time_to_first_token_seconds:首 token 延迟
# - vllm_avg_time_per_output_token_seconds:单 token 延迟
# prometheus.yml
global:
scrape_interval:15s
scrape_configs:
-job_name:'vllm'
static_configs:
-targets:
-'192.168.1.101:8000'
-'192.168.1.102:8000'
-'192.168.1.103:8000'
metrics_path:'/metrics'
-job_name:'nvidia_gpu'
static_configs:
-targets:['192.168.1.101:9835']# nvidia-dcgm-exporter
指标名称 | 正常范围 | 告警阈值 | 说明 |
|---|---|---|---|
GPU 利用率 | 85-98% | < 70% 或 = 100% | 过低浪费资源,100% 可能卡死 |
显存使用率 | 88-95% | > 98% | 接近上限可能 OOM |
首 token 延迟 | 50-200ms | > 500ms | 影响用户体验 |
单 token 延迟 | 10-30ms | > 50ms | 生成速度过慢 |
队列长度 | 0-10 | > 50 | 需要扩容 |
吞吐量 | 模型相关 | 低于基准 30% | 性能下降 |
# prometheus_alerts.yml
groups:
-name:vllm_alerts
interval:30s
rules:
# GPU 异常
-alert:GPUMemoryHigh
expr:vllm_gpu_cache_usage_perc>98
for:2m
labels:
severity:warning
annotations:
summary:"GPU 显存使用率过高"
description:"{{ $labels.instance }} 显存使用 {{ $value }}%"
# 队列堆积
-alert:RequestQueueHigh
expr:vllm_num_requests_waiting>50
for:1m
labels:
severity:critical
annotations:
summary:"请求队列堆积严重"
description:"{{ $labels.instance }} 队列长度 {{ $value }}"
# 延迟异常
-alert:HighLatency
expr:vllm_avg_time_to_first_token_seconds>0.5
for:5m
labels:
severity:warning
annotations:
summary:"首 token 延迟过高"
description:"{{ $labels.instance }} 延迟 {{ $value }}s"
# 吞吐量下降
-alert:LowThroughput
expr:rate(vllm_avg_generation_throughput[5m])<1000
for:5m
labels:
severity:warning
annotations:
summary:"吞吐量异常下降"
description:"{{ $labels.instance }} 吞吐量 {{ $value }} tokens/s"
#!/bin/bash
# 文件名:backup_vllm_full.sh
set -e
BACKUP_ROOT="/data/backups/vllm"
DATE=$(date +%Y%m%d)
BACKUP_DIR="$BACKUP_ROOT/$DATE"
mkdir -p $BACKUP_DIR
echo"[$(date)] Starting backup..."
# 1. 备份配置文件
tar -czf $BACKUP_DIR/configs.tar.gz \
/etc/vllm/ \
/etc/systemd/system/vllm.service \
/opt/vllm/ \
/etc/nginx/conf.d/vllm.conf
# 2. 备份监控配置
tar -czf $BACKUP_DIR/monitoring.tar.gz \
/etc/prometheus/ \
/var/lib/grafana/dashboards/
# 3. 导出当前运行配置
curl http://localhost:8000/v1/models > $BACKUP_DIR/running_models.json
# 4. 备份日志(最近 7 天)
find /var/log/vllm/ -name "*.log" -mtime -7 | \
tar -czf $BACKUP_DIR/logs.tar.gz -T -
# 5. 生成备份清单
cat > $BACKUP_DIR/manifest.txt <<EOF
Backup Date: $(date)
Hostname: $(hostname)
vLLM Version: $(pip show vllm | grep Version)
CUDA Version: $(nvcc --version | grep release)
Models: $(ls -1 /data/models/)
EOF
# 6. 上传到对象存储(可选)
# aws s3 sync $BACKUP_DIR s3://my-backup-bucket/vllm/$DATE/
echo"[$(date)] Backup completed: $BACKUP_DIR"
# 清理旧备份(保留 30 天)
find $BACKUP_ROOT -maxdepth 1 -type d -mtime +30 -exec rm -rf {} \;
#!/bin/bash
# 文件名:restore_vllm.sh
set -e
if [ -z "$1" ]; then
echo"Usage: $0 <backup_date>"
echo"Example: $0 20240115"
exit 1
fi
BACKUP_DATE=$1
BACKUP_DIR="/data/backups/vllm/$BACKUP_DATE"
if [ ! -d "$BACKUP_DIR" ]; then
echo"ERROR: Backup directory not found: $BACKUP_DIR"
exit 1
fi
echo"[$(date)] Starting restore from $BACKUP_DIR"
# 1. 停止服务
echo"Stopping vLLM service..."
sudo systemctl stop vllm
# 2. 恢复配置文件
echo"Restoring configurations..."
tar -xzf $BACKUP_DIR/configs.tar.gz -C /
# 3. 恢复监控配置
echo"Restoring monitoring configs..."
tar -xzf $BACKUP_DIR/monitoring.tar.gz -C /
# 4. 验证配置完整性
echo"Validating configurations..."
python -m vllm.entrypoints.api_server --help > /dev/null || {
echo"ERROR: vLLM installation is broken!"
exit 1
}
# 5. 重启服务
echo"Restarting services..."
sudo systemctl daemon-reload
sudo systemctl start vllm
# 6. 健康检查
sleep 10
for i in {1..30}; do
if curl -s http://localhost:8000/health > /dev/null; then
echo"[$(date)] Restore completed successfully!"
exit 0
fi
echo"Waiting for vLLM to start... ($i/30)"
sleep 2
done
echo"ERROR: vLLM failed to start after restore"
exit 1
# 定期执行恢复演练(每月)
# 文件名:dr_drill.sh
#!/bin/bash
LOG_FILE="/var/log/vllm/dr_drill_$(date +%Y%m%d).log"
{
echo"=== Disaster Recovery Drill Started ==="
date
# 1. 在测试环境恢复
./restore_vllm.sh $(ls -1 /data/backups/vllm/ | tail -1)
# 2. 功能验证
python -m pytest /opt/vllm/tests/integration/
# 3. 性能基准
python benchmarks/benchmark_throughput.py \
--model /data/models/Llama-2-7b-chat-hf \
--num-prompts 50
# 4. 生成报告
echo"=== Drill Completed ==="
date
} 2>&1 | tee -a $LOG_FILE
# 发送报告
mail -s "DR Drill Report" admin@example.com < $LOG_FILE
gpu-memory-utilization(显存利用率)、max-num-seqs(并发数)、max-num-batched-tokens(批大小)需配合调整# 基础操作
python -m vllm.entrypoints.api_server --model <MODEL_PATH> --host 0.0.0.0 --port 8000 # 启动服务
curl http://localhost:8000/health # 健康检查
curl http://localhost:8000/v1/models # 查询模型
systemctl status vllm # 查看服务状态
# 性能调优
--gpu-memory-utilization 0.90 # 显存利用率
--max-num-seqs 256 # 最大并发序列
--max-num-batched-tokens 8192 # 批处理大小
--dtype bfloat16 # 推理精度
--enable-prefix-caching # 启用前缀缓存
# 多 GPU
--tensor-parallel-size 2 # 张量并行(2 GPU)
CUDA_VISIBLE_DEVICES=0,1,2,3 # 指定 GPU
# 监控
nvidia-smi dmon -s u # GPU 利用率
curl http://localhost:8000/metrics # Prometheus 指标
journalctl -u vllm -f # 实时日志
# 量化
--quantization awq # AWQ 量化
--quantization gptq # GPTQ 量化
# 调试
export VLLM_LOGGING_LEVEL=DEBUG # 详细日志
export CUDA_LAUNCH_BLOCKING=1 # CUDA 同步模式
显存管理参数:
gpu_memory_utilization:GPU 显存使用比例(0.0-1.0),推荐 0.85-0.95swap_space:CPU 交换空间(GB),显存不足时使用,性能下降明显max_model_len:最大序列长度,不能超过模型的 max_position_embeddings调度参数:
max_num_seqs:最大并发序列数,影响吞吐量和延迟max_num_batched_tokens:单批次最大 token 数,需大于 max_model_lenscheduling_policy:调度策略(fcfs 先来先服务,priority 优先级)性能参数:
block_size:PagedAttention 块大小(8/16/32),需是注意力头的倍数enable_prefix_caching:启用前缀缓存,适合有共同 prompt 的场景enforce_eager:禁用 CUDA Graph,减少显存但增加延迟分布式参数:
tensor_parallel_size:张量并行 GPU 数量pipeline_parallel_size:流水线并行层数distributed_executor_backend:分布式后端(ray 或 mp)术语 | 英文 | 解释 |
|---|---|---|
PagedAttention | PagedAttention | vLLM 的核心创新,将 KV Cache 分块存储,借鉴操作系统虚拟内存管理思想 |
KV Cache | Key-Value Cache | Transformer 中存储历史 token 的 Key 和 Value,避免重复计算 |
连续批处理 | Continuous Batching | 动态调度不同长度的请求,无需等待整批完成 |
张量并行 | Tensor Parallelism | 将单个层的计算分布到多个 GPU,适合超大模型 |
流水线并行 | Pipeline Parallelism | 将不同层分布到多个 GPU,类似 CPU 流水线 |
首 Token 延迟 | Time To First Token (TTFT) | 从请求到第一个 token 返回的时间,影响交互体验 |
吞吐量 | Throughput | 单位时间生成的 token 数(tokens/s 或 requests/s) |
量化 | Quantization | 降低模型精度(如 INT8/INT4)以减少显存和加速推理 |
AWQ | Activation-aware Weight Quantization | 一种保持精度的 4-bit 量化方法 |
GPTQ | Generative Pre-trained Transformer Quantization | 基于层级量化的压缩技术 |
好了,今天的分享就到这吧,觉得有用的话,别忘了点赞、在看、转发给身边需要的程序员朋友吧~