TencentOS Server LLaMA

本指导适用于在 TencentOS Server 3 上使用 vLLM 推理框架运行 LLaMA 模型，以 Docker 方式启动。
前置环境条件
请确保已经按照 opt 文档内进行操作，运行模型之前的所有步骤已经完成，并已经准备好了 vLLM 的所有必要环境。
运行模型
下载模型权重地址换源
1. 由于中国大陆无法下载 Hugging Face 网站模型，首先需要对下载网站换源，使用国内镜像网站的 HF-Mirror 模型。
说明：
如果 docker run 的时候加上了-e HF_ENDPOINT="https://hf-mirror.com"，则此步可以跳过。
#单次有效，退出容器且暂停容器运行后失效，再次进入容器需重新输入此条命令
export HF_ENDPOINT="https://hf-mirror.com"
﻿
#设为默认，永久有效，即便退出容器且暂停容器运行，再次进入容器后也可直接运行模型（推荐使用此方法）
echo 'export HF_ENDPOINT="https://hf-mirror.com"' >> ~/.bashrc
官方 Demo 位于 examples/offline_inference.py、offline_inference.py 代码如下：
from vllm import LLM, SamplingParams
﻿
﻿
# Sample prompts.
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
﻿
﻿
# Create an LLM.
llm = LLM(model="facebook/opt-125m")
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
2. 将导入模型的代码进行更改并保存：
#原代码
llm = LLM(model="facebook/opt-125m")
﻿
#更改为：
llm = LLM(model="openlm-research/open_llama_7b")
3. 运行该 Python 文件，则会自动开始下载 LLaMA_7b 模型并开始推理。
#运行Demo
python examples/offline_inference.py
该 Demo 会输入4条 prompts，通过 LLaMA_7b 模型在 vLLM 推理框架下进行推理，最后给出生成的文字的结果。
生成的结果如下（参考）：
Prompt: 'Hello, my name is', Generated text: ' Dario and I work for the National Geographic Society. I have been a National'
Prompt: 'The president of the United States is', Generated text: ' elected to serve as the head of the executive branch of the federal government. The'
Prompt: 'The capital of France is', Generated text: ' Paris. It is situated on the river Seine. It is the 4'
Prompt: 'The future of AI is', Generated text: ' now: how it is changing the recruitment industry\\nAuthor: Jonny Moran'
注意事项
说明：
由于 OpenCloudOS 是 TencentOS Server 的开源版本，理论上上述文档当中的所有操作同样适用于 OpenCloudOS。
参考文档
﻿vllm GitHub﻿
﻿vllm 安装指引﻿
﻿Hugging Face 镜像网站﻿
﻿Hugging Face open_llama_7b 模型﻿
﻿
LLaMA

本页目录：

前置环境条件

运行模型

下载模型权重地址换源

注意事项

参考文档