本指导适用于在 TencentOS Server 3 上使用 HuggingFace TGI 推理框架运行 Baichuan 模型的官方 Demo,以 Docker 方式启动。
HuggingFace TGI 环境准备
拉取 HuggingFace TGI 相关镜像,并同时配置下载模型的镜像源,这里我们测试 baichuan-13b-chat 模型:
docker run -it --name HFTGI_baichuan_13b_chat --gpus all -e HF_ENDPOINT="https://hf-mirror.com" -e HF_HUB_OFFLINE=1 -p 8080:80 -v $PWD/data:/data ghcr.io/huggingface/text-generation-inference:latest --model-id baichuan-inc/Baichuan-13B-Chat --trust-remote-code
一旦镜像和模型都准备好,会看到如下内容(参考):
...INFO hf_hub: Token file not found "/root/.cache/huggingface/token"INFO text_generation_launcher: Default `max_input_tokens` to 4095INFO text_generation_launcher: Default `max_total_tokens` to 4096INFO text_generation_launcher: Default `max_batch_prefill_tokens` to 4145INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32]WARN text_generation_launcher: `trust_remote_code` is set. Trusting that model `baichuan-inc/Baichuan-13B-Chat` do not contain malicious code.INFO download: text_generation_launcher: Starting check and download process for baichuan-inc/Baichuan-13B-ChatINFO text_generation_launcher: Detected system cudaWARN text_generation_launcher: No safetensors weights found for model baichuan-inc/Baichuan-13B-Chat at revision None. Converting PyTorch weights to safetensors.INFO text_generation_launcher: Convert: [1/3] -- Took: 0:00:10.643016INFO text_generation_launcher: Convert: [2/3] -- Took: 0:00:09.686510INFO text_generation_launcher: Convert: [3/3] -- Took: 0:00:06.149757INFO download: text_generation_launcher: Successfully downloaded weights for baichuan-inc/Baichuan-13B-ChatINFO shard-manager: text_generation_launcher: Starting shard rank=0INFO text_generation_launcher: Detected system cudaINFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0INFO shard-manager: text_generation_launcher: Shard ready in 12.109051544s rank=0INFO text_generation_launcher: Starting WebserverWARN text_generation_router: router/src/main.rs:218: Offline mode active using cache defaultsINFO text_generation_router: router/src/main.rs:349: Using config Some(Baichuan)WARN text_generation_router: router/src/main.rs:351: Could not find a fast tokenizer implementation for baichuan-inc/Baichuan-13B-ChatWARN text_generation_router: router/src/main.rs:352: Rust input length validation and truncation is disabledWARN text_generation_router: router/src/main.rs:358: no pipeline tag found for model baichuan-inc/Baichuan-13B-ChatWARN text_generation_router: router/src/main.rs:376: Invalid hostname, defaulting to 0.0.0.0INFO text_generation_router::server: router/src/server.rs:1577: Warming up modelINFO text_generation_launcher: Cuda Graphs are enabled for sizes [32, 16, 8, 4, 2, 1]INFO text_generation_router::server: router/src/server.rs:1604: Using scheduler V3INFO text_generation_router::server: router/src/server.rs:1656: Setting max batch total tokens to 21872INFO text_generation_router::server: router/src/server.rs:1894: Connected
由于 HuggingFace TGI 是使用 Web Server 的形式交互,可以看到最后显示 connected,则说明 Webserver 连接成功。
退出容器后重启 Web Server
如果退出容器,且退出容器后容器停止运行,重新启动容器即可重启 Web Server,命令如下:
#重新启动容器docker start HFTGI_baichuan_13b_chat#实时查看Web Server的运行状态docker logs -f HFTGI_baichuan_13b_chat
重新启动后会自动启动 Web Server 进程,即可通过服务器本地发送请求运行模型。
运行模型
当前窗口连接 Web Server 成功之后,请不要关掉窗口以及做任何可能使 Web Server 停止的活动(查看 GPU 显存使用情况即可知道 Web Server 是否还在运行中)。
以 bash 方式交互运行
另外开一个窗口,输入以下命令:
curl -v http://localhost:8080/generate -X POST -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":68}}' -H 'Content-Type: application/json'
inputs 则为输入给模型的 prompt,同时窗口返回输出如下(参考):
Note: Unnecessary use of -X or --request, POST is already inferred.* Trying ::1...* TCP_NODELAY set* Connected to localhost (::1) port 8080 (#0)> POST /generate HTTP/1.1> Host: localhost:8080> User-Agent: curl/7.61.1> Accept: */*> Content-Type: application/json> Content-Length: 70>* upload completely sent off: 70 out of 70 bytes< HTTP/1.1 200 OK< content-type: application/json< x-compute-type: 1-nvidia-l40< x-compute-time: 2.758545962< x-compute-characters: 22< x-total-time: 2758< x-validation-time: 0< x-queue-time: 0< x-inference-time: 2758< x-time-per-token: 40< x-prompt-tokens: 4027< x-generated-tokens: 68< content-length: 333< vary: origin, access-control-request-method, access-control-request-headers< access-control-allow-origin: *< date: Thu, 18 Jul 2024 02:40:43 GMT<* Connection #0 to host localhost left intact{"generated_text":"\\nDeep learning is a subset of artificial intelligence (AI) that focuses on the development of algorithms that can learn automatically from data. It can be used to learn from data and improve the performance of machines. It can be used to learn from data and improve the performance of artificial intelligence.\\n"}
说明模型使用 HuggingFace TGI 推理框架运行模型成功,同时在 Web Server 窗口可以看到以下内容:
INFO generate{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(68), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None, adapter_id: None } total_time="2.758545962s" validation_time="34.56µs" queue_time="67.668µs" inference_time="2.758443884s" time_per_token="40.565351ms" seed="None"}: text_generation_router::server: router/src/server.rs:322: Success
以 TGI Client 方式运行
另外开一个窗口,首先在服务器本地安装运行 TGI Client 必要的包,使用 pip 安装:
#将pip换成清华源加快下载速度pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple#安装text-generation包pip install text-generation
安装完成后,在服务器本地新建 TGI_Client.py 文件,输入以下代码:
from text_generation import Client# Generateclient = Client("http://localhost:8080")output = client.generate("Why is the sky blue?", max_new_tokens=92).generated_textprint(f"Generate Output: {output}")# Generate streamtext = ""for response in client.generate_stream("Why is the sky blue?", max_new_tokens=92):if not response.token.special:text += response.token.textprint(f"Generate stream Output: {text}")
执行代码文件:
python TGI_Client.py
以上代码会以 client.generate 的方式输出 output 以及 client.generate_stream 以字符流的形式输出,输出结果如下(参考):
Generate Output:The sky is blue because of the Rayleigh scattering of sunlight. This is due to the Rayleigh scattering of light by atmospheric gases. This scattering of sunlight by gases in the atmosphere, such as ozone, which is responsible for the blue color of the sky. The shorter wavelengths of light.Generate stream Output:The sky is blue because of the Rayleigh scattering of sunlight. This is due to the Rayleigh scattering of light by atmospheric gases. This scattering of sunlight by gases in the atmosphere, such as ozone, which is responsible for the blue color of the sky. The shorter wavelengths of light.
同时 Web Server 窗口会输出:
INFO compat_generate{default_return_full_text=true compute_type=Extension(ComputeType("1-nvidia-l40"))}:generate_stream{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(72), return_full_text: Some(false), stop: [], truncate: None, watermark: false, details: true, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None, adapter_id: None } total_time="2.920170273s" validation_time="37.26µs" queue_time="65.668µs" inference_time="2.920067525s" time_per_token="40.556493ms" seed="None"}: text_generation_router::server: router/src/server.rs:511: Success
表明运行成功。
运行 LangChain
另外开一个窗口,首先在服务器本地安装运行 LangChain 必要的包,使用 pip 安装:
#将pip换成清华源加快下载速度pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple#安装必要的包pip install langchain transformers langchain_community
安装完成后,在服务器本地新建 LangChain.py 文件,输入以下代码:
# Wrapper to TGI client with langchainfrom langchain.llms import HuggingFaceTextGenInferencefrom langchain import PromptTemplate, LLMChaininference_server_url_local = "http://localhost:8080"llm_local = HuggingFaceTextGenInference(inference_server_url=inference_server_url_local,max_new_tokens=400,top_k=10,top_p=0.95,typical_p=0.95,temperature=0.7,repetition_penalty=1.03,)question = "whats 2 * (1 + 2)"template = """Question: {question}Answer: Let's think step by step."""prompt = PromptTemplate(template=template,input_variables= ["question"])llm_chain_local = LLMChain(prompt=prompt, llm=llm_local)output = llm_chain_local({question})print(output)
执行代码文件:
python LangChain.py
我们让模型计算2 * (1 + 2),并逐步输出,模型输出如下(参考):
{'question': {'whats 2 * (1 + 2)'}, 'text': ' 1. What is the result of multiplication? Multiplication can be done in two steps or not? The answer to this question will help you understand whether it works, and if there are any solutions for a given solution. In order to evaluate its efficiency when solving problems with an addition operation.\\n\\nProblem-solving approach based on problem transformation methodology{together.}'}
可以看到有正常输出,表明运行成功,同时 Web Server 窗口会输出:
INFO compat_generate{default_return_full_text=true compute_type=Extension(ComputeType("1-nvidia-l40"))}:generate{parameters=GenerateParameters { best_of: None, temperature: Some(0.7), repetition_penalty: Some(1.03), frequency_penalty: None, top_k: Some(10), top_p: Some(0.95), typical_p: Some(0.95), do_sample: false, max_new_tokens: Some(400), return_full_text: Some(false), stop: [], truncate: None, watermark: false, details: true, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None, adapter_id: None } total_time="3.446902913s" validation_time="24.82µs" queue_time="80.588µs" inference_time="3.446797695s" time_per_token="41.033305ms" seed="Some(6173130784801218210)"}: text_generation_router::server: router/src/server.rs:322: Success
注意事项
说明:
由于 OpenCloudOS 是 TencentOS Server 的开源版本,理论上上述文档当中的所有操作同样适用于 OpenCloudOS。
参考文档