TencentOS Server ChatGLM

本指导适用于在 TencentOS Server 3 上使用 TensorRT-LLM 推理框架运行 ChatGLM 模型的官方 Demo，以 Docker 方式启动。
前置环境条件
请确保已经按照 Baichuan 文档内进行操作，运行模型之前的所有步骤已经完成，并已经准备好了 TensorRT-LLM 的所有必要环境。
运行模型
下载模型权重地址换源
1. 由于中国大陆无法下载 Hugging Face 网站模型，首先需要对下载网站换源，使用国内镜像网站的 HF-Mirror 模型。
说明：
如果 docker run 的时候加上了-e HF_ENDPOINT="https://hf-mirror.com"，则此步可以跳过。
#单次有效，退出容器且暂停容器运行后失效，再次重启进入容器需重新输入此条命令
export HF_ENDPOINT="https://hf-mirror.com"
注意：
这里使用 echo 'export HF_ENDPOINT="https://hf-mirror.com"' >> ~/.bashrc 的命令仍然会导致下载失败，请勿使用。
安装运行 ChatGLM 必要的环境
安装 TensorRT-LLM 时不是所有环境包都已安装好，运行模型时有些模型专属的包仍需单独安装。
cd workspace/examples/chatglm
pip install -r requirements.txt
下载模型并构建 TensorRT-LLM engine(s)
本指导运行 ChatGLM-V3-6B 模型，使用单 GPU 推理以及 FP16。
1. 运行 convert_checkpoint.py 文件将下载的模型权重从 HuggingFace (HF) Transformers 格式变为 TensorRT-LLM 格式。
# ChatGLM3-6B: single gpu, dtype float16
python3 convert_checkpoint.py --model_dir THUDM/chatglm3-6b --output_dir trt_ckpt/chatglm3_6b/fp16/1-gpu
运行后会开始下载模型并转换格式。
说明：
除了 ChatGLM3，还有 ChatGLM2，ChatGLM，GLM 等模型可以运行。
以下是更多运行示例供参考运行：
# ChatGLM3-6B: 2-way tensor parallelism
python3 convert_checkpoint.py --model_dir THUDM/chatglm3-6b --tp_size 2 --output_dir trt_ckpt/chatglm3_6b/fp16/2-gpu
# Chatglm2-6B: single gpu, dtype float16
python3 convert_checkpoint.py --model_dir THUDM/chatglm2-6b --output_dir trt_ckpt/chatglm2_6b/fp16/1-gpu
# Chatglm-6B: single gpu, dtype float16
python3 convert_checkpoint.py --model_dir THUDM/chatglm-6b --output_dir trt_ckpt/chatglm_6b/fp16/1-gpu
# GLM-10B: single gpu, dtype float16
python3 convert_checkpoint.py --model_dir THUDM/glm-10b --output_dir trt_ckpt/glm_10b/fp16/1-gpu
更多运行示例请参见 NVIDIA TensorRT-LLM ChatGLM Demo。
注意：
官方文档中的 --model_dir 格式与本指导不同，这是因为官方文档通过从 GitHub 仓库拉取并重新命名模型来下载模型。而在这里，我们直接通过 Hugging Face 下载模型，因此需要 --model_dir 的格式与模型在 Hugging Face 网站上的标准命名保持一致。
2. 下载完模型后，构建 TensorRT-LLM engine：
# ChatGLM3-6B: single-gpu engine
trtllm-build --checkpoint_dir trt_ckpt/chatglm3_6b/fp16/1-gpu \\
             --gemm_plugin float16 \\
             --output_dir trt_engines/chatglm3_6b/fp16/1-gpu
说明：
更多运行示例构建 TensorRT-LLM engine：
# ChatGLM3-6B: 2-way tensor parallelism
trtllm-build --checkpoint_dir trt_ckpt/chatglm3_6b/fp16/2-gpu \\
             --gemm_plugin float16 \\
             --output_dir trt_engines/chatglm3_6b/fp16/2-gpu
# ChatGLM2-6B: single-gpu engine with dtype float16, GPT Attention plugin, Gemm plugin
trtllm-build --checkpoint_dir trt_ckpt/chatglm2_6b/fp16/1-gpu \\
             --gemm_plugin float16 \\
             --output_dir trt_engines/chatglm2_6b/fp16/1-gpu
# ChatGLM-6B: single-gpu engine with dtype float16, GPT Attention plugin, Gemm plugin
trtllm-build --checkpoint_dir trt_ckpt/chatglm_6b/fp16/1-gpu \\
             --gemm_plugin float16 \\
             --output_dir trt_engines/chatglm_6b/fp16/1-gpu
# GLM-10B: single-gpu engine with dtype float16, GPT Attention plugin, Gemm plugin
trtllm-build --checkpoint_dir trt_ckpt/glm_10b/fp16/1-gpu \\
             --gemm_plugin float16 \\
             --output_dir trt_engines/glm_10b/fp16/1-gpu
构建完成后会看到以下结果（参考）：
...
[07/11/2024-03:30:54] [TRT] [I] Engine generation completed in 10.4344 seconds.
[07/11/2024-03:30:54] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 1016 MiB, GPU 11911 MiB
[07/11/2024-03:30:55] [TRT] [I] [MemUsageStats] Peak memory usage during Engine building and serialization: CPU: 28426 MiB
[07/11/2024-03:30:55] [TRT-LLM] [I] Total time of building Unnamed Network 0: 00:00:12
[07/11/2024-03:30:55] [TRT] [I] Serialized 26 bytes of code generator cache.
[07/11/2024-03:30:55] [TRT] [I] Serialized 148708 bytes of compilation cache.
[07/11/2024-03:30:55] [TRT] [I] Serialized 27 timing cache entries
[07/11/2024-03:30:55] [TRT-LLM] [I] Timing cache serialized to model.cache
[07/11/2024-03:30:55] [TRT-LLM] [I] Serializing engine to trt_engines/chatglm3_6b/fp16/1-gpu/rank0.engine...
[07/11/2024-03:31:00] [TRT-LLM] [I] Engine serialized. Total time: 00:00:04
[07/11/2024-03:31:00] [TRT-LLM] [I] Total time of building all engines: 00:00:18
运行模型
文本生成任务（run.py）
输入给模型 prompt，生成文字并返回。
# Run the default engine of ChatGLM3-6B on single GPU, other model name is available if built.
python3 ../run.py --input_text "What's new between ChatGLM3-6B and ChatGLM2-6B?" \\
                  --max_output_len 50 \\
                  --tokenizer_dir THUDM/chatglm3-6b \\
                  --engine_dir trt_engines/chatglm3_6b/fp16/1-gpu
运行结束后生成答案 Output（参考）：
...
[TensorRT-LLM][INFO] MPI size: 1, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 2048
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] Loaded engine size: 11913 MiB
[TensorRT-LLM][INFO] Allocated 118.50 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 11910 (MiB)
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 32
[TensorRT-LLM][INFO] Max tokens in paged KV cache: 1082496. Allocating 31037325312 bytes.
Input [Text 0]: "[gMASK] sop What's new between ChatGLM3-6B and ChatGLM2-6B?"
Output [Text 0 Beam 0]: "There is no new information about ChatGLM3-6B, but I heard that ChatGLM2-6B has been updated. Can you tell me more about the updates to ChatGLM2-6B?"
说明：
以下是更多运行示例供参考运行：
# Run the default engine of ChatGLM3-6B on single GPU, using streaming output, other model name is available if built.
python3 ../run.py --input_text "What's new between ChatGLM3-6B and ChatGLM2-6B?" \\
                  --max_output_len 50 \\
                  --tokenizer_dir THUDM/chatglm3-6b \\
                  --engine_dir trt_engines/chatglm3_6b/fp16/1-gpu \\
                  --streaming
GLM 模型还可以进行完形填空（生成 [MASK] 部分的字词）：
# Run the default engine of GLM3-10B on single GPU, other model name is available if built.
# Token "[MASK]" or "[sMASK]" or "[gMASK]" must be included in the prompt as the original model commanded.
python3 ../run.py --input_text "Peking University is [MASK] than Tsinghua University." \\
                  --max_output_len 50 \\
                  --tokenizer_dir THUDM/glm-10b \\
                  --engine_dir trt_engines/glm_10b/fp16/1-gpu
以上示例仍需注意 --tokenizer_dir 参数的格式，需要与 Hugging Face 网站模型的标准命名格式一致。
文本总结任务（summarize.py）
根据已有的文字进行总结并返回。
# Run the summarization of ChatGLM3-6B task, other model name is available if built.
python3 ../summarize.py --test_trt_llm \\
                        --hf_model_dir THUDM/chatglm3-6b \\
                        --engine_dir trt_engines/chatglm3_6b/fp16/1-gpu
运行后会开始下载 cnn_dailymail 数据集，运行结束后返回结果。
以下为输入一篇文章以及引用后，模型输出的总结 Output（参考）：
...
[07/11/2024-03:57:49] [TRT-LLM] [I] TensorRT-LLM Generated : 
[07/11/2024-03:57:49] [TRT-LLM] [I]  Input : ['(CNN)James Best, best known for his portrayal of bumbling sheriff Rosco P. Coltrane on TV\\'s "The Dukes of Hazzard," died Monday after a brief illness. He was 88. Best died in hospice in Hickory, North Carolina, of complications from pneumonia, said Steve Latshaw, a longtime friend and Hollywood colleague. Although he\\'d been a busy actor for decades in theater and in Hollywood, Best didn\\'t become famous until 1979, when "The Dukes of Hazzard\\'s" cornpone charms began beaming into millions of American homes almost every Friday night. For seven seasons, Best\\'s Rosco P. Coltrane chased the moonshine-running Duke boys back and forth across the back roads of fictitious Hazzard County, Georgia, although his "hot pursuit" usually ended with him crashing his patrol car. Although Rosco was slow-witted and corrupt, Best gave him a childlike enthusiasm that got laughs and made him endearing. His character became known for his distinctive "kew-kew-kew" chuckle and for goofy catchphrases such as "cuff \\'em and stuff \\'em!" upon making an arrest. Among the most popular shows on TV in the early \\'80s, "The Dukes of Hazzard" ran until 1985 and spawned TV movies, an animated series and video games. Several of Best\\'s "Hazzard" co-stars paid tribute to the late actor on social media. "I laughed and learned more from Jimmie in one hour than from anyone else in a whole year," co-star John Schneider, who played Bo Duke, said on Twitter. "Give Uncle Jesse my love when you see him dear friend." "Jimmy Best was the most constantly creative person I have ever known," said Ben Jones, who played mechanic Cooter on the show, in a Facebook post. "Every minute of his long life was spent acting, writing, producing, painting, teaching, fishing, or involved in another of his life\\'s many passions." Born Jewel Guy on July 26, 1926, in Powderly, Kentucky, Best was orphaned at 3 and adopted by Armen and Essa Best, who renamed him James and raised him in rural Indiana. Best served in the Army during World War II before launching his acting career. In the 1950s and 1960s, he accumulated scores of credits, playing a range of colorful supporting characters in such TV shows as "The Twilight Zone," "Bonanza," "The Andy Griffith Show" and "Gunsmoke." He later appeared in a handful of Burt Reynolds\\' movies, including "Hooper" and "The End." But Best will always be best known for his "Hazzard" role, which lives on in reruns. "Jimmie was my teacher, mentor, close friend and collaborator for 26 years," Latshaw said. "I directed two of his feature films, including the recent \\'Return of the Killer Shrews,\\' a sequel he co-wrote and was quite proud of as he had made the first one more than 50 years earlier." People we\\'ve lost in 2015 . CNN\\'s Stella Chan contributed to this story.']
[07/11/2024-03:57:49] [TRT-LLM] [I] 
 Reference : ['James Best, who played the sheriff on "The Dukes of Hazzard," died Monday at 88 .\\n"Hazzard" ran from 1979 to 1985 and was among the most popular shows on TV .']
[07/11/2024-03:57:49] [TRT-LLM] [I] 
 Output : [['James Best, the actor best known for his portrayal of bumbling sheriff Rosco P. Coltrane on "The Dukes of Hazzard," has died at 88 after a brief illness. He was a busy actor for decades in theater and in Hollywood, but became famous in 1979 when the show began running every week. Best\\'s Rosco P. Coltrane became known for his distinctive "kew-kew-kew" chuck']]
[07/11/2024-03:57:49] [TRT-LLM] [I] ---------------------------------------------------------
[07/11/2024-03:58:15] [TRT-LLM] [I] TensorRT-LLM (total latency: 26.520079374313354 sec)
[07/11/2024-03:58:15] [TRT-LLM] [I] TensorRT-LLM (total output tokens: 1435)
[07/11/2024-03:58:15] [TRT-LLM] [I] TensorRT-LLM (tokens per second: 54.1099436297277)
[07/11/2024-03:58:15] [TRT-LLM] [I] TensorRT-LLM beam 0 result
[07/11/2024-03:58:15] [TRT-LLM] [I]   rouge1 : 23.372917363794926
[07/11/2024-03:58:15] [TRT-LLM] [I]   rouge2 : 7.036605777466268
[07/11/2024-03:58:15] [TRT-LLM] [I]   rougeL : 17.62796112937931
[07/11/2024-03:58:15] [TRT-LLM] [I]   rougeLsum : 20.722932087556057
说明：
以上示例仍需注意参数 --hf_model_dir 的格式，需要与 Hugging Face 网站模型的标准命名格式一致。
更多运行示例请参见 NVIDIA TensorRT-LLM ChatGLM Demo﻿
注意事项
说明：
由于 OpenCloudOS 是 TencentOS Server 的开源版本，理论上上述文档当中的所有操作同样适用于 OpenCloudOS。
参考文档
﻿NVIDIA TensorRT-LLM GitHub﻿
﻿TensorRT-LLM v0.10.0版本 log 日志﻿
﻿NVIDIA TensorRT-LLM ChatGLM Demo﻿
﻿Hugging Face 镜像网站﻿
﻿cnn_dailymail 数据集﻿
﻿Hugging Face THUDM/chatglm3-6b 模型﻿
﻿
ChatGLM

本页目录：

前置环境条件

运行模型

下载模型权重地址换源

安装运行 ChatGLM 必要的环境

下载模型并构建 TensorRT-LLM engine(s)

运行模型

文本生成任务（run.py）

文本总结任务（summarize.py）

注意事项

参考文档