本指导适用于在 TencentOS Server 3 上使用 TensorRT 推理框架运行 Bert 的官方 Demo,以 Docker 方式启动。
TensorRT 环境准备
1. 从 GitHub 下载 TensorRT 开源仓库到本地。
#下载TensorRT开源仓库到本地,版本为8.6.1git clone git@github.com:NVIDIA/TensorRT.git -b release/8.6 --single-branchcd TensorRT
2. 启动 TensorRT NGC 容器。
docker run -it --gpus all --name=tensorrt -e HF_ENDPOINT="https://hf-mirror.com" -v $PWD:/workspace nvcr.io/nvidia/pytorch:23.06-py3 /bin/bash
此时会从 nvcr 拉取 docker 镜像,请确保网络环境较好,直到镜像里所有层下载成功。成功后会直接进入容器内部。
检查必要的包的版本
1. 确保使用较新的 pip 包和 TensorRT 包,并使用命令更新。
#更新pip包 python3 -m pip install --upgrade pip
2. 检查 TensorRT 版本,最低要求为8.6.0。
#检查TensorRT版本 python3 -c 'import tensorrt;print(tensorrt.__version__)'
本指导下载的 TensorRT 版本为8.6.1,如版本低于8.6.0,请卸载 TensorRT 并重新安装。
#卸载TensorRT包 python3 -m pip uninstall tensorrt #下载新的TensorRT包,且指定版本,这里指定为8.6.1 python3 -m pip install tensorrt==8.6.1 python3 -c 'import tensorrt;print(tensorrt.__version__)'
下载运行框架和模型必要的包
将 pip 换为国内清华源以加快下载速度。
#将pip换成清华源#设为默认,永久有效pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
随后安装运行模型必要的包:
#安装运行框架和模型所需的包 pip install pycuda onnx tensorflow torch
安装 NGC
由于 Demo 中下载模型需要从 Nvidia NGC 下载,我们首先需要安装 NGC。
按顺序执行以下命令安装:
cd /usr/local/binwget https://ngc.nvidia.com/downloads/ngccli_cat_linux.zipunzip ngccli_cat_linux.zipchmod u+x ngc-cli/ngcecho "export PATH=\\"\\$PATH:$(pwd)/ngc-cli\\"" >> ~/.bash_profile && source ~/.bash_profilengc config set
全部完成后会出现
Enter API key [no-apikey]. Choices: [<VALID_APIKEY>, 'no-apikey']: 等字样,回车两次出现:...Successfully validated configuration.Saving configuration...Successfully saved NGC configuration to /root/.ngc/config
则说明 NGC 安装成功。
注意:
运行模型
1. 首先下载 SQuAD v1.1 数据集:
export TRT_OSSPATH=/workspacecd $TRT_OSSPATH/demo/BERTbash ./scripts/download_squad.sh
2. 随后下载 Bert_large 模型,sequence 长度为128:
bash scripts/download_model.sh
说明:
请保证网络状况较好,模型大约3.7G左右,如遇下载较慢,耐心等待即可。
模型下载完成后会出现如下字样(参考):
Download status: COMPLETEDDownloaded local path model: /workspace/demo/BERT/models/fine-tuned/bert_tf_ckpt_large_qa_squad2_amp_128_v19.03.1Total files downloaded: 6Total transferred: 3.75 GBStarted at: 2024-07-22 12:30:15Completed at: 2024-07-22 12:53:34Duration taken: 23m 18s
3. 创建 TensorRT engine:
mkdir -p enginespython3 builder.py -m models/fine-tuned/bert_tf_ckpt_large_qa_squad2_amp_128_v19.03.1/model.ckpt -o engines/bert_large_128.engine -b 1 -s 128 --fp16 -c models/fine-tuned/bert_tf_ckpt_large_qa_squad2_amp_128_v19.03.1
此 engine 最大 batchsize=1,sequence length=128,mixed precision=fp16,使用 BERT Large SQuAD v2 FP16 Sequence Length 128 checkpoint。
4. engine 创建好后在 engines 文件夹下可以看到 bert_large_128.engine 文件,窗口会出现以下字样(参考):
...[07/23/2024-07:21:16] [TRT] [W] TensorRT encountered issues when converting weights between types and that could affect accuracy.[07/23/2024-07:21:16] [TRT] [W] If this is not the desired behavior, please modify the weights or retrain with regularization to adjust the magnitude of the weights.[07/23/2024-07:21:16] [TRT] [W] Check verbose logs for the list of affected weights.[07/23/2024-07:21:16] [TRT] [W] - 165 weights are affected by this issue: Detected subnormal FP16 values.[07/23/2024-07:21:16] [TRT] [W] - 92 weights are affected by this issue: Detected values less than smallest positive FP16 subnormal value and converted them to the FP16 minimum subnormalized value.[07/23/2024-07:21:16] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +576, GPU +577, now: CPU 576, GPU 577 (MiB)[07/23/2024-07:21:16] [TRT] [I] build engine in 34.906 Sec[07/23/2024-07:21:17] [TRT] [I] Saving Engine to engines/bert_large_128.engine[07/23/2024-07:21:17] [TRT] [I] Done.
5. 运行模型:
python3 inference.py -e engines/bert_large_128.engine -p "TensorRT is a high performance deep learning inference platform that delivers low latency and high throughput for apps such as recommenders, speech and image/video on NVIDIA GPUs. It includes parsers to import models, and plugins to support novel ops and layers before applying optimizations for inference. Today NVIDIA is open-sourcing parsers and plugins in TensorRT so that the deep learning community can customize and extend these components to take advantage of powerful TensorRT optimizations for your apps." -q "What is TensorRT?" -v models/fine-tuned/bert_tf_ckpt_large_qa_squad2_amp_128_v19.03.1/vocab.txt
这里我们给了一段话:
"TensorRT is a high performance deep learning inference platform that delivers low latency and high throughput for apps such as recommenders, speech and image/video on NVIDIA GPUs. It includes parsers to import models, and plugins to support novel ops and layers before applying optimizations for inference. Today NVIDIA is open-sourcing parsers and plugins in TensorRT so that the deep learning community can customize and extend these components to take advantage of powerful TensorRT optimizations for your apps."
并给出问题:
What is TensorRT?
希望模型可以根据给的段落返回较好的问题的答案,运行脚本后返回结果如下(参考):
...Passage: TensorRT is a high performance deep learning inference platform that delivers low latency and high throughput for apps such as recommenders, speech and image/video on NVIDIA GPUs. It includes parsers to import models, and plugins to support novel ops and layers before applying optimizations for inference. Today NVIDIA is open-sourcing parsers and plugins in TensorRT so that the deep learning community can customize and extend these components to take advantage of powerful TensorRT optimizations for your apps.Question: What is TensorRT?------------------------Running inference in 16.378 Sentences/Sec------------------------Answer: 'a high performance deep learning inference platform'With probability: 46.077
可以看到模型返回了输出结果,说明模型运行成功。
注意事项
说明:
由于 OpenCloudOS 是 TencentOS Server 的开源版本,理论上上述文档当中的所有操作同样适用于 OpenCloudOS。
参考文档