TencentOS Server Bert

本指导适用于在 TencentOS Server 3 上使用 TensorRT 推理框架运行 Bert 的官方 Demo，以 Docker 方式启动。
TensorRT 环境准备
1. 从 GitHub 下载 TensorRT 开源仓库到本地。
#下载TensorRT开源仓库到本地，版本为8.6.1
git clone git@github.com:NVIDIA/TensorRT.git -b release/8.6 --single-branch
cd TensorRT
2. 启动 TensorRT NGC 容器。
docker run -it --gpus all --name=tensorrt -e HF_ENDPOINT="https://hf-mirror.com" -v $PWD:/workspace nvcr.io/nvidia/pytorch:23.06-py3 /bin/bash
此时会从 nvcr 拉取 docker 镜像，请确保网络环境较好，直到镜像里所有层下载成功。成功后会直接进入容器内部。
检查必要的包的版本
1. 确保使用较新的 pip 包和 TensorRT 包，并使用命令更新。
#更新pip包
python3 -m pip install --upgrade pip
2. 检查 TensorRT 版本，最低要求为8.6.0。
#检查TensorRT版本
python3 -c 'import tensorrt;print(tensorrt.__version__)'
本指导下载的 TensorRT 版本为8.6.1，如版本低于8.6.0，请卸载 TensorRT 并重新安装。
#卸载TensorRT包
python3 -m pip uninstall tensorrt

#下载新的TensorRT包，且指定版本，这里指定为8.6.1
python3 -m pip install tensorrt==8.6.1
python3 -c 'import tensorrt;print(tensorrt.__version__)'
下载运行框架和模型必要的包
将 pip 换为国内清华源以加快下载速度。
#将pip换成清华源
#设为默认，永久有效
pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
随后安装运行模型必要的包：
#安装运行框架和模型所需的包
pip install pycuda onnx tensorflow torch  
安装 NGC
由于 Demo 中下载模型需要从 Nvidia NGC 下载，我们首先需要安装 NGC。
按顺序执行以下命令安装：
cd /usr/local/bin
wget https://ngc.nvidia.com/downloads/ngccli_cat_linux.zip
unzip ngccli_cat_linux.zip
chmod u+x ngc-cli/ngc
echo "export PATH=\\"\\$PATH:$(pwd)/ngc-cli\\"" >> ~/.bash_profile && source ~/.bash_profile
ngc config set
全部完成后会出现Enter API key [no-apikey]. Choices: [<VALID_APIKEY>, 'no-apikey']: 等字样，回车两次出现：
...
Successfully validated configuration.
Saving configuration...
Successfully saved NGC configuration to /root/.ngc/config
则说明 NGC 安装成功。
注意：
如果第二步 wget 下载速度太慢，可以单击此 链接 到浏览器中下载，下载完成后放入 /usr/local/bin 文件夹下即可。
运行模型
1. 首先下载 SQuAD v1.1 数据集：
export TRT_OSSPATH=/workspace
cd $TRT_OSSPATH/demo/BERT

bash ./scripts/download_squad.sh
2. 随后下载 Bert_large 模型，sequence 长度为128：
bash scripts/download_model.sh
说明：
请保证网络状况较好，模型大约3.7G左右，如遇下载较慢，耐心等待即可。
模型下载完成后会出现如下字样（参考）：
Download status: COMPLETED
Downloaded local path model: /workspace/demo/BERT/models/fine-tuned/bert_tf_ckpt_large_qa_squad2_amp_128_v19.03.1
Total files downloaded: 6
Total transferred: 3.75 GB
Started at: 2024-07-22 12:30:15
Completed at: 2024-07-22 12:53:34
Duration taken: 23m 18s
3. 创建 TensorRT engine：
mkdir -p engines
python3 builder.py -m models/fine-tuned/bert_tf_ckpt_large_qa_squad2_amp_128_v19.03.1/model.ckpt -o engines/bert_large_128.engine -b 1 -s 128 --fp16 -c models/fine-tuned/bert_tf_ckpt_large_qa_squad2_amp_128_v19.03.1
此 engine 最大 batchsize=1，sequence length=128，mixed precision=fp16，使用 BERT Large SQuAD v2 FP16 Sequence Length 128 checkpoint。
4. engine 创建好后在 engines 文件夹下可以看到 bert_large_128.engine 文件，窗口会出现以下字样（参考）：
...
[07/23/2024-07:21:16] [TRT] [W] TensorRT encountered issues when converting weights between types and that could affect accuracy.
[07/23/2024-07:21:16] [TRT] [W] If this is not the desired behavior, please modify the weights or retrain with regularization to adjust the magnitude of the weights.
[07/23/2024-07:21:16] [TRT] [W] Check verbose logs for the list of affected weights.
[07/23/2024-07:21:16] [TRT] [W] - 165 weights are affected by this issue: Detected subnormal FP16 values.
[07/23/2024-07:21:16] [TRT] [W] - 92 weights are affected by this issue: Detected values less than smallest positive FP16 subnormal value and converted them to the FP16 minimum subnormalized value.
[07/23/2024-07:21:16] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +576, GPU +577, now: CPU 576, GPU 577 (MiB)
[07/23/2024-07:21:16] [TRT] [I] build engine in 34.906 Sec
[07/23/2024-07:21:17] [TRT] [I] Saving Engine to engines/bert_large_128.engine
[07/23/2024-07:21:17] [TRT] [I] Done.
5. 运行模型：
python3 inference.py -e engines/bert_large_128.engine -p "TensorRT is a high performance deep learning inference platform that delivers low latency and high throughput for apps such as recommenders, speech and image/video on NVIDIA GPUs. It includes parsers to import models, and plugins to support novel ops and layers before applying optimizations for inference. Today NVIDIA is open-sourcing parsers and plugins in TensorRT so that the deep learning community can customize and extend these components to take advantage of powerful TensorRT optimizations for your apps." -q "What is TensorRT?" -v models/fine-tuned/bert_tf_ckpt_large_qa_squad2_amp_128_v19.03.1/vocab.txt
这里我们给了一段话：
"TensorRT is a high performance deep learning inference platform that delivers low latency and high throughput for apps such as recommenders, speech and image/video on NVIDIA GPUs. It includes parsers to import models, and plugins to support novel ops and layers before applying optimizations for inference. Today NVIDIA is open-sourcing parsers and plugins in TensorRT so that the deep learning community can customize and extend these components to take advantage of powerful TensorRT optimizations for your apps."
并给出问题：
What is TensorRT?
希望模型可以根据给的段落返回较好的问题的答案，运行脚本后返回结果如下（参考）：
...
Passage: TensorRT is a high performance deep learning inference platform that delivers low latency and high throughput for apps such as recommenders, speech and image/video on NVIDIA GPUs. It includes parsers to import models, and plugins to support novel ops and layers before applying optimizations for inference. Today NVIDIA is open-sourcing parsers and plugins in TensorRT so that the deep learning community can customize and extend these components to take advantage of powerful TensorRT optimizations for your apps.
﻿
Question: What is TensorRT?
------------------------
Running inference in 16.378 Sentences/Sec
------------------------
Answer: 'a high performance deep learning inference platform'
With probability: 46.077
可以看到模型返回了输出结果，说明模型运行成功。
注意事项
说明：
由于 OpenCloudOS 是 TencentOS Server 的开源版本，理论上上述文档当中的所有操作同样适用于 OpenCloudOS。
参考文档
﻿NVIDIA TensorRT Bert Demo﻿
﻿NVIDIA NGC 安装指引﻿
﻿NGC 安装报错指引﻿
﻿NVIDIA TensorRT GitHub﻿
﻿清华 pipy 镜像源﻿
﻿
Bert

本页目录：

TensorRT 环境准备

检查必要的包的版本

下载运行框架和模型必要的包

安装 NGC

运行模型

注意事项

参考文档