深度学习部署架构：以 Triton Inference Server（TensorRT）为例

一个会写诗的程序员

发布于 2022-09-28 09:09:16

3.3K0

发布于 2022-09-28 09:09:16

文章被收录于专栏：一个会写诗的程序员的博客

什么是模型部署？

模型训练只是DeepLearning的一小部分，如《Hidden Technical Debt in Machine Learning Systems》机器学习系统的技术债书中所说。

现有几种搭建框架：

Python：TF+Flask+Funicorn+Nginx FrameWork：TF serving，TorchServe，ONNX Runtime Intel：OpenVINO，NVNN，QNNPACK（FB的） NVIDIA：TensorRT Inference Server（Triton），DeepStream

Triton Inference Server 简介

NVIDIA Triton推理服务器

NVIDIA Triton™推理服务器是NVIDIA AI平台的一部分，是一款开源推理服务软件，可帮助标准化模型部署和执行，并在生产中提供快速且可扩展的AI。

NVIDIA Triton Inference Server

NVIDIA Triton™ Inference Server, part of the NVIDIA AI platform, is an open-source inference serving software that helps standardize model deployment and execution and delivers fast and scalable AI in production.

什么是NVIDIA Triton？

Triton推理服务器是NVIDIA AI平台的一部分，通过使团队能够从任何基于GPU或CPU的基础架构上的任何框架部署、运行和扩展训练有素的AI模型，简化了AI推理。它为AI研究人员和数据科学家提供了为其项目选择正确框架的自由，而不会影响生产部署。它还帮助开发人员跨云、本地、边缘和嵌入式设备提供高性能推理。

What Is NVIDIA Triton?

Triton Inference Server, part of the NVIDIA AI platform, streamlines AI inference by enabling teams to deploy, run, and scale trained AI models from any framework on any GPU- or CPU-based infrastructure. It provides AI researchers and data scientists the freedom to choose the right framework for their projects without impacting production deployment. It also helps developers deliver high-performance inference across cloud, on-prem, edge, and embedded devices.

Triton推理服务器是什么？

Triton Inference Server： https://github.com/triton-inference-server/server

Triton 推理服务器(NVIDIA Triton Inference Server)，是英伟达等公司推出的开源推理框架，为用户提供部署在云和边缘推理上的解决方案。

Triton Inference Server 特性

那么推理服务器有什么特点呢？

1.推理服务器具有超强的计算密度和超高能效的特点。目前已广泛应用于精准营销、视频分析、深度学习模型、文字识别和医学影像分析。通过为人工智能服务器提供强大的计算能力，加速了人工智能的发展。

2.微云网络推理服务器最多可以支持20个推理加速卡，其高效加速应用可以满足不同场景的推理需求。

3.在使用的过程中，推理服务器可以通过发挥架构多核，功耗低的优势，为推理场景构建能效高，功耗低的计算平台。其中推理加速卡的单卡功耗只为70瓦，它能够为服务器的算力加速的同时，还可以带来更优的能效比。

4.推理服务器是当今世界上性能最高的服务器，将在石油勘探、天文探索和自动驾驶领域发挥非常重要的作用。凭借其超高的计算能力，必将加速行业的智能化发展。此外，它还可以通过超强的AI技术使各个行业智能化，从而使智能化遍地开花。

通过以上内容，我们已经了解了推理服务器的特点。可见推理服务器的功能非常强大，计算能力高，计算密度是行业的两倍。它还有一个高速接口，可以延迟芯片之间跨服务器的互联，缩短10%到70%。

Triton 推断服务结构

Triton Server 服务提供方式：

1、Http返回json； 2、内部gRPC。

一个客户端访问服务器的代码示例：

import numpy as np
import tritonclient.http as httpclient
import torch
from PIL import Image


if __name__ == '__main__':
    triton_client = httpclient.InferenceServerClient(url='127.0.0.1:8000')
    image = Image.open('./cat.jpg')
    
    image = image.resize((224, 224), Image.ANTIALIAS)
    image = np.asarray(image)
    image = image / 255
    image = np.expand_dims(image, axis=0)
    image = np.transpose(image, axes=[0, 3, 1, 2])
    image = image.astype(np.float32)

    inputs = []
    inputs.append(httpclient.InferInput('INPUT__0', image.shape, "FP32"))
    inputs[0].set_data_from_numpy(image, binary_data=False)
    outputs = []
    outputs.append(httpclient.InferRequestedOutput('OUTPUT__0', binary_data=False, class_count=3))  # class_count 表示 topN 分类
    # outputs.append(httpclient.InferRequestedOutput('OUTPUT__0', binary_data=False))

    results = triton_client.infer('resnet50_pytorch', inputs=inputs, outputs=outputs)
    output_data0 = results.as_numpy('OUTPUT__0')
    print(output_data0.shape)
    print(output_data0)

模型部署实践

服务端部署

首先，我们需要一个模型，以 Resnet50 为例子。

Pytorch 导出模型

咱们先使用 Pytorch 导出 Resnet50 模型，Resnet50 是一个图片分类模型，模型的类别有 1000 类。使用 torchscript 保存模型。

保存模型的代码如下，非常的简单。这个模型可以接受的图片大小是可变的，不一定是 244, 244。

import torch
import torchvision.models as models

resnet50 = models.resnet50(pretrained=True)
resnet50.eval()
image = torch.randn(1, 3, 244, 244)
resnet50_traced = torch.jit.trace(resnet50, image)
resnet50(image)
resnet50_traced.save('model.pt')

使用 Triton 部署模型

现在我们已经有了模型文件了，接下来我们把它部署到 Triton 上面。

第一步，拉取 Triton 镜像

以 r21.10 版本为例子，运行下面的命令拉取镜像。

docker pull nvcr.io/nvidia/tritonserver:21.10-py3

第二步，配置模型

按照下面的方式组织文件目录结构。

quick/
└── resnet50_pytorch            # 模型名字，需要和 config.txt 中的名字对上
    ├── 1                       # 模型版本号
    │   └── model.pt            # 上面保存的模型
    ├── config.pbtxt            # 模型配置文件
    ├── labels.txt              # 可选，分类标签信息，注意格式
    ├── resnet_client.py        # 客户端脚本，可以不放在这里
    └── resnet_pytorch.py       # 生成 model.pt 的脚本，可以不放在这里

下面给出模型的配置信息 config.pbtxt。模型的输入是，[ N, 3, -1, -1 ] 的图片，输出是 [ N, 1000 ] 维度的分类向量，还指定了分类的文件名，用于获取分类结果。

name: "resnet50_pytorch"
platform: "pytorch_libtorch"
max_batch_size: 128
input [
  {
    name: "INPUT__0"
    data_type: TYPE_FP32
    dims: [ 3, -1, -1 ]
  }
]
output [
  {
    name: "OUTPUT__0"
    data_type: TYPE_FP32
    dims: [ 1000 ]
    label_filename: "labels.txt"
  }
]

第三步，启动服务

启动服务的方法有两种，一种是用 docker 启动并执行命令，一种是进入 docker 中然后手动调用命令。

第一种，docker 启动并执行命令：

docker run --gpus=all --rm -p8000:8000 -p8001:8001 -p8002:8002 -v/home/percent1/triton/triton/quick:/models nvcr.io/nvidia/tritonserver:21.10-py3 tritonserver --model-repository=/models

第二种，进入 docker，然后运行命令：

docker run --gpus=all --network=host --shm-size=2g -v/home/percent1/triton/triton/quick:/models  -it nvcr.io/nvidia/tritonserver:21.10-py3  # 进入 docker
./bin/tritonserver --model-store=/models  # 启动 triton

启动之后，我们可以看到以下输出，表明我们已经启动完成。

[图片上传失败...(image-b6345f-1662475438465)]

使用下面的命令，检查是否已经准备好了

root@oneflow-15:/workspace# curl -v localhost:8000/v2/health/ready
*   Trying 127.0.0.1:8000...
* TCP_NODELAY set
* Connected to localhost (127.0.0.1) port 8000 (#0)
> GET /v2/health/ready HTTP/1.1
> Host: localhost:8000
> User-Agent: curl/7.68.0
> Accept: */*
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Content-Length: 0
< Content-Type: text/plain
< 
* Connection #0 to host localhost left intact

客户端请求

请求例子

客户端脚本如下，我们先读取一张图片的数据，然后转换成 [ N, 3, -1, -1 ] 的格式，之后发送请求做分类。因为在模型配置文件中，我们指定了标签信息的文件，所以我们可以在分类的时候指定是否获取分类结果。在 InferRequestedOutput 类中，我们可以指定 class_count 参数，设置数量，表示获取 topN 分类结果。

脚本有几个步骤组成：设置客户端，读取数据，设置一次请求的输入和输出，发起请求，获取结果，输出结果。

import numpy as np
import tritonclient.http as httpclient
import torch
from PIL import Image

if __name__ == '__main__':
    triton_client = httpclient.InferenceServerClient(url='127.0.0.1:8000')

    image = Image.open('./cat.jpg')

    image = image.resize((224, 224), Image.ANTIALIAS)
    image = np.asarray(image)
    image = image / 255
    image = np.expand_dims(image, axis=0)
    image = np.transpose(image, axes=[0, 3, 1, 2])
    image = image.astype(np.float32)

    inputs = []
    inputs.append(httpclient.InferInput('INPUT__0', image.shape, "FP32"))
    inputs[0].set_data_from_numpy(image, binary_data=False)
    outputs = []
    # outputs.append(httpclient.InferRequestedOutput('OUTPUT__0', binary_data=False, class_count=3))  # class_count 表示 topN 分类
    outputs.append(httpclient.InferRequestedOutput('OUTPUT__0', binary_data=False))  # 获取 1000 维的向量

    results = triton_client.infer('resnet50_pytorch', inputs=inputs, outputs=outputs)
    output_data0 = results.as_numpy('OUTPUT__0')
    print(output_data0.shape)
    print(output_data0)

客户端接口实例

因为 Triton 没有提供相关的 API 文档，我们只好自己看代码了。

下面展示客户端的其他接口使用方法。

import tritonclient.http as httpclient

if __name__ == '__main__':
    triton_client = httpclient.InferenceServerClient(url='127.0.0.1:8000')

    model_repository_index = triton_client.get_model_repository_index()
    server_meta = triton_client.get_server_metadata()
    model_meta = triton_client.get_model_metadata('resnet50_pytorch')
    model_config = triton_client.get_model_config('resnet50_pytorch')
    statistics = triton_client.get_inference_statistics()
    shm_status = triton_client.get_cuda_shared_memory_status()
    sshm_status = triton_client.get_system_shared_memory_status()

    server_live = triton_client.is_server_live()
    server_ready = triton_client.is_server_ready()
    model_ready = triton_client.is_model_ready('resnet50_pytorch')

    # 启动命令: ./bin/tritonserver --model-store=/models --model-control-mode explicit --load-model resnet50_pytorch
    # Triton 允许我们使用客户端去加载/卸载模型
    triton_client.unload_model('resnet50_pytorch')
    triton_client.load_model('resnet50_pytorch')

    triton_client.close()

    with httpclient.InferenceServerClient(url='127.0.0.1:8000'):
        pass