前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >TensorRT开发篇

TensorRT开发篇

作者头像
算法之名
发布2023-10-16 14:49:27
2390
发布2023-10-16 14:49:27
举报
文章被收录于专栏:算法之名算法之名

TensorRT基础

  1. TensorRT的核心在于对模型算子的优化(合并算子,利用GPU特性特定核函数等多种策略),通过tensorRT,能够在Nvidia系列GPU中获得最好的性能。
  2. 因此tensorRT的模型,需要在目标GPU上实际运行的方式选择最优算法和配置。
  3. 因此tensorRT生成的模型只能在特定条件下运行(编译的trt版本,cuda版本,编译时的GPU幸好)。

主要内容包含模型结构定义方式、编译过程配置、推理过程实现、插件实现、onnx理解。

是tensorRT的优化过程,左边是一个未优化的基本网络模型图,tensorRT会发现在大的椭圆内的三个层具有一样的网络结构,因此合并成了右边优化过的网络结构的CBR块。左图中中间的三个卷积层(3*3、5*5、1*1)也可以优化成3个CBR。这样就可以直接的优化和加速,降低了数据流转的过程,达到性能更高的优化。

现在我们来开发一个简单的神经网络,图像->全连接层->sigmoid激活函数->输出prob。

代码语言:javascript
复制
#include <NvInfer.h>
#include <NvInferRuntime.h>
#include <cuda_runtime.h>
#include <stdio.h>

using namespace nvinfer1;
//在tensorrt中进行日志打印和输出类
class TRTLogger: public ILogger {
public:
    virtual void log(Severity severity, const char *msg) noexcept override {
        if (severity <= Severity::kVERBOSE) {
            printf("%d: %s\n", severity, msg);
        }
    }
};

Weights make_weights(float* ptr, int n) {
    Weights w;
    w.count = n;
    w.type = nvinfer1::DataType::kFLOAT;
    w.values = ptr;
    return w;
}

int main() {
    TRTLogger logger;
    //神经网络的创建者,绑定日志
    IBuilder *builder = createInferBuilder(logger);
    //构建配置,指定TensorRT应该如何优化模型,TensorRT生成的模型只能在特定配置下运行
    IBuilderConfig *config = builder->createBuilderConfig();
    //神经网络,由创建者创建
    INetworkDefinition *network = builder->createNetworkV2(1);
    //输入的图像是3通道的
    const int num_input = 3;
    //输出的pro是二分类
    const int num_output = 2;
    //网络参数,前3个给w1的rgb,后3个给w2的rgb
    float layer1_weight_values[] = {1.0, 2.0, 0.5, 0.1, 0.2, 0.5};
    //偏置值
    float layer1_bias_values[] = {0.3, 0.8};
    //网络添加输入节点,输入类型为3通道的图像
    ITensor *input = network->addInput("image", nvinfer1::DataType::kFLOAT, Dims4(1, num_input, 1, 1));
    //创建tensorrt专有的权重和偏置
    Weights layer1_weight = make_weights(layer1_weight_values, 6);
    Weights layer1_bias = make_weights(layer1_bias_values, 2);
    //网络添加全连接层,输入为3通道图像,输出为2通道prob
    auto layer1 = network->addFullyConnected(*input, num_output, layer1_weight, layer1_bias);
    //网络添加激活层,它的输入是全连接层的输出,激活类型为sigmoid
    auto prob = network->addActivation(*layer1->getOutput(0), ActivationType::kSIGMOID);
    //在网络中将prob标记为输出
    network->markOutput(*prob->getOutput(0));

    printf("workspace Size = %.2f MB\n", (1 << 28) / 1024.0f / 1024.0f);
    config->setMaxWorkspaceSize(1 << 28); //256M
    builder->setMaxBatchSize(1);  //推理的batchsize为1
    //推理引擎,由创建者创建
    ICudaEngine *engine = builder->buildEngineWithConfig(*network, *config);
    if (engine == nullptr) {
        printf("Build engine failed.\n");
        return -1;
    }
    //将模型序列化,并存储为文件
    IHostMemory *model_data = engine->serialize();
    FILE *f = fopen("engine.trtmodel", "wb");
    fwrite(model_data->data(), 1, model_data->size(), f);
    fclose(f);
    //卸载顺序按照构建顺序倒序
    model_data->destroy();
    engine->destroy();
    network->destroy();
    config->destroy();
    builder->destroy();
    printf("Done.\n");

    return 0;
}

Makefile(我这里是在英伟达Jetson nano jetpak 4.5上开发,tensorrt版本号为7.1.1)

代码语言:javascript
复制
EXE=main

INCLUDE=/usr/include/aarch64-linux-gnu/
LIBPATH=/usr/lib/aarch64-linux-gnu/
CFLAGS= -I$(INCLUDE) -I/usr/local/cuda-10.2/include
LIBS= -L$(LIBPATH) -lnvinfer -L/usr/local/cuda-10.2/lib64 -lcudart -lcublas -lstdc++fs

CXX_OBJECTS := $(patsubst %.cpp,%.o,$(shell find . -name "*.cpp"))
DEP_FILES  =$(patsubst  %.o,  %.d, $(CXX_OBJECTS))

$(EXE): $(CXX_OBJECTS)
                $(CXX)  $(CXX_OBJECTS) -o $(EXE) $(LIBS)

%.o: %.cpp
                $(CXX) -c -o $@ $(CFLAGS) $(LIBS) $<

clean: 
                rm  -rf  $(CXX_OBJECTS)  $(DEP_FILES)  $(EXE)

test:
                echo $(CXX_OBJECTS)

运行结果

代码语言:javascript
复制
workspace Size = 256.00 MB
4: Applying generic optimizations to the graph for inference.
4: Original: 2 layers
4: After dead-layer removal: 2 layers
4: After Myelin optimization: 2 layers
4: After scale fusion: 2 layers
4: After vertical fusions: 2 layers
4: After final dead-layer removal: 2 layers
4: After tensor merging: 2 layers
4: After concat removal: 2 layers
4: Graph construction and optimization completed in 0.0724424 seconds.
4: Constructing optimization profile number 0 [1/1].
4: *************** Autotuning format combination: Float(1,1,1,3) -> Float(1,1,1,2) ***************
4: (Unnamed Layer* 0) [Fully Connected] (caskFullyConnectedFP32) Set Tactic Name: maxwell_sgemm_128x128_relu_nn_v1
4: (Unnamed Layer* 0) [Fully Connected] (caskFullyConnectedFP32) Set Tactic Name: maxwell_sgemm_128x64_relu_nn_v1
4: (Unnamed Layer* 0) [Fully Connected] (caskFullyConnectedFP32) Set Tactic Name: maxwell_sgemm_64x64_relu_nn_v1
4: (Unnamed Layer* 0) [Fully Connected] (caskFullyConnectedFP32) Set Tactic Name: maxwell_sgemm_32x128_relu_nn_v1
4: (Unnamed Layer* 0) [Fully Connected] (caskFullyConnectedFP32) Set Tactic Name: maxwell_sgemm_128x32_relu_nn_v1
4: --------------- Timing Runner: (Unnamed Layer* 0) [Fully Connected] (CaskFullyConnected)
4: (Unnamed Layer* 0) [Fully Connected] (caskFullyConnectedFP32) Set Tactic Name: maxwell_sgemm_128x128_relu_nn_v1
4: Tactic: 8883888914904656451 time 0.0325
4: (Unnamed Layer* 0) [Fully Connected] (caskFullyConnectedFP32) Set Tactic Name: maxwell_sgemm_128x64_relu_nn_v1
4: Tactic: 5453137127347942357 time 0.028385
4: (Unnamed Layer* 0) [Fully Connected] (caskFullyConnectedFP32) Set Tactic Name: maxwell_sgemm_64x64_relu_nn_v1
4: Tactic: 5373503982740029499 time 0.028333
4: (Unnamed Layer* 0) [Fully Connected] (caskFullyConnectedFP32) Set Tactic Name: maxwell_sgemm_32x128_relu_nn_v1
4: Tactic: 4133936625481774016 time 0.016875
4: (Unnamed Layer* 0) [Fully Connected] (caskFullyConnectedFP32) Set Tactic Name: maxwell_sgemm_128x32_relu_nn_v1
4: Tactic: 1933552664043962183 time 0.016927
4: Fastest Tactic: 4133936625481774016 Time: 0.016875
4: --------------- Timing Runner: (Unnamed Layer* 0) [Fully Connected] (CudaFullyConnected)
4: Tactic: 0 time 0.01974
4: Tactic: 1 time 0.023021
4: Tactic: 9 time 0.026927
4: Tactic: 26 time 0.019167
4: Tactic: 27 time 0.018907
4: Tactic: 48 time 0.019167
4: Tactic: 49 time 0.019844
4: Fastest Tactic: 27 Time: 0.018907
4: >>>>>>>>>>>>>>> Chose Runner Type: CaskFullyConnected Tactic: 4133936625481774016
4: (Unnamed Layer* 0) [Fully Connected] (caskFullyConnectedFP32) Set Tactic Name: maxwell_sgemm_32x128_relu_nn_v1
4: 
4: (Unnamed Layer* 0) [Fully Connected] (caskFullyConnectedFP32) Set Tactic Name: maxwell_sgemm_128x128_relu_nn_v1
4: (Unnamed Layer* 0) [Fully Connected] (caskFullyConnectedFP32) Set Tactic Name: maxwell_sgemm_128x64_relu_nn_v1
4: (Unnamed Layer* 0) [Fully Connected] (caskFullyConnectedFP32) Set Tactic Name: maxwell_sgemm_64x64_relu_nn_v1
4: (Unnamed Layer* 0) [Fully Connected] (caskFullyConnectedFP32) Set Tactic Name: maxwell_sgemm_32x128_relu_nn_v1
4: (Unnamed Layer* 0) [Fully Connected] (caskFullyConnectedFP32) Set Tactic Name: maxwell_sgemm_128x32_relu_nn_v1
4: (Unnamed Layer* 0) [Fully Connected] (caskFullyConnectedFP32) Set Tactic Name: maxwell_sgemm_32x128_relu_nn_v1
4: *************** Autotuning format combination: Float(1,1,1,2) -> Float(1,1,1,2) ***************
4: --------------- Timing Runner: (Unnamed Layer* 1) [Activation] (Activation)
4: Tactic: 0 is the only option, timing skipped
4: Fastest Tactic: 0 Time: 0
4: Formats and tactics selection completed in 0.281916 seconds.
4: After reformat layers: 2 layers
4: Block size 268435456
4: Block size 512
4: Total Activation Memory: 268435968
3: Detected 1 inputs and 1 output network tensors.
4: (Unnamed Layer* 0) [Fully Connected] (caskFullyConnectedFP32) Set Tactic Name: maxwell_sgemm_32x128_relu_nn_v1
4: Layer: (Unnamed Layer* 0) [Fully Connected] Weights: 24 HostPersistent: 384 DevicePersistent: 1536
4: Layer: (Unnamed Layer* 1) [Activation] Weights: 0 HostPersistent: 0 DevicePersistent: 0
4: Total Host Persistent Memory: 384
4: Total Device Persistent Memory: 1536
4: Total Weight Memory: 24
4: Builder timing cache: created 1 entries, 0 hit(s)
4: Engine generation completed in 11.306 seconds.
4: Engine Layer Information:
4: Layer(caskFullyConnectedFP32): (Unnamed Layer* 0) [Fully Connected], Tactic: 4133936625481774016, image[Float(3,1,1)] -> (Unnamed Layer* 0) [Fully Connected]_output[Float(2,1,1)]
4: Layer(Activation): (Unnamed Layer* 1) [Activation], Tactic: 0, (Unnamed Layer* 0) [Fully Connected]_output[Float(2,1,1)] -> (Unnamed Layer* 1) [Activation]_output[Float(2,1,1)]
Done.
本文参与 腾讯云自媒体分享计划,分享自作者个人站点/博客。
原始发表:2023-10-11,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 作者个人站点/博客 前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档