TensorRT开发篇

算法之名

发布于 2023-10-16 14:49:27

2390

发布于 2023-10-16 14:49:27

文章被收录于专栏：算法之名算法之名

TensorRT基础

TensorRT的核心在于对模型算子的优化(合并算子，利用GPU特性特定核函数等多种策略)，通过tensorRT，能够在Nvidia系列GPU中获得最好的性能。
因此tensorRT的模型，需要在目标GPU上实际运行的方式选择最优算法和配置。
因此tensorRT生成的模型只能在特定条件下运行(编译的trt版本，cuda版本，编译时的GPU幸好)。

主要内容包含模型结构定义方式、编译过程配置、推理过程实现、插件实现、onnx理解。

是tensorRT的优化过程，左边是一个未优化的基本网络模型图，tensorRT会发现在大的椭圆内的三个层具有一样的网络结构，因此合并成了右边优化过的网络结构的CBR块。左图中中间的三个卷积层(3*3、5*5、1*1)也可以优化成3个CBR。这样就可以直接的优化和加速，降低了数据流转的过程，达到性能更高的优化。

现在我们来开发一个简单的神经网络，图像->全连接层->sigmoid激活函数->输出prob。

#include <NvInfer.h>
#include <NvInferRuntime.h>
#include <cuda_runtime.h>
#include <stdio.h>

using namespace nvinfer1;
//在tensorrt中进行日志打印和输出类
class TRTLogger: public ILogger {
public:
    virtual void log(Severity severity, const char *msg) noexcept override {
        if (severity <= Severity::kVERBOSE) {
            printf("%d: %s\n", severity, msg);
        }
    }
};

Weights make_weights(float* ptr, int n) {
    Weights w;
    w.count = n;
    w.type = nvinfer1::DataType::kFLOAT;
    w.values = ptr;
    return w;
}

int main() {
    TRTLogger logger;
    //神经网络的创建者，绑定日志
    IBuilder *builder = createInferBuilder(logger);
    //构建配置，指定TensorRT应该如何优化模型，TensorRT生成的模型只能在特定配置下运行
    IBuilderConfig *config = builder->createBuilderConfig();
    //神经网络，由创建者创建
    INetworkDefinition *network = builder->createNetworkV2(1);
    //输入的图像是3通道的
    const int num_input = 3;
    //输出的pro是二分类
    const int num_output = 2;
    //网络参数，前3个给w1的rgb，后3个给w2的rgb
    float layer1_weight_values[] = {1.0, 2.0, 0.5, 0.1, 0.2, 0.5};
    //偏置值
    float layer1_bias_values[] = {0.3, 0.8};
    //网络添加输入节点，输入类型为3通道的图像
    ITensor *input = network->addInput("image", nvinfer1::DataType::kFLOAT, Dims4(1, num_input, 1, 1));
    //创建tensorrt专有的权重和偏置
    Weights layer1_weight = make_weights(layer1_weight_values, 6);
    Weights layer1_bias = make_weights(layer1_bias_values, 2);
    //网络添加全连接层，输入为3通道图像，输出为2通道prob
    auto layer1 = network->addFullyConnected(*input, num_output, layer1_weight, layer1_bias);
    //网络添加激活层,它的输入是全连接层的输出，激活类型为sigmoid
    auto prob = network->addActivation(*layer1->getOutput(0), ActivationType::kSIGMOID);
    //在网络中将prob标记为输出
    network->markOutput(*prob->getOutput(0));

    printf("workspace Size = %.2f MB\n", (1 << 28) / 1024.0f / 1024.0f);
    config->setMaxWorkspaceSize(1 << 28); //256M
    builder->setMaxBatchSize(1);  //推理的batchsize为1
    //推理引擎，由创建者创建
    ICudaEngine *engine = builder->buildEngineWithConfig(*network, *config);
    if (engine == nullptr) {
        printf("Build engine failed.\n");
        return -1;
    }
    //将模型序列化，并存储为文件
    IHostMemory *model_data = engine->serialize();
    FILE *f = fopen("engine.trtmodel", "wb");
    fwrite(model_data->data(), 1, model_data->size(), f);
    fclose(f);
    //卸载顺序按照构建顺序倒序
    model_data->destroy();
    engine->destroy();
    network->destroy();
    config->destroy();
    builder->destroy();
    printf("Done.\n");

    return 0;
}

Makefile(我这里是在英伟达Jetson nano jetpak 4.5上开发，tensorrt版本号为7.1.1)

EXE=main

INCLUDE=/usr/include/aarch64-linux-gnu/
LIBPATH=/usr/lib/aarch64-linux-gnu/
CFLAGS= -I$(INCLUDE) -I/usr/local/cuda-10.2/include
LIBS= -L$(LIBPATH) -lnvinfer -L/usr/local/cuda-10.2/lib64 -lcudart -lcublas -lstdc++fs

CXX_OBJECTS := $(patsubst %.cpp,%.o,$(shell find . -name "*.cpp"))
DEP_FILES  =$(patsubst  %.o,  %.d, $(CXX_OBJECTS))

$(EXE): $(CXX_OBJECTS)
                $(CXX)  $(CXX_OBJECTS) -o $(EXE) $(LIBS)

%.o: %.cpp
                $(CXX) -c -o $@ $(CFLAGS) $(LIBS) $<

clean: 
                rm  -rf  $(CXX_OBJECTS)  $(DEP_FILES)  $(EXE)

test:
                echo $(CXX_OBJECTS)

运行结果

workspace Size = 256.00 MB
4: Applying generic optimizations to the graph for inference.
4: Original: 2 layers
4: After dead-layer removal: 2 layers
4: After Myelin optimization: 2 layers
4: After scale fusion: 2 layers
4: After vertical fusions: 2 layers
4: After final dead-layer removal: 2 layers
4: After tensor merging: 2 layers
4: After concat removal: 2 layers
4: Graph construction and optimization completed in 0.0724424 seconds.
4: Constructing optimization profile number 0 [1/1].
4: *************** Autotuning format combination: Float(1,1,1,3) -> Float(1,1,1,2) ***************
4: (Unnamed Layer* 0) [Fully Connected] (caskFullyConnectedFP32) Set Tactic Name: maxwell_sgemm_128x128_relu_nn_v1
4: (Unnamed Layer* 0) [Fully Connected] (caskFullyConnectedFP32) Set Tactic Name: maxwell_sgemm_128x64_relu_nn_v1
4: (Unnamed Layer* 0) [Fully Connected] (caskFullyConnectedFP32) Set Tactic Name: maxwell_sgemm_64x64_relu_nn_v1
4: (Unnamed Layer* 0) [Fully Connected] (caskFullyConnectedFP32) Set Tactic Name: maxwell_sgemm_32x128_relu_nn_v1
4: (Unnamed Layer* 0) [Fully Connected] (caskFullyConnectedFP32) Set Tactic Name: maxwell_sgemm_128x32_relu_nn_v1
4: --------------- Timing Runner: (Unnamed Layer* 0) [Fully Connected] (CaskFullyConnected)
4: (Unnamed Layer* 0) [Fully Connected] (caskFullyConnectedFP32) Set Tactic Name: maxwell_sgemm_128x128_relu_nn_v1
4: Tactic: 8883888914904656451 time 0.0325
4: (Unnamed Layer* 0) [Fully Connected] (caskFullyConnectedFP32) Set Tactic Name: maxwell_sgemm_128x64_relu_nn_v1
4: Tactic: 5453137127347942357 time 0.028385
4: (Unnamed Layer* 0) [Fully Connected] (caskFullyConnectedFP32) Set Tactic Name: maxwell_sgemm_64x64_relu_nn_v1
4: Tactic: 5373503982740029499 time 0.028333
4: (Unnamed Layer* 0) [Fully Connected] (caskFullyConnectedFP32) Set Tactic Name: maxwell_sgemm_32x128_relu_nn_v1
4: Tactic: 4133936625481774016 time 0.016875
4: (Unnamed Layer* 0) [Fully Connected] (caskFullyConnectedFP32) Set Tactic Name: maxwell_sgemm_128x32_relu_nn_v1
4: Tactic: 1933552664043962183 time 0.016927
4: Fastest Tactic: 4133936625481774016 Time: 0.016875
4: --------------- Timing Runner: (Unnamed Layer* 0) [Fully Connected] (CudaFullyConnected)
4: Tactic: 0 time 0.01974
4: Tactic: 1 time 0.023021
4: Tactic: 9 time 0.026927
4: Tactic: 26 time 0.019167
4: Tactic: 27 time 0.018907
4: Tactic: 48 time 0.019167
4: Tactic: 49 time 0.019844
4: Fastest Tactic: 27 Time: 0.018907
4: >>>>>>>>>>>>>>> Chose Runner Type: CaskFullyConnected Tactic: 4133936625481774016
4: (Unnamed Layer* 0) [Fully Connected] (caskFullyConnectedFP32) Set Tactic Name: maxwell_sgemm_32x128_relu_nn_v1
4: 
4: (Unnamed Layer* 0) [Fully Connected] (caskFullyConnectedFP32) Set Tactic Name: maxwell_sgemm_128x128_relu_nn_v1
4: (Unnamed Layer* 0) [Fully Connected] (caskFullyConnectedFP32) Set Tactic Name: maxwell_sgemm_128x64_relu_nn_v1
4: (Unnamed Layer* 0) [Fully Connected] (caskFullyConnectedFP32) Set Tactic Name: maxwell_sgemm_64x64_relu_nn_v1
4: (Unnamed Layer* 0) [Fully Connected] (caskFullyConnectedFP32) Set Tactic Name: maxwell_sgemm_32x128_relu_nn_v1
4: (Unnamed Layer* 0) [Fully Connected] (caskFullyConnectedFP32) Set Tactic Name: maxwell_sgemm_128x32_relu_nn_v1
4: (Unnamed Layer* 0) [Fully Connected] (caskFullyConnectedFP32) Set Tactic Name: maxwell_sgemm_32x128_relu_nn_v1
4: *************** Autotuning format combination: Float(1,1,1,2) -> Float(1,1,1,2) ***************
4: --------------- Timing Runner: (Unnamed Layer* 1) [Activation] (Activation)
4: Tactic: 0 is the only option, timing skipped
4: Fastest Tactic: 0 Time: 0
4: Formats and tactics selection completed in 0.281916 seconds.
4: After reformat layers: 2 layers
4: Block size 268435456
4: Block size 512
4: Total Activation Memory: 268435968
3: Detected 1 inputs and 1 output network tensors.
4: (Unnamed Layer* 0) [Fully Connected] (caskFullyConnectedFP32) Set Tactic Name: maxwell_sgemm_32x128_relu_nn_v1
4: Layer: (Unnamed Layer* 0) [Fully Connected] Weights: 24 HostPersistent: 384 DevicePersistent: 1536
4: Layer: (Unnamed Layer* 1) [Activation] Weights: 0 HostPersistent: 0 DevicePersistent: 0
4: Total Host Persistent Memory: 384
4: Total Device Persistent Memory: 1536
4: Total Weight Memory: 24
4: Builder timing cache: created 1 entries, 0 hit(s)
4: Engine generation completed in 11.306 seconds.
4: Engine Layer Information:
4: Layer(caskFullyConnectedFP32): (Unnamed Layer* 0) [Fully Connected], Tactic: 4133936625481774016, image[Float(3,1,1)] -> (Unnamed Layer* 0) [Fully Connected]_output[Float(2,1,1)]
4: Layer(Activation): (Unnamed Layer* 1) [Activation], Tactic: 0, (Unnamed Layer* 0) [Fully Connected]_output[Float(2,1,1)] -> (Unnamed Layer* 1) [Activation]_output[Float(2,1,1)]
Done.

本文参与腾讯云自媒体分享计划，分享自作者个人站点/博客。

原始发表：2023-10-11，如有侵权请联系 cloudcommunity@tencent.com 删除

layer