优化NVIDIA GPU性能，实现高效的模型推理

代码医生工作室

发布于 2019-07-23 10:53:36

2.9K0

发布于 2019-07-23 10:53:36

文章被收录于专栏：相约机器人

作者 | 钱林亮

来源 | Medium

编辑 | 代码医生团队

GPU已被证明是加速深度学习和AI工作负载（如计算机视觉和自然语言处理（NLP））的有效解决方案。如今许多基于深度学习的应用程序在其生产环境中使用GPU设备，例如用于数据中心的NVIDIA Tesla和用于嵌入式平台的Jetson。这提出了一个问题：如何从NVIDIA GPU设备获得最佳推理性能？

在本文中，将逐步展示如何优化预先训练的TensorFlow模型，以改善启用CUDA的GPU的推理延迟。在实验中使用SSD MobileNet V2进行对象检测。在Colab上进行实验。所有源代码和重现结果的说明都可以在笔记本上找到。本文的结构如下：

https://colab.research.google.com/drive/10ah6t0I2-MV_3uPqw6J_WhMHlfLflrr8?source=post_page---------------------------

在TensorFlow中下载并运行原型模型
通过与CPU协作来优化模型
使用TensorRT优化模型
比较和结论

将Colab GPU实例的推理时间提高到：

通过在CPU上放置控制流操作来实现1.3x
通过转换预先训练的TensorFlow模型并在TensorRT中运行它来获得4.0x

步骤0：在TensorFlow中下载并运行原型模型

首先从TensorFlow Detection Model Zoo下载SSD MobileNet V2预训练模型，该模型提供了一系列在COCO数据集上训练的预训练模型。

https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/detection_model_zoo.md?source=post_page---------------------------

# Download SSD MobileNet V2 model
wget http://download.tensorflow.org/models/object_detection/ssd_mobilenet_v2_coco_2018_03_29.tar.gz
tar -zxf ssd_mobilenet_v2_coco_2018_03_29.tar.gz

在这个解压缩的文件夹中，可以找到以下文件：

frozen_inference_graph.pb 是任意图像和批量大小的冻结推理图
pipeline.config 包含用于生成模型的配置用法
model.ckpt.* 包含预先训练的模型变量
saved_model文件夹包含TensorFlow SavedModel文件

然后使用TensorFlow Object Detection API导出模型。这允许修复批量大小和图像大小。对于本实验，使用300x300图像作为输入和批量大小1.因此输入形状是[1, 300, 300, 3]。

https://github.com/tensorflow/models/tree/master/research/object_detection?source=post_page---------------------------

from object_detection import exporter
from pathlib import Path
from object_detection.protos import pipeline_pb2
from google.protobuf import text_format
 
# Define some helper functions here
def export_detection_model(model_dir: str):
    """
    Export model given model directory
    Args:
        model_dir: model directory
    Returns:
    """
    model_dir = Path(model_dir)
 
    if not model_dir.exists():
        raise RuntimeError("model directory {dir} does not exist".format(dir=model_dir))
 
    config_path = model_dir/"pipeline.config"
    checkpoint_path = model_dir/"model.ckpt"
    export_dir = model_dir/"exported_model"
 
    config = pipeline_pb2.TrainEvalPipelineConfig()
    with open(str(config_path), 'r') as f:
        text_format.Merge(f.read(), config, allow_unknown_extension=True)
 
    tf_config = tf.ConfigProto()
    tf_config.gpu_options.allow_growth = True
 
    with tf.Session(config=tf_config):
        with tf.Graph().as_default():
            exporter.export_inference_graph(
                "image_tensor",
                config,
                str(checkpoint_path),
                str(export_dir),
                input_shape=[1, 300, 300, 3]) # fix input shape
            
# Export model to ssd_mobilenet_v2_coco_2018_03_29/exported_model
model_dir = "ssd_mobilenet_v2_coco_2018_03_29"
export_detection_model(model_dir)

现在准备运行模型了。首先从互联网下载输入图像并将其预处理为所需的形状。然后，使用TensorFlow加载模型并执行推理。请注意，添加options并run_metadata记录配置数据以供进一步分析。

image = get_iamge_by_url("https://www.rover.com/blog/wp-content/uploads/2017/05/pug-tilt-960x540.jpg")
image_tensor = image.resize((300, 300))
image_tensor = np.array(image_tensor)
image_tensor = np.expand_dims(image_tensor, axis=0)
 
input_tensor_name = "image_tensor:0"
output_tensor_names = ['detection_boxes:0', 'detection_classes:0', 'detection_scores:0', 'num_detections:0']
ssd_mobilenet_v2_graph_def = load_graph_def(frozen_model_path)
 
with tf.Graph().as_default() as g:
    tf.import_graph_def(ssd_mobilenet_v2_graph_def, name='')
    input_tensor = g.get_tensor_by_name(input_tensor_name)
    output_tensors = [g.get_tensor_by_name(name) for name in output_tensor_names]
 
with tf.Session(graph=g) as sess:
    options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
    run_metadata = tf.RunMetadata()
 
    outputs = sess.run(output_tensors, feed_dict={input_tensor: image_tensor},
                       options=options, run_metadata=run_metadata)
    inference_time = (time.time()-start)*1000. # in ms
    
    # Write metadata
    fetched_timeline = timeline.Timeline(run_metadata.step_stats)
    chrome_trace = fetched_timeline.generate_chrome_trace_format()
 
with open('ssd_mobilenet_v2_coco_2018_03_29/exported_model/' + trace_filename, 'w') as f:
f.write(chrome_trace)

最后，进行健全性检查以确保模型进行有意义的推断。请注意，SSD MobileNet V2模型将图像阵列作为输入和输出绑定框[xmin, ymin, xmax, ymax]，用于每个检测到的对象。使用输出来绘制粘合盒并获得以下结果。

从上次运行中获得的检测结果

结果看起来很合理。所以可以相信模型正常运作。现在准备分析性能。将使用Chrome的跟踪工具对模型进行分析。打开Chrome浏览器并输入网址chrome://tracing。拖动从上一个脚本获得的时间轴JSON文件，然后可以看到以下界面。

原点SSD MobileNert V2的推断时间线跟踪

从上面的跟踪中，可能会注意到一些操作是在CPU上运行的，即使告诉TensorFlow在GPU上运行所有这些操作。这是因为TensorFlow没有为这些操作注册GPU内核（例如NonMaxSuppressionV3）。由于这些操作无法在GPU上处理，因此TensorFlow必须将中间输出从GPU内存传输到CPU内存，在CPU上处理并将结果传输回GPU然后继续运行。可以从图表中看到这种情况发生了很多次。因此程序花费了太多时间进行数据传输并变慢。

此外从图的底部，可以看到每种类型的操作的时间成本。前三大时间成本运营是GatherV2，NonMaxSuppressionV3和Conv2D。当Conv2D因为MobileNet V2在很大程度上依赖它并且计算成本很高时它是有意义的，它对其他人来说没有意义。将在下一节中解决这些问题并优化模型的推理性能。

步骤1：通过与CPU协作来优化模型

许多人认为GPU比CPU更快 - 这就是使用GPU加速程序的原因。但是这只是部分正确。为了解释这一点，需要了解GPU的工作原理。

CPU和GPU之间浮点能力差异背后的原因是GPU专门用于计算密集型，高度并行计算 - 正是图形渲染的关键 - 因此设计使得更多晶体管用于数据处理而不是数据缓存和流量控制，如下图所示：

CPU与GPU结构

因此，对于可以并行处理的诸如矩阵乘法的操作，GPU明显快于CPU。然而，由于图形处理器具有用于流控制和缓存较少的晶体管，这可能不是用于流量控制的操作（例如的情况下if，where，while等等）。

NonMaxSuppressionV3在CPU和GPU上运行前5个时间成本操作（除了它只能在CPU上处理）并比较它们的性能，得到以下结果：

可以看到Conv2D，执行矩阵乘法和输入数据的加法，在GPU上运行速度比预期的快10倍。然而，对于GatherV2，ConcatV2并且Select，其访问内存给定指标，CPU跑赢大盘GPU。因此，可以通过简单地将这些操作放在CPU上来提高推理性能：

for node in ssd_mobilenet_v2_optimized_graph_def.node:
  if 'NonMaxSuppression' in node.name:
    node.device = '/device:CPU:0'

上面的代码将所有操作都NonMaxSuppression放在CPU上，因为大多数流控制操作都发生在这个块中。然后，使用相同的代码测试修改的模型并记录时间线跟踪。得到以下结果：

优化模型的推理时间线跟踪

请注意，总推理时间从~ 50ms减少到~ 30ms。GatherV2现在时间成本为2.140毫秒，而原点为5.458毫秒。时间成本ConcatV2从3.588毫秒减少到1.422毫秒。此外，在修改后的模型中，GPU和CPU之间的数据传输较少。因此，NonMaxSuppressionV3最初在CPU上运行的操作也从中受益。

步骤2：使用TensorRT优化模型

在本节中，将展示如何使用NVIDIA TensorRT进一步加速推理。

什么是TensorRT

NVIDIA TensorRT™是一个高性能深度学习推理平台。它包括深度学习推理优化器和运行时，可为深度学习推理应用程序提供低延迟和高吞吐量。

来自NVIDIA TensorRT的 TensorRT概述

为什么要使用TensorRT

TensorRT提供了一系列深度学习模型优化工具，如精确校准和层融合。可以使用这些方便的工具，而无需了解基础算法的详细信息。此外，TensorRT专门为GPU设备选择内核，从而进一步优化性能。总结了使用TensorRT的优缺点：

优点：

方便的优化工具允许用户轻松有效地优化生产模型
特定于平台的内核选择可最大限度地提高设备性能
支持TensorFlow和Caffe等主要框架

缺点：

TensorRT仅支持部分操作。因此，在构建模型时必须仔细选择图层，以使其与TensorRT兼容，要在TensorRT中运行预先训练的TensorFlow模型，需要执行以下步骤：

将TensorFlow模型转换为UFF格式
构建TensorRT推理引擎

将TensorFlow模型转换为UFF格式

首先，将SSD MobileNet V2 TensorFlow冻结模型转换为UFF格式，可以使用Graph Surgeon和UFF转换器通过TensorRT进行解析。对于一些简单的模型（例如Mobilenet V2，用于图像分类的Inception v4），可以直接使用UFF转换器进行转换。但是，对于模型包含TensorRT不支持的操作（例如NonMaxSuppression 在SSD MobileNet V2中），必须进行一些预处理。诀窍是使用Graph Surgeon用支持的操作替换不支持的操作。

以下脚本提供预处理功能并修改原点图。关键操作是用NonMaxSuppression 操作替换原始图中的NMS_TRT操作，这是用于非最大抑制的TensorRT内核。然后，它将修改后的图形传递给UFF转换器并输出最终的UFF模型。

import uff
import tensorrt as trt
import graphsurgeon as gs
 
# Preprocess function to convert TF model to UFF
def ssd_mobilenet_v2_unsupported_nodes_to_plugin_nodes(ssd_graph, input_shape):
    """Makes ssd_graph TensorRT comparible using graphsurgeon.
    Note: This specific implementation works only for
    ssd_mobilenet_v2_coco_2018_03_29 network.
    Args:
        ssd_graph (gs.DynamicGraph): graph to convert
        input_shape: input shape in CHW format
    Returns:
        gs.DynamicGraph: UffParser compatible SSD graph
    """
 
    channels, height, width = input_shape
 
    Input = gs.create_plugin_node(name="Input",
        op="Placeholder",
        dtype=tf.float32,
        shape=[1, channels, height, width])
    PriorBox = gs.create_plugin_node(name="GridAnchor", op="GridAnchor_TRT",
        minSize=0.2,
        maxSize=0.95,
        aspectRatios=[1.0, 2.0, 0.5, 3.0, 0.33],
        variance=[0.1,0.1,0.2,0.2],
        featureMapShapes=[19, 10, 5, 3, 2, 1],
        numLayers=6
    )
    NMS = gs.create_plugin_node(
        name="NMS",
        op="NMS_TRT",
        shareLocation=1,
        varianceEncodedInTarget=0,
        backgroundLabelId=0,
        confidenceThreshold=1e-8,
        nmsThreshold=0.6,
        topK=100,
        keepTopK=100,
        numClasses=91,
        inputOrder=[1, 0, 2],
        confSigmoid=1,
        isNormalized=1
    )
    concat_priorbox = gs.create_node(
        "concat_priorbox",
        op="ConcatV2",
        dtype=tf.float32,
        axis=2
    )
    concat_box_loc = gs.create_plugin_node(
        "concat_box_loc",
        op="FlattenConcat_TRT",
        dtype=tf.float32,
        axis=1,
        ignoreBatch=0
    )
    concat_box_conf = gs.create_plugin_node(
        "concat_box_conf",
        op="FlattenConcat_TRT",
        dtype=tf.float32,
        axis=1,
        ignoreBatch=0
    )
 
    # Create a mapping of namespace names -> plugin nodes.
    namespace_plugin_map = {
        "MultipleGridAnchorGenerator": PriorBox,
        "Postprocessor": NMS,
        "Preprocessor/map": Input,
        "ToFloat": Input,
        # "image_tensor": Input,
        "Concatenate": concat_priorbox,
        "concat": concat_box_loc,
        "concat_1": concat_box_conf
    }
    for node in ssd_graph.graph_inputs:
        namespace_plugin_map[node.name] = Input
 
    # Create a new graph by collapsing namespaces
    ssd_graph.collapse_namespaces(namespace_plugin_map)
    # Remove the outputs, so we just have a single output node (NMS).
    # If remove_exclusive_dependencies is True, the whole graph will be removed!
    ssd_graph.remove(ssd_graph.graph_outputs, remove_exclusive_dependencies=False)
    # Disconnect the Input node from NMS, as it expects to have only 3 inputs.
    ssd_graph.find_nodes_by_op("NMS_TRT")[0].input.remove("Input")
    
    return ssd_graph
 
# Export UFF model file
ssd_mobilenet_v2_pb_path = "ssd_mobilenet_v2_coco_2018_03_29/frozen_inference_graph.pb"
output_uff_filename = "ssd_mobilenet_v2_coco_2018_03_29/frozen_inference_graph.uff"
input_shape = (3, 300, 300)
 
dynamic_graph = gs.DynamicGraph(ssd_mobilenet_v2_pb_path)
dynamic_graph = ssd_mobilenet_v2_unsupported_nodes_to_plugin_nodes(dynamic_graph, input_shape)
 
uff.from_tensorflow(dynamic_graph.as_graph_def(), output_nodes=["NMS"], output_filename=output_uff_filename)

构建TensorRT推理引擎

现在有了UFF模型文件。准备建立TensorRT引擎。可以构建一次引擎并将其部署到不同的设备。但是由于引擎针对其构建的设备进行了优化，因此建议为不同的设备重新构建引擎，以最大限度地提高设备性能。

# Build TensorRT engine
uff_model_path = "ssd_mobilenet_v2_coco_2018_03_29/frozen_inference_graph.uff"
engine_path = "ssd_mobilenet_v2_coco_2018_03_29/ssd_mobilenet_v2_bs_1.engine"
TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
trt.init_libnvinfer_plugins(TRT_LOGGER, '')
 
trt_runtime = trt.Runtime(TRT_LOGGER)
 
with trt.Builder(TRT_LOGGER) as builder, builder.create_network() as network, trt.UffParser() as parser:
  builder.max_workspace_size = 1 << 30
  builder.fp16_mode = True
  builder.max_batch_size = 1
  parser.register_input("Input", (3, 300, 300))
  parser.register_output("MarkOutput_0")
  parser.parse(uff_model_path, network)
  
  print("Building TensorRT engine, this may take a few minutes...")
  trt_engine = builder.build_cuda_engine(network)

现在有了TensorRT引擎，准备在TensorRT上运行模型。请注意，TensorRT需要NCHW格式的输入图像。因此输入格式应该[1, 3, 300, 300]不是[1, 300, 300, 3]TensorFlow。

def run_trt_model(image_tensor, exec_context, inputs, outputs, bindings, stream):
  # Copy input to appropriate place
  np.copyto(inputs[0].host, image_tensor)
 
  start = time.time()
  # Copy input from host memory to GPU memory
  for inp in inputs:
    cuda.memcpy_htod_async(inp.device, inp.host, stream)
 
  # Run inference
  exec_context.execute_async(batch_size=1, bindings=bindings, stream_handle=stream.handle)
 
  # Copy result from GPU memory to host memory
  for out in outputs:
    cuda.memcpy_dtoh_async(out.host, out.device, stream)
 
  stream.synchronize()
  inference_time = (time.time()-start)*1000.
  
  res = [out.host for out in outputs]
  return res, inference_time
 
inference_time = []
for i in range(30):
  res, t = run_trt_model(image_tensor, exec_context, inputs, outputs, bindings, stream)
  inference_time.append(t)
 
print("TensorRT inference time: %.2f ms" % np.mean(inference_time))

在实验中，此次运行的平均推断时间约为4.9 ms。

比较和结论

比较了实验的推理时间，得到了以下图：