我正试图通过ONNX (opset 11)用torch.nn.functional.grid_sample将模型从Pytorch (1.6)转换为TensorRT (7)。Opset 11不支持grid_sample转换。在Pytorch中运行时,定制的alternative I found (https://github.com/pytorch/pytorch/issues/27212)非常慢,并且在将主循环转换为TRT时遇到了问题。
我自己的双线性采样的实现(不仅仅是grid_sample,而是基于grid_sample的整个原始采样)在Pytorch中的执行速度要快得多,并成功地转换为TRT。但是,我在TRT中的自定义双线性抽样比Pytorch的双线性抽样慢(5.6ms对2.0ms)。结果表明,Pytorch图像:、ind、y0、x0索引生成的集合层的运行时间约为0.97ms。在这类双线性抽样的TRT版本中有4个这样的层。
所以问题是:
这是双线性抽样函数的代码:
def bilinear_sample_noloop(image, grid):
"""
:param image: sampling source of shape [N, C, H, W]
:param grid: integer sampling pixel coordinates of shape [N, grid_H, grid_W, 2]
:return: sampling result of shape [N, C, grid_H, grid_W]
"""
Nt, C, H, W = image.shape
grid_H = grid.shape[1]
grid_W = grid.shape[2]
xgrid, ygrid = grid.split([1, 1], dim=-1)
mask = ((xgrid >= 0) & (ygrid >= 0) & (xgrid < W - 1) & (ygrid < H - 1)).float()
x0 = torch.floor(xgrid)
x1 = x0 + 1
y0 = torch.floor(ygrid)
y1 = y0 + 1
wa = ((x1 - xgrid) * (y1 - ygrid)).permute(3, 0, 1, 2)
wb = ((x1 - xgrid) * (ygrid - y0)).permute(3, 0, 1, 2)
wc = ((xgrid - x0) * (y1 - ygrid)).permute(3, 0, 1, 2)
wd = ((xgrid - x0) * (ygrid - y0)).permute(3, 0, 1, 2)
x0 = (x0 * mask).view(Nt, grid_H, grid_W).long()
y0 = (y0 * mask).view(Nt, grid_H, grid_W).long()
x1 = (x1 * mask).view(Nt, grid_H, grid_W).long()
y1 = (y1 * mask).view(Nt, grid_H, grid_W).long()
ind = torch.arange(Nt, device=image.device) #torch.linspace(0, Nt - 1, Nt, device=image.device)
ind = ind.view(Nt, 1).expand(-1, grid_H).view(Nt, grid_H, 1).expand(-1, -1, grid_W).long()
image = image.permute(1, 0, 2, 3)
output_tensor = (image[:, ind, y0, x0] * wa + image[:, ind, y1, x0] * wb + image[:, ind, y0, x1] * wc + \
image[:, ind, y1, x1] * wd).permute(1, 0, 2, 3)
output_tensor *= mask.permute(0, 3, 1, 2).expand(-1, C, -1, -1)
image = image.permute(1, 0, 2, 3)
return output_tensor, mask
时间剖面参数:
在笔记本电脑Dell G3 15 (Core i7 8750H2.2 GHz x12,16 Gb RAM (2666 GHz),NVidia GeForce GTX 1050 Ti)上进行了NVidia GeForce时间剖面分析实验,
使用trtexec的TRT模型分析的一部分:
Layer Time (ms) Avg. Time (ms) Time %
...
Mul_146 5.82 0.03 0.5
Add_147 8.50 0.04 0.7
Gather_148 214.39 0.97 17.3
Gather_174 214.25 0.97 17.3
Gather_201 213.88 0.97 17.3
Gather_228 214.48 0.97 17.3
Add_237)) 25.01 0.11 2.0
Mul_251 7.84 0.04 0.6
Total 1238.40 5.60 100.0
此外,我尝试将图像视为除C之外的所有维度上的线性数组,并创建线性索引以在form image:、p0中添加元素。在这种情况下,收集速度会更慢(大约1.07 ms)。然后,我考虑了C=1 (在原始模型中总是如此),并将张量元素定位为imagep0。这次收集大约需要0.92毫秒(仍然太慢)。
发布于 2021-09-22 14:31:34
下面的代码可用于将火炬转换为图像的TensorRT,作为图像的bilinear_interpolate
def bilinear_interpolate_torch(im, y, x):
'''
im : B,C,H,W
y : 1,numPoints -- pixel location y float
x : 1,numPOints -- pixel location y float
'''
x0 = torch.floor(x).type(torch.cuda.LongTensor)
x1 = x0 + 1
y0 = torch.floor(y).type(torch.cuda.LongTensor)
y1 = y0 + 1
wa = (x1.type(torch.cuda.FloatTensor) - x) * (y1.type(torch.cuda.FloatTensor) - y)
wb = (x1.type(torch.cuda.FloatTensor) - x) * (y - y0.type(dtype))
wc = (x - x0.type(torch.cuda.FloatTensor)) * (y1.type(torch.cuda.FloatTensor) - y)
wd = (x - x0.type(torch.cuda.FloatTensor)) * (y - y0.type(torch.cuda.FloatTensor))
# Instead of clamp
x1 = x1 - torch.floor(x1 / im.shape[3]).int()
y1 = y1 - torch.floor(y1 / im.shape[2]).int()
Ia = im[:, :, y0, x0]
Ib = im[:, :, y1, x0]
Ic = im[:, :, y0, x1]
Id = im[:, :, y1, x1]
return Ia * wa + Ib * wb + Ic * wc + Id * wd
https://stackoverflow.com/questions/67687813
复制相似问题