我希望移动我的自定义摄像机视频管道使用视频内存与numba和cupy的组合,并避免将数据传递回主机内存,如果可能的话。作为这样做的一部分,我需要将我的锐度检测例程移植到使用cuda。最简单的方法似乎是使用cupy作为基本,我所做的就是计算每个图像的laplace变换的方差。我遇到的麻烦是,立方方差计算似乎比numpy慢8倍。这包括将设备ndarray复制到主机并使用numpy在cpu上执行方差计算所需的时间。我希望能更好地理解为什么cupy在GPU上使用的方差计算ReductionKernel要慢得多。我将从包括我下面的测试开始。
import cupy as cp
import numpy as np
from numba import cuda
import cv2
from timeit import default_timer as timer
n_runs = 10
n_warmup = 2
n_tot = n_runs + n_warmup
sizes = (100, 500, 1000, 2000, 5000, 10000)
# NumPy
for s in sizes:
t_cp = np.zeros(n_tot)
for n in range(n_tot):
np.random.seed(0)
x = np.random.randn(*(s,s)).astype(np.uint16)
t_start = timer()
_ = np.var(x)
t_cp[n] = timer() - t_start
t_mean, t_std = t_cp[n_warmup:].mean(), t_cp[n_warmup:].std()
print(f'NumPy: {s}x{s} {t_mean:.5f} +- {t_std:.5f}')
# CuPy
for s in sizes:
t_cp = np.zeros(n_tot)
for n in range(n_tot):
np.random.seed(0)
x = np.random.randn(*(s,s)).astype(np.uint16)
x_nb = cuda.to_device(x)
cp.cuda.Stream.null.synchronize()
t_start = timer()
x_cp = cp.asarray(x_nb)
_ = cp.var(x_cp)
cp.cuda.Stream.null.synchronize()
t_cp[n] = timer() - t_start
t_mean, t_std = t_cp[n_warmup:].mean(), t_cp[n_warmup:].std()
print(f'CuPy: {s}x{s} {t_mean:.5f} +- {t_std:.5f}')此脚本的输出如下
NumPy: 100x100 0.00006 +- 0.00000
NumPy: 500x500 0.00053 +- 0.00008
NumPy: 1000x1000 0.00243 +- 0.00029
NumPy: 2000x2000 0.01073 +- 0.00136
NumPy: 5000x5000 0.07470 +- 0.00264
NumPy: 10000x10000 0.28578 +- 0.00313
...
CuPy: 100x100 0.00026 +- 0.00002
CuPy: 500x500 0.00463 +- 0.00068
CuPy: 1000x1000 0.02308 +- 0.00060
CuPy: 2000x2000 0.07876 +- 0.00172
CuPy: 5000x5000 0.50830 +- 0.02237
CuPy: 10000x10000 1.98131 +- 0.03540我还尝试使用float16和float32,而不是uint16 (因为它看起来像cupy与浮子一起工作所使用的缩减内核,但它并没有以任何有意义的方式改变差异)。
以下是一些与我的工作环境相关的版本。
>>> numpy.__version__
'1.18.5'
>>> cupy.__version__
'9.6.0'
>>> numba.__version__
'0.53.1'
Python 3.6.9
Driver Version: 470.82.01
CUDA Version: 11.4任何关于什么可能导致cupy表现如此糟糕的提示都将受到赞赏。此外,如果有什么事情可以做,以提高表现,我很喜欢知道。我试着阅读关于ReductionKernels的文章,以了解如何优化它们,但这是我无法理解的。https://developer.download.nvidia.com/assets/cuda/files/reduction.pdf
基于@myrtlecat的反馈更新结果
with dtype np.float32
CuPy (1 axis at a time): 100x100 0.00017 +- 0.00002
CuPy (1 axis at a time): 500x500 0.00041 +- 0.00001
CuPy (1 axis at a time): 1000x1000 0.00097 +- 0.00003
CuPy (1 axis at a time): 2000x2000 0.00278 +- 0.00016
CuPy (1 axis at a time): 5000x5000 0.01381 +- 0.00041
CuPy (1 axis at a time): 10000x10000 0.04313 +- 0.00355
CuPy: 100x100 0.00013 +- 0.00000
CuPy: 500x500 0.00432 +- 0.00013
CuPy: 1000x1000 0.01713 +- 0.00070
CuPy: 2000x2000 0.06713 +- 0.00079
CuPy: 5000x5000 0.41975 +- 0.00259
CuPy: 10000x10000 1.69374 +- 0.01938
NumPy: 100x100 0.00004 +- 0.00000
NumPy: 500x500 0.00022 +- 0.00001
NumPy: 1000x1000 0.00121 +- 0.00018
NumPy: 2000x2000 0.00530 +- 0.00047
NumPy: 5000x5000 0.03432 +- 0.00179
NumPy: 10000x10000 0.14227 +- 0.00503
with dtype np.uint16
CuPy (1 axis at a time): 100x100 0.00018 +- 0.00000
CuPy (1 axis at a time): 500x500 0.00124 +- 0.00003
CuPy (1 axis at a time): 1000x1000 0.00430 +- 0.00020
CuPy (1 axis at a time): 2000x2000 0.01537 +- 0.00008
CuPy (1 axis at a time): 5000x5000 0.07413 +- 0.00330
CuPy (1 axis at a time): 10000x10000 0.29842 +- 0.01291
CuPy: 100x100 0.00016 +- 0.00000
CuPy: 500x500 0.00359 +- 0.00053
CuPy: 1000x1000 0.01952 +- 0.00058
CuPy: 2000x2000 0.07719 +- 0.00076
CuPy: 5000x5000 0.48284 +- 0.00169
CuPy: 10000x10000 1.96746 +- 0.04353
NumPy: 100x100 0.00006 +- 0.00002
NumPy: 500x500 0.00053 +- 0.00010
NumPy: 1000x1000 0.00224 +- 0.00016
NumPy: 2000x2000 0.00956 +- 0.00034
NumPy: 5000x5000 0.06818 +- 0.00210
NumPy: 10000x10000 0.27071 +- 0.00747基于@myrtlecat反馈的更新脚本
import cupy as cp
import numpy as np
from timeit import default_timer as timer
n_runs = 10
n_warmup = 2
n_tot = n_runs + n_warmup
sizes = (100, 500, 1000, 2000, 5000, 10000)
dtype = np.uint16
# dtype = np.float32
def mean(x):
while x.size > 1:
x = x.mean(-1)
return x
def var(x):
return mean((x - mean(x)) ** 2)
# CuPy (1 axis at a time)
for s in sizes:
t_cp = np.zeros(n_tot)
for n in range(n_tot):
# np.random.seed(0)
x = np.random.randn(*(s, s)).astype(dtype)
x_cp = cp.asarray(x)
cp.cuda.Stream.null.synchronize()
t_start = timer()
_ = var(x_cp)
cp.cuda.Stream.null.synchronize()
t_cp[n] = timer() - t_start
t_mean, t_std = t_cp[n_warmup:].mean(), t_cp[n_warmup:].std()
print(f"CuPy (1 axis at a time): {s}x{s} {t_mean:.5f} +- {t_std:.5f}")
# CuPy
for s in sizes:
t_cp = np.zeros(n_tot)
for n in range(n_tot):
# np.random.seed(0)
x = np.random.randn(*(s, s)).astype(dtype)
x_cp = cp.asarray(x)
cp.cuda.Stream.null.synchronize()
t_start = timer()
_ = cp.var(x_cp)
cp.cuda.Stream.null.synchronize()
t_cp[n] = timer() - t_start
t_mean, t_std = t_cp[n_warmup:].mean(), t_cp[n_warmup:].std()
print(f"CuPy: {s}x{s} {t_mean:.5f} +- {t_std:.5f}")
# NumPy
for s in sizes:
t_cp = np.zeros(n_tot)
for n in range(n_tot):
# np.random.seed(0)
x = np.random.randn(*(s, s)).astype(dtype)
t_start = timer()
_ = np.var(x)
t_cp[n] = timer() - t_start
t_mean, t_std = t_cp[n_warmup:].mean(), t_cp[n_warmup:].std()
print(f"NumPy: {s}x{s} {t_mean:.5f} +- {t_std:.5f}")上一次更新脚本和结果是基于@myrtlecat的反馈,他说这可能是2d数组的一个问题,所以在调用"var"之前,我尝试使用reshape将数组扁平化。
# CuPy (flattened)
for s in sizes:
t_cp = np.zeros(n_tot)
for n in range(n_tot):
# np.random.seed(0)
x = np.random.randn(*(s, s)).astype(dtype)
x_cp = cp.asarray(x)
cp.cuda.Stream.null.synchronize()
t_start = timer()
_ = var(x_cp.reshape((s * s,)))
cp.cuda.Stream.null.synchronize()
t_cp[n] = timer() - t_start
t_mean, t_std = t_cp[n_warmup:].mean(), t_cp[n_warmup:].std()
print(f"CuPy (flattened): {s}x{s} {t_mean:.5f} +- {t_std:.5f}")结果
for uint16
CuPy (flattened): 100x100 0.00018 +- 0.00006
CuPy (flattened): 500x500 0.00107 +- 0.00002
CuPy (flattened): 1000x1000 0.00414 +- 0.00020
CuPy (flattened): 2000x2000 0.01550 +- 0.00036
CuPy (flattened): 5000x5000 0.10017 +- 0.00525
CuPy (flattened): 10000x10000 0.39470 +- 0.01606
CuPy (1 axis at a time): 100x100 0.00026 +- 0.00008
CuPy (1 axis at a time): 500x500 0.00104 +- 0.00005
CuPy (1 axis at a time): 1000x1000 0.00368 +- 0.00028
CuPy (1 axis at a time): 2000x2000 0.01364 +- 0.00055
CuPy (1 axis at a time): 5000x5000 0.07639 +- 0.00311
CuPy (1 axis at a time): 10000x10000 0.29405 +- 0.00419
for float32
CuPy (flattened): 100x100 0.00015 +- 0.00007
CuPy (flattened): 500x500 0.00043 +- 0.00001
CuPy (flattened): 1000x1000 0.00159 +- 0.00003
CuPy (flattened): 2000x2000 0.00631 +- 0.00030
CuPy (flattened): 5000x5000 0.03827 +- 0.00135
CuPy (flattened): 10000x10000 0.13173 +- 0.00579
CuPy (1 axis at a time): 100x100 0.00020 +- 0.00005
CuPy (1 axis at a time): 500x500 0.00030 +- 0.00004
CuPy (1 axis at a time): 1000x1000 0.00077 +- 0.00002
CuPy (1 axis at a time): 2000x2000 0.00215 +- 0.00002
CuPy (1 axis at a time): 5000x5000 0.01387 +- 0.00049
CuPy (1 axis at a time): 10000x10000 0.04099 +- 0.00142因此,对于uint16和float32来说,在建议的技术中使用1轴的结果似乎更快,而不是通过整形将2d数组扁平化。尽管整形比传递2d数组要快得多。
发布于 2022-01-14 01:38:21
我对这个问题有一个部分的假设(不是一个完整的解释)和一个解决办法。也许有人能填补这些空白。为了简洁起见,我使用了一个更快更脏的基准。
解决办法:一次缩小一个轴。
当一次在一个轴上执行约简时,Cupy的比快得多。代替:
x.sum()
更喜欢这样:
x.sum(-1).sum(-1).sum(-1)...
请注意,由于舍入错误,这些计算的结果可能有所不同。
下面是更快的mean和var函数:
def mean(x):
while x.size > 1:
x = x.mean(-1)
return x
def var(x):
return mean((x - mean(x)) ** 2)基准测试
import numpy as np
import cupy as cp
from timeit import timeit
def mean(x):
while x.size > 1:
x = x.mean(-1)
return x
def var(x):
return mean((x - mean(x)) ** 2)
def benchmark(label, f):
t = timeit(f, number=10) / 10
print(f"{label.ljust(25)}{t:0.4f}s")
x = np.random.rand(5000, 5000).astype('f4')
x_cp = cp.array(x)
benchmark("Numpy", lambda: x.var())
benchmark("Cupy", lambda: float(x_cp.var()))
benchmark("Cupy (1 axis at a time)", lambda: float(var(x_cp)))生产速度超过100倍:
Numpy 0.0469s
Cupy 0.3805s
Cupy (1 axis at a time) 0.0013s由于四舍五入,答案接近但不完全相同:
> x.var(), float(x_cp.var()), float(var(x_cp))
(0.08334004, 0.08333975821733475, 0.08334004133939743)为什么更快?
当在单个线程上操作时,具有CUDA功能的GPU通常比现代CPU慢,但是它们通过并行执行许多相同的操作来实现高吞吐量。
类似于sum或var的简化可以通过以下方式并行化:
chunks
对于足够大的输入数组,如果块大小选择得很好,这将接近最佳性能。(注意:对于小数组,或压缩每一滴性能,需要更先进的技术,如幻灯片OP链接)。
我觉得是个虫子..。
我认为cupy应该将这一技术应用于所有的缩减(我对cupy代码的理解是它确实做到了这一点)。但是,无论出于什么原因,它都能很好地在单个轴上并行化缩减,而不是在整个数组上并行化。我怀疑这是不是有意的行为,但我不是一个开发人员在cupy,所以,也许是。生成的CUDA代码和参数用于调用还原内核,因此我不确定在每种情况下到底发生了什么。
更新基准
运行更新后的uint16基准脚本的结果
CuPy (1 axis at a time): 100x100 0.00016 +- 0.00001
CuPy (1 axis at a time): 500x500 0.00029 +- 0.00000
CuPy (1 axis at a time): 1000x1000 0.00070 +- 0.00000
CuPy (1 axis at a time): 2000x2000 0.00242 +- 0.00001
CuPy (1 axis at a time): 5000x5000 0.01410 +- 0.00001
CuPy (1 axis at a time): 10000x10000 0.05145 +- 0.00149
CuPy: 100x100 0.00016 +- 0.00000
CuPy: 500x500 0.00316 +- 0.00000
CuPy: 1000x1000 0.01250 +- 0.00001
CuPy: 2000x2000 0.07283 +- 0.00290
CuPy: 5000x5000 0.44025 +- 0.00012
CuPy: 10000x10000 1.76455 +- 0.00190
NumPy: 100x100 0.00004 +- 0.00000
NumPy: 500x500 0.00056 +- 0.00001
NumPy: 1000x1000 0.00201 +- 0.00001
NumPy: 2000x2000 0.01066 +- 0.00005
NumPy: 5000x5000 0.08828 +- 0.00007
NumPy: 10000x10000 0.35403 +- 0.00064所以在numpy上加快了大约7倍,对于float32也是如此。
CuPy (1 axis at a time): 100x100 0.00016 +- 0.00001
CuPy (1 axis at a time): 500x500 0.00018 +- 0.00001
CuPy (1 axis at a time): 1000x1000 0.00021 +- 0.00000
CuPy (1 axis at a time): 2000x2000 0.00052 +- 0.00000
CuPy (1 axis at a time): 5000x5000 0.00232 +- 0.00001
CuPy (1 axis at a time): 10000x10000 0.00837 +- 0.00002
CuPy: 100x100 0.00015 +- 0.00000
CuPy: 500x500 0.00281 +- 0.00001
CuPy: 1000x1000 0.01657 +- 0.00026
CuPy: 2000x2000 0.06557 +- 0.00003
CuPy: 5000x5000 0.37905 +- 0.00007
CuPy: 10000x10000 1.51899 +- 0.01084
NumPy: 100x100 0.00003 +- 0.00000
NumPy: 500x500 0.00032 +- 0.00000
NumPy: 1000x1000 0.00115 +- 0.00001
NumPy: 2000x2000 0.00581 +- 0.00010
NumPy: 5000x5000 0.04730 +- 0.00009
NumPy: 10000x10000 0.19188 +- 0.00024大约40倍的速度超过了伦皮。
结果将取决于平台:相比之下,我的CPU是Intel(R) Core(TM) i9-7940X CPU @ 3.10GHz,我的GPU是NVIDIA GeForce GTX 1080 Ti。
https://stackoverflow.com/questions/70677346
复制相似问题