首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >cupy.var (方差)性能远远低于numpy.var试图理解的原因

cupy.var (方差)性能远远低于numpy.var试图理解的原因
EN

Stack Overflow用户
提问于 2022-01-12 06:45:56
回答 1查看 272关注 0票数 0

我希望移动我的自定义摄像机视频管道使用视频内存与numba和cupy的组合,并避免将数据传递回主机内存,如果可能的话。作为这样做的一部分,我需要将我的锐度检测例程移植到使用cuda。最简单的方法似乎是使用cupy作为基本,我所做的就是计算每个图像的laplace变换的方差。我遇到的麻烦是,立方方差计算似乎比numpy慢8倍。这包括将设备ndarray复制到主机并使用numpy在cpu上执行方差计算所需的时间。我希望能更好地理解为什么cupy在GPU上使用的方差计算ReductionKernel要慢得多。我将从包括我下面的测试开始。

代码语言:javascript
复制
import cupy as cp
import numpy as np
from numba import cuda
import cv2
from timeit import default_timer as timer

n_runs = 10
n_warmup = 2
n_tot = n_runs + n_warmup
sizes = (100, 500, 1000, 2000, 5000, 10000)

# NumPy
for s in sizes:
    t_cp = np.zeros(n_tot)
    for n in range(n_tot):
        np.random.seed(0)
        x = np.random.randn(*(s,s)).astype(np.uint16)
        t_start = timer()
        _ = np.var(x)
        t_cp[n] = timer() - t_start
    t_mean, t_std = t_cp[n_warmup:].mean(), t_cp[n_warmup:].std()
    print(f'NumPy: {s}x{s} {t_mean:.5f} +- {t_std:.5f}')
    

# CuPy
for s in sizes:
    t_cp = np.zeros(n_tot)
    for n in range(n_tot):
        np.random.seed(0)
        x = np.random.randn(*(s,s)).astype(np.uint16)
        x_nb = cuda.to_device(x)
        cp.cuda.Stream.null.synchronize()
        t_start = timer()
        x_cp = cp.asarray(x_nb)
        _ = cp.var(x_cp)
        cp.cuda.Stream.null.synchronize()
        t_cp[n] = timer() - t_start
    t_mean, t_std = t_cp[n_warmup:].mean(), t_cp[n_warmup:].std()
    print(f'CuPy: {s}x{s} {t_mean:.5f} +- {t_std:.5f}')

此脚本的输出如下

代码语言:javascript
复制
NumPy: 100x100 0.00006 +- 0.00000
NumPy: 500x500 0.00053 +- 0.00008
NumPy: 1000x1000 0.00243 +- 0.00029
NumPy: 2000x2000 0.01073 +- 0.00136
NumPy: 5000x5000 0.07470 +- 0.00264
NumPy: 10000x10000 0.28578 +- 0.00313
...
CuPy: 100x100 0.00026 +- 0.00002
CuPy: 500x500 0.00463 +- 0.00068
CuPy: 1000x1000 0.02308 +- 0.00060
CuPy: 2000x2000 0.07876 +- 0.00172
CuPy: 5000x5000 0.50830 +- 0.02237
CuPy: 10000x10000 1.98131 +- 0.03540

我还尝试使用float16和float32,而不是uint16 (因为它看起来像cupy与浮子一起工作所使用的缩减内核,但它并没有以任何有意义的方式改变差异)。

https://github.com/cupy/cupy/blob/04bf15706474a5e79ba196e70784a147cad6e26e/cupy/_core/_routines_statistics.pyx#L542

以下是一些与我的工作环境相关的版本。

代码语言:javascript
复制
>>> numpy.__version__
'1.18.5'
>>> cupy.__version__
'9.6.0'
>>> numba.__version__
'0.53.1'
Python 3.6.9
Driver Version: 470.82.01 
CUDA Version: 11.4

任何关于什么可能导致cupy表现如此糟糕的提示都将受到赞赏。此外,如果有什么事情可以做,以提高表现,我很喜欢知道。我试着阅读关于ReductionKernels的文章,以了解如何优化它们,但这是我无法理解的。https://developer.download.nvidia.com/assets/cuda/files/reduction.pdf

基于@myrtlecat的反馈更新结果

代码语言:javascript
复制
with dtype np.float32
CuPy (1 axis at a time): 100x100 0.00017 +- 0.00002
CuPy (1 axis at a time): 500x500 0.00041 +- 0.00001
CuPy (1 axis at a time): 1000x1000 0.00097 +- 0.00003
CuPy (1 axis at a time): 2000x2000 0.00278 +- 0.00016
CuPy (1 axis at a time): 5000x5000 0.01381 +- 0.00041
CuPy (1 axis at a time): 10000x10000 0.04313 +- 0.00355
CuPy: 100x100 0.00013 +- 0.00000
CuPy: 500x500 0.00432 +- 0.00013
CuPy: 1000x1000 0.01713 +- 0.00070
CuPy: 2000x2000 0.06713 +- 0.00079
CuPy: 5000x5000 0.41975 +- 0.00259
CuPy: 10000x10000 1.69374 +- 0.01938
NumPy: 100x100 0.00004 +- 0.00000
NumPy: 500x500 0.00022 +- 0.00001
NumPy: 1000x1000 0.00121 +- 0.00018
NumPy: 2000x2000 0.00530 +- 0.00047
NumPy: 5000x5000 0.03432 +- 0.00179
NumPy: 10000x10000 0.14227 +- 0.00503

with dtype np.uint16
CuPy (1 axis at a time): 100x100 0.00018 +- 0.00000
CuPy (1 axis at a time): 500x500 0.00124 +- 0.00003
CuPy (1 axis at a time): 1000x1000 0.00430 +- 0.00020
CuPy (1 axis at a time): 2000x2000 0.01537 +- 0.00008
CuPy (1 axis at a time): 5000x5000 0.07413 +- 0.00330
CuPy (1 axis at a time): 10000x10000 0.29842 +- 0.01291
CuPy: 100x100 0.00016 +- 0.00000
CuPy: 500x500 0.00359 +- 0.00053
CuPy: 1000x1000 0.01952 +- 0.00058
CuPy: 2000x2000 0.07719 +- 0.00076
CuPy: 5000x5000 0.48284 +- 0.00169
CuPy: 10000x10000 1.96746 +- 0.04353
NumPy: 100x100 0.00006 +- 0.00002
NumPy: 500x500 0.00053 +- 0.00010
NumPy: 1000x1000 0.00224 +- 0.00016
NumPy: 2000x2000 0.00956 +- 0.00034
NumPy: 5000x5000 0.06818 +- 0.00210
NumPy: 10000x10000 0.27071 +- 0.00747

基于@myrtlecat反馈的更新脚本

代码语言:javascript
复制
import cupy as cp
import numpy as np
from timeit import default_timer as timer

n_runs = 10
n_warmup = 2
n_tot = n_runs + n_warmup
sizes = (100, 500, 1000, 2000, 5000, 10000)


dtype = np.uint16
# dtype = np.float32


def mean(x):
    while x.size > 1:
        x = x.mean(-1)
    return x


def var(x):
    return mean((x - mean(x)) ** 2)


# CuPy (1 axis at a time)
for s in sizes:
    t_cp = np.zeros(n_tot)
    for n in range(n_tot):
        # np.random.seed(0)
        x = np.random.randn(*(s, s)).astype(dtype)
        x_cp = cp.asarray(x)
        cp.cuda.Stream.null.synchronize()
        t_start = timer()
        _ = var(x_cp)
        cp.cuda.Stream.null.synchronize()
        t_cp[n] = timer() - t_start
    t_mean, t_std = t_cp[n_warmup:].mean(), t_cp[n_warmup:].std()
    print(f"CuPy (1 axis at a time): {s}x{s} {t_mean:.5f} +- {t_std:.5f}")

# CuPy
for s in sizes:
    t_cp = np.zeros(n_tot)
    for n in range(n_tot):
        # np.random.seed(0)
        x = np.random.randn(*(s, s)).astype(dtype)
        x_cp = cp.asarray(x)
        cp.cuda.Stream.null.synchronize()
        t_start = timer()
        _ = cp.var(x_cp)
        cp.cuda.Stream.null.synchronize()
        t_cp[n] = timer() - t_start
    t_mean, t_std = t_cp[n_warmup:].mean(), t_cp[n_warmup:].std()
    print(f"CuPy: {s}x{s} {t_mean:.5f} +- {t_std:.5f}")

# NumPy
for s in sizes:
    t_cp = np.zeros(n_tot)
    for n in range(n_tot):
        # np.random.seed(0)
        x = np.random.randn(*(s, s)).astype(dtype)
        t_start = timer()
        _ = np.var(x)
        t_cp[n] = timer() - t_start
    t_mean, t_std = t_cp[n_warmup:].mean(), t_cp[n_warmup:].std()
    print(f"NumPy: {s}x{s} {t_mean:.5f} +- {t_std:.5f}")

上一次更新脚本和结果是基于@myrtlecat的反馈,他说这可能是2d数组的一个问题,所以在调用"var"之前,我尝试使用reshape将数组扁平化。

代码语言:javascript
复制
# CuPy (flattened)
for s in sizes:
    t_cp = np.zeros(n_tot)
    for n in range(n_tot):
        # np.random.seed(0)
        x = np.random.randn(*(s, s)).astype(dtype)
        x_cp = cp.asarray(x)
        cp.cuda.Stream.null.synchronize()
        t_start = timer()
        _ = var(x_cp.reshape((s * s,)))
        cp.cuda.Stream.null.synchronize()
        t_cp[n] = timer() - t_start
    t_mean, t_std = t_cp[n_warmup:].mean(), t_cp[n_warmup:].std()
    print(f"CuPy (flattened): {s}x{s} {t_mean:.5f} +- {t_std:.5f}")

结果

代码语言:javascript
复制
for uint16
CuPy (flattened): 100x100 0.00018 +- 0.00006
CuPy (flattened): 500x500 0.00107 +- 0.00002
CuPy (flattened): 1000x1000 0.00414 +- 0.00020
CuPy (flattened): 2000x2000 0.01550 +- 0.00036
CuPy (flattened): 5000x5000 0.10017 +- 0.00525
CuPy (flattened): 10000x10000 0.39470 +- 0.01606
CuPy (1 axis at a time): 100x100 0.00026 +- 0.00008
CuPy (1 axis at a time): 500x500 0.00104 +- 0.00005
CuPy (1 axis at a time): 1000x1000 0.00368 +- 0.00028
CuPy (1 axis at a time): 2000x2000 0.01364 +- 0.00055
CuPy (1 axis at a time): 5000x5000 0.07639 +- 0.00311
CuPy (1 axis at a time): 10000x10000 0.29405 +- 0.00419

for float32
CuPy (flattened): 100x100 0.00015 +- 0.00007
CuPy (flattened): 500x500 0.00043 +- 0.00001
CuPy (flattened): 1000x1000 0.00159 +- 0.00003
CuPy (flattened): 2000x2000 0.00631 +- 0.00030
CuPy (flattened): 5000x5000 0.03827 +- 0.00135
CuPy (flattened): 10000x10000 0.13173 +- 0.00579
CuPy (1 axis at a time): 100x100 0.00020 +- 0.00005
CuPy (1 axis at a time): 500x500 0.00030 +- 0.00004
CuPy (1 axis at a time): 1000x1000 0.00077 +- 0.00002
CuPy (1 axis at a time): 2000x2000 0.00215 +- 0.00002
CuPy (1 axis at a time): 5000x5000 0.01387 +- 0.00049
CuPy (1 axis at a time): 10000x10000 0.04099 +- 0.00142

因此,对于uint16和float32来说,在建议的技术中使用1轴的结果似乎更快,而不是通过整形将2d数组扁平化。尽管整形比传递2d数组要快得多。

EN

Stack Overflow用户

回答已采纳

发布于 2022-01-14 01:38:21

我对这个问题有一个部分的假设(不是一个完整的解释)和一个解决办法。也许有人能填补这些空白。为了简洁起见,我使用了一个更快更脏的基准。

解决办法:一次缩小一个轴。

当一次在一个轴上执行约简时,Cupy的快得多。代替:

x.sum()

更喜欢这样:

x.sum(-1).sum(-1).sum(-1)...

请注意,由于舍入错误,这些计算的结果可能有所不同。

下面是更快的meanvar函数:

代码语言:javascript
复制
def mean(x):
    while x.size > 1:
        x = x.mean(-1)
    return x

def var(x):
    return mean((x - mean(x)) ** 2)

基准测试

代码语言:javascript
复制
import numpy as np
import cupy as cp
from timeit import timeit

def mean(x):
    while x.size > 1:
        x = x.mean(-1)
    return x

def var(x):
    return mean((x - mean(x)) ** 2)

def benchmark(label, f):
    t = timeit(f, number=10) / 10
    print(f"{label.ljust(25)}{t:0.4f}s")

x = np.random.rand(5000, 5000).astype('f4')
x_cp = cp.array(x)

benchmark("Numpy", lambda: x.var())
benchmark("Cupy", lambda: float(x_cp.var()))
benchmark("Cupy (1 axis at a time)", lambda: float(var(x_cp)))

生产速度超过100倍:

代码语言:javascript
复制
Numpy                    0.0469s
Cupy                     0.3805s
Cupy (1 axis at a time)  0.0013s

由于四舍五入,答案接近但不完全相同:

代码语言:javascript
复制
> x.var(), float(x_cp.var()), float(var(x_cp))
(0.08334004, 0.08333975821733475, 0.08334004133939743)

为什么更快?

当在单个线程上操作时,具有CUDA功能的GPU通常比现代CPU慢,但是它们通过并行执行许多相同的操作来实现高吞吐量。

类似于sumvar的简化可以通过以下方式并行化:

chunks

  • performing

  • 将输入划分为parallel

  • performing中块的缩减,对每个块的结果进行第二次缩减

对于足够大的输入数组,如果块大小选择得很好,这将接近最佳性能。(注意:对于小数组,或压缩每一滴性能,需要更先进的技术,如幻灯片OP链接)。

我觉得是个虫子..。

我认为cupy应该将这一技术应用于所有的缩减(我对cupy代码的理解是它确实做到了这一点)。但是,无论出于什么原因,它都能很好地在单个轴上并行化缩减,而不是在整个数组上并行化。我怀疑这是不是有意的行为,但我不是一个开发人员在cupy,所以,也许是。生成的CUDA代码和参数用于调用还原内核,因此我不确定在每种情况下到底发生了什么。

更新基准

运行更新后的uint16基准脚本的结果

代码语言:javascript
复制
CuPy (1 axis at a time): 100x100 0.00016 +- 0.00001
CuPy (1 axis at a time): 500x500 0.00029 +- 0.00000
CuPy (1 axis at a time): 1000x1000 0.00070 +- 0.00000
CuPy (1 axis at a time): 2000x2000 0.00242 +- 0.00001
CuPy (1 axis at a time): 5000x5000 0.01410 +- 0.00001
CuPy (1 axis at a time): 10000x10000 0.05145 +- 0.00149
CuPy: 100x100 0.00016 +- 0.00000
CuPy: 500x500 0.00316 +- 0.00000
CuPy: 1000x1000 0.01250 +- 0.00001
CuPy: 2000x2000 0.07283 +- 0.00290
CuPy: 5000x5000 0.44025 +- 0.00012
CuPy: 10000x10000 1.76455 +- 0.00190
NumPy: 100x100 0.00004 +- 0.00000
NumPy: 500x500 0.00056 +- 0.00001
NumPy: 1000x1000 0.00201 +- 0.00001
NumPy: 2000x2000 0.01066 +- 0.00005
NumPy: 5000x5000 0.08828 +- 0.00007
NumPy: 10000x10000 0.35403 +- 0.00064

所以在numpy上加快了大约7倍,对于float32也是如此。

代码语言:javascript
复制
CuPy (1 axis at a time): 100x100 0.00016 +- 0.00001
CuPy (1 axis at a time): 500x500 0.00018 +- 0.00001
CuPy (1 axis at a time): 1000x1000 0.00021 +- 0.00000
CuPy (1 axis at a time): 2000x2000 0.00052 +- 0.00000
CuPy (1 axis at a time): 5000x5000 0.00232 +- 0.00001
CuPy (1 axis at a time): 10000x10000 0.00837 +- 0.00002
CuPy: 100x100 0.00015 +- 0.00000
CuPy: 500x500 0.00281 +- 0.00001
CuPy: 1000x1000 0.01657 +- 0.00026
CuPy: 2000x2000 0.06557 +- 0.00003
CuPy: 5000x5000 0.37905 +- 0.00007
CuPy: 10000x10000 1.51899 +- 0.01084
NumPy: 100x100 0.00003 +- 0.00000
NumPy: 500x500 0.00032 +- 0.00000
NumPy: 1000x1000 0.00115 +- 0.00001
NumPy: 2000x2000 0.00581 +- 0.00010
NumPy: 5000x5000 0.04730 +- 0.00009
NumPy: 10000x10000 0.19188 +- 0.00024

大约40倍的速度超过了伦皮。

结果将取决于平台:相比之下,我的CPU是Intel(R) Core(TM) i9-7940X CPU @ 3.10GHz,我的GPU是NVIDIA GeForce GTX 1080 Ti。

票数 2
EN
查看全部 1 条回答
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/70677346

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档