问NumPy/SciPy中的多线程整数矩阵乘法
EN

Stack Overflow用户

提问于 2016-01-30 19:39:35

回答 2查看 6.4K关注 0票数 25

做像这样的事情

import numpy as np
a = np.random.rand(10**4, 10**4)
b = np.dot(a, a)

使用多核，并且运行良好。

不过，a中的元素是64位浮点数(在32位平台中是32位吗？)，我想将8位整数数组相乘。不过，请尝试以下几点：

a = np.random.randint(2, size=(n, n)).astype(np.int8)

结果导致点积不使用多核，因此在我的PC上运行速度慢了大约1000倍。

array: np.random.randint(2, size=shape).astype(dtype)

dtype    shape          %time (average)

float32 (2000, 2000)    62.5 ms
float32 (3000, 3000)    219 ms
float32 (4000, 4000)    328 ms
float32 (10000, 10000)  4.09 s

int8    (2000, 2000)    13 seconds
int8    (3000, 3000)    3min 26s
int8    (4000, 4000)    12min 20s
int8    (10000, 10000)  It didn't finish in 6 hours

float16 (2000, 2000)    2min 25s
float16 (3000, 3000)    Not tested
float16 (4000, 4000)    Not tested
float16 (10000, 10000)  Not tested

我知道NumPy使用BLAS，它不支持整数，但如果我使用SciPy BLAS包装器，即。

import scipy.linalg.blas as blas
a = np.random.randint(2, size=(n, n)).astype(np.int8)
b = blas.sgemm(alpha=1.0, a=a, b=a)

计算是多线程的。现在，对于浮点32，blas.sgemm的运行时间与np.dot完全相同，但对于非浮点数，它会将所有内容转换为float32并输出浮点数，这是np.dot所不做的。(此外，b现在是F_CONTIGUOUS顺序，这是一个较小的问题)。

因此，如果我想做整数矩阵乘法，我必须执行以下操作之一：

使用Numpy的慢得令人痛苦的np.dot，很高兴我保留了8位的sgemm，并使用了4倍的内存。
使用的是Numpy的np.float16，只使用了2倍的内存，但需要注意的是，np.dot在float16阵列上比在float32阵列上慢得多，比int8慢得多。
找到一个用于多线程整数矩阵乘法的优化库(实际上，Mathematica做到了这一点，但我更喜欢支持1位阵列的Python解决方案)，虽然8位数组也可以...(实际上我的目标是在有限域Z/2Z上做矩阵乘法，我知道我可以用Sage来做这件事，这是非常Python式的，但是，再说一次，有没有严格意义上的Python?)

我可以遵循选项4吗？这样的库存在吗？

免责声明:我实际上正在运行NumPy + MKL，但我已经在vanilly NumPy上尝试了类似的测试，得到了类似的结果。

python

multithreading

numpy

matrix-multiplication

blas

回答 2

Stack Overflow用户

回答已采纳

发布于 2016-02-17 20:31:51

请注意，虽然这个答案变得陈旧，但numpy可能会获得优化的整数支持。请验证此答案在您的设置中是否仍然工作得更快。

Option 5-滚动自定义解决方案：将矩阵产品划分为几个子产品，并并行执行这些操作。使用标准Python模块可以相对容易地实现这一点。子产品使用numpy.dot进行计算，这将释放全局解释器锁。因此，可以使用相对轻量级的threads，并且可以从主线程访问数组以提高内存效率。

实施：

import numpy as np
from numpy.testing import assert_array_equal
import threading
from time import time


def blockshaped(arr, nrows, ncols):
    """
    Return an array of shape (nrows, ncols, n, m) where
    n * nrows, m * ncols = arr.shape.
    This should be a view of the original array.
    """
    h, w = arr.shape
    n, m = h // nrows, w // ncols
    return arr.reshape(nrows, n, ncols, m).swapaxes(1, 2)


def do_dot(a, b, out):
    #np.dot(a, b, out)  # does not work. maybe because out is not C-contiguous?
    out[:] = np.dot(a, b)  # less efficient because the output is stored in a temporary array?


def pardot(a, b, nblocks, mblocks, dot_func=do_dot):
    """
    Return the matrix product a * b.
    The product is split into nblocks * mblocks partitions that are performed
    in parallel threads.
    """
    n_jobs = nblocks * mblocks
    print('running {} jobs in parallel'.format(n_jobs))

    out = np.empty((a.shape[0], b.shape[1]), dtype=a.dtype)

    out_blocks = blockshaped(out, nblocks, mblocks)
    a_blocks = blockshaped(a, nblocks, 1)
    b_blocks = blockshaped(b, 1, mblocks)

    threads = []
    for i in range(nblocks):
        for j in range(mblocks):
            th = threading.Thread(target=dot_func, 
                                  args=(a_blocks[i, 0, :, :], 
                                        b_blocks[0, j, :, :], 
                                        out_blocks[i, j, :, :]))
            th.start()
            threads.append(th)

    for th in threads:
        th.join()

    return out


if __name__ == '__main__':
    a = np.ones((4, 3), dtype=int)
    b = np.arange(18, dtype=int).reshape(3, 6)
    assert_array_equal(pardot(a, b, 2, 2), np.dot(a, b))

    a = np.random.randn(1500, 1500).astype(int)

    start = time()
    pardot(a, a, 2, 4)
    time_par = time() - start
    print('pardot: {:.2f} seconds taken'.format(time_par))

    start = time()
    np.dot(a, a)
    time_dot = time() - start
    print('np.dot: {:.2f} seconds taken'.format(time_dot))

在这个实现中，我获得了大约x4的加速比，这是我的机器中的物理核心数：

running 8 jobs in parallel
pardot: 5.45 seconds taken
np.dot: 22.30 seconds taken

票数 6

Stack Overflow用户

发布于 2018-02-21 14:20:09

"Why is it faster to perform float by float matrix multiplication compared to int by int?“解释了整数如此缓慢的原因:首先，CPU有高吞吐量的浮点流水线。其次，BLAS没有整数类型。

解决方法:将矩阵转换为float32值的可以获得很大的加速。在2015年的MacBook专业版上，90倍的加速效果如何？(使用float64的效果只有一半好。)

import numpy as np
import time

def timeit(callable):
    start = time.time()
    callable()
    end = time.time()
    return end - start

a = np.random.random_integers(0, 9, size=(1000, 1000)).astype(np.int8)

timeit(lambda: a.dot(a))  # ≈0.9 sec
timeit(lambda: a.astype(np.float32).dot(a.astype(np.float32)).astype(np.int8) )  # ≈0.01 sec

票数 2

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/35101312

复制

相似问题

问NumPy/SciPy中的多线程整数矩阵乘法
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问NumPy/SciPy中的多线程整数矩阵乘法EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问NumPy/SciPy中的多线程整数矩阵乘法
EN