我的最终目标是在Python中加速矩阵向量乘积的计算,可能是通过使用支持CUDA的GPU。矩阵A约为15k x 15k且稀疏(密度约为0.05),向量x为15k个元素且密集,我正在计算Ax。我必须多次执行此计算,因此使其尽可能快将是理想的。
我目前的非GPU“优化”是将A表示为scipy.sparse.csc_matrix对象,然后简单地计算A.dot(x),但我希望在连接了几个NVIDIA GPU的VM上加速这一过程,如果可能的话,只使用Python (即不手动编写详细的内核函数)。我已经成功地使用cudamat库加速了密集矩阵向量乘积,但不适用于稀疏情况。网上有一些关于稀疏案例的建议,比如使用pycuda,或者scikit-cuda,或者anaconda的加速包,但是没有太多的信息,所以很难知道从哪里开始。
我不需要太详细的说明,但如果有人之前已经解决了这个问题,并可以提供一个最简单的方法的“大图”路线图,或者有一个基于稀疏GPU的矩阵向量产品的稀疏算法的那种加速的想法,这将是非常有帮助的。
发布于 2019-07-24 18:30:35
另一种选择是使用CuPy
包。它有与numpy/ scipy相同的界面(这很好),而且(至少对我来说),它比pycuda
更容易安装。新代码将如下所示:
import cupy as cp
from cupyx.scipy.sparse import csr_matrix as csr_gpu
A = some_sparse_matrix #(scipy.sparse.csr_matrix)
x = some_dense_vector #(numpy.ndarray)
A_gpu = csr_gpu(A) #moving A to the gpu
x_gpu = cp.array(x) #moving x to the gpu
for i in range(niter):
x_gpu = A_gpu.dot(x_gpu)
x = cp.asnumpy(x_gpu) #back to numpy object for fast indexing
发布于 2018-02-28 13:40:48
发布于 2018-03-03 02:49:25
谢谢你的建议。
我设法让pyculib的csrmm (压缩稀疏行格式化矩阵的矩阵乘法)操作使用以下方法(在谷歌云平台上使用2个NVIDIA K80 GPU),但不幸的是无法实现加速。
我假设这是因为csrmm函数中的大部分时间都花在与GPU之间传输数据上,而不是实际进行计算。不幸的是,我不能想出任何简单的pyculib
方法将数组放在GPU上,并在迭代过程中保持它们。我使用的代码是:
import numpy as np
from scipy.sparse import csr_matrix
from pyculib.sparse import Sparse
from time import time
def spmv_cuda(a_sparse, b, sp, count):
"""Compute a_sparse x b."""
# args to csrmm call
trans_a = 'N' # non-transpose, use 'T' for transpose or 'C' for conjugate transpose
m = a_sparse.shape[0] # num rows in a
n = b.shape[1] # num cols in b, c
k = a_sparse.shape[1] # num cols in a
nnz = len(a_sparse.data) # num nonzero in a
alpha = 1 # no scaling
descr_a = sp.matdescr( # matrix descriptor
indexbase=0, # 0-based indexing
matrixtype='G', # 'general': no symmetry or triangular structure
)
csr_val_a = a_sparse.data # csr data
csr_row_ptr_a = a_sparse.indptr # csr row pointers
csr_col_ind_a = a_sparse.indices # csr col idxs
ldb = b.shape[0]
beta = 0
c = np.empty((m, n), dtype=a_sparse.dtype)
ldc = b.shape[0]
# call function
tic = time()
for ii in range(count):
sp.csrmm(
transA=trans_a,
m=m,
n=n,
k=k,
nnz=nnz,
alpha=alpha,
descrA=descr_a,
csrValA=csr_val_a,
csrRowPtrA=csr_row_ptr_a,
csrColIndA=csr_col_ind_a,
B=b,
ldb=ldb,
beta=beta,
C=c,
ldc=ldc)
toc = time()
return c, toc - tic
# run benchmark
COUNT = 20
N = 5000
P = 0.1
print('Constructing objects...\n\n')
np.random.seed(0)
a_dense = np.random.rand(N, N).astype(np.float32)
a_dense[np.random.rand(N, N) >= P] = 0
a_sparse = csr_matrix(a_dense)
b = np.random.rand(N, 1).astype(np.float32)
sp = Sparse()
# scipy sparse
tic = time()
for ii in range(COUNT):
c = a_sparse.dot(b)
toc = time()
print('scipy sparse matrix multiplication took {} seconds\n'.format(toc - tic))
print('c = {}'.format(c[:5, 0]))
# pyculib sparse
c, t = spmv_cuda(a_sparse, b, sp, COUNT)
print('pyculib sparse matrix multiplication took {} seconds\n'.format(t))
print('c = {}'.format(c[:5, 0]))
这将产生输出:
Constructing objects...
scipy sparse matrix multiplication took 0.05158638954162598 seconds
c = [ 122.29484558 127.83656311 128.75004578 130.69120789 124.98323059]
Testing pyculib sparse matrix multiplication...
pyculib sparse matrix multiplication took 0.12598299980163574 seconds
c = [ 122.29483032 127.83659363 128.75003052 130.6912384 124.98326111]
正如您所看到的,pyculib的速度是GPU的两倍多,即使矩阵乘法是在GPU上进行的。同样,可能是因为在每次迭代中向/从GPU传输数据所涉及的开销。
然而,我发现的另一个解决方案是使用Andreas Kloeckner的pycuda库,它的速度提高了50倍!
import numpy as np
import pycuda.autoinit
import pycuda.driver as drv
import pycuda.gpuarray as gpuarray
from pycuda.sparse.packeted import PacketedSpMV
from pycuda.tools import DeviceMemoryPool
from scipy.sparse import csr_matrix
from time import time
def spmv_cuda(a_sparse, b, count):
dtype = a_sparse.dtype
m = a_sparse.shape[0]
print('moving objects to GPU...')
spmv = PacketedSpMV(a_sparse, is_symmetric=False, dtype=dtype)
dev_pool = DeviceMemoryPool()
d_b = gpuarray.to_gpu(b, dev_pool.allocate)
d_c = gpuarray.zeros(m, dtype=dtype, allocator=d_b.allocator)
print('executing spmv operation...\n')
tic = time()
for ii in range(count):
d_c.fill(0)
d_c = spmv(d_b, d_c)
toc = time()
return d_c.get(), toc - tic
# run benchmark
COUNT = 100
N = 5000
P = 0.1
DTYPE = np.float32
print('Constructing objects...\n\n')
np.random.seed(0)
a_dense = np.random.rand(N, N).astype(DTYPE)
a_dense[np.random.rand(N, N) >= P] = 0
a_sparse = csr_matrix(a_dense)
b = np.random.rand(N, 1).astype(DTYPE)
# numpy dense
tic = time()
for ii in range(COUNT):
c = np.dot(a_dense, b)
toc = time()
print('numpy dense matrix multiplication took {} seconds\n'.format(toc - tic))
print('c = {}\n'.format(c[:5, 0]))
# scipy sparse
tic = time()
for ii in range(COUNT):
c = a_sparse.dot(b)
toc = time()
print('scipy sparse matrix multiplication took {} seconds\n'.format(toc - tic))
print('c = {}\n'.format(c[:5, 0]))
# pycuda sparse
c, t = spmv_cuda(a_sparse, b, COUNT)
print('pycuda sparse matrix multiplication took {} seconds\n'.format(t))
print('c = {}\n'.format(c[:5]))
这将产生以下输出:
numpy dense matrix multiplication took 0.2290663719177246 seconds
c = [ 122.29484558 127.83656311 128.75004578 130.69120789 124.98323059]
scipy sparse matrix multiplication took 0.24468040466308594 seconds
c = [ 122.29484558 127.83656311 128.75004578 130.69120789 124.98323059]
moving objects to GPU...
executing spmv operation...
pycuda sparse matrix multiplication took 0.004545450210571289 seconds
c = [ 122.29484558 127.83656311 128.75004578 130.69120789 124.98323059]
注1: pycuda需要以下依赖:
pip install pymetis
sudo apt install nvidia-cuda-toolkit
注2:由于某些原因,pip install pycuda
无法安装文件pkt_build_cython.pyx
,因此您需要自己从https://github.com/inducer/pycuda/blob/master/pycuda/sparse/pkt_build_cython.pyx下载/复制该文件。
https://stackoverflow.com/questions/49019189
复制相似问题