文章/答案/技术大牛

发布

社区首页 >问答首页 >python中可能存在的枕稀疏数组内存泄漏

问python中可能存在的枕稀疏数组内存泄漏
EN

Stack Overflow用户

提问于 2022-03-06 23:36:13

回答 1查看 243关注 0票数 0

编辑3: TL；我的问题是由于我的矩阵不够稀疏，而且计算稀疏数组的大小也不正确。

希望有人能向我解释为什么会发生这种事。我使用的是51 GB内存的colab，我需要从H5文件float32中加载数据。我能够加载一个测试H5文件作为numpy数组和RAM ~ 45 GB。我将其分批加载(总共21次)并将其堆叠起来。然后，我尝试将数据加载到numpy中，转换为稀疏数据，然后将数据和内存打包，然后在批处理12左右之后得到一个OOM。

此代码模拟它，您可以更改数据大小以在计算机上测试它。即使当我看到内存中变量的大小时，内存也会增加，这是完全无法解释的。这是怎么回事？我做错什么了？

import os, psutil
import gc
gc.enable()
from scipy import sparse
import numpy as np
all_x = None
x = (1*(np.random.rand(97406, 2048)>0.39721115241072164)).astype('float32')
x2 = sparse.csr_matrix(x)
print('GB on Memory SPARSE ', x2.data.nbytes/ 10**9)
print('GB on Memory NUMPY ', x.nbytes/ 10**9)
print('sparse to dense mat ratio', x2.data.nbytes/ x.nbytes)
print('_____________________')
for k in range(8):
  if all_x is None:
    all_x = x2
  else:
    all_x = sparse.hstack([all_x, x2])
  print('GB on Memory ALL SPARSE ', all_x.data.nbytes/ 10**9)
  print('GB USED BEFORE GC', psutil.Process(os.getpid()).memory_info().rss/ 10**9) 
  gc.collect()
  print('GB USED AFTER GC', psutil.Process(os.getpid()).memory_info().rss/ 10**9) 
  print('_____________________')

GB on Memory SPARSE  0.481035332
GB on Memory NUMPY  0.797949952
sparse to dense mat ratio 0.6028389760464576
_____________________
GB on Memory ALL SPARSE  0.481035332
GB USED BEFORE GC 4.62065664
GB USED AFTER GC 4.6206976
_____________________
GB on Memory ALL SPARSE  0.962070664
GB USED BEFORE GC 8.473133056
GB USED AFTER GC 8.473133056
_____________________
GB on Memory ALL SPARSE  1.443105996
GB USED BEFORE GC 12.325183488
GB USED AFTER GC 12.325183488
_____________________
GB on Memory ALL SPARSE  1.924141328
GB USED BEFORE GC 17.140740096
GB USED AFTER GC 17.140740096
_____________________
GB on Memory ALL SPARSE  2.40517666
GB USED BEFORE GC 20.512710656
GB USED AFTER GC 20.512710656
_____________________
GB on Memory ALL SPARSE  2.886211992
GB USED BEFORE GC 22.920142848
GB USED AFTER GC 22.920142848
_____________________
GB on Memory ALL SPARSE  3.367247324
GB USED BEFORE GC 29.660889088
GB USED AFTER GC 29.660889088
_____________________
GB on Memory ALL SPARSE  3.848282656
GB USED BEFORE GC 33.99727104
GB USED AFTER GC 33.99727104
_____________________

编辑:我在numpy h堆栈中堆叠了一个列表，它运行得很好。

import os, psutil
import gc
gc.enable()
from scipy import sparse
import numpy as np
all_x = None
x = (1*(np.random.rand(97406, 2048)>0.39721115241072164)).astype('float32')
x2 = sparse.csr_matrix(x)
print('GB on Memory SPARSE ', x2.data.nbytes/ 10**9)
print('GB on Memory NUMPY ', x.nbytes/ 10**9)
print('sparse to dense mat ratio', x2.data.nbytes/ x.nbytes)
print('_____________________')

all_x = np.hstack([x]*21)

print('GB on Memory ALL SPARSE ', all_x.data.nbytes/ 10**9)
print('GB USED BEFORE GC', psutil.Process(os.getpid()).memory_info().rss/ 10**9) 
gc.collect()
print('GB USED AFTER GC', psutil.Process(os.getpid()).memory_info().rss/ 10**9) 
print('_____________________')

输出

GB on Memory SPARSE  0.480956104
GB on Memory NUMPY  0.797949952
sparse to dense mat ratio 0.6027396866113227
_____________________
GB on Memory ALL SPARSE  16.756948992
GB USED BEFORE GC 38.169387008
GB USED AFTER GC 38.169411584
_____________________

但是，当我对稀疏矩阵做同样的处理时，我得到了一个OOM。根据字节数，稀疏矩阵应该更小。

import os, psutil
import gc
gc.enable()
from scipy import sparse
import numpy as np
all_x = None
x = (1*(np.random.rand(97406, 2048)>0.39721115241072164)).astype('float32')
x2 = sparse.csr_matrix(x)
print('GB on Memory SPARSE ', x2.data.nbytes/ 10**9)
print('GB on Memory NUMPY ', x.nbytes/ 10**9)
print('sparse to dense mat ratio', x2.data.nbytes/ x.nbytes)
print('_____________________')

all_x = sparse.hstack([x2]*21)

print('GB on Memory ALL SPARSE ', all_x.data.nbytes/ 10**9)
print('GB USED BEFORE GC', psutil.Process(os.getpid()).memory_info().rss/ 10**9) 
gc.collect()
print('GB USED AFTER GC', psutil.Process(os.getpid()).memory_info().rss/ 10**9) 
print('_____________________')

但当我这样做时，它会返回OOM错误。

编辑2，似乎我是不正确地计算稀疏矩阵的真实大小。它可以用

def bytes_in_sparse(a):
  return  a.data.nbytes + a.indptr.nbytes + a.indices.nbytes

稠密数组和稀疏数组的真正比较是

GB on Memory SPARSE  0.962395268
GB on Memory NUMPY  0.797949952
sparse to dense mat ratio 1.2060847495357703

一旦我使用sparse.hstack，这两个变量就变成了不同类型的稀疏矩阵。

all_x, x2

输出

(<97406x4096 sparse matrix of type '<class 'numpy.float32'>'
    with 240476696 stored elements in COOrdinate format>,
 <97406x2048 sparse matrix of type '<class 'numpy.float32'>'
    with 120238348 stored elements in Compressed Sparse Row format>)

scipy

sparse-matrix

python

numpy

回答 1

Stack Overflow用户

回答已采纳

发布于 2022-03-07 01:51:59

尺寸更小，所以我不用挂电脑

In [50]: x = (1 * (np.random.rand(974, 204) > 0.39721115241072164)).astype("float32")
In [51]: x.nbytes
Out[51]: 794784

THe csr和近似内存的使用：

In [52]: M = sparse.csr_matrix(x)
In [53]: M.data.nbytes + M.indices.nbytes + M.indptr.nbytes
Out[53]: 960308

hstack实际上使用了coo格式：

In [54]: Mo = M.tocoo()
In [55]: Mo.data.nbytes + Mo.row.nbytes + Mo.col.nbytes
Out[55]: 1434612

合并10个副本-字节数增加10：

In [56]: xx = np.hstack([x]*10)
In [57]: xx.shape
Out[57]: (974, 2040)

稀疏的情况也是如此：

In [58]: MM = sparse.hstack([M] * 10)
In [59]: MM.shape
Out[59]: (974, 2040)
In [60]: xx.nbytes
Out[60]: 7947840
In [61]: MM
Out[61]: 
<974x2040 sparse matrix of type '<class 'numpy.float32'>'
    with 1195510 stored elements in Compressed Sparse Row format>
In [62]: M
Out[62]: 
<974x204 sparse matrix of type '<class 'numpy.float32'>'
    with 119551 stored elements in Compressed Sparse Row format>
In [63]: MM.data.nbytes + MM.indices.nbytes + MM.indptr.nbytes
Out[63]: 9567980

稀疏的密度

In [65]: M.nnz / np.prod(M.shape)
Out[65]: 0.6016779401699078

不会节省内存。如果要同时节省内存和计算时间(特别是矩阵乘法)，0.1或更小的工作密度是一个很好的工作密度。

In [66]: (x@x.T).shape
Out[66]: (974, 974)
In [67]: timeit(x@x.T).shape
10.1 ms ± 31.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [68]: (M@M.T).shape
Out[68]: (974, 974)
In [69]: timeit(M@M.T).shape
220 ms ± 91.8 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/71375061

复制

相似问题

问python中可能存在的枕稀疏数组内存泄漏
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问python中可能存在的枕稀疏数组内存泄漏EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问python中可能存在的枕稀疏数组内存泄漏
EN