我需要写一个非常“高”的两列数组到一个文本文件,它是非常慢的。我发现,如果我将数组重组为更宽的数组,则写入速度要快得多。例如
import time
import numpy as np
dataMat1 = np.random.rand(1000,1000)
dataMat2 = np.random.rand(2,500000)
dataMat3 = np.random.rand(500000,2)
start = time.perf_counter()
with open('test1.txt','w') as f:
np.savetxt(f,dataMat1,fmt='%g',delimiter=' ')
end = time.perf_counter()
print(end-start)
start = time.perf_counter()
with open('test2.txt','w') as f:
np.savetxt(f,dataMat2,fmt='%g',delimiter=' ')
end = time.perf_counter()
print(end-start)
start = time.perf_counter()
with open('test3.txt','w') as f:
np.savetxt(f,dataMat3,fmt='%g',delimiter=' ')
end = time.perf_counter()
print(end-start)由于三个数据矩阵中的元素数量相同,为什么最后一个元素比其他两个元素要耗时得多呢?有没有任何方法可以加快“高”数据数组的编写?
发布于 2018-12-17 19:44:41
作为hpaulj pointed out,savetxt是X,并分别格式化每一行:
for row in X:
try:
v = format % tuple(row) + newline
except TypeError:
raise TypeError("Mismatch between array dtype ('%s') and "
"format specifier ('%s')"
% (str(X.dtype), format))
fh.write(v)我认为主要的时间杀手是所有的字符串插值调用。如果我们把所有的字符串内插都打包到一个调用中,事情就会进行得更快:
with open('/tmp/test4.txt','w') as f:
fmt = ' '.join(['%g']*dataMat3.shape[1])
fmt = '\n'.join([fmt]*dataMat3.shape[0])
data = fmt % tuple(dataMat3.ravel())
f.write(data)import io
import time
import numpy as np
dataMat1 = np.random.rand(1000,1000)
dataMat2 = np.random.rand(2,500000)
dataMat3 = np.random.rand(500000,2)
start = time.perf_counter()
with open('/tmp/test1.txt','w') as f:
np.savetxt(f,dataMat1,fmt='%g',delimiter=' ')
end = time.perf_counter()
print(end-start)
start = time.perf_counter()
with open('/tmp/test2.txt','w') as f:
np.savetxt(f,dataMat2,fmt='%g',delimiter=' ')
end = time.perf_counter()
print(end-start)
start = time.perf_counter()
with open('/tmp/test3.txt','w') as f:
np.savetxt(f,dataMat3,fmt='%g',delimiter=' ')
end = time.perf_counter()
print(end-start)
start = time.perf_counter()
with open('/tmp/test4.txt','w') as f:
fmt = ' '.join(['%g']*dataMat3.shape[1])
fmt = '\n'.join([fmt]*dataMat3.shape[0])
data = fmt % tuple(dataMat3.ravel())
f.write(data)
end = time.perf_counter()
print(end-start)报告
0.1604848340011813
0.17416274400056864
0.6634929459996783
0.16207673999997496发布于 2018-12-17 18:19:35
savetxt的代码是Python的,是可访问的。基本上,它会对每一行/行进行格式化的写入。实际上是这样的
for row in arr:
f.write(fmt%tuple(row))其中fmt是从您的fmt和数组的形状派生出来的。
'%g %g %g ...'所以它为数组的每一行写了一个文件。行格式也需要一些时间,但它是用Python代码在内存中完成的。
我希望loadtxt/genfromtxt将显示同样的时间模式--读取许多行所需的时间更长。
pandas具有更快的csv负载。我还没有看到任何关于它的写速度的讨论。
https://stackoverflow.com/questions/53820891
复制相似问题