文章/答案/技术大牛

发布

社区首页 >问答首页 >熊猫稀疏数据在磁盘上比密集版大

问熊猫稀疏数据在磁盘上比密集版大
EN

Stack Overflow用户

提问于 2014-02-06 18:16:00

回答 1查看 4K关注 0票数 5

我发现，当保存到磁盘中时，dataframe的稀疏版本实际上要比密集版本大得多。我做错了什么？

test = pd.DataFrame(ones((4,4000)))
test.ix[:,:] = nan
test.ix[0,0] = 47

test.to_hdf('test3', 'df')
test.to_sparse(fill_value=nan).to_hdf('test4', 'df')

test.to_pickle('test5')
test.to_sparse(fill_value=nan).to_pickle('test6')

....
ls -sh test*
200K test3   16M test4  164K test5  516K test6

使用0.12.0版

我最终想要有效地存储10^7×60阵列，密度约为10%，然后将它们拉到Pandas数据仓库中，并与它们一起玩。

编辑:谢谢杰夫回答了原来的问题。后续问题:这似乎只是节省了腌制，而不是当使用其他格式，如HDF5。泡菜是我最好的路线吗？

print shape(array_activity) #This is just 0s and 1s
(1020000, 60)

test = pd.DataFrame(array_activity)
test_sparse = test.to_sparse()
print test_sparse.density
0.0832333496732

test.to_hdf('1', 'df')
test_sparse.to_hdf('2', 'df')
test.to_pickle('3')
test_sparse.to_pickle('4')
!ls -sh 1 2 3 4
477M 1  544M 2  477M 3   83M 4

这些数据，作为Matlab .mat文件中的索引列表，小于12M。我渴望将其转换为HDF5 5/Pytables格式，这样我就可以只获取特定的索引(其他文件要大得多，加载到内存中需要的时间要长得多)，然后可以轻松地对它们执行Pandasy操作。也许我不会用正确的方式去做这件事？

python

pandas

sparse-matrix

sparse-array

回答 1

Stack Overflow用户

回答已采纳

发布于 2014-02-06 18:29:58

您正在创建一个包含4000列和4行的框架；稀疏是按行处理的，因此反转维度。

In [2]: from numpy import *

In [3]: test = pd.DataFrame(ones((4000,4)))

In [4]: test.ix[:,:] = nan

In [5]: test.ix[0,0] = 47

In [6]: test.to_hdf('test3', 'df')

In [7]: test.to_sparse(fill_value=nan).to_hdf('test4', 'df')

In [8]: test.to_pickle('test5')

In [9]: test.to_sparse(fill_value=nan).to_pickle('test6')

In [11]: !ls -sh test3 test4 test5 test6
164K test3  148K test4  160K test5   36K test6

后续行动。您提供的存储是用table格式编写的，因此保存了密集版本(对于非常灵活和可查询的表格式不支持稀疏)，请参见文档。

此外，您可能希望尝试使用稀疏格式的两种不同表示形式保存您的文件。

下面是一个示例会话：

df = 
In [1]: df = pd.read_hdf('store_compressed.h5','test')

In [2]: type(df)
Out[2]: pandas.core.frame.DataFrame

In [3]: df.to_sparse(kind='block').to_hdf('test_block.h5','test',mode='w',complib='blosc',complevel=9)

In [4]: df.to_sparse(kind='integer').to_hdf('test_block.h5','test',mode='w',complib='blosc',complevel=9)

In [5]: df.to_sparse(kind='block').to_hdf('test_block.h5','test',mode='w',complib='blosc',complevel=9)

In [6]: df.to_sparse(kind='integer').to_hdf('test_integer.h5','test',mode='w',complib='blosc',complevel=9)

In [7]: df.to_hdf('test_dense_fixed.h5','test',mode='w',complib='blosc',complevel=9)

In [8]: df.to_hdf('test_dense_table.h5','test',mode='w',format='table',complib='blosc',complevel=9)

In [9]: !ls -ltr *.h5
-rwxrwxr-x 1 jreback users 57015522 Feb  6 18:19 store_compressed.h5
-rw-rw-r-- 1 jreback users 30335044 Feb  6 19:01 test_block.h5
-rw-rw-r-- 1 jreback users 28547220 Feb  6 19:02 test_integer.h5
-rw-rw-r-- 1 jreback users 44540381 Feb  6 19:02 test_dense_fixed.h5
-rw-rw-r-- 1 jreback users 57744418 Feb  6 19:03 test_dense_table.h5

IIRC它们是0.12中的一个bug，因为to_hdf没有通过所有参数传递所有参数，所以您需要使用：

with get_store('test.h5',mode='w',complib='blosc',complevel=9) as store:
    store.put('test',df)

它们基本上是作为SparseSeries的集合存储的，所以如果密度很低且不连续，那么它就不会像大小一样最小。虽然YMMV，但熊猫稀疏套房的相邻区块数量较少，效果更好。还提供了一些稀疏的处理工具。

尽管IMHO，这些对于HDF5文件来说都是非常微不足道的大小，但是您可以处理大量的行；文件大小可以很容易地处理到10和100千兆字节(尽管推荐)。

此外，如果确实是可以查询的查找表，则可以考虑使用表格格式。

票数 6

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/21610804

复制

相似问题

问熊猫稀疏数据在磁盘上比密集版大
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问熊猫稀疏数据在磁盘上比密集版大EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问熊猫稀疏数据在磁盘上比密集版大
EN