问Pandas "Group By“查询HDFStore中的大数据？
EN

Stack Overflow用户

提问于 2013-04-04 05:11:04

回答 1查看 8.9K关注 0票数 21

在一个超过60列的HDFStore中，我有大约700万行。数据太多了，我放不进内存。我希望根据列"A“的值将数据聚合到组中。pandas splitting/aggregating/combining的文档假设我已经将所有数据放在一个DataFrame中，但是我不能将整个存储读取到内存中的DataFrame中。在HDFStore中对数据进行分组的正确方法是什么

pytables

python

pandas

回答 1

Stack Overflow用户

回答已采纳

发布于 2013-04-04 08:00:20

这里有一个完整的例子。

import numpy as np
import pandas as pd
import os

fname = 'groupby.h5'

# create a frame
df = pd.DataFrame({'A': ['foo', 'foo', 'foo', 'foo',
                         'bar', 'bar', 'bar', 'bar',
                         'foo', 'foo', 'foo'],
                   'B': ['one', 'one', 'one', 'two',
                         'one', 'one', 'one', 'two',
                         'two', 'two', 'one'],
                   'C': ['dull', 'dull', 'shiny', 'dull',
                         'dull', 'shiny', 'shiny', 'dull',
                         'shiny', 'shiny', 'shiny'],
                   'D': np.random.randn(11),
                   'E': np.random.randn(11),
                   'F': np.random.randn(11)})


# create the store and append, using data_columns where I possibily
# could aggregate
with pd.get_store(fname) as store:
    store.append('df',df,data_columns=['A','B','C'])
    print "store:\n%s" % store

    print "\ndf:\n%s" % store['df']

    # get the groups
    groups = store.select_column('df','A').unique()
    print "\ngroups:%s" % groups

    # iterate over the groups and apply my operations
    l = []
    for g in groups:

        grp = store.select('df',where = [ 'A=%s' % g ])

        # this is a regular frame, aggregate however you would like
        l.append(grp[['D','E','F']].sum())


    print "\nresult:\n%s" % pd.concat(l, keys = groups)

os.remove(fname)

输出

store:
<class 'pandas.io.pytables.HDFStore'>
File path: groupby.h5
/df            frame_table  (typ->appendable,nrows->11,ncols->6,indexers->[index],dc->[A,B,C])

df:
      A    B      C         D         E         F
0   foo  one   dull -0.815212 -1.195488 -1.346980
1   foo  one   dull -1.111686 -1.814385 -0.974327
2   foo  one  shiny -1.069152 -1.926265  0.360318
3   foo  two   dull -0.472180  0.698369 -1.007010
4   bar  one   dull  1.329867  0.709621  1.877898
5   bar  one  shiny -0.962906  0.489594 -0.663068
6   bar  one  shiny -0.657922 -0.377705  0.065790
7   bar  two   dull -0.172245  1.694245  1.374189
8   foo  two  shiny -0.780877 -2.334895 -2.747404
9   foo  two  shiny -0.257413  0.577804 -0.159316
10  foo  one  shiny  0.737597  1.979373 -0.236070

groups:Index([bar, foo], dtype=object)

result:
bar  D   -0.463206
     E    2.515754
     F    2.654810
foo  D   -3.768923
     E   -4.015488
     F   -6.110789
dtype: float64

一些注意事项：

1)如果你的群体密度相对较低，这种方法是有意义的。成百上千的群体。如果你得到了更多，那么就会有更有效(但更复杂)的方法，并且你正在应用的函数(在本例中是sum)会变得更加严格。

从本质上讲，您将按块迭代整个存储，随着时间的推移进行分组，但只保持组的半折叠(想象一下做一个平均值，因此您需要保持一个运行总数加上一个运行计数，然后在结束时除以)。因此，一些操作可能会有点棘手，但可能会处理许多组(而且速度非常快)。

2)可以通过保存坐标来提高效率(例如，组位置，但这有点复杂)

3)使用该方案不能进行多分组(这是可能的，但需要更类似于上述2的方法)

4)要分组的列必须是data_column！

5)你可以在select btw中组合你想要的任何其他过滤器(这是做多分组btw的一种简单的方式，你只需要在它们的乘积上形成两个唯一的组和迭代器列表，如果你有很多组，效率不是非常高，但可以工作)

HTH

如果这对你有效，请告诉我

票数 20

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/15798209

复制

相似问题

问Pandas "Group By“查询HDFStore中的大数据？
EN

回答 1

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Pandas "Group By“查询HDFStore中的大数据？EN

回答 1

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Pandas "Group By“查询HDFStore中的大数据？
EN