在之前的前面几篇博客中,详细介绍了Pandas的一些基础和高级特性。今天博主继续介绍一个Pandas的进阶之数据聚合。
pandas可以支持像sql语句那样,对数据进行聚合操作。比如:groupby,combine等等。
我们可以将一个Pandas的DataFrame结构进行拆分-应用-合并操作。比如:
In [6]: df = DataFrame({'key1':'aabba','key2':["one","two","one","two
...: ","one"],"data1":np.random.randn(5),'data2':np.random.randn(5
...: )})
In [7]: df
Out[7]:
data1 data2 key1 key2
0 -0.228886 0.167925 a one
1 -1.556655 0.640775 a two
2 -0.508787 -0.578234 b one
3 -0.279801 0.089786 b two
4 -0.823011 1.359131 a one
然后我们可以根据按照key1或者key2在某一列上进行分组,比如:
In [13]: df.groupby('key1')
Out[13]: <pandas.core.groupby.DataFrameGroupBy object at 0x1065c2510>
In [14]: df.groupby('key1').mean()
Out[14]:
data1 data2
key1
a -0.869517 0.722611
b -0.394294 -0.244224
In [18]: df.groupby(['key1','key2']).mean()
Out[18]:
data1 data2
key1 key2
a one -0.525948 0.763528
two -1.556655 0.640775
b one -0.508787 -0.578234
two -0.279801 0.089786
In [19]: df.groupby(['key1','key2']).size()
Out[19]:
key1 key2
a one 2
two 1
b one 1
two 1
dtype: int64
In [27]: df.groupby('key1')['data1'].mean()
Out[27]:
key1
a -0.869517
b -0.394294
Name: data1, dtype: float64
GroupBy对象实际上并没有进行任何计算,只是保留了一份中间数据而已,当执行mean()才会进行将数据分组聚合应用。 如果我们已经对数据进行了分组,然后想对分组的数据进行可定制化的操作那么如何迭代?
In [21]: for name,group in df.groupby('key1'):
...: print name
...: print group
...:
a
data1 data2 key1 key2
0 -0.228886 0.167925 a one
1 -1.556655 0.640775 a two
4 -0.823011 1.359131 a one
b
data1 data2 key1 key2
2 -0.508787 -0.578234 b one
3 -0.279801 0.089786 b two
In [22]: for (name1,name2),group in df.groupby(['key1','key2']):
...: print (name1,name2)
...: print group
...:
...:
('a', 'one')
data1 data2 key1 key2
0 -0.228886 0.167925 a one
4 -0.823011 1.359131 a one
('a', 'two')
data1 data2 key1 key2
1 -1.556655 0.640775 a two
('b', 'one')
data1 data2 key1 key2
2 -0.508787 -0.578234 b one
('b', 'two')
data1 data2 key1 key2
3 -0.279801 0.089786 b two
groupby默认是在axis=0(行)分组,也可以在列上分组。我们可以通过传入已分组数据进行进行分组,比如:
In [29]: df = DataFrame(np.random.rand(5,5),columns=list('abcde'),index=['Jo','St','We','Ji','Tr'])
In [30]: df
Out[30]:
a b c d e
Jo 0.223651 0.374765 0.039368 0.742583 0.083534
St 0.389264 0.938515 0.495707 0.785569 0.605133
We 0.536896 0.326160 0.557036 0.262893 0.367037
Ji 0.590111 0.801715 0.111961 0.046658 0.474800
Tr 0.365418 0.885797 0.397164 0.642887 0.760261
In [37]: df.loc[2:3,['b','c']]=np.nan
In [38]: df
Out[38]:
a b c d e
Jo 0.223651 0.374765 0.039368 0.742583 0.083534
St 0.389264 0.938515 0.495707 0.785569 0.605133
We 0.536896 NaN NaN 0.262893 0.367037
Ji 0.590111 0.801715 0.111961 0.046658 0.474800
Tr 0.365418 0.885797 0.397164 0.642887 0.760261
In [51]: mapping={'a':'red','b':'red','c':'blue','d':'blue','e':'red','f':'orange'}
In [52]: by = df.groupby(mapping,axis=1)
In [53]: by
Out[53]: <pandas.core.groupby.DataFrameGroupBy object at 0x106876850>
In [54]: by.sum()
Out[54]:
blue red
Jo 0.781950 0.681950
St 1.281276 1.932913
We 0.262893 0.903934
Ji 0.158619 1.866626
Tr 1.040051 2.011476
In [55]: by.count()
Out[55]:
blue red
Jo 2 3
St 2 3
We 1 2
Ji 2 3
Tr 2 3
groupby还可以通过函数和索引级别来分组,比如:
In [59]: df.groupby(len).sum()
Out[59]:
a b c d e
2 2.10534 3.000792 1.0442 2.480589 2.290765
In [73]: df.groupby([len,key_list]).min()
Out[73]:
a b c d e
2 one 0.223651 0.374765 0.039368 0.262893 0.083534
two 0.365418 0.801715 0.111961 0.046658 0.474800
上述都可以通过聚合之后的对对象操作。
In [82]: df
Out[82]:
data1 data2 key1 key2
0 2.346840 1.574532 a one
1 -0.225204 -0.555727 a two
2 0.129150 -0.219370 b one
3 -0.856133 -1.382830 b two
4 -0.300663 -2.461491 a one
In [83]: grouped = df.groupby('key1')
In [84]: def custom(arr):
...: return arr.max()-arr.min()
...:
In [85]: grouped.agg(custom)
Out[85]:
data1 data2
key1
a 2.647503 4.036022
b 0.985283 1.163460
In [90]: grouped.agg('mean')
Out[90]:
data1 data2
key1
a 0.606991 -0.480895
b -0.363492 -0.801100
In [91]: grouped.agg(['std','mean',custom])
Out[91]:
data1 data2
std mean custom std mean custom
key1
a 1.507226 0.606991 2.647503 2.019051 -0.480895 4.036022
b 0.696700 -0.363492 0.985283 0.822690 -0.801100 1.163460
#如果传入的是由元组构成的列表
In [93]: grouped.agg([('s','std'),('m','mean'),('c',custom)])
Out[93]:
data1 data2
s m c s m c
key1
a 1.507226 0.606991 2.647503 2.019051 -0.480895 4.036022
b 0.696700 -0.363492 0.985283 0.822690 -0.801100 1.163460
分组级别的运算可以通过transform进行广播运算,比如:
In [104]: df
Out[104]:
a b c d e
Jo 0.416052 0.627895 0.862025 0.059178 0.679535
St 0.411298 0.364009 0.977781 0.028793 0.513383
We 0.067403 0.843924 0.419891 0.603240 0.787490
Ji 0.268642 0.722305 0.235712 0.647945 0.399922
Tr 0.413175 0.099380 0.966640 0.700351 0.321278
In [102]: def custom(arr):
...: return arr-arr.mean()
...:
In [103]: df.groupby(['one','two','one','two','one']).transform(custom)
Out[103]:
a b c d e
Jo 0.117176 0.104162 0.112506 -0.395078 0.083434
St 0.071328 -0.179148 0.371034 -0.309576 0.056731
We -0.231474 0.320191 -0.329628 0.148984 0.191389
Ji -0.071328 0.179148 -0.371034 0.309576 -0.056731
Tr 0.114298 -0.424353 0.217121 0.246094 -0.274823
transform会将一个函数应用到各个分组,然后将结果放置到适当的位置上。但是它只能产生两种结果:要么产生一个可以广播的标量(np.mean),要么产生一个相同大小的结果数组。
In [119]: def top(df,n=5,column='tip_pct'):
...: return df.sort_values(by=column,axis='index')[-n:]
In [120]: top(tips,n=6)
Out[120]:
total_bill tip sex smoker day time size tip_pct
109 14.31 4.00 Female Yes Sat Dinner 2 0.279525
183 23.17 6.50 Male Yes Sun Dinner 4 0.280535
232 11.61 3.39 Male No Sat Dinner 2 0.291990
67 3.07 1.00 Female Yes Sat Dinner 1 0.325733
178 9.60 4.00 Female Yes Sun Dinner 2 0.416667
172 7.25 5.15 Male Yes Sun Dinner 2 0.710345
1.《Python 数据分析》 2.《Python数据挖掘与分析》 3.《利用Python进行数据分析》