本文为作者翻译,原文地址 https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html
前文回顾
pandas默认使用np.nan
表示确实数据。
重新索引可以在特定的轴上修改、新增和删除索引。他将返回数据的副本。
In [9]: import pandas as pd
...: import numpy as np
...:
...:
...: s = pd.Series([1, 3, 5, np.nan, 6, 8])
...: #s
...: dates = pd.date_range('20130101', periods=6)
...: #dates
...: df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
...:
...: s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range('20130102', periods=6))
...: df['F'] = s1
...: df.at[dates[0], 'A'] = 0
...: df.iat[0, 1] = 0
...: df.loc[:, 'D'] = np.array([5] * len(df))
In [10]: df
Out[10]:
A B C D F
2013-01-01 0.000000 0.000000 1.199418 5 NaN
2013-01-02 1.534227 0.769859 -0.674221 5 1.0
2013-01-03 0.874073 -0.389967 -0.670989 5 2.0
2013-01-04 0.942903 0.778283 -0.700779 5 3.0
2013-01-05 1.607993 -0.032569 1.991723 5 4.0
2013-01-06 0.345273 0.558341 0.792286 5 5.0
In [11]: df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ['E'])
In [12]: df1.loc[dates[0]:dates[1], 'E'] = 1
In [13]: df1
Out[13]:
A B C D F E
2013-01-01 0.000000 0.000000 1.199418 5 NaN 1.0
2013-01-02 1.534227 0.769859 -0.674221 5 1.0 1.0
2013-01-03 0.874073 -0.389967 -0.670989 5 2.0 NaN
2013-01-04 0.942903 0.778283 -0.700779 5 3.0 NaN
丢弃包含缺失数据的任意行
In [14]: df1.dropna(how='any')
Out[14]:
A B C D F E
2013-01-02 1.534227 0.769859 -0.674221 5 1.0 1.0
填充缺失的值
In [15]: df1.fillna(value=5)
Out[15]:
A B C D F E
2013-01-01 0.000000 0.000000 1.199418 5 5.0 1.0
2013-01-02 1.534227 0.769859 -0.674221 5 1.0 1.0
2013-01-03 0.874073 -0.389967 -0.670989 5 2.0 5.0
2013-01-04 0.942903 0.778283 -0.700779 5 3.0 5.0
返回nan
的布尔值判断
In [16]: pd.isna(df1)
Out[16]:
A B C D F E
2013-01-01 False False False False True False
2013-01-02 False False False False False False
2013-01-03 False False False False False True
2013-01-04 False False False False False True
统计 计算时一般不包括丢失的数据 执行一个描述性统计
In [17]: df.mean() #axis=0
Out[17]:
A 0.884078
B 0.280658
C 0.322906
D 5.000000
F 3.000000
dtype: float64
在其他轴上进行相同的运算
In [18]: df.mean(1) # axis=1
Out[18]:
2013-01-01 1.549854
2013-01-02 1.525973
2013-01-03 1.362623
2013-01-04 1.804082
2013-01-05 2.513429
2013-01-06 2.339180
Freq: D, dtype: float64
用于运算的对象有不同的维度并需要对齐。除此之外,pandas会自动沿着指定维度计算。
In [19]: s = pd.Series([1, 3, 5, np.nan, 6, 8], index=dates).shift(2)
In [20]: s
Out[20]:
2013-01-01 NaN
2013-01-02 NaN
2013-01-03 1.0
2013-01-04 3.0
2013-01-05 5.0
2013-01-06 NaN
Freq: D, dtype: float64
In [21]: df.sub(s, axis='index')
Out[21]:
A B C D F
2013-01-01 NaN NaN NaN NaN NaN
2013-01-02 NaN NaN NaN NaN NaN
2013-01-03 -0.125927 -1.389967 -1.670989 4.0 1.0
2013-01-04 -2.057097 -2.221717 -3.700779 2.0 0.0
2013-01-05 -3.392007 -5.032569 -3.008277 0.0 -1.0
2013-01-06 NaN NaN NaN NaN NaN
应用(使用/作用) 在数据上使用函数
In [22]: df.apply(np.cumsum) #累加
Out[22]:
A B C D F
2013-01-01 0.000000 0.000000 1.199418 5 NaN
2013-01-02 1.534227 0.769859 0.525197 10 1.0
2013-01-03 2.408300 0.379892 -0.145793 15 3.0
2013-01-04 3.351203 1.158175 -0.846572 20 6.0
2013-01-05 4.959196 1.125606 1.145151 25 10.0
2013-01-06 5.304469 1.683947 1.937437 30 15.0
In [25]: df.apply(lambda x: x.max() - x.min()) #求每列最大值和最小值的差
Out[25]:
A 1.607993
B 1.168250
C 2.692502
D 0.000000
F 4.000000
dtype: float64
直方图
In [26]: s = pd.Series(np.random.randint(0, 7, size=10))
In [27]: s
Out[27]:
0 5
1 1
2 3
3 6
4 6
5 6
6 1
7 5
8 4
9 4
dtype: int32
In [28]: s.value_counts()
Out[28]:
6 3
5 2
4 2
1 2
3 1
dtype: int64
字符串方法 序列可以使用一些字符串处理方法很轻易操作数据组中的每个元素,比如以下代码片断。注意字符匹配方法默认情况下通常使用正则表达式(并且大多数时候都如此)。
In [32]: s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
In [33]: s.str.lower()
Out[33]:
0 a
1 b
2 c
3 aaba
4 baca
5 NaN
6 caba
7 dog
8 cat
dtype: object
concat连接
pandas提供各种工具以便合并序列,数据桢,和组合对象, 在连接/合并类型操作中使用多种类型索引和相关数学函数。
使用concat()
把pandas对象连接到一起
In [34]: df = pd.DataFrame(np.random.randn(10, 4))
In [35]: df
Out[35]:
0 1 2 3
0 2.173213 -0.598088 1.706941 1.058619
1 0.050132 0.772387 -0.375141 0.653398
2 1.903610 -0.840157 0.923137 0.598493
3 0.437401 -1.832521 0.535153 0.894782
4 -0.807621 0.280262 0.994054 0.402910
5 -0.717370 -1.312313 -1.041917 0.949461
6 0.923145 0.723936 -1.682883 1.406243
7 1.270020 1.309892 0.765172 1.322076
8 -1.432259 1.566898 0.186364 0.956206
9 -1.350429 -0.374433 -0.790461 -0.340310
# 数据分片
In [36]: pieces = [df[:2], df[4:7], df[8:]]
# 分片数据合并(纵向)
In [37]: pd.concat(pieces)
Out[37]:
0 1 2 3
0 2.173213 -0.598088 1.706941 1.058619
1 0.050132 0.772387 -0.375141 0.653398
4 -0.807621 0.280262 0.994054 0.402910
5 -0.717370 -1.312313 -1.041917 0.949461
6 0.923145 0.723936 -1.682883 1.406243
8 -1.432259 1.566898 0.186364 0.956206
9 -1.350429 -0.374433 -0.790461 -0.340310
join连接 SQL形式的合并。
In [38]: left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})
In [39]: right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})
In [40]: left
Out[40]:
key lval
0 foo 1
1 foo 2
In [41]: right
Out[41]:
key rval
0 foo 4
1 foo 5
In [42]: pd.merge(left, right, on='key')
Out[42]:
key lval rval
0 foo 1 4
1 foo 1 5
2 foo 2 4
3 foo 2 5
另外一个例子
In [43]: left = pd.DataFrame({'key': ['foo', 'bar'], 'lval': [1, 2]})
In [44]: right = pd.DataFrame({'key': ['foo', 'bar'], 'rval': [4, 5]})
In [45]: left
Out[45]:
key lval
0 foo 1
1 bar 2
In [46]: right
Out[46]:
key rval
0 foo 4
1 bar 5
In [47]: pd.merge(left, right, on='key')
Out[47]:
key lval rval
0 foo 1 4
1 bar 2 5
追加
追加行到dataframe
In [48]: df = pd.DataFrame(np.random.randn(8, 4), columns=['A', 'B', 'C', 'D'])
In [49]: df
Out[49]:
A B C D
0 -0.541936 0.304257 -0.666548 1.448552
1 0.604079 -0.106248 -0.926829 -0.817308
2 -0.584797 -0.809370 -1.532935 -0.569670
3 0.690052 -0.605041 -0.465595 1.114489
4 0.913118 -0.605127 0.366342 0.393939
5 1.444903 0.274196 1.650371 -0.598488
6 0.063602 1.016262 0.493207 -0.814833
7 -0.042801 -1.752630 0.159843 0.645406
In [50]: s = df.iloc[3] #按照行号索引数据
In [51]: df.append(s, ignore_index=True)
Out[51]:
A B C D
0 -0.541936 0.304257 -0.666548 1.448552
1 0.604079 -0.106248 -0.926829 -0.817308
2 -0.584797 -0.809370 -1.532935 -0.569670
3 0.690052 -0.605041 -0.465595 1.114489
4 0.913118 -0.605127 0.366342 0.393939
5 1.444903 0.274196 1.650371 -0.598488
6 0.063602 1.016262 0.493207 -0.814833
7 -0.042801 -1.752630 0.159843 0.645406
8 0.690052 -0.605041 -0.465595 1.114489
对于“group by”指的是涉及以下一个或多个步骤的处理:
In [52]: df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
...: 'foo', 'bar', 'foo', 'foo'],
...: 'B': ['one', 'one', 'two', 'three',
...: 'two', 'two', 'one', 'three'],
...: 'C': np.random.randn(8),
...: 'D': np.random.randn(8)})
In [53]: df
Out[53]:
A B C D
0 foo one 0.214543 -0.743767
1 bar one 0.933127 -0.380128
2 foo two 0.106048 0.527558
3 bar three 0.905622 -0.892613
4 foo two -0.289133 1.136363
5 bar two -0.323147 -0.013935
6 foo one 1.070073 -0.038362
7 foo three 0.130030 0.455695
分组,然后应用sum( )
函数放到结果组
In [54]: df.groupby('A').sum()
Out[54]:
C D
A
bar 1.515602 -1.286676
foo 1.231561 1.337486
按照多列分组为层次索引,然后,再次应用sum
函数
In [55]: df.groupby(['A', 'B']).sum()
Out[55]:
C D
A B
bar one 0.933127 -0.380128
three 0.905622 -0.892613
two -0.323147 -0.013935
foo one 1.284616 -0.782129
three 0.130030 0.455695
two -0.183085 1.663921
堆砌
In [56]: tuples = list(zip( * [['bar', 'bar', 'baz', 'baz',
...: 'foo', 'foo', 'qux', 'qux'],
...: ['one', 'two', 'one', 'two',
...: 'one', 'two', 'one', 'two']]))
In [57]: tuples
Out[57]:
[('bar', 'one'),
('bar', 'two'),
('baz', 'one'),
('baz', 'two'),
('foo', 'one'),
('foo', 'two'),
('qux', 'one'),
('qux', 'two')]
In [58]: index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
In [59]: df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])
In [60]: df
Out[60]:
A B
first second
bar one -0.236559 -1.052202
two 0.663142 -0.426109
baz one 0.903963 -0.326161
two -0.640158 1.695094
foo one 0.684574 -0.352153
two -0.156945 -0.282294
qux one 0.853196 0.920496
two -0.131899 1.526732
In [61]: df2 = df[:4]
In [62]: df2
Out[62]:
A B
first second
bar one -0.236559 -1.052202
two 0.663142 -0.426109
baz one 0.903963 -0.326161
two -0.640158 1.695094
stack()
函数压缩DataFrame
的列,使得其减少一个级别(二维转一维,列转行)。
In [63]: stacked = df2.stack()
In [64]: stacked
Out[64]:
first second
bar one A -0.236559
B -1.052202
two A 0.663142
B -0.426109
baz one A 0.903963
B -0.326161
two A -0.640158
B 1.695094
被“堆砌”的DataFrame or Series
(有一个多重索引作为索引),堆砌的反向操作是逆堆砌,默认逆堆砌到上一级别:
In [66]: stacked.unstack() #逆堆砌
Out[66]:
A B
first second
bar one -0.236559 -1.052202
two 0.663142 -0.426109
baz one 0.903963 -0.326161
two -0.640158 1.695094
In [66]: stacked.unstack()
Out[66]:
A B
first second
bar one -0.236559 -1.052202
two 0.663142 -0.426109
baz one 0.903963 -0.326161
two -0.640158 1.695094
In [67]: stacked.unstack(1)
Out[67]:
second one two
first
bar A -0.236559 0.663142
B -1.052202 -0.426109
baz A 0.903963 -0.640158
B -0.326161 1.695094
In [68]: stacked.unstack(0)
Out[68]:
first bar baz
second
one A -0.236559 0.903963
B -1.052202 -0.326161
two A 0.663142 -0.640158
B -0.426109 1.695094
数据透视表
In [69]: df = pd.DataFrame({'A': ['one', 'one', 'two', 'three']*3,
...: 'B': ['A', 'B', 'C']*4,
...: 'C': ['foo', 'foo', 'foo', 'bar', 'bar', 'bar']*2,
...: 'D': np.random.randn(12),
...: 'E': np.random.randn(12)})
In [70]: df
Out[70]:
A B C D E
0 one A foo 0.356812 0.508259
1 one B foo -0.223585 2.090394
2 two C foo -0.366076 0.835971
3 three A bar -0.107184 0.377065
4 one B bar 0.335025 1.876482
5 one C bar 1.300853 -0.437791
6 two A foo -1.127656 1.240071
7 three B foo -0.719621 -0.541175
8 one C foo -0.464972 0.662851
9 one A bar -1.283444 -0.518707
10 two B bar -0.118690 -1.065746
11 three C bar -0.605458 0.908448
我们可以轻松的从这个数据生成数据透视表
In [71]: pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])
Out[71]:
C bar foo
A B
one A -1.283444 0.356812
B 0.335025 -0.223585
C 1.300853 -0.464972
three A -0.107184 NaN
B NaN -0.719621
C -0.605458 NaN
two A NaN -1.127656
B -0.118690 NaN
C NaN -0.366076
pandas
拥有简单、强大且高效的函数用于高频数据的重采样转换工作(例如,秒粒度数据转换为5分钟粒度数据)。这在金融应用领域广泛使用,但是不局限于此。
In [72]: rng = pd.date_range('1/1/2012', periods=100, freq='S')
In [73]: ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)
In [74]: ts.resample('5Min').sum()
Out[74]:
2012-01-01 26018
Freq: 5T, dtype: int32
时区表示
In [80]: rng = pd.date_range('3/6/2012 00:00', periods=5, freq='D')
In [81]: ts = pd.Series(np.random.randn(len(rng)), rng)
In [82]: ts
Out[82]:
2012-03-06 0.874556
2012-03-07 3.030650
2012-03-08 -0.286538
2012-03-09 0.154653
2012-03-10 -0.513948
Freq: D, dtype: float64
In [83]: ts_utc = ts.tz_localize('UTC')
In [84]: ts_utc
Out[84]:
2012-03-06 00:00:00+00:00 0.874556
2012-03-07 00:00:00+00:00 3.030650
2012-03-08 00:00:00+00:00 -0.286538
2012-03-09 00:00:00+00:00 0.154653
2012-03-10 00:00:00+00:00 -0.513948
Freq: D, dtype: float64
时区转换
In [85]: ts_utc.tz_convert('US/Eastern')
Out[85]:
2012-03-05 19:00:00-05:00 0.874556
2012-03-06 19:00:00-05:00 3.030650
2012-03-07 19:00:00-05:00 -0.286538
2012-03-08 19:00:00-05:00 0.154653
2012-03-09 19:00:00-05:00 -0.513948
Freq: D, dtype: float64
不同时间跨度的转换
In [86]: rng = pd.date_range('1/1/2012', periods=5, freq='M')
In [87]: ts = pd.Series(np.random.randn(len(rng)), index=rng)
In [88]: ts
Out[88]:
2012-01-31 -0.135612
2012-02-29 2.293425
2012-03-31 0.789431
2012-04-30 -0.525814
2012-05-31 1.558781
Freq: M, dtype: float64
In [89]: ps = ts.to_period()
#转换为时间段,其频率默认是从时间点推断而来,也可以指定:
# ts.to_period('M')/ ts.to_period('Y')
In [90]: ps
Out[90]:
2012-01 -0.135612
2012-02 2.293425
2012-03 0.789431
2012-04 -0.525814
2012-05 1.558781
Freq: M, dtype: float64
In [91]: ps.to_timestamp()
Out[91]:
2012-01-01 -0.135612
2012-02-01 2.293425
2012-03-01 0.789431
2012-04-01 -0.525814
2012-05-01 1.558781
Freq: MS, dtype: float64
不同时间跨度的转换使得一些简单四则运算函数可以使用。在下面的例子中,我们将把一个11月份结束的季度财年时间转换为每季度结束的次月的首日9点。
In [95]: prng = pd.period_range('1990Q1', '2000Q4', freq='Q-NOV')
#pandas支持12种可能的季度型频率,即Q-JAN到Q-DEC,Q-NOV代表财年是11月份结束
In [96]: prng
Out[96]:
PeriodIndex(['1990Q1', '1990Q2', '1990Q3', '1990Q4', '1991Q1', '1991Q2',
'1991Q3', '1991Q4', '1992Q1', '1992Q2', '1992Q3', '1992Q4',
'1993Q1', '1993Q2', '1993Q3', '1993Q4', '1994Q1', '1994Q2',
'1994Q3', '1994Q4', '1995Q1', '1995Q2', '1995Q3', '1995Q4',
'1996Q1', '1996Q2', '1996Q3', '1996Q4', '1997Q1', '1997Q2',
'1997Q3', '1997Q4', '1998Q1', '1998Q2', '1998Q3', '1998Q4',
'1999Q1', '1999Q2', '1999Q3', '1999Q4', '2000Q1', '2000Q2',
'2000Q3', '2000Q4'],
dtype='period[Q-NOV]', freq='Q-NOV')
In [97]: ts = pd.Series(np.random.randn(len(prng)), prng)
In [98]: ts.head()
Out[98]:
1990Q1 -0.394540
1990Q2 0.380933
1990Q3 0.108135
1990Q4 0.874617
1991Q1 -0.560237
Freq: Q-NOV, dtype: float64
In [99]: ts.index = prng.asfreq('M', 'e')
#转换为季度末的月份,等效于prng.asfreq('M', how='e')
In [100]: ts
Out[100]:
1990-02 -0.394540
1990-05 0.380933
1990-08 0.108135
1990-11 0.874617
1991-02 -0.560237
1991-05 -0.912935
Freq: M, dtype: float64
In [101]: ts.index = prng.asfreq('M', 'e') + 1
#转换为季度末月份的次月
In [103]: ts.head()
Out[103]:
1990-03 -0.394540
1990-06 0.380933
1990-09 0.108135
1990-12 0.874617
1991-03 -0.560237
Freq: M, dtype: float64
In [104]: ts.index = (prng.asfreq('M', 'e') + 1).asfreq('H', 's') + 9
#转换为季度末月份次月首日的9点,等效于ts.index = (prng.asfreq('M', how= 'end') + 1).asfreq('H', how= 'start') + 9
In [107]: ts.head()
Out[107]:
1990-03-01 09:00 -0.394540
1990-06-01 09:00 0.380933
1990-09-01 09:00 0.108135
1990-12-01 09:00 0.874617
1991-03-01 09:00 -0.560237
Freq: H, dtype: float64