10分钟入门Pandas-系列（2）

披头

发布于 2019-12-26 11:30:20

7690

发布于 2019-12-26 11:30:20

文章被收录于专栏：datartisan

本文为作者翻译，原文地址 https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html

前文回顾

10分钟入门Pandas-系列（1）

缺失数据处理

pandas默认使用np.nan表示确实数据。重新索引可以在特定的轴上修改、新增和删除索引。他将返回数据的副本。

In [9]: import pandas as pd
   ...: import numpy  as np
   ...:
   ...:
   ...: s = pd.Series([1, 3, 5, np.nan, 6, 8])
   ...: #s
   ...: dates = pd.date_range('20130101', periods=6)
   ...: #dates
   ...: df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
   ...:
   ...: s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range('20130102', periods=6))
   ...: df['F'] = s1
   ...: df.at[dates[0], 'A'] = 0
   ...: df.iat[0, 1] = 0
   ...: df.loc[:, 'D'] = np.array([5] * len(df))

In [10]: df
Out[10]:
                   A         B         C  D    F
2013-01-01  0.000000  0.000000  1.199418  5  NaN
2013-01-02  1.534227  0.769859 -0.674221  5  1.0
2013-01-03  0.874073 -0.389967 -0.670989  5  2.0
2013-01-04  0.942903  0.778283 -0.700779  5  3.0
2013-01-05  1.607993 -0.032569  1.991723  5  4.0
2013-01-06  0.345273  0.558341  0.792286  5  5.0

In [11]:  df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ['E'])

In [12]: df1.loc[dates[0]:dates[1], 'E'] = 1

In [13]: df1
Out[13]:
                   A         B         C  D    F    E
2013-01-01  0.000000  0.000000  1.199418  5  NaN  1.0
2013-01-02  1.534227  0.769859 -0.674221  5  1.0  1.0
2013-01-03  0.874073 -0.389967 -0.670989  5  2.0  NaN
2013-01-04  0.942903  0.778283 -0.700779  5  3.0  NaN

丢弃包含缺失数据的任意行

In [14]:  df1.dropna(how='any')
Out[14]:
                   A         B         C  D    F    E
2013-01-02  1.534227  0.769859 -0.674221  5  1.0  1.0

填充缺失的值

In [15]:  df1.fillna(value=5)
Out[15]:
                   A         B         C  D    F    E
2013-01-01  0.000000  0.000000  1.199418  5  5.0  1.0
2013-01-02  1.534227  0.769859 -0.674221  5  1.0  1.0
2013-01-03  0.874073 -0.389967 -0.670989  5  2.0  5.0
2013-01-04  0.942903  0.778283 -0.700779  5  3.0  5.0

返回nan的布尔值判断

In [16]:  pd.isna(df1)
Out[16]:
                A      B      C      D      F      E
2013-01-01  False  False  False  False   True  False
2013-01-02  False  False  False  False  False  False
2013-01-03  False  False  False  False  False   True
2013-01-04  False  False  False  False  False   True

数据运算

统计计算时一般不包括丢失的数据执行一个描述性统计

In [17]:  df.mean()  #axis=0
Out[17]:
A    0.884078
B    0.280658
C    0.322906
D    5.000000
F    3.000000
dtype: float64

在其他轴上进行相同的运算

In [18]: df.mean(1)  # axis=1
Out[18]:
2013-01-01    1.549854
2013-01-02    1.525973
2013-01-03    1.362623
2013-01-04    1.804082
2013-01-05    2.513429
2013-01-06    2.339180
Freq: D, dtype: float64

用于运算的对象有不同的维度并需要对齐。除此之外，pandas会自动沿着指定维度计算。

In [19]:  s = pd.Series([1, 3, 5, np.nan, 6, 8], index=dates).shift(2)

In [20]: s
Out[20]:
2013-01-01    NaN
2013-01-02    NaN
2013-01-03    1.0
2013-01-04    3.0
2013-01-05    5.0
2013-01-06    NaN
Freq: D, dtype: float64

In [21]:  df.sub(s, axis='index')
Out[21]:
                   A         B         C    D    F
2013-01-01       NaN       NaN       NaN  NaN  NaN
2013-01-02       NaN       NaN       NaN  NaN  NaN
2013-01-03 -0.125927 -1.389967 -1.670989  4.0  1.0
2013-01-04 -2.057097 -2.221717 -3.700779  2.0  0.0
2013-01-05 -3.392007 -5.032569 -3.008277  0.0 -1.0
2013-01-06       NaN       NaN       NaN  NaN  NaN

应用（使用/作用） 在数据上使用函数

In [22]: df.apply(np.cumsum) #累加
Out[22]:
                   A         B         C   D     F
2013-01-01  0.000000  0.000000  1.199418   5   NaN
2013-01-02  1.534227  0.769859  0.525197  10   1.0
2013-01-03  2.408300  0.379892 -0.145793  15   3.0
2013-01-04  3.351203  1.158175 -0.846572  20   6.0
2013-01-05  4.959196  1.125606  1.145151  25  10.0
2013-01-06  5.304469  1.683947  1.937437  30  15.0

In [25]:  df.apply(lambda x: x.max() - x.min()) #求每列最大值和最小值的差
Out[25]:
A    1.607993
B    1.168250
C    2.692502
D    0.000000
F    4.000000
dtype: float64

直方图

In [26]: s = pd.Series(np.random.randint(0, 7, size=10))

In [27]: s
Out[27]:
0    5
1    1
2    3
3    6
4    6
5    6
6    1
7    5
8    4
9    4
dtype: int32

In [28]:  s.value_counts()
Out[28]:
6    3
5    2
4    2
1    2
3    1
dtype: int64

字符串方法 序列可以使用一些字符串处理方法很轻易操作数据组中的每个元素,比如以下代码片断。注意字符匹配方法默认情况下通常使用正则表达式（并且大多数时候都如此）。

In [32]:  s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])

In [33]:  s.str.lower()
Out[33]:
0       a
1       b
2       c
3    aaba
4    baca
5     NaN
6    caba
7     dog
8     cat
dtype: object

数据合并

concat连接 pandas提供各种工具以便合并序列,数据桢,和组合对象, 在连接/合并类型操作中使用多种类型索引和相关数学函数。使用concat()把pandas对象连接到一起

In [34]:  df = pd.DataFrame(np.random.randn(10, 4))

In [35]: df
Out[35]:
          0         1         2         3
0  2.173213 -0.598088  1.706941  1.058619
1  0.050132  0.772387 -0.375141  0.653398
2  1.903610 -0.840157  0.923137  0.598493
3  0.437401 -1.832521  0.535153  0.894782
4 -0.807621  0.280262  0.994054  0.402910
5 -0.717370 -1.312313 -1.041917  0.949461
6  0.923145  0.723936 -1.682883  1.406243
7  1.270020  1.309892  0.765172  1.322076
8 -1.432259  1.566898  0.186364  0.956206
9 -1.350429 -0.374433 -0.790461 -0.340310

# 数据分片
In [36]:  pieces = [df[:2], df[4:7], df[8:]]
# 分片数据合并（纵向）
In [37]:  pd.concat(pieces)
Out[37]:
          0         1         2         3
0  2.173213 -0.598088  1.706941  1.058619
1  0.050132  0.772387 -0.375141  0.653398
4 -0.807621  0.280262  0.994054  0.402910
5 -0.717370 -1.312313 -1.041917  0.949461
6  0.923145  0.723936 -1.682883  1.406243
8 -1.432259  1.566898  0.186364  0.956206
9 -1.350429 -0.374433 -0.790461 -0.340310

join连接 SQL形式的合并。

In [38]: left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})

In [39]: right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})

In [40]: left
Out[40]:
   key  lval
0  foo     1
1  foo     2

In [41]: right
Out[41]:
   key  rval
0  foo     4
1  foo     5

In [42]: pd.merge(left, right, on='key')
Out[42]:
   key  lval  rval
0  foo     1     4
1  foo     1     5
2  foo     2     4
3  foo     2     5

另外一个例子

In [43]: left = pd.DataFrame({'key': ['foo', 'bar'], 'lval': [1, 2]})

In [44]: right = pd.DataFrame({'key': ['foo', 'bar'], 'rval': [4, 5]})

In [45]: left
Out[45]:
   key  lval
0  foo     1
1  bar     2

In [46]: right
Out[46]:
   key  rval
0  foo     4
1  bar     5

In [47]: pd.merge(left, right, on='key')
Out[47]:
   key  lval  rval
0  foo     1     4
1  bar     2     5

追加追加行到dataframe

In [48]:  df = pd.DataFrame(np.random.randn(8, 4), columns=['A', 'B', 'C', 'D'])

In [49]: df
Out[49]:
          A         B         C         D
0 -0.541936  0.304257 -0.666548  1.448552
1  0.604079 -0.106248 -0.926829 -0.817308
2 -0.584797 -0.809370 -1.532935 -0.569670
3  0.690052 -0.605041 -0.465595  1.114489
4  0.913118 -0.605127  0.366342  0.393939
5  1.444903  0.274196  1.650371 -0.598488
6  0.063602  1.016262  0.493207 -0.814833
7 -0.042801 -1.752630  0.159843  0.645406

In [50]:  s = df.iloc[3] #按照行号索引数据

In [51]:  df.append(s, ignore_index=True)
Out[51]:
          A         B         C         D
0 -0.541936  0.304257 -0.666548  1.448552
1  0.604079 -0.106248 -0.926829 -0.817308
2 -0.584797 -0.809370 -1.532935 -0.569670
3  0.690052 -0.605041 -0.465595  1.114489
4  0.913118 -0.605127  0.366342  0.393939
5  1.444903  0.274196  1.650371 -0.598488
6  0.063602  1.016262  0.493207 -0.814833
7 -0.042801 -1.752630  0.159843  0.645406
8  0.690052 -0.605041 -0.465595  1.114489

分组

对于“group by”指的是涉及以下一个或多个步骤的处理：

将数据按某些标准分割为不同的组
在每个独立组上应用函数
组合结果为一个数据结构

In [52]:  df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
    ...:                           'foo', 'bar', 'foo', 'foo'],
    ...:                     'B': ['one', 'one', 'two', 'three',
    ...:                           'two', 'two', 'one', 'three'],
    ...:                     'C': np.random.randn(8),
    ...:                     'D': np.random.randn(8)})

In [53]: df
Out[53]:
     A      B         C         D
0  foo    one  0.214543 -0.743767
1  bar    one  0.933127 -0.380128
2  foo    two  0.106048  0.527558
3  bar  three  0.905622 -0.892613
4  foo    two -0.289133  1.136363
5  bar    two -0.323147 -0.013935
6  foo    one  1.070073 -0.038362
7  foo  three  0.130030  0.455695

分组，然后应用sum( )函数放到结果组

In [54]: df.groupby('A').sum()
Out[54]:
            C         D
A
bar  1.515602 -1.286676
foo  1.231561  1.337486

按照多列分组为层次索引，然后，再次应用sum函数

In [55]:  df.groupby(['A', 'B']).sum()
Out[55]:
                  C         D
A   B
bar one    0.933127 -0.380128
    three  0.905622 -0.892613
    two   -0.323147 -0.013935
foo one    1.284616 -0.782129
    three  0.130030  0.455695
    two   -0.183085  1.663921

重塑

堆砌

In [56]:  tuples = list(zip( * [['bar', 'bar', 'baz', 'baz',
    ...:                         'foo', 'foo', 'qux', 'qux'],
    ...:                        ['one', 'two', 'one', 'two',
    ...:                         'one', 'two', 'one', 'two']]))

In [57]: tuples
Out[57]:
[('bar', 'one'),
 ('bar', 'two'),
 ('baz', 'one'),
 ('baz', 'two'),
 ('foo', 'one'),
 ('foo', 'two'),
 ('qux', 'one'),
 ('qux', 'two')]

In [58]: index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])

In [59]: df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])

In [60]: df
Out[60]:
                     A         B
first second
bar   one    -0.236559 -1.052202
      two     0.663142 -0.426109
baz   one     0.903963 -0.326161
      two    -0.640158  1.695094
foo   one     0.684574 -0.352153
      two    -0.156945 -0.282294
qux   one     0.853196  0.920496
      two    -0.131899  1.526732

In [61]:  df2 = df[:4]

In [62]: df2
Out[62]:
                     A         B
first second
bar   one    -0.236559 -1.052202
      two     0.663142 -0.426109
baz   one     0.903963 -0.326161
      two    -0.640158  1.695094

stack()函数压缩DataFrame的列，使得其减少一个级别（二维转一维，列转行）。

In [63]: stacked = df2.stack()

In [64]: stacked
Out[64]:
first  second
bar    one     A   -0.236559
               B   -1.052202
       two     A    0.663142
               B   -0.426109
baz    one     A    0.903963
               B   -0.326161
       two     A   -0.640158
               B    1.695094

被“堆砌”的DataFrame or Series(有一个多重索引作为索引)，堆砌的反向操作是逆堆砌，默认逆堆砌到上一级别：

In [66]: stacked.unstack()  #逆堆砌
Out[66]:
                     A         B
first second
bar   one    -0.236559 -1.052202
      two     0.663142 -0.426109
baz   one     0.903963 -0.326161
      two    -0.640158  1.695094

In [66]: stacked.unstack()
Out[66]:
                     A         B
first second
bar   one    -0.236559 -1.052202
      two     0.663142 -0.426109
baz   one     0.903963 -0.326161
      two    -0.640158  1.695094

In [67]: stacked.unstack(1)
Out[67]:
second        one       two
first
bar   A -0.236559  0.663142
      B -1.052202 -0.426109
baz   A  0.903963 -0.640158
      B -0.326161  1.695094

In [68]: stacked.unstack(0)
Out[68]:
first          bar       baz
second
one    A -0.236559  0.903963
       B -1.052202 -0.326161
two    A  0.663142 -0.640158
       B -0.426109  1.695094

数据透视表

In [69]:  df = pd.DataFrame({'A': ['one', 'one', 'two', 'three']*3,
    ...:                     'B': ['A', 'B', 'C']*4,
    ...:                     'C': ['foo', 'foo', 'foo', 'bar', 'bar', 'bar']*2,
    ...:                     'D': np.random.randn(12),
    ...:                     'E': np.random.randn(12)})

In [70]: df
Out[70]:
        A  B    C         D         E
0     one  A  foo  0.356812  0.508259
1     one  B  foo -0.223585  2.090394
2     two  C  foo -0.366076  0.835971
3   three  A  bar -0.107184  0.377065
4     one  B  bar  0.335025  1.876482
5     one  C  bar  1.300853 -0.437791
6     two  A  foo -1.127656  1.240071
7   three  B  foo -0.719621 -0.541175
8     one  C  foo -0.464972  0.662851
9     one  A  bar -1.283444 -0.518707
10    two  B  bar -0.118690 -1.065746
11  three  C  bar -0.605458  0.908448

我们可以轻松的从这个数据生成数据透视表

In [71]:  pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])
Out[71]:
C             bar       foo
A     B
one   A -1.283444  0.356812
      B  0.335025 -0.223585
      C  1.300853 -0.464972
three A -0.107184       NaN
      B       NaN -0.719621
      C -0.605458       NaN
two   A       NaN -1.127656
      B -0.118690       NaN
      C       NaN -0.366076

时间序列

pandas拥有简单、强大且高效的函数用于高频数据的重采样转换工作（例如，秒粒度数据转换为5分钟粒度数据）。这在金融应用领域广泛使用，但是不局限于此。

In [72]: rng = pd.date_range('1/1/2012', periods=100, freq='S')

In [73]: ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)

In [74]: ts.resample('5Min').sum()
Out[74]:
2012-01-01    26018
Freq: 5T, dtype: int32

时区表示

In [80]: rng = pd.date_range('3/6/2012 00:00', periods=5, freq='D')

In [81]: ts = pd.Series(np.random.randn(len(rng)), rng)

In [82]: ts
Out[82]:
2012-03-06    0.874556
2012-03-07    3.030650
2012-03-08   -0.286538
2012-03-09    0.154653
2012-03-10   -0.513948
Freq: D, dtype: float64

In [83]: ts_utc = ts.tz_localize('UTC')

In [84]: ts_utc
Out[84]:
2012-03-06 00:00:00+00:00    0.874556
2012-03-07 00:00:00+00:00    3.030650
2012-03-08 00:00:00+00:00   -0.286538
2012-03-09 00:00:00+00:00    0.154653
2012-03-10 00:00:00+00:00   -0.513948
Freq: D, dtype: float64

时区转换

In [85]: ts_utc.tz_convert('US/Eastern')
Out[85]:
2012-03-05 19:00:00-05:00    0.874556
2012-03-06 19:00:00-05:00    3.030650
2012-03-07 19:00:00-05:00   -0.286538
2012-03-08 19:00:00-05:00    0.154653
2012-03-09 19:00:00-05:00   -0.513948
Freq: D, dtype: float64

不同时间跨度的转换

In [86]: rng = pd.date_range('1/1/2012', periods=5, freq='M')

In [87]: ts = pd.Series(np.random.randn(len(rng)), index=rng)

In [88]: ts
Out[88]:
2012-01-31   -0.135612
2012-02-29    2.293425
2012-03-31    0.789431
2012-04-30   -0.525814
2012-05-31    1.558781
Freq: M, dtype: float64

In [89]: ps = ts.to_period() 
#转换为时间段，其频率默认是从时间点推断而来，也可以指定：
# ts.to_period('M')/ ts.to_period('Y')

In [90]: ps
Out[90]:
2012-01   -0.135612
2012-02    2.293425
2012-03    0.789431
2012-04   -0.525814
2012-05    1.558781
Freq: M, dtype: float64

In [91]:  ps.to_timestamp()
Out[91]:
2012-01-01   -0.135612
2012-02-01    2.293425
2012-03-01    0.789431
2012-04-01   -0.525814
2012-05-01    1.558781
Freq: MS, dtype: float64

不同时间跨度的转换使得一些简单四则运算函数可以使用。在下面的例子中，我们将把一个11月份结束的季度财年时间转换为每季度结束的次月的首日9点。

In [95]: prng = pd.period_range('1990Q1', '2000Q4', freq='Q-NOV')
#pandas支持12种可能的季度型频率，即Q-JAN到Q-DEC，Q-NOV代表财年是11月份结束
In [96]: prng
Out[96]:
PeriodIndex(['1990Q1', '1990Q2', '1990Q3', '1990Q4', '1991Q1', '1991Q2',
             '1991Q3', '1991Q4', '1992Q1', '1992Q2', '1992Q3', '1992Q4',
             '1993Q1', '1993Q2', '1993Q3', '1993Q4', '1994Q1', '1994Q2',
             '1994Q3', '1994Q4', '1995Q1', '1995Q2', '1995Q3', '1995Q4',
             '1996Q1', '1996Q2', '1996Q3', '1996Q4', '1997Q1', '1997Q2',
             '1997Q3', '1997Q4', '1998Q1', '1998Q2', '1998Q3', '1998Q4',
             '1999Q1', '1999Q2', '1999Q3', '1999Q4', '2000Q1', '2000Q2',
             '2000Q3', '2000Q4'],
            dtype='period[Q-NOV]', freq='Q-NOV')

In [97]: ts = pd.Series(np.random.randn(len(prng)), prng)

In [98]: ts.head()
Out[98]:
1990Q1   -0.394540
1990Q2    0.380933
1990Q3    0.108135
1990Q4    0.874617
1991Q1   -0.560237
Freq: Q-NOV, dtype: float64

In [99]: ts.index = prng.asfreq('M', 'e')
#转换为季度末的月份，等效于prng.asfreq('M', how='e')
In [100]: ts
Out[100]:
1990-02   -0.394540
1990-05    0.380933
1990-08    0.108135
1990-11    0.874617
1991-02   -0.560237
1991-05   -0.912935
Freq: M, dtype: float64

In [101]: ts.index = prng.asfreq('M', 'e') + 1
#转换为季度末月份的次月
In [103]: ts.head()
Out[103]:
1990-03   -0.394540
1990-06    0.380933
1990-09    0.108135
1990-12    0.874617
1991-03   -0.560237
Freq: M, dtype: float64

In [104]: ts.index = (prng.asfreq('M', 'e') + 1).asfreq('H', 's') + 9

#转换为季度末月份次月首日的9点，等效于ts.index = (prng.asfreq('M', how= 'end') + 1).asfreq('H', how= 'start') + 9

In [107]: ts.head()
Out[107]:
1990-03-01 09:00   -0.394540
1990-06-01 09:00    0.380933
1990-09-01 09:00    0.108135
1990-12-01 09:00    0.874617
1991-03-01 09:00   -0.560237
Freq: H, dtype: float64

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2019-12-01，如有侵权请联系 cloudcommunity@tencent.com 删除

python

编程算法

本文分享自乐享数据8090 微信公众号，前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与腾讯云自媒体同步曝光计划，欢迎热爱写作的你一起参与！

python

编程算法

登录后参与评论

0 条评论

热度