Pandas 中文官档 ~ 基础用法4

1480

发布于 2019-11-07 11:41:02

3K0

发布于 2019-11-07 11:41:02

文章被收录于专栏：数据分析1480

重置索引与更换标签

reindex() 是 pandas 里实现数据对齐的基本方法，该方法执行几乎所有功能都要用到的标签对齐功能。 reindex 指的是沿着指定轴，让数据与给定的一组标签进行匹配。该功能完成以下几项操作：

让现有数据匹配一组新标签，并重新排序；
在无数据但有标签的位置插入缺失值（NA）标记；
如果指定，则按逻辑填充无标签的数据，该操作多见于时间序列数据。

示例如下：

In [196]: s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])

In [197]: s
Out[197]: 
a    1.695148
b    1.328614
c    1.234686
d   -0.385845
e   -1.326508
dtype: float64

In [198]: s.reindex(['e', 'b', 'f', 'd'])
Out[198]: 
e   -1.326508
b    1.328614
f         NaN
d   -0.385845
dtype: float64

本例中，原 Series 里没有标签 f ，因此，输出结果里 f 对应的值为 NaN。

DataFrame 支持同时 reindex 索引与列：

In [199]: df
Out[199]: 
        one       two     three
a  1.394981  1.772517       NaN
b  0.343054  1.912123 -0.050390
c  0.695246  1.478369  1.227435
d       NaN  0.279344 -0.613172

In [200]: df.reindex(index=['c', 'f', 'b'], columns=['three', 'two', 'one'])
Out[200]: 
      three       two       one
c  1.227435  1.478369  0.695246
f       NaN       NaN       NaN
b -0.050390  1.912123  0.343054

reindex 还支持 axis 关键字：

In [201]: df.reindex(['c', 'f', 'b'], axis='index')
Out[201]: 
        one       two     three
c  0.695246  1.478369  1.227435
f       NaN       NaN       NaN
b  0.343054  1.912123 -0.050390

注意：不同对象可以共享 Index 包含的轴标签。比如，有一个 Series，还有一个 DataFrame，可以执行下列操作：

In [202]: rs = s.reindex(df.index)

In [203]: rs
Out[203]: 
a    1.695148
b    1.328614
c    1.234686
d   -0.385845
dtype: float64

In [204]: rs.index is df.index
Out[204]: True

这里指的是，重置后，Series 的索引与 DataFrame 的索引是同一个 Python 对象。

0.21.0 版新增。

DataFrame.reindex() 还支持 “轴样式”调用习语，可以指定单个 labels 参数，并指定应用于哪个 axis。

In [205]: df.reindex(['c', 'f', 'b'], axis='index')
Out[205]: 
        one       two     three
c  0.695246  1.478369  1.227435
f       NaN       NaN       NaN
b  0.343054  1.912123 -0.050390

In [206]: df.reindex(['three', 'two', 'one'], axis='columns')
Out[206]: 
      three       two       one
a       NaN  1.772517  1.394981
b -0.050390  1.912123  0.343054
c  1.227435  1.478369  0.695246
d -0.613172  0.279344       NaN

::: tip 注意

多重索引与高级索引介绍了怎样用更简洁的方式重置索引。

:::

::: tip 注意

编写注重性能的代码时，最好花些时间深入理解 reindex：预对齐数据后，操作会更快。两个未对齐的 DataFrame 相加，后台操作会执行 reindex。探索性分析时很难注意到这点有什么不同，这是因为 reindex 已经进行了高度优化，但需要注重 CPU 周期时，显式调用 reindex 还是有一些影响的。

:::

重置索引，并与其它对象对齐

提取一个对象，并用另一个具有相同标签的对象 reindex 该对象的轴。这种操作的语法虽然简单，但未免有些啰嗦。这时，最好用 reindex_like() 方法，这是一种既有效，又简单的方式：

In [207]: df2
Out[207]: 
        one       two
a  1.394981  1.772517
b  0.343054  1.912123
c  0.695246  1.478369

In [208]: df3
Out[208]: 
        one       two
a  0.583888  0.051514
b -0.468040  0.191120
c -0.115848 -0.242634

In [209]: df.reindex_like(df2)
Out[209]: 
        one       two
a  1.394981  1.772517
b  0.343054  1.912123
c  0.695246  1.478369

用 align 对齐多个对象

align() 方法是对齐两个对象最快的方式，该方法支持 join 参数（请参阅 joining 与 merging）：

join='outer'：使用两个对象索引的合集，默认值
join='left'：使用左侧调用对象的索引
join='right'：使用右侧传递对象的索引
join='inner'：使用两个对象索引的交集

该方法返回重置索引后的两个 Series 元组：

In [210]: s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])

In [211]: s1 = s[:4]

In [212]: s2 = s[1:]

In [213]: s1.align(s2)
Out[213]: 
(a   -0.186646
 b   -1.692424
 c   -0.303893
 d   -1.425662
 e         NaN
 dtype: float64, a         NaN
 b   -1.692424
 c   -0.303893
 d   -1.425662
 e    1.114285
 dtype: float64)

In [214]: s1.align(s2, join='inner')
Out[214]: 
(b   -1.692424
 c   -0.303893
 d   -1.425662
 dtype: float64, b   -1.692424
 c   -0.303893
 d   -1.425662
 dtype: float64)

In [215]: s1.align(s2, join='left')
Out[215]: 
(a   -0.186646
 b   -1.692424
 c   -0.303893
 d   -1.425662
 dtype: float64, a         NaN
 b   -1.692424
 c   -0.303893
 d   -1.425662
 dtype: float64)

默认条件下， join 方法既应用于索引，也应用于列：

In [216]: df.align(df2, join='inner')
Out[216]: 
(        one       two
 a  1.394981  1.772517
 b  0.343054  1.912123
 c  0.695246  1.478369,         one       two
 a  1.394981  1.772517
 b  0.343054  1.912123
 c  0.695246  1.478369)

align 方法还支持 axis 选项，用来指定要对齐的轴：

In [217]: df.align(df2, join='inner', axis=0)
Out[217]: 
(        one       two     three
 a  1.394981  1.772517       NaN
 b  0.343054  1.912123 -0.050390
 c  0.695246  1.478369  1.227435,         one       two
 a  1.394981  1.772517
 b  0.343054  1.912123
 c  0.695246  1.478369)

如果把 Series 传递给 DataFrame.align()，可以用 axis 参数选择是在 DataFrame 的索引，还是列上对齐两个对象：

In [218]: df.align(df2.iloc[0], axis=1)
Out[218]: 
(        one     three       two
 a  1.394981       NaN  1.772517
 b  0.343054 -0.050390  1.912123
 c  0.695246  1.227435  1.478369
 d       NaN -0.613172  0.279344, one      1.394981
 three         NaN
 two      1.772517
 Name: a, dtype: float64)

方法	动作
pad / ffill	先前填充
bfill / backfill	向后填充
nearest	从最近的索引值填充

下面用一个简单的 Series 展示 fill 方法：

In [219]: rng = pd.date_range('1/3/2000', periods=8)

In [220]: ts = pd.Series(np.random.randn(8), index=rng)

In [221]: ts2 = ts[[0, 3, 6]]

In [222]: ts
Out[222]: 
2000-01-03    0.183051
2000-01-04    0.400528
2000-01-05   -0.015083
2000-01-06    2.395489
2000-01-07    1.414806
2000-01-08    0.118428
2000-01-09    0.733639
2000-01-10   -0.936077
Freq: D, dtype: float64

In [223]: ts2
Out[223]: 
2000-01-03    0.183051
2000-01-06    2.395489
2000-01-09    0.733639
dtype: float64

In [224]: ts2.reindex(ts.index)
Out[224]: 
2000-01-03    0.183051
2000-01-04         NaN
2000-01-05         NaN
2000-01-06    2.395489
2000-01-07         NaN
2000-01-08         NaN
2000-01-09    0.733639
2000-01-10         NaN
Freq: D, dtype: float64

In [225]: ts2.reindex(ts.index, method='ffill')
Out[225]: 
2000-01-03    0.183051
2000-01-04    0.183051
2000-01-05    0.183051
2000-01-06    2.395489
2000-01-07    2.395489
2000-01-08    2.395489
2000-01-09    0.733639
2000-01-10    0.733639
Freq: D, dtype: float64

In [226]: ts2.reindex(ts.index, method='bfill')
Out[226]: 
2000-01-03    0.183051
2000-01-04    2.395489
2000-01-05    2.395489
2000-01-06    2.395489
2000-01-07    0.733639
2000-01-08    0.733639
2000-01-09    0.733639
2000-01-10         NaN
Freq: D, dtype: float64

In [227]: ts2.reindex(ts.index, method='nearest')
Out[227]: 
2000-01-03    0.183051
2000-01-04    0.183051
2000-01-05    2.395489
2000-01-06    2.395489
2000-01-07    2.395489
2000-01-08    0.733639
2000-01-09    0.733639
2000-01-10    0.733639
Freq: D, dtype: float64

上述操作要求索引按递增或递减排序。

注意：除了 method='nearest'，用 fillna 或 interpolate 也能实现同样的效果：

In [228]: ts2.reindex(ts.index).fillna(method='ffill')
Out[228]: 
2000-01-03    0.183051
2000-01-04    0.183051
2000-01-05    0.183051
2000-01-06    2.395489
2000-01-07    2.395489
2000-01-08    2.395489
2000-01-09    0.733639
2000-01-10    0.733639
Freq: D, dtype: float64

如果索引不是按递增或递减排序，reindex() 会触发 ValueError 错误。fillna() 与 interpolate() 则不检查索引的排序。

重置索引填充的限制

limit 与 tolerance 参数可以控制 reindex 的填充操作。limit 限定了连续匹配的最大数量：

In [229]: ts2.reindex(ts.index, method='ffill', limit=1)
Out[229]: 
2000-01-03    0.183051
2000-01-04    0.183051
2000-01-05         NaN
2000-01-06    2.395489
2000-01-07    2.395489
2000-01-08         NaN
2000-01-09    0.733639
2000-01-10    0.733639
Freq: D, dtype: float64

反之，tolerance 限定了索引与索引器值之间的最大距离：

In [230]: ts2.reindex(ts.index, method='ffill', tolerance='1 day')
Out[230]: 
2000-01-03    0.183051
2000-01-04    0.183051
2000-01-05         NaN
2000-01-06    2.395489
2000-01-07    2.395489
2000-01-08         NaN
2000-01-09    0.733639
2000-01-10    0.733639
Freq: D, dtype: float64

注意：索引为 DatetimeIndex、TimedeltaIndex 或 PeriodIndex 时，tolerance 会尽可能将这些索引强制转换为 Timedelta，这里要求用户用恰当的字符串设定 tolerance 参数。

去掉轴上的标签

drop() 函数与 reindex 经常配合使用，该函数用于删除轴上的一组标签：

In [231]: df
Out[231]: 
        one       two     three
a  1.394981  1.772517       NaN
b  0.343054  1.912123 -0.050390
c  0.695246  1.478369  1.227435
d       NaN  0.279344 -0.613172

In [232]: df.drop(['a', 'd'], axis=0)
Out[232]: 
        one       two     three
b  0.343054  1.912123 -0.050390
c  0.695246  1.478369  1.227435

In [233]: df.drop(['one'], axis=1)
Out[233]: 
        two     three
a  1.772517       NaN
b  1.912123 -0.050390
c  1.478369  1.227435
d  0.279344 -0.613172

注意：下面的代码可以运行，但不够清晰：

In [234]: df.reindex(df.index.difference(['a', 'd']))
Out[234]: 
        one       two     three
b  0.343054  1.912123 -0.050390
c  0.695246  1.478369  1.227435

重命名或映射标签

rename() 方法支持按不同的轴基于映射（字典或 Series）调整标签。

In [235]: s
Out[235]: 
a   -0.186646
b   -1.692424
c   -0.303893
d   -1.425662
e    1.114285
dtype: float64

In [236]: s.rename(str.upper)
Out[236]: 
A   -0.186646
B   -1.692424
C   -0.303893
D   -1.425662
E    1.114285
dtype: float64

如果调用的是函数，该函数在处理标签时，必须返回一个值，而且生成的必须是一组唯一值。此外，rename() 还可以调用字典或 Series。

In [237]: df.rename(columns={'one': 'foo', 'two': 'bar'},
   .....:           index={'a': 'apple', 'b': 'banana', 'd': 'durian'})
   .....: 
Out[237]: 
             foo       bar     three
apple   1.394981  1.772517       NaN
banana  0.343054  1.912123 -0.050390
c       0.695246  1.478369  1.227435
durian       NaN  0.279344 -0.613172

pandas 不会重命名标签未包含在映射里的列或索引。注意，映射里多出的标签不会触发错误。

0.21.0 版新增。

DataFrame.rename() 还支持“轴式”习语，用这种方式可以指定单个 mapper，及执行映射的 axis。

In [238]: df.rename({'one': 'foo', 'two': 'bar'}, axis='columns')
Out[238]: 
        foo       bar     three
a  1.394981  1.772517       NaN
b  0.343054  1.912123 -0.050390
c  0.695246  1.478369  1.227435
d       NaN  0.279344 -0.613172

In [239]: df.rename({'a': 'apple', 'b': 'banana', 'd': 'durian'}, axis='index')
Out[239]: 
             one       two     three
apple   1.394981  1.772517       NaN
banana  0.343054  1.912123 -0.050390
c       0.695246  1.478369  1.227435
durian       NaN  0.279344 -0.613172

rename() 方法还提供了 inplace 命名参数，默认为 False，并会复制底层数据。inplace=True 时，会直接在原数据上重命名。

0.18.0 版新增。

rename() 还支持用标量或列表更改 Series.name 属性。

In [240]: s.rename("scalar-name")
Out[240]: 
a   -0.186646
b   -1.692424
c   -0.303893
d   -1.425662
e    1.114285
Name: scalar-name, dtype: float64

0.24.0 版新增。

rename_axis() 方法支持指定 多重索引 名称，与标签相对应。

In [241]: df = pd.DataFrame({'x': [1, 2, 3, 4, 5, 6],
   .....:                    'y': [10, 20, 30, 40, 50, 60]},
   .....:                   index=pd.MultiIndex.from_product([['a', 'b', 'c'], [1, 2]],
   .....:                   names=['let', 'num']))
   .....: 

In [242]: df
Out[242]: 
         x   y
let num       
a   1    1  10
    2    2  20
b   1    3  30
    2    4  40
c   1    5  50
    2    6  60

In [243]: df.rename_axis(index={'let': 'abc'})
Out[243]: 
         x   y
abc num       
a   1    1  10
    2    2  20
b   1    3  30
    2    4  40
c   1    5  50
    2    6  60

In [244]: df.rename_axis(index=str.upper)
Out[244]: 
         x   y
LET NUM       
a   1    1  10
    2    2  20
b   1    3  30
    2    4  40
c   1    5  50
    2    6  60

迭代

pandas 对象基于类型进行迭代操作。Series 迭代时被视为数组，基础迭代生成值。DataFrame 则遵循字典式习语，用对象的 key 实现迭代操作。

简言之，基础迭代（for i in object）生成：

Series ：值
DataFrame：列标签

例如，DataFrame 迭代时输出列名：

In [245]: df = pd.DataFrame({'col1': np.random.randn(3),
   .....:                    'col2': np.random.randn(3)}, index=['a', 'b', 'c'])
   .....: 

In [246]: for col in df:
   .....:     print(col)
   .....: 
col1
col2

Pandas 对象还支持字典式的 items() 方法，通过键值对迭代。

用下列方法可以迭代 DataFrame 里的行：

iterrows()：把 DataFrame 里的行当作（index， Series）对进行迭代。该操作把行转为 Series，同时改变数据类型，并对性能有影响。
`itertuples()` 把 DataFrame 的行当作值的命名元组进行迭代。该操作比 `iterrows()` 快的多，建议尽量用这种方法迭代 DataFrame 的值。

::: danger 警告

Pandas 对象迭代的速度较慢。大部分情况下，没必要对行执行迭代操作，建议用以下几种替代方式：

矢量化：很多操作可以用内置方法或 Numpy 函数，布尔索引……
调用的函数不能在完整的 DataFrame / Series 上运行时，最好用 `apply()`，不要对值进行迭代操作。请参阅函数应用文档。
如果必须对值进行迭代，请务必注意代码的性能，建议在 cython 或 numba 环境下实现内循环。参阅增强性能一节，查看这种操作方法的示例。

:::

::: danger 警告

永远不要修改迭代的内容，这种方式不能确保所有操作都能正常运作。基于数据类型，迭代器返回的是复制（copy）的结果，不是视图（view），这种写入可能不会生效！

下例中的赋值就不会生效：

In [247]: df = pd.DataFrame({'a': [1, 2, 3], 'b': ['a', 'b', 'c']})

In [248]: for index, row in df.iterrows():
.....:     row['a'] = 10
.....: 

In [249]: df
Out[249]: 
a  b
0  1  a
1  2  b
2  3  c

:::

项目（items）

与字典型接口类似，items() 通过键值对进行迭代：

Series：（Index，标量值）对
DataFrame：（列，Series）对

示例如下：

In [250]: for label, ser in df.items():
   .....:     print(label)
   .....:     print(ser)
   .....: 
a
0    1
1    2
2    3
Name: a, dtype: int64
b
0    a
1    b
2    c
Name: b, dtype: object

iterrows

iterrows() 迭代 DataFrame 或 Series 里的每一行数据。这个操作返回一个迭代器，生成索引值及包含每行数据的 Series：

In [251]: for row_index, row in df.iterrows():
   .....:     print(row_index, row, sep='\n')
   .....: 
0
a    1
b    a
Name: 0, dtype: object
1
a    2
b    b
Name: 1, dtype: object
2
a    3
b    c
Name: 2, dtype: object

::: tip 注意

iterrows() 返回的是 Series 里的每一行数据，该操作不会保留每行数据的数据类型，因为数据类型是通过 DataFrame 的列界定的。

示例如下：

In [252]: df_orig = pd.DataFrame([[1, 1.5]], columns=['int', 'float'])

In [253]: df_orig.dtypes
Out[253]: 
int        int64
float    float64
dtype: object

In [254]: row = next(df_orig.iterrows())[1]

In [255]: row
Out[255]: 
int      1.0
float    1.5
Name: 0, dtype: float64

row 里的值以 Series 形式返回，并被转换为浮点数，原始的整数值则在列 X：

In [256]: row['int'].dtype
Out[256]: dtype('float64')

In [257]: df_orig['int'].dtype
Out[257]: dtype('int64')

要想在行迭代时保存数据类型，最好用 itertuples()，这个函数返回值的命名元组，总的来说，该操作比 iterrows() 速度更快。

:::

下例展示了怎样转置 DataFrame：

In [258]: df2 = pd.DataFrame({'x': [1, 2, 3], 'y': [4, 5, 6]})

In [259]: print(df2)
   x  y
0  1  4
1  2  5
2  3  6

In [260]: print(df2.T)
   0  1  2
x  1  2  3
y  4  5  6

In [261]: df2_t = pd.DataFrame({idx: values for idx, values in df2.iterrows()})

In [262]: print(df2_t)
   0  1  2
x  1  2  3
y  4  5  6

itertuples

itertuples() 方法返回为 DataFrame 里每行数据生成命名元组的迭代器。该元组的第一个元素是行的索引值，其余的值则是行的值。

示例如下：

In [263]: for row in df.itertuples():
   .....:     print(row)
   .....: 
Pandas(Index=0, a=1, b='a')
Pandas(Index=1, a=2, b='b')
Pandas(Index=2, a=3, b='c')

该方法不会把行转换为 Series，只是返回命名元组里的值。itertuples() 保存值的数据类型，而且比 iterrows() 快。

::: tip 注意

包含无效 Python 识别符的列名、重复的列名及以下划线开头的列名，会被重命名为位置名称。如果列数较大，比如大于 255 列，则返回正则元组。

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2019-11-05，如有侵权请联系 cloudcommunity@tencent.com 删除

python

编程算法

本文分享自数据分析1480 微信公众号，前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与腾讯云自媒体同步曝光计划，欢迎热爱写作的你一起参与！

python

编程算法

登录后参与评论

0 条评论

热度