Pandas 2.2 中文官方教程和指南（九·二）

ApacheCN_飞龙

发布于 2024-05-24 16:25:52

1290

发布于 2024-05-24 16:25:52

文章被收录于专栏：信数据得永生

比较类似数组的对象

当将 pandas 数据结构与标量值进行比较时，您可以方便地执行逐元素比较：

In [65]: pd.Series(["foo", "bar", "baz"]) == "foo"
Out[65]: 
0     True
1    False
2    False
dtype: bool

In [66]: pd.Index(["foo", "bar", "baz"]) == "foo"
Out[66]: array([ True, False, False])

pandas 还处理了长度相同的不同类似数组对象之间的逐元素比较：

In [67]: pd.Series(["foo", "bar", "baz"]) == pd.Index(["foo", "bar", "qux"])
Out[67]: 
0     True
1     True
2    False
dtype: bool

In [68]: pd.Series(["foo", "bar", "baz"]) == np.array(["foo", "bar", "qux"])
Out[68]: 
0     True
1     True
2    False
dtype: bool

尝试比较不同长度的 Index 或 Series 对象将引发 ValueError：

In [69]: pd.Series(['foo', 'bar', 'baz']) == pd.Series(['foo', 'bar'])
---------------------------------------------------------------------------
ValueError  Traceback (most recent call last)
Cell In[69], line 1
----> 1 pd.Series(['foo', 'bar', 'baz']) == pd.Series(['foo', 'bar'])

File ~/work/pandas/pandas/pandas/core/ops/common.py:76, in _unpack_zerodim_and_defer.<locals>.new_method(self, other)
  72             return NotImplemented
  74 other = item_from_zerodim(other)
---> 76 return method(self, other)

File ~/work/pandas/pandas/pandas/core/arraylike.py:40, in OpsMixin.__eq__(self, other)
  38 @unpack_zerodim_and_defer("__eq__")
  39 def __eq__(self, other):
---> 40     return self._cmp_method(other, operator.eq)

File ~/work/pandas/pandas/pandas/core/series.py:6114, in Series._cmp_method(self, other, op)
  6111 res_name = ops.get_op_result_name(self, other)
  6113 if isinstance(other, Series) and not self._indexed_same(other):
-> 6114     raise ValueError("Can only compare identically-labeled Series objects")
  6116 lvalues = self._values
  6117 rvalues = extract_array(other, extract_numpy=True, extract_range=True)

ValueError: Can only compare identically-labeled Series objects

In [70]: pd.Series(['foo', 'bar', 'baz']) == pd.Series(['foo'])
---------------------------------------------------------------------------
ValueError  Traceback (most recent call last)
Cell In[70], line 1
----> 1 pd.Series(['foo', 'bar', 'baz']) == pd.Series(['foo'])

File ~/work/pandas/pandas/pandas/core/ops/common.py:76, in _unpack_zerodim_and_defer.<locals>.new_method(self, other)
  72             return NotImplemented
  74 other = item_from_zerodim(other)
---> 76 return method(self, other)

File ~/work/pandas/pandas/pandas/core/arraylike.py:40, in OpsMixin.__eq__(self, other)
  38 @unpack_zerodim_and_defer("__eq__")
  39 def __eq__(self, other):
---> 40     return self._cmp_method(other, operator.eq)

File ~/work/pandas/pandas/pandas/core/series.py:6114, in Series._cmp_method(self, other, op)
  6111 res_name = ops.get_op_result_name(self, other)
  6113 if isinstance(other, Series) and not self._indexed_same(other):
-> 6114     raise ValueError("Can only compare identically-labeled Series objects")
  6116 lvalues = self._values
  6117 rvalues = extract_array(other, extract_numpy=True, extract_range=True)

ValueError: Can only compare identically-labeled Series objects

合并重叠的数据集

一个偶尔出现的问题是合并两个相似的数据集，其中一个数据集中的值优先于另一个。一个例子是代表特定经济指标的两个数据系列，其中一个被认为是“更高质量”的。然而，较低质量的系列可能在历史上延伸得更长，或者数据覆盖更完整。因此，我们希望将两个 DataFrame 对象合并，其中一个 DataFrame 中的缺失值有条件地用另一个 DataFrame 中的相同标签值填充。实现此操作的函数是combine_first()，我们进行如下说明：

In [71]: df1 = pd.DataFrame(
 ....:    {"A": [1.0, np.nan, 3.0, 5.0, np.nan], "B": [np.nan, 2.0, 3.0, np.nan, 6.0]}
 ....: )
 ....: 

In [72]: df2 = pd.DataFrame(
 ....:    {
 ....:        "A": [5.0, 2.0, 4.0, np.nan, 3.0, 7.0],
 ....:        "B": [np.nan, np.nan, 3.0, 4.0, 6.0, 8.0],
 ....:    }
 ....: )
 ....: 

In [73]: df1
Out[73]: 
 A    B
0  1.0  NaN
1  NaN  2.0
2  3.0  3.0
3  5.0  NaN
4  NaN  6.0

In [74]: df2
Out[74]: 
 A    B
0  5.0  NaN
1  2.0  NaN
2  4.0  3.0
3  NaN  4.0
4  3.0  6.0
5  7.0  8.0

In [75]: df1.combine_first(df2)
Out[75]: 
 A    B
0  1.0  NaN
1  2.0  2.0
2  3.0  3.0
3  5.0  4.0
4  3.0  6.0
5  7.0  8.0

通用 DataFrame 合并

上述 combine_first() 方法调用更一般的 DataFrame.combine()。此方法接受另一个 DataFrame 和一个组合函数，对齐输入 DataFrame，然后传递组合函数的 Series 对（即，列名相同的列）。

因此，例如，要重现上述 combine_first()：

In [76]: def combiner(x, y):
 ....:    return np.where(pd.isna(x), y, x)
 ....: 

In [77]: df1.combine(df2, combiner)
Out[77]: 
 A    B
0  1.0  NaN
1  2.0  2.0
2  3.0  3.0
3  5.0  4.0
4  3.0  6.0
5  7.0  8.0

匹配 / 广播行为

DataFrame 有方法 add()、sub()、mul()、div() 和相关函数 radd()、rsub()，… 用于执行二进制操作。对广播行为，Series 输入是主要关注的。使用这些函数，您可以通过 axis 关键字匹配 index 或 columns：

In [18]: df = pd.DataFrame(
 ....:    {
 ....:        "one": pd.Series(np.random.randn(3), index=["a", "b", "c"]),
 ....:        "two": pd.Series(np.random.randn(4), index=["a", "b", "c", "d"]),
 ....:        "three": pd.Series(np.random.randn(3), index=["b", "c", "d"]),
 ....:    }
 ....: )
 ....: 

In [19]: df
Out[19]: 
 one       two     three
a  1.394981  1.772517       NaN
b  0.343054  1.912123 -0.050390
c  0.695246  1.478369  1.227435
d       NaN  0.279344 -0.613172

In [20]: row = df.iloc[1]

In [21]: column = df["two"]

In [22]: df.sub(row, axis="columns")
Out[22]: 
 one       two     three
a  1.051928 -0.139606       NaN
b  0.000000  0.000000  0.000000
c  0.352192 -0.433754  1.277825
d       NaN -1.632779 -0.562782

In [23]: df.sub(row, axis=1)
Out[23]: 
 one       two     three
a  1.051928 -0.139606       NaN
b  0.000000  0.000000  0.000000
c  0.352192 -0.433754  1.277825
d       NaN -1.632779 -0.562782

In [24]: df.sub(column, axis="index")
Out[24]: 
 one  two     three
a -0.377535  0.0       NaN
b -1.569069  0.0 -1.962513
c -0.783123  0.0 -0.250933
d       NaN  0.0 -0.892516

In [25]: df.sub(column, axis=0)
Out[25]: 
 one  two     three
a -0.377535  0.0       NaN
b -1.569069  0.0 -1.962513
c -0.783123  0.0 -0.250933
d       NaN  0.0 -0.892516

此外，您可以将 MultiIndexed DataFrame 的一个级别与 Series 对齐。

In [26]: dfmi = df.copy()

In [27]: dfmi.index = pd.MultiIndex.from_tuples(
 ....:    [(1, "a"), (1, "b"), (1, "c"), (2, "a")], names=["first", "second"]
 ....: )
 ....: 

In [28]: dfmi.sub(column, axis=0, level="second")
Out[28]: 
 one       two     three
first second 
1     a      -0.377535  0.000000       NaN
 b      -1.569069  0.000000 -1.962513
 c      -0.783123  0.000000 -0.250933
2     a            NaN -1.493173 -2.385688

Series 和 Index 也支持内置的 divmod()。此函数同时进行地板除法和模运算，返回与左侧相同类型的两个元组。例如：

In [29]: s = pd.Series(np.arange(10))

In [30]: s
Out[30]: 
0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    9
dtype: int64

In [31]: div, rem = divmod(s, 3)

In [32]: div
Out[32]: 
0    0
1    0
2    0
3    1
4    1
5    1
6    2
7    2
8    2
9    3
dtype: int64

In [33]: rem
Out[33]: 
0    0
1    1
2    2
3    0
4    1
5    2
6    0
7    1
8    2
9    0
dtype: int64

In [34]: idx = pd.Index(np.arange(10))

In [35]: idx
Out[35]: Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64')

In [36]: div, rem = divmod(idx, 3)

In [37]: div
Out[37]: Index([0, 0, 0, 1, 1, 1, 2, 2, 2, 3], dtype='int64')

In [38]: rem
Out[38]: Index([0, 1, 2, 0, 1, 2, 0, 1, 2, 0], dtype='int64')

我们还可以进行逐元素的 divmod()：

In [39]: div, rem = divmod(s, [2, 2, 3, 3, 4, 4, 5, 5, 6, 6])

In [40]: div
Out[40]: 
0    0
1    0
2    0
3    1
4    1
5    1
6    1
7    1
8    1
9    1
dtype: int64

In [41]: rem
Out[41]: 
0    0
1    1
2    2
3    0
4    0
5    1
6    1
7    2
8    2
9    3
dtype: int64

缺失数据 / 使用填充值的操作

在 Series 和 DataFrame 中，算术函数有一个 fill_value 选项，即在一个位置的值中至多有一个缺失时要替换的值。例如，当添加两个 DataFrame 对象时，你可能希望将 NaN 视为 0，除非两个 DataFrame 都缺少该值，此时结果将是 NaN（如果你愿意，你可以稍后使用 fillna 将 NaN 替换为其他值）。

In [42]: df2 = df.copy()

In [43]: df2.loc["a", "three"] = 1.0

In [44]: df
Out[44]: 
 one       two     three
a  1.394981  1.772517       NaN
b  0.343054  1.912123 -0.050390
c  0.695246  1.478369  1.227435
d       NaN  0.279344 -0.613172

In [45]: df2
Out[45]: 
 one       two     three
a  1.394981  1.772517  1.000000
b  0.343054  1.912123 -0.050390
c  0.695246  1.478369  1.227435
d       NaN  0.279344 -0.613172

In [46]: df + df2
Out[46]: 
 one       two     three
a  2.789963  3.545034       NaN
b  0.686107  3.824246 -0.100780
c  1.390491  2.956737  2.454870
d       NaN  0.558688 -1.226343

In [47]: df.add(df2, fill_value=0)
Out[47]: 
 one       two     three
a  2.789963  3.545034  1.000000
b  0.686107  3.824246 -0.100780
c  1.390491  2.956737  2.454870
d       NaN  0.558688 -1.226343

灵活的比较

Series 和 DataFrame 有二进制比较方法 eq、ne、lt、gt、le 和 ge，其行为类似于上面描述的二进制算术操作：

In [48]: df.gt(df2)
Out[48]: 
 one    two  three
a  False  False  False
b  False  False  False
c  False  False  False
d  False  False  False

In [49]: df2.ne(df)
Out[49]: 
 one    two  three
a  False  False   True
b  False  False  False
c  False  False  False
d   True  False  False

这些操作会产生与左侧输入相同类型的 dtype 为 bool 的 pandas 对象。这些 boolean 对象可以在索引操作中使用，请参阅布尔索引一节。

布尔归约

你可以应用归约：empty、any()、all() 和 bool() 来提供一种总结布尔结果的方法。

In [50]: (df > 0).all()
Out[50]: 
one      False
two       True
three    False
dtype: bool

In [51]: (df > 0).any()
Out[51]: 
one      True
two      True
three    True
dtype: bool

你可以归约到最终的布尔值。

In [52]: (df > 0).any().any()
Out[52]: True

你可以通过 empty 属性测试 pandas 对象是否为空。

In [53]: df.empty
Out[53]: False

In [54]: pd.DataFrame(columns=list("ABC")).empty
Out[54]: True

警告

断言 pandas 对象的真实性会引发错误，因为空值或值的测试是模糊的。

In [55]: if df:
 ....:    print(True)
 ....: 
---------------------------------------------------------------------------
ValueError  Traceback (most recent call last)
<ipython-input-55-318d08b2571a> in ?()
----> 1 if df:
  2     print(True)

~/work/pandas/pandas/pandas/core/generic.py in ?(self)
  1575     @final
  1576     def __nonzero__(self) -> NoReturn:
-> 1577         raise ValueError(
  1578             f"The truth value of a {type(self).__name__} is ambiguous. "
  1579             "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
  1580         )

ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

In [56]: df and df2
---------------------------------------------------------------------------
ValueError  Traceback (most recent call last)
<ipython-input-56-b241b64bb471> in ?()
----> 1 df and df2

~/work/pandas/pandas/pandas/core/generic.py in ?(self)
  1575     @final
  1576     def __nonzero__(self) -> NoReturn:
-> 1577         raise ValueError(
  1578             f"The truth value of a {type(self).__name__} is ambiguous. "
  1579             "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
  1580         )

ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

详细讨论请参阅陷阱。

比较对象是否等价

常常你会发现有多种方法可以计算相同的结果。举个简单的例子，考虑df + df和df * 2。为了测试这两个计算是否产生相同的结果，根据上面展示的工具，你可能会想象使用(df + df == df * 2).all()。但实际上，这个表达式是错误的：

In [57]: df + df == df * 2
Out[57]: 
 one   two  three
a   True  True  False
b   True  True   True
c   True  True   True
d  False  True   True

In [58]: (df + df == df * 2).all()
Out[58]: 
one      False
two       True
three    False
dtype: bool

注意，布尔 DataFrame df + df == df * 2 包含一些 False 值！这是因为 NaN 不会比较相等：

In [59]: np.nan == np.nan
Out[59]: False

因此，NDFrames（如 Series 和 DataFrames）具有一个 equals() 方法，用于测试相等性，其中相应位置的 NaN 被视为相等。

In [60]: (df + df).equals(df * 2)
Out[60]: True

请注意，要使相等为真，Series 或 DataFrame 索引需要按相同顺序排列：

In [61]: df1 = pd.DataFrame({"col": ["foo", 0, np.nan]})

In [62]: df2 = pd.DataFrame({"col": [np.nan, 0, "foo"]}, index=[2, 1, 0])

In [63]: df1.equals(df2)
Out[63]: False

In [64]: df1.equals(df2.sort_index())
Out[64]: True

比较类数组对象

当比较一个 pandas 数据结构与一个标量值时，你可以方便地执行逐元素比较：

In [65]: pd.Series(["foo", "bar", "baz"]) == "foo"
Out[65]: 
0     True
1    False
2    False
dtype: bool

In [66]: pd.Index(["foo", "bar", "baz"]) == "foo"
Out[66]: array([ True, False, False])

pandas 还处理同长度的不同类数组对象之间的逐元素比较：

In [67]: pd.Series(["foo", "bar", "baz"]) == pd.Index(["foo", "bar", "qux"])
Out[67]: 
0     True
1     True
2    False
dtype: bool

In [68]: pd.Series(["foo", "bar", "baz"]) == np.array(["foo", "bar", "qux"])
Out[68]: 
0     True
1     True
2    False
dtype: bool

尝试比较不同长度的 Index 或 Series 对象会引发 ValueError：

In [69]: pd.Series(['foo', 'bar', 'baz']) == pd.Series(['foo', 'bar'])
---------------------------------------------------------------------------
ValueError  Traceback (most recent call last)
Cell In[69], line 1
----> 1 pd.Series(['foo', 'bar', 'baz']) == pd.Series(['foo', 'bar'])

File ~/work/pandas/pandas/pandas/core/ops/common.py:76, in _unpack_zerodim_and_defer.<locals>.new_method(self, other)
  72             return NotImplemented
  74 other = item_from_zerodim(other)
---> 76 return method(self, other)

File ~/work/pandas/pandas/pandas/core/arraylike.py:40, in OpsMixin.__eq__(self, other)
  38 @unpack_zerodim_and_defer("__eq__")
  39 def __eq__(self, other):
---> 40     return self._cmp_method(other, operator.eq)

File ~/work/pandas/pandas/pandas/core/series.py:6114, in Series._cmp_method(self, other, op)
  6111 res_name = ops.get_op_result_name(self, other)
  6113 if isinstance(other, Series) and not self._indexed_same(other):
-> 6114     raise ValueError("Can only compare identically-labeled Series objects")
  6116 lvalues = self._values
  6117 rvalues = extract_array(other, extract_numpy=True, extract_range=True)

ValueError: Can only compare identically-labeled Series objects

In [70]: pd.Series(['foo', 'bar', 'baz']) == pd.Series(['foo'])
---------------------------------------------------------------------------
ValueError  Traceback (most recent call last)
Cell In[70], line 1
----> 1 pd.Series(['foo', 'bar', 'baz']) == pd.Series(['foo'])

File ~/work/pandas/pandas/pandas/core/ops/common.py:76, in _unpack_zerodim_and_defer.<locals>.new_method(self, other)
  72             return NotImplemented
  74 other = item_from_zerodim(other)
---> 76 return method(self, other)

File ~/work/pandas/pandas/pandas/core/arraylike.py:40, in OpsMixin.__eq__(self, other)
  38 @unpack_zerodim_and_defer("__eq__")
  39 def __eq__(self, other):
---> 40     return self._cmp_method(other, operator.eq)

File ~/work/pandas/pandas/pandas/core/series.py:6114, in Series._cmp_method(self, other, op)
  6111 res_name = ops.get_op_result_name(self, other)
  6113 if isinstance(other, Series) and not self._indexed_same(other):
-> 6114     raise ValueError("Can only compare identically-labeled Series objects")
  6116 lvalues = self._values
  6117 rvalues = extract_array(other, extract_numpy=True, extract_range=True)

ValueError: Can only compare identically-labeled Series objects

合并重叠数据集

有时会出现一个问题，即合并两个相似的数据集，其中一个数据集中的值优先于另一个。一个例子是代表特定经济指标的两个数据系列，其中一个被认为是“更高质量”的。然而，较低质量的系列可能在历史上延伸得更远，或者具有更完整的数据覆盖。因此，我们希望将两个 DataFrame 对象合并，其中一个 DataFrame 中的缺失值有条件地用另一个 DataFrame 中的类似标记值填充。实现此操作的函数是combine_first()，我们进行演示：

In [71]: df1 = pd.DataFrame(
 ....:    {"A": [1.0, np.nan, 3.0, 5.0, np.nan], "B": [np.nan, 2.0, 3.0, np.nan, 6.0]}
 ....: )
 ....: 

In [72]: df2 = pd.DataFrame(
 ....:    {
 ....:        "A": [5.0, 2.0, 4.0, np.nan, 3.0, 7.0],
 ....:        "B": [np.nan, np.nan, 3.0, 4.0, 6.0, 8.0],
 ....:    }
 ....: )
 ....: 

In [73]: df1
Out[73]: 
 A    B
0  1.0  NaN
1  NaN  2.0
2  3.0  3.0
3  5.0  NaN
4  NaN  6.0

In [74]: df2
Out[74]: 
 A    B
0  5.0  NaN
1  2.0  NaN
2  4.0  3.0
3  NaN  4.0
4  3.0  6.0
5  7.0  8.0

In [75]: df1.combine_first(df2)
Out[75]: 
 A    B
0  1.0  NaN
1  2.0  2.0
2  3.0  3.0
3  5.0  4.0
4  3.0  6.0
5  7.0  8.0

通用 DataFrame 合并

上面的combine_first()方法调用了更一般的DataFrame.combine()。此方法接受另一个 DataFrame 和一个合并函数，对齐输入 DataFrame，然后将组合器函数传递给一对 Series（即，列名称相同的列）。

因此，例如，要重现combine_first()如上所示：

In [76]: def combiner(x, y):
 ....:    return np.where(pd.isna(x), y, x)
 ....: 

In [77]: df1.combine(df2, combiner)
Out[77]: 
 A    B
0  1.0  NaN
1  2.0  2.0
2  3.0  3.0
3  5.0  4.0
4  3.0  6.0
5  7.0  8.0

描述性统计

存在大量计算描述性统计和其他相关操作的方法，适用于 Series，DataFrame。其中大多数是聚合（因此生成较低维度的结果），如sum()、mean()和quantile()，但其中一些，如cumsum()和cumprod()，生成相同大小的对象。一般来说，这些方法接受一个axis参数，就像*ndarray.{sum, std, …}*一样，但是轴可以通过名称或整数指定：

Series：不需要轴参数
DataFrame：“索引”（axis=0，默认），“列”（axis=1）

例如：

In [78]: df
Out[78]: 
 one       two     three
a  1.394981  1.772517       NaN
b  0.343054  1.912123 -0.050390
c  0.695246  1.478369  1.227435
d       NaN  0.279344 -0.613172

In [79]: df.mean(0)
Out[79]: 
one      0.811094
two      1.360588
three    0.187958
dtype: float64

In [80]: df.mean(1)
Out[80]: 
a    1.583749
b    0.734929
c    1.133683
d   -0.166914
dtype: float64

所有这些方法都有一个skipna选项，指示是否排除缺失数据（默认为True）：

In [81]: df.sum(0, skipna=False)
Out[81]: 
one           NaN
two      5.442353
three         NaN
dtype: float64

In [82]: df.sum(axis=1, skipna=True)
Out[82]: 
a    3.167498
b    2.204786
c    3.401050
d   -0.333828
dtype: float64

结合广播/算术行为，可以非常简洁地描述各种统计程序，比如标准化（使数据均值为零，标准差为 1）：

In [83]: ts_stand = (df - df.mean()) / df.std()

In [84]: ts_stand.std()
Out[84]: 
one      1.0
two      1.0
three    1.0
dtype: float64

In [85]: xs_stand = df.sub(df.mean(1), axis=0).div(df.std(1), axis=0)

In [86]: xs_stand.std(1)
Out[86]: 
a    1.0
b    1.0
c    1.0
d    1.0
dtype: float64

注意，像 cumsum() 和 cumprod() 这样的方法会保留 NaN 值的位置。这与 expanding() 和 rolling() 有些不同，因为 NaN 的行为还受 min_periods 参数的影响。

In [87]: df.cumsum()
Out[87]: 
 one       two     three
a  1.394981  1.772517       NaN
b  1.738035  3.684640 -0.050390
c  2.433281  5.163008  1.177045
d       NaN  5.442353  0.563873

这是常见函数的快速参考摘要表。每个函数还接受一个可选的 level 参数，该参数仅在对象具有分层索引时适用。

函数	描述
count	非 NA 观测数量
sum	值的总和
mean	值的均值
median	值的算术中位数
min	最小值
max	最大值
mode	众数
abs	绝对值
prod	值的乘积
std	Bessel 校正的样本标准差
var	无偏方差
sem	均值的标准误差
skew	样本偏度（3 阶矩）
kurt	样本峰度（4 阶矩）
quantile	样本分位数（%处的值）
cumsum	累积和
cumprod	累积乘积
cummax	累积最大值
cummin	累积最小值

注意，一些 NumPy 方法，如 mean、std 和 sum，默认情况下会在 Series 输入中排除 NA 值：

In [88]: np.mean(df["one"])
Out[88]: 0.8110935116651192

In [89]: np.mean(df["one"].to_numpy())
Out[89]: nan

Series.nunique() 将返回 Series 中唯一非 NA 值的数量：

In [90]: series = pd.Series(np.random.randn(500))

In [91]: series[20:500] = np.nan

In [92]: series[10:20] = 5

In [93]: series.nunique()
Out[93]: 11

数据汇总：describe

有一个方便的 describe() 函数，可以计算关于 Series 或 DataFrame 列的各种摘要统计信息（当然不包括 NA 值）：

In [94]: series = pd.Series(np.random.randn(1000))

In [95]: series[::2] = np.nan

In [96]: series.describe()
Out[96]: 
count    500.000000
mean      -0.021292
std        1.015906
min       -2.683763
25%       -0.699070
50%       -0.069718
75%        0.714483
max        3.160915
dtype: float64

In [97]: frame = pd.DataFrame(np.random.randn(1000, 5), columns=["a", "b", "c", "d", "e"])

In [98]: frame.iloc[::2] = np.nan

In [99]: frame.describe()
Out[99]: 
 a           b           c           d           e
count  500.000000  500.000000  500.000000  500.000000  500.000000
mean     0.033387    0.030045   -0.043719   -0.051686    0.005979
std      1.017152    0.978743    1.025270    1.015988    1.006695
min     -3.000951   -2.637901   -3.303099   -3.159200   -3.188821
25%     -0.647623   -0.576449   -0.712369   -0.691338   -0.691115
50%      0.047578   -0.021499   -0.023888   -0.032652   -0.025363
75%      0.729907    0.775880    0.618896    0.670047    0.649748
max      2.740139    2.752332    3.004229    2.728702    3.240991

您可以选��在输出中包含特定的百分位数：

In [100]: series.describe(percentiles=[0.05, 0.25, 0.75, 0.95])
Out[100]: 
count    500.000000
mean      -0.021292
std        1.015906
min       -2.683763
5%        -1.645423
25%       -0.699070
50%       -0.069718
75%        0.714483
95%        1.711409
max        3.160915
dtype: float64

默认情况下，中位数始终包括在内。

对于非数值 Series 对象，describe() 将提供关于唯一值数量和最常出现值的简单摘要：

In [101]: s = pd.Series(["a", "a", "b", "b", "a", "a", np.nan, "c", "d", "a"])

In [102]: s.describe()
Out[102]: 
count     9
unique    4
top       a
freq      5
dtype: object

注意，在混合类型的 DataFrame 对象上，describe() 将限制摘要仅包括数值列或者如果没有数值列，则仅包括分类列：

In [103]: frame = pd.DataFrame({"a": ["Yes", "Yes", "No", "No"], "b": range(4)})

In [104]: frame.describe()
Out[104]: 
 b
count  4.000000
mean   1.500000
std    1.290994
min    0.000000
25%    0.750000
50%    1.500000
75%    2.250000
max    3.000000

可以通过提供一个类型列表作为 include/exclude 参数来控制此行为。还可以使用特殊值 all：

In [105]: frame.describe(include=["object"])
Out[105]: 
 a
count     4
unique    2
top     Yes
freq      2

In [106]: frame.describe(include=["number"])
Out[106]: 
 b
count  4.000000
mean   1.500000
std    1.290994
min    0.000000
25%    0.750000
50%    1.500000
75%    2.250000
max    3.000000

In [107]: frame.describe(include="all")
Out[107]: 
 a         b
count     4  4.000000
unique    2       NaN
top     Yes       NaN
freq      2       NaN
mean    NaN  1.500000
std     NaN  1.290994
min     NaN  0.000000
25%     NaN  0.750000
50%     NaN  1.500000
75%     NaN  2.250000
max     NaN  3.000000

该功能依赖于 select_dtypes。有关接受的输入的详细信息，请参考那里。### 最小/最大值的索引

Series 和 DataFrame 上的idxmin()和idxmax()函数计算具有最小和最大对应值的索引标签：

In [108]: s1 = pd.Series(np.random.randn(5))

In [109]: s1
Out[109]: 
0    1.118076
1   -0.352051
2   -1.242883
3   -1.277155
4   -0.641184
dtype: float64

In [110]: s1.idxmin(), s1.idxmax()
Out[110]: (3, 0)

In [111]: df1 = pd.DataFrame(np.random.randn(5, 3), columns=["A", "B", "C"])

In [112]: df1
Out[112]: 
 A         B         C
0 -0.327863 -0.946180 -0.137570
1 -0.186235 -0.257213 -0.486567
2 -0.507027 -0.871259 -0.111110
3  2.000339 -2.430505  0.089759
4 -0.321434 -0.033695  0.096271

In [113]: df1.idxmin(axis=0)
Out[113]: 
A    2
B    3
C    1
dtype: int64

In [114]: df1.idxmax(axis=1)
Out[114]: 
0    C
1    A
2    C
3    A
4    C
dtype: object

当有多行（或列）匹配最小值或最大值时，idxmin()和idxmax()返回第一个匹配的索引：

In [115]: df3 = pd.DataFrame([2, 1, 1, 3, np.nan], columns=["A"], index=list("edcba"))

In [116]: df3
Out[116]: 
 A
e  2.0
d  1.0
c  1.0
b  3.0
a  NaN

In [117]: df3["A"].idxmin()
Out[117]: 'd'

注意

idxmin 和 idxmax 在 NumPy 中被称为 argmin 和 argmax。### 值计数（直方图）/ 众数

value_counts() Series 方法计算值的 1D 数组的直方图。它也可以用作常规数组的函数：

In [118]: data = np.random.randint(0, 7, size=50)

In [119]: data
Out[119]: 
array([6, 6, 2, 3, 5, 3, 2, 5, 4, 5, 4, 3, 4, 5, 0, 2, 0, 4, 2, 0, 3, 2,
 2, 5, 6, 5, 3, 4, 6, 4, 3, 5, 6, 4, 3, 6, 2, 6, 6, 2, 3, 4, 2, 1,
 6, 2, 6, 1, 5, 4])

In [120]: s = pd.Series(data)

In [121]: s.value_counts()
Out[121]: 
6    10
2    10
4     9
3     8
5     8
0     3
1     2
Name: count, dtype: int64

value_counts()方法可用于计算跨多列的组合。默认情况下使用所有列，但可以使用 subset 参数选择子集。

In [122]: data = {"a": [1, 2, 3, 4], "b": ["x", "x", "y", "y"]}

In [123]: frame = pd.DataFrame(data)

In [124]: frame.value_counts()
Out[124]: 
a  b
1  x    1
2  x    1
3  y    1
4  y    1
Name: count, dtype: int64

类似地，您可以获取 Series 或 DataFrame 中值的出现频率最高的值，即众数：

In [125]: s5 = pd.Series([1, 1, 3, 3, 3, 5, 5, 7, 7, 7])

In [126]: s5.mode()
Out[126]: 
0    3
1    7
dtype: int64

In [127]: df5 = pd.DataFrame(
 .....:    {
 .....:        "A": np.random.randint(0, 7, size=50),
 .....:        "B": np.random.randint(-10, 15, size=50),
 .....:    }
 .....: )
 .....: 

In [128]: df5.mode()
Out[128]: 
 A   B
0  1.0  -9
1  NaN  10
2  NaN  13

离散化和分位数

连续值可以使用cut()（基于值的箱）和qcut()（基于样本分位数的箱）函数进行离散化：

In [129]: arr = np.random.randn(20)

In [130]: factor = pd.cut(arr, 4)

In [131]: factor
Out[131]: 
[(-0.251, 0.464], (-0.968, -0.251], (0.464, 1.179], (-0.251, 0.464], (-0.968, -0.251], ..., (-0.251, 0.464], (-0.968, -0.251], (-0.968, -0.251], (-0.968, -0.251], (-0.968, -0.251]]
Length: 20
Categories (4, interval[float64, right]): [(-0.968, -0.251] < (-0.251, 0.464] < (0.464, 1.179] <
 (1.179, 1.893]]

In [132]: factor = pd.cut(arr, [-5, -1, 0, 1, 5])

In [133]: factor
Out[133]: 
[(0, 1], (-1, 0], (0, 1], (0, 1], (-1, 0], ..., (-1, 0], (-1, 0], (-1, 0], (-1, 0], (-1, 0]]
Length: 20
Categories (4, interval[int64, right]): [(-5, -1] < (-1, 0] < (0, 1] < (1, 5]]

qcut()计算样本分位数。例如，我们可以将一些正态分布数据切片成相等大小的四分位数：

In [134]: arr = np.random.randn(30)

In [135]: factor = pd.qcut(arr, [0, 0.25, 0.5, 0.75, 1])

In [136]: factor
Out[136]: 
[(0.569, 1.184], (-2.278, -0.301], (-2.278, -0.301], (0.569, 1.184], (0.569, 1.184], ..., (-0.301, 0.569], (1.184, 2.346], (1.184, 2.346], (-0.301, 0.569], (-2.278, -0.301]]
Length: 30
Categories (4, interval[float64, right]): [(-2.278, -0.301] < (-0.301, 0.569] < (0.569, 1.184] <
 (1.184, 2.346]]

我们也可以传入无限值来定义箱子：

In [137]: arr = np.random.randn(20)

In [138]: factor = pd.cut(arr, [-np.inf, 0, np.inf])

In [139]: factor
Out[139]: 
[(-inf, 0.0], (0.0, inf], (0.0, inf], (-inf, 0.0], (-inf, 0.0], ..., (-inf, 0.0], (-inf, 0.0], (-inf, 0.0], (0.0, inf], (0.0, inf]]
Length: 20
Categories (2, interval[float64, right]): [(-inf, 0.0] < (0.0, inf]]

数据汇总：描述

有一个方便的describe()函数，它计算关于 Series 或 DataFrame 的列的各种摘要统计信息（当然不包括 NAs）：

In [94]: series = pd.Series(np.random.randn(1000))

In [95]: series[::2] = np.nan

In [96]: series.describe()
Out[96]: 
count    500.000000
mean      -0.021292
std        1.015906
min       -2.683763
25%       -0.699070
50%       -0.069718
75%        0.714483
max        3.160915
dtype: float64

In [97]: frame = pd.DataFrame(np.random.randn(1000, 5), columns=["a", "b", "c", "d", "e"])

In [98]: frame.iloc[::2] = np.nan

In [99]: frame.describe()
Out[99]: 
 a           b           c           d           e
count  500.000000  500.000000  500.000000  500.000000  500.000000
mean     0.033387    0.030045   -0.043719   -0.051686    0.005979
std      1.017152    0.978743    1.025270    1.015988    1.006695
min     -3.000951   -2.637901   -3.303099   -3.159200   -3.188821
25%     -0.647623   -0.576449   -0.712369   -0.691338   -0.691115
50%      0.047578   -0.021499   -0.023888   -0.032652   -0.025363
75%      0.729907    0.775880    0.618896    0.670047    0.649748
max      2.740139    2.752332    3.004229    2.728702    3.240991

你可以选择特定的百分位数包含在输出中：

In [100]: series.describe(percentiles=[0.05, 0.25, 0.75, 0.95])
Out[100]: 
count    500.000000
mean      -0.021292
std        1.015906
min       -2.683763
5%        -1.645423
25%       -0.699070
50%       -0.069718
75%        0.714483
95%        1.711409
max        3.160915
dtype: float64

默认情况下，中位数始终包含在内。

对于非数值 Series 对象，describe()将给出关于唯一值数量和最常出现值的简单摘要：

In [101]: s = pd.Series(["a", "a", "b", "b", "a", "a", np.nan, "c", "d", "a"])

In [102]: s.describe()
Out[102]: 
count     9
unique    4
top       a
freq      5
dtype: object

请注意，在混合类型的 DataFrame 对象上，describe()将限制摘要仅包括数值列或者如果没有数值列，则仅包括分类列：

In [103]: frame = pd.DataFrame({"a": ["Yes", "Yes", "No", "No"], "b": range(4)})

In [104]: frame.describe()
Out[104]: 
 b
count  4.000000
mean   1.500000
std    1.290994
min    0.000000
25%    0.750000
50%    1.500000
75%    2.250000
max    3.000000

可以通过提供类型列表作为include/exclude参数来控制此行为。还可以使用特殊值all：

In [105]: frame.describe(include=["object"])
Out[105]: 
 a
count     4
unique    2
top     Yes
freq      2

In [106]: frame.describe(include=["number"])
Out[106]: 
 b
count  4.000000
mean   1.500000
std    1.290994
min    0.000000
25%    0.750000
50%    1.500000
75%    2.250000
max    3.000000

In [107]: frame.describe(include="all")
Out[107]: 
 a         b
count     4  4.000000
unique    2       NaN
top     Yes       NaN
freq      2       NaN
mean    NaN  1.500000
std     NaN  1.290994
min     NaN  0.000000
25%     NaN  0.750000
50%     NaN  1.500000
75%     NaN  2.250000
max     NaN  3.000000

该功能依赖于 select_dtypes。有关接受的输入的详细信息，请参阅那里。

最小/最大值的索引

Series 和 DataFrame 上的idxmin()和idxmax()函数计算具有最小和最大对应值的索引标签：

In [108]: s1 = pd.Series(np.random.randn(5))

In [109]: s1
Out[109]: 
0    1.118076
1   -0.352051
2   -1.242883
3   -1.277155
4   -0.641184
dtype: float64

In [110]: s1.idxmin(), s1.idxmax()
Out[110]: (3, 0)

In [111]: df1 = pd.DataFrame(np.random.randn(5, 3), columns=["A", "B", "C"])

In [112]: df1
Out[112]: 
 A         B         C
0 -0.327863 -0.946180 -0.137570
1 -0.186235 -0.257213 -0.486567
2 -0.507027 -0.871259 -0.111110
3  2.000339 -2.430505  0.089759
4 -0.321434 -0.033695  0.096271

In [113]: df1.idxmin(axis=0)
Out[113]: 
A    2
B    3
C    1
dtype: int64

In [114]: df1.idxmax(axis=1)
Out[114]: 
0    C
1    A
2    C
3    A
4    C
dtype: object

当有多行（或列）匹配最小值或最大值时，idxmin()和idxmax()返回第一个匹配的索引：

In [115]: df3 = pd.DataFrame([2, 1, 1, 3, np.nan], columns=["A"], index=list("edcba"))

In [116]: df3
Out[116]: 
 A
e  2.0
d  1.0
c  1.0
b  3.0
a  NaN

In [117]: df3["A"].idxmin()
Out[117]: 'd'

注意

在 NumPy 中，idxmin和idxmax被称为argmin和argmax。

值计数（直方图）/ 众数

value_counts() Series 方法计算值数组的直方图。它也可以用作常规数组的函数：

In [118]: data = np.random.randint(0, 7, size=50)

In [119]: data
Out[119]: 
array([6, 6, 2, 3, 5, 3, 2, 5, 4, 5, 4, 3, 4, 5, 0, 2, 0, 4, 2, 0, 3, 2,
 2, 5, 6, 5, 3, 4, 6, 4, 3, 5, 6, 4, 3, 6, 2, 6, 6, 2, 3, 4, 2, 1,
 6, 2, 6, 1, 5, 4])

In [120]: s = pd.Series(data)

In [121]: s.value_counts()
Out[121]: 
6    10
2    10
4     9
3     8
5     8
0     3
1     2
Name: count, dtype: int64

value_counts() 方法可用于计算多列之间的组合。默认情况下会使用所有列，但可以使用subset参数选择子集。

In [122]: data = {"a": [1, 2, 3, 4], "b": ["x", "x", "y", "y"]}

In [123]: frame = pd.DataFrame(data)

In [124]: frame.value_counts()
Out[124]: 
a  b
1  x    1
2  x    1
3  y    1
4  y    1
Name: count, dtype: int64

同样，您可以获取 Series 或 DataFrame 中值的出现频率最高的值（即众数）：

In [125]: s5 = pd.Series([1, 1, 3, 3, 3, 5, 5, 7, 7, 7])

In [126]: s5.mode()
Out[126]: 
0    3
1    7
dtype: int64

In [127]: df5 = pd.DataFrame(
 .....:    {
 .....:        "A": np.random.randint(0, 7, size=50),
 .....:        "B": np.random.randint(-10, 15, size=50),
 .....:    }
 .....: )
 .....: 

In [128]: df5.mode()
Out[128]: 
 A   B
0  1.0  -9
1  NaN  10
2  NaN  13

离散化和分位数

连续值可以使用cut()（基于值的区间）和qcut()（基于样本分位数的区间）函数进行离散化：

In [129]: arr = np.random.randn(20)

In [130]: factor = pd.cut(arr, 4)

In [131]: factor
Out[131]: 
[(-0.251, 0.464], (-0.968, -0.251], (0.464, 1.179], (-0.251, 0.464], (-0.968, -0.251], ..., (-0.251, 0.464], (-0.968, -0.251], (-0.968, -0.251], (-0.968, -0.251], (-0.968, -0.251]]
Length: 20
Categories (4, interval[float64, right]): [(-0.968, -0.251] < (-0.251, 0.464] < (0.464, 1.179] <
 (1.179, 1.893]]

In [132]: factor = pd.cut(arr, [-5, -1, 0, 1, 5])

In [133]: factor
Out[133]: 
[(0, 1], (-1, 0], (0, 1], (0, 1], (-1, 0], ..., (-1, 0], (-1, 0], (-1, 0], (-1, 0], (-1, 0]]
Length: 20
Categories (4, interval[int64, right]): [(-5, -1] < (-1, 0] < (0, 1] < (1, 5]]

qcut() 计算样本分位数。例如，我们可以将一些正态分布数据切分成相等大小的四分位数：

In [134]: arr = np.random.randn(30)

In [135]: factor = pd.qcut(arr, [0, 0.25, 0.5, 0.75, 1])

In [136]: factor
Out[136]: 
[(0.569, 1.184], (-2.278, -0.301], (-2.278, -0.301], (0.569, 1.184], (0.569, 1.184], ..., (-0.301, 0.569], (1.184, 2.346], (1.184, 2.346], (-0.301, 0.569], (-2.278, -0.301]]
Length: 30
Categories (4, interval[float64, right]): [(-2.278, -0.301] < (-0.301, 0.569] < (0.569, 1.184] <
 (1.184, 2.346]]

我们还可以传入无限值来定义区间：

In [137]: arr = np.random.randn(20)

In [138]: factor = pd.cut(arr, [-np.inf, 0, np.inf])

In [139]: factor
Out[139]: 
[(-inf, 0.0], (0.0, inf], (0.0, inf], (-inf, 0.0], (-inf, 0.0], ..., (-inf, 0.0], (-inf, 0.0], (-inf, 0.0], (0.0, inf], (0.0, inf]]
Length: 20
Categories (2, interval[float64, right]): [(-inf, 0.0] < (0.0, inf]]

函数应用

要将您自己或其他库的函数应用于 pandas 对象，您应该了解下面的三种方法。要使用的适当方法取决于您的函数是否希望在整个DataFrame或Series、按行或按列，或按元素进行操作。

按表应用函数: pipe()
按行或列应用函数: apply()
聚合 API：agg() 和 transform()
应用逐元素函数：map()

表格级函数应用

DataFrames 和 Series 可以被传递给函数。但是，如果函数需要在链式调用中调用，请考虑使用 pipe() 方法。

首先进行一些设置：

In [140]: def extract_city_name(df):
 .....: """
 .....:    Chicago, IL -> Chicago for city_name column
 .....:    """
 .....:    df["city_name"] = df["city_and_code"].str.split(",").str.get(0)
 .....:    return df
 .....: 

In [141]: def add_country_name(df, country_name=None):
 .....: """
 .....:    Chicago -> Chicago-US for city_name column
 .....:    """
 .....:    col = "city_name"
 .....:    df["city_and_country"] = df[col] + country_name
 .....:    return df
 .....: 

In [142]: df_p = pd.DataFrame({"city_and_code": ["Chicago, IL"]})

extract_city_name 和 add_country_name 是接受并返回 DataFrames 的函数。

现在比较以下内容：

In [143]: add_country_name(extract_city_name(df_p), country_name="US")
Out[143]: 
 city_and_code city_name city_and_country
0   Chicago, IL   Chicago        ChicagoUS

等同于：

In [144]: df_p.pipe(extract_city_name).pipe(add_country_name, country_name="US")
Out[144]: 
 city_and_code city_name city_and_country
0   Chicago, IL   Chicago        ChicagoUS

pandas 鼓励第二种风格，即称为方法链。pipe 可以轻松地在方法链中使用您自己或另一个库的函数，与 pandas 的方法并列使用。

在上面的示例中，函数 extract_city_name 和 add_country_name 分别预期将 DataFrame 作为第一个位置参数。如果您希望应用的函数将其数据作为，例如，第二个参数呢？在这种情况下，提供一个 (callable, data_keyword) 元组给 pipe。.pipe 将路由 DataFrame 到元组中指定的参数。

例如，我们可以使用 statsmodels 拟合回归。他们的 API 首先期望公式，然后是第二个参数 DataFrame，即 data。我们将函数、关键字对 (sm.ols, 'data') 传递给 pipe：

In [147]: import statsmodels.formula.api as sm

In [148]: bb = pd.read_csv("data/baseball.csv", index_col="id")

In [149]: (
 .....:    bb.query("h > 0")
 .....:    .assign(ln_h=lambda df: np.log(df.h))
 .....:    .pipe((sm.ols, "data"), "hr ~ ln_h + year + g + C(lg)")
 .....:    .fit()
 .....:    .summary()
 .....: )
 .....:
Out[149]:
<class 'statsmodels.iolib.summary.Summary'>
"""
 OLS Regression Results
==============================================================================
Dep. Variable:                     hr   R-squared:                       0.685
Model:                            OLS   Adj. R-squared:                  0.665
Method:                 Least Squares   F-statistic:                     34.28
Date:                Tue, 22 Nov 2022   Prob (F-statistic):           3.48e-15
Time:                        05:34:17   Log-Likelihood:                -205.92
No. Observations:                  68   AIC:                             421.8
Df Residuals:                      63   BIC:                             432.9
Df Model:                           4
Covariance Type:            nonrobust
===============================================================================
 coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
Intercept   -8484.7720   4664.146     -1.819      0.074   -1.78e+04     835.780
C(lg)[T.NL]    -2.2736      1.325     -1.716      0.091      -4.922       0.375
ln_h           -1.3542      0.875     -1.547      0.127      -3.103       0.395
year            4.2277      2.324      1.819      0.074      -0.417       8.872
g               0.1841      0.029      6.258      0.000       0.125       0.243
==============================================================================
Omnibus:                       10.875   Durbin-Watson:                   1.999
Prob(Omnibus):                  0.004   Jarque-Bera (JB):               17.298
Skew:                           0.537   Prob(JB):                     0.000175
Kurtosis:                       5.225   Cond. No.                     1.49e+07
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.49e+07. This might indicate that there are
strong multicollinearity or other numerical problems.
"""

pipe 方法受 Unix 管道的启发，最近也受到了 dplyr 和 magrittr 的影响，它们引入了流行的 (%>%)（读作 pipe）运算符用于 R。这里的 pipe 实现非常简洁，并且在 Python 中感觉很合适。我们鼓励您查看 pipe() 的源代码。

按行或按列应用函数

可以使用 apply() 方法沿着 DataFrame 的轴应用任意函数，与描述性统计方法一样，它接受一个可选的 axis 参数：

In [145]: df.apply(lambda x: np.mean(x))
Out[145]: 
one      0.811094
two      1.360588
three    0.187958
dtype: float64

In [146]: df.apply(lambda x: np.mean(x), axis=1)
Out[146]: 
a    1.583749
b    0.734929
c    1.133683
d   -0.166914
dtype: float64

In [147]: df.apply(lambda x: x.max() - x.min())
Out[147]: 
one      1.051928
two      1.632779
three    1.840607
dtype: float64

In [148]: df.apply(np.cumsum)
Out[148]: 
 one       two     three
a  1.394981  1.772517       NaN
b  1.738035  3.684640 -0.050390
c  2.433281  5.163008  1.177045
d       NaN  5.442353  0.563873

In [149]: df.apply(np.exp)
Out[149]: 
 one       two     three
a  4.034899  5.885648       NaN
b  1.409244  6.767440  0.950858
c  2.004201  4.385785  3.412466
d       NaN  1.322262  0.541630

apply() 方法也会根据字符串方法名称进行调度。

In [150]: df.apply("mean")
Out[150]: 
one      0.811094
two      1.360588
three    0.187958
dtype: float64

In [151]: df.apply("mean", axis=1)
Out[151]: 
a    1.583749
b    0.734929
c    1.133683
d   -0.166914
dtype: float64

传递给 apply() 的函数的返回类型会影响 DataFrame.apply 的最终输出类型，默认行为如下：

如果应用的函数返回一个 Series，最终输出是一个 DataFrame。列与应用函数返回的 Series 的索引匹配。
如果应用的函数返回任何其他类型，最终输出是一个Series。

可以使用result_type覆盖此默认行为，它接受三个选项：reduce、broadcast和expand。这将决定类似列表的返回值如何扩展（或不扩展）为一个DataFrame。

apply()结合一些巧妙的技巧可以用来回答关于数据集的许多问题。例如，假设我们想要提取每列的最大值发生的日期：

In [152]: tsdf = pd.DataFrame(
 .....:    np.random.randn(1000, 3),
 .....:    columns=["A", "B", "C"],
 .....:    index=pd.date_range("1/1/2000", periods=1000),
 .....: )
 .....: 

In [153]: tsdf.apply(lambda x: x.idxmax())
Out[153]: 
A   2000-08-06
B   2001-01-18
C   2001-07-18
dtype: datetime64[ns]

您还可以向apply()方法传递其他参数和关键字参数。

In [154]: def subtract_and_divide(x, sub, divide=1):
 .....:    return (x - sub) / divide
 .....: 

In [155]: df_udf = pd.DataFrame(np.ones((2, 2)))

In [156]: df_udf.apply(subtract_and_divide, args=(5,), divide=3)
Out[156]: 
 0         1
0 -1.333333 -1.333333
1 -1.333333 -1.333333

另一个有用的功能是能够传递 Series 方法来对每列或每行执行一些 Series 操作：

In [157]: tsdf = pd.DataFrame(
 .....:    np.random.randn(10, 3),
 .....:    columns=["A", "B", "C"],
 .....:    index=pd.date_range("1/1/2000", periods=10),
 .....: )
 .....: 

In [158]: tsdf.iloc[3:7] = np.nan

In [159]: tsdf
Out[159]: 
 A         B         C
2000-01-01 -0.158131 -0.232466  0.321604
2000-01-02 -1.810340 -3.105758  0.433834
2000-01-03 -1.209847 -1.156793 -0.136794
2000-01-04       NaN       NaN       NaN
2000-01-05       NaN       NaN       NaN
2000-01-06       NaN       NaN       NaN
2000-01-07       NaN       NaN       NaN
2000-01-08 -0.653602  0.178875  1.008298
2000-01-09  1.007996  0.462824  0.254472
2000-01-10  0.307473  0.600337  1.643950

In [160]: tsdf.apply(pd.Series.interpolate)
Out[160]: 
 A         B         C
2000-01-01 -0.158131 -0.232466  0.321604
2000-01-02 -1.810340 -3.105758  0.433834
2000-01-03 -1.209847 -1.156793 -0.136794
2000-01-04 -1.098598 -0.889659  0.092225
2000-01-05 -0.987349 -0.622526  0.321243
2000-01-06 -0.876100 -0.355392  0.550262
2000-01-07 -0.764851 -0.088259  0.779280
2000-01-08 -0.653602  0.178875  1.008298
2000-01-09  1.007996  0.462824  0.254472
2000-01-10  0.307473  0.600337  1.643950

最后，apply()接受一个默认为 False 的参数raw，在应用函数之前将每行或每列转换为一个 Series。当设置为 True 时，传递的函数将接收一个 ndarray 对象，如果您不需要索引功能，则具有积极的性能影响。

聚合 API

聚合 API 允许以一种简洁的方式表达可能的多个聚合操作。这个 API 在 pandas 对象中是相似的，参见 groupby API、window API 和 resample API。聚合的入口点是DataFrame.aggregate()，或别名DataFrame.agg()。

我们将使用与上面类似的起始框架：

In [161]: tsdf = pd.DataFrame(
 .....:    np.random.randn(10, 3),
 .....:    columns=["A", "B", "C"],
 .....:    index=pd.date_range("1/1/2000", periods=10),
 .....: )
 .....: 

In [162]: tsdf.iloc[3:7] = np.nan

In [163]: tsdf
Out[163]: 
 A         B         C
2000-01-01  1.257606  1.004194  0.167574
2000-01-02 -0.749892  0.288112 -0.757304
2000-01-03 -0.207550 -0.298599  0.116018
2000-01-04       NaN       NaN       NaN
2000-01-05       NaN       NaN       NaN
2000-01-06       NaN       NaN       NaN
2000-01-07       NaN       NaN       NaN
2000-01-08  0.814347 -0.257623  0.869226
2000-01-09 -0.250663 -1.206601  0.896839
2000-01-10  2.169758 -1.333363  0.283157

使用单个函数等同于apply()。您还可以将命名方法作为字符串传递。这些将返回聚合输出的Series：

In [164]: tsdf.agg(lambda x: np.sum(x))
Out[164]: 
A    3.033606
B   -1.803879
C    1.575510
dtype: float64

In [165]: tsdf.agg("sum")
Out[165]: 
A    3.033606
B   -1.803879
C    1.575510
dtype: float64

# these are equivalent to a ``.sum()`` because we are aggregating
# on a single function
In [166]: tsdf.sum()
Out[166]: 
A    3.033606
B   -1.803879
C    1.575510
dtype: float64

对Series进行单个聚合将返回一个标量值：

In [167]: tsdf["A"].agg("sum")
Out[167]: 3.033606102414146

使用多个函数进行聚合

您可以将多个聚合参数作为列表传递。每个传递函数的结果将成为结果DataFrame中的一行。这些自然地从聚合函数命名。

In [168]: tsdf.agg(["sum"])
Out[168]: 
 A         B        C
sum  3.033606 -1.803879  1.57551

多个函数产生多行：

In [169]: tsdf.agg(["sum", "mean"])
Out[169]: 
 A         B         C
sum   3.033606 -1.803879  1.575510
mean  0.505601 -0.300647  0.262585

在Series上，多个函数返回一个由函数名称索引的Series：

In [170]: tsdf["A"].agg(["sum", "mean"])
Out[170]: 
sum     3.033606
mean    0.505601
Name: A, dtype: float64

传递一个lambda函数将产生一个名为<lambda>的行：

In [171]: tsdf["A"].agg(["sum", lambda x: x.mean()])
Out[171]: 
sum         3.033606
<lambda>    0.505601
Name: A, dtype: float64

传递一个命名函数将为该行产生该名称：

In [172]: def mymean(x):
 .....:    return x.mean()
 .....: 

In [173]: tsdf["A"].agg(["sum", mymean])
Out[173]: 
sum       3.033606
mymean    0.505601
Name: A, dtype: float64

使用字典进行聚合

将列名的字典传递给标量或标量列表，以便将它们传递给 DataFrame.agg，允许您自定义将哪些函数应用于哪些列。请注意，结果不以任何特定顺序排列，您可以使用 OrderedDict 来保证顺序。

In [174]: tsdf.agg({"A": "mean", "B": "sum"})
Out[174]: 
A    0.505601
B   -1.803879
dtype: float64

传递类似列表的将生成一个 DataFrame 输出。您将获得所有聚合器的矩阵式输出。输出将由所有唯一的函数组成。那些未在特定列中注意到的将为 NaN：

In [175]: tsdf.agg({"A": ["mean", "min"], "B": "sum"})
Out[175]: 
 A         B
mean  0.505601       NaN
min  -0.749892       NaN
sum        NaN -1.803879

自定义描述

使用 .agg() 可以轻松创建自定义描述函数，类似于内置的描述函数。

In [176]: from functools import partial

In [177]: q_25 = partial(pd.Series.quantile, q=0.25)

In [178]: q_25.__name__ = "25%"

In [179]: q_75 = partial(pd.Series.quantile, q=0.75)

In [180]: q_75.__name__ = "75%"

In [181]: tsdf.agg(["count", "mean", "std", "min", q_25, "median", q_75, "max"])
Out[181]: 
 A         B         C
count   6.000000  6.000000  6.000000
mean    0.505601 -0.300647  0.262585
std     1.103362  0.887508  0.606860
min    -0.749892 -1.333363 -0.757304
25%    -0.239885 -0.979600  0.128907
median  0.303398 -0.278111  0.225365
75%     1.146791  0.151678  0.722709
max     2.169758  1.004194  0.896839 
```### Transform API

`transform()` 方法返回一个与原始索引相同（大小相同）的对象。此 API 允许您一次提供*多个*操作，而不是一个接一个地提供。它的 API 与 `.agg` API 非常相似。

我们创建了一个类似于上述部分中使用的框架。

```py
In [182]: tsdf = pd.DataFrame(
 .....:    np.random.randn(10, 3),
 .....:    columns=["A", "B", "C"],
 .....:    index=pd.date_range("1/1/2000", periods=10),
 .....: )
 .....: 

In [183]: tsdf.iloc[3:7] = np.nan

In [184]: tsdf
Out[184]: 
 A         B         C
2000-01-01 -0.428759 -0.864890 -0.675341
2000-01-02 -0.168731  1.338144 -1.279321
2000-01-03 -1.621034  0.438107  0.903794
2000-01-04       NaN       NaN       NaN
2000-01-05       NaN       NaN       NaN
2000-01-06       NaN       NaN       NaN
2000-01-07       NaN       NaN       NaN
2000-01-08  0.254374 -1.240447 -0.201052
2000-01-09 -0.157795  0.791197 -1.144209
2000-01-10 -0.030876  0.371900  0.061932

对整个框架进行转换。.transform() 允许输入函数为：NumPy 函数、字符串函数名称或用户定义的函数。

In [185]: tsdf.transform(np.abs)
Out[185]: 
 A         B         C
2000-01-01  0.428759  0.864890  0.675341
2000-01-02  0.168731  1.338144  1.279321
2000-01-03  1.621034  0.438107  0.903794
2000-01-04       NaN       NaN       NaN
2000-01-05       NaN       NaN       NaN
2000-01-06       NaN       NaN       NaN
2000-01-07       NaN       NaN       NaN
2000-01-08  0.254374  1.240447  0.201052
2000-01-09  0.157795  0.791197  1.144209
2000-01-10  0.030876  0.371900  0.061932

In [186]: tsdf.transform("abs")
Out[186]: 
 A         B         C
2000-01-01  0.428759  0.864890  0.675341
2000-01-02  0.168731  1.338144  1.279321
2000-01-03  1.621034  0.438107  0.903794
2000-01-04       NaN       NaN       NaN
2000-01-05       NaN       NaN       NaN
2000-01-06       NaN       NaN       NaN
2000-01-07       NaN       NaN       NaN
2000-01-08  0.254374  1.240447  0.201052
2000-01-09  0.157795  0.791197  1.144209
2000-01-10  0.030876  0.371900  0.061932

In [187]: tsdf.transform(lambda x: x.abs())
Out[187]: 
 A         B         C
2000-01-01  0.428759  0.864890  0.675341
2000-01-02  0.168731  1.338144  1.279321
2000-01-03  1.621034  0.438107  0.903794
2000-01-04       NaN       NaN       NaN
2000-01-05       NaN       NaN       NaN
2000-01-06       NaN       NaN       NaN
2000-01-07       NaN       NaN       NaN
2000-01-08  0.254374  1.240447  0.201052
2000-01-09  0.157795  0.791197  1.144209
2000-01-10  0.030876  0.371900  0.061932

这里 transform() 接收了一个函数；这相当于应用 ufunc。

In [188]: np.abs(tsdf)
Out[188]: 
 A         B         C
2000-01-01  0.428759  0.864890  0.675341
2000-01-02  0.168731  1.338144  1.279321
2000-01-03  1.621034  0.438107  0.903794
2000-01-04       NaN       NaN       NaN
2000-01-05       NaN       NaN       NaN
2000-01-06       NaN       NaN       NaN
2000-01-07       NaN       NaN       NaN
2000-01-08  0.254374  1.240447  0.201052
2000-01-09  0.157795  0.791197  1.144209
2000-01-10  0.030876  0.371900  0.061932

将单个函数传递给 .transform() 与 Series 将产生一个返回的单个 Series。

In [189]: tsdf["A"].transform(np.abs)
Out[189]: 
2000-01-01    0.428759
2000-01-02    0.168731
2000-01-03    1.621034
2000-01-04         NaN
2000-01-05         NaN
2000-01-06         NaN
2000-01-07         NaN
2000-01-08    0.254374
2000-01-09    0.157795
2000-01-10    0.030876
Freq: D, Name: A, dtype: float64

使用多个函数进行转换

传递多个函数将生成一个列 MultiIndexed DataFrame。第一级将是原始框架的列名；第二级将是转换函数的名称。

In [190]: tsdf.transform([np.abs, lambda x: x + 1])
Out[190]: 
 A                   B                   C 
 absolute  <lambda>  absolute  <lambda>  absolute  <lambda>
2000-01-01  0.428759  0.571241  0.864890  0.135110  0.675341  0.324659
2000-01-02  0.168731  0.831269  1.338144  2.338144  1.279321 -0.279321
2000-01-03  1.621034 -0.621034  0.438107  1.438107  0.903794  1.903794
2000-01-04       NaN       NaN       NaN       NaN       NaN       NaN
2000-01-05       NaN       NaN       NaN       NaN       NaN       NaN
2000-01-06       NaN       NaN       NaN       NaN       NaN       NaN
2000-01-07       NaN       NaN       NaN       NaN       NaN       NaN
2000-01-08  0.254374  1.254374  1.240447 -0.240447  0.201052  0.798948
2000-01-09  0.157795  0.842205  0.791197  1.791197  1.144209 -0.144209
2000-01-10  0.030876  0.969124  0.371900  1.371900  0.061932  1.061932

将多个函数传递给 Series 将产生一个 DataFrame。结果的列名将是转换函数。

In [191]: tsdf["A"].transform([np.abs, lambda x: x + 1])
Out[191]: 
 absolute  <lambda>
2000-01-01  0.428759  0.571241
2000-01-02  0.168731  0.831269
2000-01-03  1.621034 -0.621034
2000-01-04       NaN       NaN
2000-01-05       NaN       NaN
2000-01-06       NaN       NaN
2000-01-07       NaN       NaN
2000-01-08  0.254374  1.254374
2000-01-09  0.157795  0.842205
2000-01-10  0.030876  0.969124

使用字典进行转换

传递一个函数字典将允许按列进行选择性转换。

In [192]: tsdf.transform({"A": np.abs, "B": lambda x: x + 1})
Out[192]: 
 A         B
2000-01-01  0.428759  0.135110
2000-01-02  0.168731  2.338144
2000-01-03  1.621034  1.438107
2000-01-04       NaN       NaN
2000-01-05       NaN       NaN
2000-01-06       NaN       NaN
2000-01-07       NaN       NaN
2000-01-08  0.254374 -0.240447
2000-01-09  0.157795  1.791197
2000-01-10  0.030876  1.371900

传递函数列表的字典将生成一个具有这些选择性转换的 MultiIndexed DataFrame。

In [193]: tsdf.transform({"A": np.abs, "B": [lambda x: x + 1, "sqrt"]})
Out[193]: 
 A         B 
 absolute  <lambda>      sqrt
2000-01-01  0.428759  0.135110       NaN
2000-01-02  0.168731  2.338144  1.156782
2000-01-03  1.621034  1.438107  0.661897
2000-01-04       NaN       NaN       NaN
2000-01-05       NaN       NaN       NaN
2000-01-06       NaN       NaN       NaN
2000-01-07       NaN       NaN       NaN
2000-01-08  0.254374 -0.240447       NaN
2000-01-09  0.157795  1.791197  0.889493
2000-01-10  0.030876  1.371900  0.609836 
```### 逐元素应用函数

由于并非所有函数都可以进行矢量化（接受 NumPy 数组并返回另一个数组或值），因此 DataFrame 上的方法 `map()` 和类似地 Series 上的方法 `map()` 接受任何 Python 函数，该函数接受一个值并返回一个值。例如：

```py
In [194]: df4 = df.copy()

In [195]: df4
Out[195]: 
 one       two     three
a  1.394981  1.772517       NaN
b  0.343054  1.912123 -0.050390
c  0.695246  1.478369  1.227435
d       NaN  0.279344 -0.613172

In [196]: def f(x):
 .....:    return len(str(x))
 .....: 

In [197]: df4["one"].map(f)
Out[197]: 
a    18
b    19
c    18
d     3
Name: one, dtype: int64

In [198]: df4.map(f)
Out[198]: 
 one  two  three
a   18   17      3
b   19   18     20
c   18   18     16
d    3   19     19

Series.map() 还有一个附加功能；它可以用于轻松“链接”或“映射”由辅助序列定义的值。这与合并/连接功能密切相关：

In [199]: s = pd.Series(
 .....:    ["six", "seven", "six", "seven", "six"], index=["a", "b", "c", "d", "e"]
 .....: )
 .....: 

In [200]: t = pd.Series({"six": 6.0, "seven": 7.0})

In [201]: s
Out[201]: 
a      six
b    seven
c      six
d    seven
e      six
dtype: object

In [202]: s.map(t)
Out[202]: 
a    6.0
b    7.0
c    6.0
d    7.0
e    6.0
dtype: float64 
```### 表格级别的函数应用

可以将 `DataFrames` 和 `Series` 传递给函数。但是，如果函数需要在链中调用，请考虑使用 `pipe()` 方法。

首先进行一些设置：

```py
In [140]: def extract_city_name(df):
 .....: """
 .....:    Chicago, IL -> Chicago for city_name column
 .....:    """
 .....:    df["city_name"] = df["city_and_code"].str.split(",").str.get(0)
 .....:    return df
 .....: 

In [141]: def add_country_name(df, country_name=None):
 .....: """
 .....:    Chicago -> Chicago-US for city_name column
 .....:    """
 .....:    col = "city_name"
 .....:    df["city_and_country"] = df[col] + country_name
 .....:    return df
 .....: 

In [142]: df_p = pd.DataFrame({"city_and_code": ["Chicago, IL"]})

extract_city_name 和 add_country_name 是接受并返回 DataFrames 的函数。

现在比较以下内容：

In [143]: add_country_name(extract_city_name(df_p), country_name="US")
Out[143]: 
 city_and_code city_name city_and_country
0   Chicago, IL   Chicago        ChicagoUS

等同于：

In [144]: df_p.pipe(extract_city_name).pipe(add_country_name, country_name="US")
Out[144]: 
 city_and_code city_name city_and_country
0   Chicago, IL   Chicago        ChicagoUS

pandas 鼓励第二种风格，即方法链。pipe 让您可以在方法链中轻松使用自己或另一个库的函数，与 pandas 的方法一起使用。

在上面的示例中，函数 extract_city_name 和 add_country_name 分别期望 DataFrame 作为第一个位置参数。如果您希望应用的函数将其数据作为，比如说，第二个参数呢？在这种情况下，提供一个元组 (callable, data_keyword) 给 pipe。.pipe 将把 DataFrame 路由到元组中指定的参数。

例如，我们可以使用 statsmodels 进行回归拟合。他们的 API 首先期望一个公式，然后是第二个参数 data 的 DataFrame。我们将函数、关键字对 (sm.ols, 'data') 传递给 pipe：

In [147]: import statsmodels.formula.api as sm

In [148]: bb = pd.read_csv("data/baseball.csv", index_col="id")

In [149]: (
 .....:    bb.query("h > 0")
 .....:    .assign(ln_h=lambda df: np.log(df.h))
 .....:    .pipe((sm.ols, "data"), "hr ~ ln_h + year + g + C(lg)")
 .....:    .fit()
 .....:    .summary()
 .....: )
 .....:
Out[149]:
<class 'statsmodels.iolib.summary.Summary'>
"""
 OLS Regression Results
==============================================================================
Dep. Variable:                     hr   R-squared:                       0.685
Model:                            OLS   Adj. R-squared:                  0.665
Method:                 Least Squares   F-statistic:                     34.28
Date:                Tue, 22 Nov 2022   Prob (F-statistic):           3.48e-15
Time:                        05:34:17   Log-Likelihood:                -205.92
No. Observations:                  68   AIC:                             421.8
Df Residuals:                      63   BIC:                             432.9
Df Model:                           4
Covariance Type:            nonrobust
===============================================================================
 coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
Intercept   -8484.7720   4664.146     -1.819      0.074   -1.78e+04     835.780
C(lg)[T.NL]    -2.2736      1.325     -1.716      0.091      -4.922       0.375
ln_h           -1.3542      0.875     -1.547      0.127      -3.103       0.395
year            4.2277      2.324      1.819      0.074      -0.417       8.872
g               0.1841      0.029      6.258      0.000       0.125       0.243
==============================================================================
Omnibus:                       10.875   Durbin-Watson:                   1.999
Prob(Omnibus):                  0.004   Jarque-Bera (JB):               17.298
Skew:                           0.537   Prob(JB):                     0.000175
Kurtosis:                       5.225   Cond. No.                     1.49e+07
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.49e+07. This might indicate that there are
strong multicollinearity or other numerical problems.
"""

管道方法受到 Unix 管道以及最近的 dplyr 和 magrittr 的启发，它们引入了流行的 (%>%)（读取管道）操作符用于 R。这里的 pipe 实现非常干净，感觉就像在 Python 中本来就应该有的。我们鼓励您查看 pipe() 的源代码。

行或列的函数应用

可以使用 apply() 方法沿着 DataFrame 的轴应用任意函数，该方法与描述性统计方法类似，都接受一个可选的 axis 参数：

In [145]: df.apply(lambda x: np.mean(x))
Out[145]: 
one      0.811094
two      1.360588
three    0.187958
dtype: float64

In [146]: df.apply(lambda x: np.mean(x), axis=1)
Out[146]: 
a    1.583749
b    0.734929
c    1.133683
d   -0.166914
dtype: float64

In [147]: df.apply(lambda x: x.max() - x.min())
Out[147]: 
one      1.051928
two      1.632779
three    1.840607
dtype: float64

In [148]: df.apply(np.cumsum)
Out[148]: 
 one       two     three
a  1.394981  1.772517       NaN
b  1.738035  3.684640 -0.050390
c  2.433281  5.163008  1.177045
d       NaN  5.442353  0.563873

In [149]: df.apply(np.exp)
Out[149]: 
 one       two     three
a  4.034899  5.885648       NaN
b  1.409244  6.767440  0.950858
c  2.004201  4.385785  3.412466
d       NaN  1.322262  0.541630

apply() 方法还可以根据字符串方法名进行分派。

In [150]: df.apply("mean")
Out[150]: 
one      0.811094
two      1.360588
three    0.187958
dtype: float64

In [151]: df.apply("mean", axis=1)
Out[151]: 
a    1.583749
b    0.734929
c    1.133683
d   -0.166914
dtype: float64

传递给 apply() 的函数的返回类型会影响默认行为下 DataFrame.apply 的最终输出类型：

如果应用的函数返回一个 Series，最终输出是一个 DataFrame。列匹配应用函数返回的 Series 的索引。
如果应用的函数返回其他任何类型，则最终输出是一个 Series。

可以使用 result_type 覆盖此默认行为，它接受三个选项：reduce、broadcast 和 expand。这些选项将决定类似列表的返回值如何扩展（或不扩展）为 DataFrame。

apply()结合一些巧妙的方法可以回答关于数据集的许多问题。例如，假设我们想要提取每列中最大值出现的日期：

In [152]: tsdf = pd.DataFrame(
 .....:    np.random.randn(1000, 3),
 .....:    columns=["A", "B", "C"],
 .....:    index=pd.date_range("1/1/2000", periods=1000),
 .....: )
 .....: 

In [153]: tsdf.apply(lambda x: x.idxmax())
Out[153]: 
A   2000-08-06
B   2001-01-18
C   2001-07-18
dtype: datetime64[ns]

您还可以将额外的参数和关键字参数传递给apply()方法。

In [154]: def subtract_and_divide(x, sub, divide=1):
 .....:    return (x - sub) / divide
 .....: 

In [155]: df_udf = pd.DataFrame(np.ones((2, 2)))

In [156]: df_udf.apply(subtract_and_divide, args=(5,), divide=3)
Out[156]: 
 0         1
0 -1.333333 -1.333333
1 -1.333333 -1.333333

另一个有用的功能是能够传递 Series 方法来对每列或每行执行一些 Series 操作：

In [157]: tsdf = pd.DataFrame(
 .....:    np.random.randn(10, 3),
 .....:    columns=["A", "B", "C"],
 .....:    index=pd.date_range("1/1/2000", periods=10),
 .....: )
 .....: 

In [158]: tsdf.iloc[3:7] = np.nan

In [159]: tsdf
Out[159]: 
 A         B         C
2000-01-01 -0.158131 -0.232466  0.321604
2000-01-02 -1.810340 -3.105758  0.433834
2000-01-03 -1.209847 -1.156793 -0.136794
2000-01-04       NaN       NaN       NaN
2000-01-05       NaN       NaN       NaN
2000-01-06       NaN       NaN       NaN
2000-01-07       NaN       NaN       NaN
2000-01-08 -0.653602  0.178875  1.008298
2000-01-09  1.007996  0.462824  0.254472
2000-01-10  0.307473  0.600337  1.643950

In [160]: tsdf.apply(pd.Series.interpolate)
Out[160]: 
 A         B         C
2000-01-01 -0.158131 -0.232466  0.321604
2000-01-02 -1.810340 -3.105758  0.433834
2000-01-03 -1.209847 -1.156793 -0.136794
2000-01-04 -1.098598 -0.889659  0.092225
2000-01-05 -0.987349 -0.622526  0.321243
2000-01-06 -0.876100 -0.355392  0.550262
2000-01-07 -0.764851 -0.088259  0.779280
2000-01-08 -0.653602  0.178875  1.008298
2000-01-09  1.007996  0.462824  0.254472
2000-01-10  0.307473  0.600337  1.643950

最后，apply()接受一个默认为 False 的参数raw，在应用函数之前将每行或每列转换为一个 Series。当设置为 True 时，传递的函数将收到一个 ndarray 对象，如果您不需要索引功能，则具有积极的性能影响。

聚合 API

聚合 API 允许以一种简洁的方式表达可能的多个聚合操作。这个 API 在 pandas 对象中是相似的，参见 groupby API，window API，以及 resample API。聚合的入口点是DataFrame.aggregate()，或别名DataFrame.agg()。

我们将使用与上面类似的起始框架：

In [161]: tsdf = pd.DataFrame(
 .....:    np.random.randn(10, 3),
 .....:    columns=["A", "B", "C"],
 .....:    index=pd.date_range("1/1/2000", periods=10),
 .....: )
 .....: 

In [162]: tsdf.iloc[3:7] = np.nan

In [163]: tsdf
Out[163]: 
 A         B         C
2000-01-01  1.257606  1.004194  0.167574
2000-01-02 -0.749892  0.288112 -0.757304
2000-01-03 -0.207550 -0.298599  0.116018
2000-01-04       NaN       NaN       NaN
2000-01-05       NaN       NaN       NaN
2000-01-06       NaN       NaN       NaN
2000-01-07       NaN       NaN       NaN
2000-01-08  0.814347 -0.257623  0.869226
2000-01-09 -0.250663 -1.206601  0.896839
2000-01-10  2.169758 -1.333363  0.283157

使用单个函数等同于apply()。您还可以将命名方法作为字符串传递。这些将返回聚合输出的Series：

In [164]: tsdf.agg(lambda x: np.sum(x))
Out[164]: 
A    3.033606
B   -1.803879
C    1.575510
dtype: float64

In [165]: tsdf.agg("sum")
Out[165]: 
A    3.033606
B   -1.803879
C    1.575510
dtype: float64

# these are equivalent to a ``.sum()`` because we are aggregating
# on a single function
In [166]: tsdf.sum()
Out[166]: 
A    3.033606
B   -1.803879
C    1.575510
dtype: float64

对Series进行单个聚合将返回一个标量值：

In [167]: tsdf["A"].agg("sum")
Out[167]: 3.033606102414146

使用多个函数进行聚合

您可以将多个聚合参数作为列表传递。每个传递函数的结果将成为结果DataFrame中的一行。这些自然地从聚合函数命名。

In [168]: tsdf.agg(["sum"])
Out[168]: 
 A         B        C
sum  3.033606 -1.803879  1.57551

多个函数产生多个行：

In [169]: tsdf.agg(["sum", "mean"])
Out[169]: 
 A         B         C
sum   3.033606 -1.803879  1.575510
mean  0.505601 -0.300647  0.262585

在Series上，多个函数返回一个Series，由函数名称索引：

In [170]: tsdf["A"].agg(["sum", "mean"])
Out[170]: 
sum     3.033606
mean    0.505601
Name: A, dtype: float64

传递一个lambda函数将产生一个<lambda>命名行：

In [171]: tsdf["A"].agg(["sum", lambda x: x.mean()])
Out[171]: 
sum         3.033606
<lambda>    0.505601
Name: A, dtype: float64

传递一个命名函数将为该行产生该名称：

In [172]: def mymean(x):
 .....:    return x.mean()
 .....: 

In [173]: tsdf["A"].agg(["sum", mymean])
Out[173]: 
sum       3.033606
mymean    0.505601
Name: A, dtype: float64

使用字典进行聚合

将列名的字典传递给标量或标量列表，以便DataFrame.agg允许您自定义应用于哪些列的函数。请注意，结果不按任何特定顺序排列，您可以使用OrderedDict来保证顺序。

In [174]: tsdf.agg({"A": "mean", "B": "sum"})
Out[174]: 
A    0.505601
B   -1.803879
dtype: float64

传递类似列表将生成一个DataFrame输出。您将获得所有聚合器的类似矩阵的输出。输出将包含所有唯一的函数。那些没有针对特定列指定的函数将是NaN：

In [175]: tsdf.agg({"A": ["mean", "min"], "B": "sum"})
Out[175]: 
 A         B
mean  0.505601       NaN
min  -0.749892       NaN
sum        NaN -1.803879

自定义描述

使用 .agg() 可以轻松创建自定义描述函数，类似于内置的描述函数。

In [176]: from functools import partial

In [177]: q_25 = partial(pd.Series.quantile, q=0.25)

In [178]: q_25.__name__ = "25%"

In [179]: q_75 = partial(pd.Series.quantile, q=0.75)

In [180]: q_75.__name__ = "75%"

In [181]: tsdf.agg(["count", "mean", "std", "min", q_25, "median", q_75, "max"])
Out[181]: 
 A         B         C
count   6.000000  6.000000  6.000000
mean    0.505601 -0.300647  0.262585
std     1.103362  0.887508  0.606860
min    -0.749892 -1.333363 -0.757304
25%    -0.239885 -0.979600  0.128907
median  0.303398 -0.278111  0.225365
75%     1.146791  0.151678  0.722709
max     2.169758  1.004194  0.896839

使用多个函数进行聚合

您可以将多个聚合参数作为列表传递。每个传递函数的结果将成为生成的 DataFrame 中的一行。这些自然地从聚合函数命名。

In [168]: tsdf.agg(["sum"])
Out[168]: 
 A         B        C
sum  3.033606 -1.803879  1.57551

多个函数产生多行：

In [169]: tsdf.agg(["sum", "mean"])
Out[169]: 
 A         B         C
sum   3.033606 -1.803879  1.575510
mean  0.505601 -0.300647  0.262585

对于 Series，多个函数返回一个由函数名称索引的 Series：

In [170]: tsdf["A"].agg(["sum", "mean"])
Out[170]: 
sum     3.033606
mean    0.505601
Name: A, dtype: float64

传递 lambda 函数将产生一个命名为 <lambda> 的行：

In [171]: tsdf["A"].agg(["sum", lambda x: x.mean()])
Out[171]: 
sum         3.033606
<lambda>    0.505601
Name: A, dtype: float64

传递命名函数将产生该行的名称：

In [172]: def mymean(x):
 .....:    return x.mean()
 .....: 

In [173]: tsdf["A"].agg(["sum", mymean])
Out[173]: 
sum       3.033606
mymean    0.505601
Name: A, dtype: float64

使用字典进行聚合

将列名称的字典传递给标量或标量列表，以便 DataFrame.agg 允许您自定义应用于哪些列的函数。请注意，结果没有任何特定顺序，您可以改用 OrderedDict 以保证顺序。

In [174]: tsdf.agg({"A": "mean", "B": "sum"})
Out[174]: 
A    0.505601
B   -1.803879
dtype: float64

传递类似列表将生成一个 DataFrame 输出。您将获得所有聚合器的矩阵样式输出。输出将由所有唯一函数组成。那些未特定于特定列的函数将是 NaN：

In [175]: tsdf.agg({"A": ["mean", "min"], "B": "sum"})
Out[175]: 
 A         B
mean  0.505601       NaN
min  -0.749892       NaN
sum        NaN -1.803879

自定义描述

使用 .agg() 可以轻松创建自定义描述函数，类似于内置的描述函数。

In [176]: from functools import partial

In [177]: q_25 = partial(pd.Series.quantile, q=0.25)

In [178]: q_25.__name__ = "25%"

In [179]: q_75 = partial(pd.Series.quantile, q=0.75)

In [180]: q_75.__name__ = "75%"

In [181]: tsdf.agg(["count", "mean", "std", "min", q_25, "median", q_75, "max"])
Out[181]: 
 A         B         C
count   6.000000  6.000000  6.000000
mean    0.505601 -0.300647  0.262585
std     1.103362  0.887508  0.606860
min    -0.749892 -1.333363 -0.757304
25%    -0.239885 -0.979600  0.128907
median  0.303398 -0.278111  0.225365
75%     1.146791  0.151678  0.722709
max     2.169758  1.004194  0.896839

转换 API

transform() 方法返回一个与原始对象（大小相同）索引相同的对象。该 API 允许您一次性提供多个操作，而不是一个接一个的操作。其 API 与 .agg API 非常相似。

我们创建了一个类似于上述部分中使用的框架。

In [182]: tsdf = pd.DataFrame(
 .....:    np.random.randn(10, 3),
 .....:    columns=["A", "B", "C"],
 .....:    index=pd.date_range("1/1/2000", periods=10),
 .....: )
 .....: 

In [183]: tsdf.iloc[3:7] = np.nan

In [184]: tsdf
Out[184]: 
 A         B         C
2000-01-01 -0.428759 -0.864890 -0.675341
2000-01-02 -0.168731  1.338144 -1.279321
2000-01-03 -1.621034  0.438107  0.903794
2000-01-04       NaN       NaN       NaN
2000-01-05       NaN       NaN       NaN
2000-01-06       NaN       NaN       NaN
2000-01-07       NaN       NaN       NaN
2000-01-08  0.254374 -1.240447 -0.201052
2000-01-09 -0.157795  0.791197 -1.144209
2000-01-10 -0.030876  0.371900  0.061932

转换整个框架。.transform() 允许输入函数为：NumPy 函数、字符串函数名称或用户定义的函数。

In [185]: tsdf.transform(np.abs)
Out[185]: 
 A         B         C
2000-01-01  0.428759  0.864890  0.675341
2000-01-02  0.168731  1.338144  1.279321
2000-01-03  1.621034  0.438107  0.903794
2000-01-04       NaN       NaN       NaN
2000-01-05       NaN       NaN       NaN
2000-01-06       NaN       NaN       NaN
2000-01-07       NaN       NaN       NaN
2000-01-08  0.254374  1.240447  0.201052
2000-01-09  0.157795  0.791197  1.144209
2000-01-10  0.030876  0.371900  0.061932

In [186]: tsdf.transform("abs")
Out[186]: 
 A         B         C
2000-01-01  0.428759  0.864890  0.675341
2000-01-02  0.168731  1.338144  1.279321
2000-01-03  1.621034  0.438107  0.903794
2000-01-04       NaN       NaN       NaN
2000-01-05       NaN       NaN       NaN
2000-01-06       NaN       NaN       NaN
2000-01-07       NaN       NaN       NaN
2000-01-08  0.254374  1.240447  0.201052
2000-01-09  0.157795  0.791197  1.144209
2000-01-10  0.030876  0.371900  0.061932

In [187]: tsdf.transform(lambda x: x.abs())
Out[187]: 
 A         B         C
2000-01-01  0.428759  0.864890  0.675341
2000-01-02  0.168731  1.338144  1.279321
2000-01-03  1.621034  0.438107  0.903794
2000-01-04       NaN       NaN       NaN
2000-01-05       NaN       NaN       NaN
2000-01-06       NaN       NaN       NaN
2000-01-07       NaN       NaN       NaN
2000-01-08  0.254374  1.240447  0.201052
2000-01-09  0.157795  0.791197  1.144209
2000-01-10  0.030876  0.371900  0.061932

这里 transform() 接收到一个单个函数；这相当于应用 ufunc。

In [188]: np.abs(tsdf)
Out[188]: 
 A         B         C
2000-01-01  0.428759  0.864890  0.675341
2000-01-02  0.168731  1.338144  1.279321
2000-01-03  1.621034  0.438107  0.903794
2000-01-04       NaN       NaN       NaN
2000-01-05       NaN       NaN       NaN
2000-01-06       NaN       NaN       NaN
2000-01-07       NaN       NaN       NaN
2000-01-08  0.254374  1.240447  0.201052
2000-01-09  0.157795  0.791197  1.144209
2000-01-10  0.030876  0.371900  0.061932

将单个函数传递给 .transform() 与 Series 将返回单个 Series。

In [189]: tsdf["A"].transform(np.abs)
Out[189]: 
2000-01-01    0.428759
2000-01-02    0.168731
2000-01-03    1.621034
2000-01-04         NaN
2000-01-05         NaN
2000-01-06         NaN
2000-01-07         NaN
2000-01-08    0.254374
2000-01-09    0.157795
2000-01-10    0.030876
Freq: D, Name: A, dtype: float64

使用多个函数进行转换

传递多个函数将产生一个列 MultiIndexed DataFrame。第一级将是原始框架列名称；第二级将是转换函数的名称。

In [190]: tsdf.transform([np.abs, lambda x: x + 1])
Out[190]: 
 A                   B                   C 
 absolute  <lambda>  absolute  <lambda>  absolute  <lambda>
2000-01-01  0.428759  0.571241  0.864890  0.135110  0.675341  0.324659
2000-01-02  0.168731  0.831269  1.338144  2.338144  1.279321 -0.279321
2000-01-03  1.621034 -0.621034  0.438107  1.438107  0.903794  1.903794
2000-01-04       NaN       NaN       NaN       NaN       NaN       NaN
2000-01-05       NaN       NaN       NaN       NaN       NaN       NaN
2000-01-06       NaN       NaN       NaN       NaN       NaN       NaN
2000-01-07       NaN       NaN       NaN       NaN       NaN       NaN
2000-01-08  0.254374  1.254374  1.240447 -0.240447  0.201052  0.798948
2000-01-09  0.157795  0.842205  0.791197  1.791197  1.144209 -0.144209
2000-01-10  0.030876  0.969124  0.371900  1.371900  0.061932  1.061932

对 Series 传递多个函数将产生一个 DataFrame。生成的列名称将是转换函数。

In [191]: tsdf["A"].transform([np.abs, lambda x: x + 1])
Out[191]: 
 absolute  <lambda>
2000-01-01  0.428759  0.571241
2000-01-02  0.168731  0.831269
2000-01-03  1.621034 -0.621034
2000-01-04       NaN       NaN
2000-01-05       NaN       NaN
2000-01-06       NaN       NaN
2000-01-07       NaN       NaN
2000-01-08  0.254374  1.254374
2000-01-09  0.157795  0.842205
2000-01-10  0.030876  0.969124

使用字典进行转换

传递函数字典将允许按列选择性转换。

In [192]: tsdf.transform({"A": np.abs, "B": lambda x: x + 1})
Out[192]: 
 A         B
2000-01-01  0.428759  0.135110
2000-01-02  0.168731  2.338144
2000-01-03  1.621034  1.438107
2000-01-04       NaN       NaN
2000-01-05       NaN       NaN
2000-01-06       NaN       NaN
2000-01-07       NaN       NaN
2000-01-08  0.254374 -0.240447
2000-01-09  0.157795  1.791197
2000-01-10  0.030876  1.371900

传递列表的字典将生成一个具有这些选择性转换的 MultiIndexed DataFrame。

In [193]: tsdf.transform({"A": np.abs, "B": [lambda x: x + 1, "sqrt"]})
Out[193]: 
 A         B 
 absolute  <lambda>      sqrt
2000-01-01  0.428759  0.135110       NaN
2000-01-02  0.168731  2.338144  1.156782
2000-01-03  1.621034  1.438107  0.661897
2000-01-04       NaN       NaN       NaN
2000-01-05       NaN       NaN       NaN
2000-01-06       NaN       NaN       NaN
2000-01-07       NaN       NaN       NaN
2000-01-08  0.254374 -0.240447       NaN
2000-01-09  0.157795  1.791197  0.889493
2000-01-10  0.030876  1.371900  0.609836

使用多个函数进行转换

传递多个函数将生成一个列 MultiIndexed DataFrame。第一级将是原始帧列名；第二级将是变换函数的名称。

In [190]: tsdf.transform([np.abs, lambda x: x + 1])
Out[190]: 
 A                   B                   C 
 absolute  <lambda>  absolute  <lambda>  absolute  <lambda>
2000-01-01  0.428759  0.571241  0.864890  0.135110  0.675341  0.324659
2000-01-02  0.168731  0.831269  1.338144  2.338144  1.279321 -0.279321
2000-01-03  1.621034 -0.621034  0.438107  1.438107  0.903794  1.903794
2000-01-04       NaN       NaN       NaN       NaN       NaN       NaN
2000-01-05       NaN       NaN       NaN       NaN       NaN       NaN
2000-01-06       NaN       NaN       NaN       NaN       NaN       NaN
2000-01-07       NaN       NaN       NaN       NaN       NaN       NaN
2000-01-08  0.254374  1.254374  1.240447 -0.240447  0.201052  0.798948
2000-01-09  0.157795  0.842205  0.791197  1.791197  1.144209 -0.144209
2000-01-10  0.030876  0.969124  0.371900  1.371900  0.061932  1.061932

传递多个函数给一个 Series 将产生一个 DataFrame。结果列名将是变换函数。

In [191]: tsdf["A"].transform([np.abs, lambda x: x + 1])
Out[191]: 
 absolute  <lambda>
2000-01-01  0.428759  0.571241
2000-01-02  0.168731  0.831269
2000-01-03  1.621034 -0.621034
2000-01-04       NaN       NaN
2000-01-05       NaN       NaN
2000-01-06       NaN       NaN
2000-01-07       NaN       NaN
2000-01-08  0.254374  1.254374
2000-01-09  0.157795  0.842205
2000-01-10  0.030876  0.969124

使用字典进行转换

传递一个函数的字典将允许按列进行选择性变换。

In [192]: tsdf.transform({"A": np.abs, "B": lambda x: x + 1})
Out[192]: 
 A         B
2000-01-01  0.428759  0.135110
2000-01-02  0.168731  2.338144
2000-01-03  1.621034  1.438107
2000-01-04       NaN       NaN
2000-01-05       NaN       NaN
2000-01-06       NaN       NaN
2000-01-07       NaN       NaN
2000-01-08  0.254374 -0.240447
2000-01-09  0.157795  1.791197
2000-01-10  0.030876  1.371900

传递一个列表的字典将生成一个具有这些选择性转换的 MultiIndexed DataFrame。

In [193]: tsdf.transform({"A": np.abs, "B": [lambda x: x + 1, "sqrt"]})
Out[193]: 
 A         B 
 absolute  <lambda>      sqrt
2000-01-01  0.428759  0.135110       NaN
2000-01-02  0.168731  2.338144  1.156782
2000-01-03  1.621034  1.438107  0.661897
2000-01-04       NaN       NaN       NaN
2000-01-05       NaN       NaN       NaN
2000-01-06       NaN       NaN       NaN
2000-01-07       NaN       NaN       NaN
2000-01-08  0.254374 -0.240447       NaN
2000-01-09  0.157795  1.791197  0.889493
2000-01-10  0.030876  1.371900  0.609836

应用逐元素函数

由于并非所有函数都可以矢量化（接受 NumPy 数组并返回另一个数组或值），因此 DataFrame 上的方法 map() 和类似地 Series 上的 map() 接受任何接受单个值并返回单个值的 Python 函数。例如：

In [194]: df4 = df.copy()

In [195]: df4
Out[195]: 
 one       two     three
a  1.394981  1.772517       NaN
b  0.343054  1.912123 -0.050390
c  0.695246  1.478369  1.227435
d       NaN  0.279344 -0.613172

In [196]: def f(x):
 .....:    return len(str(x))
 .....: 

In [197]: df4["one"].map(f)
Out[197]: 
a    18
b    19
c    18
d     3
Name: one, dtype: int64

In [198]: df4.map(f)
Out[198]: 
 one  two  three
a   18   17      3
b   19   18     20
c   18   18     16
d    3   19     19

Series.map() 还具有额外的功能；它可以用于轻松“链接”或“映射”由次级系列定义的值。这与合并/连接功能密切相关：

In [199]: s = pd.Series(
 .....:    ["six", "seven", "six", "seven", "six"], index=["a", "b", "c", "d", "e"]
 .....: )
 .....: 

In [200]: t = pd.Series({"six": 6.0, "seven": 7.0})

In [201]: s
Out[201]: 
a      six
b    seven
c      six
d    seven
e      six
dtype: object

In [202]: s.map(t)
Out[202]: 
a    6.0
b    7.0
c    6.0
d    7.0
e    6.0
dtype: float64

重新索引和更改标签

reindex() 是 pandas 中的基本数据对齐方法。它用于实现几乎所有依赖标签对齐功能的其他功能。重新索引意味着使数据符合与特定轴上的给定标签集匹配的数据。这可以实现几个目标：

重新排列现有数据以匹配新的标签集
在不存在该标签的标签位置插入缺失值（NA）标记
如果指定了，可以使用逻辑填充缺失标签的数据（与处理时间序列数据高度相关）

这里是一个简单的例子：

In [203]: s = pd.Series(np.random.randn(5), index=["a", "b", "c", "d", "e"])

In [204]: s
Out[204]: 
a    1.695148
b    1.328614
c    1.234686
d   -0.385845
e   -1.326508
dtype: float64

In [205]: s.reindex(["e", "b", "f", "d"])
Out[205]: 
e   -1.326508
b    1.328614
f         NaN
d   -0.385845
dtype: float64

在这里，f 标签未包含在 Series 中，因此在结果中显示为 NaN。

对于 DataFrame，您可以同时重新索引索引和列：

In [206]: df
Out[206]: 
 one       two     three
a  1.394981  1.772517       NaN
b  0.343054  1.912123 -0.050390
c  0.695246  1.478369  1.227435
d       NaN  0.279344 -0.613172

In [207]: df.reindex(index=["c", "f", "b"], columns=["three", "two", "one"])
Out[207]: 
 three       two       one
c  1.227435  1.478369  0.695246
f       NaN       NaN       NaN
b -0.050390  1.912123  0.343054

注意，包含实际轴标签的 Index 对象可以在对象之间共享。因此，如果我们有一个 Series 和一个 DataFrame，则可以执行以下操作：

In [208]: rs = s.reindex(df.index)

In [209]: rs
Out[209]: 
a    1.695148
b    1.328614
c    1.234686
d   -0.385845
dtype: float64

In [210]: rs.index is df.index
Out[210]: True

这意味着重新索引的 Series 的索引与 DataFrame 的索引是相同的 Python 对象。

DataFrame.reindex() 还支持一种“轴样式”调用约定，其中您指定单个 labels 参数以及它适用的 axis。

In [211]: df.reindex(["c", "f", "b"], axis="index")
Out[211]: 
 one       two     three
c  0.695246  1.478369  1.227435
f       NaN       NaN       NaN
b  0.343054  1.912123 -0.050390

In [212]: df.reindex(["three", "two", "one"], axis="columns")
Out[212]: 
 three       two       one
a       NaN  1.772517  1.394981
b -0.050390  1.912123  0.343054
c  1.227435  1.478369  0.695246
d -0.613172  0.279344       NaN

另请参阅

多索引 / 高级索引是进行重新索引的更简洁方式。

注意

在编写对性能敏感的代码时，有充分的理由花一些时间成为一个重新索引的忍者：许多操作在预对齐数据上更快。添加两个不对齐的 DataFrame 内部会触发重新索引步骤。对于探索性分析，你几乎不会注意到差异（因为reindex已经经过了大量优化），但是当 CPU 周期很重要时，偶尔在某些地方添加一些显式的reindex调用可能会产生影响。

本文参与腾讯云自媒体同步曝光计划，分享自作者个人站点/博客。

原始发表：2024-04-24，如有侵权请联系 cloudcommunity@tencent.com 删除

nan