前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >《Pandas 1.x Cookbook · 第二版》第08章 索引对齐

《Pandas 1.x Cookbook · 第二版》第08章 索引对齐

作者头像
SeanCheney
发布2021-03-02 15:28:40
6430
发布2021-03-02 15:28:40
举报
文章被收录于专栏:SeanCheney的专栏SeanCheney的专栏

8.1 检查索引对象

读取大学数据集,用变量columns存储列索引:

代码语言:javascript
复制
>>> import pandas as pd
>>> import numpy as np
>>> college = pd.read_csv("data/college.csv")
>>> columns = college.columns
>>> columns
Index(['INSTNM', 'CITY', 'STABBR', 'HBCU', 'MENONLY', 'WOMENONLY', 'RELAFFIL',
       'SATVRMID', 'SATMTMID', 'DISTANCEONLY', 'UGDS', 'UGDS_WHITE',
       'UGDS_BLACK', 'UGDS_HISP', 'UGDS_ASIAN', 'UGDS_AIAN', 'UGDS_NHPI',
       'UGDS_2MOR', 'UGDS_NRA', 'UGDS_UNKN', 'PPTUG_EF', 'CURROPER', 'PCTPELL',
       'PCTFLOAN', 'UG25ABV', 'MD_EARN_WNE_P10', 'GRAD_DEBT_MDN_SUPP'],
      dtype='object')

.values属性获取底层的NumPy数组:

代码语言:javascript
复制
>>> columns.values
array(['INSTNM', 'CITY', 'STABBR', 'HBCU', 'MENONLY', 'WOMENONLY',
       'RELAFFIL', 'SATVRMID', 'SATMTMID', 'DISTANCEONLY', 'UGDS',
       'UGDS_WHITE', 'UGDS_BLACK', 'UGDS_HISP', 'UGDS_ASIAN', 'UGDS_AIAN',
       'UGDS_NHPI', 'UGDS_2MOR', 'UGDS_NRA', 'UGDS_UNKN', 'PPTUG_EF',
       'CURROPER', 'PCTPELL', 'PCTFLOAN', 'UG25ABV', 'MD_EARN_WNE_P10',
       'GRAD_DEBT_MDN_SUPP'], dtype=object)

用标量、列表和切片从columns提取数据:

代码语言:javascript
复制
>>> columns[5]
'WOMENONLY'
>>> columns[[1, 8, 10]]
Index(['CITY', 'SATMTMID', 'UGDS'], dtype='object')
>>> columns[-7:-4]
Index(['PPTUG_EF', 'CURROPER', 'PCTPELL'], dtype='object')

索引对象和Series和DataFrames有公用的方法:

代码语言:javascript
复制
>>> columns.min(), columns.max(), columns.isnull().sum()
('CITY', 'WOMENONLY', 0)

索引对象支持运算和比较:

代码语言:javascript
复制
>>> columns + "_A"
Index(['INSTNM_A', 'CITY_A', 'STABBR_A', 'HBCU_A', 'MENONLY_A',
'WOMENONLY_A',
       'RELAFFIL_A', 'SATVRMID_A', 'SATMTMID_A', 'DISTANCEONLY_A', 'UGDS_A',
       'UGDS_WHITE_A', 'UGDS_BLACK_A', 'UGDS_HISP_A', 'UGDS_ASIAN_A',
       'UGDS_AIAN_A', 'UGDS_NHPI_A', 'UGDS_2MOR_A', 'UGDS_NRA_A',
       'UGDS_UNKN_A', 'PPTUG_EF_A', 'CURROPER_A', 'PCTPELL_A', 'PCTFLOAN_A',
       'UG25ABV_A', 'MD_EARN_WNE_P10_A', 'GRAD_DEBT_MDN_SUPP_A'],
      dtype='object')
>>> columns > "G"
array([ True, False,  True,  True,  True,  True,  True,  True,  True,
       False,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True, False,  True,  True,  True,  True,  True])

索引是不可变对象:

代码语言:javascript
复制
>>> columns[1] = "city"
Traceback (most recent call last):
  ...
TypeError: Index does not support mutable operations
更多

索引支持集合运算:

代码语言:javascript
复制
>>> c1 = columns[:4]
>>> c1
Index(['INSTNM', 'CITY', 'STABBR', 'HBCU'], dtype='object')
>>> c2 = columns[2:6]
>>> c2
Index(['STABBR', 'HBCU', 'MENONLY', 'WOMENONLY'], dtype='object')
>>> c1.union(c2)  # or 'c1 | c2'
Index(['CITY', 'HBCU', 'INSTNM', 'MENONLY', 'STABBR', 'WOMENONLY'],
dtype='object')
>>> c1.symmetric_difference(c2)  # or 'c1 ^ c2'
Index(['CITY', 'INSTNM', 'MENONLY', 'WOMENONLY'], dtype='object')

8.2 生成笛卡尔积

创建两个部分相同的Series:

代码语言:javascript
复制
>>> s1 = pd.Series(index=list("aaab"), data=np.arange(4))
>>> s1
a    0
a    1
a    2
b    3
dtype: int64
>>> s2 = pd.Series(index=list("cababb"), data=np.arange(6))
>>> s2
c    0
a    1
b    2
a    3
b    4
b    5
dtype: int64

将这两个Series相加,就生成了笛卡尔积:

代码语言:javascript
复制
>>> s1 + s2
a    1.0
a    3.0
a    2.0
a    4.0
a    3.0
a    5.0
b    5.0
b    7.0
b    8.0
c    NaN
dtype: float64
更多

如果索引相同,顺序也一致,就不会生成笛卡尔积:

代码语言:javascript
复制
>>> s1 = pd.Series(index=list("aaabb"), data=np.arange(5))
>>> s2 = pd.Series(index=list("aaabb"), data=np.arange(5))
>>> s1 + s2
a    0
a    2
a    4
b    6
b    8
dtype: int64

如果索引中的元素相同,但顺序不一致,也会产生笛卡尔积:

代码语言:javascript
复制
>>> s1 = pd.Series(index=list("aaabb"), data=np.arange(5))
>>> s2 = pd.Series(index=list("bbaaa"), data=np.arange(5))
>>> s1 + s2
a    2
a    3
a    4
a    3
a    4
    ..
a    6
b    3
b    4
b    4
b    5
Length: 13, dtype: int64

如果索引的顺序不同,但没有重复对象,则不会生成笛卡尔积:

代码语言:javascript
复制
>>> s3 = pd.Series(index=list("ab"), data=np.arange(2))
>>> s4 = pd.Series(index=list("ba"), data=np.arange(2))
>>> s3 + s4
a    1
b    1
dtype: int64

8.3 索引爆炸

读取员工数据集:

代码语言:javascript
复制
>>> employee = pd.read_csv(
...     "data/employee.csv", index_col="RACE"
... )
>>> employee.head()
              UNIQUE_ID POSITION_TITLE  ...   HIRE_DATE    JOB_DATE
RACE                                    ...                        
Hispanic/...          0  ASSISTAN...    ...  2006-06-12  2012-10-13
Hispanic/...          1  LIBRARY ...    ...  2000-07-19  2010-09-18
White                 2  POLICE O...    ...  2015-02-03  2015-02-03
White                 3  ENGINEER...    ...  1982-02-08  1991-05-25
White                 4  ELECTRICIAN    ...  1989-06-19  1994-10-22

判断下面两个对象是否等价:

代码语言:javascript
复制
>>> salary1 = employee["BASE_SALARY"]
>>> salary2 = employee["BASE_SALARY"]
>>> salary1 is salary2
True

这两个对象指向相同,如果要做一个全新的对象,需要使用.copy方法:

代码语言:javascript
复制
>>> salary2 = employee["BASE_SALARY"].copy()
>>> salary1 is salary2
False

改变一个索引的顺序:

代码语言:javascript
复制
>>> salary1 = salary1.sort_index()
>>> salary1.head()
RACE
American Indian or Alaskan Native    78355.0
American Indian or Alaskan Native    26125.0
American Indian or Alaskan Native    98536.0
American Indian or Alaskan Native        NaN
American Indian or Alaskan Native    55461.0
Name: BASE_SALARY, dtype: float64
>>> salary2.head()
RACE
Hispanic/Latino    121862.0
Hispanic/Latino     26125.0
White               45279.0
White               63166.0
White               56347.0
Name: BASE_SALARY, dtype: float64

将两个Series相加:

代码语言:javascript
复制
>>> salary_add = salary1 + salary2
>>> salary_add.head()
RACE
American Indian or Alaskan Native    138702.0
American Indian or Alaskan Native    156710.0
American Indian or Alaskan Native    176891.0
American Indian or Alaskan Native    159594.0
American Indian or Alaskan Native    127734.0
Name: BASE_SALARY, dtype: float64

为了对比,再创建一个salary_add1,比较这几个Series的长度:

代码语言:javascript
复制
# 索引顺序不同时,产生了笛卡尔积
>>> salary_add1 = salary1 + salary1
>>> len(salary1), len(salary2), len(salary_add), len(
...     salary_add1
... )
(2000, 2000, 1175424, 2000)
更多

笛卡尔积的数量是可以计算的:

代码语言:javascript
复制
>>> index_vc = salary1.index.value_counts(dropna=False)
>>> index_vc
Black or African American            700
White                                665
Hispanic/Latino                      480
Asian/Pacific Islander               107
NaN                                   35
American Indian or Alaskan Native     11
Others                                 2
Name: RACE, dtype: int64
>>> index_vc.pow(2).sum()
1175424

8.4 用不等索引填充值

读取三个棒球数据集:

代码语言:javascript
复制
>>> baseball_14 = pd.read_csv(
...     "data/baseball14.csv", index_col="playerID"
... )
>>> baseball_15 = pd.read_csv(
...     "data/baseball15.csv", index_col="playerID"
... )
>>> baseball_16 = pd.read_csv(
...     "data/baseball16.csv", index_col="playerID"
... )
>>> baseball_14.head()
           yearID  stint teamID lgID  ...  HBP   SH   SF  GIDP
playerID                              ...
altuvjo01    2014      1    HOU   AL  ...  5.0  1.0  5.0  20.0
cartech02    2014      1    HOU   AL  ...  5.0  0.0  4.0  12.0
castrja01    2014      1    HOU   AL  ...  9.0  1.0  3.0  11.0
corpoca01    2014      1    HOU   AL  ...  3.0  1.0  2.0   3.0
dominma01    2014      1    HOU   AL  ...  5.0  2.0  7.0  23.0

检查哪些索引位于baseball_14而不在baseball_15中:

代码语言:javascript
复制
>>> baseball_14.index.difference(baseball_15.index)
Index(['corpoca01', 'dominma01', 'fowlede01', 'grossro01', 'guzmaje01',
       'hoeslj01', 'krausma01', 'preslal01', 'singljo02'],
      dtype='object', name='playerID')
>>> baseball_15.index.difference(baseball_14.index)
Index(['congeha01', 'correca01', 'gattiev01', 'gomezca01', 'lowrije01',
       'rasmuco01', 'tuckepr01', 'valbulu01'],
      dtype='object', name='playerID')

查询每名选手的击球数:

代码语言:javascript
复制
>>> hits_14 = baseball_14["H"]
>>> hits_15 = baseball_15["H"]
>>> hits_16 = baseball_16["H"]
>>> hits_14.head()
playerID
altuvjo01    225
cartech02    115
castrja01    103
corpoca01     40
dominma01    121
Name: H, dtype: int64

将两列相加:

代码语言:javascript
复制
>>> (hits_14 + hits_15).head()
playerID
altuvjo01    425.0
cartech02    193.0
castrja01    174.0
congeha01      NaN
corpoca01      NaN
Name: H, dtype: float64

发现有缺失值,使用fill_value来填充:

代码语言:javascript
复制
>>> hits_14.add(hits_15, fill_value=0).head()
playerID
altuvjo01    425.0
cartech02    193.0
castrja01    174.0
congeha01     46.0
corpoca01     40.0
Name: H, dtype: float64

在加上hits_16

代码语言:javascript
复制
>>> hits_total = hits_14.add(hits_15, fill_value=0).add(
...     hits_16, fill_value=0
... )
>>> hits_total.head()
playerID
altuvjo01    641.0
bregmal01     53.0
cartech02    193.0
castrja01    243.0
congeha01     46.0
Name: H, dtype: float64

检查是否有缺失值:

代码语言:javascript
复制
>>> hits_total.hasnans
False
更多

DataFrame在相加时,也支持填充:

代码语言:javascript
复制
>>> df_14 = baseball_14[["G", "AB", "R", "H"]]
>>> df_14.head()
             G   AB   R    H
playerID                    
altuvjo01  158  660  85  225
cartech02  145  507  68  115
castrja01  126  465  43  103
corpoca01   55  170  22   40
dominma01  157  564  51  121

>>> df_15 = baseball_15[["AB", "R", "H", "HR"]]
>>> df_15.head()
            AB   R    H  HR
playerID                   
altuvjo01  638  86  200  15
cartech02  391  50   78  24
castrja01  337  38   71  11
congeha01  201  25   46  11
correca01  387  52  108  22

8.5 从不同的DataFrame添加列

读取数据集:

代码语言:javascript
复制
>>> employee = pd.read_csv("data/employee.csv")
>>> dept_sal = employee[["DEPARTMENT", "BASE_SALARY"]]

对值进行排序:

代码语言:javascript
复制
>>> dept_sal = dept_sal.sort_values(
...     ["DEPARTMENT", "BASE_SALARY"],
...     ascending=[True, False],
... )

DEPARTMENT列进行去重:

代码语言:javascript
复制
>>> max_dept_sal = dept_sal.drop_duplicates(
...     subset="DEPARTMENT"
... )
>>> max_dept_sal.head()
       DEPARTMENT  BASE_SALARY
                        DEPARTMENT  BASE_SALARY
1494    Admn. & Regulatory Affairs     140416.0
149       City Controller's Office      64251.0
236                   City Council     100000.0
647   Convention and Entertainment      38397.0
1500   Dept of Neighborhoods (DON)      89221.0

将列DEPARTMENT作为行索引:

代码语言:javascript
复制
>>> max_dept_sal = max_dept_sal.set_index("DEPARTMENT")
>>> employee = employee.set_index("DEPARTMENT")

employee添加一个新列:

代码语言:javascript
复制
>>> employee = employee.assign(
...     MAX_DEPT_SALARY=max_dept_sal["BASE_SALARY"]
... )
>>> employee
                               UNIQUE_ID  ... MAX_D/ALARY
DEPARTMENT                                ...
Municipal Courts Department            0  ...    121862.0
Library                                1  ...    107763.0
Houston Police Department-HPD          2  ...    199596.0
Houston Fire Department (HFD)          3  ...    210588.0
General Services Department            4  ...     89194.0
...                                  ...  ...         ...
Houston Police Department-HPD       1995  ...    199596.0
Houston Fire Department (HFD)       1996  ...    210588.0
Houston Police Department-HPD       1997  ...    199596.0
Houston Police Department-HPD       1998  ...    199596.0
Houston Fire Department (HFD)       1999  ...    210588.0

用query方法检查是否存在BASE_SALARY高于MAX_DEPT_SALARY的行:

代码语言:javascript
复制
>>> employee.query("BASE_SALARY > MAX_DEPT_SALARY")
Empty DataFrame
Columns: [UNIQUE_ID, POSITION_TITLE, BASE_SALARY, RACE, EMPLOYMENT_TYPE, GENDER, EMPLOYMENT_STATUS, HIRE_DATE, JOB_DATE, MAX_DEPT_SALARY]
Index: []

将前面的方法组合成链式方法:

代码语言:javascript
复制
>>> employee = pd.read_csv("data/employee.csv")
>>> max_dept_sal = (
...     employee
...     [["DEPARTMENT", "BASE_SALARY"]]
...     .sort_values(
...         ["DEPARTMENT", "BASE_SALARY"],
...         ascending=[True, False],
...     )
...     .drop_duplicates(subset="DEPARTMENT")
...     .set_index("DEPARTMENT")
... )
>>> (
...     employee
...     .set_index("DEPARTMENT")
...     .assign(
...         MAX_DEPT_SALARY=max_dept_sal["BASE_SALARY"]
...     )
... )
              UNIQUE_ID POSITION_TITLE  ...    JOB_DATE MAX_DEPT_SALARY
DEPARTMENT                              ...
Municipal...          0  ASSISTAN...    ...  2012-10-13     121862.0
Library               1  LIBRARY ...    ...  2010-09-18     107763.0
Houston P...          2  POLICE O...    ...  2015-02-03     199596.0
Houston F...          3  ENGINEER...    ...  1991-05-25     210588.0
General S...          4  ELECTRICIAN    ...  1994-10-22      89194.0
...                 ...          ...    ...         ...          ...
Houston P...       1995  POLICE O...    ...  2015-06-09     199596.0
Houston F...       1996  COMMUNIC...    ...  2013-10-06     210588.0
Houston P...       1997  POLICE O...    ...  2015-10-13     199596.0
Houston P...       1998  POLICE O...    ...  2011-07-02     199596.0
Houston F...       1999  FIRE FIG...    ...  2010-07-12     210588.0
更多

在索引对齐的过程中,如果索引不能对齐,就会产生缺失值。只用max_dept_sal的前三行做新列:

代码语言:javascript
复制
>>> (
...     employee
...     .set_index("DEPARTMENT")
...     .assign(
...         MAX_SALARY2=max_dept_sal["BASE_SALARY"].head(3)
...     )
...     .MAX_SALARY2
...     .value_counts(dropna=False)
... )
NaN         1955
140416.0      29
100000.0      11
64251.0        5
Name: MAX_SALARY2, dtype: int64

我的方法是使用groupbytransform,后面章节会详细讨论:

代码语言:javascript
复制
>>> max_sal = (
...     employee
...     .groupby("DEPARTMENT")
...     .BASE_SALARY
...     .transform("max")
... )
>>> (employee.assign(MAX_DEPT_SALARY=max_sal))
UNIQUE_ID POSITION_TITLE  ...    JOB_DATE  MAX_DEPT_SALARY
0             0  ASSISTAN...    ...  2012-10-13     121862.0
1             1  LIBRARY ...    ...  2010-09-18     107763.0
2             2  POLICE O...    ...  2015-02-03     199596.0
3             3  ENGINEER...    ...  1991-05-25     210588.0
4             4  ELECTRICIAN    ...  1994-10-22      89194.0
...         ...          ...    ...         ...          ...
1995       1995  POLICE O...    ...  2015-06-09     199596.0
1996       1996  COMMUNIC...    ...  2013-10-06     210588.0
1997       1997  POLICE O...    ...  2015-10-13     199596.0
1998       1998  POLICE O...    ...  2011-07-02     199596.0
1999       1999  FIRE FIG...    ...  2010-07-12     210588.0

下面的方法是将groupbymerge联用:

代码语言:javascript
复制
>>> max_sal = (
...     employee
...     .groupby("DEPARTMENT")
...     .BASE_SALARY
...     .max()
... )
>>> (
...     employee.merge(
...         max_sal.rename("MAX_DEPT_SALARY"),
...         how="left",
...         left_on="DEPARTMENT",
...         right_index=True,
...     )
... )
UNIQUE_ID POSITION_TITLE  ...    JOB_DATE  MAX_DEPT_SALARY
0             0  ASSISTAN...    ...  2012-10-13     121862.0
1             1  LIBRARY ...    ...  2010-09-18     107763.0
2             2  POLICE O...    ...  2015-02-03     199596.0
3             3  ENGINEER...    ...  1991-05-25     210588.0
4             4  ELECTRICIAN    ...  1994-10-22      89194.0
...         ...          ...    ...         ...          ...
1995       1995  POLICE O...    ...  2015-06-09     199596.0
1996       1996  COMMUNIC...    ...  2013-10-06     210588.0
1997       1997  POLICE O...    ...  2015-10-13     199596.0
1998       1998  POLICE O...    ...  2011-07-02     199596.0
1999       1999  FIRE FIG...    ...  2010-07-12     210588.0

8.6 高亮每列的最大值

读取数据集:

代码语言:javascript
复制
>>> college = pd.read_csv(
...     "data/college.csv", index_col="INSTNM"
... )
>>> college.dtypes
CITY                   object
STABBR                 object
HBCU                  float64
MENONLY               float64
WOMENONLY             float64
                       ...
PCTPELL               float64
PCTFLOAN              float64
UG25ABV               float64
MD_EARN_WNE_P10        object
GRAD_DEBT_MDN_SUPP     object
Length: 26, dtype: object

随机检查数据:

代码语言:javascript
复制
>>> college.MD_EARN_WNE_P10.sample(10, random_state=42)
INSTNM
Career Point College                                      20700
Ner Israel Rabbinical College                       PrivacyS...
Reflections Academy of Beauty                               NaN
Capital Area Technical College                            26400
West Virginia University Institute of Technology          43400
Mid-State Technical College                               32000
Strayer University-Huntsville Campus                      49200
National Aviation Academy of Tampa Bay                    45000
University of California-Santa Cruz                       43000
Lexington Theological Seminary                              NaN
Name: MD_EARN_WNE_P10, dtype: object
>>> college.GRAD_DEBT_MDN_SUPP.sample(10, random_state=42)
INSTNM
Career Point College                                      14977
Ner Israel Rabbinical College                       PrivacyS...
Reflections Academy of Beauty                       PrivacyS...
Capital Area Technical College                      PrivacyS...
West Virginia University Institute of Technology          23969
Mid-State Technical College   

.value_counts查看为什么是字符串:

代码语言:javascript
复制
>>> college.MD_EARN_WNE_P10.value_counts()
PrivacySuppressed    822
38800                151
21500                 97
49200                 78
27400                 46
                    ...
66700                  1
163900                 1
64400                  1
58700                  1
64100                  1
Name: MD_EARN_WNE_P10, Length: 598, dtype: int64
>>> set(college.MD_EARN_WNE_P10.apply(type))
{<class 'float'>, <class 'str'>}
>>> college.GRAD_DEBT_MDN_SUPP.value_counts()
PrivacySuppressed    1510
9500                  514
27000                 306
25827.5               136
25000                 124
                     ...
16078.5                 1
27763.5                 1
6382                    1
27625                   1
11300                   1
Name: GRAD_DEBT_MDN_SUPP, Length: 2038, dtype: int64

使用to_numeric将其转化为数值类型,参数errors='coerce'可以将字符串转换为NaN

代码语言:javascript
复制
>>> cols = ["MD_EARN_WNE_P10", "GRAD_DEBT_MDN_SUPP"]
>>> for col in cols:
...     college[col] = pd.to_numeric(
...         college[col], errors="coerce"
...     )
>>> college.dtypes.loc[cols]
MD_EARN_WNE_P10       float64
GRAD_DEBT_MDN_SUPP    float64
dtype: object

选取数值类型数据:

代码语言:javascript
复制
>>> college_n = college.select_dtypes("number")
>>> college_n.head()
              HBCU  MENONLY  ...  MD_EARN_WNE_P10  GRAD_DEBT_MDN_SUPP
INSTNM                       ...
Alabama A...   1.0      0.0  ...      30300.0          33888.0
Universit...   0.0      0.0  ...      39700.0          21941.5
Amridge U...   0.0      0.0  ...      40100.0          23370.0
Universit...   0.0      0.0  ...      45500.0          24097.0
Alabama S...   1.0      0.0  ...      26600.0          33118.5

有的二分列只有1和0两种数值,用.nunique进行判断:

代码语言:javascript
复制
>>> binary_only = college_n.nunique() == 2
>>> binary_only.head()
HBCU          True
MENONLY       True
WOMENONLY     True
RELAFFIL      True
SATVRMID     False
dtype: bool

用布尔数组创建二分列表:

代码语言:javascript
复制
>>> binary_cols = binary_only[binary_only].index
>>> binary_cols
Index(['HBCU', 'MENONLY', 'WOMENONLY', 'RELAFFIL', 'DISTANCEONLY', 'CURROPER'], dtype='object')

使用drop方法删除这些二分列表:

代码语言:javascript
复制
>>> college_n2 = college_n.drop(columns=binary_cols)
>>> college_n2.head()
              SATVRMID  SATMTMID  ...  MD_EARN_WNE_P10  GRAD_DEBT_MDN_SUPP
INSTNM                            ...
Alabama A...     424.0     420.0  ...      30300.0          33888.0
Universit...     570.0     565.0  ...      39700.0          21941.5
Amridge U...       NaN       NaN  ...      40100.0          23370.0
Universit...     595.0     590.0  ...      45500.0          24097.0
Alabama S...     425.0     430.0  ...      26600.0          33118.5

idxmax方法找到每列最大值对应的行索引标签:

代码语言:javascript
复制
>>> max_cols = college_n2.idxmax()
>>> max_cols
SATVRMID                      California Institute of Technology
SATMTMID                      California Institute of Technology
UGDS                               University of Phoenix-Arizona
UGDS_WHITE                Mr Leon's School of Hair Design-Moscow
UGDS_BLACK                    Velvatex College of Beauty Culture
                                         ...
PCTPELL                                 MTI Business College Inc
PCTFLOAN                                  ABC Beauty College Inc
UG25ABV                           Dongguk University-Los Angeles
MD_EARN_WNE_P10                     Medical College of Wisconsin
GRAD_DEBT_MDN_SUPP    Southwest University of Visual Arts-Tucson
Length: 18, dtype: object

max_cols上调用unique方法,能得到所有行索引的值:

代码语言:javascript
复制
>>> unique_max_cols = max_cols.unique()
>>> unique_max_cols[:5]
array(['California Institute of Technology',
       'University of Phoenix-Arizona',
       "Mr Leon's School of Hair Design-Moscow",
       'Velvatex College of Beauty Culture',
       'Thunderbird School of Global Management'], dtype=object)

使用.style高亮所有最大值:

代码语言:javascript
复制
college_n2.loc[unique_max_cols].style.highlight_max()

重写上面的代码,提高可读性:

代码语言:javascript
复制
>>> def remove_binary_cols(df):
...     binary_only = df.nunique() == 2
...     cols = binary_only[binary_only].index.tolist()
...     return df.drop(columns=cols)
>>> def select_rows_with_max_cols(df):
...     max_cols = df.idxmax()
...     unique = max_cols.unique()
...     return df.loc[unique]
>>> (
...     college
...     .assign(
...         MD_EARN_WNE_P10=pd.to_numeric(
...             college.MD_EARN_WNE_P10, errors="coerce"
...         ),
...         GRAD_DEBT_MDN_SUPP=pd.to_numeric(
...             college.GRAD_DEBT_MDN_SUPP, errors="coerce"
...         ),
...     )
...     .select_dtypes("number")
...     .pipe(remove_binary_cols)
...     .pipe(select_rows_with_max_cols)
... )
              SATVRMID  SATMTMID  ...  MD_EARN_WNE_P10  GRAD_DEBT_MDN_SUPP
INSTNM                            ...
Californi...     765.0     785.0  ...      77800.0          11812.5
Universit...       NaN       NaN  ...          NaN          33000.0
Mr Leon's...       NaN       NaN  ...          NaN          15710.0
Velvatex ...       NaN       NaN  ...          NaN              NaN
Thunderbi...       NaN       NaN  ...     118900.0              NaN
...                ...       ...  ...          ...              ...
MTI Busin...       NaN       NaN  ...      23000.0           9500.0
ABC Beaut...       NaN       NaN  ...          NaN          16500.0
Dongguk U...       NaN       NaN  ...          NaN              NaN
Medical C...       NaN       NaN  ...     233100.0              NaN
Southwest...       NaN       NaN  ...      27200.0          49750.0
更多

可以用axis参数,高亮每行的最大值:

代码语言:javascript
复制
>>> college = pd.read_csv(
...     "data/college.csv", index_col="INSTNM"
... )
>>> college_ugds = college.filter(like="UGDS_").head()

8.7 用链式方法替换idxmax

和上一节一样,读取数据集,只选取目标数值列:

代码语言:javascript
复制
>>> def remove_binary_cols(df):
...     binary_only = df.nunique() == 2
...     cols = binary_only[binary_only].index.tolist()
...     return df.drop(columns=cols)
>>> college_n = (
...     college
...     .assign(
...         MD_EARN_WNE_P10=pd.to_numeric(
...             college.MD_EARN_WNE_P10, errors="coerce"
...         ),
...         GRAD_DEBT_MDN_SUPP=pd.to_numeric(
...             college.GRAD_DEBT_MDN_SUPP, errors="coerce"
...         ),
...     )
...     .select_dtypes("number")
...     .pipe(remove_binary_cols)
... )

使用max方法,找到每列的最大值:

代码语言:javascript
复制
>>> college_n.max().head()
SATVRMID         765.0
SATMTMID         785.0
UGDS          151558.0
UGDS_WHITE         1.0
UGDS_BLACK         1.0
dtype: float64

使用eq方法,将DataFrame中的每个值和列的最大值比较:

代码语言:javascript
复制
>>> college_n.eq(college_n.max()).head()
              SATVRMID  SATMTMID  ...  MD_EARN_WNE_P10  GRAD_DEBT_MDN_SUPP
INSTNM                            ...
Alabama A...     False     False  ...        False            False
Universit...     False     False  ...        False            False
Amridge U...     False     False  ...        False            False
Universit...     False     False  ...        False            False
Alabama S...     False     False  ...        False            False

使用.any找到包含True值的行:

代码语言:javascript
复制
>>> has_row_max = (
...     college_n
...     .eq(college_n.max())
...     .any(axis="columns")
... )
>>> has_row_max.head()
INSTNM
Alabama A & M University               False
University of Alabama at Birmingham    False
Amridge University                     False
University of Alabama in Huntsville    False
Alabama State University               False
dtype: bool

检查有多少个最大值:

代码语言:javascript
复制
>>> college_n.shape
(7535, 18)
>>> has_row_max.sum()
401

这说明,有的列存在多个最大值。回到上面的步骤,用cumsum方法检查:

代码语言:javascript
复制
>>> college_n.eq(college_n.max()).cumsum()
              SATVRMID  SATMTMID  ...  MD_EARN_WNE_P10  GRAD_DEBT_MDN_SUPP
INSTNM                            ...
Alabama A...         0         0  ...            0                0
Universit...         0         0  ...            0                0
Amridge U...         0         0  ...            0                0
Universit...         0         0  ...            0                0
Alabama S...         0         0  ...            0                0
...                ...       ...  ...          ...              ...
SAE Insti...         1         1  ...            1                2
Rasmussen...         1         1  ...            1                2
National ...         1         1  ...            1                2
Bay Area ...         1         1  ...            1                2
Excel Lea...         1         1  ...            1                2

cumsum方法再重复一下,找到1出现的地方:

代码语言:javascript
复制
>>> (college_n.eq(college_n.max()).cumsum().cumsum())
              SATVRMID  SATMTMID  ...  MD_EARN_WNE_P10  GRAD_DEBT_MDN_SUPP
INSTNM                            ...
Alabama A...         0         0  ...            0                0
Universit...         0         0  ...            0                0
Amridge U...         0         0  ...            0                0
Universit...         0         0  ...            0                0
Alabama S...         0         0  ...            0                0
...                ...       ...  ...          ...              ...
SAE Insti...      7305      7305  ...         3445            10266
Rasmussen...      7306      7306  ...         3446            10268
National ...      7307      7307  ...         3447            10270
Bay Area ...      7308      7308  ...         3448            10272
Excel Lea...      7309      7309  ...         3449            10274

现在再用any方法找到至少有一个True值的列:

代码语言:javascript
复制
>>> has_row_max2 = (
...     college_n.eq(college_n.max())
...     .cumsum()
...     .cumsum()
...     .eq(1)
...     .any(axis="columns")
... )
>>> has_row_max2.head()
INSTNM
Alabama A & M University               False
University of Alabama at Birmingham    False
Amridge University                     False
University of Alabama in Huntsville    False
Alabama State University               False
dtype: bool

现在,has_row_max2的True值就不必列数多了:

代码语言:javascript
复制
>>> has_row_max2.sum()
16

计算最大值对应的行索引标签:

代码语言:javascript
复制
>>> idxmax_cols = has_row_max2[has_row_max2].index
>>> idxmax_cols
Index(['Thunderbird School of Global Management',
       'Southwest University of Visual Arts-Tucson', 'ABC Beauty College Inc',
       'Velvatex College of Beauty Culture',
       'California Institute of Technology',
       'Le Cordon Bleu College of Culinary Arts-San Francisco',
       'MTI Business College Inc', 'Dongguk University-Los Angeles',
       'Mr Leon's School of Hair Design-Moscow',
       'Haskell Indian Nations University', 'LIU Brentwood',
       'Medical College of Wisconsin', 'Palau Community College',
       'California University of Management and Sciences',
       'Cosmopolitan Beauty and Tech School', 'University of Phoenix-Arizona'],
      dtype='object', name='INSTNM')
代码语言:javascript
复制
>>> set(college_n.idxmax().unique()) == set(idxmax_cols)
True

将其写成idx_max函数:

代码语言:javascript
复制
>>> def idx_max(df):
...     has_row_max = (
...         df
...         .eq(df.max())
...         .cumsum()
...         .cumsum()
...         .eq(1)
...         .any(axis="columns")
...     )
...     return has_row_max[has_row_max].index
>>> idx_max(college_n)
Index(['Thunderbird School of Global Management',
       'Southwest University of Visual Arts-Tucson', 'ABC Beauty College Inc',
       'Velvatex College of Beauty Culture',
       'California Institute of Technology',
       'Le Cordon Bleu College of Culinary Arts-San Francisco',
       'MTI Business College Inc', 'Dongguk University-Los Angeles',
       'Mr Leon's School of Hair Design-Moscow',
       'Haskell Indian Nations University', 'LIU Brentwood',
       'Medical College of Wisconsin', 'Palau Community College',
       'California University of Management and Sciences',
       'Cosmopolitan Beauty and Tech School', 'University of Phoenix-Arizona'],
      dtype='object', name='INSTNM')
更多

比较两种方法的速度:

代码语言:javascript
复制
>>> def idx_max(df):
...     has_row_max = (
...         df
...         .eq(df.max())
...         .cumsum()
...         .cumsum()
...         .eq(1)
...         .any(axis="columns")
...         [lambda df_: df_]
...         .index
...     )
...     return has_row_max
>>> %timeit college_n.idxmax().values
1.12 ms ± 28.4 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit idx_max(college_n)
5.35 ms ± 55.2 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

8.8 发现列的最常见最大值

读取数据集:

代码语言:javascript
复制
>>> college = pd.read_csv(
...     "data/college.csv", index_col="INSTNM"
... )
>>> college_ugds = college.filter(like="UGDS_")
>>> college_ugds.head()
              UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
INSTNM                                ...
Alabama A...      0.0333      0.9353  ...    0.0059     0.0138
Universit...      0.5922      0.2600  ...    0.0179     0.0100
Amridge U...      0.2990      0.4192  ...    0.0000     0.2715
Universit...      0.6988      0.1255  ...    0.0332     0.0350
Alabama S...      0.0158      0.9208  ...    0.0243     0.0137

.idxmax在列上,找到最高比例种族对应的学校:

代码语言:javascript
复制
>>> highest_percentage_race = college_ugds.idxmax(
...     axis="columns"
... )
>>> highest_percentage_race.head()
INSTNM
Alabama A & M University
University of Alabama at Birmingham
Amridge University
University of Alabama in Huntsville
Alabama State University
dtype: object

查看最大值的分布情况:

代码语言:javascript
复制
>>> highest_percentage_race.value_counts(normalize=True)
UGDS_WHITE    0.670352
UGDS_BLACK    0.151586
UGDS_HISP     0.129473
UGDS_UNKN     0.023422
UGDS_ASIAN    0.012074
UGDS_AIAN     0.006110
UGDS_NRA      0.004073
UGDS_NHPI     0.001746
UGDS_2MOR     0.001164
dtype: float64
更多

对于黑人占多数的学校,其它族裔是如何分布的:

代码语言:javascript
复制
>>> (
...     college_ugds
...     [highest_percentage_race == "UGDS_BLACK"]
...     .drop(columns="UGDS_BLACK")
...     .idxmax(axis="columns")
...     .value_counts(normalize=True)
... )
UGDS_WHITE    0.661228
UGDS_HISP     0.230326
UGDS_UNKN     0.071977
UGDS_NRA      0.018234
UGDS_ASIAN    0.009597
UGDS_2MOR     0.006718
UGDS_AIAN     0.000960
UGDS_NHPI     0.000960
dtype: float64
本文参与 腾讯云自媒体同步曝光计划,分享自作者个人站点/博客。
如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 作者个人站点/博客 前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 8.1 检查索引对象
  • 8.2 生成笛卡尔积
  • 8.3 索引爆炸
  • 8.4 用不等索引填充值
  • 8.5 从不同的DataFrame添加列
  • 8.6 高亮每列的最大值
  • 8.7 用链式方法替换idxmax
  • 8.8 发现列的最常见最大值
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档