《Pandas Cookbook》第04章 选取数据子集1. 选取Series数据2. 选取DataFrame的行3. 同时选取DataFrame的行和列4. 用整数和标签选取数据5. 快速选取标量6


第01章 Pandas基础 第02章 DataFrame运算 第03章 数据分析入门 第04章 选取数据子集 第05章 布尔索引 第06章 索引对齐 第07章 分组聚合、过滤、转换 第08章 数据清理 第09章 合并Pandas对象 第10章 时间序列分析 第11章 用Matplotlib、Pandas、Seaborn进行可视化


In[1]: import pandas as pd
       import numpy as np

1. 选取Series数据

# 读取college数据集,查看CITY的前5行
 In[2]: college = pd.read_csv('data/college.csv', index_col='INSTNM')
        city = college['CITY']
        city.head()
Out[2]: INSTNM
        Alabama A & M University                   Normal
        University of Alabama at Birmingham    Birmingham
        Amridge University                     Montgomery
        University of Alabama in Huntsville    Huntsville
        Alabama State University               Montgomery
        Name: CITY, dtype: object
# iloc可以通过整数选取
 In[3]: city.iloc[3]
Out[3]: 'Huntsville'
# iloc通过整数列表选取多行,返回结果是Series
 In[4]: city.iloc[[10,20,30]]
Out[4]: INSTNM
        Birmingham Southern College                            Birmingham
        George C Wallace State Community College-Hanceville    Hanceville
        Judson College                                             Marion
        Name: CITY, dtype: object
# 选择等分的数据,可以使用切片语法
 In[5]: city.iloc[4:50:10]
Out[5]: INSTNM
        Alabama State University              Montgomery
        Enterprise State Community College    Enterprise
        Heritage Christian University           Florence
        Marion Military Institute                 Marion
        Reid State Technical College           Evergreen
        Name: CITY, dtype: object
# loc只接收行索引标签
 In[6]: city.loc['Heritage Christian University']
Out[6]: 'Florence'
# 随机选择4个标签
 In[7]: np.random.seed(1)
        labels = list(np.random.choice(city.index, 4))
        labels
Out[7]: ['Northwest HVAC/R Training Center',
         'California State University-Dominguez Hills',
         'Lower Columbia College',
         'Southwest Acupuncture College-Boulder']
# 通过标签列表选择多行
 In[8]: city.loc[labels]
Out[8]: INSTNM
        Northwest HVAC/R Training Center                Spokane
        California State University-Dominguez Hills      Carson
        Lower Columbia College                         Longview
        Southwest Acupuncture College-Boulder           Boulder
        Name: CITY, dtype: object
# 也可以通过切片语法均匀选择多个
 In[9]: city.loc['Alabama State University':'Reid State Technical College':10]
Out[9]: INSTNM
        Alabama State University              Montgomery
        Enterprise State Community College    Enterprise
        Heritage Christian University           Florence
        Marion Military Institute                 Marion
        Reid State Technical College           Evergreen
        Name: CITY, dtype: object
# 也可以不使用loc,直接使用类似Python的语法
 In[10]: city['Alabama State University':'Reid State Technical College':10]
Out[10]: INSTNM
         Alabama State University              Montgomery
         Enterprise State Community College    Enterprise
         Heritage Christian University           Florence
         Marion Military Institute                 Marion
         Reid State Technical College           Evergreen
         Name: CITY, dtype: object

更多

# 要想只选取一项,并保留其Series类型,则传入一个只包含一项的列表
 In[11]: city.iloc[[3]]
Out[11]: INSTNM
         University of Alabama in Huntsville    Huntsville
         Name: CITY, dtype: object
# 使用loc切片时要注意,如果start索引再stop索引之后,则会返回空,并且不会报警
 In[12]: city.loc['Reid State Technical College':'Alabama State University':10]
Out[12]: Series([], Name: CITY, dtype: object)
# 也可以切片逆序选取
 In[13]: city.loc['Reid State Technical College':'Alabama State University':-10]
Out[13]: INSTNM
         Reid State Technical College           Evergreen
         Marion Military Institute                 Marion
         Heritage Christian University           Florence
         Enterprise State Community College    Enterprise
         Alabama State University              Montgomery
         Name: CITY, dtype: object

2. 选取DataFrame的行

# 还是读取college数据集
 In[14]: college = pd.read_csv('data/college.csv', index_col='INSTNM')
         college.head()
Out[14]: 
# 选取第61行
 In[15]: pd.options.display.max_rows = 6
 In[16]: college.iloc[60]
Out[16]: 
# 也可以通过行标签选取
 In[17]: college.loc['University of Alaska Anchorage']
Out[17]: CITY                  Anchorage
         STABBR                       AK
         HBCU                          0
                                     ...    
         UG25ABV                  0.4386
         MD_EARN_WNE_P10           42500
         GRAD_DEBT_MDN_SUPP      19449.5
         Name: University of Alaska Anchorage, Length: 26, dtype: object
# 选取多个不连续的行
 In[18]: college.iloc[[60, 99, 3]]
Out[18]: 
# 也可以用loc加列表来选取
 In[19]: labels = ['University of Alaska Anchorage',
                   'International Academy of Hair Design',
                   'University of Alabama in Huntsville']
         college.loc[labels]
Out[19]: 
# iloc可以用切片连续选取
 In[20]: college.iloc[99:102]
Out[20]: 
# loc可以用标签连续选取
 In[21]: start = 'International Academy of Hair Design'
         stop = 'Mesa Community College'
         college.loc[start:stop]
Out[21]: 

更多

# .index.tolist()可以直接提取索引标签,生成一个列表
 In[22]: college.iloc[[60, 99, 3]].index.tolist()
Out[22]: ['University of Alaska Anchorage',
          'International Academy of Hair Design',
          'University of Alabama in Huntsville']

3. 同时选取DataFrame的行和列

# 读取college数据集,给行索引命名为INSTNM;选取前3行和前4列
 In[23]: college = pd.read_csv('data/college.csv', index_col='INSTNM')
         college.iloc[:3, :4]
Out[23]: 
# 用loc实现同上功能
 In[24]: college.loc[:'Amridge University', :'MENONLY']
Out[24]: 
# 选取两列的所有的行
 In[25]: college.iloc[:, [4,6]].head()
Out[25]: 
# loc实现同上功能
 In[26]: college.loc[:, ['WOMENONLY', 'SATVRMID']]
Out[26]: 
# 选取不连续的行和列
 In[27]: college.iloc[[100, 200], [7, 15]]
Out[27]: 
# 用loc和列表,选取不连续的行和列
 In[28]: rows = ['GateWay Community College', 'American Baptist Seminary of the West']
         columns = ['SATMTMID', 'UGDS_NHPI']
         college.loc[rows, columns]
Out[28]: 
# iloc选取一个标量值
 In[29]: college.iloc[5, -4]
Out[29]: 0.40100000000000002
# loc选取一个标量值
 In[30]: college.loc['The University of Alabama', 'PCTFLOAN']
Out[30]: 0.40100000000000002
# iloc对行切片,并只选取一列
 In[31]: college.iloc[90:80:-2, 5]
Out[31]: INSTNM
         Empire Beauty School-Flagstaff     0
         Charles of Italy Beauty College    0
         Central Arizona College            0
         University of Arizona              0
         Arizona State University-Tempe     0
         Name: RELAFFIL, dtype: int64
# loc对行切片,并只选取一列
 In[32]: start = 'Empire Beauty School-Flagstaff'
         stop = 'Arizona State University-Tempe'
         college.loc[start:stop:-2, 'RELAFFIL']
Out[32]: INSTNM
         Empire Beauty School-Flagstaff     0
         Charles of Italy Beauty College    0
         Central Arizona College            0
         University of Arizona              0
         Arizona State University-Tempe     0
         Name: RELAFFIL, dtype: int64

4. 用整数和标签选取数据

# 读取college数据集,行索引命名为INSTNM
 In[33]: college = pd.read_csv('data/college.csv', index_col='INSTNM')
# 用索引方法get_loc,找到指定列的整数位置
 In[34]: col_start = college.columns.get_loc('UGDS_WHITE')
         col_end = college.columns.get_loc('UGDS_UNKN') + 1
         col_start, col_end
Out[34]: (10, 19)
# 用切片选取连续的列
 In[35]: college.iloc[:5, col_start:col_end]
Out[35]:

更多

# index()方法可以获得整数行对应的标签名
 In[36]: row_start = college.index[10]
         row_end = college.index[15]
         college.loc[row_start:row_end, 'UGDS_WHITE':'UGDS_UNKN']
Out[36]: 

5. 快速选取标量

# 通过将行标签赋值给一个变量,用loc选取
 In[37]: college = pd.read_csv('data/college.csv', index_col='INSTNM')
         cn = 'Texas A & M University-College Station'
         college.loc[cn, 'UGDS_WHITE']
Out[37]: 0.66099999999999992
# at可以实现同样的功能
 In[38]: college.at[cn, 'UGDS_WHITE']
Out[38]: 0.66099999999999992
# 用魔术方法%timeit,对速度进行比较
 In[39]: %timeit college.loc[cn, 'UGDS_WHITE']
Out[39]: 9.93 µs ± 274 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
 In[40]: %timeit college.at[cn, 'UGDS_WHITE']
Out[40]: 6.69 µs ± 223 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

.iat.at只接收标量值,是专门用来取代.iloc.loc选取标量的,可以节省大概2.5微秒。

# 用get_loc找到整数位置,再进行速度比较
 In[41]: row_num = college.index.get_loc(cn)
         col_num = college.columns.get_loc('UGDS_WHITE')
 In[42]: row_num, col_num
Out[42]: (3765, 10)

 In[43]: %timeit college.iloc[row_num, col_num]
Out[43]: 11.1 µs ± 426 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

 In[44]: %timeit college.iat[row_num, col_num]
Out[44]: 7.47 µs ± 109 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

 In[45]: %timeit college.iloc[5, col_num]
Out[45]: 10.8 µs ± 467 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

 In[46]: %timeit college.iat[5, col_num]
Out[46]: 7.12 µs ± 297 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

更多

# Series对象也可以使用.iat和.at选取标量
 In[47]: state = college['STABBR']
 In[48]: state.iat[1000]
Out[48]: 'IL'

 In[49]: state.at['Stanford University']
Out[49]: 'CA'

6. 惰性行切片

# 读取college数据集;从行索引10到20,每隔一个取一行
 In[50]: college = pd.read_csv('data/college.csv', index_col='INSTNM')
         college[10:20:2]
Out[50]: 
# Series也可以进行同样的切片
 In[51]: city = college['CITY']
         city[10:20:2]
Out[51]: INSTNM
         Birmingham Southern College              Birmingham
         Concordia College Alabama                     Selma
         Enterprise State Community College       Enterprise
         Faulkner University                      Montgomery
         New Beginning College of Cosmetology    Albertville
         Name: CITY, dtype: object
# 查看第4002个行索引标签
 In[52]: college.index[4001]
Out[52]: 'Spokane Community College'
# Series和DataFrame都可以用标签进行切片。下面是对DataFrame用标签切片
 In[53]: start = 'Mesa Community College'
         stop = 'Spokane Community College'
         college[start:stop:1500]
Out[53]: 
# 下面是对Series用标签切片
 In[54]: city[start:stop:1500]
Out[54]: INSTNM
         Mesa Community College                            Mesa
         Hair Academy Inc-New Carrollton         New Carrollton
         National College of Natural Medicine          Portland
         Name: CITY, dtype: object

更多

惰性切片不能用于列,只能用于DataFrame的行和Series,也不能同时选取行和列。

# 下面尝试选取两列,导致错误
 In[55]: college[:10, ['CITY', 'STABBR']]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-55-92538c61bdfa> in <module>()
----> 1 college[:10, ['CITY', 'STABBR']]

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/frame.py in __getitem__(self, key)
   1962             return self._getitem_multilevel(key)
   1963         else:
-> 1964             return self._getitem_column(key)
   1965 
   1966     def _getitem_column(self, key):

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/frame.py in _getitem_column(self, key)
   1969         # get column
   1970         if self.columns.is_unique:
-> 1971             return self._get_item_cache(key)
   1972 
   1973         # duplicate columns & possible reduce dimensionality

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/generic.py in _get_item_cache(self, item)
   1641         """Return the cached item, item represents a label indexer."""
   1642         cache = self._item_cache
-> 1643         res = cache.get(item)
   1644         if res is None:
   1645             values = self._data.get(item)

TypeError: unhashable type: 'slice'
# 只能用.loc和.iloc选取
 In[56]: first_ten_instnm = college.index[:10]
         college.loc[first_ten_instnm, ['CITY', 'STABBR']]
Out[56]: 

7. 按照字母切片

# 读取college数据集;尝试选取字母顺序在‘Sp’和‘Su’之间的学校
 In[57]: college = pd.read_csv('data/college.csv', index_col='INSTNM')
         college.loc['Sp':'Su']
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_slice_bound(self, label, side, kind)
   3483             try:
-> 3484                 return self._searchsorted_monotonic(label, side)
   3485             except ValueError:

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/indexes/base.py in _searchsorted_monotonic(self, label, side)
   3442 
-> 3443         raise ValueError('index must be monotonic increasing or decreasing')
   3444 

ValueError: index must be monotonic increasing or decreasing

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-57-c9f1c69a918b> in <module>()
      1 college = pd.read_csv('data/college.csv', index_col='INSTNM')
----> 2 college.loc['Sp':'Su']

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/indexing.py in __getitem__(self, key)
   1326         else:
   1327             key = com._apply_if_callable(key, self.obj)
-> 1328             return self._getitem_axis(key, axis=0)
   1329 
   1330     def _is_scalar_access(self, key):

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/indexing.py in _getitem_axis(self, key, axis)
   1504         if isinstance(key, slice):
   1505             self._has_valid_type(key, axis)
-> 1506             return self._get_slice_axis(key, axis=axis)
   1507         elif is_bool_indexer(key):
   1508             return self._getbool_axis(key, axis=axis)

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/indexing.py in _get_slice_axis(self, slice_obj, axis)
   1354         labels = obj._get_axis(axis)
   1355         indexer = labels.slice_indexer(slice_obj.start, slice_obj.stop,
-> 1356                                        slice_obj.step, kind=self.name)
   1357 
   1358         if isinstance(indexer, slice):

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/indexes/base.py in slice_indexer(self, start, end, step, kind)
   3348         """
   3349         start_slice, end_slice = self.slice_locs(start, end, step=step,
-> 3350                                                  kind=kind)
   3351 
   3352         # return a slice

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/indexes/base.py in slice_locs(self, start, end, step, kind)
   3536         start_slice = None
   3537         if start is not None:
-> 3538             start_slice = self.get_slice_bound(start, 'left', kind)
   3539         if start_slice is None:
   3540             start_slice = 0

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_slice_bound(self, label, side, kind)
   3485             except ValueError:
   3486                 # raise the original KeyError
-> 3487                 raise err
   3488 
   3489         if isinstance(slc, np.ndarray):

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_slice_bound(self, label, side, kind)
   3479         # we need to look up the label
   3480         try:
-> 3481             slc = self._get_loc_only_exact_matches(label)
   3482         except KeyError as err:
   3483             try:

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/indexes/base.py in _get_loc_only_exact_matches(self, key)
   3448         get_slice_bound.
   3449         """
-> 3450         return self.get_loc(key)
   3451 
   3452     def get_slice_bound(self, label, side, kind):

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2442                 return self._engine.get_loc(key)
   2443             except KeyError:
-> 2444                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2445 
   2446         indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5280)()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5126)()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20523)()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20477)()

KeyError: 'Sp'
# 对college进行排序
 In[58]: college = college.sort_index()
 In[59]: college = college.head()
Out[59]:
# 再尝试选取字母顺序在‘Sp’和‘Su’之间的学校
 In[60]: pd.options.display.max_rows = 6
 In[61]: college.loc['Sp':'Su']
Out[61]:
# 可以用is_monotonic_increasing或is_monotonic_decreasing检测字母排序的顺序
 In[62]: college = college.sort_index(ascending=False)
         college.index.is_monotonic_decreasing
Out[62]: True
# 字母逆序选取
 In[63]: college.loc['E':'B']
Out[63]: 


本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

发表于

我来说两句

0 条评论
登录 后参与评论

相关文章

来自专栏视觉求索无尽也

算法:动态规划(DP)入门实践

31320
来自专栏程序生活

Leetcode-Easy 575. Distribute Candies

728. Self Dividing Numbers 描述: 有偶数个糖,需要分给弟弟和妹妹,要求最终两个人分到的糖数目一样,返回妹妹获得糖的种类数目最大值 ...

37140
来自专栏SeanCheney的专栏

《Pandas Cookbook》第11章 用Matplotlib、Pandas、Seaborn进行可视化

20930
来自专栏包子铺里聊IT

[Google最新面试题] continental divider

给一个矩阵,其中0代表海洋,其他数字代表高度, 秉着水往低处流的原则,求出能够 流向任意海洋的点。 比如说 0 0 0 1 2 3 0 0 1 2 2 4 3 ...

28840
来自专栏小樱的经验随笔

Codeforces Round #412 (rated, Div. 2, base on VK Cup 2017 Round 3)(A.B.C,3道暴力题,C可二分求解)

A. Is it rated? time limit per test:2 seconds memory limit per test:256 megabyte...

37970
来自专栏技术沉淀

Matplotlib可视化Pyplot Tutorial

MATLAB, and pyplot, have the concept of the current figure and the current axes....

17740
来自专栏计算机视觉与深度学习基础

Leetcode 84 Largest Rectangle in Histogram

Given n non-negative integers representing the histogram's bar height where the...

18390
来自专栏python3

习题3:数字和数学计算

数小鸡! 母鸡 30.0 公鸡 97 数鸡蛋 6.75 False what is 3 + 2 ? 5 what is 5 - 7? -2 True True ...

7010
来自专栏Bingo的深度学习杂货店

Q172 Factorial Trailing Zeroes

Given an integer n, return the number of trailing zeroes in n!. Note: Your solut...

33350
来自专栏逍遥剑客的游戏开发

D3D深度测试和Alpha混合

27250

扫码关注云+社区

领取腾讯云代金券