Datawhale干货
作者:闫钟峰,Datawhale优秀学习者
寄语:本文对单级索引中的loc、iloc、[]三种方法进行了详细的阐述。同时,对布尔索引,快速标量索引方式、区间索引方式做了详细介绍。
读取csv数据的时候, 使用参数index_col指定表中的列作为索引
import numpy as np
import pandas as pd
df = pd.read_csv('data/table.csv',index_col='ID')
df.head()
效果等同于读取数据后, 使用set_index方法指定某一列为索引,但index_col的方式更简洁。
df1 = pd.read_csv('data/table.csv')
df2=df1.set_index(['ID'])
df2.head()
df.loc[1103]
多行索引时,需传入一个list,而不是多个索引
df.loc[[1102,2304]]
#以下索引报错
# TypeError: cannot do label indexing on <class 'pandas.core.indexes.base.Index'> with these indexers [2304] of <class 'int'>
df.loc[1102,2304]
df.loc[1304:].head()
df.loc[2402::-1].head()
df.loc[:,'Height'].head()
df['Height'].head()
df.Height.head()
df.loc[:,['Height','Math']].head()
# 等价于df[['Height','Math']].head()
df[['Height','Math']].head()
df.loc[:,'Height':'Math'].head()
df.iloc[:,4:7].head()
# 以下语句报错
# TypeError: cannot do slice indexing on <class 'pandas.core.indexes.numeric.Int64Index'> with these indexers [Height] of <class 'str'>
df['Height':'Math'].head()
df.loc[1102:2401:3,'Height':'Math']#.head()
df.loc[lambda x:x['Gender']=='M'].head()
def f(x):
return [1101,1103]
df.loc[f]
df.loc[df['Address'].isin(['street_7','street_4'])].head()
# 类似的sql语句为 select * from df where Address in ('street_7','street_4')
df['Address'].isin(['street_7','street_4'])
df.loc[[True if i[-1]=='4' or i[-1]=='7' else False for i in df['Address'].values]].head()
# 实现相同筛选的sql代码类似于 select * from df where substr(Address,-1,1) in ('4','7')
① 单行索引
df.iloc[3]
df.iloc[3:5]
③ 单列索引
df.iloc[:,3].head()
df.iloc[:,7::-2].head()
df.iloc[3::4,7::-2]#.head()
df.iloc[lambda x:[3]].head()
df.iloc[lambda x:[30000]].head()
df.iloc[lambda x:range(3)]
df.iloc[lambda x:np.arange(3)]
s = pd.Series(df['Math'],index=df.index)
s[1101]
s.loc[1101]
s.head()
s[1]
s.iloc[1]
② 多行索引
s[0:4]
s[lambda x: x.index[16::-6]]
s[lambda x: 16::-6]
#TypeError: cannot do slice indexing on <class 'pandas.core.indexes.numeric.Int64Index'> with these indexers [<function <lambda> at 0x00000000083FFCA8>] of <class 'function'>
s[16::-6]
def f(x):
return x[16::-6]
f(s)
s[lambda x: x[16::-6]]
s[lambda x: x[16::-6].index]
s[16::-6].index
s[16::-6]
s.apply??
④ 布尔索引
s[s>80]
# 类似的sql语句是 select * from s where s.value>80
s[(s>80)&(s<95)] # 没有括号的时候会报错--s[(s>80&s<95)]
df[1:2]
df[1102:]
row = df.index.get_loc(1102) # df.index.get_loc 将标签索引转换为默认整数索引
df[row:row+1]
df.index.get_loc??
df.loc[1102,:]
df.loc[1102:1102,:]
② 多行索引
df[3:5]
df.iloc[3:5]
③ 单列索引
df['School'].head()
pd.DataFrame(df['School']).head()
df['School'].to_frame().head()
df.iloc[:,0].head()
df.loc[:,'School'].head()
df1=pd.DataFrame(np.random.randint(1,10,12).reshape(3,4))
df1.iloc[:,0]
df1.loc[:,0]
df1[0:2]
df1.loc[0:2]
df.School.head()
df.columns=['School Name', 'Class', 'Gender', 'Address', 'Height', 'Weight', 'Math', 'Physics']
df.School Name
df['School Name'].head()
df.columns=['School', 'Class', 'Gender', 'Address', 'Height', 'Weight', 'Math', 'Physics']
df[['School','Math']].head()
df.loc[:,['School','Math']].head()
df.iloc[:,[0,6]].head()
⑤函数式索引
df[lambda x:['Math','Physics']].head()
df[lambda x: [x for x in df.columns if len(x)>5]].head()
df[lambda x: [a for a in x.columns if len(a)>5]].head()
df1[lambda x: [a for a in x.columns if len(str(a))>5]].head()
df[[ x for x in df.columns if len(x)>5]].head()
⑥ 布尔索引
df[df['Gender']=='F'].head()
# 等价的sql语句 select * from df where Gender='F'
df[df.Gender=='F'].head()
小节:一般来说,[]操作符常用于列选择或布尔选择,尽量避免行的选择
df[(df['Gender']=='F')&(df['Address']=='street_2')].head()
# 等价的sql语句 select * from df where Gender='F' and Address='street_2'
df[(df['Math']>85)|(df['Address']=='street_7')].head()
# select * from df where df.Math>85 or df.Address='street_7'
df[~((df['Math']>75)|(df['Address']=='street_1'))].head()
# 等价的sql语句 select * from df where not (math>75 or Address='street_1')
#df[~((df['Math']>75)|(df['Address']=='street_1'))].head()
df[~(df['Math']>75)][~(df['Address']=='street_1')].head()
df.loc[df['Math']>60,(df[:8]['Address']=='street_6').values].head()
df[:8]['Address']=='street_6'
(df[:8]['Address']=='street_6').values
df.loc[df['Math']>60,(df[:8]['Address']=='street_6')].head()
# IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).
df[df['Address'].isin(['street_1','street_4'])&df['Physics'].isin(['A','A+'])]
# select * from df where Address in ('street_1','street_4') and Physic in ('A','A+')
df[df[['Address','Physics']].isin({'Address':['street_1','street_4'],'Physics':['A','A+']}).all(1)
display(df.at[1101,'School'])
display(df.loc[1101,'School'])
display(df.iat[0,0])
display(df.iloc[0,0])
#可尝试去掉注释对比时间
%timeit df.at[1101,'School']
%timeit df.loc[1101,'School']
%timeit df.iat[0,0]
%timeit df.iloc[0,0]
#当数据集更大的时候,差别更明显
df.at[1101,:]
# at方法只能选择单元格?
# TypeError: unhashable type: 'slice'
df.at[1101,df.columns[:3]]
# TypeError: unhashable type: 'Index'
df.at??
# Access a single value for a row/column label pair.
pd.interval_range(start=0,end=5)
pd.interval_range(start=0,periods=8,freq=5)
pd.interval_range??
math_interval = pd.cut(df['Math'],bins=[0,40,60,80,100])
math_interval.head()
math_interval.values
df_i = df.join(math_interval,rsuffix='_interval')[['Math','Math_interval']].reset_index().set_index('Math_interval')
df_i.head()
df_i.index
df_i.loc[65].head()
df_i.loc[[65,90]]
df_i.loc[[65,30]]
df_i.loc[pd.Interval(70,75)].head()
df_i[df_i.index.astype('interval').overlaps(pd.Interval(70, 85))].head()
df_i.index
tmp=df_i.index.astype('interval')
tmp.overlaps??
df_i[df_i.index.astype('interval').overlaps(pd.Interval(70, 85),pd.Interval(20, 35))].head()
tmp.overlaps(pd.Interval(20, 35))
pd.Interval(70, 85)
pd.Interval??
# left, right, close 三个参数