创建日期: 20200805 15:20 上次修改: 20200805 19:27 Python 版本: Python 3.7
项目介绍:一直想写一份适合经济学等社科背景、学术科研向的 Python 教程。因为学经济学的多少会对 Stata 有所了解,有一些写代码命令的经历,这份教程应该:
在构思了一段时间之后,偶然发现 Ties de Kok 的 Get started with Python for research tutorial项目已经搭建出了我想要的框架。于是打算在这个项目的基础上进行完善,首先将其主要内容“汉化”成中文,之后对用法进行扩充、加入典型用法和案例。
“原作者简介:Ties de Kok (Personal Website)为华盛顿大学福斯特商学院的助理教授,他专注于将计算机科学与实证会计研究相结合,研究兴趣是财务会计、资本市场、计算机科学、自然语言处理和经验管理会计。 ”
本部分基于 PyCon 2015 tutorial/talk by Brandon Rhodes,如果想了解更多,建议观看:
https://www.youtube.com/watch?v=5JnMutdy6Fw https://github.com/brandon-rhodes/pycon-pandas-tutorial
import pandas as pd
import numpy as np
import os
from os.path import join
# 定义路径
data_path = join(os.getcwd(), 'data')
df_auto = pd.read_csv(join(data_path, 'auto_df.csv'), sep=';', index_col='Unnamed: 0')
可以传入多种类型的数据到 pd.DataFrame()
:
d = {'col1': [1,2,3,4], 'col2': [5,6,7,8]}
df = pd.DataFrame(data=d)
df
col1 | col2 | |
---|---|---|
0 | 1 | 5 |
1 | 2 | 6 |
2 | 3 | 7 |
3 | 4 | 8 |
d = [(1, 2 ,3 ,4), (5, 6, 7, 8)]
df = pd.DataFrame(data=d)
df
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 1 | 2 | 3 | 4 |
1 | 5 | 6 | 7 | 8 |
d = {'row1': [1,2,3,4], 'row2': [5,6,7,8]}
df = pd.DataFrame.from_dict(d, orient='index')
df
0 | 1 | 2 | 3 | |
---|---|---|---|---|
row1 | 1 | 2 | 3 | 4 |
row2 | 5 | 6 | 7 | 8 |
也可以直接传入字典:
df = pd.DataFrame.from_dict({'row1': [1,2,3,4], 'row2': [5,6,7,8]}, orient='index')
df
0 | 1 | 2 | 3 | |
---|---|---|---|---|
row1 | 1 | 2 | 3 | 4 |
row2 | 5 | 6 | 7 | 8 |
df['col5'] = [10, 10]
df
0 | 1 | 2 | 3 | col5 | |
---|---|---|---|---|---|
row1 | 1 | 2 | 3 | 4 | 10 |
row2 | 5 | 6 | 7 | 8 | 10 |
df.loc['row3'] = [11, 12, 13, 14, 15]
df
0 | 1 | 2 | 3 | col5 | |
---|---|---|---|---|---|
row1 | 1 | 2 | 3 | 4 | 10 |
row2 | 5 | 6 | 7 | 8 | 10 |
row3 | 11 | 12 | 13 | 14 | 15 |
df.T
row1 | row2 | row3 | |
---|---|---|---|
0 | 1 | 5 | 11 |
1 | 2 | 6 | 12 |
2 | 3 | 7 | 13 |
3 | 4 | 8 | 14 |
col5 | 10 | 10 | 15 |
df = df.drop('col5', axis=1) # axis =1 : column
df
0 | 1 | 2 | 3 | |
---|---|---|---|---|
row1 | 1 | 2 | 3 | 4 |
row2 | 5 | 6 | 7 | 8 |
row3 | 11 | 12 | 13 | 14 |
df = df.drop('row1', axis=0)
df
0 | 1 | 2 | 3 | |
---|---|---|---|---|
row2 | 5 | 6 | 7 | 8 |
row3 | 11 | 12 | 13 | 14 |
df
0 | 1 | 2 | 3 | |
---|---|---|---|---|
row2 | 5 | 6 | 7 | 8 |
row3 | 11 | 12 | 13 | 14 |
df.set_index(0)
1 | 2 | 3 | |
---|---|---|---|
0 | |||
5 | 6 | 7 | 8 |
11 | 12 | 13 | 14 |
提示:Pandas 允许多索引,这在数据分析中非常实用。
df.set_index(0, append=True)
1 | 2 | 3 | ||
---|---|---|---|---|
0 | ||||
row2 | 5 | 6 | 7 | 8 |
row3 | 11 | 12 | 13 | 14 |
使用reset_index()
将索引(index)转化为常规的列(regular column)。
df.reset_index()
index | 0 | 1 | 2 | 3 | |
---|---|---|---|---|---|
0 | row2 | 5 | 6 | 7 | 8 |
1 | row3 | 11 | 12 | 13 | 14 |
可以直接操作df.columns
或使用df.rename()
:
df.columns = ['col1','col2','col3','col4']
df
col1 | col2 | col3 | col4 | |
---|---|---|---|---|
row2 | 5 | 6 | 7 | 8 |
row3 | 11 | 12 | 13 | 14 |
df.rename(columns={'col1' : 'column1', 'col2' : 'column2'})
column1 | column2 | col3 | col4 | |
---|---|---|---|---|
row2 | 5 | 6 | 7 | 8 |
row3 | 11 | 12 | 13 | 14 |
使用df.rename()
是复制后修改,不会对原来的 df 覆盖。要对原来的 df 进行覆盖,需要使用inplace=True
选项。
df = df.rename(columns={'col1' : 'column1', 'col2' : 'column2'})
#or
df.rename(columns={'col1' : 'column1', 'col2' : 'column2'}, inplace=True)
df_auto # 只显示首部和尾部
make | price | mpg | rep78 | headroom | trunk | weight | length | turn | displacement | gear_ratio | foreign | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | AMC Concord | 4099 | 22 | 3.0 | 2.5 | 11 | 2930 | 186 | 40 | 121 | 3.58 | Domestic |
1 | AMC Pacer | 4749 | 17 | 3.0 | 3.0 | 11 | 3350 | 173 | 40 | 258 | 2.53 | Domestic |
2 | AMC Spirit | 3799 | 22 | NaN | 3.0 | 12 | 2640 | 168 | 35 | 121 | 3.08 | Domestic |
3 | Buick Century | 4816 | 20 | 3.0 | 4.5 | 16 | 3250 | 196 | 40 | 196 | 2.93 | Domestic |
4 | Buick Electra | 7827 | 15 | 4.0 | 4.0 | 20 | 4080 | 222 | 43 | 350 | 2.41 | Domestic |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
69 | VW Dasher | 7140 | 23 | 4.0 | 2.5 | 12 | 2160 | 172 | 36 | 97 | 3.74 | Foreign |
70 | VW Diesel | 5397 | 41 | 5.0 | 3.0 | 15 | 2040 | 155 | 35 | 90 | 3.78 | Foreign |
71 | VW Rabbit | 4697 | 25 | 4.0 | 3.0 | 15 | 1930 | 155 | 35 | 89 | 3.78 | Foreign |
72 | VW Scirocco | 6850 | 25 | 4.0 | 2.0 | 16 | 1990 | 156 | 36 | 97 | 3.78 | Foreign |
73 | Volvo 260 | 11995 | 17 | 5.0 | 2.5 | 14 | 3170 | 193 | 37 | 163 | 2.98 | Foreign |
74 rows × 12 columns
df_auto.head(3) # 首部3条数据
make | price | mpg | rep78 | headroom | trunk | weight | length | turn | displacement | gear_ratio | foreign | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | AMC Concord | 4099 | 22 | 3.0 | 2.5 | 11 | 2930 | 186 | 40 | 121 | 3.58 | Domestic |
1 | AMC Pacer | 4749 | 17 | 3.0 | 3.0 | 11 | 3350 | 173 | 40 | 258 | 2.53 | Domestic |
2 | AMC Spirit | 3799 | 22 | NaN | 3.0 | 12 | 2640 | 168 | 35 | 121 | 3.08 | Domestic |
df_auto.tail(5) # 尾部5条数据
make | price | mpg | rep78 | headroom | trunk | weight | length | turn | displacement | gear_ratio | foreign | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
69 | VW Dasher | 7140 | 23 | 4.0 | 2.5 | 12 | 2160 | 172 | 36 | 97 | 3.74 | Foreign |
70 | VW Diesel | 5397 | 41 | 5.0 | 3.0 | 15 | 2040 | 155 | 35 | 90 | 3.78 | Foreign |
71 | VW Rabbit | 4697 | 25 | 4.0 | 3.0 | 15 | 1930 | 155 | 35 | 89 | 3.78 | Foreign |
72 | VW Scirocco | 6850 | 25 | 4.0 | 2.0 | 16 | 1990 | 156 | 36 | 97 | 3.78 | Foreign |
73 | Volvo 260 | 11995 | 17 | 5.0 | 2.5 | 14 | 3170 | 193 | 37 | 163 | 2.98 | Foreign |
X = 5
df_auto.sample(X)
make | price | mpg | rep78 | headroom | trunk | weight | length | turn | displacement | gear_ratio | foreign | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
13 | Chev. Chevette | 3299 | 29 | 3.0 | 2.5 | 9 | 2110 | 163 | 34 | 231 | 2.93 | Domestic |
18 | Chev. Nova | 3955 | 19 | 3.0 | 3.5 | 13 | 3430 | 197 | 43 | 250 | 2.56 | Domestic |
23 | Ford Fiesta | 4389 | 28 | 4.0 | 1.5 | 9 | 1800 | 147 | 33 | 98 | 3.15 | Domestic |
2 | AMC Spirit | 3799 | 22 | NaN | 3.0 | 12 | 2640 | 168 | 35 | 121 | 3.08 | Domestic |
4 | Buick Electra | 7827 | 15 | 4.0 | 4.0 | 20 | 4080 | 222 | 43 | 350 | 2.41 | Domestic |
**注意:**下面返回的是 Pandas 的 Series 对象,这与 Pandas 的 Dataframe 对象不同!
df_auto['make'].head(3)
0 AMC Concord
1 AMC Pacer
2 AMC Spirit
Name: make, dtype: object
如果列名没有空格,则还可以在列名后使用点号(.
):
df_auto.make.head(3)
0 AMC Concord
1 AMC Pacer
2 AMC Spirit
Name: make, dtype: object
使用双引号,选择多列:
df_auto[['make', 'price', 'mpg']].head(10)
make | price | mpg | |
---|---|---|---|
0 | AMC Concord | 4099 | 22 |
1 | AMC Pacer | 4749 | 17 |
2 | AMC Spirit | 3799 | 22 |
3 | Buick Century | 4816 | 20 |
4 | Buick Electra | 7827 | 15 |
5 | Buick LeSabre | 5788 | 18 |
6 | Buick Opel | 4453 | 26 |
7 | Buick Regal | 5189 | 20 |
8 | Buick Riviera | 10372 | 16 |
9 | Buick Skylark | 4082 | 19 |
df = df_auto[['make', 'price', 'mpg', 'trunk', 'headroom']].set_index('make')
df.loc['Buick Riviera']
price 10372.0
mpg 16.0
trunk 17.0
headroom 3.5
Name: Buick Riviera, dtype: float64
注意:返回一个 pandas.Series 对象而不是 pandas.Dataframe 对象。
df.iloc[2:5] # index location
price | mpg | trunk | headroom | |
---|---|---|---|---|
make | ||||
AMC Spirit | 3799 | 22 | 12 | 3.0 |
Buick Century | 4816 | 20 | 16 | 4.5 |
Buick Electra | 7827 | 15 | 20 | 4.0 |
df.iloc[2:5, 1:3]
mpg | trunk | |
---|---|---|
make | ||
AMC Spirit | 22 | 12 |
Buick Century | 20 | 16 |
Buick Electra | 15 | 20 |
条件选择背后的逻辑:
condition
] 来请求 Pandas 过滤数据框conditon
是每行的True
或者False
值序列(因此condition
的长度必须和 dataframe 行的长度相同)True
的值。举例来说:
df_auto['price'] < 3800
会每行依次判断 df_auto['price']
,之后会返回条件为True
或者 False
:
0 False 1 False 2 True 3 False 4 False 5 False
将条件放入方括号中 df_auto[ df_auto['price'] < 3800 ]
, Pandas 首先会生成值为 True
/ False
的序列,之后仅显示为True
的行所对应的值。
df_auto[ df_auto['price'] < 3800 ]
make | price | mpg | rep78 | headroom | trunk | weight | length | turn | displacement | gear_ratio | foreign | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2 | AMC Spirit | 3799 | 22 | NaN | 3.0 | 12 | 2640 | 168 | 35 | 121 | 3.08 | Domestic |
13 | Chev. Chevette | 3299 | 29 | 3.0 | 2.5 | 9 | 2110 | 163 | 34 | 231 | 2.93 | Domestic |
17 | Chev. Monza | 3667 | 24 | 2.0 | 2.0 | 7 | 2750 | 179 | 40 | 151 | 2.73 | Domestic |
33 | Merc. Zephyr | 3291 | 20 | 3.0 | 3.5 | 17 | 2830 | 195 | 43 | 140 | 3.08 | Domestic |
65 | Subaru | 3798 | 35 | 5.0 | 2.5 | 11 | 2050 | 164 | 36 | 97 | 3.81 | Foreign |
67 | Toyota Corolla | 3748 | 31 | 5.0 | 3.0 | 9 | 2200 | 165 | 35 | 97 | 3.21 | Foreign |
也可以通过链接布尔表达式来组合多个条件:
&
|
df_auto[(df_auto['price'] < 3800) & (df_auto['foreign'] == 'Foreign')]
make | price | mpg | rep78 | headroom | trunk | weight | length | turn | displacement | gear_ratio | foreign | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
65 | Subaru | 3798 | 35 | 5.0 | 2.5 | 11 | 2050 | 164 | 36 | 97 | 3.81 | Foreign |
67 | Toyota Corolla | 3748 | 31 | 5.0 | 3.0 | 9 | 2200 | 165 | 35 | 97 | 3.21 | Foreign |
注意:如果我们不分配上述所有新 dataframe,则上述所有操作均会返回这些新 dataframe。如果我们要将其保留为单独的 dataframe,则必须像这样分配它:
df_auto_small = df_auto[(df_auto.price < 3800) & (df_auto.foreign == 'Foreign')]
df_auto_small
make | price | mpg | rep78 | headroom | trunk | weight | length | turn | displacement | gear_ratio | foreign | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
65 | Subaru | 3798 | 35 | 5.0 | 2.5 | 11 | 2050 | 164 | 36 | 97 | 3.81 | Foreign |
67 | Toyota Corolla | 3748 | 31 | 5.0 | 3.0 | 9 | 2200 | 165 | 35 | 97 | 3.21 | Foreign |
df_auto.sort_values(by=['headroom', 'trunk'], inplace=True) # df.sort_value()
df_auto.head()
make | price | mpg | rep78 | headroom | trunk | weight | length | turn | displacement | gear_ratio | foreign | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
55 | Datsun 200 | 6229 | 23 | 4.0 | 1.5 | 6 | 2370 | 170 | 35 | 119 | 3.89 | Foreign |
47 | Pont. Firebird | 4934 | 18 | 1.0 | 1.5 | 7 | 3470 | 198 | 42 | 231 | 3.08 | Domestic |
44 | Plym. Sapporo | 6486 | 26 | NaN | 1.5 | 8 | 2520 | 182 | 38 | 119 | 3.54 | Domestic |
23 | Ford Fiesta | 4389 | 28 | 4.0 | 1.5 | 9 | 1800 | 147 | 33 | 98 | 3.15 | Domestic |
17 | Chev. Monza | 3667 | 24 | 2.0 | 2.0 | 7 | 2750 | 179 | 40 | 151 | 2.73 | Domestic |
df_auto.dtypes
make object
price int64
mpg int64
rep78 float64
headroom float64
trunk int64
weight int64
length int64
turn int64
displacement int64
gear_ratio float64
foreign object
dtype: object
我们可以通过两种方式转换列的数据类型:
df_auto['length'].apply(lambda x: str(x)).dtypes # apply() + lambda function
dtype('O')
注意: 'O'
表示 'object'
df_auto['length'].apply(lambda x: int(x)).dtypes
dtype('int64')
如果想将列转化为string
,建议使用.astype(str)
:
df_auto['length'].astype(str).dtypes
dtype('O')
如果想将列转化为numeric
,建议使用df.to_numeric()
:
pd.to_numeric(df_auto['length']).dtypes
dtype('int64')
http://pandas.pydata.org/pandas-docs/stable/missing_data.html
将缺失值定义为np.nan
:
df_auto.loc['UvT_Car'] = [np.nan for x in range(0,len(df_auto.columns))]
df_auto.loc['UvT_Bike'] = [np.nan for x in range(0,len(df_auto.columns))]
df_auto.loc[['UvT_Car', 'UvT_Bike']]
make | price | mpg | rep78 | headroom | trunk | weight | length | turn | displacement | gear_ratio | foreign | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
UvT_Car | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
UvT_Bike | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
始终使用pd.isnull()
或pd.notnull()
最为可靠,df_auto.make == np.nan
有时无法取得正确的结果。
df_auto[pd.isnull(df_auto.make)]
make | price | mpg | rep78 | headroom | trunk | weight | length | turn | displacement | gear_ratio | foreign | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
UvT_Car | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
UvT_Bike | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
df_auto[pd.notnull(df_auto.make)].head()
make | price | mpg | rep78 | headroom | trunk | weight | length | turn | displacement | gear_ratio | foreign | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
55 | Datsun 200 | 6229.0 | 23.0 | 4.0 | 1.5 | 6.0 | 2370.0 | 170.0 | 35.0 | 119.0 | 3.89 | Foreign |
47 | Pont. Firebird | 4934.0 | 18.0 | 1.0 | 1.5 | 7.0 | 3470.0 | 198.0 | 42.0 | 231.0 | 3.08 | Domestic |
44 | Plym. Sapporo | 6486.0 | 26.0 | NaN | 1.5 | 8.0 | 2520.0 | 182.0 | 38.0 | 119.0 | 3.54 | Domestic |
23 | Ford Fiesta | 4389.0 | 28.0 | 4.0 | 1.5 | 9.0 | 1800.0 | 147.0 | 33.0 | 98.0 | 3.15 | Domestic |
17 | Chev. Monza | 3667.0 | 24.0 | 2.0 | 2.0 | 7.0 | 2750.0 | 179.0 | 40.0 | 151.0 | 2.73 | Domestic |
使用fillna()
填补缺失值:
df = df_auto.fillna('Missing')
df.loc[['UvT_Car', 'UvT_Bike']]
make | price | mpg | rep78 | headroom | trunk | weight | length | turn | displacement | gear_ratio | foreign | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
UvT_Car | Missing | Missing | Missing | Missing | Missing | Missing | Missing | Missing | Missing | Missing | Missing | Missing |
UvT_Bike | Missing | Missing | Missing | Missing | Missing | Missing | Missing | Missing | Missing | Missing | Missing | Missing |
使用.dropna()
删除缺失值:
df_auto['make'].tail(3)
45 Plym. Volare
UvT_Car NaN
UvT_Bike NaN
Name: make, dtype: object
df = df_auto.dropna(axis=0)
df['make'].tail(3)
36 Olds Cutlass
22 Dodge St. Regis
45 Plym. Volare
Name: make, dtype: object
df_auto['price_trunk_ratio'] = df_auto.price / df_auto.trunk
df_auto[['price', 'trunk', 'price_trunk_ratio']].head()
price | trunk | price_trunk_ratio | |
---|---|---|---|
55 | 6229.0 | 6.0 | 1038.166667 |
47 | 4934.0 | 7.0 | 704.857143 |
44 | 6486.0 | 8.0 | 810.750000 |
23 | 4389.0 | 9.0 | 487.666667 |
17 | 3667.0 | 7.0 | 523.857143 |
提出问题:如果是国外车(Foreign),将价格(Price)乘以 1.5。
apply()
函数和lambda
logic = lambda x: x.price*1.5 if x.foreign == 'Foreign' else x.price
df_auto['new_price'] = df_auto.apply(logic, axis=1)
df_auto[['make', 'price', 'foreign', 'new_price']].head()
make | price | foreign | new_price | |
---|---|---|---|---|
55 | Datsun 200 | 6229.0 | Foreign | 9343.5 |
47 | Pont. Firebird | 4934.0 | Domestic | 4934.0 |
44 | Plym. Sapporo | 6486.0 | Domestic | 6486.0 |
23 | Ford Fiesta | 4389.0 | Domestic | 4389.0 |
17 | Chev. Monza | 3667.0 | Domestic | 3667.0 |
apply()
和函数在上面的示例中,我们使用匿名 lambda 函数。
对于更复杂的处理,可以使用已定义的函数并在.apply()
中调用它。
比较建议这种方式,因为最灵活并且更易于阅读。
def new_price_function(x):
if x.foreign == 'Foreign':
return x.price * 1.5
else:
return x.price
df_auto['new_price'] = df_auto.apply(new_price_function, axis=1) # axis =1: iterate over rows
df_auto[['make', 'price', 'foreign', 'new_price']].head()
make | price | foreign | new_price | |
---|---|---|---|---|
55 | Datsun 200 | 6229.0 | Foreign | 9343.5 |
47 | Pont. Firebird | 4934.0 | Domestic | 4934.0 |
44 | Plym. Sapporo | 6486.0 | Domestic | 6486.0 |
23 | Ford Fiesta | 4389.0 | Domestic | 4389.0 |
17 | Chev. Monza | 3667.0 | Domestic | 3667.0 |
df_auto['new_price'] = [p*1.5 if f == 'Foreign' else p for p, f in zip(df_auto.price, df_auto.foreign)]
df_auto[['price', 'foreign', 'new_price']].sample(5, random_state=1)
price | foreign | new_price | |
---|---|---|---|
58 | 8129.0 | Foreign | 12193.5 |
49 | 4723.0 | Domestic | 4723.0 |
41 | 4647.0 | Domestic | 4647.0 |
36 | 4733.0 | Domestic | 4733.0 |
40 | 10371.0 | Domestic | 10371.0 |
注意:random_state=1
保证每次获取同样的随机数样本。
有三种方式合并 dataframe:
# 数据准备
df_auto_p1 = df_auto[['make', 'price', 'mpg']]
df_auto_p2 = df_auto[['make', 'headroom', 'trunk']]
df_auto_p1.head(3)
make | price | mpg | |
---|---|---|---|
55 | Datsun 200 | 6229.0 | 23.0 |
47 | Pont. Firebird | 4934.0 | 18.0 |
44 | Plym. Sapporo | 6486.0 | 26.0 |
df_auto_p2.head(3)
make | headroom | trunk | |
---|---|---|---|
55 | Datsun 200 | 1.5 | 6.0 |
47 | Pont. Firebird | 1.5 | 7.0 |
44 | Plym. Sapporo | 1.5 | 8.0 |
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html
merged_auto = pd.merge(df_auto_p1, df_auto_p2, how='left', on='make')
merged_auto.head(3)
make | price | mpg | headroom | trunk | |
---|---|---|---|---|---|
0 | Datsun 200 | 6229.0 | 23.0 | 1.5 | 6.0 |
1 | Pont. Firebird | 4934.0 | 18.0 | 1.5 | 7.0 |
2 | Plym. Sapporo | 6486.0 | 26.0 | 1.5 | 8.0 |
两个 dataframe 都必须具有与索引相同的列集(column set)
df_auto_p1.set_index('make', inplace=True)
df_auto_p2.set_index('make', inplace=True)
http://pandas.pydata.org/pandas-docs/stable/merging.html#concatenating-objects
df_auto_i1 = df_auto.iloc[0:3]
df_auto_i2 = df_auto.iloc[3:6]
df_auto_i1
make | price | mpg | rep78 | headroom | trunk | weight | length | turn | displacement | gear_ratio | foreign | price_trunk_ratio | new_price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
55 | Datsun 200 | 6229.0 | 23.0 | 4.0 | 1.5 | 6.0 | 2370.0 | 170.0 | 35.0 | 119.0 | 3.89 | Foreign | 1038.166667 | 9343.5 |
47 | Pont. Firebird | 4934.0 | 18.0 | 1.0 | 1.5 | 7.0 | 3470.0 | 198.0 | 42.0 | 231.0 | 3.08 | Domestic | 704.857143 | 4934.0 |
44 | Plym. Sapporo | 6486.0 | 26.0 | NaN | 1.5 | 8.0 | 2520.0 | 182.0 | 38.0 | 119.0 | 3.54 | Domestic | 810.750000 | 6486.0 |
df_auto_i2
make | price | mpg | rep78 | headroom | trunk | weight | length | turn | displacement | gear_ratio | foreign | price_trunk_ratio | new_price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
23 | Ford Fiesta | 4389.0 | 28.0 | 4.0 | 1.5 | 9.0 | 1800.0 | 147.0 | 33.0 | 98.0 | 3.15 | Domestic | 487.666667 | 4389.0 |
17 | Chev. Monza | 3667.0 | 24.0 | 2.0 | 2.0 | 7.0 | 2750.0 | 179.0 | 40.0 | 151.0 | 2.73 | Domestic | 523.857143 | 3667.0 |
51 | Pont. Sunbird | 4172.0 | 24.0 | 2.0 | 2.0 | 7.0 | 2690.0 | 179.0 | 41.0 | 151.0 | 2.73 | Domestic | 596.000000 | 4172.0 |
使用concat()
函数:
pd.concat([df_auto_i1, df_auto_i2])
make | price | mpg | rep78 | headroom | trunk | weight | length | turn | displacement | gear_ratio | foreign | price_trunk_ratio | new_price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
55 | Datsun 200 | 6229.0 | 23.0 | 4.0 | 1.5 | 6.0 | 2370.0 | 170.0 | 35.0 | 119.0 | 3.89 | Foreign | 1038.166667 | 9343.5 |
47 | Pont. Firebird | 4934.0 | 18.0 | 1.0 | 1.5 | 7.0 | 3470.0 | 198.0 | 42.0 | 231.0 | 3.08 | Domestic | 704.857143 | 4934.0 |
44 | Plym. Sapporo | 6486.0 | 26.0 | NaN | 1.5 | 8.0 | 2520.0 | 182.0 | 38.0 | 119.0 | 3.54 | Domestic | 810.750000 | 6486.0 |
23 | Ford Fiesta | 4389.0 | 28.0 | 4.0 | 1.5 | 9.0 | 1800.0 | 147.0 | 33.0 | 98.0 | 3.15 | Domestic | 487.666667 | 4389.0 |
17 | Chev. Monza | 3667.0 | 24.0 | 2.0 | 2.0 | 7.0 | 2750.0 | 179.0 | 40.0 | 151.0 | 2.73 | Domestic | 523.857143 | 3667.0 |
51 | Pont. Sunbird | 4172.0 | 24.0 | 2.0 | 2.0 | 7.0 | 2690.0 | 179.0 | 41.0 | 151.0 | 2.73 | Domestic | 596.000000 | 4172.0 |
使用append()
函数:
df_auto_i1.append(df_auto_i2)
make | price | mpg | rep78 | headroom | trunk | weight | length | turn | displacement | gear_ratio | foreign | price_trunk_ratio | new_price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
55 | Datsun 200 | 6229.0 | 23.0 | 4.0 | 1.5 | 6.0 | 2370.0 | 170.0 | 35.0 | 119.0 | 3.89 | Foreign | 1038.166667 | 9343.5 |
47 | Pont. Firebird | 4934.0 | 18.0 | 1.0 | 1.5 | 7.0 | 3470.0 | 198.0 | 42.0 | 231.0 | 3.08 | Domestic | 704.857143 | 4934.0 |
44 | Plym. Sapporo | 6486.0 | 26.0 | NaN | 1.5 | 8.0 | 2520.0 | 182.0 | 38.0 | 119.0 | 3.54 | Domestic | 810.750000 | 6486.0 |
23 | Ford Fiesta | 4389.0 | 28.0 | 4.0 | 1.5 | 9.0 | 1800.0 | 147.0 | 33.0 | 98.0 | 3.15 | Domestic | 487.666667 | 4389.0 |
17 | Chev. Monza | 3667.0 | 24.0 | 2.0 | 2.0 | 7.0 | 2750.0 | 179.0 | 40.0 | 151.0 | 2.73 | Domestic | 523.857143 | 3667.0 |
51 | Pont. Sunbird | 4172.0 | 24.0 | 2.0 | 2.0 | 7.0 | 2690.0 | 179.0 | 41.0 | 151.0 | 2.73 | Domestic | 596.000000 | 4172.0 |
使用.groupby()
实现组内操作,处理流程如下:
参阅:http://pandas.pydata.org/pandas-docs/stable/groupby.html
步骤 1:创建一个组对象,该对象指定我们要创建的组。
col_list = ['price', 'mpg', 'headroom', 'trunk', 'weight', 'length']
grouped = df_auto[col_list + ['foreign']].groupby(['foreign'])
grouped.mean()
price | mpg | headroom | trunk | weight | length | |
---|---|---|---|---|---|---|
foreign | ||||||
Domestic | 6072.423077 | 19.826923 | 3.153846 | 14.750000 | 3317.115385 | 196.134615 |
Foreign | 6384.681818 | 24.772727 | 2.613636 | 11.409091 | 2315.909091 | 168.545455 |
grouped.get_group('Domestic').head()
price | mpg | headroom | trunk | weight | length | foreign | |
---|---|---|---|---|---|---|---|
47 | 4934.0 | 18.0 | 1.5 | 7.0 | 3470.0 | 198.0 | Domestic |
44 | 6486.0 | 26.0 | 1.5 | 8.0 | 2520.0 | 182.0 | Domestic |
23 | 4389.0 | 28.0 | 1.5 | 9.0 | 1800.0 | 147.0 | Domestic |
17 | 3667.0 | 24.0 | 2.0 | 7.0 | 2750.0 | 179.0 | Domestic |
51 | 4172.0 | 24.0 | 2.0 | 7.0 | 2690.0 | 179.0 | Domestic |
group
对象中的组for name, group in grouped:
print(name)
print(group.head())
Domestic
price mpg headroom trunk weight length foreign
47 4934.0 18.0 1.5 7.0 3470.0 198.0 Domestic
44 6486.0 26.0 1.5 8.0 2520.0 182.0 Domestic
23 4389.0 28.0 1.5 9.0 1800.0 147.0 Domestic
17 3667.0 24.0 2.0 7.0 2750.0 179.0 Domestic
51 4172.0 24.0 2.0 7.0 2690.0 179.0 Domestic
Foreign
price mpg headroom trunk weight length foreign
55 6229.0 23.0 1.5 6.0 2370.0 170.0 Foreign
56 4589.0 35.0 2.0 8.0 2020.0 165.0 Foreign
68 5719.0 18.0 2.0 11.0 2670.0 175.0 Foreign
72 6850.0 25.0 2.0 16.0 1990.0 156.0 Foreign
61 4499.0 28.0 2.5 5.0 1760.0 149.0 Foreign
在group
对象中应用.apply()
函数:
在.apply()
中使用lambda
是迭代数据子集的好方法。
例如,假设我们要获得每个trunk
尺寸类别中最便宜的汽车:
df_auto.groupby('trunk').apply(lambda df: df.sort_values('price').iloc[0]).head()
make | price | mpg | rep78 | headroom | trunk | weight | length | turn | displacement | gear_ratio | foreign | price_trunk_ratio | new_price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
trunk | ||||||||||||||
5.0 | Honda Civic | 4499.0 | 28.0 | 4.0 | 2.5 | 5.0 | 1760.0 | 149.0 | 34.0 | 91.0 | 3.30 | Foreign | 899.800000 | 6748.5 |
6.0 | Datsun 200 | 6229.0 | 23.0 | 4.0 | 1.5 | 6.0 | 2370.0 | 170.0 | 35.0 | 119.0 | 3.89 | Foreign | 1038.166667 | 9343.5 |
7.0 | Chev. Monza | 3667.0 | 24.0 | 2.0 | 2.0 | 7.0 | 2750.0 | 179.0 | 40.0 | 151.0 | 2.73 | Domestic | 523.857143 | 3667.0 |
8.0 | Dodge Colt | 3984.0 | 30.0 | 5.0 | 2.0 | 8.0 | 2120.0 | 163.0 | 35.0 | 98.0 | 3.54 | Domestic | 498.000000 | 3984.0 |
9.0 | Chev. Chevette | 3299.0 | 29.0 | 3.0 | 2.5 | 9.0 | 2110.0 | 163.0 | 34.0 | 231.0 | 2.93 | Domestic | 366.555556 | 3299.0 |
如果要将每个组汇总到新数据框中的一行,则可以使用以下两个示例中的许多选项:
grouped.sum()
和 gropued.mean()
grouped.sum()
price | mpg | headroom | trunk | weight | length | |
---|---|---|---|---|---|---|
foreign | ||||||
Domestic | 315766.0 | 1031.0 | 164.0 | 767.0 | 172490.0 | 10199.0 |
Foreign | 140463.0 | 545.0 | 57.5 | 251.0 | 50950.0 | 3708.0 |
grouped.mean()
price | mpg | headroom | trunk | weight | length | |
---|---|---|---|---|---|---|
foreign | ||||||
Domestic | 6072.423077 | 19.826923 | 3.153846 | 14.750000 | 3317.115385 | 196.134615 |
Foreign | 6384.681818 | 24.772727 | 2.613636 | 11.409091 | 2315.909091 | 168.545455 |
grouped.count()
和 gropued.size()
grouped.count()
price | mpg | headroom | trunk | weight | length | |
---|---|---|---|---|---|---|
foreign | ||||||
Domestic | 52 | 52 | 52 | 52 | 52 | 52 |
Foreign | 22 | 22 | 22 | 22 | 22 | 22 |
grouped.size()
foreign
Domestic 52
Foreign 22
dtype: int64
grouped.first()
和 gropued.last()
grouped.first()
price | mpg | headroom | trunk | weight | length | |
---|---|---|---|---|---|---|
foreign | ||||||
Domestic | 4934.0 | 18.0 | 1.5 | 7.0 | 3470.0 | 198.0 |
Foreign | 6229.0 | 23.0 | 1.5 | 6.0 | 2370.0 | 170.0 |
grouped.last()
price | mpg | headroom | trunk | weight | length | |
---|---|---|---|---|---|---|
foreign | ||||||
Domestic | 4060.0 | 18.0 | 5.0 | 16.0 | 3330.0 | 201.0 |
Foreign | 12990.0 | 14.0 | 3.5 | 14.0 | 3420.0 | 192.0 |
grouped.agg({'price' : 'first', 'mpg' : ['mean', 'median'], 'trunk' : ['mean', (lambda x: 100 * np.mean(x))]})
price | mpg | trunk | |||
---|---|---|---|---|---|
first | mean | median | mean | <lambda_0> | |
foreign | |||||
Domestic | 4934.0 | 19.826923 | 19.0 | 14.750000 | 1475.000000 |
Foreign | 6229.0 | 24.772727 | 24.5 | 11.409091 | 1140.909091 |
.groupby
更多的操作可以参阅:
https://pandas.pydata.org/pandas-docs/stable/groupby.html
tuples = [('bar', 'one', 1, 2),
('bar', 'two', 3, 4),
('bar', 'three', 5, 6),
('baz', 'one', 1, 2),
('baz', 'two', 3, 4),
('baz', 'three', 5, 6),
('foo', 'one', 1, 2),
('foo', 'two', 3, 4),
('foo', 'three', 5, 6)
]
df = pd.DataFrame(tuples)
df.columns = ['first', 'second', 'A', 'B']
df
first | second | A | B | |
---|---|---|---|---|
0 | bar | one | 1 | 2 |
1 | bar | two | 3 | 4 |
2 | bar | three | 5 | 6 |
3 | baz | one | 1 | 2 |
4 | baz | two | 3 | 4 |
5 | baz | three | 5 | 6 |
6 | foo | one | 1 | 2 |
7 | foo | two | 3 | 4 |
8 | foo | three | 5 | 6 |
使用pivot()
函数:
http://pandas.pydata.org/pandas-docs/stable/reshaping.html#reshaping-by-pivoting-dataframe-objects
df.pivot(index='first', columns='second', values='A')
second | one | three | two |
---|---|---|---|
first | |||
bar | 1 | 5 | 3 |
baz | 1 | 5 | 3 |
foo | 1 | 5 | 3 |
使用 pd.pivot_table()
函数: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.pivot_table.html
pd.pivot_table(df, values=['A', 'B'], index='first', columns='second')
A | B | |||||
---|---|---|---|---|---|---|
second | one | three | two | one | three | two |
first | ||||||
bar | 1 | 5 | 3 | 2 | 6 | 4 |
baz | 1 | 5 | 3 | 2 | 6 | 4 |
foo | 1 | 5 | 3 | 2 | 6 | 4 |
说明 1:以上内容说明了 Pandas 本质上具有两个索引:通常的“row index”和“column index”。
说明 2:Pandas 还具有一个称为pandas.melt
(https://pandas.pydata.org/pandas-docs/stable/generated/pandas.melt.html)
Stack
和Unstack
是高级操作符,用于基于多级索引来重塑数据框。
pivot_df = pd.pivot_table(df, values=['A', 'B'], index='first', columns='second')
pivot_df
A | B | |||||
---|---|---|---|---|---|---|
second | one | three | two | one | three | two |
first | ||||||
bar | 1 | 5 | 3 | 2 | 6 | 4 |
baz | 1 | 5 | 3 | 2 | 6 | 4 |
foo | 1 | 5 | 3 | 2 | 6 | 4 |
pivot_df.stack(level=['second'])
A | B | ||
---|---|---|---|
first | second | ||
bar | one | 1 | 2 |
three | 5 | 6 | |
two | 3 | 4 | |
baz | one | 1 | 2 |
three | 5 | 6 | |
two | 3 | 4 | |
foo | one | 1 | 2 |
three | 5 | 6 | |
two | 3 | 4 |
注意:我们也可以只使用pivot_df.stack()
,因为它将默认选择索引的“last”级别。
df.set_index(['first', 'second'], inplace=True)
df
A | B | ||
---|---|---|---|
first | second | ||
bar | one | 1 | 2 |
two | 3 | 4 | |
three | 5 | 6 | |
baz | one | 1 | 2 |
two | 3 | 4 | |
three | 5 | 6 | |
foo | one | 1 | 2 |
two | 3 | 4 | |
three | 5 | 6 |
df.unstack(level=['first'])
A | B | |||||
---|---|---|---|---|---|---|
first | bar | baz | foo | bar | baz | foo |
second | ||||||
one | 1 | 1 | 1 | 2 | 2 | 2 |
three | 5 | 5 | 5 | 6 | 6 | 6 |
two | 3 | 3 | 3 | 4 | 4 | 4 |
df.unstack(level=['second'])
A | B | |||||
---|---|---|---|---|---|---|
second | one | three | two | one | three | two |
first | ||||||
bar | 1 | 5 | 3 | 2 | 6 | 4 |
baz | 1 | 5 | 3 | 2 | 6 | 4 |
foo | 1 | 5 | 3 | 2 | 6 | 4 |
Pandas 具有很多内置功能来处理时间序列数据 http://pandas.pydata.org/pandas-docs/stable/timeseries.html
文档概览:
Class | Remarks | How to create: |
---|---|---|
Timestamp | Represents a single time stamp | to_datetime, Timestamp |
DatetimeIndex | Index of Timestamp | to_datetime, date_range, DatetimeIndex |
Period | Represents a single time span | Period |
PeriodIndex | Index of Period | period_range, PeriodIndex |
date_index = pd.date_range('1/1/2011', periods=len(df_auto.index), freq='D')
date_index[0:5]
DatetimeIndex(['2011-01-01', '2011-01-02', '2011-01-03', '2011-01-04',
'2011-01-05'],
dtype='datetime64[ns]', freq='D')
便于解释,在df_auto
中添加日期:
df_ad = df_auto.copy()[['make', 'price']]
df_ad['date'] = date_index
df_ad.head()
make | price | date | |
---|---|---|---|
55 | Datsun 200 | 6229.0 | 2011-01-01 |
47 | Pont. Firebird | 4934.0 | 2011-01-02 |
44 | Plym. Sapporo | 6486.0 | 2011-01-03 |
23 | Ford Fiesta | 4389.0 | 2011-01-04 |
17 | Chev. Monza | 3667.0 | 2011-01-05 |
df_ad.dtypes
make object
price float64
date datetime64[ns]
dtype: object
str
列转为 date
列从外部导入数据的时候,会将日期数据识别成字符型。
df_ad['date'] = df_ad['date'].astype(str)
df_ad['date'].dtypes
dtype('O')
我们现在无法对该列执行任何日期时间操作,因为它的数据类型错误!
幸运的是,我们可以这样修复它:
pd.to_datetime(df_ad['date']).dtypes
dtype('<M8[ns]')
或者:
df_ad['date'] = df_ad['date'].apply(lambda x: pd.to_datetime(x))
pd.Timestamp('2011-02-01')
Timestamp('2011-02-01 00:00:00')
pd.to_datetime('01-02-2011', format='%d-%m-%Y')
Timestamp('2011-02-01 00:00:00')
提示:通常最好明确包含格式,以避免意外行为。
df_ad[df_ad.date > pd.to_datetime('07-03-2011', format='%d-%m-%Y')]
make | price | date | |
---|---|---|---|
37 | Olds Delta 88 | 4890.0 | 2011-03-08 |
46 | Pont. Catalina | 5798.0 | 2011-03-09 |
5 | Buick LeSabre | 5788.0 | 2011-03-10 |
38 | Olds Omega | 4181.0 | 2011-03-11 |
3 | Buick Century | 4816.0 | 2011-03-12 |
36 | Olds Cutlass | 4733.0 | 2011-03-13 |
22 | Dodge St. Regis | 6342.0 | 2011-03-14 |
45 | Plym. Volare | 4060.0 | 2011-03-15 |
UvT_Car | NaN | NaN | 2011-03-16 |
UvT_Bike | NaN | NaN | 2011-03-17 |
我们也可以使用 Pandas 的.isin()
代替date_range
对象。
df_ad[df_ad.date.isin(pd.date_range('2/20/2011', '3/11/2011', freq='D'))]
make | price | date | |
---|---|---|---|
11 | Cad. Eldorado | 14500.0 | 2011-02-20 |
29 | Merc. Cougar | 5379.0 | 2011-02-21 |
8 | Buick Riviera | 10372.0 | 2011-02-22 |
15 | Chev. Malibu | 4504.0 | 2011-02-23 |
33 | Merc. Zephyr | 3291.0 | 2011-02-24 |
40 | Olds Toronado | 10371.0 | 2011-02-25 |
49 | Pont. Le Mans | 4723.0 | 2011-02-26 |
25 | Linc. Continental | 11497.0 | 2011-02-27 |
30 | Merc. Marquis | 6165.0 | 2011-02-28 |
20 | Dodge Diplomat | 4010.0 | 2011-03-01 |
21 | Dodge Magnum | 5886.0 | 2011-03-02 |
43 | Plym. Horizon | 4482.0 | 2011-03-03 |
4 | Buick Electra | 7827.0 | 2011-03-04 |
10 | Cad. Deville | 11385.0 | 2011-03-05 |
14 | Chev. Impala | 5705.0 | 2011-03-06 |
34 | Olds 98 | 8814.0 | 2011-03-07 |
37 | Olds Delta 88 | 4890.0 | 2011-03-08 |
46 | Pont. Catalina | 5798.0 | 2011-03-09 |
5 | Buick LeSabre | 5788.0 | 2011-03-10 |
38 | Olds Omega | 4181.0 | 2011-03-11 |
可以从日期中提取day
,month
和year
:
http://pandas.pydata.org/pandas-docs/stable/timeseries.html#time-date-components
df_ad['day'] = df_ad['date'].apply(lambda x: x.day)
df_ad.head()
make | price | date | day | |
---|---|---|---|---|
55 | Datsun 200 | 6229.0 | 2011-01-01 | 1 |
47 | Pont. Firebird | 4934.0 | 2011-01-02 | 2 |
44 | Plym. Sapporo | 6486.0 | 2011-01-03 | 3 |
23 | Ford Fiesta | 4389.0 | 2011-01-04 | 4 |
17 | Chev. Monza | 3667.0 | 2011-01-05 | 5 |
参阅:http://pandas.pydata.org/pandas-docs/stable/timeseries.html#dateoffset-objects
df_ad['new_date'] = df_ad.date.apply(lambda x: x + pd.DateOffset(years=1))
df_ad.head()
make | price | date | day | new_date | |
---|---|---|---|---|---|
55 | Datsun 200 | 6229.0 | 2011-01-01 | 1 | 2012-01-01 |
47 | Pont. Firebird | 4934.0 | 2011-01-02 | 2 | 2012-01-02 |
44 | Plym. Sapporo | 6486.0 | 2011-01-03 | 3 | 2012-01-03 |
23 | Ford Fiesta | 4389.0 | 2011-01-04 | 4 | 2012-01-04 |
17 | Chev. Monza | 3667.0 | 2011-01-05 | 5 | 2012-01-05 |