⭐本专栏旨在对Python的基础语法进行详解,精炼地总结语法中的重点,详解难点,面向零基础及入门的学习者,通过专栏的学习可以熟练掌握python编程,同时为后续的数据分析,机器学习及深度学习的代码能力打下坚实的基础。 🔥本文已收录于Python基础系列专栏: Python基础系列教程 欢迎订阅,持续更新。
Numpy 在向量化的数值计算中表现优异
但是在处理更灵活、复杂的数据任务:
如为数据添加标签、处理缺失值、分组和透视表等方面
Numpy显得力不从心
而基于Numpy构建的Pandas库,提供了使得数据分析变得更快更简单的高级数据结构和操作工具
Series 是带标签数据的一维数组
Series对象的创建
通用结构: pd.Series(data, index=index, dtype=dtype)
data:数据,可以是列表,字典或Numpy数组
index:索引,为可选参数
dtype: 数据类型,为可选参数
1、用列表创建
import pandas as pd
data = pd.Series([1.5, 3, 4.5, 6])
data
0 1.5
1 3.0
2 4.5
3 6.0
dtype: float64
data = pd.Series([1.5, 3, 4.5, 6], index=["a", "b", "c", "d"])
data
a 1.5
b 3.0
c 4.5
d 6.0
dtype: float64
data = pd.Series([1, 2, 3, 4], index=["a", "b", "c", "d"])
data
a 1
b 2
c 3
d 4
dtype: int64
data = pd.Series([1, 2, 3, 4], index=["a", "b", "c", "d"], dtype="float")
data
a 1.0
b 2.0
c 3.0
d 4.0
dtype: float64
注意:数据支持多种类型
data = pd.Series([1, 2, "3", 4], index=["a", "b", "c", "d"])
data
a 1
b 2
c 3
d 4
dtype: object
data["a"]
1
data["c"]
'3'
数据类型可被强制改变
data = pd.Series([1, 2, "3", 4], index=["a", "b", "c", "d"], dtype=float)
data
a 1.0
b 2.0
c 3.0
d 4.0
dtype: float64
data["c"]
3.0
不能转为浮点数则会报错
data = pd.Series([1, 2, "a", 4], index=["a", "b", "c", "d"], dtype=float)
data
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_9236/4046912764.py in <module>
----> 1 data = pd.Series([1, 2, "a", 4], index=["a", "b", "c", "d"], dtype=float)
2 data
NameError: name 'pd' is not defined
2、用一维numpy数组创建
import numpy as np
x = np.arange(5)
pd.Series(x)
0 0
1 1
2 2
3 3
4 4
dtype: int32
3、用字典创建
population_dict = {"BeiJing": 2154,
"ShangHai": 2424,
"ShenZhen": 1303,
"HangZhou": 981 }
population = pd.Series(population_dict)
population
BeiJing 2154
ShangHai 2424
ShenZhen 1303
HangZhou 981
dtype: int64
population = pd.Series(population_dict, index=["BeiJing", "HangZhou", "c", "d"])
population
BeiJing 2154.0
HangZhou 981.0
c NaN
d NaN
dtype: float64
4、data为标量的情况
pd.Series(5, index=[100, 200, 300])
100 5
200 5
300 5
dtype: int64
DataFrame 是带标签数据的多维数组
DataFrame对象的创建
通用结构: pd.DataFrame(data, index=index, columns=columns)
data:数据,可以是列表,字典或Numpy数组
index:索引,为可选参数
columns: 列标签,为可选参数
1、通过Series对象创建
population_dict = {"BeiJing": 2154,
"ShangHai": 2424,
"ShenZhen": 1303,
"HangZhou": 981 }
population = pd.Series(population_dict)
pd.DataFrame(population)
0 | |
---|---|
BeiJing | 2154 |
ShangHai | 2424 |
ShenZhen | 1303 |
HangZhou | 981 |
pd.DataFrame(population, columns=["population"])
population | |
---|---|
BeiJing | 2154 |
ShangHai | 2424 |
ShenZhen | 1303 |
HangZhou | 981 |
2、通过Series对象字典创建
GDP_dict = {"BeiJing": 30320,
"ShangHai": 32680,
"ShenZhen": 24222,
"HangZhou": 13468 }
GDP = pd.Series(GDP_dict)
GDP
BeiJing 30320
ShangHai 32680
ShenZhen 24222
HangZhou 13468
dtype: int64
pd.DataFrame({"population": population,
"GDP": GDP})
population | GDP | |
---|---|---|
BeiJing | 2154 | 30320 |
ShangHai | 2424 | 32680 |
ShenZhen | 1303 | 24222 |
HangZhou | 981 | 13468 |
注意:数量不够的会自动补齐
pd.DataFrame({"population": population,
"GDP": GDP,
"country": "China"})
population | GDP | country | |
---|---|---|---|
BeiJing | 2154 | 30320 | China |
ShangHai | 2424 | 32680 | China |
ShenZhen | 1303 | 24222 | China |
HangZhou | 981 | 13468 | China |
3、通过字典列表对象创建
import numpy as np
import pandas as pd
data = [{"a": i, "b": 2*i} for i in range(3)]
data
[{'a': 0, 'b': 0}, {'a': 1, 'b': 2}, {'a': 2, 'b': 4}]
data = pd.DataFrame(data)
data
a | b | |
---|---|---|
0 | 0 | 0 |
1 | 1 | 2 |
2 | 2 | 4 |
行的标签没有排,因此行从0开始,列的标签延续。
data1 = data["a"].copy()
data1
0 0
1 1
2 2
Name: a, dtype: int64
data1[0] = 10
data1
0 10
1 1
2 2
Name: a, dtype: int64
data
a | b | |
---|---|---|
0 | 0 | 0 |
1 | 1 | 2 |
2 | 2 | 4 |
data = [{"a": 1, "b":1},{"b": 3, "c":4}]
data
[{'a': 1, 'b': 1}, {'b': 3, 'c': 4}]
pd.DataFrame(data)
a | b | c | |
---|---|---|---|
0 | 1.0 | 1 | NaN |
1 | NaN | 3 | 4.0 |
4、通过Numpy二维数组创建
data = np.random.randint(10, size=(3, 2))
data
array([[1, 6],
[2, 9],
[4, 0]])
pd.DataFrame(data, columns=["foo", "bar"], index=["a", "b", "c"])
foo | bar | |
---|---|---|
a | 1 | 6 |
b | 2 | 9 |
c | 4 | 0 |
1、属性
data = pd.DataFrame({"pop": population, "GDP": GDP})
data
pop | GDP | |
---|---|---|
BeiJing | 2154 | 30320 |
ShangHai | 2424 | 32680 |
ShenZhen | 1303 | 24222 |
HangZhou | 981 | 13468 |
(1)df.values 返回numpy数组表示的数据
data.values
array([[ 2154, 30320],
[ 2424, 32680],
[ 1303, 24222],
[ 981, 13468]], dtype=int64)
(2)df.index 返回行索引
data.index
Index(['BeiJing', 'ShangHai', 'ShenZhen', 'HangZhou'], dtype='object')
(3)df.columns 返回列索引
data.columns
Index(['pop', 'GDP'], dtype='object')
(4)df.shape 形状
data.shape
(4, 2)
(5) pd.size 大小
data.size
8
(6)pd.dtypes 返回每列数据类型
data.dtypes
pop int64
GDP int64
dtype: object
2、索引
data
pop | GDP | |
---|---|---|
BeiJing | 2154 | 30320 |
ShangHai | 2424 | 32680 |
ShenZhen | 1303 | 24222 |
HangZhou | 981 | 13468 |
(1)获取列
data["pop"]
BeiJing 2154
ShangHai 2424
ShenZhen 1303
HangZhou 981
Name: pop, dtype: int64
data[["GDP", "pop"]]
GDP | pop | |
---|---|---|
BeiJing | 30320 | 2154 |
ShangHai | 32680 | 2424 |
ShenZhen | 24222 | 1303 |
HangZhou | 13468 | 981 |
data.GDP
BeiJing 30320
ShangHai 32680
ShenZhen 24222
HangZhou 13468
Name: GDP, dtype: int64
(2)获取行
data.loc["BeiJing"]
pop 2154
GDP 30320
Name: BeiJing, dtype: int64
data.loc[["BeiJing", "HangZhou"]]
pop | GDP | |
---|---|---|
BeiJing | 2154 | 30320 |
HangZhou | 981 | 13468 |
data
pop | GDP | |
---|---|---|
BeiJing | 2154 | 30320 |
ShangHai | 2424 | 32680 |
ShenZhen | 1303 | 24222 |
HangZhou | 981 | 13468 |
data.iloc[0]
pop 2154
GDP 30320
Name: BeiJing, dtype: int64
data.iloc[[1, 3]]
pop | GDP | |
---|---|---|
ShangHai | 2424 | 32680 |
HangZhou | 981 | 13468 |
(3)获取标量
data
pop | GDP | |
---|---|---|
BeiJing | 2154 | 30320 |
ShangHai | 2424 | 32680 |
ShenZhen | 1303 | 24222 |
HangZhou | 981 | 13468 |
data.loc["BeiJing", "GDP"]
30320
data.iloc[0, 1]
30320
data.values[0][1]
30320
(4)Series对象的索引
type(data.GDP)
pandas.core.series.Series
GDP
BeiJing 30320
ShangHai 32680
ShenZhen 24222
HangZhou 13468
dtype: int64
GDP["BeiJing"]
30320
3、切片
dates = pd.date_range(start='2019-01-01', periods=6)
dates
DatetimeIndex(['2019-01-01', '2019-01-02', '2019-01-03', '2019-01-04',
'2019-01-05', '2019-01-06'],
dtype='datetime64[ns]', freq='D')
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=["A", "B", "C", "D"])
df
A | B | C | D | |
---|---|---|---|---|
2019-01-01 | -0.935378 | -0.190742 | 0.925984 | -0.818969 |
2019-01-02 | -0.234414 | -1.194674 | 1.080779 | -2.294395 |
2019-01-03 | -0.141572 | 0.058118 | 1.102248 | 1.207726 |
2019-01-04 | 0.305088 | 0.535920 | -0.978434 | 0.177251 |
2019-01-05 | 0.313383 | 0.234041 | 0.163155 | -0.296649 |
2019-01-06 | 0.250613 | -0.904400 | -0.858240 | -1.573342 |
(1)行切片
df["2019-01-01": "2019-01-03"]
A | B | C | D | |
---|---|---|---|---|
2019-01-01 | -0.935378 | -0.190742 | 0.925984 | -0.818969 |
2019-01-02 | -0.234414 | -1.194674 | 1.080779 | -2.294395 |
2019-01-03 | -0.141572 | 0.058118 | 1.102248 | 1.207726 |
df.loc["2019-01-01": "2019-01-03"]
A | B | C | D | |
---|---|---|---|---|
2019-01-01 | -0.935378 | -0.190742 | 0.925984 | -0.818969 |
2019-01-02 | -0.234414 | -1.194674 | 1.080779 | -2.294395 |
2019-01-03 | -0.141572 | 0.058118 | 1.102248 | 1.207726 |
df.iloc[0: 3]
A | B | C | D | |
---|---|---|---|---|
2019-01-01 | -0.935378 | -0.190742 | 0.925984 | -0.818969 |
2019-01-02 | -0.234414 | -1.194674 | 1.080779 | -2.294395 |
2019-01-03 | -0.141572 | 0.058118 | 1.102248 | 1.207726 |
注意:这里的3是取不到的。
(2)列切片
df
A | B | C | D | |
---|---|---|---|---|
2019-01-01 | -0.935378 | -0.190742 | 0.925984 | -0.818969 |
2019-01-02 | -0.234414 | -1.194674 | 1.080779 | -2.294395 |
2019-01-03 | -0.141572 | 0.058118 | 1.102248 | 1.207726 |
2019-01-04 | 0.305088 | 0.535920 | -0.978434 | 0.177251 |
2019-01-05 | 0.313383 | 0.234041 | 0.163155 | -0.296649 |
2019-01-06 | 0.250613 | -0.904400 | -0.858240 | -1.573342 |
df.loc[:, "A": "C"]
A | B | C | |
---|---|---|---|
2019-01-01 | -0.935378 | -0.190742 | 0.925984 |
2019-01-02 | -0.234414 | -1.194674 | 1.080779 |
2019-01-03 | -0.141572 | 0.058118 | 1.102248 |
2019-01-04 | 0.305088 | 0.535920 | -0.978434 |
2019-01-05 | 0.313383 | 0.234041 | 0.163155 |
2019-01-06 | 0.250613 | -0.904400 | -0.858240 |
df.iloc[:, 0: 3]
A | B | C | |
---|---|---|---|
2019-01-01 | -0.935378 | -0.190742 | 0.925984 |
2019-01-02 | -0.234414 | -1.194674 | 1.080779 |
2019-01-03 | -0.141572 | 0.058118 | 1.102248 |
2019-01-04 | 0.305088 | 0.535920 | -0.978434 |
2019-01-05 | 0.313383 | 0.234041 | 0.163155 |
2019-01-06 | 0.250613 | -0.904400 | -0.858240 |
(3)多种多样的取值
df
A | B | C | D | |
---|---|---|---|---|
2019-01-01 | -0.935378 | -0.190742 | 0.925984 | -0.818969 |
2019-01-02 | -0.234414 | -1.194674 | 1.080779 | -2.294395 |
2019-01-03 | -0.141572 | 0.058118 | 1.102248 | 1.207726 |
2019-01-04 | 0.305088 | 0.535920 | -0.978434 | 0.177251 |
2019-01-05 | 0.313383 | 0.234041 | 0.163155 | -0.296649 |
2019-01-06 | 0.250613 | -0.904400 | -0.858240 | -1.573342 |
df.loc["2019-01-02": "2019-01-03", "C":"D"]
C | D | |
---|---|---|
2019-01-02 | 1.080779 | -2.294395 |
2019-01-03 | 1.102248 | 1.207726 |
df.iloc[1: 3, 2:]
C | D | |
---|---|---|
2019-01-02 | 1.080779 | -2.294395 |
2019-01-03 | 1.102248 | 1.207726 |
df.loc["2019-01-04": "2019-01-06", ["A", "C"]]
A | C | |
---|---|---|
2019-01-04 | 0.305088 | -0.978434 |
2019-01-05 | 0.313383 | 0.163155 |
2019-01-06 | 0.250613 | -0.858240 |
df.iloc[3:, [0, 2]]
A | C | |
---|---|---|
2019-01-04 | 0.305088 | -0.978434 |
2019-01-05 | 0.313383 | 0.163155 |
2019-01-06 | 0.250613 | -0.858240 |
df.loc[["2019-01-02", "2019-01-06"], "C": "D"]
上面这种方式是行不通的。
df.iloc[[1, 5], 0: 3]
A | B | C | |
---|---|---|---|
2019-01-02 | -0.234414 | -1.194674 | 1.080779 |
2019-01-06 | 0.250613 | -0.904400 | -0.858240 |
df.loc[["2019-01-04", "2019-01-06"], ["A", "D"]]
同样,上面这种方式是行不通的。
df.iloc[[1, 5], [0, 3]]
A | D | |
---|---|---|
2019-01-02 | -0.234414 | -2.294395 |
2019-01-06 | 0.250613 | -1.573342 |
4、布尔索引
相当于numpy当中的掩码操作。
df
A | B | C | D | |
---|---|---|---|---|
2019-01-01 | -0.935378 | -0.190742 | 0.925984 | -0.818969 |
2019-01-02 | -0.234414 | -1.194674 | 1.080779 | -2.294395 |
2019-01-03 | -0.141572 | 0.058118 | 1.102248 | 1.207726 |
2019-01-04 | 0.305088 | 0.535920 | -0.978434 | 0.177251 |
2019-01-05 | 0.313383 | 0.234041 | 0.163155 | -0.296649 |
2019-01-06 | 0.250613 | -0.904400 | -0.858240 | -1.573342 |
df > 0
A | B | C | D | |
---|---|---|---|---|
2019-01-01 | False | False | True | False |
2019-01-02 | False | False | True | False |
2019-01-03 | False | True | True | True |
2019-01-04 | True | True | False | True |
2019-01-05 | True | True | True | False |
2019-01-06 | True | False | False | False |
df[df > 0]
A | B | C | D | |
---|---|---|---|---|
2019-01-01 | NaN | NaN | 0.925984 | NaN |
2019-01-02 | NaN | NaN | 1.080779 | NaN |
2019-01-03 | NaN | 0.058118 | 1.102248 | 1.207726 |
2019-01-04 | 0.305088 | 0.535920 | NaN | 0.177251 |
2019-01-05 | 0.313383 | 0.234041 | 0.163155 | NaN |
2019-01-06 | 0.250613 | NaN | NaN | NaN |
可以观察到,为true的部分都被取到了,而false没有。
df.A > 0
2019-01-01 False
2019-01-02 False
2019-01-03 False
2019-01-04 True
2019-01-05 True
2019-01-06 True
Freq: D, Name: A, dtype: bool
df[df.A > 0]
A | B | C | D | |
---|---|---|---|---|
2019-01-04 | 0.305088 | 0.535920 | -0.978434 | 0.177251 |
2019-01-05 | 0.313383 | 0.234041 | 0.163155 | -0.296649 |
2019-01-06 | 0.250613 | -0.904400 | -0.858240 | -1.573342 |
df2 = df.copy()
df2['E'] = ['one', 'one', 'two', 'three', 'four', 'three']
df2
A | B | C | D | E | |
---|---|---|---|---|---|
2019-01-01 | -0.935378 | -0.190742 | 0.925984 | -0.818969 | one |
2019-01-02 | -0.234414 | -1.194674 | 1.080779 | -2.294395 | one |
2019-01-03 | -0.141572 | 0.058118 | 1.102248 | 1.207726 | two |
2019-01-04 | 0.305088 | 0.535920 | -0.978434 | 0.177251 | three |
2019-01-05 | 0.313383 | 0.234041 | 0.163155 | -0.296649 | four |
2019-01-06 | 0.250613 | -0.904400 | -0.858240 | -1.573342 | three |
ind = df2["E"].isin(["two", "four"])
ind
2019-01-01 False
2019-01-02 False
2019-01-03 True
2019-01-04 False
2019-01-05 True
2019-01-06 False
Freq: D, Name: E, dtype: bool
df2[ind]
A | B | C | D | E | |
---|---|---|---|---|---|
2019-01-03 | -0.141572 | 0.058118 | 1.102248 | 1.207726 | two |
2019-01-05 | 0.313383 | 0.234041 | 0.163155 | -0.296649 | four |
(5)赋值
df
s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range('20190101', periods=6))
s1
2019-01-01 1
2019-01-02 2
2019-01-03 3
2019-01-04 4
2019-01-05 5
2019-01-06 6
Freq: D, dtype: int64
df["E"] = s1
df
A | B | C | D | E | |
---|---|---|---|---|---|
2019-01-01 | -0.935378 | -0.190742 | 0.925984 | -0.818969 | 1 |
2019-01-02 | -0.234414 | -1.194674 | 1.080779 | -2.294395 | 2 |
2019-01-03 | -0.141572 | 0.058118 | 1.102248 | 1.207726 | 3 |
2019-01-04 | 0.305088 | 0.535920 | -0.978434 | 0.177251 | 4 |
2019-01-05 | 0.313383 | 0.234041 | 0.163155 | -0.296649 | 5 |
2019-01-06 | 0.250613 | -0.904400 | -0.858240 | -1.573342 | 6 |
df.loc["2019-01-01", "A"] = 0
df
A | B | C | D | E | |
---|---|---|---|---|---|
2019-01-01 | 0.000000 | -0.190742 | 0.925984 | -0.818969 | 1 |
2019-01-02 | -0.234414 | -1.194674 | 1.080779 | -2.294395 | 2 |
2019-01-03 | -0.141572 | 0.058118 | 1.102248 | 1.207726 | 3 |
2019-01-04 | 0.305088 | 0.535920 | -0.978434 | 0.177251 | 4 |
2019-01-05 | 0.313383 | 0.234041 | 0.163155 | -0.296649 | 5 |
2019-01-06 | 0.250613 | -0.904400 | -0.858240 | -1.573342 | 6 |
df.iloc[0, 1] = 0
df
A | B | C | D | E | |
---|---|---|---|---|---|
2019-01-01 | 0.000000 | 0.000000 | 0.925984 | -0.818969 | 1 |
2019-01-02 | -0.234414 | -1.194674 | 1.080779 | -2.294395 | 2 |
2019-01-03 | -0.141572 | 0.058118 | 1.102248 | 1.207726 | 3 |
2019-01-04 | 0.305088 | 0.535920 | -0.978434 | 0.177251 | 4 |
2019-01-05 | 0.313383 | 0.234041 | 0.163155 | -0.296649 | 5 |
2019-01-06 | 0.250613 | -0.904400 | -0.858240 | -1.573342 | 6 |
df["D"] = np.array([5]*len(df)) # 可简化成df["D"] = 5
df
A | B | C | D | E | |
---|---|---|---|---|---|
2019-01-01 | 0.000000 | 0.000000 | 0.925984 | 5 | 1 |
2019-01-02 | -0.234414 | -1.194674 | 1.080779 | 5 | 2 |
2019-01-03 | -0.141572 | 0.058118 | 1.102248 | 5 | 3 |
2019-01-04 | 0.305088 | 0.535920 | -0.978434 | 5 | 4 |
2019-01-05 | 0.313383 | 0.234041 | 0.163155 | 5 | 5 |
2019-01-06 | 0.250613 | -0.904400 | -0.858240 | 5 | 6 |
df.index = [i for i in range(len(df))]
df
A | B | C | D | E | |
---|---|---|---|---|---|
0 | 0.000000 | 0.000000 | 0.925984 | 5 | 1 |
1 | -0.234414 | -1.194674 | 1.080779 | 5 | 2 |
2 | -0.141572 | 0.058118 | 1.102248 | 5 | 3 |
3 | 0.305088 | 0.535920 | -0.978434 | 5 | 4 |
4 | 0.313383 | 0.234041 | 0.163155 | 5 | 5 |
5 | 0.250613 | -0.904400 | -0.858240 | 5 | 6 |
df.columns = [i for i in range(df.shape[1])]
df
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | 0.000000 | 0.000000 | 0.925984 | 5 | 1 |
1 | -0.234414 | -1.194674 | 1.080779 | 5 | 2 |
2 | -0.141572 | 0.058118 | 1.102248 | 5 | 3 |
3 | 0.305088 | 0.535920 | -0.978434 | 5 | 4 |
4 | 0.313383 | 0.234041 | 0.163155 | 5 | 5 |
5 | 0.250613 | -0.904400 | -0.858240 | 5 | 6 |
1、数据的查看
import pandas as pd
import numpy as np
dates = pd.date_range(start='2019-01-01', periods=6)
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=["A", "B", "C", "D"])
df
A | B | C | D | |
---|---|---|---|---|
2019-01-01 | -0.854043 | 0.412345 | -2.296051 | -0.048964 |
2019-01-02 | 1.371364 | -0.121454 | -0.299653 | 1.095375 |
2019-01-03 | -0.714591 | -1.103224 | 0.979250 | 0.319455 |
2019-01-04 | -1.397557 | 0.426008 | 0.233861 | -1.651887 |
2019-01-05 | 0.434026 | 0.459830 | -0.095444 | 1.220302 |
2019-01-06 | -0.133876 | 0.074500 | -1.028147 | 0.605402 |
(1)查看前面的行
df.head() # 默认5行,也可以进行设置
A | B | C | D | |
---|---|---|---|---|
2019-01-01 | -0.854043 | 0.412345 | -2.296051 | -0.048964 |
2019-01-02 | 1.371364 | -0.121454 | -0.299653 | 1.095375 |
2019-01-03 | -0.714591 | -1.103224 | 0.979250 | 0.319455 |
2019-01-04 | -1.397557 | 0.426008 | 0.233861 | -1.651887 |
2019-01-05 | 0.434026 | 0.459830 | -0.095444 | 1.220302 |
df.head(2)
A | B | C | D | |
---|---|---|---|---|
2019-01-01 | -0.854043 | 0.412345 | -2.296051 | -0.048964 |
2019-01-02 | 1.371364 | -0.121454 | -0.299653 | 1.095375 |
(2)查看后面的行
df.tail() # 默认5行
A | B | C | D | |
---|---|---|---|---|
2019-01-02 | 1.371364 | -0.121454 | -0.299653 | 1.095375 |
2019-01-03 | -0.714591 | -1.103224 | 0.979250 | 0.319455 |
2019-01-04 | -1.397557 | 0.426008 | 0.233861 | -1.651887 |
2019-01-05 | 0.434026 | 0.459830 | -0.095444 | 1.220302 |
2019-01-06 | -0.133876 | 0.074500 | -1.028147 | 0.605402 |
df.tail(3)
A | B | C | D | |
---|---|---|---|---|
2019-01-04 | -1.397557 | 0.426008 | 0.233861 | -1.651887 |
2019-01-05 | 0.434026 | 0.459830 | -0.095444 | 1.220302 |
2019-01-06 | -0.133876 | 0.074500 | -1.028147 | 0.605402 |
(3)查看总体信息
df.iloc[0, 3] = np.nan
df
A | B | C | D | |
---|---|---|---|---|
2019-01-01 | -0.854043 | 0.412345 | -2.296051 | NaN |
2019-01-02 | 1.371364 | -0.121454 | -0.299653 | 1.095375 |
2019-01-03 | -0.714591 | -1.103224 | 0.979250 | 0.319455 |
2019-01-04 | -1.397557 | 0.426008 | 0.233861 | -1.651887 |
2019-01-05 | 0.434026 | 0.459830 | -0.095444 | 1.220302 |
2019-01-06 | -0.133876 | 0.074500 | -1.028147 | 0.605402 |
df.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 6 entries, 2019-01-01 to 2019-01-06
Freq: D
Data columns (total 4 columns):
A 6 non-null float64
B 6 non-null float64
C 6 non-null float64
D 5 non-null float64
dtypes: float64(4)
memory usage: 240.0 bytes
2、Numpy通用函数同样适用于Pandas
(1)向量化运算
x = pd.DataFrame(np.arange(4).reshape(1, 4))
x
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 0 | 1 | 2 | 3 |
x+5
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 5 | 6 | 7 | 8 |
np.exp(x)
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 1.0 | 2.718282 | 7.389056 | 20.085537 |
y = pd.DataFrame(np.arange(4,8).reshape(1, 4))
y
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 4 | 5 | 6 | 7 |
x*y
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 0 | 5 | 12 | 21 |
(2)矩阵化运算
np.random.seed(42)
x = pd.DataFrame(np.random.randint(10, size=(30, 30)))
x
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 6 | 3 | 7 | 4 | 6 | 9 | 2 | 6 | 7 | 4 | ... | 4 | 0 | 9 | 5 | 8 | 0 | 9 | 2 | 6 | 3 |
1 | 8 | 2 | 4 | 2 | 6 | 4 | 8 | 6 | 1 | 3 | ... | 2 | 0 | 3 | 1 | 7 | 3 | 1 | 5 | 5 | 9 |
2 | 3 | 5 | 1 | 9 | 1 | 9 | 3 | 7 | 6 | 8 | ... | 6 | 8 | 7 | 0 | 7 | 7 | 2 | 0 | 7 | 2 |
3 | 2 | 0 | 4 | 9 | 6 | 9 | 8 | 6 | 8 | 7 | ... | 0 | 2 | 4 | 2 | 0 | 4 | 9 | 6 | 6 | 8 |
4 | 9 | 9 | 2 | 6 | 0 | 3 | 3 | 4 | 6 | 6 | ... | 9 | 6 | 8 | 6 | 0 | 0 | 8 | 8 | 3 | 8 |
5 | 2 | 6 | 5 | 7 | 8 | 4 | 0 | 2 | 9 | 7 | ... | 2 | 0 | 4 | 0 | 7 | 0 | 0 | 1 | 1 | 5 |
6 | 6 | 4 | 0 | 0 | 2 | 1 | 4 | 9 | 5 | 6 | ... | 5 | 0 | 8 | 5 | 2 | 3 | 3 | 2 | 9 | 2 |
7 | 2 | 3 | 6 | 3 | 8 | 0 | 7 | 6 | 1 | 7 | ... | 3 | 0 | 1 | 0 | 4 | 4 | 6 | 8 | 8 | 2 |
8 | 2 | 2 | 3 | 7 | 5 | 7 | 0 | 7 | 3 | 0 | ... | 1 | 1 | 5 | 2 | 8 | 3 | 0 | 3 | 0 | 4 |
9 | 3 | 7 | 7 | 6 | 2 | 0 | 0 | 2 | 5 | 6 | ... | 4 | 2 | 3 | 2 | 0 | 0 | 4 | 5 | 2 | 8 |
10 | 4 | 7 | 0 | 4 | 2 | 0 | 3 | 4 | 6 | 0 | ... | 5 | 6 | 1 | 9 | 1 | 9 | 0 | 7 | 0 | 8 |
11 | 5 | 6 | 9 | 6 | 9 | 2 | 1 | 8 | 7 | 9 | ... | 6 | 5 | 2 | 8 | 9 | 5 | 9 | 9 | 5 | 0 |
12 | 3 | 9 | 5 | 5 | 4 | 0 | 7 | 4 | 4 | 6 | ... | 0 | 7 | 2 | 9 | 6 | 9 | 4 | 9 | 4 | 6 |
13 | 8 | 4 | 0 | 9 | 9 | 0 | 1 | 5 | 8 | 7 | ... | 5 | 8 | 4 | 0 | 3 | 4 | 9 | 9 | 4 | 6 |
14 | 3 | 0 | 4 | 6 | 9 | 9 | 5 | 4 | 3 | 1 | ... | 6 | 1 | 0 | 3 | 7 | 1 | 2 | 0 | 0 | 2 |
15 | 4 | 2 | 0 | 0 | 7 | 9 | 1 | 2 | 1 | 2 | ... | 6 | 3 | 9 | 4 | 1 | 7 | 3 | 8 | 4 | 8 |
16 | 3 | 9 | 4 | 8 | 7 | 2 | 0 | 2 | 3 | 1 | ... | 8 | 0 | 0 | 3 | 8 | 5 | 2 | 0 | 3 | 8 |
17 | 2 | 8 | 6 | 3 | 2 | 9 | 4 | 4 | 2 | 8 | ... | 6 | 9 | 4 | 2 | 6 | 1 | 8 | 9 | 9 | 0 |
18 | 5 | 6 | 7 | 9 | 8 | 1 | 9 | 1 | 4 | 4 | ... | 3 | 5 | 2 | 5 | 6 | 9 | 9 | 2 | 6 | 2 |
19 | 1 | 9 | 3 | 7 | 8 | 6 | 0 | 2 | 8 | 0 | ... | 4 | 3 | 2 | 2 | 3 | 8 | 1 | 8 | 0 | 0 |
20 | 4 | 5 | 5 | 2 | 6 | 8 | 9 | 7 | 5 | 7 | ... | 3 | 5 | 0 | 8 | 0 | 4 | 3 | 2 | 5 | 1 |
21 | 2 | 4 | 8 | 1 | 9 | 7 | 1 | 4 | 6 | 7 | ... | 0 | 1 | 8 | 2 | 0 | 4 | 6 | 5 | 0 | 4 |
22 | 4 | 5 | 2 | 4 | 6 | 4 | 4 | 4 | 9 | 9 | ... | 1 | 7 | 6 | 9 | 9 | 1 | 5 | 5 | 2 | 1 |
23 | 0 | 5 | 4 | 8 | 0 | 6 | 4 | 4 | 1 | 2 | ... | 8 | 5 | 0 | 7 | 6 | 9 | 2 | 0 | 4 | 3 |
24 | 9 | 7 | 0 | 9 | 0 | 3 | 7 | 4 | 1 | 5 | ... | 3 | 7 | 8 | 2 | 2 | 1 | 9 | 2 | 2 | 4 |
25 | 4 | 1 | 9 | 5 | 4 | 5 | 0 | 4 | 8 | 9 | ... | 9 | 3 | 0 | 7 | 0 | 2 | 3 | 7 | 5 | 9 |
26 | 6 | 7 | 1 | 9 | 7 | 2 | 6 | 2 | 6 | 1 | ... | 0 | 6 | 5 | 9 | 8 | 0 | 3 | 8 | 3 | 9 |
27 | 2 | 8 | 1 | 3 | 5 | 1 | 7 | 7 | 0 | 2 | ... | 8 | 0 | 4 | 5 | 4 | 5 | 5 | 6 | 3 | 7 |
28 | 6 | 8 | 6 | 2 | 2 | 7 | 4 | 3 | 7 | 5 | ... | 1 | 7 | 9 | 2 | 4 | 5 | 9 | 5 | 3 | 2 |
29 | 3 | 0 | 3 | 0 | 0 | 9 | 5 | 4 | 3 | 2 | ... | 1 | 3 | 0 | 4 | 8 | 0 | 8 | 7 | 5 | 6 |
30 rows × 30 columns
z = x.T
z
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 6 | 8 | 3 | 2 | 9 | 2 | 6 | 2 | 2 | 3 | ... | 4 | 2 | 4 | 0 | 9 | 4 | 6 | 2 | 6 | 3 |
1 | 3 | 2 | 5 | 0 | 9 | 6 | 4 | 3 | 2 | 7 | ... | 5 | 4 | 5 | 5 | 7 | 1 | 7 | 8 | 8 | 0 |
2 | 7 | 4 | 1 | 4 | 2 | 5 | 0 | 6 | 3 | 7 | ... | 5 | 8 | 2 | 4 | 0 | 9 | 1 | 1 | 6 | 3 |
3 | 4 | 2 | 9 | 9 | 6 | 7 | 0 | 3 | 7 | 6 | ... | 2 | 1 | 4 | 8 | 9 | 5 | 9 | 3 | 2 | 0 |
4 | 6 | 6 | 1 | 6 | 0 | 8 | 2 | 8 | 5 | 2 | ... | 6 | 9 | 6 | 0 | 0 | 4 | 7 | 5 | 2 | 0 |
5 | 9 | 4 | 9 | 9 | 3 | 4 | 1 | 0 | 7 | 0 | ... | 8 | 7 | 4 | 6 | 3 | 5 | 2 | 1 | 7 | 9 |
6 | 2 | 8 | 3 | 8 | 3 | 0 | 4 | 7 | 0 | 0 | ... | 9 | 1 | 4 | 4 | 7 | 0 | 6 | 7 | 4 | 5 |
7 | 6 | 6 | 7 | 6 | 4 | 2 | 9 | 6 | 7 | 2 | ... | 7 | 4 | 4 | 4 | 4 | 4 | 2 | 7 | 3 | 4 |
8 | 7 | 1 | 6 | 8 | 6 | 9 | 5 | 1 | 3 | 5 | ... | 5 | 6 | 9 | 1 | 1 | 8 | 6 | 0 | 7 | 3 |
9 | 4 | 3 | 8 | 7 | 6 | 7 | 6 | 7 | 0 | 6 | ... | 7 | 7 | 9 | 2 | 5 | 9 | 1 | 2 | 5 | 2 |
10 | 3 | 8 | 7 | 1 | 3 | 5 | 3 | 0 | 7 | 5 | ... | 4 | 0 | 2 | 6 | 4 | 1 | 9 | 9 | 1 | 0 |
11 | 7 | 1 | 4 | 0 | 6 | 7 | 6 | 8 | 3 | 5 | ... | 7 | 5 | 0 | 5 | 1 | 0 | 5 | 8 | 3 | 5 |
12 | 7 | 9 | 1 | 6 | 2 | 8 | 7 | 8 | 5 | 5 | ... | 9 | 0 | 4 | 1 | 2 | 9 | 2 | 4 | 3 | 1 |
13 | 2 | 8 | 4 | 6 | 5 | 3 | 0 | 1 | 7 | 2 | ... | 3 | 1 | 8 | 5 | 8 | 8 | 2 | 5 | 5 | 7 |
14 | 5 | 9 | 7 | 7 | 1 | 0 | 5 | 6 | 3 | 5 | ... | 9 | 0 | 0 | 1 | 6 | 9 | 8 | 3 | 5 | 9 |
15 | 4 | 4 | 9 | 4 | 9 | 0 | 7 | 9 | 2 | 7 | ... | 7 | 4 | 2 | 1 | 6 | 8 | 6 | 9 | 0 | 4 |
16 | 1 | 1 | 8 | 2 | 8 | 9 | 4 | 2 | 8 | 1 | ... | 9 | 9 | 3 | 1 | 5 | 8 | 4 | 1 | 7 | 6 |
17 | 7 | 3 | 8 | 7 | 4 | 3 | 3 | 6 | 2 | 4 | ... | 1 | 8 | 0 | 2 | 7 | 5 | 9 | 7 | 5 | 9 |
18 | 5 | 6 | 0 | 5 | 5 | 6 | 1 | 9 | 8 | 0 | ... | 4 | 5 | 0 | 1 | 3 | 7 | 6 | 5 | 2 | 1 |
19 | 1 | 7 | 8 | 2 | 3 | 1 | 5 | 8 | 1 | 0 | ... | 8 | 0 | 7 | 3 | 7 | 0 | 8 | 4 | 8 | 7 |
20 | 4 | 2 | 6 | 0 | 9 | 2 | 5 | 3 | 1 | 4 | ... | 3 | 0 | 1 | 8 | 3 | 9 | 0 | 8 | 1 | 1 |
21 | 0 | 0 | 8 | 2 | 6 | 0 | 0 | 0 | 1 | 2 | ... | 5 | 1 | 7 | 5 | 7 | 3 | 6 | 0 | 7 | 3 |
22 | 9 | 3 | 7 | 4 | 8 | 4 | 8 | 1 | 5 | 3 | ... | 0 | 8 | 6 | 0 | 8 | 0 | 5 | 4 | 9 | 0 |
23 | 5 | 1 | 0 | 2 | 6 | 0 | 5 | 0 | 2 | 2 | ... | 8 | 2 | 9 | 7 | 2 | 7 | 9 | 5 | 2 | 4 |
24 | 8 | 7 | 7 | 0 | 0 | 7 | 2 | 4 | 8 | 0 | ... | 0 | 0 | 9 | 6 | 2 | 0 | 8 | 4 | 4 | 8 |
25 | 0 | 3 | 7 | 4 | 0 | 0 | 3 | 4 | 3 | 0 | ... | 4 | 4 | 1 | 9 | 1 | 2 | 0 | 5 | 5 | 0 |
26 | 9 | 1 | 2 | 9 | 8 | 0 | 3 | 6 | 0 | 4 | ... | 3 | 6 | 5 | 2 | 9 | 3 | 3 | 5 | 9 | 8 |
27 | 2 | 5 | 0 | 6 | 8 | 1 | 2 | 8 | 3 | 5 | ... | 2 | 5 | 5 | 0 | 2 | 7 | 8 | 6 | 5 | 7 |
28 | 6 | 5 | 7 | 6 | 3 | 1 | 9 | 8 | 0 | 2 | ... | 5 | 0 | 2 | 4 | 2 | 5 | 3 | 3 | 3 | 5 |
29 | 3 | 9 | 2 | 8 | 8 | 5 | 2 | 2 | 4 | 8 | ... | 1 | 4 | 1 | 3 | 4 | 9 | 9 | 7 | 2 | 6 |
30 rows × 30 columns
np.random.seed(1)
y = pd.DataFrame(np.random.randint(10, size=(30, 30)))
y
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 5 | 8 | 9 | 5 | 0 | 0 | 1 | 7 | 6 | 9 | ... | 1 | 7 | 0 | 6 | 9 | 9 | 7 | 6 | 9 | 1 |
1 | 0 | 1 | 8 | 8 | 3 | 9 | 8 | 7 | 3 | 6 | ... | 9 | 2 | 0 | 4 | 9 | 2 | 7 | 7 | 9 | 8 |
2 | 6 | 9 | 3 | 7 | 7 | 4 | 5 | 9 | 3 | 6 | ... | 7 | 7 | 1 | 1 | 3 | 0 | 8 | 6 | 4 | 5 |
3 | 6 | 2 | 5 | 7 | 8 | 4 | 4 | 7 | 7 | 4 | ... | 0 | 1 | 9 | 8 | 2 | 3 | 1 | 2 | 7 | 2 |
4 | 6 | 0 | 9 | 2 | 6 | 6 | 2 | 7 | 7 | 0 | ... | 1 | 5 | 4 | 0 | 7 | 8 | 9 | 5 | 7 | 0 |
5 | 9 | 3 | 9 | 1 | 4 | 4 | 6 | 8 | 8 | 9 | ... | 1 | 8 | 7 | 0 | 3 | 4 | 2 | 0 | 3 | 5 |
6 | 1 | 2 | 4 | 3 | 0 | 6 | 0 | 7 | 2 | 8 | ... | 4 | 3 | 3 | 6 | 7 | 3 | 5 | 3 | 2 | 4 |
7 | 4 | 0 | 3 | 3 | 8 | 3 | 5 | 6 | 7 | 5 | ... | 1 | 7 | 3 | 1 | 6 | 6 | 9 | 6 | 9 | 6 |
8 | 0 | 0 | 2 | 9 | 6 | 0 | 6 | 7 | 0 | 3 | ... | 6 | 7 | 9 | 5 | 4 | 9 | 5 | 2 | 5 | 6 |
9 | 6 | 8 | 7 | 7 | 7 | 2 | 6 | 0 | 5 | 2 | ... | 7 | 0 | 6 | 2 | 4 | 3 | 6 | 7 | 6 | 3 |
10 | 0 | 6 | 4 | 7 | 6 | 2 | 9 | 5 | 9 | 9 | ... | 4 | 9 | 3 | 9 | 1 | 2 | 5 | 4 | 0 | 8 |
11 | 2 | 3 | 9 | 9 | 4 | 4 | 8 | 2 | 1 | 6 | ... | 0 | 5 | 9 | 8 | 6 | 6 | 0 | 4 | 7 | 3 |
12 | 0 | 1 | 6 | 0 | 6 | 1 | 6 | 4 | 2 | 5 | ... | 8 | 8 | 0 | 7 | 2 | 0 | 7 | 1 | 1 | 9 |
13 | 5 | 1 | 5 | 9 | 6 | 4 | 9 | 8 | 7 | 5 | ... | 2 | 4 | 3 | 2 | 0 | 0 | 4 | 2 | 5 | 0 |
14 | 0 | 3 | 8 | 5 | 3 | 1 | 4 | 7 | 3 | 2 | ... | 8 | 5 | 5 | 7 | 5 | 9 | 1 | 3 | 9 | 3 |
15 | 3 | 3 | 6 | 1 | 3 | 0 | 5 | 0 | 5 | 2 | ... | 7 | 1 | 7 | 7 | 3 | 8 | 3 | 0 | 6 | 3 |
16 | 0 | 6 | 5 | 9 | 6 | 4 | 6 | 6 | 2 | 2 | ... | 3 | 6 | 8 | 6 | 5 | 1 | 3 | 2 | 6 | 3 |
17 | 6 | 7 | 2 | 8 | 0 | 1 | 8 | 6 | 0 | 0 | ... | 5 | 6 | 2 | 5 | 4 | 3 | 0 | 6 | 2 | 1 |
18 | 9 | 4 | 4 | 0 | 9 | 8 | 7 | 7 | 6 | 1 | ... | 7 | 9 | 9 | 7 | 1 | 1 | 4 | 6 | 5 | 6 |
19 | 4 | 1 | 1 | 5 | 1 | 2 | 6 | 2 | 3 | 3 | ... | 0 | 0 | 0 | 9 | 8 | 5 | 9 | 3 | 4 | 0 |
20 | 9 | 8 | 6 | 3 | 9 | 9 | 0 | 8 | 1 | 6 | ... | 2 | 9 | 0 | 1 | 3 | 9 | 4 | 8 | 8 | 8 |
21 | 2 | 8 | 6 | 4 | 9 | 0 | 5 | 5 | 6 | 1 | ... | 6 | 7 | 5 | 6 | 8 | 7 | 4 | 2 | 4 | 0 |
22 | 0 | 3 | 5 | 9 | 0 | 3 | 6 | 5 | 1 | 1 | ... | 6 | 2 | 5 | 3 | 9 | 3 | 9 | 5 | 1 | 9 |
23 | 7 | 7 | 0 | 8 | 6 | 1 | 2 | 0 | 4 | 4 | ... | 1 | 9 | 6 | 0 | 2 | 8 | 3 | 7 | 2 | 5 |
24 | 6 | 0 | 4 | 2 | 3 | 1 | 0 | 5 | 7 | 0 | ... | 1 | 1 | 2 | 7 | 5 | 2 | 9 | 4 | 7 | 3 |
25 | 5 | 0 | 2 | 1 | 4 | 9 | 4 | 6 | 9 | 3 | ... | 5 | 5 | 3 | 5 | 9 | 2 | 7 | 4 | 1 | 6 |
26 | 9 | 8 | 1 | 8 | 1 | 6 | 2 | 6 | 1 | 8 | ... | 2 | 5 | 1 | 2 | 5 | 3 | 3 | 6 | 1 | 8 |
27 | 1 | 8 | 6 | 4 | 6 | 9 | 5 | 4 | 7 | 2 | ... | 9 | 3 | 1 | 5 | 1 | 1 | 7 | 1 | 2 | 6 |
28 | 0 | 7 | 7 | 4 | 3 | 2 | 7 | 8 | 5 | 2 | ... | 0 | 2 | 8 | 3 | 7 | 3 | 9 | 2 | 3 | 8 |
29 | 8 | 0 | 2 | 6 | 8 | 3 | 6 | 4 | 9 | 7 | ... | 6 | 7 | 8 | 5 | 7 | 2 | 5 | 3 | 4 | 5 |
30 rows × 30 columns
x.dot(y)
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 616 | 560 | 723 | 739 | 612 | 457 | 681 | 799 | 575 | 590 | ... | 523 | 739 | 613 | 580 | 668 | 602 | 733 | 585 | 657 | 700 |
1 | 520 | 438 | 691 | 600 | 612 | 455 | 666 | 764 | 707 | 592 | ... | 555 | 681 | 503 | 679 | 641 | 506 | 779 | 494 | 633 | 590 |
2 | 557 | 570 | 786 | 807 | 690 | 469 | 804 | 828 | 704 | 573 | ... | 563 | 675 | 712 | 758 | 793 | 672 | 754 | 550 | 756 | 638 |
3 | 605 | 507 | 664 | 701 | 660 | 496 | 698 | 806 | 651 | 575 | ... | 582 | 685 | 668 | 586 | 629 | 534 | 678 | 484 | 591 | 626 |
4 | 599 | 681 | 753 | 873 | 721 | 563 | 754 | 770 | 620 | 654 | ... | 633 | 747 | 661 | 677 | 726 | 649 | 716 | 610 | 735 | 706 |
5 | 422 | 354 | 602 | 627 | 613 | 396 | 617 | 627 | 489 | 423 | ... | 456 | 572 | 559 | 537 | 499 | 384 | 589 | 436 | 574 | 507 |
6 | 359 | 446 | 599 | 599 | 481 | 357 | 577 | 572 | 451 | 464 | ... | 449 | 550 | 495 | 532 | 633 | 554 | 663 | 476 | 565 | 602 |
7 | 531 | 520 | 698 | 590 | 607 | 537 | 665 | 696 | 571 | 472 | ... | 576 | 588 | 551 | 665 | 652 | 527 | 742 | 528 | 650 | 599 |
8 | 449 | 322 | 547 | 533 | 593 | 399 | 584 | 638 | 587 | 424 | ... | 402 | 596 | 523 | 523 | 447 | 362 | 561 | 386 | 529 | 484 |
9 | 373 | 433 | 525 | 601 | 522 | 345 | 551 | 521 | 434 | 447 | ... | 508 | 498 | 438 | 478 | 459 | 418 | 488 | 407 | 503 | 496 |
10 | 500 | 427 | 574 | 607 | 667 | 477 | 652 | 656 | 615 | 477 | ... | 622 | 702 | 531 | 610 | 558 | 532 | 598 | 471 | 582 | 561 |
11 | 664 | 694 | 772 | 841 | 779 | 574 | 730 | 810 | 711 | 608 | ... | 591 | 760 | 616 | 638 | 721 | 676 | 846 | 678 | 754 | 708 |
12 | 545 | 547 | 687 | 701 | 721 | 576 | 689 | 724 | 710 | 532 | ... | 674 | 684 | 648 | 694 | 710 | 564 | 757 | 571 | 671 | 656 |
13 | 574 | 586 | 723 | 750 | 691 | 494 | 696 | 787 | 667 | 523 | ... | 618 | 681 | 568 | 682 | 715 | 644 | 756 | 557 | 690 | 604 |
14 | 502 | 382 | 645 | 557 | 570 | 403 | 538 | 677 | 500 | 501 | ... | 369 | 650 | 507 | 576 | 546 | 531 | 554 | 437 | 616 | 463 |
15 | 510 | 505 | 736 | 651 | 649 | 510 | 719 | 733 | 694 | 557 | ... | 605 | 717 | 574 | 642 | 678 | 576 | 755 | 455 | 598 | 654 |
16 | 567 | 376 | 614 | 612 | 643 | 514 | 598 | 724 | 547 | 464 | ... | 456 | 639 | 520 | 560 | 569 | 442 | 596 | 517 | 659 | 532 |
17 | 626 | 716 | 828 | 765 | 740 | 603 | 809 | 852 | 692 | 591 | ... | 664 | 716 | 655 | 721 | 742 | 612 | 819 | 593 | 744 | 712 |
18 | 600 | 559 | 667 | 664 | 641 | 556 | 624 | 815 | 638 | 564 | ... | 581 | 701 | 559 | 677 | 710 | 554 | 748 | 597 | 614 | 657 |
19 | 445 | 431 | 661 | 681 | 641 | 552 | 690 | 719 | 602 | 474 | ... | 515 | 637 | 576 | 620 | 572 | 512 | 599 | 455 | 622 | 538 |
20 | 523 | 569 | 784 | 725 | 713 | 501 | 740 | 772 | 638 | 640 | ... | 589 | 775 | 664 | 686 | 726 | 672 | 747 | 548 | 723 | 645 |
21 | 487 | 465 | 553 | 639 | 517 | 449 | 592 | 609 | 454 | 398 | ... | 492 | 567 | 534 | 404 | 554 | 417 | 561 | 466 | 498 | 492 |
22 | 479 | 449 | 574 | 686 | 583 | 377 | 566 | 614 | 563 | 455 | ... | 453 | 539 | 491 | 501 | 596 | 520 | 722 | 478 | 565 | 501 |
23 | 483 | 386 | 476 | 526 | 550 | 426 | 492 | 585 | 536 | 482 | ... | 322 | 541 | 438 | 456 | 487 | 408 | 502 | 426 | 474 | 481 |
24 | 523 | 551 | 658 | 767 | 537 | 444 | 663 | 731 | 576 | 577 | ... | 522 | 590 | 525 | 664 | 691 | 548 | 635 | 526 | 641 | 538 |
25 | 652 | 656 | 738 | 753 | 853 | 508 | 752 | 815 | 669 | 576 | ... | 694 | 833 | 693 | 606 | 575 | 616 | 704 | 559 | 728 | 672 |
26 | 578 | 577 | 744 | 856 | 699 | 497 | 779 | 800 | 733 | 587 | ... | 630 | 754 | 704 | 834 | 760 | 680 | 765 | 592 | 731 | 629 |
27 | 554 | 494 | 665 | 689 | 630 | 574 | 695 | 703 | 636 | 599 | ... | 554 | 685 | 532 | 658 | 649 | 554 | 693 | 577 | 634 | 668 |
28 | 498 | 552 | 659 | 784 | 552 | 492 | 690 | 775 | 544 | 551 | ... | 567 | 636 | 518 | 599 | 742 | 521 | 733 | 533 | 605 | 604 |
29 | 513 | 491 | 563 | 642 | 477 | 367 | 589 | 647 | 516 | 484 | ... | 428 | 574 | 504 | 548 | 553 | 483 | 540 | 407 | 547 | 455 |
30 rows × 30 columns
%timeit x.dot(y)
218 µs ± 18.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit np.dot(x, y)
81.1 µs ± 2.85 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
x1 = np.array(x)
x1
y1 = np.array(y)
y1
%timeit x1.dot(y1)
22.1 µs ± 992 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit np.dot(x1, y1)
22.6 µs ± 766 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit np.dot(x.values, y.values)
42.9 µs ± 1.24 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
x2 = list(x1)
y2 = list(y1)
x3 = []
y3 = []
for i in x2:
res = []
for j in i:
res.append(int(j))
x3.append(res)
for i in y2:
res = []
for j in i:
res.append(int(j))
y3.append(res)
def f(x, y):
res = []
for i in range(len(x)):
row = []
for j in range(len(y[0])):
sum_row = 0
for k in range(len(x[0])):
sum_row += x[i][k]*y[k][j]
row.append(sum_row)
res.append(row)
return res
%timeit f(x3, y3)
4.29 ms ± 207 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
一般来说,纯粹的计算在Numpy里执行的更快
Numpy更侧重于计算,Pandas更侧重于数据处理
(3)广播运算
np.random.seed(42)
x = pd.DataFrame(np.random.randint(10, size=(3, 3)), columns=list("ABC"))
x
A | B | C | |
---|---|---|---|
0 | 6 | 3 | 7 |
1 | 4 | 6 | 9 |
2 | 2 | 6 | 7 |
x.iloc[0]
A 6
B 3
C 7
Name: 0, dtype: int32
x/x.iloc[0]
A | B | C | |
---|---|---|---|
0 | 1.000000 | 1.0 | 1.000000 |
1 | 0.666667 | 2.0 | 1.285714 |
2 | 0.333333 | 2.0 | 1.000000 |
x.A
0 6
1 4
2 2
Name: A, dtype: int32
x.div(x.A, axis=0) # add sub div mul
A | B | C | |
---|---|---|---|
0 | 1.0 | 0.5 | 1.166667 |
1 | 1.0 | 1.5 | 2.250000 |
2 | 1.0 | 3.0 | 3.500000 |
x.div(x.iloc[0], axis=1)
A | B | C | |
---|---|---|---|
0 | 1.000000 | 1.0 | 1.000000 |
1 | 0.666667 | 2.0 | 1.285714 |
2 | 0.333333 | 2.0 | 1.000000 |
3、新的用法
(1)索引对齐
A = pd.DataFrame(np.random.randint(0, 20, size=(2, 2)), columns=list("AB"))
A
A | B | |
---|---|---|
0 | 3 | 7 |
1 | 2 | 1 |
B = pd.DataFrame(np.random.randint(0, 10, size=(3, 3)), columns=list("ABC"))
B
A | B | C | |
---|---|---|---|
0 | 7 | 5 | 1 |
1 | 4 | 0 | 9 |
2 | 5 | 8 | 0 |
A+B
A | B | C | |
---|---|---|---|
0 | 10.0 | 12.0 | NaN |
1 | 6.0 | 1.0 | NaN |
2 | NaN | NaN | NaN |
A.add(B, fill_value=0)
A | B | C | |
---|---|---|---|
0 | 10.0 | 12.0 | 1.0 |
1 | 6.0 | 1.0 | 9.0 |
2 | 5.0 | 8.0 | 0.0 |
A*B
A | B | C | |
---|---|---|---|
0 | 21.0 | 35.0 | NaN |
1 | 8.0 | 0.0 | NaN |
2 | NaN | NaN | NaN |
(2)统计相关
y = np.random.randint(3, size=20)
y
array([2, 2, 2, 1, 2, 1, 1, 2, 1, 2, 2, 0, 2, 0, 2, 2, 0, 0, 2, 1])
np.unique(y)
array([0, 1, 2])
用Counter方法统计数据
from collections import Counter
Counter(y)
Counter({2: 11, 1: 5, 0: 4})
y1 = pd.DataFrame(y, columns=["A"])
y1
A | |
---|---|
0 | 2 |
1 | 2 |
2 | 2 |
3 | 1 |
4 | 2 |
5 | 1 |
6 | 1 |
7 | 2 |
8 | 1 |
9 | 2 |
10 | 2 |
11 | 0 |
12 | 2 |
13 | 0 |
14 | 2 |
15 | 2 |
16 | 0 |
17 | 0 |
18 | 2 |
19 | 1 |
np.unique(y1)
有value counter的方法
y1["A"].value_counts()
2 11
1 5
0 4
Name: A, dtype: int64
population_dict = {"BeiJing": 2154,
"ShangHai": 2424,
"ShenZhen": 1303,
"HangZhou": 981 }
population = pd.Series(population_dict)
GDP_dict = {"BeiJing": 30320,
"ShangHai": 32680,
"ShenZhen": 24222,
"HangZhou": 13468 }
GDP = pd.Series(GDP_dict)
city_info = pd.DataFrame({"population": population,"GDP": GDP})
city_info
population | GDP | |
---|---|---|
BeiJing | 2154 | 30320 |
ShangHai | 2424 | 32680 |
ShenZhen | 1303 | 24222 |
HangZhou | 981 | 13468 |
city_info["per_GDP"] = city_info["GDP"]/city_info["population"]
city_info
population | GDP | per_GDP | |
---|---|---|---|
BeiJing | 2154 | 30320 | 14.076137 |
ShangHai | 2424 | 32680 | 13.481848 |
ShenZhen | 1303 | 24222 | 18.589409 |
HangZhou | 981 | 13468 | 13.728848 |
递增排序
city_info.sort_values(by="per_GDP")
population | GDP | per_GDP | |
---|---|---|---|
ShangHai | 2424 | 32680 | 13.481848 |
HangZhou | 981 | 13468 | 13.728848 |
BeiJing | 2154 | 30320 | 14.076137 |
ShenZhen | 1303 | 24222 | 18.589409 |
递减排序
city_info.sort_values(by="per_GDP", ascending=False)
population | GDP | per_GDP | |
---|---|---|---|
ShenZhen | 1303 | 24222 | 18.589409 |
BeiJing | 2154 | 30320 | 14.076137 |
HangZhou | 981 | 13468 | 13.728848 |
ShangHai | 2424 | 32680 | 13.481848 |
按轴进行排序
data = pd.DataFrame(np.random.randint(20, size=(3, 4)), index=[2, 1, 0], columns=list("CBAD"))
data
C | B | A | D | |
---|---|---|---|---|
2 | 3 | 13 | 17 | 8 |
1 | 1 | 19 | 14 | 6 |
0 | 11 | 7 | 14 | 2 |
行排序
data.sort_index()
C | B | A | D | |
---|---|---|---|---|
0 | 11 | 7 | 14 | 2 |
1 | 1 | 19 | 14 | 6 |
2 | 3 | 13 | 17 | 8 |
列排序
data.sort_index(axis=1)
A | B | C | D | |
---|---|---|---|---|
2 | 17 | 13 | 3 | 8 |
1 | 14 | 19 | 1 | 6 |
0 | 14 | 7 | 11 | 2 |
data.sort_index(axis=1, ascending=False)
D | C | B | A | |
---|---|---|---|---|
2 | 8 | 3 | 13 | 17 |
1 | 6 | 1 | 19 | 14 |
0 | 2 | 11 | 7 | 14 |
df = pd.DataFrame(np.random.normal(2, 4, size=(6, 4)),columns=list("ABCD"))
df
A | B | C | D | |
---|---|---|---|---|
0 | 1.082198 | 3.557396 | -3.060476 | 6.367969 |
1 | 13.113252 | 6.774559 | 2.874553 | 5.527044 |
2 | -2.036341 | -4.333177 | 5.094802 | -0.152567 |
3 | -3.386712 | -1.522365 | -2.522209 | 2.537716 |
4 | 4.328491 | 5.550994 | 5.577329 | 5.019991 |
5 | 1.171336 | -0.493910 | -4.032613 | 6.398588 |
非空个数
df.count()
A 6
B 6
C 6
D 6
dtype: int64
求和
df.sum()
A 14.272224
B 9.533497
C 3.931385
D 25.698741
dtype: float64
df.sum(axis=1)
0 7.947086
1 28.289408
2 -1.427283
3 -4.893571
4 20.476806
5 3.043402
dtype: float64
最大值 最小值
df.min()
A -3.386712
B -4.333177
C -4.032613
D -0.152567
dtype: float64
df.max(axis=1)
0 6.367969
1 13.113252
2 5.094802
3 2.537716
4 5.577329
5 6.398588
dtype: float64
df
A | B | C | D | |
---|---|---|---|---|
0 | 1.082198 | 3.557396 | -3.060476 | 6.367969 |
1 | 13.113252 | 6.774559 | 2.874553 | 5.527044 |
2 | -2.036341 | -4.333177 | 5.094802 | -0.152567 |
3 | -3.386712 | -1.522365 | -2.522209 | 2.537716 |
4 | 4.328491 | 5.550994 | 5.577329 | 5.019991 |
5 | 1.171336 | -0.493910 | -4.032613 | 6.398588 |
df.idxmax()
A 1
B 1
C 4
D 5
dtype: int64
均值
df.mean()
A 2.378704
B 1.588916
C 0.655231
D 4.283124
dtype: float64
方差
df.var()
A 34.980702
B 19.110656
C 18.948144
D 6.726776
dtype: float64
标准差
df.std()
A 5.914449
B 4.371574
C 4.352947
D 2.593603
dtype: float64
中位数
df.median()
A 1.126767
B 1.531743
C 0.176172
D 5.273518
dtype: float64
众数
data = pd.DataFrame(np.random.randint(5, size=(10, 2)), columns=list("AB"))
data
A | B | |
---|---|---|
0 | 4 | 2 |
1 | 3 | 2 |
2 | 2 | 0 |
3 | 2 | 4 |
4 | 2 | 0 |
5 | 4 | 1 |
6 | 2 | 0 |
7 | 1 | 1 |
8 | 3 | 4 |
9 | 2 | 0 |
data.mode()
A | B | |
---|---|---|
0 | 2 | 0 |
75%分位数
df.quantile(0.75)
A 3.539202
B 5.052594
C 4.539740
D 6.157738
Name: 0.75, dtype: float64
df.describe()
A | B | C | D | |
---|---|---|---|---|
count | 6.000000 | 6.000000 | 6.000000 | 6.000000 |
mean | 2.378704 | 1.588916 | 0.655231 | 4.283124 |
std | 5.914449 | 4.371574 | 4.352947 | 2.593603 |
min | -3.386712 | -4.333177 | -4.032613 | -0.152567 |
25% | -1.256706 | -1.265251 | -2.925910 | 3.158284 |
50% | 1.126767 | 1.531743 | 0.176172 | 5.273518 |
75% | 3.539202 | 5.052594 | 4.539740 | 6.157738 |
max | 13.113252 | 6.774559 | 5.577329 | 6.398588 |
data_2 = pd.DataFrame([["a", "a", "c", "d"],
["c", "a", "c", "b"],
["a", "a", "d", "c"]], columns=list("ABCD"))
data_2
A | B | C | D | |
---|---|---|---|---|
0 | a | a | c | d |
1 | c | a | c | b |
2 | a | a | d | c |
data_2.describe()
A | B | C | D | |
---|---|---|---|---|
count | 3 | 3 | 3 | 3 |
unique | 2 | 1 | 2 | 3 |
top | a | a | c | d |
freq | 2 | 3 | 2 | 1 |
相关性系数和协方差
df.corr()
A | B | C | D | |
---|---|---|---|---|
A | 1.000000 | 0.831063 | 0.331060 | 0.510821 |
B | 0.831063 | 1.000000 | 0.179244 | 0.719112 |
C | 0.331060 | 0.179244 | 1.000000 | -0.450365 |
D | 0.510821 | 0.719112 | -0.450365 | 1.000000 |
df.corrwith(df["A"])
A 1.000000
B 0.831063
C 0.331060
D 0.510821
dtype: float64
自定义输出
apply(method)的用法:使用method方法默认对每一列进行相应的操作
df
A | B | C | D | |
---|---|---|---|---|
0 | 1.082198 | 3.557396 | -3.060476 | 6.367969 |
1 | 13.113252 | 6.774559 | 2.874553 | 5.527044 |
2 | -2.036341 | -4.333177 | 5.094802 | -0.152567 |
3 | -3.386712 | -1.522365 | -2.522209 | 2.537716 |
4 | 4.328491 | 5.550994 | 5.577329 | 5.019991 |
5 | 1.171336 | -0.493910 | -4.032613 | 6.398588 |
df.apply(np.cumsum)
A | B | C | D | |
---|---|---|---|---|
0 | 1.082198 | 3.557396 | -3.060476 | 6.367969 |
1 | 14.195450 | 10.331955 | -0.185923 | 11.895013 |
2 | 12.159109 | 5.998778 | 4.908878 | 11.742447 |
3 | 8.772397 | 4.476413 | 2.386669 | 14.280162 |
4 | 13.100888 | 10.027406 | 7.963999 | 19.300153 |
5 | 14.272224 | 9.533497 | 3.931385 | 25.698741 |
df.apply(np.cumsum, axis=1)
A | B | C | D | |
---|---|---|---|---|
0 | 1.082198 | 4.639594 | 1.579117 | 7.947086 |
1 | 13.113252 | 19.887811 | 22.762364 | 28.289408 |
2 | -2.036341 | -6.369518 | -1.274717 | -1.427283 |
3 | -3.386712 | -4.909077 | -7.431287 | -4.893571 |
4 | 4.328491 | 9.879485 | 15.456814 | 20.476806 |
5 | 1.171336 | 0.677427 | -3.355186 | 3.043402 |
df.apply(sum)
A 14.272224
B 9.533497
C 3.931385
D 25.698741
dtype: float64
df.sum()
A 14.272224
B 9.533497
C 3.931385
D 25.698741
dtype: float64
df.apply(lambda x: x.max()-x.min())
A 16.499965
B 11.107736
C 9.609942
D 6.551155
dtype: float64
def my_describe(x):
return pd.Series([x.count(), x.mean(), x.max(), x.idxmin(), x.std()], \
index=["Count", "mean", "max", "idxmin", "std"])
df.apply(my_describe)
A | B | C | D | |
---|---|---|---|---|
Count | 6.000000 | 6.000000 | 6.000000 | 6.000000 |
mean | 2.378704 | 1.588916 | 0.655231 | 4.283124 |
max | 13.113252 | 6.774559 | 5.577329 | 6.398588 |
idxmin | 3.000000 | 2.000000 | 5.000000 | 2.000000 |
std | 5.914449 | 4.371574 | 4.352947 | 2.593603 |
1、发现缺失值
import pandas as pd
import numpy as np
data = pd.DataFrame(np.array([[1, np.nan, 2],
[np.nan, 3, 4],
[5, 6, None]]), columns=["A", "B", "C"])
data
A | B | C | |
---|---|---|---|
0 | 1 | NaN | 2 |
1 | NaN | 3 | 4 |
2 | 5 | 6 | None |
注意:有None、字符串等,数据类型全部变为object,它比int和float更消耗资源
np.nan是一个特殊的浮点数,类型是浮点类型,所以表示缺失值时最好使用NaN。
data.dtypes
A object
B object
C object
dtype: object
data.isnull()
A | B | C | |
---|---|---|---|
0 | False | True | False |
1 | True | False | False |
2 | False | False | True |
data.notnull()
A | B | C | |
---|---|---|---|
0 | True | False | True |
1 | False | True | True |
2 | True | True | False |
2、删除缺失值
data = pd.DataFrame(np.array([[1, np.nan, 2, 3],
[np.nan, 4, 5, 6],
[7, 8, np.nan, 9],
[10, 11 , 12, 13]]), columns=["A", "B", "C", "D"])
data
A | B | C | D | |
---|---|---|---|---|
0 | 1.0 | NaN | 2.0 | 3.0 |
1 | NaN | 4.0 | 5.0 | 6.0 |
2 | 7.0 | 8.0 | NaN | 9.0 |
3 | 10.0 | 11.0 | 12.0 | 13.0 |
注意:np.nan是一种特殊的浮点数
data.dtypes
A float64
B float64
C float64
D float64
dtype: object
(1)删除整行
data.dropna()
A | B | C | D | |
---|---|---|---|---|
3 | 10.0 | 11.0 | 12.0 | 13.0 |
(2)删除整列
data.dropna(axis="columns")
D | |
---|---|
0 | 3.0 |
1 | 6.0 |
2 | 9.0 |
3 | 13.0 |
data["D"] = np.nan
data
A | B | C | D | |
---|---|---|---|---|
0 | 1.0 | NaN | 2.0 | NaN |
1 | NaN | 4.0 | 5.0 | NaN |
2 | 7.0 | 8.0 | NaN | NaN |
3 | 10.0 | 11.0 | 12.0 | NaN |
data.dropna(axis="columns", how="all")
A | B | C | |
---|---|---|---|
0 | 1.0 | NaN | 2.0 |
1 | NaN | 4.0 | 5.0 |
2 | 7.0 | 8.0 | NaN |
3 | 10.0 | 11.0 | 12.0 |
all表示都是缺失值时才删除。
data.dropna(axis="columns", how="any")
0 |
1 |
2 |
3 |
data.loc[3] = np.nan
data
A | B | C | D | |
---|---|---|---|---|
0 | 1.0 | NaN | 2.0 | NaN |
1 | NaN | 4.0 | 5.0 | NaN |
2 | 7.0 | 8.0 | NaN | NaN |
3 | NaN | NaN | NaN | NaN |
data.dropna(how="all")
A | B | C | D | |
---|---|---|---|---|
0 | 1.0 | NaN | 2.0 | NaN |
1 | NaN | 4.0 | 5.0 | NaN |
2 | 7.0 | 8.0 | NaN | NaN |
3、填充缺失值
data = pd.DataFrame(np.array([[1, np.nan, 2, 3],
[np.nan, 4, 5, 6],
[7, 8, np.nan, 9],
[10, 11 , 12, 13]]), columns=["A", "B", "C", "D"])
data
A | B | C | D | |
---|---|---|---|---|
0 | 1.0 | NaN | 2.0 | 3.0 |
1 | NaN | 4.0 | 5.0 | 6.0 |
2 | 7.0 | 8.0 | NaN | 9.0 |
3 | 10.0 | 11.0 | 12.0 | 13.0 |
data.fillna(value=5)
A | B | C | D | |
---|---|---|---|---|
0 | 1.0 | 5.0 | 2.0 | 3.0 |
1 | 5.0 | 4.0 | 5.0 | 6.0 |
2 | 7.0 | 8.0 | 5.0 | 9.0 |
3 | 10.0 | 11.0 | 12.0 | 13.0 |
fill = data.mean()
fill
A 6.000000
B 7.666667
C 6.333333
D 7.750000
dtype: float64
data.fillna(value=fill)
A | B | C | D | |
---|---|---|---|---|
0 | 1.0 | 7.666667 | 2.000000 | 3.0 |
1 | 6.0 | 4.000000 | 5.000000 | 6.0 |
2 | 7.0 | 8.000000 | 6.333333 | 9.0 |
3 | 10.0 | 11.000000 | 12.000000 | 13.0 |
全部数据的平均值,先进行摊平,再进行填充即可。
fill = data.stack().mean()
fill
7.0
data.fillna(value=fill)
A | B | C | D | |
---|---|---|---|---|
0 | 1.0 | 7.0 | 2.0 | 3.0 |
1 | 7.0 | 4.0 | 5.0 | 6.0 |
2 | 7.0 | 8.0 | 7.0 | 9.0 |
3 | 10.0 | 11.0 | 12.0 | 13.0 |
import pandas as pd
import numpy as np
def make_df(cols, ind):
"一个简单的DataFrame"
data = {c: [str(c)+str(i) for i in ind] for c in cols}
return pd.DataFrame(data, ind)
make_df("ABC", range(3))
A | B | C | |
---|---|---|---|
0 | A0 | B0 | C0 |
1 | A1 | B1 | C1 |
2 | A2 | B2 | C2 |
df_1 = make_df("AB", [1, 2])
df_2 = make_df("AB", [3, 4])
print(df_1)
print(df_2)
A B
1 A1 B1
2 A2 B2
A B
3 A3 B3
4 A4 B4
pd.concat([df_1, df_2])
A | B | |
---|---|---|
1 | A1 | B1 |
2 | A2 | B2 |
3 | A3 | B3 |
4 | A4 | B4 |
df_3 = make_df("AB", [0, 1])
df_4 = make_df("CD", [0, 1])
print(df_3)
print(df_4)
A B
0 A0 B0
1 A1 B1
C D
0 C0 D0
1 C1 D1
pd.concat([df_3, df_4], axis=1)
A | B | C | D | |
---|---|---|---|---|
0 | A0 | B0 | C0 | D0 |
1 | A1 | B1 | C1 | D1 |
行重叠
df_5 = make_df("AB", [1, 2])
df_6 = make_df("AB", [1, 2])
print(df_5)
print(df_6)
A B
1 A1 B1
2 A2 B2
A B
1 A1 B1
2 A2 B2
pd.concat([df_5, df_6])
A | B | |
---|---|---|
1 | A1 | B1 |
2 | A2 | B2 |
1 | A1 | B1 |
2 | A2 | B2 |
pd.concat([df_5, df_6],ignore_index=True)
A | B | |
---|---|---|
0 | A1 | B1 |
1 | A2 | B2 |
2 | A1 | B1 |
3 | A2 | B2 |
列重叠
df_7 = make_df("ABC", [1, 2])
df_8 = make_df("BCD", [1, 2])
print(df_7)
print(df_8)
A B C
1 A1 B1 C1
2 A2 B2 C2
B C D
1 B1 C1 D1
2 B2 C2 D2
pd.concat([df_7, df_8], axis=1)
A | B | C | B | C | D | |
---|---|---|---|---|---|---|
1 | A1 | B1 | C1 | B1 | C1 | D1 |
2 | A2 | B2 | C2 | B2 | C2 | D2 |
pd.concat([df_7, df_8],axis=1, ignore_index=True)
0 | 1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|---|
1 | A1 | B1 | C1 | B1 | C1 | D1 |
2 | A2 | B2 | C2 | B2 | C2 | D2 |
df_9 = make_df("AB", [1, 2])
df_10 = make_df("BC", [1, 2])
print(df_9)
print(df_10)
A B
1 A1 B1
2 A2 B2
B C
1 B1 C1
2 B2 C2
pd.merge(df_9, df_10)
A | B | C | |
---|---|---|---|
0 | A1 | B1 | C1 |
1 | A2 | B2 | C2 |
df_9 = make_df("AB", [1, 2])
df_10 = make_df("CB", [2, 1])
print(df_9)
print(df_10)
A B
1 A1 B1
2 A2 B2
C B
2 C2 B2
1 C1 B1
pd.merge(df_9, df_10)
A | B | C | |
---|---|---|---|
0 | A1 | B1 | C1 |
1 | A2 | B2 | C2 |
【例】 合并城市信息
population_dict = {"city": ("BeiJing", "HangZhou", "ShenZhen"),
"pop": (2154, 981, 1303)}
population = pd.DataFrame(population_dict)
population
city | pop | |
---|---|---|
0 | BeiJing | 2154 |
1 | HangZhou | 981 |
2 | ShenZhen | 1303 |
GDP_dict = {"city": ("BeiJing", "ShangHai", "HangZhou"),
"GDP": (30320, 32680, 13468)}
GDP = pd.DataFrame(GDP_dict)
GDP
city | GDP | |
---|---|---|
0 | BeiJing | 30320 |
1 | ShangHai | 32680 |
2 | HangZhou | 13468 |
city_info = pd.merge(population, GDP)
city_info
city | pop | GDP | |
---|---|---|---|
0 | BeiJing | 2154 | 30320 |
1 | HangZhou | 981 | 13468 |
这里outer是求并集
city_info = pd.merge(population, GDP, how="outer")
city_info
city | pop | GDP | |
---|---|---|---|
0 | BeiJing | 2154.0 | 30320.0 |
1 | HangZhou | 981.0 | 13468.0 |
2 | ShenZhen | 1303.0 | NaN |
3 | ShangHai | NaN | 32680.0 |
df = pd.DataFrame({"key":["A", "B", "C", "C", "B", "A"],
"data1": range(6),
"data2": np.random.randint(0, 10, size=6)})
df
key | data1 | data2 | |
---|---|---|---|
0 | A | 0 | 1 |
1 | B | 1 | 4 |
2 | C | 2 | 9 |
3 | C | 3 | 9 |
4 | B | 4 | 1 |
5 | A | 5 | 9 |
(1)分组
df.groupby("key")
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000002276795A240>
这说明已经分好了,等待我们用什么样的方法进行处理后,再显示。
df.groupby("key").sum()
data1 | data2 | |
---|---|---|
key | ||
A | 5 | 10 |
B | 5 | 6 |
C | 5 | 11 |
df.groupby("key").mean()
data1 | data2 | |
---|---|---|
key | ||
A | 2.5 | 5.0 |
B | 2.5 | 3.0 |
C | 2.5 | 5.5 |
可以打印看看这是什么东西:
for i in df.groupby("key"):
print(str(i))
('A', key data1 data2
0 A 0 2
5 A 5 8)
('B', key data1 data2
1 B 1 2
4 B 4 4)
('C', key data1 data2
2 C 2 8
3 C 3 3)
df.groupby("key")["data2"].sum()
key
A 10
B 6
C 11
Name: data2, dtype: int32
for data, group in df.groupby("key"):
print("{0:5} shape={1}".format(data, group.shape))
A shape=(2, 3)
B shape=(2, 3)
C shape=(2, 3)
df.groupby("key")["data1"].describe()
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
key | ||||||||
A | 2.0 | 2.5 | 3.535534 | 0.0 | 1.25 | 2.5 | 3.75 | 5.0 |
B | 2.0 | 2.5 | 2.121320 | 1.0 | 1.75 | 2.5 | 3.25 | 4.0 |
C | 2.0 | 2.5 | 0.707107 | 2.0 | 2.25 | 2.5 | 2.75 | 3.0 |
df.groupby("key").aggregate(["min", "median", "max"])
data1 | data2 | |||||
---|---|---|---|---|---|---|
min | median | max | min | median | max | |
key | ||||||
A | 0 | 2.5 | 5 | 2 | 5.0 | 8 |
B | 1 | 2.5 | 4 | 2 | 3.0 | 4 |
C | 2 | 2.5 | 3 | 3 | 5.5 | 8 |
def filter_func(x):
return x["data2"].std() > 3
df.groupby("key")["data2"].std()
key
A 4.242641
B 1.414214
C 3.535534
Name: data2, dtype: float64
df.groupby("key").filter(filter_func)
key | data1 | data2 | |
---|---|---|---|
0 | A | 0 | 2 |
2 | C | 2 | 8 |
3 | C | 3 | 3 |
5 | A | 5 | 8 |
df
key | data1 | data2 | |
---|---|---|---|
0 | A | 0 | 2 |
1 | B | 1 | 2 |
2 | C | 2 | 8 |
3 | C | 3 | 3 |
4 | B | 4 | 4 |
5 | A | 5 | 8 |
df.groupby("key").transform(lambda x: x-x.mean())
data1 | data2 | |
---|---|---|
0 | -2.5 | -3.0 |
1 | -1.5 | -1.0 |
2 | -0.5 | 2.5 |
3 | 0.5 | -2.5 |
4 | 1.5 | 1.0 |
5 | 2.5 | 3.0 |
df
key | data1 | data2 | |
---|---|---|---|
0 | A | 0 | 1 |
1 | B | 1 | 4 |
2 | C | 2 | 9 |
3 | C | 3 | 9 |
4 | B | 4 | 1 |
5 | A | 5 | 9 |
df.groupby("key").apply(lambda x: x-x.mean())
data1 | data2 | |
---|---|---|
0 | -2.5 | -4.0 |
1 | -1.5 | 1.5 |
2 | -0.5 | 0.0 |
3 | 0.5 | 0.0 |
4 | 1.5 | -1.5 |
5 | 2.5 | 4.0 |
df
key | data1 | data2 | |
---|---|---|---|
0 | A | 0 | 2 |
1 | B | 1 | 2 |
2 | C | 2 | 8 |
3 | C | 3 | 3 |
4 | B | 4 | 4 |
5 | A | 5 | 8 |
def norm_by_data2(x):
x["data1"] /= x["data2"].sum()
return x
df.groupby("key").apply(norm_by_data2)
key | data1 | data2 | |
---|---|---|---|
0 | A | 0.000000 | 2 |
1 | B | 0.166667 | 2 |
2 | C | 0.181818 | 8 |
3 | C | 0.272727 | 3 |
4 | B | 0.666667 | 4 |
5 | A | 0.500000 | 8 |
这里的L相当于一个新的标签替代原来的行标签。
L = [0, 1, 0, 1, 2, 0]
df
key | data1 | data2 | |
---|---|---|---|
0 | A | 0 | 2 |
1 | B | 1 | 2 |
2 | C | 2 | 8 |
3 | C | 3 | 3 |
4 | B | 4 | 4 |
5 | A | 5 | 8 |
df.groupby(L).sum()
data1 | data2 | |
---|---|---|
0 | 7 | 18 |
1 | 4 | 5 |
2 | 4 | 4 |
df2 = df.set_index("key")
df2
data1 | data2 | |
---|---|---|
key | ||
A | 0 | 2 |
B | 1 | 2 |
C | 2 | 8 |
C | 3 | 3 |
B | 4 | 4 |
A | 5 | 8 |
mapping = {"A": "first", "B": "constant", "C": "constant"}
df2.groupby(mapping).sum()
data1 | data2 | |
---|---|---|
constant | 10 | 17 |
first | 5 | 10 |
df2.groupby(str.lower).mean()
data1 | data2 | |
---|---|---|
a | 2.5 | 5.0 |
b | 2.5 | 3.0 |
c | 2.5 | 5.5 |
只有这两个数都相等,才会分到同一个组。
df2.groupby([str.lower, mapping]).mean()
data1 | data2 | ||
---|---|---|---|
a | first | 2.5 | 5.0 |
b | constant | 2.5 | 3.0 |
c | constant | 2.5 | 5.5 |
【例1】 行星观测数据处理
import seaborn as sns
planets = sns.load_dataset("planets")
planets.shape
(1035, 6)
planets.head()
method | number | orbital_period | mass | distance | year | |
---|---|---|---|---|---|---|
0 | Radial Velocity | 1 | 269.300 | 7.10 | 77.40 | 2006 |
1 | Radial Velocity | 1 | 874.774 | 2.21 | 56.95 | 2008 |
2 | Radial Velocity | 1 | 763.000 | 2.60 | 19.84 | 2011 |
3 | Radial Velocity | 1 | 326.030 | 19.40 | 110.62 | 2007 |
4 | Radial Velocity | 1 | 516.220 | 10.50 | 119.47 | 2009 |
planets.describe()
number | orbital_period | mass | distance | year | |
---|---|---|---|---|---|
count | 1035.000000 | 992.000000 | 513.000000 | 808.000000 | 1035.000000 |
mean | 1.785507 | 2002.917596 | 2.638161 | 264.069282 | 2009.070531 |
std | 1.240976 | 26014.728304 | 3.818617 | 733.116493 | 3.972567 |
min | 1.000000 | 0.090706 | 0.003600 | 1.350000 | 1989.000000 |
25% | 1.000000 | 5.442540 | 0.229000 | 32.560000 | 2007.000000 |
50% | 1.000000 | 39.979500 | 1.260000 | 55.250000 | 2010.000000 |
75% | 2.000000 | 526.005000 | 3.040000 | 178.500000 | 2012.000000 |
max | 7.000000 | 730000.000000 | 25.000000 | 8500.000000 | 2014.000000 |
planets.head()
method | number | orbital_period | mass | distance | year | |
---|---|---|---|---|---|---|
0 | Radial Velocity | 1 | 269.300 | 7.10 | 77.40 | 2006 |
1 | Radial Velocity | 1 | 874.774 | 2.21 | 56.95 | 2008 |
2 | Radial Velocity | 1 | 763.000 | 2.60 | 19.84 | 2011 |
3 | Radial Velocity | 1 | 326.030 | 19.40 | 110.62 | 2007 |
4 | Radial Velocity | 1 | 516.220 | 10.50 | 119.47 | 2009 |
decade = 10 * (planets["year"] // 10)
decade.head()
0 2000
1 2000
2 2010
3 2000
4 2000
Name: year, dtype: int64
decade = decade.astype(str) + "s"
decade.name = "decade"
decade.head()
0 2000s
1 2000s
2 2010s
3 2000s
4 2000s
Name: decade, dtype: object
planets.head()
method | number | orbital_period | mass | distance | year | |
---|---|---|---|---|---|---|
0 | Radial Velocity | 1 | 269.300 | 7.10 | 77.40 | 2006 |
1 | Radial Velocity | 1 | 874.774 | 2.21 | 56.95 | 2008 |
2 | Radial Velocity | 1 | 763.000 | 2.60 | 19.84 | 2011 |
3 | Radial Velocity | 1 | 326.030 | 19.40 | 110.62 | 2007 |
4 | Radial Velocity | 1 | 516.220 | 10.50 | 119.47 | 2009 |
planets.groupby(["method", decade]).sum()
number | orbital_period | mass | distance | year | ||
---|---|---|---|---|---|---|
method | decade | |||||
Astrometry | 2010s | 2 | 1.262360e+03 | 0.00000 | 35.75 | 4023 |
Eclipse Timing Variations | 2000s | 5 | 1.930800e+04 | 6.05000 | 261.44 | 6025 |
2010s | 10 | 2.345680e+04 | 4.20000 | 1000.00 | 12065 | |
Imaging | 2000s | 29 | 1.350935e+06 | 0.00000 | 956.83 | 40139 |
2010s | 21 | 6.803750e+04 | 0.00000 | 1210.08 | 36208 | |
Microlensing | 2000s | 12 | 1.732500e+04 | 0.00000 | 0.00 | 20070 |
2010s | 15 | 4.750000e+03 | 0.00000 | 41440.00 | 26155 | |
Orbital Brightness Modulation | 2010s | 5 | 2.127920e+00 | 0.00000 | 2360.00 | 6035 |
Pulsar Timing | 1990s | 9 | 1.900153e+02 | 0.00000 | 0.00 | 5978 |
2000s | 1 | 3.652500e+04 | 0.00000 | 0.00 | 2003 | |
2010s | 1 | 9.070629e-02 | 0.00000 | 1200.00 | 2011 | |
Pulsation Timing Variations | 2000s | 1 | 1.170000e+03 | 0.00000 | 0.00 | 2007 |
Radial Velocity | 1980s | 1 | 8.388800e+01 | 11.68000 | 40.57 | 1989 |
1990s | 52 | 1.091561e+04 | 68.17820 | 723.71 | 55943 | |
2000s | 475 | 2.633526e+05 | 945.31928 | 15201.16 | 619775 | |
2010s | 424 | 1.809630e+05 | 316.47890 | 11382.67 | 432451 | |
Transit | 2000s | 64 | 2.897102e+02 | 0.00000 | 31823.31 | 124462 |
2010s | 712 | 8.087813e+03 | 1.47000 | 102419.46 | 673999 | |
Transit Timing Variations | 2010s | 9 | 2.393505e+02 | 0.00000 | 3313.00 | 8050 |
这里使用两个中括号[[]],取出来是DF类型的数据,而一个中括号[]取出来是Serios的数据,前者更美观一点。
planets.groupby(["method", decade])[["number"]].sum().unstack().fillna(0)
number | ||||
---|---|---|---|---|
decade | 1980s | 1990s | 2000s | 2010s |
method | ||||
Astrometry | 0.0 | 0.0 | 0.0 | 2.0 |
Eclipse Timing Variations | 0.0 | 0.0 | 5.0 | 10.0 |
Imaging | 0.0 | 0.0 | 29.0 | 21.0 |
Microlensing | 0.0 | 0.0 | 12.0 | 15.0 |
Orbital Brightness Modulation | 0.0 | 0.0 | 0.0 | 5.0 |
Pulsar Timing | 0.0 | 9.0 | 1.0 | 1.0 |
Pulsation Timing Variations | 0.0 | 0.0 | 1.0 | 0.0 |
Radial Velocity | 1.0 | 52.0 | 475.0 | 424.0 |
Transit | 0.0 | 0.0 | 64.0 | 712.0 |
Transit Timing Variations | 0.0 | 0.0 | 0.0 | 9.0 |
(2)数据透视表
【例2】泰坦尼克号乘客数据分析
import seaborn as sns
titanic = sns.load_dataset("titanic")
titanic.head()
survived | pclass | sex | age | sibsp | parch | fare | embarked | class | who | adult_male | deck | embark_town | alive | alone | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | male | 22.0 | 1 | 0 | 7.2500 | S | Third | man | True | NaN | Southampton | no | False |
1 | 1 | 1 | female | 38.0 | 1 | 0 | 71.2833 | C | First | woman | False | C | Cherbourg | yes | False |
2 | 1 | 3 | female | 26.0 | 0 | 0 | 7.9250 | S | Third | woman | False | NaN | Southampton | yes | True |
3 | 1 | 1 | female | 35.0 | 1 | 0 | 53.1000 | S | First | woman | False | C | Southampton | yes | False |
4 | 0 | 3 | male | 35.0 | 0 | 0 | 8.0500 | S | Third | man | True | NaN | Southampton | no | True |
T = titanic[titanic.age.notnull()].copy()
T.age.apply(lambda x: 60 if x>=60 else x)
T.age.value_counts()
24.00 30
22.00 27
60.00 26
18.00 26
28.00 25
30.00 25
19.00 25
21.00 24
25.00 23
36.00 22
29.00 20
35.00 18
32.00 18
27.00 18
26.00 18
31.00 17
16.00 17
34.00 15
20.00 15
33.00 15
23.00 15
39.00 14
40.00 13
17.00 13
42.00 13
45.00 12
38.00 11
4.00 10
50.00 10
2.00 10
..
8.00 4
5.00 4
11.00 4
6.00 3
7.00 3
46.00 3
30.50 2
57.00 2
0.83 2
55.00 2
10.00 2
59.00 2
13.00 2
28.50 2
40.50 2
45.50 2
0.75 2
32.50 2
34.50 1
55.50 1
0.92 1
36.50 1
12.00 1
53.00 1
14.50 1
0.67 1
20.50 1
23.50 1
24.50 1
0.42 1
Name: age, Length: 77, dtype: int64
Age = 10*(T["age"]//10)
Age = Age.astype(int)
Age.head()
Age.value_counts()
20 220
30 167
10 102
40 89
0 62
50 48
60 26
Name: age, dtype: int64
Age.astype(str)+"s"
0 20s
1 30s
2 20s
3 30s
4 30s
6 50s
7 0s
8 20s
9 10s
10 0s
11 50s
12 20s
13 30s
14 10s
15 50s
16 0s
18 30s
20 30s
21 30s
22 10s
23 20s
24 0s
25 30s
27 10s
30 40s
33 60s
34 20s
35 40s
37 20s
38 10s
...
856 40s
857 50s
858 20s
860 40s
861 20s
862 40s
864 20s
865 40s
866 20s
867 30s
869 0s
870 20s
871 40s
872 30s
873 40s
874 20s
875 10s
876 20s
877 10s
879 50s
880 20s
881 30s
882 20s
883 20s
884 20s
885 30s
886 20s
887 10s
889 20s
890 30s
Name: age, Length: 714, dtype: object
T.groupby(["sex", Age])["survived"].mean().unstack()
age | 0 | 10 | 20 | 30 | 40 | 50 | 60 |
---|---|---|---|---|---|---|---|
sex | |||||||
female | 0.633333 | 0.755556 | 0.722222 | 0.833333 | 0.687500 | 0.888889 | 1.000000 |
male | 0.593750 | 0.122807 | 0.168919 | 0.214953 | 0.210526 | 0.133333 | 0.136364 |
T.age = Age
T.pivot_table("survived", index="sex", columns="age")
age | 0 | 10 | 20 | 30 | 40 | 50 | 60 |
---|---|---|---|---|---|---|---|
sex | |||||||
female | 0.633333 | 0.755556 | 0.722222 | 0.833333 | 0.687500 | 0.888889 | 1.000000 |
male | 0.593750 | 0.122807 | 0.168919 | 0.214953 | 0.210526 | 0.133333 | 0.136364 |
titanic.describe()
survived | pclass | age | sibsp | parch | fare | |
---|---|---|---|---|---|---|
count | 891.000000 | 891.000000 | 714.000000 | 891.000000 | 891.000000 | 891.000000 |
mean | 0.383838 | 2.308642 | 29.699118 | 0.523008 | 0.381594 | 32.204208 |
std | 0.486592 | 0.836071 | 14.526497 | 1.102743 | 0.806057 | 49.693429 |
min | 0.000000 | 1.000000 | 0.420000 | 0.000000 | 0.000000 | 0.000000 |
25% | 0.000000 | 2.000000 | 20.125000 | 0.000000 | 0.000000 | 7.910400 |
50% | 0.000000 | 3.000000 | 28.000000 | 0.000000 | 0.000000 | 14.454200 |
75% | 1.000000 | 3.000000 | 38.000000 | 1.000000 | 0.000000 | 31.000000 |
max | 1.000000 | 3.000000 | 80.000000 | 8.000000 | 6.000000 | 512.329200 |
titanic.groupby("sex")[["survived"]].mean()
survived | |
---|---|
sex | |
female | 0.742038 |
male | 0.188908 |
titanic.groupby("sex")["survived"].mean()
sex
female 0.742038
male 0.188908
Name: survived, dtype: float64
titanic.groupby(["sex", "class"])["survived"].aggregate("mean").unstack()
class | First | Second | Third |
---|---|---|---|
sex | |||
female | 0.968085 | 0.921053 | 0.500000 |
male | 0.368852 | 0.157407 | 0.135447 |
titanic.pivot_table("survived", index="sex", columns="class") # 默认返回平均值
class | First | Second | Third |
---|---|---|---|
sex | |||
female | 0.968085 | 0.921053 | 0.500000 |
male | 0.368852 | 0.157407 | 0.135447 |
titanic.pivot_table("survived", index="sex", columns="class", aggfunc="mean", margins=True) # aggfunc="mean"即为默认值 margins=True 会加一个总的列和总的行。
class | First | Second | Third | All |
---|---|---|---|---|
sex | ||||
female | 0.968085 | 0.921053 | 0.500000 | 0.742038 |
male | 0.368852 | 0.157407 | 0.135447 | 0.188908 |
All | 0.629630 | 0.472826 | 0.242363 | 0.383838 |
titanic.pivot_table(index="sex", columns="class", aggfunc={"survived": "sum", "fare": "mean"}) # 要处理的那一列和要处理的方法组成一个键值对。
fare | survived | |||||
---|---|---|---|---|---|---|
class | First | Second | Third | First | Second | Third |
sex | ||||||
female | 106.125798 | 21.970121 | 16.118810 | 91 | 70 | 72 |
male | 67.226127 | 19.741782 | 12.661633 | 45 | 17 | 47 |
(1)向量化字符串操作
(2) 处理时间序列
(3) 多级索引:用于多维数据
base_data = np.array([[1771, 11115 ],
[2154, 30320],
[2141, 14070],
[2424, 32680],
[1077, 7806],
[1303, 24222],
[798, 4789],
[981, 13468]])
data = pd.DataFrame(base_data, index=[["BeiJing","BeiJing","ShangHai","ShangHai","ShenZhen","ShenZhen","HangZhou","HangZhou"]\
, [2008, 2018]*4], columns=["population", "GDP"])
data
population | GDP | ||
---|---|---|---|
BeiJing | 2008 | 1771 | 11115 |
2018 | 2154 | 30320 | |
ShangHai | 2008 | 2141 | 14070 |
2018 | 2424 | 32680 | |
ShenZhen | 2008 | 1077 | 7806 |
2018 | 1303 | 24222 | |
HangZhou | 2008 | 798 | 4789 |
2018 | 981 | 13468 |
data.index.names = ["city", "year"]
data
population | GDP | ||
---|---|---|---|
city | year | ||
BeiJing | 2008 | 1771 | 11115 |
2018 | 2154 | 30320 | |
ShangHai | 2008 | 2141 | 14070 |
2018 | 2424 | 32680 | |
ShenZhen | 2008 | 1077 | 7806 |
2018 | 1303 | 24222 | |
HangZhou | 2008 | 798 | 4789 |
2018 | 981 | 13468 |
data["GDP"]
city year
BeiJing 2008 11115
2018 30320
ShangHai 2008 14070
2018 32680
ShenZhen 2008 7806
2018 24222
HangZhou 2008 4789
2018 13468
Name: GDP, dtype: int32
data.loc["ShangHai", "GDP"]
year
2008 14070
2018 32680
Name: GDP, dtype: int32
data.loc["ShangHai", 2018]["GDP"]
32680
(4) 高性能的Pandas:eval()
df1, df2, df3, df4 = (pd.DataFrame(np.random.random((10000,100))) for i in range(4))
%timeit (df1+df2)/(df3+df4)
17.6 ms ± 120 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit pd.eval("(df1+df2)/(df3+df4)")
10.5 ms ± 153 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
np.allclose((df1+df2)/(df3+df4), pd.eval("(df1+df2)/(df3+df4)"))
True
df = pd.DataFrame(np.random.random((1000, 3)), columns=list("ABC"))
df.head()
A | B | C | |
---|---|---|---|
0 | 0.418071 | 0.381836 | 0.500556 |
1 | 0.059432 | 0.749066 | 0.302429 |
2 | 0.489147 | 0.739153 | 0.777161 |
3 | 0.175441 | 0.016556 | 0.348979 |
4 | 0.766534 | 0.559252 | 0.310635 |
res_1 = pd.eval("(df.A+df.B)/(df.C-1)")
res_2 = df.eval("(A+B)/(C-1)")
np.allclose(res_1, res_2)
True
df["D"] = pd.eval("(df.A+df.B)/(df.C-1)")
df.head()
A | B | C | D | |
---|---|---|---|---|
0 | 0.418071 | 0.381836 | 0.500556 | -1.601593 |
1 | 0.059432 | 0.749066 | 0.302429 | -1.159019 |
2 | 0.489147 | 0.739153 | 0.777161 | -5.512052 |
3 | 0.175441 | 0.016556 | 0.348979 | -0.294917 |
4 | 0.766534 | 0.559252 | 0.310635 | -1.923199 |
df.eval("D=(A+B)/(C-1)", inplace=True)
df.head()
A | B | C | D | |
---|---|---|---|---|
0 | 0.418071 | 0.381836 | 0.500556 | -1.601593 |
1 | 0.059432 | 0.749066 | 0.302429 | -1.159019 |
2 | 0.489147 | 0.739153 | 0.777161 | -5.512052 |
3 | 0.175441 | 0.016556 | 0.348979 | -0.294917 |
4 | 0.766534 | 0.559252 | 0.310635 | -1.923199 |
column_mean = df.mean(axis=1)
res = df.eval("A+@column_mean")
res.head()
0 0.342788
1 0.047409
2 -0.387501
3 0.236956
4 0.694839
dtype: float64
(4) 高性能的Pandas:query()
df.head()
A | B | C | D | |
---|---|---|---|---|
0 | 0.418071 | 0.381836 | 0.500556 | -1.601593 |
1 | 0.059432 | 0.749066 | 0.302429 | -1.159019 |
2 | 0.489147 | 0.739153 | 0.777161 | -5.512052 |
3 | 0.175441 | 0.016556 | 0.348979 | -0.294917 |
4 | 0.766534 | 0.559252 | 0.310635 | -1.923199 |
%timeit df[(df.A < 0.5) & (df.B > 0.5)]
1.11 ms ± 9.38 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit df.query("(A < 0.5)&(B > 0.5)")
2.55 ms ± 199 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
df.query("(A < 0.5)&(B > 0.5)").head()
A | B | C | D | |
---|---|---|---|---|
1 | 0.059432 | 0.749066 | 0.302429 | -1.159019 |
2 | 0.489147 | 0.739153 | 0.777161 | -5.512052 |
7 | 0.073950 | 0.730144 | 0.646190 | -2.272672 |
10 | 0.393200 | 0.610467 | 0.697096 | -3.313485 |
11 | 0.065734 | 0.764699 | 0.179380 | -1.011958 |
np.allclose(df[(df.A < 0.5) & (df.B > 0.5)], df.query("(A < 0.5)&(B > 0.5)"))
True
(5)eval()和query()的使用时机
小数组时,普通方法反而更快
df.values.nbytes
32000
df1.values.nbytes
8000000