【Data Mining】机器学习三剑客之Pandas常用用法总结（上）

接地气的陈老师

发布于 2019-12-09 14:27:07

4870

发布于 2019-12-09 14:27:07

文章被收录于专栏：接地气学堂

一、前言

看pandas之前我建议先看我的numpy总结，效果更佳。【Data Mining】机器学习三剑客之Numpy常用用法总结可以大概理解为numpy主要是用来生成数据，并且进行数据运算的工具而pandas主要是用来整个数据的管理，也就是整个数据的摆放或是一些行列的操作等等。当然也不完全是这个样子。

二、下载、安装、导入

用anaconda安装是十分方便的，如果你已经安装了tf,keras之类的，其实已经直接把numpy安装了，一般来说安装就是pip命令。

pip install pandas #py2
pip3 install pandas #py3

用法则是

import pandas as pd　# 一般as为pd来操作

三、常用用法总结

１．Series

# -*- coding: utf-8 -*-
import pandas as pd
import numpy as np

df1 = pd.DataFrame(np.arange(12).reshape((3, 4)))
print df1
"""
   0  1   2   3
0  0  1   2   3
1  4  5   6   7
2  8  9  10  11
"""
s = pd.Series([1, 1, 44, 22]) #创建一个series
s_type_int_float = pd.Series([1, 1, 44, 22], dtype=np.float32) #更改type
s_type = pd.Series([1, np.nan, 44, 22]) #np.nan就是就是Nan缺省值
16   #更改index
s_index = pd.Series([1, np.nan, 44, 22], index=["c", "h", "e", "hongshu"])
print("s:")
print(s)
print("s_type_int_float:")
print(s_type_int_float)
print("s_type:")
print(s_type)
print("s_index:")
print(s_index)
"""
s:
0     1
1     1
2    44
3    22
dtype: int64

s_type_int_float:
0     1.0
1     1.0
2    44.0
3    22.0
dtype: float32

s_type:
0     1.0
1     NaN
2    44.0
3    22.0
dtype: float64

s_index:
c           1.0
h           NaN
e          44.0
hongshu    22.0
dtype: float64
"""

一些说明：

series相当于dataframe的一个元素，pandas的主体数据类型为dataframe，一个series单位相当于dataframe的一行，当然是连带这整个dataframe的column和元素dtype的信息的。（ps:这里可以先记着，后面慢慢才能全都懂，先记住这么个关系，后面讲）
生成series的左面一列其实就是dataframe的每一列的index，例如上述s左面为[0, 1, 2, 3]其实就是和我上面写的那个dataframe的最上面的单独的一行对应，代表每一列的名字，有点像excel表格中的每一列的name。
上述采用list生成的series，理论上用array-like的形式都可以生成，当然numpy毋庸置疑可以后面会有展示，如果生成的series的list中的每个元素为整型，则dtype默认推理为int64，如果元素中海包括nan缺省值则按浮点数处理，所以默认为float64,可知如果都为浮点数则默认为float64。
如果要是自定义dtype和往常一样自然转换，整数化或者浮点化。

# -*- coding: utf-8 -*-
import pandas as pd
import numpy as np

s_np1 = pd.Series(np.arange(6)) #利用numpy生成series的方法

data_numpy = np.array([1, 2, 3, 45], dtype=np.float32)
s_np2 = pd.Series(data_numpy)

data_numpy1 = np.array([1, 2, 3, 45], dtype=np.int8)
s_np3 = pd.Series(data_numpy1)

data_numpy2 = np.array([1, 2, 3, 45])
s_np4 = pd.Series(data_numpy2)

print(s_np1)
print(s_np2)
print(s_np3)
print(s_np4)
"""
0    0
1    1
2    2
3    3
4    4
5    5
dtype: int64
0     1.0
1     2.0
2     3.0
3    45.0
dtype: float32
0     1
1     2
2     3
3    45
dtype: int8
0     1
1     2
2     3
3    45
dtype: int64
"""

上面这个主要看dtype，可知规律为通过numpy生成series时dtype跟随numpy的类型。

2、 DataFrame

①、df的index和colomns操作

# -*- coding: utf-8 -*-
import pandas as pd
import numpy as np


# 通过numpy生成随机0-10的shape为(3, 4)的dataframe
df_np = pd.DataFrame(np.random.randint(low=0, high=10, size=(3, 4)))
print(df_np)

# 生成随机-1-1的dataframe
# 更改index
df_index = pd.DataFrame(np.random.randn(3, 4), index=['f', 's', 't'])
print(df_index)

# 更改column
df_colums = pd.DataFrame(np.arange(12).reshape((3, 4)), columns=['che', 'hong', 'shu', '24'])
print(df_colums)

"""
   0  1  2  3
0  2  3  0  3
1  7  0  5  8
2  0  5  2  7

          0         1         2         3
f -2.216776 -1.506733  0.870351  1.361973
s  1.104645 -1.538397 -0.616963 -2.101459
t -1.423237 -0.378047 -0.294814 -0.200800

   che  hong  shu  24
0    0     1    2   3
1    4     5    6   7
2    8     9   10  11
"""

#use dict to create dataframe
dates_value = pd.date_range('20181222', periods=3)
#dict的key对应于df的colomn
df_dict = pd.DataFrame({'che': 22.22,
                        'hong': pd.Series(np.array([1, 2, 3], dtype=np.float32)),
                        'shu': dates_value})
print(df_dict)
"""
     che  hong        shu
0  22.22   1.0 2018-12-22
1  22.22   2.0 2018-12-23
2  22.22   3.0 2018-12-24

这里需要注意的一点：dataframe中的colomn参数其实就是series中的index。

总结一下：

dataframe可以通过dict和numpy生成
主要设置参数为index和colomns, index为每行的名称，colomns为每列的，对应于每一行的series的index。
利用dict生成dataframe时,dict的keys对应于dataframe的colomns

②、df的各种属性

import pandas as pd
import numpy as np
# pandas.Categorical
#https://blog.csdn.net/weixin_38656890/article/details/81348539


df2 = pd.DataFrame({'A': 1.,
                    'B': pd.Timestamp('20130102'),
                    'C': pd.Series(1, index=list(range(4)), dtype='float32'),
                    'D': np.array([3, 6, 9, 12], dtype=np.int32),
                    'E': pd.Categorical(["test", "train", "test", "train"]),
                    'F': 'che'})
print(df2)
print(df2.dtypes) #return the data type of each column.
"""
     A          B    C   D      E    F
0  1.0 2013-01-02  1.0   3   test  che
1  1.0 2013-01-02  1.0   6  train  che
2  1.0 2013-01-02  1.0   9   test  che
3  1.0 2013-01-02  1.0  12  train  che
A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
"""
print(df2.index)
print(df2.columns)
"""
Int64Index([0, 1, 2, 3], dtype='int64')
Index([u'A', u'B', u'C', u'D', u'E', u'F'], dtype='object')
"""
print(df2.values)
# 返回数据类型为numpy可知取出元素其中一个方法是变成list之后取出即可
# 当然这个方法速度慢，有更好的内置取值的方法
print(type(df2.values))
"""
[[1.0 Timestamp('2013-01-02 00:00:00') 1.0 3 'test' 'che']
 [1.0 Timestamp('2013-01-02 00:00:00') 1.0 6 'train' 'che']
 [1.0 Timestamp('2013-01-02 00:00:00') 1.0 9 'test' 'che']
 [1.0 Timestamp('2013-01-02 00:00:00') 1.0 12 'train' 'che']]
<type 'numpy.ndarray'>
"""
# 数字类data的各种数学计算结果
# 数量、平均、标准差、最小等 　
print(df2.describe())
"""
         A    C          D
count  4.0  4.0   4.000000
mean   1.0  1.0   7.500000
std    0.0  0.0   3.872983
min    1.0  1.0   3.000000
25%    1.0  1.0   5.250000
50%    1.0  1.0   7.500000
75%    1.0  1.0   9.750000
max    1.0  1.0  12.000000
"""

"""
原dataframe　方便对比观看
     A          B    C   D      E    F
0  1.0 2013-01-02  1.0   3   test  che
1  1.0 2013-01-02  1.0   6  train  che
2  1.0 2013-01-02  1.0   9   test  che
3  1.0 2013-01-02  1.0  12  train  che
"""
print(df2.T)  #转置
"""                     0         ...                             3
A                    1         ...                             1
B  2013-01-02 00:00:00         ...           2013-01-02 00:00:00
C                    1         ...                             1
D                    3         ...                            12
E                 test         ...                         train
F                  che         ...                           che
"""
print(df2.sort_index(axis=1, ascending=False))  # axis=1 相当于colomn元素排序
print(df2.sort_index(axis=0, ascending=False))  # axis=0 相当于index排序
#  其他value顺着index或者colomns排序即可
"""
     F      E   D    C          B    A
0  che   test   3  1.0 2013-01-02  1.0
1  che  train   6  1.0 2013-01-02  1.0
2  che   test   9  1.0 2013-01-02  1.0
3  che  train  12  1.0 2013-01-02  1.0
     A          B    C   D      E    F
3  1.0 2013-01-02  1.0  12  train  che
2  1.0 2013-01-02  1.0   9   test  che
1  1.0 2013-01-02  1.0   6  train  che
0  1.0 2013-01-02  1.0   3   test  che
"""
print(df2.sort_values(by='E')) #通过colomn为E的单位的value来排序（如果是数字则按数字大小排列，字母按字母大小）
"""
     A          B    C   D      E    F
0  1.0 2013-01-02  1.0   3   test  che
2  1.0 2013-01-02  1.0   9   test  che
1  1.0 2013-01-02  1.0   6  train  che
3  1.0 2013-01-02  1.0  12  train  che
"""

３、select

# -*- coding: utf-8 -*-
import pandas as pd
import numpy as np

dates = pd.date_range('20121222', periods=6)
df = pd.DataFrame(np.arange(24).reshape((6, 4)), index=dates, columns=['A', 'B', 'C', 'D'])
#easy selection
print(df)
"""
             A   B   C   D
2012-12-22   0   1   2   3
2012-12-23   4   5   6   7
2012-12-24   8   9  10  11
2012-12-25  12  13  14  15
2012-12-26  16  17  18  19
2012-12-27  20  21  22  23
"""
# select 'A' colomn
print(df['A'])
print(df.A)
"""
2012-12-22     0
2012-12-23     4
2012-12-24     8
2012-12-25    12
2012-12-26    16
2012-12-27    20
Freq: D, Name: A, dtype: int64
2012-12-22     0
2012-12-23     4
2012-12-24     8
2012-12-25    12
2012-12-26    16
2012-12-27    20
Freq: D, Name: A, dtype: int64
"""
# select 0-3 rows
print(df[0: 3])
print(df['2012-12-22':'2012-12-24'])
"""
            A  B   C   D
2012-12-22  0  1   2   3
2012-12-23  4  5   6   7
2012-12-24  8  9  10  11
            A  B   C   D
2012-12-22  0  1   2   3
2012-12-23  4  5   6   7
2012-12-24  8  9  10  11
"""



"""
原dataframe,　适宜对比观看
             A   B   C   D
2012-12-22   0   1   2   3
2012-12-23   4   5   6   7
2012-12-24   8   9  10  11
2012-12-25  12  13  14  15
2012-12-26  16  17  18  19
2012-12-27  20  21  22  23
"""
# select by label= loc
# 这里的label其实就是我之前说dataframe对应的colomn和index
# 和平时的二维的numpy选取相似，只是把index转换为对应的label name

print(df.loc['20121224']) #loc[]内单个一个label name时为行的index name
print(df.loc[:, 'A':'C']) # : 代表所有的行都要 逗号后面为colomns的label name
"""
A     8
B     9
C    10
D    11
Name: 2012-12-24 00:00:00, dtype: int64
             A   B   C
2012-12-22   0   1   2
2012-12-23   4   5   6
2012-12-24   8   9  10
2012-12-25  12  13  14
2012-12-26  16  17  18
2012-12-27  20  21  22
"""
print(df.loc[:, ['A', 'C']])
print(df.loc['20121223', ['A', 'C']])
"""
             A   C
2012-12-22   0   2
2012-12-23   4   6
2012-12-24   8  10
2012-12-25  12  14
2012-12-26  16  18
2012-12-27  20  22
A    4
C    6
Name: 2012-12-23 00:00:00, dtype: int64
"""

"""
原dataframe,　适宜对比观看
             A   B   C   D
2012-12-22   0   1   2   3
2012-12-23   4   5   6   7
2012-12-24   8   9  10  11
2012-12-25  12  13  14  15
2012-12-26  16  17  18  19
2012-12-27  20  21  22  23
"""
# select by position(index)= iloc
# 这里的selection index其实就是完全和numpy相似
# (row index, colomn index)
# 利用行的索引和列的索引来取值
print(df.iloc[3])
print(df.iloc[3:5, 1:3])
print(df.iloc[[1, 3], 1:3])
"""
A    12
B    13
C    14
D    15
Name: 2012-12-25 00:00:00, dtype: int64
             B   C
2012-12-25  13  14
2012-12-26  17  18
             B   C
2012-12-23   5   6
2012-12-25  13  14
"""

# mixed selection = ix
# label + position selection
print(df.ix[1, ['A', 'D']])
"""
A    4
D    7
Name: 2012-12-23 00:00:00, dtype: int64
"""
# Boolean indexing
# use bool to select
print(df[df.B > 9])
"""
             A   B   C   D
2012-12-25  12  13  14  15
2012-12-26  16  17  18  19
2012-12-27  20  21  22  23
"""

一些总结：

一种选择数据有五种：简单直接选取,label选取(loc)，index选取(iloc),混合选取(ix)，真假选取
其实第二种到第四种选取，有规律可言，其实都是[row,colomn]的组合而已，只是一个是用label name，一个是index name，混合是label or index
第一种其实就是label或者index的单列或者行选取，但是也有特殊表达比如df.A
最后一种主要用于删选数据的。

4、读取文件，输出文件

在使用中主要针对于excel文件和csv文件，个人推荐csv文件，因为在很多比赛和项目中都采用此类型，主要是兼容性好一些，我在linux下使用excel问题很多，当然对于pandas两样的使用很相似。首先我们采用常用的机器学习数据集：iris数据集，链接如下

数据集简单介绍：鸢尾花的特征作为数据来源，数据集包含150个数据集，分为3类，每类50个数据，每个数据包含4个属性，数据集iris.csv截图如下。

数据集内容此处进行简单读入，并按照算法输入进行简单处理，并输出

import pandas as pd
import numpy as np
# 读csv文件
Iris_dataset = pd.read_csv("./Iris_dataset/iris.csv")
# 给每列一个column label
Iris_dataset.columns = ['data_index', 'sepal_len', 'sepal_width', 'petal_len', 'petal_width', 'class']
# drop掉第一列（无用的列，表示数据index）
Iris_dataset.drop(columns='data_index', axis=1, inplace=True)
# 判断是否存在nan
if np.any(Iris_dataset.isnull()) == True:
    print("有空缺值")
    Iris_dataset.dropna()
else:
    print("无空缺值")
# 进行把string label name转换为int型
def fun(x):
    if x == 'setosa':
        return 0
    elif x == 'versicolor':
        return 1
    elif x == 'virginica':
        return 2
Iris_dataset['class'] = Iris_dataset['class'].apply(lambda x: fun(x))
# 前五条数据
print(Iris_dataset.head())
# 输出.csv文件
Iris_dataset.to_csv('iris_handle_data')

输出文件如下：