数据分析 ——— pandas基础（四）

andrew_a

发布于 2019-08-16 10:59:55

1.1K0

发布于 2019-08-16 10:59:55

文章被收录于专栏：Python爬虫与数据分析Python爬虫与数据分析

利用pandas来进行数据处理的方法太多了，在这里继续更新一下对缺失数据的处理，以及数据的分组，聚合函数的使用。

一、处理缺失数据

在做数据分析的时候，大部分数据都不是很完整，缺失数导致数据的质量差，机器学习在做模型预测的时候，模型的准确性就会面临严峻的问题。所以对缺失数据的处理还是很有必要的。

1）处理pandas的缺失值（NA or NaN）

使用reindex，我们创建了一个缺失值的DataFrame。

在输出中,NaN表示不是数字。

import numpy as np
import pandas as pd
# 处理缺失数据
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print(df)
"""
输出：
        one       two     three
a -0.018164 -0.594016  0.378447
b       NaN       NaN       NaN
c -1.482830  0.909581 -0.431973
d       NaN       NaN       NaN
e -0.797581  0.986172 -1.182949
f -0.514952  1.124808 -1.246717
g       NaN       NaN       NaN
h  1.781893  0.784155 -0.672985
"""

检查缺失值：panda提供了isnull()和notnull()函数

# isnull() 判断one这一列数据是否有缺失值，有缺失值返回True,无返回false
print(df['one'].isnull()) 
"""
输出：
a    False
b     True
c    False
d     True
e    False
f    False
g     True
h    False
Name: one, dtype: bool
"""

# notnull()
print(df['one'].notnull())
"""
输出：
a     True
b    False
c     True
d    False
e     True
f     True
g    False
h     True
Name: one, dtype: bool
"""

2) 对有缺失值的数据进行求和

sum():

在对数据求和时，NA将被记为0
当所属数据全为NA或者NAN时，结果也将是NA

rint(df['one'].sum())
"""
输出：
1.0316327375313081
"""

3）清除/填充缺失值

pandas提供了fillna()函数中的几种方式来填充缺少数据。

用标量填充（固定值填充）：

# 清洁，填充缺失数据
df = pd.DataFrame(np.random.randn(3, 3), index=['a', 'c', 'e'],columns=['one',
'two', 'three'])
df = df.reindex(['a', 'b', 'c'])
print(df,'\n')
# 将NAN全部替换为0
print ("NaN replaced with '0':")
print(df.fillna(0))
"""
输出：
       one       two     three
a -1.004807  0.727737 -0.481955
b       NaN       NaN       NaN
c  0.284135 -1.066389 -1.725905

NaN replaced with '0':
        one       two     three
a -1.004807  0.727737 -0.481955
b  0.000000  0.000000  0.000000
c  0.284135 -1.066389 -1.725905
"""

在上面我们使用了0值进行填充，除了0之外，还可以填充其它的数。

正向填充和前向填充：

对每一条数据的缺失值，填充其上下条数据的值。

# 正向填充，和反向填充
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print(df,'\n')
# 填充前一条数据的值，但是前一条也不一定有值
print(df.fillna(method='pad'), '\n')
# 填充后一条数据的值，但是后一条也不一定有值
print(df.fillna(method='backfill'))
"""
输出：
        one       two     three
a  0.872584 -0.423063 -0.156466
b       NaN       NaN       NaN
c  0.307049  0.292621 -1.684947
d       NaN       NaN       NaN
e  1.276004 -0.444504  0.460022
f -0.180679  1.428129 -0.383163
g       NaN       NaN       NaN
h -0.158751  0.334699 -0.174680

        one       two     three
a  0.872584 -0.423063 -0.156466
b  0.872584 -0.423063 -0.156466
c  0.307049  0.292621 -1.684947
d  0.307049  0.292621 -1.684947
e  1.276004 -0.444504  0.460022
f -0.180679  1.428129 -0.383163
g -0.180679  1.428129 -0.383163
h -0.158751  0.334699 -0.174680

        one       two     three
a  1.437663  0.509368 -0.308674
b  1.250818  0.420155 -0.146549
c  1.250818  0.420155 -0.146549
d -1.748608 -0.646638  0.154813
e -1.748608 -0.646638  0.154813
f  0.183441  0.093766 -0.355318
g -0.138610 -0.688689 -0.129530
h -0.138610 -0.688689 -0.129530
"""

填充均值：

df.fillna(df.mode(),inplace=True)
"""
输出：
        one       two     three
a -0.325235  1.671434 -0.059426
b -0.213600  0.214624 -0.629093
c -1.070583 -0.142056 -0.046486
d -0.213600  0.214624 -0.629093
e  1.214059 -0.831476 -0.210059
f -1.312524 -0.554252 -1.111779
g -0.213600  0.214624 -0.629093
h  0.426282  0.929469 -1.717717
"""

中位数填充：

# 中位数填充
print(df.fillna(df.median(), inplace=True))
"""
输出：
        one       two     three
a -0.325235  1.671434 -0.059426
b -0.213600  0.214624 -0.629093
c -1.070583 -0.142056 -0.046486
d -0.213600  0.214624 -0.629093
e  1.214059 -0.831476 -0.210059
f -1.312524 -0.554252 -1.111779
g -0.213600  0.214624 -0.629093
h  0.426282  0.929469 -1.717717
"""

除了上面的几种填充方式，还有其他的填充方式，比如说填充众数，对每一列的缺失值，填充当列的众数。但可能存在某列缺失值过多，众数为nan的情况，因此可以将每列nan值删除掉，对之后的数据取众数。

二、数据分组

利用groupby()对数据进行分组

# group by, 拆分组
import pandas as pd
ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
         'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
         'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
         'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
         'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)
print(df,'\n')
"""
输出：
    Points  Rank    Team  Year
0      876     1  Riders  2014
1      789     2  Riders  2015
2      863     2  Devils  2014
3      673     3  Devils  2015
4      741     3   Kings  2014
5      812     4   kings  2015
6      756     1   Kings  2016
7      788     1   Kings  2017
8      694     2  Riders  2016
9      701     4  Royals  2014
10     804     1  Royals  2015
11     690     2  Riders  2017 
"""

按列分组：

print(df.groupby('Rank'),'\n') # 分割组
"""
输出：
<pandas.core.groupby.DataFrameGroupBy object at 0x7f54f9e6a6d8> 
"""

输出的是一个pandas对象

查看组：

print(df.groupby('Rank').groups,'\n') # 查看组
"""
输出：
{1: Int64Index([0, 6, 7, 10], dtype='int64'), 
2: Int64Index([1, 2, 8, 11], dtype='int64'),
3: Int64Index([3, 4], dtype='int64'), 
4: Int64Index([5, 9], dtype='int64')}
"""

结果返回字典

按多列分组：

print(df.groupby(['Team', 'Year']).groups) # 按多列分组
"""
输出：
{('Devils', 2014): Int64Index([2], dtype='int64'), 
('Devils', 2015): Int64Index([3], dtype='int64'),
 ('Kings', 2014): Int64Index([4], dtype='int64'),
  ('Kings', 2016): Int64Index([6], dtype='int64'), 
  ('Kings', 2017): Int64Index([7], dtype='int64'), 
  ('Riders', 2014): Int64Index([0], dtype='int64'),
   ('Riders', 2015): Int64Index([1], dtype='int64'), 
   ('Riders', 2016): Int64Index([8], dtype='int64'), 
   ('Riders', 2017): Int64Index([11], dtype='int64'),
    ('Royals', 2014): Int64Index([9], dtype='int64'), 
    ('Royals', 2015): Int64Index([10], dtype='int64'),
     ('kings', 2015): Int64Index([5], dtype='int64')}
"""

返回的也是字典形式

遍历组，并选择组：

ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
         'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
         'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
         'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
         'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)
groupYear = df.groupby('Year')
# 遍历组
for name, group in groupYear:
    print(name)
    print(group)
print("\n")
print("***********\n")
print(groupYear.get_group(2016)) # 选择组
"""
输出：
    2014
       Points  Rank    Team  Year
    0     876     1  Riders  2014
    2     863     2  Devils  2014
    4     741     3   Kings  2014
    9     701     4  Royals  2014
    2015
        Points  Rank    Team  Year
    1      789     2  Riders  2015
    3      673     3  Devils  2015
    5      812     4   kings  2015
    10     804     1  Royals  2015
    2016
       Points  Rank    Team  Year
    6     756     1   Kings  2016
    8     694     2  Riders  2016
    2017
        Points  Rank    Team  Year
    7      788     1   Kings  2017
    11     690     2  Riders  2017
    
    ***********
       Points  Rank    Team  Year
    6     756     1   Kings  2016
    8     694     2  Riders  2016
"""

三、聚合函数

聚合函数为每个组返回单个聚合值。一旦创建了分组对象，就可以对分组数据执行多个聚合操作。python通过agg()方法进行聚合。

import numpy as np

ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
         'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
         'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
         'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
         'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)

groupYear = df.groupby('Year')
print(groupYear['Points'].agg(np.mean),'\n')
print(groupYear.agg(np.size)) # 查看每个组的大小

"""
输出：
Year
2014    795.25
2015    769.50
2016    725.00
2017    739.00
Name: Points, dtype: float64 

      Points  Rank  Team
Year                    
2014       4     4     4
2015       4     4     4
2016       2     2     2
2017       2     2     2
"""

一次使用多个聚合函数：

groupTeam = df.groupby('Team')
print(groupTeam['Points'].agg([np.sum, np.mean, np.std]))
"""
输出：
         sum        mean         std
Team                                
Devils  1536  768.000000  134.350288
Kings   2285  761.666667   24.006943
Riders  3049  762.250000   88.567771
Royals  1505  752.500000   72.831998
kings    812  812.000000         NaN
"""

过滤：

print(df.groupby('Team').filter(lambda x: len(x) >= 3))
"""
输出：
    Points  Rank    Team  Year
0      876     1  Riders  2014
1      789     2  Riders  2015
4      741     3   Kings  2014
6      756     1   Kings  2016
7      788     1   Kings  2017
8      694     2  Riders  2016
11     690     2  Riders  2017
""

本文参与腾讯云自媒体分享计划，分享自微信公众号。

原始发表：2019-08-13，如有侵权请联系 cloudcommunity@tencent.com 删除

数据处理

本文分享自 Python爬虫scrapy 微信公众号，前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与腾讯云自媒体分享计划，欢迎热爱写作的你一起参与！

数据处理

登录后参与评论

0 条评论

热度

数据分析 ——— pandas基础（四）

数据分析 ——— pandas基础（四）

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐