利用pandas来进行数据处理的方法太多了,在这里继续更新一下对缺失数据的处理,以及数据的分组,聚合函数的使用。
一、处理缺失数据
在做数据分析的时候,大部分数据都不是很完整,缺失数导致数据的质量差,机器学习在做模型预测的时候,模型的准确性就会面临严峻的问题。所以对缺失数据的处理还是很有必要的。
1)处理pandas的缺失值(NA or NaN)
使用reindex,我们创建了一个缺失值的DataFrame。
在输出中,NaN表示不是数字。
import numpy as np
import pandas as pd
# 处理缺失数据
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print(df)
"""
输出:
one two three
a -0.018164 -0.594016 0.378447
b NaN NaN NaN
c -1.482830 0.909581 -0.431973
d NaN NaN NaN
e -0.797581 0.986172 -1.182949
f -0.514952 1.124808 -1.246717
g NaN NaN NaN
h 1.781893 0.784155 -0.672985
"""
检查缺失值:panda提供 了isnull()和notnull()函数
# isnull() 判断one这一列数据是否有缺失值,有缺失值返回True,无返回false
print(df['one'].isnull())
"""
输出:
a False
b True
c False
d True
e False
f False
g True
h False
Name: one, dtype: bool
"""
# notnull()
print(df['one'].notnull())
"""
输出:
a True
b False
c True
d False
e True
f True
g False
h True
Name: one, dtype: bool
"""
2) 对有缺失值的数据进行求和
sum():
rint(df['one'].sum())
"""
输出:
1.0316327375313081
"""
3)清除/填充缺失值
pandas提供了fillna()函数中的几种方式来填充缺少数据。
用标量填充(固定值填充):
# 清洁,填充缺失数据
df = pd.DataFrame(np.random.randn(3, 3), index=['a', 'c', 'e'],columns=['one',
'two', 'three'])
df = df.reindex(['a', 'b', 'c'])
print(df,'\n')
# 将NAN全部替换为0
print ("NaN replaced with '0':")
print(df.fillna(0))
"""
输出:
one two three
a -1.004807 0.727737 -0.481955
b NaN NaN NaN
c 0.284135 -1.066389 -1.725905
NaN replaced with '0':
one two three
a -1.004807 0.727737 -0.481955
b 0.000000 0.000000 0.000000
c 0.284135 -1.066389 -1.725905
"""
在上面我们使用了0值进行填充,除了0之外,还可以填充其它的数。
正向填充和前向填充:
对每一条数据的缺失值,填充其上下条数据的值。
# 正向填充,和反向填充
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print(df,'\n')
# 填充前一条数据的值,但是前一条也不一定有值
print(df.fillna(method='pad'), '\n')
# 填充后一条数据的值,但是后一条也不一定有值
print(df.fillna(method='backfill'))
"""
输出:
one two three
a 0.872584 -0.423063 -0.156466
b NaN NaN NaN
c 0.307049 0.292621 -1.684947
d NaN NaN NaN
e 1.276004 -0.444504 0.460022
f -0.180679 1.428129 -0.383163
g NaN NaN NaN
h -0.158751 0.334699 -0.174680
one two three
a 0.872584 -0.423063 -0.156466
b 0.872584 -0.423063 -0.156466
c 0.307049 0.292621 -1.684947
d 0.307049 0.292621 -1.684947
e 1.276004 -0.444504 0.460022
f -0.180679 1.428129 -0.383163
g -0.180679 1.428129 -0.383163
h -0.158751 0.334699 -0.174680
one two three
a 1.437663 0.509368 -0.308674
b 1.250818 0.420155 -0.146549
c 1.250818 0.420155 -0.146549
d -1.748608 -0.646638 0.154813
e -1.748608 -0.646638 0.154813
f 0.183441 0.093766 -0.355318
g -0.138610 -0.688689 -0.129530
h -0.138610 -0.688689 -0.129530
"""
填充均值:
df.fillna(df.mode(),inplace=True)
"""
输出:
one two three
a -0.325235 1.671434 -0.059426
b -0.213600 0.214624 -0.629093
c -1.070583 -0.142056 -0.046486
d -0.213600 0.214624 -0.629093
e 1.214059 -0.831476 -0.210059
f -1.312524 -0.554252 -1.111779
g -0.213600 0.214624 -0.629093
h 0.426282 0.929469 -1.717717
"""
中位数填充:
# 中位数填充
print(df.fillna(df.median(), inplace=True))
"""
输出:
one two three
a -0.325235 1.671434 -0.059426
b -0.213600 0.214624 -0.629093
c -1.070583 -0.142056 -0.046486
d -0.213600 0.214624 -0.629093
e 1.214059 -0.831476 -0.210059
f -1.312524 -0.554252 -1.111779
g -0.213600 0.214624 -0.629093
h 0.426282 0.929469 -1.717717
"""
除了上面的几种填充方式,还有其他的填充方式,比如说填充众数,对每一列的缺失值,填充当列的众数。但可能存在某列缺失值过多,众数为nan的情况,因此可以将每列nan值删除掉,对之后的数据取众数。
二、数据分组
利用groupby()对数据进行分组
# group by, 拆分组
import pandas as pd
ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)
print(df,'\n')
"""
输出:
Points Rank Team Year
0 876 1 Riders 2014
1 789 2 Riders 2015
2 863 2 Devils 2014
3 673 3 Devils 2015
4 741 3 Kings 2014
5 812 4 kings 2015
6 756 1 Kings 2016
7 788 1 Kings 2017
8 694 2 Riders 2016
9 701 4 Royals 2014
10 804 1 Royals 2015
11 690 2 Riders 2017
"""
按列分组:
print(df.groupby('Rank'),'\n') # 分割组
"""
输出:
<pandas.core.groupby.DataFrameGroupBy object at 0x7f54f9e6a6d8>
"""
输出的是一个pandas对象
查看组:
print(df.groupby('Rank').groups,'\n') # 查看组
"""
输出:
{1: Int64Index([0, 6, 7, 10], dtype='int64'),
2: Int64Index([1, 2, 8, 11], dtype='int64'),
3: Int64Index([3, 4], dtype='int64'),
4: Int64Index([5, 9], dtype='int64')}
"""
结果返回字典
按多列分组:
print(df.groupby(['Team', 'Year']).groups) # 按多列分组
"""
输出:
{('Devils', 2014): Int64Index([2], dtype='int64'),
('Devils', 2015): Int64Index([3], dtype='int64'),
('Kings', 2014): Int64Index([4], dtype='int64'),
('Kings', 2016): Int64Index([6], dtype='int64'),
('Kings', 2017): Int64Index([7], dtype='int64'),
('Riders', 2014): Int64Index([0], dtype='int64'),
('Riders', 2015): Int64Index([1], dtype='int64'),
('Riders', 2016): Int64Index([8], dtype='int64'),
('Riders', 2017): Int64Index([11], dtype='int64'),
('Royals', 2014): Int64Index([9], dtype='int64'),
('Royals', 2015): Int64Index([10], dtype='int64'),
('kings', 2015): Int64Index([5], dtype='int64')}
"""
返回的也是字典形式
遍历组, 并选择组:
ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)
groupYear = df.groupby('Year')
# 遍历组
for name, group in groupYear:
print(name)
print(group)
print("\n")
print("***********\n")
print(groupYear.get_group(2016)) # 选择组
"""
输出:
2014
Points Rank Team Year
0 876 1 Riders 2014
2 863 2 Devils 2014
4 741 3 Kings 2014
9 701 4 Royals 2014
2015
Points Rank Team Year
1 789 2 Riders 2015
3 673 3 Devils 2015
5 812 4 kings 2015
10 804 1 Royals 2015
2016
Points Rank Team Year
6 756 1 Kings 2016
8 694 2 Riders 2016
2017
Points Rank Team Year
7 788 1 Kings 2017
11 690 2 Riders 2017
***********
Points Rank Team Year
6 756 1 Kings 2016
8 694 2 Riders 2016
"""
三、聚合函数
聚合函数为每个组返回单个聚合值。一旦创建了分组 对象,就可以对分组数据执行多个聚合操作。python通过agg()方法进行聚合。
import numpy as np
ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)
groupYear = df.groupby('Year')
print(groupYear['Points'].agg(np.mean),'\n')
print(groupYear.agg(np.size)) # 查看每个组的大小
"""
输出:
Year
2014 795.25
2015 769.50
2016 725.00
2017 739.00
Name: Points, dtype: float64
Points Rank Team
Year
2014 4 4 4
2015 4 4 4
2016 2 2 2
2017 2 2 2
"""
一次使用多个聚合函数:
groupTeam = df.groupby('Team')
print(groupTeam['Points'].agg([np.sum, np.mean, np.std]))
"""
输出:
sum mean std
Team
Devils 1536 768.000000 134.350288
Kings 2285 761.666667 24.006943
Riders 3049 762.250000 88.567771
Royals 1505 752.500000 72.831998
kings 812 812.000000 NaN
"""
过滤:
print(df.groupby('Team').filter(lambda x: len(x) >= 3))
"""
输出:
Points Rank Team Year
0 876 1 Riders 2014
1 789 2 Riders 2015
4 741 3 Kings 2014
6 756 1 Kings 2016
7 788 1 Kings 2017
8 694 2 Riders 2016
11 690 2 Riders 2017
""
本文分享自 Python爬虫scrapy 微信公众号,前往查看
如有侵权,请联系 cloudcommunity@tencent.com 删除。
本文参与 腾讯云自媒体分享计划 ,欢迎热爱写作的你一起参与!