文章/答案/技术大牛

发布

Python干货宝典：如何处理Pandas中丢失的数据

文章来源：企鹅号 - 疯狂的python

当一个或多个项目或整个单元没有提供信息时，可能会出现丢失数据。在现实生活中，丢失数据是一个很大的问题，往往找半天还找不回来。

在Pandas中，缺少的数据由两个值表示：

None：None是Python单例对象，通常用于丢失Python代码中的数据。

NaN(非数字的缩写)，是所有使用标准ieee浮点表示的系统所认可的特殊浮点值。

pandas对于None和NaN本质上是可互换的，用于表示缺失或空值。

在Pandas DataFrame中有几个用于检测、删除和替换空值的有用函数：

isnull()

notnull()

dropna()

fillna()

replace()

interpolate()

使用isnull()和notnull()

使用函数isnull()和notnull()检查PandasDataFrame中缺少的值。

使用isnull()

为了检查PandasDataFrame中的空值，我们使用isnull()函数返回布尔值的数据，这些值是NaN值的真值。

代码1：

# importing pandas as pd

import pandas as pd

# importing numpy as np

import numpy as np

# dictionary of lists

dict = {'First Score':[100, 90, np.nan, 95],

'Second Score': [30, 45, 56, np.nan],

'Third Score':[np.nan, 40, 80, 98]}

# creating a dataframe from list

df = pd.DataFrame(dict)

# using isnull() function

df.isnull()

产出：

代码2：

# importing pandas package

import pandas as pd

# making data frame from csv file

data = pd.read_csv("employees.csv")

# creating bool series True for NaN values

bool_series = pd.isnull(data["Gender"])

# filtering data

# displaying data only with Gender = NaN

data[bool_series]

产出：

如输出映像所示，只有具有Gender = NULL都会显示。

使用notnull()

为了检查PandasDataframe中的空值，我们使用NOTNULL()函数来返回对于NaN值为false的布尔值的数据。

代码3：

# importing pandas as pd

import pandas as pd

# importing numpy as np

import numpy as np

# dictionary of lists

dict = {'First Score':[100, 90, np.nan, 95],

'Second Score': [30, 45, 56, np.nan],

'Third Score':[np.nan, 40, 80, 98]}

# creating a dataframe using dictionary

df = pd.DataFrame(dict)

# using notnull() function

df.notnull()

产出：

代码4：

# importing pandas package

import pandas as pd

# making data frame from csv file

data = pd.read_csv("employees.csv")

# creating bool series True for NaN values

bool_series = pd.notnull(data["Gender"])

# filtering data

# displayind data only with Gender = Not NaN

data[bool_series]

产出：

如输出映像所示，只有具有Gender = NOT NULL都会显示。

使用fillna(), replace()和interpolate()

使用fillna(), replace()和interpolate()函数这些函数将NaN值替换为它们自己的一些值。在DataFrame的数据集中填充空值。

插值()函数主要用于填充NA数据中的值，使用各种插值技术来填充丢失的值，不是对值进行硬编码。

代码1：用单个值填充空值

# importing pandas as pd

import pandas as pd

# importing numpy as np

import numpy as np

# dictionary of lists

dict = {'First Score':[100, 90, np.nan, 95],

'Second Score': [30, 45, 56, np.nan],

'Third Score':[np.nan, 40, 80, 98]}

# creating a dataframe from dictionary

df = pd.DataFrame(dict)

# filling missing value using fillna()

df.fillna(0)

产出：

代码2：用前面的值填充空值

# importing pandas as pd

import pandas as pd

# importing numpy as np

import numpy as np

# dictionary of lists

dict = {'First Score':[100, 90, np.nan, 95],

'Second Score': [30, 45, 56, np.nan],

'Third Score':[np.nan, 40, 80, 98]

# creating a dataframe from dictionary

df = pd.DataFrame(dict)

# filling a missing value with

# previous ones

df.fillna(method ='pad')

产出：

代码3：用下一个值填充空值

# importing pandas as pd

import pandas as pd

# importing numpy as np

import numpy as np

# dictionary of lists

dict = {'First Score':[100, 90, np.nan, 95],

'Second Score': [30, 45, 56, np.nan],

'Third Score':[np.nan, 40, 80, 98]}

# creating a dataframe from dictionary

df = pd.DataFrame(dict)

# filling null value using fillna() function

df.fillna(method ='bfill')

产出：

代码4：在CSV文件中填充空值

# importing pandas package

import pandas as pd

# making data frame from csv file

data = pd.read_csv("employees.csv")

# Printing the first 10 to 24 rows of

# the data frame for visualization

data[10:25]

现在，我们将用“无性别”填充性别列中的所有空值。

# importing pandas package

import pandas as pd

# making data frame from csv file

data = pd.read_csv("employees.csv")

# filling a null values using fillna()

data["Gender"].fillna("No Gender", inplace = True)

data

产出：

代码5：使用替换()方法填充空值

# importing pandas package

import pandas as pd

# making data frame from csv file

data = pd.read_csv("employees.csv")

# Printing the first 10 to 24 rows of

# the data frame for visualization

data[10:25]

产出：

现在，我们将将数据帧中的ALNAN值替换为-99值。

# importing pandas package

import pandas as pd

# making data frame from csv file

data = pd.read_csv("employees.csv")

# will replace Nan value in dataframe with value -99

data.replace(to_replace = np.nan, value = -99)

产出：

代码6：使用插值()函数来使用线性方法填充缺失的值。

# importing pandas as pd

import pandas as pd

# Creating the dataframe

df = pd.DataFrame({"A":[12, 4, 5, None, 1],

"B":[None, 2, 54, 3, None],

"C":[20, 16, None, 3, 8],

"D":[14, 3, None, None, 6]})

# Print the dataframe

让我们用线性方法插值缺失的值。请注意，线性方法忽略索引，并将值视为等距。

# to interpolate the missing values

df.interpolate(method ='linear', limit_direction ='forward')

产出：

正如我们可以看到的输出，第一行中的值无法被填充，因为填充值的方向是向前的，并且没有以前的值可以用于插值。

使用dropna()

从dataframe中删除空值，使用dropna()函数以不同的方式删除具有Null值的数据集的行/列。

代码1：删除至少1空值的行。

# importing pandas as pd

import pandas as pd

# importing numpy as np

import numpy as np

# dictionary of lists

dict = {'First Score':[100, 90, np.nan, 95],

'Second Score': [30, np.nan, 45, 56],

'Third Score':[52, 40, 80, 98],

'Fourth Score':[np.nan, np.nan, np.nan, 65]}

# creating a dataframe from dictionary

df = pd.DataFrame(dict)

使用至少一个Nan值(Null值)删除行。

# importing pandas as pd

import pandas as pd

# importing numpy as np

import numpy as np

# dictionary of lists

dict = {'First Score':[100, 90, np.nan, 95],

'Second Score': [30, np.nan, 45, 56],

'Third Score':[52, 40, 80, 98],

'Fourth Score':[np.nan, np.nan, np.nan, 65]}

# creating a dataframe from dictionary

df = pd.DataFrame(dict)

# using dropna() function

df.dropna()

产出：

代码2：如果该行中的所有值都丢失，则删除行。

# importing pandas as pd

import pandas as pd

# importing numpy as np

import numpy as np

# dictionary of lists

dict = {'First Score':[100, np.nan, np.nan, 95],

'Second Score': [30, np.nan, 45, 56],

'Third Score':[52, np.nan, 80, 98],

'Fourth Score':[np.nan, np.nan, np.nan, 65]}

# creating a dataframe from dictionary

df = pd.DataFrame(dict)

删除所有数据丢失或包含空值(Nan)的行。

# importing pandas as pd

import pandas as pd

# importing numpy as np

import numpy as np

# dictionary of lists

dict = {'First Score':[100, np.nan, np.nan, 95],

'Second Score': [30, np.nan, 45, 56],

'Third Score':[52, np.nan, 80, 98],

'Fourth Score':[np.nan, np.nan, np.nan, 65]}

df = pd.DataFrame(dict)

# using dropna() function

df.dropna(how = 'all')

产出：

代码3：删除至少1空值的列。

# importing pandas as pd

import pandas as pd

# importing numpy as np

import numpy as np

# dictionary of lists

dict = {'First Score':[100, np.nan, np.nan, 95],

'Second Score': [30, np.nan, 45, 56],

'Third Score':[52, np.nan, 80, 98],

'Fourth Score':[60, 67, 68, 65]}

# creating a dataframe from dictionary

df = pd.DataFrame(dict)

删除至少有1个缺失值的列。

# importing pandas as pd

import pandas as pd

# importing numpy as np

import numpy as np

# dictionary of lists

dict = {'First Score':[100, np.nan, np.nan, 95],

'Second Score': [30, np.nan, 45, 56],

'Third Score':[52, np.nan, 80, 98],

'Fourth Score':[60, 67, 68, 65]}

# creating a dataframe from dictionary

df = pd.DataFrame(dict)

# using dropna() function

df.dropna(axis = 1)

产出：

代码4：在CSV文件中删除至少1空值的行

# importing pandas module

import pandas as pd

# making data frame from csv file

data = pd.read_csv("employees.csv")

# making new data frame with dropped NA values

new_data = data.dropna(axis = 0, how ='any')

new_data

产出：

现在我们比较数据帧的大小，这样我们就可以知道有多少行至少有一个空值。

print("Old data frame length:", len(data))

print("New data frame length:", len(new_data))

print("Number of rows with at least 1 NA value: ", (len(data)-len(new_data)))

产出：

Old data frame length: 1000

New data frame length: 764

Number of rows with at least 1 NA value: 236

由于差值为236，因此在任何列中都有236行，其中至少有1空值。

发表于: 2020-11-162020-11-16 20:38:27
原文链接：https://kuaibao.qq.com/s/20201116A0GJKH00?refer=cp_1026
腾讯「腾讯云开发者社区」是腾讯内容开放平台帐号（企鹅号）传播渠道之一，根据《腾讯内容开放平台服务协议》转载发布内容。
如有侵权，请联系 cloudcommunity@tencent.com 删除。

Python干货宝典：如何处理Pandas中丢失的数据

相关快讯

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐