[008] 数据清洗不知如何着手？强力推荐这份清单

Sam Gor

发布于 2020-12-02 10:45:49

4040

发布于 2020-12-02 10:45:49

文章被收录于专栏：SAMshareSAMshare

“作者把我们从拿到原始数据到完成数据清洗的步骤过程用到的Code，进行了一次系统的梳理，是一份很不错的checklist，也是一份很值得收藏的文章哦！回头有空我也会把它搞成中文版并且排版成一页纸，方便打印出来粘贴哈哈哈哈！

It was before the Stack Overflow era, so not much help was available online. Some people would print out cheatsheets of different kinds and hang on the walls around their workstations. Having a couple of pages of frequently used codes in front of the desk was an efficient way of correcting syntax errors.

Help is now at the fingertips only few clicks away. But an old-fashioned cheatsheet is still a valuable time-saving tool. It’s even more the case if you have to juggle between multiple programming languages.

Data scientists spend most of their time on data wrangling, so being efficient is a valuable skill to have. So the purpose of this article is to show how to build a “cheatsheet” for data wrangling following a typical analytics workflow. I am not going to write down all the codes needed every step of the way, rather I’ll focus on how to compile a cheatsheet that serves your purpose, so you can spend more time coding, less time searching for the right syntax.

1 ) Setup

Soon after firing up the IDE/text editor, you may very well be staring at the blank screen. Loading in the dataset(s) is probably the first thing coming to your mind. So let’s start there and import a couple of libraries.

import pandas as pd
import numpy as np

Then comes connecting to working directories.

import os
# show current working directory
os.getcwd()

# list files in the directory
os.listdir()

# change working directory
os.chdir("/PATH/TO/NEW/DIRECTORY")

2) Loading data

Now load in your datasets from the repository (desktop, cloud, SQL server — wherever stored). It’s a good idea to make a copy of the original dataset and work with the copy because you’ll be doing a lot of modifications to the original one.

# import from a csv file
data = pd.read_csv("../Datasets/iris.csv")

# copying a dataset
df = data.copy()

# call the head function
df.head()

I’m importing the iris dataset from my local machine. Yellow highlighted cells are some changes I deliberately made in the original data, you’ll see shortly why.

3) Initial data screening

df.info()

I’d call the info() method first to check a few things. The yellow highlight seems odd, as I was expecting petal_length to be a “float” data type rather than an “object”. We’ll deal with it later on.

Among other things info() tells you the number of rows and columns in the dataset, but you can also get them with the shape attribute; so good to have that in the cheatsheet as well.

# number of rows and columns
df.shape

>> (150, 5)

Some datasets may have too many columns and will not fit on the computer screen. You can check column names with columns attribute instead.

# column names
df.columns

>> Index(['sepal_length', 'SEpal_Width', 'petal_length', 'petal_width',
       'species'],
      dtype='object')

For categorical variables you’ll want to know how many categories are there;

# number of unique values
df["species"].nunique()

>> 4

and also the names of those categories;

# name of the unique values
df["species"].unique()

>> array(['setosa', 'setosaaa', 'versicolor', 'virginica'], dtype=object)

and the count of rows in each category.

# count of categorical data
df["species"].value_counts()

>> 
versicolor    50
virginica     50
setosa        49
setosaaa       1
Name: species, dtype: int64

4) Missing value treatment

Missing values are no surprise. The yellow highlighted cells in the dataframe above are NaN values. You can search for the number of missing values in a dataset by typing the following:

# show NaN values per feature
df.isnull().sum()

>>
sepal_length    2
SEpal_Width     0
petal_length    0
petal_width     0
species         0
dtype: int64

You could also obtain missing values as a percentage of total observations (it is quite useful for large datasets).

# NaN values as % of total observations
df.isnull().sum()*100/len(df)

In the iris dataset we have two missing values in the sepal_length column. Now that we found out, what to do with them? You can do one of the following:

a ) drop the rows or columns containing null values;

## Drop row/column ##
#####################
# drop all rows containing null
df.dropna()

# drop all columns containing null
df.dropna(axis=1)

# drop columns with less than 5 NaN values
df.dropna(axis=1, thresh=5)

b ) or, replace/impute missing cells with some other values;

## Replace values ##
####################

# replace all na values with -9999
df.fillna(-9999)

[# additional tip: you can also replace any specific cell values
df.at[1, "sepal_length"]= 9999]

# fill na values with NaN
df.fillna(np.NaN)

# fill na values with strings
df.fillna("data missing")

# fill missing values with mean column values
df.fillna(df.mean())

# replace na values of specific columns with mean value
df["sepal_length"].fillna(df["sepal_length"].mean())

c ) or, if it’s time-series data, interpolation is a great way to impute data.

## Interpolate ##
#################

# interpolation of missing values (useful in time-series)
df.interpolate() 

# all dataframe
df["sepal_length"].interpolate() # specific column

5) Subsetting & working with columns

Not all columns in the dataset are of interest, sometimes we select specific columns for analytics or building a model. Subsetting allows you to do that.

There are two key ways to select columns: by column names and by column positions:

# select a column by column name
df["sepal_length"]

# select multiple columns by column name
df[["sepal_length", "sepal_width", "petal_length", "spp"]]

# select a column by column number
df.iloc[:, 2:4]

# select multiple columns by column number
df.iloc[:, [1,3,4]]

But what if you want to subset data by dropping a column?

# drop a column 
df.drop("sepal_length", axis=1)

Now let’s say you want to create a new column by adding two existing columns:

sepal_len_cm= sepal_length* 10

Creating new calculated columns is often big part of feature engineering.

# add new calculated column
df['new'] = df["sepal_length"]*2

# create a conditional calculated column
df['newcol'] = ["short" if i<3 else "long" for i in df["sepal_width"]]

Sometimes re-coding may be needed to convert categorical string values to numeric values.

df.replace({"Species":{"setosa":1, "versicolor":2, "virginica":3}})

If aggregation of column values is needed (mean/median etc.), pythonand numpyhas native functions that can be applied to the dataframe.

# calculate mean of each of two columns
df[["sepal_length", "sepal_width"]].mean()

# calculate sum and mean of each column
df[["sepal_length", "sepal_width"]].agg([np.sum, np.mean])

And finally, some bonus syntax useful to work with columns:

# transposing a dataset
df.T

# create a list of columns
df.columns.tolist()

# sorting values in ascending order
df.sort_values(by = "sepal_width", ascending = True)

# change column name
df.rename(columns={"old_name": "new_name"})

6) Filtering: working with rows

Filtering is an important part of exploratory data analysis, drawing insights and building KPIs.

There are many ways to filter data depending on the analytics needs, such as:

a) using the row index location:

# select rows with index number 3 to 10
df.iloc[3:10,]

# select rows with index name
df.loc["index1", "index2"]

# finding rows with specific strings
df[df["species"].isin(["setosa"])]

b) conditional filtering

# simple conditional filtering to filter rows with sepal_length>=5
df.query('sepal_length>=5') 

# or
df[df.sepal_length>= 5]

# filtering rows with multiple values e.g. 0.2, 0.3
df[df["petal_length"].isin([0.2, 0.3])]

# multi-conditional filtering
df[(df.sepal_length>1) & (df.species=="setosa") | (df.sepal_width<3)]

And finally, here’s how you’d get rid of a row if needed.

# drop rows
df.drop(df.index[1]) # 1 is row index to be deleted

7) Grouping

Like filtering, grouping is another important part of exploratory data analysis and data visualization. The key function for this task is groupby() and is mainly used for aggregating rows based on categorical features.

# return a dataframe object grouped by "species" column
df.groupby("species")

After the dataframe is grouped, you could apply different functions to it, for example, getting aggregate values of numeric columns:

# return mean a column groupby "species" categories
df["sepal_length"].groupby(df["species"]).mean()

Or you can apply such aggregate function to multiple features:

# group each column by "species", then apply multiple operation on each feature 
df.groupby("species").agg([np.sum, np.mean, np.std])

8) Joining/merging

If you know SQL I don’t have to explain how important joining is. Python and pandas have some functions such as merge(), join(), concat() for SQL style joining. If SQL is the primary database you probably won’t have to do joining much in Python, but nevertheless you should add the following codes to your cheatsheet.

# SQL style joining
df1 = df[["sepal_length", "sepal_width"]]
df2 = df[["sepal_length", "petal_length"]]
dfx = pd.concat([df1, df2], axis = 1)

Final word

Cliché to say, but it’s true that most data scientists spend 70–80% of their time in data clean up before doing any advanced analytics. Even experienced programmers spend a significant amount of time on the internet (i.e. StackOverflow) searching for the right code or syntax for their project. Having an old-fashioned cheatsheet — whether printed on papers or written in a Jupyter notebook — is still a valuable asset. Having the most frequently used codes in one place can save a significant amount of time and energy.

Thanks for reading!

本文参与腾讯云自媒体分享计划，分享自微信公众号。

原始发表：2020-11-20，如有侵权请联系 cloudcommunity@tencent.com 删除

python