前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >[008] 数据清洗不知如何着手?强力推荐这份清单

[008] 数据清洗不知如何着手?强力推荐这份清单

作者头像
Sam Gor
发布2020-12-02 10:45:49
4040
发布2020-12-02 10:45:49
举报
文章被收录于专栏:SAMshareSAMshare

“作者把我们从拿到原始数据到完成数据清洗的步骤过程用到的Code,进行了一次系统的梳理,是一份很不错的checklist,也是一份很值得收藏的文章哦!回头有空我也会把它搞成中文版并且排版成一页纸,方便打印出来粘贴哈哈哈哈!

It was before the Stack Overflow era, so not much help was available online. Some people would print out cheatsheets of different kinds and hang on the walls around their workstations. Having a couple of pages of frequently used codes in front of the desk was an efficient way of correcting syntax errors.

Help is now at the fingertips only few clicks away. But an old-fashioned cheatsheet is still a valuable time-saving tool. It’s even more the case if you have to juggle between multiple programming languages.

Data scientists spend most of their time on data wrangling, so being efficient is a valuable skill to have. So the purpose of this article is to show how to build a “cheatsheet” for data wrangling following a typical analytics workflow. I am not going to write down all the codes needed every step of the way, rather I’ll focus on how to compile a cheatsheet that serves your purpose, so you can spend more time coding, less time searching for the right syntax.

1 ) Setup

Soon after firing up the IDE/text editor, you may very well be staring at the blank screen. Loading in the dataset(s) is probably the first thing coming to your mind. So let’s start there and import a couple of libraries.

代码语言:javascript
复制
import pandas as pd
import numpy as np

Then comes connecting to working directories.

代码语言:javascript
复制
import os
# show current working directory
os.getcwd()

# list files in the directory
os.listdir()

# change working directory
os.chdir("/PATH/TO/NEW/DIRECTORY")

2) Loading data

Now load in your datasets from the repository (desktop, cloud, SQL server — wherever stored). It’s a good idea to make a copy of the original dataset and work with the copy because you’ll be doing a lot of modifications to the original one.

代码语言:javascript
复制
# import from a csv file
data = pd.read_csv("../Datasets/iris.csv")

# copying a dataset
df = data.copy()

# call the head function
df.head()

I’m importing the iris dataset from my local machine. Yellow highlighted cells are some changes I deliberately made in the original data, you’ll see shortly why.

3) Initial data screening

代码语言:javascript
复制
df.info()

I’d call the info() method first to check a few things. The yellow highlight seems odd, as I was expecting petal_length to be a “float” data type rather than an “object”. We’ll deal with it later on.

Among other things info() tells you the number of rows and columns in the dataset, but you can also get them with the shape attribute; so good to have that in the cheatsheet as well.

代码语言:javascript
复制
# number of rows and columns
df.shape

>> (150, 5)

Some datasets may have too many columns and will not fit on the computer screen. You can check column names with columns attribute instead.

代码语言:javascript
复制
# column names
df.columns

>> Index(['sepal_length', 'SEpal_Width', 'petal_length', 'petal_width',
       'species'],
      dtype='object')

For categorical variables you’ll want to know how many categories are there;

代码语言:javascript
复制
# number of unique values
df["species"].nunique()

>> 4

and also the names of those categories;

代码语言:javascript
复制
# name of the unique values
df["species"].unique()

>> array(['setosa', 'setosaaa', 'versicolor', 'virginica'], dtype=object)

and the count of rows in each category.

代码语言:javascript
复制
# count of categorical data
df["species"].value_counts()

>> 
versicolor    50
virginica     50
setosa        49
setosaaa       1
Name: species, dtype: int64

4) Missing value treatment

Missing values are no surprise. The yellow highlighted cells in the dataframe above are NaN values. You can search for the number of missing values in a dataset by typing the following:

代码语言:javascript
复制
# show NaN values per feature
df.isnull().sum()

>>
sepal_length    2
SEpal_Width     0
petal_length    0
petal_width     0
species         0
dtype: int64

You could also obtain missing values as a percentage of total observations (it is quite useful for large datasets).

代码语言:javascript
复制
# NaN values as % of total observations
df.isnull().sum()*100/len(df)

In the iris dataset we have two missing values in the sepal_length column. Now that we found out, what to do with them? You can do one of the following:

a ) drop the rows or columns containing null values;

代码语言:javascript
复制
## Drop row/column ##
#####################
# drop all rows containing null
df.dropna()

# drop all columns containing null
df.dropna(axis=1)

# drop columns with less than 5 NaN values
df.dropna(axis=1, thresh=5)

b ) or, replace/impute missing cells with some other values;

代码语言:javascript
复制
## Replace values ##
####################

# replace all na values with -9999
df.fillna(-9999)

[# additional tip: you can also replace any specific cell values
df.at[1, "sepal_length"]= 9999]

# fill na values with NaN
df.fillna(np.NaN)

# fill na values with strings
df.fillna("data missing")

# fill missing values with mean column values
df.fillna(df.mean())

# replace na values of specific columns with mean value
df["sepal_length"].fillna(df["sepal_length"].mean())

c ) or, if it’s time-series data, interpolation is a great way to impute data.

代码语言:javascript
复制
## Interpolate ##
#################

# interpolation of missing values (useful in time-series)
df.interpolate() 

# all dataframe
df["sepal_length"].interpolate() # specific column

5) Subsetting & working with columns

Not all columns in the dataset are of interest, sometimes we select specific columns for analytics or building a model. Subsetting allows you to do that.

There are two key ways to select columns: by column names and by column positions:

代码语言:javascript
复制
# select a column by column name
df["sepal_length"]

# select multiple columns by column name
df[["sepal_length", "sepal_width", "petal_length", "spp"]]

# select a column by column number
df.iloc[:, 2:4]

# select multiple columns by column number
df.iloc[:, [1,3,4]]

But what if you want to subset data by dropping a column?

代码语言:javascript
复制
# drop a column 
df.drop("sepal_length", axis=1)

Now let’s say you want to create a new column by adding two existing columns:

sepal_len_cm= sepal_length* 10

Creating new calculated columns is often big part of feature engineering.

代码语言:javascript
复制
# add new calculated column
df['new'] = df["sepal_length"]*2

# create a conditional calculated column
df['newcol'] = ["short" if i<3 else "long" for i in df["sepal_width"]] 

Sometimes re-coding may be needed to convert categorical string values to numeric values.

代码语言:javascript
复制
df.replace({"Species":{"setosa":1, "versicolor":2, "virginica":3}})

If aggregation of column values is needed (mean/median etc.), pythonand numpyhas native functions that can be applied to the dataframe.

代码语言:javascript
复制
# calculate mean of each of two columns
df[["sepal_length", "sepal_width"]].mean()

# calculate sum and mean of each column
df[["sepal_length", "sepal_width"]].agg([np.sum, np.mean])

And finally, some bonus syntax useful to work with columns:

代码语言:javascript
复制
# transposing a dataset
df.T

# create a list of columns
df.columns.tolist()

# sorting values in ascending order
df.sort_values(by = "sepal_width", ascending = True)

# change column name
df.rename(columns={"old_name": "new_name"})

6) Filtering: working with rows

Filtering is an important part of exploratory data analysis, drawing insights and building KPIs.

There are many ways to filter data depending on the analytics needs, such as:

a) using the row index location:

代码语言:javascript
复制
# select rows with index number 3 to 10
df.iloc[3:10,]

# select rows with index name
df.loc["index1", "index2"]

# finding rows with specific strings
df[df["species"].isin(["setosa"])]

b) conditional filtering

代码语言:javascript
复制
# simple conditional filtering to filter rows with sepal_length>=5
df.query('sepal_length>=5') 

# or
df[df.sepal_length>= 5]

# filtering rows with multiple values e.g. 0.2, 0.3
df[df["petal_length"].isin([0.2, 0.3])]

# multi-conditional filtering
df[(df.sepal_length>1) & (df.species=="setosa") | (df.sepal_width<3)]

And finally, here’s how you’d get rid of a row if needed.

代码语言:javascript
复制
# drop rows
df.drop(df.index[1]) # 1 is row index to be deleted

7) Grouping

Like filtering, grouping is another important part of exploratory data analysis and data visualization. The key function for this task is groupby() and is mainly used for aggregating rows based on categorical features.

代码语言:javascript
复制
# return a dataframe object grouped by "species" column
df.groupby("species")

After the dataframe is grouped, you could apply different functions to it, for example, getting aggregate values of numeric columns:

代码语言:javascript
复制
# return mean a column groupby "species" categories
df["sepal_length"].groupby(df["species"]).mean()

Or you can apply such aggregate function to multiple features:

代码语言:javascript
复制
# group each column by "species", then apply multiple operation on each feature 
df.groupby("species").agg([np.sum, np.mean, np.std])

8) Joining/merging

If you know SQL I don’t have to explain how important joining is. Python and pandas have some functions such as merge(), join(), concat() for SQL style joining. If SQL is the primary database you probably won’t have to do joining much in Python, but nevertheless you should add the following codes to your cheatsheet.

代码语言:javascript
复制
# SQL style joining
df1 = df[["sepal_length", "sepal_width"]]
df2 = df[["sepal_length", "petal_length"]]
dfx = pd.concat([df1, df2], axis = 1)

Final word

Cliché to say, but it’s true that most data scientists spend 70–80% of their time in data clean up before doing any advanced analytics. Even experienced programmers spend a significant amount of time on the internet (i.e. StackOverflow) searching for the right code or syntax for their project. Having an old-fashioned cheatsheet — whether printed on papers or written in a Jupyter notebook — is still a valuable asset. Having the most frequently used codes in one place can save a significant amount of time and energy.

Thanks for reading!

本文参与 腾讯云自媒体分享计划,分享自微信公众号。
原始发表:2020-11-20,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 SAMshare 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 1 ) Setup
  • 2) Loading data
  • 3) Initial data screening
  • 4) Missing value treatment
  • 5) Subsetting & working with columns
  • 6) Filtering: working with rows
  • 7) Grouping
  • 8) Joining/merging
  • Final word
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档