前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >【译】Python中的数据清洗 |Pythonic Data Cleaning With NumPy and Pandas(一)

【译】Python中的数据清洗 |Pythonic Data Cleaning With NumPy and Pandas(一)

作者头像
needrunning
发布2019-10-08 16:35:25
8890
发布2019-10-08 16:35:25
举报
文章被收录于专栏:图南科技图南科技

python中的数据清洗 | Pythonic Data Cleaning With NumPy and Pandas[1]

Python中的数据清洗入门文章,阅读需要一些耐心

生词释意

a handful of columns 少量字段

roughly 初略的 大体的

enforce 强迫实施 执行

github 库

https://github.com/realpython/python-data-cleaning[2]

数据集

  • BL-Flickr-Images-Book.csv[3] – A CSV file containing information about books from the British Library
  • university_towns.txt[4] – A text file containing names of college towns in every US state
  • olympics.csv[5] – A CSV file summarizing the participation of all countries in the Summer and Winter Olympics

这篇文章中主要有以下几个 content

“数据集读取 按需删除字段 清理字段

代码语言:javascript
复制
>>> import pandas as pd
>>> import numpy as np

Dropping Columns in a DataFrame

Often, you’ll find that not all the categories of data in a dataset are useful to you. For example, you might have a dataset containing student information (name, grade, standard, parents’ names, and address) but want to focus on analyzing student grades.

In this case, the address or parents’ names categories are not important to you. Retaining these unneeded categories will take up unnecessary space and potentially also bog down runtime.

Pandas provides a handy way of removing unwanted columns or rows from a DataFramewith the `drop()`[6] function. Let’s look at a simple example where we drop a number of columns from a DataFrame.

读取数据集

First, let’s create a DataFrame out of the CSV file ‘BL-Flickr-Images-Book.csv’. In the examples below, we pass a relative path to pd.read_csv, meaning that all of the datasets are in a folder named Datasets in our current working directory:

When we look at the first five entries using the head() method, we can see that a handful of columns provide ancillary information that would be helpful to the library but isn’t very descriptive of the books themselves: Edition Statement, Corporate Author, Corporate Contributors, Former owner, Engraver, Issuance type and Shelfmarks.

我们使用 head()方法查看数据集的前几列基本信息。只有少量的字段对数据是有用的。

通过以下方式删除

代码语言:javascript
复制
>>> to_drop = ['Edition Statement',
...            'Corporate Author',
...            'Corporate Contributors',
...            'Former owner',
...            'Engraver',
...            'Contributors',
...            'Issuance type',
...            'Shelfmarks']

>>> df.drop(to_drop, inplace=True, axis=1)

Above, we defined a list that contains the names of all the columns we want to drop. Next, we call the drop() function on our object, passing in the inplace parameter as True and the axis parameter as 1. This tells Pandas that we want the changes to be made directly in our object and that it should look for the values to be dropped in the columns of the object.

我们把需要删除的列,单独以列表的形式,传递给 drop 方法,即可删除

When we inspect the DataFrame again, we’ll see that the unwanted columns have been removed:

重新查看列数

使用定位函数查看

代码语言:javascript
复制
>>> df.loc[206]
Place of Publication                                               London
Date of Publication                                           1879 [1878]
Publisher                                                S. Tinsley & Co.
Title                                   Walter Forbes. [A novel.] By A. A
Author                                                              A. A.
Flickr URL              http://www.flickr.com/photos/britishlibrary/ta...
Name: 206, dtype: object

Tidying up Fields in the Data 整理字段

So far, we have removed unnecessary columns and changed the index of our DataFrame to something more sensible. In this section, we will clean specific columns and get them to a uniform format to get a better understanding of the dataset and enforce consistency. In particular, we will be cleaning Date of Publication and Place of Publication.

Upon inspection, all of the data types are currently the object dtype[7], which is roughly analogous to str in native Python.

It encapsulates any field that can’t be neatly fit as numerical or categorical data. This makes sense since we’re working with data that is initially a bunch of messy strings:

代码语言:javascript
复制
>>> df.get_dtype_counts()
object    6

get_dtype_counts 返回此对象中唯一 dtypes 的计数. 如果一列中含有多个类型,则该列的类型会是 object,同样字符串类型的列也会被当成 object 类型.

One field where it makes sense to enforce a numeric value is the date of publication so that we can do calculations down the road:

出版日期应该强制转换为数字型,方便后续做计算

代码语言:javascript
复制
df.loc[1905:, 'Date of Publication'].head(10)
Identifier
1905           1888
1929    1839, 38-54
2836        [1897?]
2854           1865
2956        1860-63
2957           1873
3017           1866
3131           1899
4598           1814
4884           1820
Name: Date of Publication, dtype: object

A particular book can have only one date of publication. Therefore, we need to do the following:

一本确定的书,仅有一个确定的出版日期,因此我们需要做以下操作:

Remove the extra dates in square brackets, wherever present: 1879 [1878] 移除中括号内额外的日期

Convert date ranges to their “start date”, wherever present: 1860-63; 1839, 38-54

Completely remove the dates we are not certain about and replace them with NumPy’s NaN: [1897?]

完全清除不确定的日期,用 NumPy 的 NaN 类型替代

Convert the string nan to NumPy’s NaN value

转换 string nan 为 NumPy’s NaN

“统计数据每列为空的数据个数的统计

代码语言:javascript
复制
df.isnull().sum()

“查看数据的类型统计

代码语言:javascript
复制
df.get_dtype_counts()

“dataframe 的时候 发现所有 string 类型的 column 都是 object 类型

原文中还有一部分关于数据清理的操作,下篇文章继续翻译和解读。

参考资料

[1]

Pythonic Data Cleaning With NumPy and Pandas: https://realpython.com/python-data-cleaning-numpy-pandas/

[2]

https://github.com/realpython/python-data-cleaning: https://github.com/realpython/python-data-cleaning

[3]

BL-Flickr-Images-Book.csv: https://github.com/realpython/python-data-cleaning/blob/master/Datasets/BL-Flickr-Images-Book.csv

[4]

university_towns.txt: https://github.com/realpython/python-data-cleaning/blob/master/Datasets/university_towns.txt

[5]

olympics.csv: https://github.com/realpython/python-data-cleaning/blob/master/Datasets/olympics.csv

[6]

drop(): https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html

[7]

dtype: http://pandas.pydata.org/pandas-docs/stable/basics.html#dtypes

本文参与 腾讯云自媒体分享计划,分享自微信公众号。
原始发表:2019-10-05,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 图南科技 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • Python中的数据清洗入门文章,阅读需要一些耐心
  • 生词释意
  • github 库
  • 数据集
  • Dropping Columns in a DataFrame
    • 读取数据集
      • 重新查看列数
        • 使用定位函数查看
        • Tidying up Fields in the Data 整理字段
          • 参考资料
          领券
          问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档