【译】Python中的数据清洗 |Pythonic Data Cleaning With NumPy and Pandas（二）

needrunning

发布于 2019-10-08 16:34:19

6270

发布于 2019-10-08 16:34:19

文章被收录于专栏：图南科技

本文是使用 Python 进行数据清洗

第二部分翻译，全部翻译的文章内容摘要如下

【译】Python中的数据清洗 |Pythonic Data Cleaning With NumPy and Pandas（一）

下图目录是一些常规的数据清理项，本文中主要讨论

“Cleaning the Entire Dataset Using the applymap Function

数据清理目录.png

原文地址

Pythonic Data Cleaning With NumPy and Pandas[1]

数据集地址

university_towns.txt[2]

A text file containing names of college towns in every US state

数据集格式大概如下

Alabama[edit]
Auburn (Auburn University)[1]
Florence (University of North Alabama)
Jacksonville (Jacksonville State University)[2]
Livingston (University of West Alabama)[2]
Montevallo (University of Montevallo)[2]
Troy (Troy University)[2]
Tuscaloosa (University of Alabama, Stillman College, Shelton State)[3][4]
Tuskegee (Tuskegee University)[5]
...

我们的数据清洗任务是把以上不规则的行数据整理为整齐的数据,我们可以看到每行数据除了一些括号外，没有其它的共性特征。

Cleaning the Entire Dataset Using the applymap Function

使用 applymap 函数清洗整个数据集

In certain situations, you will see that the “dirt” is not localized to one column but is more spread out.

There are some instances where it would be helpful to apply a customized function to each cell or element of a DataFrame.

Pandas .applymap() method is similar to the in-built map() function and simply applies a function to all the elements in a DataFrame.

数据清洗-变换行.png

We see that we have periodic state names followed by the university towns in that state: StateA TownA1 TownA2 StateB TownB1 TownB2....

If we look at the way state names are written in the file, we’ll see that all of them have the “[edit]” substring in them.

美国州名称用【edit】字符串分割

We can take advantage of this pattern by creating a list of (state, city) tuples and wrapping that list in a DataFrame:

university_towns = []
with open('datasets/python-data-cleaning-master/university_towns.txt') as file:
    for line in file:
        if '[edit]' in line:
            # Remember this `state` until the next is found
           state = line
    else:
        # Otherwise, we have a city; keep `state` as last-seen
           university_towns.append((state, line))
#读取前5行
print(university_towns[:5])

[('Alabama[edit]\n', 'Auburn (Auburn University)[1]\n'),
 ('Alabama[edit]\n', 'Florence (University of North Alabama)\n'),
 ('Alabama[edit]\n', 'Jacksonville (Jacksonville State University)[2]\n'),
 ('Alabama[edit]\n', 'Livingston (University of West Alabama)[2]\n'),
 ('Alabama[edit]\n', 'Montevallo (University of Montevallo)[2]\n')]

数据变换，增加列名

We can wrap this list in a DataFrame and set the columns as “State” and “RegionName”. Pandas will take each element in the list and set State to the left value and RegionName to the right value.

The resulting DataFrame looks like this:

towns_df = pd.DataFrame(university_towns,
columns=['State', 'RegionName'])
print(towns_df.head())

数据清理-增加列.png

The applymap() method took each element from the DataFrame, passed it to the function, and the original value was replaced by the returned value. It’s that simple!

applymap()实际上是一个行遍历的思想，在处理数据时，每一行都可以对应回调函数，自定义来处理数据。

参考资料

[1]

Pythonic Data Cleaning With NumPy and Pandas: https://realpython.com/python-data-cleaning-numpy-pandas/

[2]

university_towns.txt: https://github.com/realpython/python-data-cleaning/blob/master/Datasets/university_towns.txt

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2019-10-06，如有侵权请联系 cloudcommunity@tencent.com 删除

numpy

python

本文分享自图南科技微信公众号，前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与腾讯云自媒体同步曝光计划，欢迎热爱写作的你一起参与！

numpy

python

登录后参与评论

0 条评论

热度