前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >【译】Python中的数据清洗 |Pythonic Data Cleaning With NumPy and Pandas(二)

【译】Python中的数据清洗 |Pythonic Data Cleaning With NumPy and Pandas(二)

作者头像
needrunning
发布2019-10-08 16:34:19
6060
发布2019-10-08 16:34:19
举报
文章被收录于专栏:图南科技图南科技

本文是 使用 Python 进行数据清洗

第二部分翻译,全部翻译的文章内容摘要如下

【译】Python中的数据清洗 |Pythonic Data Cleaning With NumPy and Pandas(一)

下图目录是一些常规的数据清理项,本文中主要讨论

“Cleaning the Entire Dataset Using the applymap Function

数据清理目录.png

原文地址

Pythonic Data Cleaning With NumPy and Pandas[1]

数据集地址

university_towns.txt[2]

A text file containing names of college towns in every US state

数据集格式大概如下

代码语言:javascript
复制
Alabama[edit]
Auburn (Auburn University)[1]
Florence (University of North Alabama)
Jacksonville (Jacksonville State University)[2]
Livingston (University of West Alabama)[2]
Montevallo (University of Montevallo)[2]
Troy (Troy University)[2]
Tuscaloosa (University of Alabama, Stillman College, Shelton State)[3][4]
Tuskegee (Tuskegee University)[5]
...

我们的数据清洗任务 是把以上不规则的行数据整理为整齐的数据,我们可以看到每行数据除了一些括号外,没有其它的共性特征。

Cleaning the Entire Dataset Using the applymap Function

使用 applymap 函数清洗整个数据集

In certain situations, you will see that the “dirt” is not localized to one column but is more spread out.

There are some instances where it would be helpful to apply a customized function to each cell or element of a DataFrame.

Pandas .applymap() method is similar to the in-built map() function and simply applies a function to all the elements in a DataFrame.

数据清洗-变换行.png

We see that we have periodic state names followed by the university towns in that state: StateA TownA1 TownA2 StateB TownB1 TownB2....

If we look at the way state names are written in the file, we’ll see that all of them have the “[edit]” substring in them.

美国州名称用【edit】字符串分割

We can take advantage of this pattern by creating a list of (state, city) tuples and wrapping that list in a DataFrame:

代码语言:javascript
复制
university_towns = []
with open('datasets/python-data-cleaning-master/university_towns.txt') as file:
    for line in file:
        if '[edit]' in line:
            # Remember this `state` until the next is found
           state = line
    else:
        # Otherwise, we have a city; keep `state` as last-seen
           university_towns.append((state, line))
#读取前5行
print(university_towns[:5])
代码语言:javascript
复制
[('Alabama[edit]\n', 'Auburn (Auburn University)[1]\n'),
 ('Alabama[edit]\n', 'Florence (University of North Alabama)\n'),
 ('Alabama[edit]\n', 'Jacksonville (Jacksonville State University)[2]\n'),
 ('Alabama[edit]\n', 'Livingston (University of West Alabama)[2]\n'),
 ('Alabama[edit]\n', 'Montevallo (University of Montevallo)[2]\n')]

数据变换,增加列名

We can wrap this list in a DataFrame and set the columns as “State” and “RegionName”. Pandas will take each element in the list and set State to the left value and RegionName to the right value.

The resulting DataFrame looks like this:

代码语言:javascript
复制
towns_df = pd.DataFrame(university_towns,
columns=['State', 'RegionName'])
print(towns_df.head())

数据清理-增加列.png

The applymap() method took each element from the DataFrame, passed it to the function, and the original value was replaced by the returned value. It’s that simple!

applymap()实际上是一个行遍历的思想,在处理数据时,每一行都可以对应回调函数,自定义来处理数据。

参考资料

[1]

Pythonic Data Cleaning With NumPy and Pandas: https://realpython.com/python-data-cleaning-numpy-pandas/

[2]

university_towns.txt: https://github.com/realpython/python-data-cleaning/blob/master/Datasets/university_towns.txt

本文参与 腾讯云自媒体分享计划,分享自微信公众号。
原始发表:2019-10-06,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 图南科技 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 原文地址
  • 数据集地址
  • Cleaning the Entire Dataset Using the applymap Function
    • 数据变换,增加列名
      • 参考资料
      领券
      问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档