首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >【译】Python中的数据清洗 |Pythonic Data Cleaning With NumPy and Pandas(三)

【译】Python中的数据清洗 |Pythonic Data Cleaning With NumPy and Pandas(三)

作者头像
needrunning
发布2019-10-08 17:24:37
9340
发布2019-10-08 17:24:37
举报
文章被收录于专栏:图南科技图南科技

本文使用 Python 进行数据清洗的第三部分翻译,全部翻译的文章内容摘要如下

【译】Python中的数据清洗 |Pythonic Data Cleaning With NumPy and Pandas(一)

【译】Python中的数据清洗 |Pythonic Data Cleaning With NumPy and Pandas(二)

下图目录是一些常规的数据清理项,本文中主要讨论

“Renaming Columns and Skipping Rows

“Python Data Cleaning: Recap and Resources

数据清理目录.png

原文地址

Pythonic Data Cleaning With NumPy and Pandas[1]

数据集

olympics.csv[2]

A CSV file summarizing the participation of all countries in the Summer and Winter Olympics

Renaming Columns and Skipping Rows

重命名列和跳行

首先我们分析下原始数据集

Therefore, we need to do two things:

Skip one row and set the header as the first (0-indexed) row
Rename the columns

通过增加参数,移除第一行

olympics_df = pd.read_csv('datasets/python-data-cleaning-master/olympics.csv', header=1)
print(olympics_df.head())

通过指定列名索引集合来重命名列

new_names =  {'Unnamed: 0': 'Country',
            '? Summer': 'Summer Olympics',
            '01 !': 'Gold',
             '02 !': 'Silver',
             '03 !': 'Bronze',
           '? Winter': 'Winter Olympics',
             '01 !.1': 'Gold.1',
              '02 !.1': 'Silver.1',
             '03 !.1': 'Bronze.1',
             '? Games': '# Games',
            '01 !.2': 'Gold.2',
            '02 !.2': 'Silver.2',
           '03 !.2': 'Bronze.2'}
olympics_df.rename(columns=new_names, inplace=True)

Python Data Cleaning: Recap and Resources

数据清洗回顾和相关资源

In this tutorial, you learned how you can drop unnecessary information from a dataset using the drop() function, as well as how to set an index for your dataset so that items in it can be referenced easily.

Moreover, you learned how to clean object fields with the .str() accessor and how to clean the entire dataset using the applymap() method. Lastly, we explored how to skip rows in a CSV file and rename columns using the rename() method.

数据清洗是数据科学中的重要部分。这篇文章是对 python 中使用 Pandas and NumPy 库的使用有一个基本的理解。

Knowing about data cleaning is very important, because it is a big part of data science. You now have a basic understanding of how Pandas and NumPy can be leveraged to clean datasets!

Check out the links below to find additional resources that will help you on your Python data science journey:

  • The Pandas documentation[3]
  • The NumPy documentation[4]
  • Python for Data Analysis[5] by Wes McKinney, the creator of Pandas
  • Pandas Cookbook[6] by Ted Petrou, a data science trainer and consultant

翻译总结

一整篇文章的翻译分成了三部分,持续花了三周的时间,文章算是 Python 数据处理的入门知识,是实际使用的基础应用点,翻译的内容可以作为知识索引,之后需要的时候返回来再看看。

另外发现https://realpython.com[7]是学习 python 很不错的外文网站,之后会持续翻译这个网站上 python 相关的文章,作为积累,一点一点熟悉 python。

参考资料

[1]

Pythonic Data Cleaning With NumPy and Pandas: https://realpython.com/python-data-cleaning-numpy-pandas/

[2]

olympics.csv: https://github.com/realpython/python-data-cleaning/blob/master/Datasets/olympics.csv

[3]

documentation: https://pandas.pydata.org/pandas-docs/stable/index.html

[4]

documentation: https://docs.scipy.org/doc/numpy/reference/

[5]

Python for Data Analysis: https://realpython.com/asins/1491957662/

[6]

Pandas Cookbook: https://realpython.com/asins/B06W2LXLQK/

[7]

https://realpython.com: https://realpython.com/

本文参与 腾讯云自媒体分享计划,分享自微信公众号。
原始发表:2019-10-07,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 图南科技 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 原文地址
  • 数据集
  • Renaming Columns and Skipping Rows
  • Python Data Cleaning: Recap and Resources
    • 翻译总结
      • 参考资料
      领券
      问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档