本文是 使用 Python 进行数据清洗
第二部分翻译,全部翻译的文章内容摘要如下
【译】Python中的数据清洗 |Pythonic Data Cleaning With NumPy and Pandas(一)
下图目录是一些常规的数据清理项,本文中主要讨论
“Cleaning the Entire Dataset Using the applymap Function
数据清理目录.png
Pythonic Data Cleaning With NumPy and Pandas[1]
university_towns.txt[2]
A text file containing names of college towns in every US state
数据集格式大概如下
Alabama[edit]
Auburn (Auburn University)[1]
Florence (University of North Alabama)
Jacksonville (Jacksonville State University)[2]
Livingston (University of West Alabama)[2]
Montevallo (University of Montevallo)[2]
Troy (Troy University)[2]
Tuscaloosa (University of Alabama, Stillman College, Shelton State)[3][4]
Tuskegee (Tuskegee University)[5]
...
我们的数据清洗任务 是把以上不规则的行数据整理为整齐的数据,我们可以看到每行数据除了一些括号外,没有其它的共性特征。
使用 applymap 函数清洗整个数据集
In certain situations, you will see that the “dirt” is not localized to one column but is more spread out.
There are some instances where it would be helpful to apply a customized function to each cell or element of a DataFrame.
Pandas .applymap() method is similar to the in-built map() function and simply applies a function to all the elements in a DataFrame.
数据清洗-变换行.png
We see that we have periodic state names followed by the university towns in that state: StateA TownA1 TownA2 StateB TownB1 TownB2....
If we look at the way state names are written in the file, we’ll see that all of them have the “[edit]” substring in them.
美国州名称用【edit】字符串分割
We can take advantage of this pattern by creating a list of (state, city) tuples and wrapping that list in a DataFrame:
university_towns = []
with open('datasets/python-data-cleaning-master/university_towns.txt') as file:
for line in file:
if '[edit]' in line:
# Remember this `state` until the next is found
state = line
else:
# Otherwise, we have a city; keep `state` as last-seen
university_towns.append((state, line))
#读取前5行
print(university_towns[:5])
[('Alabama[edit]\n', 'Auburn (Auburn University)[1]\n'),
('Alabama[edit]\n', 'Florence (University of North Alabama)\n'),
('Alabama[edit]\n', 'Jacksonville (Jacksonville State University)[2]\n'),
('Alabama[edit]\n', 'Livingston (University of West Alabama)[2]\n'),
('Alabama[edit]\n', 'Montevallo (University of Montevallo)[2]\n')]
We can wrap this list in a DataFrame and set the columns as “State” and “RegionName”. Pandas will take each element in the list and set State to the left value and RegionName to the right value.
The resulting DataFrame looks like this:
towns_df = pd.DataFrame(university_towns,
columns=['State', 'RegionName'])
print(towns_df.head())
数据清理-增加列.png
The applymap() method took each element from the DataFrame, passed it to the function, and the original value was replaced by the returned value. It’s that simple!
applymap()实际上是一个行遍历的思想,在处理数据时,每一行都可以对应回调函数,自定义来处理数据。
[1]
Pythonic Data Cleaning With NumPy and Pandas: https://realpython.com/python-data-cleaning-numpy-pandas/
[2]
university_towns.txt: https://github.com/realpython/python-data-cleaning/blob/master/Datasets/university_towns.txt