我有一个带有人的信息的DataFrame,但是有一些重复的行,地址略有不同。
如何根据模糊匹配或其他检测相似度的方法删除重复项,但只有在姓和名也匹配的情况下,才能确保地址相似的行被删除?
示例数据:
First name | Last name | Address
0 John Doe ABC 9
1 John Doe KFT 2
2 Michael John ABC 9
3 Mary Jane PEP 9/2
4 Mary Jane PEP, 9-2
5 Gary Young verylongstreetname 1
6 Gary Young 1 verylongstretname(故意在街道上打字)
例如,数据代码:
df = pd.DataFrame([
['John', 'Doe', 'ABC 9'],
['John', 'Doe', 'KFT 2'],
['Michael', 'John', 'ABC 9'],
['Mary', 'Jane', 'PEP 9/2'],
['Mary', 'Jane', 'PEP, 9-2'],
['Gary', 'Young', 'verylongstreetname 1'],
['Gary', 'Young', '1 verylongstretname']
], columns=['First name', 'Last name', 'Address'])预期产出:
First name | Last name | Address
0 John Doe ABC 9
1 John Doe KFT 2
2 Michael John ABC 9
3 Mary Jane PEP 9/2
4 Gary Young verylongstreetname 1 发布于 2019-02-22 15:41:20
解决了。
基于@iamklaus anwser,我编写了以下代码:
def remove_duplicates_inplace(df, groupby=[], similarity_field='', similar_level=85):
def check_simi(d):
dupl_indexes = []
for i in range(len(d.values) - 1):
for j in range(i + 1, len(d.values)):
if fuzz.token_sort_ratio(d.values[i], d.values[j]) >= similar_level:
dupl_indexes.append(d.index[j])
return dupl_indexes
indexes = df.groupby(groupby)[similarity_field].apply(check_simi)
for index_list in indexes:
df.drop(index_list, inplace=True)
remove_duplicates_inplace(df, groupby=['firstname', 'lastname'], similarity_field='address')输出:
firstname lastname address
0 John Doe ABC 9
1 John Doe KFT 2
2 Michael John ABC 9
3 Mary Jane PEP 9/2
5 Gary Young verylongstreetname 1https://stackoverflow.com/questions/54827174
复制相似问题