我的任务是从列表中删除重复地址。
Case1: 5个地址的列表,其中只有2个地址是必需的,3个地址是重复的。
['3805 Swan House Ct||Burtonsville|MD|20866',
'3805 Swan House Ct||Burtonsville|Md|20866',
'6113 Loventree Rd||Columbia|MD|21044',
'6113 Loventree Rd||Columbia|Md|21044',
'6113 Loventree Road||Columbia|MD|21044']这里的地址'3805天鹅之家Ct\#**#*#* '3805 Swan House Ct_x Burtonsville MD\20866‘和'3805 Swan House x#Burtonsville MD\20866’是相似的,因此在这里它应该返回考虑长度的任何地址,这里‘3805天鹅之家的Ct_x_Burtonsville MD\20866’将是可以的。
如果地址是'6113 Loventree‘地址变量--这些是比较后的3个地址,它应该返回'6113 Loventree Road\Columbia\MD\21044’。
预期产出:
['3805 Swan House Ct||Burtonsville|MD|20866','6113 Loventree Road||Columbia|MD|21044']Case2:这里列出3个地址,只需要提取一个地址。
['4512 Fairfax Road|Apartment 2|Baltimore|MD|21216', '4512fairfaxrd|Apt2|Baltimore|Md|21216', '4512 Fairfax Rd|Apt 2|Baltimore|Md|21216']预期输出:考虑最高的地址长度。
['4512 Fairfax Road|Apartment 2|Baltimore|MD|21216']发布于 2022-09-10 06:10:15
你可以用衍射。但我不确定它与接近匹配的数据是如何匹配的。
from collections import OrderedDict
import difflib
data = ['3805 Swan House Ct||Burtonsville|MD|20866',
'3805 Swan House Ct||Burtonsville|Md|20866',
'6113 Loventree Rd||Columbia|MD|21044',
'6113 Loventree Rd||Columbia|Md|21044',
'6113 Loventree Road||Columbia|MD|21044',
"123 Cherry Lane Apt 12",
"123 Cherry Lane Apt 121"]
test = []
for word in data:
new_list = difflib.get_close_matches(word, data)
match_data = [i for i in data if any((j in i) for j in new_list)][:1]
test.append(match_data[0])
remove_dup = list(OrderedDict.fromkeys(test))
print(remove_dup)
>>> ['3805 Swan House Ct||Burtonsville|MD|20866', '6113 Loventree Rd||Columbia|MD|21044', '123 Cherry Lane Apt 12']如果您想要基于其长度的地址:
test = []
for word in data:
new_list = difflib.get_close_matches(word, data)
match_data = [i for i in data if any((j in i) for j in new_list)]
test_data = []
for i in match_data:
if not test_data:
test_data.append(i)
if test_data:
if len(test_data[-1]) < len(i):
test_data.remove(test_data[-1])
test_data.append(i)
test.append(test_data[0])
remove_dup = list(OrderedDict.fromkeys(test))
print(remove_dup)
>>> ['3805 Swan House Ct||Burtonsville|MD|20866', '6113 Loventree Road||Columbia|MD|21044', '123 Cherry Lane Apt 121']https://stackoverflow.com/questions/73669343
复制相似问题