我有一个软比赛函数(下面),它接受一个捐献者列表和一个新的条目,并查看给定的捐助者是否已经存在。
数据是不准确的,所以我必须使用软比赛来确定是否存在给定的记录(例如: Jon at 123 Sesame St. .与John . Doe在芝麻街123号)相同。我用difflib实现了这一点,它很有效,就像圣诞节一样慢。
该程序目前需要大约两天的时间来处理10mb的数据。分析器指出是软匹配函数中的difflib操作导致了缓慢。
是否有方法优化我的匹配功能以更好地工作?
def softMatch(self, new, donors):
#Takes new, a contribution record from a DG report, and attempts to soft match it with a record in donors
#Returns the position of the matched record or -1 if no match
#Methodology
#Name
#Address
#Affiliation,Occupation,Employer not analyzed
#Dependences cleanAddr(str), re(from cleanAddr), difflib
match = -1
name = new[0]
address = self.cleanAddr(new[1] + " " + new[2] + " " + new[3] + " " + new[4] + " " + new[5][:5])
while address.find(" ") != -1:
address = address.replace(" "," ")
diff = 0.6
for x in range(0, len(donors)):
ratio = difflib.SequenceMatcher(None, name, donors[x].getBestName()[0]).ratio() * difflib.SequenceMatcher(None, address, donors[x].getBestAddr()).ratio()**2
if ratio > diff:
diff = ratio
match = x
return match
而分析器输出:
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
3091394 53.737 0.000 66.798 0.000 difflib.py:353(find_longest_match)
368770 10.030 0.000 80.702 0.000 difflib.py:463(get_matching_blocks)
59454412 7.683 0.000 7.683 0.000 {method 'get' of 'dict' objects}
368770 6.906 0.000 10.521 0.000 difflib.py:300(__chain_b)
5210450 4.009 0.000 4.009 0.000 {built-in method __new__ of type object at 0x1001e14a0}
16895496 3.056 0.000 3.056 0.000 {method 'append' of 'list' objects}
756 2.863 0.004 104.674 0.138 june12.py:74(softMatch)
3091394 1.875 0.000 1.875 0.000 {method 'pop' of 'list' objects}
10299176 1.850 0.000 1.850 0.000 {method 'setdefault' of 'dict' objects}
2487807 1.808 0.000 4.634 0.000 difflib.py:661(<genexpr>)
3091394 1.620 0.000 4.345 0.000 <string>:12(__new__)
368770 1.461 0.000 88.032 0.000 difflib.py:639(ratio)
2119037 1.258 0.000 2.826 0.000 <string>:16(_make)
8127502 1.032 0.000 1.032 0.000 {method '__contains__' of 'set' objects}
184385 0.919 0.000 1.500 0.000 Donor.py:63(getBestAddr)
368770 0.859 0.000 5.493 0.000 {built-in method sum}
368770 0.554 0.000 12.131 0.000 difflib.py:154(__init__)
368817 0.552 0.000 0.552 0.000 {method 'sort' of 'list' objects}
3965153/3965134 0.538 0.000 0.538 0.000 {built-in method len}
368770 0.421 0.000 11.577 0.000 difflib.py:218(set_seqs)
368770 0.403 0.000 10.924 0.000 difflib.py:256(set_seq2)
337141 0.371 0.000 0.371 0.000 {method 'find' of 'str' objects}
368770 0.281 0.000 0.281 0.000 difflib.py:41(_calculate_ratio)
368770 0.232 0.000 0.232 0.000 difflib.py:230(set_seq1)
159562 0.216 0.000 0.216 0.000 {method 'replace' of 'str' objects}
184385 0.134 0.000 0.134 0.000 Donor.py:59(getBestName)
1 0.017 0.017 104.718 104.718 june12.py:56(loadDonors)
...
(软比赛由loadDonors调用)
发布于 2018-05-07 19:07:40
我建议您尝试不同的方法:
quick_ratio
而不是ratio
.lower().split(' ')
应用于数据,然后将其发送到SequenceMatcher
SequenceMatcher
第一种方法很可能会减少整个运行时间,但会降低精度。第二种方法应该运行得很好,因为SequenceMatcher
知道如何处理列表,并且只匹配两个项:['john', 'doe']
和['john', 'doe']
,而不是名称中的所有字母。
https://codereview.stackexchange.com/questions/55051
复制相似问题