我正在自学python,并且已经完成了一个基本的文本摘要。我对总结的文本几乎很满意,但是我想对最后的产品进行更多的润色。
代码正确地执行一些标准文本处理(标记化、删除停止词等)。然后,该代码根据加权词频对每个句子进行评分。我正在使用heapq.nlargest()方法返回前7句,我觉得这在我的示例文本基础上做得很好。
我面临的问题是,前7句话是从最高分->最低分中返回的。我明白为什么会发生这种事。我希望保持原案文所载的相同的句子顺序。我已经包括了相关的代码,并希望有人能指导我的解决方案。
#remove all stopwords from text, build clean list of lower case words
clean_data = []
for word in tokens:
if str(word).lower() not in stoplist:
clean_data.append(word.lower())
#build dictionary of all words with frequency counts: {key:value = word:count}
word_frequencies = {}
for word in clean_data:
if word not in word_frequencies.keys():
word_frequencies[word] = 1
else:
word_frequencies[word] += 1
#print(word_frequencies.items())
#update the dictionary with a weighted frequency
maximum_frequency = max(word_frequencies.values())
#print(maximum_frequency)
for word in word_frequencies.keys():
word_frequencies[word] = (word_frequencies[word]/maximum_frequency)
#print(word_frequencies.items())
#iterate through each sentence and combine the weighted score of the underlying word
sentence_scores = {}
for sent in sentence_list:
for word in nltk.word_tokenize(sent.lower()):
if word in word_frequencies.keys():
if len(sent.split(' ')) < 30:
if sent not in sentence_scores.keys():
sentence_scores[sent] = word_frequencies[word]
else:
sentence_scores[sent] += word_frequencies[word]
#print(sentence_scores.items())
summary_sentences = heapq.nlargest(7, sentence_scores, key = sentence_scores.get)
summary = ' '.join(summary_sentences)
print(summary)
我正在使用以下文章进行测试:https://www.bbc.com/news/world-australia-45674716
目前的调查结果:“澳大利亚银行的调查:‘他们不在乎伤害谁’,调查还听取了关于公司欺诈、银行贿赂集团、欺骗监管者的行动和鲁莽行为的证词。今年,英国最高形式的公开调查--皇家委员会--揭露了该行业普遍存在的不当行为。在此之前,澳大利亚最大的金融行业--澳大利亚金融业--长达十年的丑闻行为。“这份报告为我们金融部门的不良行为提供了一个非常光明的明灯,”财务主管乔希·弗赖登伯格( Josh Frydenberg )表示。他说:“当不当行为被揭露时,它要么不受惩罚,要么后果不符合所做行为的严重性。”失去一切的银行客户也批评了他所称的监管机构对银行和金融公司的不当行为。它还收到9 300多份指控银行、财务顾问、养恤基金和保险公司不当行为的意见书。“
作为预期产出的一个例子:上面的第三句话:“英国最高形式的公开调查--今年的皇家委员会--揭露了该行业普遍存在的不当行为。”实际上,在原文中“澳大利亚银行询价:他们不在乎伤害谁”之前,我希望输出结果能维持句子的顺序。
发布于 2019-03-08 16:12:37
它起作用了,离开这里,以防别人好奇:
#iterate through each sentence and combine the weighted score of the underlying word
sentence_scores = {}
cnt = 0
for sent in sentence_list:
sentence_scores[sent] = []
score = 0
for word in nltk.word_tokenize(sent.lower()):
if word in word_frequencies.keys():
if len(sent.split(' ')) < 30:
if sent not in sentence_scores.keys():
score = word_frequencies[word]
else:
score += word_frequencies[word]
sentence_scores[sent].append(score)
sentence_scores[sent].append(cnt)
cnt = cnt + 1
#Sort the dictionary using the score in descending order and then index in ascending order
#Getting the top 7 sentences
#Putting them in 1 string variable
from operator import itemgetter
top7 = dict(sorted(sentence_scores.items(), key=itemgetter(1), reverse = True)[0:7])
#print(top7)
def Sort(sub_li):
return(sorted(sub_li, key = lambda sub_li: sub_li[1]))
sentence_summary = Sort(top7.values())
summary = ""
for value in sentence_summary:
for key in top7:
if top7[key] == value:
summary = summary + key
print(summary)
https://stackoverflow.com/questions/55027099
复制相似问题