我正在尝试构建一个简单的程序,它接受一个文本文件,构建一个以单词作为键的dict()
,以及每个单词出现的次数(单词频率)的值。
我已经了解到,collections.Counter
函数可以很容易地做到这一点(包括其他方法)。我的问题是,我希望字典按频率排序,这样我就能打印出第N个最频繁的单词。最后,我还需要为字典提供一种方法,以便稍后将不同类型的值(单词定义的字符串)关联起来。
基本上,我需要输出以下内容的东西:
Number of words: 5
[mostfrequentword: frequency, definition]
[2ndmostfrequentword: frequency, definition]
etc.
到目前为止,这是我所拥有的,但它只计算单词的频率,我不知道如何按频率排序字典,然后打印第N个最频繁的单词:
wordlist ={}
def cleanedup(string):
alphabet = 'abcdefghijklmnopqrstuvwxyz'
cleantext = ''
for character in string.lower():
if character in alphabet:
cleantext += character
else:
cleantext += ' '
return cleantext
def text_crunch(textfile):
for line in textfile:
for word in cleanedup(line).split():
if word in wordlist:
wordlist[word] += 1
else:
wordlist[word] = 1
with open ('DQ.txt') as doc:
text_crunch(doc)
print(wordlist['todos'])
发布于 2014-12-10 18:01:07
一个更简单的代码版本,它可以很好地实现您想要的功能:)
import string
import collections
def cleanedup(fh):
for line in fh:
word = ''
for character in line:
if character in string.ascii_letters:
word += character
elif word:
yield word
word = ''
with open ('DQ.txt') as doc:
wordlist = collections.Counter(cleanedup(doc))
print wordlist.most_commond(5)
带有正则表达式的替代解决方案:
import re
import collections
def cleandup(fh):
for line in fh:
for word in re.findall('[a-z]+', line.lower()):
yield word
with open ('DQ.txt') as doc:
wordlist = collections.Counter(cleanedup(doc))
print wordlist.most_commond(5)
或者:
import re
import collections
def cleandup(fh):
for line in fh:
for word in re.split('[^a-z]+', line.lower()):
yield word
with open ('DQ.txt') as doc:
wordlist = collections.Counter(cleanedup(doc))
print wordlist.most_commond(5)
https://stackoverflow.com/questions/27407485
复制相似问题