示例文本文件:
airport, 2007, 175702
airport, 2008, 173294
request, 2005, 646179
request, 2006, 677820
request, 2007, 697645
request, 2008, 795265
wandered, 2005, 83769
wandered, 2006, 87688
wandered, 2007, 108634
wandered, 2008, 171015该文本文件包含一个单词(例如:‘airport’)、年份和该单词在该年份中使用的次数。我所做的是创建了一个类,它将单词作为关键字,并具有该年的年份和发生次数。现在我想要做的是找出每个字母从a到z的出现次数,这是通过找出字母表中每个字母在单词中出现的次数,然后乘以这个数字,这个单词出现的总次数加上其他单词的总出现次数。
示例:
'a';在流浪和机场中都出现了一次,因此我们得到1(83769+87688+108634+171015) = 451106次'a‘在流浪中的总出现次数,以及1(175702+173294) = 348996次'a’在机场中的总出现次数,总计字母a出现了800102次。为了找出'a‘出现的频率,我们将800102除以字母总数25770183,得出字母'a’的频率为0.013047。“b”和“c”应该是0.0,因为当前没有单词使用这两个字母。
这就是我到目前为止所拥有的,但它根本不起作用,而且我没有想法:
from wordData import*
def letterFreq(words):
totalLetters = 0
letterDict = {'a':0,'b':0,'c':0,'d':0,'e':0,'f':0,'g':0,'h':0,'i':0,'j':0,'k':0,'l':0,'m':0,'n':0,'o':0,'p':0,'q':0,
'r':0,'s':0,'t':0,'u':0,'v':0,'w':0,'x':0,'y':0,'z':0}
for word in words:
totalLetters += totalOccurances(word,words)*len(word)
for char in range(0,len(word)):
for letter in letterDict:
if letter == word[char]:
for year in words[word]:
letterDict[letter] += year.count
for letters in letterDict:
letterDict[letters] /= totalLetters
print(letterDict)
def main():
filename = "data/very_short.csv"
words = readWordFile(filename)
letterFreq(words)
if __name__ == '__main__':
main()发布于 2014-11-24 06:12:40
如果您想要文件中所有字母的计数,请使用collections.Counter字典:
from collections import Counter
c = Counter()
with open("input.txt") as f:
for line in f:
c.update(line.split(",")[0])
print(c)
Counter({'e': 16, 'r': 12, 'd': 8, 'a': 6, 't': 6, 'n': 4, 'q': 4, 's': 4, 'u': 4, 'w': 4, 'i': 2, 'o': 2, 'p': 2})要得到总数,只需乘以它出现的次数:
from collections import Counter
c = Counter()
with open("input.txt") as f:
for line in f:
word, year, count = line.split()
c.update(word*int(count))
print(c["a"] / float(sum(c.values())))https://stackoverflow.com/questions/27094946
复制相似问题