Python——大数据词频统计

Ed_Frey

发布于 2020-03-31 17:02:28

1.5K0

发布于 2020-03-31 17:02:28

文章被收录于专栏：奔跑的键盘侠

这是奔跑的键盘侠的第170篇文章

作者|我是奔跑的键盘侠

来源|奔跑的键盘侠（ID：runningkeyboardhero）

转载请联系授权（微信ID：ctwott）

当里个当，我来了！

今天来讲一个词频统计的方法，说高大上一点，就是大数据分析；看完以后，也不过数行代码而已。

用途倒是很广泛，比如我们统计某篇文章中的用词频率，网络热点词汇，再比如起名排行榜呀、热门旅游景点排行榜呀什么的，其实也都可以套用。

coding

#!/usr/bin/env python3.7
# -*- coding: utf-8 -*-
# @Time    : 2020-03-29 22:04
# @Author  : Ed Frey
# @File    : counter_func.py
# @Software: PyCharm

text = '''O, that this too too solid flesh would melt
Thaw and resolve itself into a dew!
Or that the Everlasting had not fix'd
His canon 'gainst self-slaughter! O God! God!
How weary, stale, flat and unprofitable, 
Seem to me all the uses of this world!
Fie on't! ah fie! 'tis an unweeded garden,
That grows to seed; things rank and gross in nature
Possess it merely. That it should come to this!
But two months dead: nay, not so much, not two: 
So excellent a king; that was, to this,
Hyperion to a satyr; so loving to my mother
That he might not beteem the winds of heaven
Visit her face too roughly. Heaven and earth!
Must I remember? why, she would hang on him, 
As if increase of appetite had grown'''

all_strings = text.replace("\n"," ")
words = all_strings.split(" ")
stat_counter = {}
for word in words:
    if word in stat_counter.keys():
        stat_counter[word] += 1
    else:
        stat_counter[word] = 1

result = sorted(stat_counter,key=stat_counter.get,reverse=True)[:10]
for key in result:
    print("%s:%d"%(key,stat_counter[key]))

测试结果如下：

to:6

and:4

not:4

that:3

too:3

a:3

the:3

of:3

That:3

其中用到了sorted关键字的取值排序。

补充一个Counter函数用法

python内置模块collections中有个Counter函数，功能也极为强大，做实验设计可能会到，不过跟上面的单词统计不太一样。Counter函数是以文本中的单个字母、或单个文字作为处理对象，而代码就更简烈了。

#!/usr/bin/env python3.7
# -*- coding: utf-8 -*-
# @Time    : 2020-03-29 22:04
# @Author  : Ed Frey
# @File    : counter_func.py
# @Software: PyCharm
from collections import Counter

text= '''清明时节雨纷纷，路上行人欲断魂。
借问酒家何处有？牧童遥指杏花村。'''
stat = Counter(text.replace("\n",""))
print(stat.most_common(5))

运行结果如下：

[('纷', 2), ('。', 2), ('清', 1), ('明', 1), ('时', 1)]

最后再节选部分Counter使用手册中的语法，供大家参详：

'''
Help on class Counter in module collections:
class Counter(builtins.dict)
 |  Dict subclass for counting hashable items.  Sometimes called a bag
 |  or multiset.  Elements are stored as dictionary keys and their counts
 |  are stored as dictionary values.
 |
 |  >>> c = Counter('abcdeabcdabcaba')  # count elements from a string
 |
 |  >>> c.most_common(3)                # three most common elements
 |  [('a', 5), ('b', 4), ('c', 3)]
 |  >>> sorted(c)                       # list all unique elements
 |  ['a', 'b', 'c', 'd', 'e']
 |  >>> ''.join(sorted(c.elements()))   # list elements with repetitions
 |  'aaaaabbbbcccdde'
 |  >>> sum(c.values())                 # total of all counts
 |  15
 |
 |  >>> c['a']                          # count of letter 'a'
 |  5
 |  >>> for elem in 'shazam':           # update counts from an iterable
 |  ...     c[elem] += 1                # by adding 1 to each element's count
 |  >>> c['a']                          # now there are seven 'a'
 |  7
 |  >>> del c['b']                      # remove all 'b'
 |  >>> c['b']                          # now there are zero 'b'
 |  0
 |
 |  >>> d = Counter('simsalabim')       # make another counter
 |  >>> c.update(d)                     # add in the second counter
 |  >>> c['a']                          # now there are nine 'a'
 |  9
 |
 |  >>> c.clear()                       # empty the counter
 |  >>> c
 |  Counter()
 |
 |  Note:  If a count is set to zero or reduced to zero, it will remain
 |  in the counter until the entry is deleted or the counter is cleared:
 |
 |  >>> c = Counter('aaabbc')
 |  >>> c['b'] -= 2                     # reduce the count of 'b' by two
 |  >>> c.most_common()                 # 'b' is still in, but its count is zero
 |  [('a', 3), ('c', 1), ('b', 0)]
'''

-END-

奔跑的键盘侠原创作品 | 尽情分享朋友圈 | 转载请联系授权

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2020-03-29，如有侵权请联系 cloudcommunity@tencent.com 删除

coding

本文分享自奔跑的键盘侠微信公众号，前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与腾讯云自媒体同步曝光计划，欢迎热爱写作的你一起参与！

coding

登录后参与评论

0 条评论

热度

Python——大数据词频统计

Python——大数据词频统计

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐