前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >Python——大数据词频统计

Python——大数据词频统计

作者头像
Ed_Frey
发布2020-03-31 17:02:28
1.5K0
发布2020-03-31 17:02:28
举报
文章被收录于专栏:奔跑的键盘侠

这是奔跑的键盘侠的第170篇文章

作者|我是奔跑的键盘侠

来源|奔跑的键盘侠(ID:runningkeyboardhero)

转载请联系授权(微信ID:ctwott)

当里个当,我来了!

今天来讲一个词频统计的方法,说高大上一点,就是大数据分析;看完以后,也不过数行代码而已。

用途倒是很广泛,比如我们统计某篇文章中的用词频率,网络热点词汇,再比如起名排行榜呀、热门旅游景点排行榜呀什么的,其实也都可以套用。

1

coding

代码语言:javascript
复制
#!/usr/bin/env python3.7
# -*- coding: utf-8 -*-
# @Time    : 2020-03-29 22:04
# @Author  : Ed Frey
# @File    : counter_func.py
# @Software: PyCharm

text = '''O, that this too too solid flesh would melt
Thaw and resolve itself into a dew!
Or that the Everlasting had not fix'd
His canon 'gainst self-slaughter! O God! God!
How weary, stale, flat and unprofitable, 
Seem to me all the uses of this world!
Fie on't! ah fie! 'tis an unweeded garden,
That grows to seed; things rank and gross in nature
Possess it merely. That it should come to this!
But two months dead: nay, not so much, not two: 
So excellent a king; that was, to this,
Hyperion to a satyr; so loving to my mother
That he might not beteem the winds of heaven
Visit her face too roughly. Heaven and earth!
Must I remember? why, she would hang on him, 
As if increase of appetite had grown'''

all_strings = text.replace("\n"," ")
words = all_strings.split(" ")
stat_counter = {}
for word in words:
    if word in stat_counter.keys():
        stat_counter[word] += 1
    else:
        stat_counter[word] = 1

result = sorted(stat_counter,key=stat_counter.get,reverse=True)[:10]
for key in result:
    print("%s:%d"%(key,stat_counter[key]))
代码语言:javascript
复制
测试结果如下:

to:6

and:4

not:4

that:3

too:3

a:3

the:3

:3

of:3

That:3

其中用到了sorted关键字的取值排序。

代码语言:javascript
复制

2

补充一个Counter函数用法

代码语言:javascript
复制
python内置模块collections中有个Counter函数,功能也极为强大,做实验设计可能会到,不过跟上面的单词统计不太一样。Counter函数是以文本中的单个字母、或单个文字作为处理对象,而代码就更简烈了。
代码语言:javascript
复制
代码语言:javascript
复制
#!/usr/bin/env python3.7
# -*- coding: utf-8 -*-
# @Time    : 2020-03-29 22:04
# @Author  : Ed Frey
# @File    : counter_func.py
# @Software: PyCharm
from collections import Counter

text= '''清明时节雨纷纷,路上行人欲断魂。
借问酒家何处有?牧童遥指杏花村。'''
stat = Counter(text.replace("\n",""))
print(stat.most_common(5))
代码语言:javascript
复制
运行结果如下:
代码语言:javascript
复制
[('纷', 2), ('。', 2), ('清', 1), ('明', 1), ('时', 1)]
代码语言:javascript
复制
代码语言:javascript
复制
最后再节选部分Counter使用手册中的语法,供大家参详:
代码语言:javascript
复制
'''
Help on class Counter in module collections:
class Counter(builtins.dict)
 |  Dict subclass for counting hashable items.  Sometimes called a bag
 |  or multiset.  Elements are stored as dictionary keys and their counts
 |  are stored as dictionary values.
 |
 |  >>> c = Counter('abcdeabcdabcaba')  # count elements from a string
 |
 |  >>> c.most_common(3)                # three most common elements
 |  [('a', 5), ('b', 4), ('c', 3)]
 |  >>> sorted(c)                       # list all unique elements
 |  ['a', 'b', 'c', 'd', 'e']
 |  >>> ''.join(sorted(c.elements()))   # list elements with repetitions
 |  'aaaaabbbbcccdde'
 |  >>> sum(c.values())                 # total of all counts
 |  15
 |
 |  >>> c['a']                          # count of letter 'a'
 |  5
 |  >>> for elem in 'shazam':           # update counts from an iterable
 |  ...     c[elem] += 1                # by adding 1 to each element's count
 |  >>> c['a']                          # now there are seven 'a'
 |  7
 |  >>> del c['b']                      # remove all 'b'
 |  >>> c['b']                          # now there are zero 'b'
 |  0
 |
 |  >>> d = Counter('simsalabim')       # make another counter
 |  >>> c.update(d)                     # add in the second counter
 |  >>> c['a']                          # now there are nine 'a'
 |  9
 |
 |  >>> c.clear()                       # empty the counter
 |  >>> c
 |  Counter()
 |
 |  Note:  If a count is set to zero or reduced to zero, it will remain
 |  in the counter until the entry is deleted or the counter is cleared:
 |
 |  >>> c = Counter('aaabbc')
 |  >>> c['b'] -= 2                     # reduce the count of 'b' by two
 |  >>> c.most_common()                 # 'b' is still in, but its count is zero
 |  [('a', 3), ('c', 1), ('b', 0)]
'''

-END-

© Copyright

奔跑的键盘侠原创作品 | 尽情分享朋友圈 | 转载请联系授权

本文参与 腾讯云自媒体同步曝光计划,分享自微信公众号。
原始发表:2020-03-29,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 奔跑的键盘侠 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档