专栏首页奔跑的键盘侠Python——大数据词频统计

Python——大数据词频统计

这是奔跑的键盘侠的第170篇文章

作者|我是奔跑的键盘侠

来源|奔跑的键盘侠(ID:runningkeyboardhero)

转载请联系授权(微信ID:ctwott)

当里个当,我来了!

今天来讲一个词频统计的方法,说高大上一点,就是大数据分析;看完以后,也不过数行代码而已。

用途倒是很广泛,比如我们统计某篇文章中的用词频率,网络热点词汇,再比如起名排行榜呀、热门旅游景点排行榜呀什么的,其实也都可以套用。

1

coding

#!/usr/bin/env python3.7
# -*- coding: utf-8 -*-
# @Time    : 2020-03-29 22:04
# @Author  : Ed Frey
# @File    : counter_func.py
# @Software: PyCharm

text = '''O, that this too too solid flesh would melt
Thaw and resolve itself into a dew!
Or that the Everlasting had not fix'd
His canon 'gainst self-slaughter! O God! God!
How weary, stale, flat and unprofitable, 
Seem to me all the uses of this world!
Fie on't! ah fie! 'tis an unweeded garden,
That grows to seed; things rank and gross in nature
Possess it merely. That it should come to this!
But two months dead: nay, not so much, not two: 
So excellent a king; that was, to this,
Hyperion to a satyr; so loving to my mother
That he might not beteem the winds of heaven
Visit her face too roughly. Heaven and earth!
Must I remember? why, she would hang on him, 
As if increase of appetite had grown'''

all_strings = text.replace("\n"," ")
words = all_strings.split(" ")
stat_counter = {}
for word in words:
    if word in stat_counter.keys():
        stat_counter[word] += 1
    else:
        stat_counter[word] = 1

result = sorted(stat_counter,key=stat_counter.get,reverse=True)[:10]
for key in result:
    print("%s:%d"%(key,stat_counter[key]))
测试结果如下:

to:6

and:4

not:4

that:3

too:3

a:3

the:3

:3

of:3

That:3

其中用到了sorted关键字的取值排序。

2

补充一个Counter函数用法

python内置模块collections中有个Counter函数,功能也极为强大,做实验设计可能会到,不过跟上面的单词统计不太一样。Counter函数是以文本中的单个字母、或单个文字作为处理对象,而代码就更简烈了。
#!/usr/bin/env python3.7
# -*- coding: utf-8 -*-
# @Time    : 2020-03-29 22:04
# @Author  : Ed Frey
# @File    : counter_func.py
# @Software: PyCharm
from collections import Counter

text= '''清明时节雨纷纷,路上行人欲断魂。
借问酒家何处有?牧童遥指杏花村。'''
stat = Counter(text.replace("\n",""))
print(stat.most_common(5))
运行结果如下:
[('纷', 2), ('。', 2), ('清', 1), ('明', 1), ('时', 1)]
最后再节选部分Counter使用手册中的语法,供大家参详:
'''
Help on class Counter in module collections:
class Counter(builtins.dict)
 |  Dict subclass for counting hashable items.  Sometimes called a bag
 |  or multiset.  Elements are stored as dictionary keys and their counts
 |  are stored as dictionary values.
 |
 |  >>> c = Counter('abcdeabcdabcaba')  # count elements from a string
 |
 |  >>> c.most_common(3)                # three most common elements
 |  [('a', 5), ('b', 4), ('c', 3)]
 |  >>> sorted(c)                       # list all unique elements
 |  ['a', 'b', 'c', 'd', 'e']
 |  >>> ''.join(sorted(c.elements()))   # list elements with repetitions
 |  'aaaaabbbbcccdde'
 |  >>> sum(c.values())                 # total of all counts
 |  15
 |
 |  >>> c['a']                          # count of letter 'a'
 |  5
 |  >>> for elem in 'shazam':           # update counts from an iterable
 |  ...     c[elem] += 1                # by adding 1 to each element's count
 |  >>> c['a']                          # now there are seven 'a'
 |  7
 |  >>> del c['b']                      # remove all 'b'
 |  >>> c['b']                          # now there are zero 'b'
 |  0
 |
 |  >>> d = Counter('simsalabim')       # make another counter
 |  >>> c.update(d)                     # add in the second counter
 |  >>> c['a']                          # now there are nine 'a'
 |  9
 |
 |  >>> c.clear()                       # empty the counter
 |  >>> c
 |  Counter()
 |
 |  Note:  If a count is set to zero or reduced to zero, it will remain
 |  in the counter until the entry is deleted or the counter is cleared:
 |
 |  >>> c = Counter('aaabbc')
 |  >>> c['b'] -= 2                     # reduce the count of 'b' by two
 |  >>> c.most_common()                 # 'b' is still in, but its count is zero
 |  [('a', 3), ('c', 1), ('b', 0)]
'''

-END-

© Copyright

奔跑的键盘侠原创作品 | 尽情分享朋友圈 | 转载请联系授权

本文分享自微信公众号 - 奔跑的键盘侠(runningkeyboardhero),作者:我是奔跑的键盘侠

原文出处及转载信息见文内详细说明,如有侵权,请联系 yunjia_community@tencent.com 删除。

原始发表时间:2020-03-29

本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

我来说两句

0 条评论
登录 后参与评论

相关文章

  • Python——图片透明化处理

    量化分析的篇章,前一篇已经做了完结。如果要细究一下,整体的流程框架都有了,要深入详细的搭建程序代码,可能还要再写个十多篇。倘若交易策略多种组合,就要更多篇幅了。

    Ed_Frey
  • Python——三级菜单(字典的应用)

    Ed_Frey
  • Python——三级菜单(字典+文件读写)

    不过,我个人觉得数据写入这个节点,还是放到代码中间,修改完立即写入比较好。毕竟遇到非正常情况退出,那修改内容就丢失了。另外关于删除城市,有下一级的也全部删除,欠...

    Ed_Frey
  • 学习电子商务产品搜索的鲁棒模型(CS CL)

    显示不符合搜索查询意图的项目会降低电子商务中的客户体验。这些不匹配是由于排名算法对搜索日志中的点击和购买等嘈杂行为信号的反事实偏见导致的。缓解这个问题需要大量的...

    刘持诚
  • Baozi Training Leetcode Solution 205: Isomorphic Strings

    博客园:https://www.cnblogs.com/baozitraining/p/11112125.html

    包子面试培训
  • 萨提亚·纳德拉、沈向洋CVPR对谈:那些未来可期的计算机视觉研究与应用

    6月16日,CVPR 2020 大会以全球连线的形式如期开幕。在大会的首场主题演讲中,微软公司 CEO 萨提亚·纳德拉与微软公司前执行副总裁沈向洋进行了一场精彩...

    AI科技评论
  • 萨提亚·纳德拉与沈向洋CVPR对谈:那些未来可期的计算机视觉研究与应用

    编者按:6月16日,CVPR 2020 大会以全球连线的形式如期开幕。在大会的首场主题演讲中,微软公司 CEO 萨提亚·纳德拉与微软公司前执行副总裁沈向洋进行了...

    CV君
  • "CMake Error: CMake was unable to find a build program corresponding Ninja"

    前两天, 没修改什么, 打开Android Studio编so, 忽然就不成功了.

    望天
  • 提高 golang 的反射性能

    golang 的反射很慢。这个和它的 api 设计有关。在 java 里面,我们一般使用反射都是这样来弄的。 ---- Field field = clazz....

    李海彬
  • BHN:类脑异构网络(CS NE)

    人脑是在无监督的情况下工作的,而一个以上的大脑区域对于激发智力至关重要。受此启发,我们提出了一种类人脑的异构网络(BHN),该网络可以协作学习分布式表示(如皮质...

    小童

扫码关注云+社区

领取腾讯云代金券