还记得2020年的最后一天,郭导和于编在同一时间承认抄袭,大方道歉,可以说两人一时瑜亮,互不相让。甚至广大吃瓜群众们一度认为他俩会一起成立一个反抄袭基金,一次来阻断抄(自)袭(己)之(财)风(路)!
那么一起来看看广大网友的评论吧,对于抓取微博评论,已经有太多工具了,这里介绍一个自己写的小工具
当前开源在 GitHub 上,个人感觉还是蛮好用的,使用起来也比较简单,见下图即可
写这个工具的起源,可以看下面这个链接
有小批量爬取微博需求的朋友,可以试试哦
GitHub 地址:https://github.com/zhouwei713/weibo_spider
程序运行完成后,我们会得到两个文件,一个是微博信息文件,一个是评论文件
微博信息:
评论:
简单的代码片段如下
import pandas as pd
import jieba
from pyecharts import options as opts
from pyecharts.charts import WordCloud
guo = pd.read_csv('郭导.csv')
df_copy = guo.copy()
df_copy['comment'] = df_copy['comment'].apply(lambda x: str(x).split()) # 去掉空格
df_list = df_copy.values.tolist()
comment = jieba.lcut(str(df_list), cut_all=False)
words = ' '.join(comment)
counts = {}
excludes = {",",":","“","。","”","、",";"," ","!", "1", "2", "3", "4", "5", "6", "7", "8", "9", "0",
"?"," ", ",", "'", "[", "]", "@ ", "的", "你", "是", "了", "@", "我", "不", "他", "她", "回复", "_",
"都", "也", "就", "人", "有", "说", "吗", "在", "啊", "吧", "还", "呢", "被", "和", "没", "给", "这", "…",
"很", "能", "", "", "", "", "", ""}
for word in comment:
if word not in excludes:
counts[word] = counts.get(word,0)+1
items = list(counts.items())
items.sort(key=lambda x:x[1],reverse = True)
for i in range(15):
word, count = items[i]
print("{0:<10}{1:>5}".format(word,count))
c = (
WordCloud()
.add(
"",
items,
word_size_range=[20, 100],
textstyle_opts=opts.TextStyleOpts(font_family="cursive"),
)
.set_global_opts(title_opts=opts.TitleOpts(title="WordCloud-自定义文字样式"))
)
c.render_notebook()
先来看下郭导微博下评论的词云信息
下面是于编的评论词云,大家可以对比下,哈哈
最后来简单看看两者的评论当中,都有哪些高频词汇吧
嗯~画风还是蛮一致的,舒服!
好了,今天的分享就到这里
最后还是那句话:
原创不易,给个“在看”再走吧!