python爬虫-八佰词云

火星娃统计

发布于 2020-09-15 15:53:02

1.4K0

发布于 2020-09-15 15:53:02

文章被收录于专栏：火星娃统计

python爬虫-八佰词云

概述

豆瓣八佰短评爬虫

思路

使用正则解析网页，获得数据使用wordcloud绘制词云

代码

# 数据获取
import requests
import re
import csv
import jieba
import wordcloud
# 通过循环实现多页爬虫
# 观察页面链接规律
# https://movie.douban.com/subject/26754233/comments?start=0&limit=20&sort=new_score&status=P
# https://movie.douban.com/subject/26754233/comments?start=20&limit=20&sort=new_score&status=P
# https://movie.douban.com/subject/26754233/comments?start=40&limit=20&sort=new_score&status=P
# https://movie.douban.com/subject/26754233/comments?start=60&limit=20&sort=new_score&status=P
# 每页20条从0到后，因此设置循环步骤，爬取1000页
# 备注，高看了，没有1000页数，修改
page=[]
for i in range(0,80,20):
    page.append(i)
with open (r'D:\360MoveData\Users\cmusunqi\Documents\GitHub\R_and_python\python\豆瓣八佰爬虫\短评.csv','a',newline='',encoding='utf-8') as f:
    for i in page:
        url='https://movie.douban.com/subject/26754233/comments?start='+str(i)+'&limit=20&sort=new_score&status=P'
        headers={
            'User-Agent':
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:80.0) Gecko/20100101 Firefox/80.0'
        }
        resp=requests.get(url,headers=headers)
        html=resp.text

        # 解析网页
        res=re.compile('<span class="short">(.*?)</span>')
        duanpin=re.findall(res,html)

        # 保存数据

        for duan in duanpin:            
            writer=csv.writer(f)
            duanpin=[]
            duanpin.append(duan)
            writer.writerow(duanpin)
# 绘制短评词云图

f = open (r'D:\360MoveData\Users\cmusunqi\Documents\GitHub\R_and_python\python\豆瓣八佰爬虫\短评.csv',encoding='utf-8')
txt=f.read()

txt_list=jieba.lcut(txt)
string=' '.join(txt_list)
w=wordcloud.WordCloud(
    width=1000,
    height=700,
    background_color='white',
    font_path="msyh.ttc",
    scale=15,
    stopwords={" "},
    contour_width=5,
    contour_color='red'
)

w.generate(string)
w.to_file(r'D:\360MoveData\Users\cmusunqi\Documents\GitHub\R_and_python\python\豆瓣八佰爬虫\\八佰.png')

结果

此次爬取的短评数据较少，在网页的源代码里面只有这么几条，让我百思不得其解，感觉是有问题的，可能需要将网页代码转换为手机数据进行浏览，也许可能是本来就那么几条，谁知道呢从词云看，八佰还是打着历史的旗号进行宣发，因此这样的历史虚无主义电影，还是别看了，管虎的屁股就没正过。

结束语

最近爬虫、和业余的python学的有点多了，后面还是转到数据分析吧。

love&peace

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2020-08-31，如有侵权请联系 cloudcommunity@tencent.com 删除

爬虫

python

本文分享自火星娃统计微信公众号，前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与腾讯云自媒体同步曝光计划，欢迎热爱写作的你一起参与！

爬虫

python

登录后参与评论

0 条评论

热度

python爬虫-八佰词云

python爬虫-八佰词云

python爬虫-八佰词云

概述

思路

代码

结果

结束语

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐