豆瓣八佰短评爬虫
使用正则解析网页,获得数据 使用wordcloud绘制词云
# 数据获取
import requests
import re
import csv
import jieba
import wordcloud
# 通过循环实现多页爬虫
# 观察页面链接规律
# https://movie.douban.com/subject/26754233/comments?start=0&limit=20&sort=new_score&status=P
# https://movie.douban.com/subject/26754233/comments?start=20&limit=20&sort=new_score&status=P
# https://movie.douban.com/subject/26754233/comments?start=40&limit=20&sort=new_score&status=P
# https://movie.douban.com/subject/26754233/comments?start=60&limit=20&sort=new_score&status=P
# 每页20条从0到后,因此设置循环步骤,爬取1000页
# 备注,高看了,没有1000页数,修改
page=[]
for i in range(0,80,20):
page.append(i)
with open (r'D:\360MoveData\Users\cmusunqi\Documents\GitHub\R_and_python\python\豆瓣八佰爬虫\短评.csv','a',newline='',encoding='utf-8') as f:
for i in page:
url='https://movie.douban.com/subject/26754233/comments?start='+str(i)+'&limit=20&sort=new_score&status=P'
headers={
'User-Agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:80.0) Gecko/20100101 Firefox/80.0'
}
resp=requests.get(url,headers=headers)
html=resp.text
# 解析网页
res=re.compile('<span class="short">(.*?)</span>')
duanpin=re.findall(res,html)
# 保存数据
for duan in duanpin:
writer=csv.writer(f)
duanpin=[]
duanpin.append(duan)
writer.writerow(duanpin)
# 绘制短评词云图
f = open (r'D:\360MoveData\Users\cmusunqi\Documents\GitHub\R_and_python\python\豆瓣八佰爬虫\短评.csv',encoding='utf-8')
txt=f.read()
txt_list=jieba.lcut(txt)
string=' '.join(txt_list)
w=wordcloud.WordCloud(
width=1000,
height=700,
background_color='white',
font_path="msyh.ttc",
scale=15,
stopwords={" "},
contour_width=5,
contour_color='red'
)
w.generate(string)
w.to_file(r'D:\360MoveData\Users\cmusunqi\Documents\GitHub\R_and_python\python\豆瓣八佰爬虫\\八佰.png')
此次爬取的短评数据较少,在网页的源代码里面只有这么几条,让我百思不得其解,感觉是有问题的,可能需要将网页代码转换为手机数据进行浏览,也许可能是本来就那么几条,谁知道呢 从词云看,八佰还是打着历史的旗号进行宣发,因此这样的历史虚无主义电影,还是别看了,管虎的屁股就没正过。
最近爬虫、和业余的python学的有点多了,后面还是转到数据分析吧。
love&peace