嗯,今天还是挑战了爬取电影,因为我发现从别的页面进去就不是Ajax的页面了,步骤和书单差不多hhh
由于我在一边写一遍测试,就不停的运行,后来发现运行以后没有任何结果,我就测试了一下,应该是我发请求太频繁,导致我!!ip被封了,然后又只能用代理了,前几天那个还失效了,我又换了一个。
不说了直接贴代码吧
import requests
from lxml import etree
import csv
import time
url = "https://movie.douban.com/top250?"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:87.0) Gecko/20100101 Firefox/87.0"
}
datas = []
time.sleep(3)
for page in range(0, 226, 25):
params = {
"start": page,
"filter": ""
}
# 获取数据
response = requests.get(url=url, headers=headers, params=params, proxies={"https": "134.35.238.207:8080"})
print("status_code: " + str(response.status_code))
tree = etree.HTML(response.text)
# 电影列表
movie_list = tree.xpath("//ol[@class='grid_view']/li")
# 遍历电影
for movies in movie_list:
data = []
# 获取电影名
movie_name = movies.xpath("./div/div/a/img/@alt"[0] #这里忘记加索引了
data.append(movie_name)
# 获取电影评分
movie_rating = movies.xpath(".//div[@class='star']/span[2]/text()")[0] #这里也是!!
data.append(movie_rating)
# 评价人数
remark_number = movies.xpath(".//div[@class='star']/span[4]/text()")[0]
data.append(remark_number)
# 获取电影短评
movie_remark = movies.xpath(".//p[@class='quote']/span/text()")[0]
data.append(movie_remark)
datas.append(data)
with open("D:/movie.csv", "w", encoding="utf-8-sig", newline='') as fp:
writer = csv.writer(fp)
writer.writerow(["电影名称", "评分", "评价人数", "短评"])
for i in range(len(datas)):
writer.writerow(datas[i])
print("all done")
终于
但是最后的结果不知道为啥会多一个中括号。
我再想想办法吧hhh,运行一次真的要等超级久,不知道是不是代理的问题,所以每一次我都好谨慎555
嗯...我貌似知道问题所在了,忘记加索引了,现在在等待最后的结果啦(好怕报错!!)
好哇,果然...
再换代理吧...唉
已经找了一晚上了,真的好难,一直没好用的代理,试了超级多都报错。
------------很久以后-------------
终于!!
短评这个只能以列表的形式打印出来,因为有几个电影是没有短评的,索引取不到会报错,并且只有242部电影,就是说那几部没短评的直接被pass掉了,我再想想办法,好累orz...
反复修改以后终于成功!!
刚好250部电影!顺便还复习了一下try的用法,我太开心辣!!!!大功终于告成嘻嘻嘻,历时整整一天啊啊啊,从早到晚。
重新贴一次代码:
import requests
from lxml import etree
import csv
import time
url = "https://movie.douban.com/top250?"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:87.0) Gecko/20100101 Firefox/87.0"
}
datas = []
for page in range(0, 226, 25):
params = {
"start": page,
"filter": ""
}
# 获取数据
response = requests.get(url=url, headers=headers, params=params, proxies={"https": "221.5.80.66:3128"})
print("status_code: " + str(response.status_code))
tree = etree.HTML(response.text)
# 电影列表
movie_list = tree.xpath("//ol[@class='grid_view']/li")
# 遍历电影
for movies in movie_list:
data = []
# 获取电影名
movie_name = movies.xpath("./div/div/a/img/@alt")[0]
data.append(movie_name)
# 获取电影评分
movie_rating = movies.xpath(".//div[@class='star']/span[2]/text()")[0]
data.append(movie_rating)
# 评价人数
remark_number = movies.xpath(".//div[@class='star']/span[4]/text()")[0]
data.append(remark_number)
# 获取电影短评
try:
movie_remark = movies.xpath(".//p[@class='quote']/span/text()")[0]
data.append(movie_remark)
datas.append(data)
except:
movie_remark = "无短评"
data.append(movie_remark)
datas.append(data)
with open("D:/movie.csv", "w", encoding="utf-8-sig", newline='') as fp:
writer = csv.writer(fp)
writer.writerow(["电影名称", "评分", "评价人数", "短评"])
for i in range(len(datas)):
writer.writerow(datas[i])
print("all done")
print(datas)
注意事项:
1.CSV写入列的用法,是[["xx","xx"],["xx","xx"],["xx","xx"]],这样就会自动写入两列。
2.注意返回数据的类型,有些是列表的需用索引提取
3.要自信!!!不要一直发请求,会被封ip!!
4.坚持不懈找能用的免费代理(白嫖的总是最香的)
5.思路最重要,每一步得到的是什么东西要搞清楚,适当写注释!!