采集天堂电影数据来看看

Python知识大全

发布于 2020-02-13 13:57:57

3930

发布于 2020-02-13 13:57:57

文章被收录于专栏：Python 知识大全

阅读本文需要2分钟

最近本狗想放松放松，想了想还是看看几部电影最为可贵，于是找了大家最为熟悉的网站《电影天堂》去看个究竟。为了更好的去"挑选"电影，本狗就爬取了大几十页的数据。废话不多说：开工啦

原理：

构建目标URL：

def page_urls():
    baseurl = 'http://www.ygdy8.net/html/gndy/dyzz/list_23_{}.html'
    for i in range(1, 30):
        url = baseurl.format(i)
        parse_url(url)

只需要改变{}里面的内容就可以实现翻页

爬取电影详情URL:

def parse_url(url):
    response = requests.get(url, headers=headers)
    html = etree.HTML(response.text)
    tables = html.xpath('//table[@class="tbspan"]//a/@href')
    for table_url in tables:
        page_urls = baseurl + table_url

需要的模块：

import time
import random
import requests
from lxml import etree
import csv

主程序：（有点长，截取部分）

def spider(page_urls):
    data = {}
    response = requests.get(page_urls, headers=headers)
    html = etree.HTML(response.content.decode('gbk'))
    title = html.xpath('//div[@class="title_all"]//font[@color="#07519a"]/text()')[0]
    data['名字'] = title
    try:
        images = html.xpath('//div[@id="Zoom"]//img/@src')[1]
    except:
        print("套路深！")
    try:
        posters = html.xpath('//div[@id="Zoom"]//img/@src')[0]
    except:
        print("套路深！!")
    data['海报'] = posters
    # time.sleep(random.randint(1, 2))
    zoom_ = html.xpath('//div[@id="Zoom"]')[0]
    infos = zoom_.xpath('.//text()')
    for info in infos:
    
        if info.startswith('◎年　　代'):
            info1 = info.replace('◎年　　代', '').strip()
            data['年代'] = info1
        elif info.startswith('◎产　　地'):
            info2 = info.replace('◎产　　地', '').strip()
            data['产地'] = info2
        elif info.startswith('◎类　　别'):
            info3 = info.replace('◎类　　别', '').strip()
            data['类别'] = info3
        elif info.startswith('◎语　　言'):
            info4 = info.replace('◎语　　言', '').strip()
            data['语言'] = info4
        elif info.startswith('◎上映日期'):
            info5 = info.replace('◎上映日期', '').strip()
            data['上映日期'] = info5
        elif info.startswith('◎豆瓣评分'):
            info6 = info.replace('◎豆瓣评分', '').strip()
            info6 = ''.join(info6.split('/')[:1])
            data['豆瓣评分'] = info6
        elif info.startswith('◎片　　长'):
            info7 = info.replace('◎片　　长', '').strip()
            data['片长'] = info7

效果图：

这样查看电影很方便呀！！！最后本跟根据【评分】【类别】选择了些电影《头号玩家》《江湖儿女》《调音师》，感觉还不错！！！主要原因还是没钱开会员

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2019-06-03，如有侵权请联系 cloudcommunity@tencent.com 删除

php

本文分享自 Python 知识大全微信公众号，前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与腾讯云自媒体同步曝光计划，欢迎热爱写作的你一起参与！

php

登录后参与评论

0 条评论

热度

采集天堂电影数据来看看

采集天堂电影数据来看看

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐