专栏首页有趣的djangopython爬虫入门(九)Scrapy框架之数据库保存

python爬虫入门(九)Scrapy框架之数据库保存

豆瓣电影TOP 250爬取-->>>数据保存到MongoDB

豆瓣电影TOP 250网址

要求:

1.爬取豆瓣top 250电影名字、演员列表、评分和简介

2.设置随机UserAgent和Proxy

3.爬取到的数据保存到MongoDB数据库

items.py

# -*- coding: utf-8 -*-

import scrapy

class DoubanItem(scrapy.Item):
    # define the fields for your item here like:
    # 标题
    title = scrapy.Field()
    # 信息
    bd = scrapy.Field()
    # 评分
    star = scrapy.Field()
    # 简介
    quote = scrapy.Field()

doubanmovie.py

# -*- coding: utf-8 -*-
import scrapy
from douban.items import DoubanItem

class DoubamovieSpider(scrapy.Spider):
    name = "doubanmovie"
    allowed_domains = ["movie.douban.com"]
    offset = 0
    url = "https://movie.douban.com/top250?start="
    start_urls = (
            url+str(offset),
    )

    def parse(self, response):
        item = DoubanItem()
        movies = response.xpath("//div[@class='info']")

        for each in movies:
            # 标题
            item['title'] = each.xpath(".//span[@class='title'][1]/text()").extract()[0]
            # 信息
            item['bd'] = each.xpath(".//div[@class='bd']/p/text()").extract()[0]
            # 评分
            item['star'] = each.xpath(".//div[@class='star']/span[@class='rating_num']/text()").extract()[0]
            # 简介
            quote = each.xpath(".//p[@class='quote']/span/text()").extract()
            if len(quote) != 0:
                item['quote'] = quote[0]
            yield item

        if self.offset < 225:
            self.offset += 25
            yield scrapy.Request(self.url + str(self.offset), callback = self.parse)

pipelines.py

# -*- coding: utf-8 -*-

import pymongo
from scrapy.conf import settings

class DoubanPipeline(object):
    def __init__(self):
        host = settings["MONGODB_HOST"]
        port = settings["MONGODB_PORT"]
        dbname = settings["MONGODB_DBNAME"]
        sheetname= settings["MONGODB_SHEETNAME"]

        # 创建MONGODB数据库链接
        client = pymongo.MongoClient(host = host, port = port)
        # 指定数据库
        mydb = client[dbname]
        # 存放数据的数据库表名
        self.sheet = mydb[sheetname]

    def process_item(self, item, spider):
        data = dict(item)
        self.sheet.insert(data)
        return item

settings.py

DOWNLOAD_DELAY = 2.5

COOKIES_ENABLED = False

DOWNLOADER_MIDDLEWARES = {
    'douban.middlewares.RandomUserAgent': 100,
    'douban.middlewares.RandomProxy': 200,
}

USER_AGENTS = [
    'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0)',
    'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2)',
    'Opera/9.27 (Windows NT 5.2; U; zh-cn)',
    'Opera/8.0 (Macintosh; PPC Mac OS X; U; en)',
    'Mozilla/5.0 (Macintosh; PPC Mac OS X; U; en) Opera 8.0',
    'Mozilla/5.0 (Linux; U; Android 4.0.3; zh-cn; M032 Build/IML74K) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30',
    'Mozilla/5.0 (Windows; U; Windows NT 5.2) AppleWebKit/525.13 (KHTML, like Gecko) Chrome/0.2.149.27 Safari/525.13'
]

PROXIES = [
        {"ip_port" :"121.42.140.113:16816", "user_passwd" : "****"},
        #{"ip_prot" :"121.42.140.113:16816", "user_passwd" : ""}
        #{"ip_prot" :"121.42.140.113:16816", "user_passwd" : ""}
        #{"ip_prot" :"121.42.140.113:16816", "user_passwd" : ""}
]


ITEM_PIPELINES = {
    'douban.pipelines.DoubanPipeline': 300,
}


# MONGODB 主机名
MONGODB_HOST = "127.0.0.1"

# MONGODB 端口号
MONGODB_PORT = 27017

# 数据库名称
MONGODB_DBNAME = "Douban"

# 存放数据的表名称
MONGODB_SHEETNAME = "doubanmovies"

middlewares.py

#!/usr/bin/env python
# -*- coding:utf-8 -*-

import random
import base64

from settings import USER_AGENTS
from settings import PROXIES

# 随机的User-Agent
class RandomUserAgent(object):
    def process_request(self, request, spider):
        useragent = random.choice(USER_AGENTS)
        #print useragent
        request.headers.setdefault("User-Agent", useragent)

class RandomProxy(object):
    def process_request(self, request, spider):
        proxy = random.choice(PROXIES)

        if proxy['user_passwd'] is None:
            # 没有代理账户验证的代理使用方式
            request.meta['proxy'] = "http://" + proxy['ip_port']

        else:
            # 对账户密码进行base64编码转换
            base64_userpasswd = base64.b64encode(proxy['user_passwd'])
            # 对应到代理服务器的信令格式里
            request.headers['Proxy-Authorization'] = 'Basic ' + base64_userpasswd

            request.meta['proxy'] = "http://" + proxy['ip_port']

本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

我来说两句

0 条评论
登录 后参与评论

相关文章

  • 21天打造分布式爬虫-Scrapy框架(七)

    pip install Twisted-18.7.0-cp36-cp36m-win_amd64.whl

    zhang_derek
  • Django REST framework+Vue 打造生鲜超市(四)

    五、商品列表页 5.1.django的view实现商品列表页 (1)goods/view_base.py 在goods文件夹下面新建view_base.py,为...

    zhang_derek
  • python爬虫入门(八)Scrapy框架之CrawlSpider类

    CrawlSpider类 通过下面的命令可以快速创建 CrawlSpider模板 的代码: scrapy genspider -t crawl tencent ...

    zhang_derek
  • Python实战案例:用Python写一个弹球游戏,就是这么强

    我们前面讲了几篇关于类的知识点,为了让大家更好的掌握类的概念,并灵活的运用这些知识,我写了一个有趣又好玩的弹球的游戏,一来可以把类的知识融会一下,二来加深对Py...

    诸葛青云
  • 六天完成一个简单iOS App - 第二天

    xx_Cc
  • coach运行流程梳理

    用户1908973
  • python pyqt5 椭圆形窗体

    import sys from PyQt5.QtWidgets import QApplication, QWidget from PyQt5.QtGui ...

    用户5760343
  • NLP经典算法复现!CRF原理及实现代码

    寄语:本文先对马尔可夫过程及隐马尔可夫算法进行了简单的介绍;然后,对条件随机场的定义及其三种形式进行了详细推导;最后,介绍了条件随机场的三大问题,同时针对预测问...

    Datawhale
  • 强化学习 cartpole_a3c

    https://github.com/rlcode/reinforcement-learning/blob/master/2-cartpole/5-a3c/ca...

    用户1908973
  • 走进 Masonry

    导语 Masonry 源码阅读 在阅读这篇文章之前,你需要对两块东西有明确的了解 1、AutoLayout, 至少能够知道并使用过 /* Cr...

    MelonTeam

扫码关注云+社区

领取腾讯云代金券