前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >收藏| Scrapy框架各组件详细设置

收藏| Scrapy框架各组件详细设置

作者头像
刘早起
发布2020-05-14 20:35:49
6810
发布2020-05-14 20:35:49
举报
文章被收录于专栏:早起Python早起Python

大家好,关于Requests爬虫我们已经讲了很多。今天我们就说一下Scrapy框架各组件的详细设置方便之后更新Scrapy爬虫实战案例。

关于Scrapy

Scrapy是纯Python语言实现的爬虫框架,简单、易用、拓展性高是其主要特点。这里不过多介绍Scrapy的基本知识点,主要针对其高拓展性详细介绍各个主要部件的配置方法。其实也不详细,不过应该能满足大多数人的需求了 : )。当然,更多信息可以仔细阅读官方文档。首先还是放一张Scrapy数据流的图供复习和参考。

接下来进入正题,有些具体的示例以某瓣spider为例。

创建命令

代码语言:javascript
复制
scrapy startproject <Project_name>
scrapy genspider <spider_name> <domains>

如果想要创建全网爬取的便捷框架crawlspider,则用如下命令:

代码语言:javascript
复制
scrapy genspider –t crawl <spider_name> <domains>

spider.py

首先介绍最核心的部件spider.py,废话不多说,上代码,看注释

代码语言:javascript
复制
import scrapy
# 有些命令如果有python基础的都明白,我不做过多介绍
import json
# 需要做持久化所以导入item,也可以根据文件夹名慢慢导入
from ..items import DoubanItem

class DoubanSpider(scrapy.Spider):
    name = 'douban'
    allowed_domains = ['douban.com']
    # 对单个爬虫设置请求头
    custom_settings = {  
        'DEFAULT_REQUEST_HEADERS': {
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
        'Accept-Language': 'en',
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'
    }}
    
    # 很多时候并不需要重载这个函数,如果需要定制化起始url或者单独设置请求头可以选择重载
    def start_requests(self):
        page = 18
        base_url = 'https://xxxx'
        for i in range(page):
            url = base_url.format(i * 20)
            req = scrapy.Request(url=url, callback=self.parse)
            # 对某个请求添加请求头,后面的请求如果要设置也是类似方法
            # req.headers['User-Agent'] = ''  
            yield req
            
    # 没有特别要解释,就是常规的页面解析抛给...(看数据流就明白了)
    def parse(self, response):
        json_str = response.body.decode('utf-8')
        res_dict = json.loads(json_str)
        for i in res_dict['subjects']:
            url = i['url']
            yield scrapy.Request(url=url, callback=self.parse_detailed_page)
            
    # scrapy的response可以直接用xpath解析,基础东西大家都懂不赘述
    def parse_detailed_page(self, response):
        title = response.xpath('//h1/span[1]/text()').extract_first()
        year = response.xpath('//h1/span[2]/text()').extract()[0]
        image = response.xpath('//img[@rel="v:image"]/@src').extract_first()
        
        item = DoubanItem()
        item['title'] = title
        item['year'] = year
        item['image'] = image
        # 如果要下载图片需要单独设置,ImagePipelines,同样在settings和pipelines都需要相应设置
        item['image_urls'] = [image]   
        yield item

如果是全网爬取,则框架中spiders的部分开头会略有差别

代码语言:javascript
复制
rules = (Rule(LinkExtractor(allow=r'http://digimons.net/digimon/.*/index.html'), callback='parse_item', follow=False),)

关键就是follow的设置了,是否到达既定深度和页面需要自己把握。提一嘴,请求头可以在三个地方设置,决定了请求头的影响范围

  1. 在settings中设置,范围最大,影响整个框架的所有spider
  2. 在spiders类变量处设置,影响该spider的所有请求
  3. 在具体请求中设置,只影响该request

三处设置的影响范围实际就是从全局到单个爬虫到单个请求。如果同时存在则单个请求的headers设置优先级最高!

items.py

代码语言:javascript
复制
import scrapy

class DoubanItem(scrapy.Item):
    title = scrapy.Field()
    year = scrapy.Field()
    image = scrapy.Field()
    # 下载图片的ImagePipelines也需要设置items
    image_urls = scrapy.Field()   
    
    # 持久化存储我选择用mysql,不具体展开
    def get_insert_sql_and_data(self):
    # CREATE TABLE douban(
    # id int not null auto_increment primary key,
    # title text, `year` int, image text)ENGINE=INNODB DEFAULT CHARSET=UTF8mb4;
        insert_sql = 'INSERT INTO douban(title,`year`,image)' \     # 系统关键字需要加``
                     'VALUES(%s,%s,%s)'
        data = (self['title'],self['year'],self['image'])
        return (insert_sql, data)

middlewares.py

中间件就很灵性了,很多小伙伴也不一定用的到,但实际上在配置代理时很重要,一般需求不去配置SpiderMiddleware,主要针对DownloaderMiddleware进行修改

代码语言:javascript
复制
# 信号,这个名词在scrapy自定义拓展中很重要
from scrapy import signals
# 本地配置的类,代码见后续,可以搭在自己的IP池上,也可以直接挂在收费IP(比如我)
from proxyhelper import Proxyhelper
# 多线程操作同一个对象需要锁,用法就是实例化以后一锁一释放
from twisted.internet.defer import DeferredLock

class DoubanSpiderMiddleware(object): # spider中间件不设置
    pass

class DoubanDownloaderMiddleware(object):
    def __init__(self):
        # 对IP配置的代理和锁都实例化
        self.helper = Proxyhelper()
        self.lock = DeferredLock()

    @classmethod
    def from_crawler(cls, crawler):  # 不修改
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        # request的数据流到达下载中间件的时候出发
        self.lock.acquire()
        request.meta['Proxy'] = self.helper.get_proxy()
        self.lock.release()
        return None

    def process_response(self, request, response, spider):
        # 对响应判断,如果不符合就换代理重新请求
        if response.status != 200:   
            self.lock.acquire()
            self.helper.update_proxy(request.meta['Proxy'])
            self.lock.release()
            return request
        return response

    def process_exception(self, request, exception, spider):
        self.lock.acquire()
        self.helper.update_proxy(request.meta['Proxy'])
        self.lock.release()
        return request

    def spider_opened(self, spider):  # 不修改
        spider.logger.info('Spider opened: %s' % spider.name)

附上proxyhelper配置的代码

代码语言:javascript
复制
import requests

class Proxyhelper(object):
    def __init__(self):
        self.proxy = self._get_proxy_from_xxx()

    def get_proxy(self):
        return self.proxy

    def update_proxy(self, proxy):
        if proxy == self.proxy:
            print('Updating a proxy')
            self.proxy = self._get_proxy_from_xxx()

    def _get_proxy_from_xxx(self):
        url = '' # 此处修改url,最好是一次返回一个ip
        response = requests.get(url)
        return 'http://' + response.text.strip()

pipelines.py

代码语言:javascript
复制
# 载入本地的mysql持久化类,按需自己写
from mysqlhelper import Mysqlhelper
# 载入ImagesPipeline便于重载,自定义一些功能
from scrapy.pipelines.images import ImagesPipeline
import hashlib
from scrapy.utils.python import to_bytes
from scrapy.http import Request

class DoubanImagesPipeline(ImagesPipeline):
    def get_media_requests(self, item, info):
        request_lst = []
        for x in item.get(self.images_urls_field, []):
            req = Request(x)
            req.meta['movie_name'] = item['title']  # 获取名字
            request_lst.append(req)
        return request_lst
    # 重载
    def file_path(self, request, response=None, info=None):
        image_guid = hashlib.sha1(to_bytes(request.url)).hexdigest()
        return 'full/%s.jpg' % (request.meta['movie_name']) # 修改图片名

# 无特殊,有些步骤在items已经写完,实现pipelines和items功能上的分离
class DoubanPipeline(object):
    def __init__(self):
        self.mysqlhelper = Mysqlhelper()

    def process_item(self, item, spider):
        if 'get_insert_sql_and_data' in dir(item):
            (insert_sql, data) = item.get_insert_sql_and_data()
            self.mysqlhelper.execute_sql(insert_sql, data)
        return item

setting.py

极其关键的部件,注释已经在代码中标注

代码语言:javascript
复制
# 爬虫名称
BOT_NAME = 'Douban'

SPIDER_MODULES = ['Douban.spiders']
NEWSPIDER_MODULE = 'Douban.spiders'

# 客户端请求头
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'Douban (+http://www.yourdomain.com)'

# Obey robots.txt rules
# 机器人协定
ROBOTSTXT_OBEY = False

# 并发请求数
# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 32


# 下载延迟
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
# 单域名和单IP并发数,会覆盖上面的设定
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
# 对爬虫进行监控
#TELNETCONSOLE_ENABLED = False
# TELNETCONSOLE_ENABLED = True
# TELNETCONSOLE_HOST = '127.0.0.1'
# TELNETCONSOLE_PORT = [6023,]
# 操作命令:cmd -> telent 127.0.0.1 6023-> est<>

# Override the default request headers:
# 默认请求头,项目内所有爬虫有效
# DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#   'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'
# }


# 爬虫中间件
# SPIDER_MIDDLEWARES = {
#    # 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware': None
#    'Douban.middlewares.DoubanSpiderMiddleware': 543,
# }

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# 下载中间件
DOWNLOADER_MIDDLEWARES = {
   'Douban.middlewares.DoubanDownloaderMiddleware': 560,  
# 更改为560的原因在于不同中间件细分很多亚组间,这些组间的数据大小决定了request和response数据流触碰的顺序,具体见官方文档
}
# 允许url的访问时限
TIMEOUT = 10
# 深度限制
# DEPTH_LIMIT = 1

# 自定义拓展
EXTENSIONS = {
   'Douban.extends.MyExtension': 500,
}


# item-pipelines配置
ITEM_PIPELINES = {
   # 'scrapy.pipelines.images.ImagesPipeline': 1,  # 图片下载器需要注册
   'Douban.pipelines.DoubanImagesPipeline': 300,
}

# 利用算法,自动限速
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# 启用缓存,较少用
# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

# 图片下载器ImagePipeline的配置,按需开启
IMAGES_STORE = 'download'

extends.py

自定义扩展,建议设置该部件需要对信号有了解,深入理解scrapy运行过程的信号触发,实际还是需要对数据流理解的完善。代码中我是利用自己写的类,本质就是利用喵提醒在某些特定时刻触发提醒(喵提醒打钱?)。当然也可以利用日志或者其他功能强化拓展功能,通过signal的不同触发时刻针对性设置

需要自己创建,创建位置如图:

代码语言:javascript
复制
from scrapy import signals
from message import Message

class MyExtension(object):
    def __init__(self, value):
        self.value = value

    @classmethod
    def from_crawler(cls, crawler):
        val = crawler.settings.getint('MMMM')
        ext = cls(val)

        crawler.signals.connect(ext.spider_opened, signal=signals.spider_opened)
        crawler.signals.connect(ext.spider_closed, signal=signals.spider_closed)

        return ext

    def spider_opened(self, spider):
        print('spider running')

    def spider_closed(self, spider):
        message = Message('spider运行结束')
        message.push()
        print('spider closed')

running.py

runnings.py最后提一下吧,其实就是一个在python中运行cmd的命令

代码语言:javascript
复制
from scrapy.cmdline import execute
execute('scrapy crawl douban'.split())

以上就是可以满足基本需求的Scrapy各部件配置,如果还不熟悉的话可以参考,之后我们将更新一些Scrapy爬虫实战案例。

本文参与 腾讯云自媒体分享计划,分享自微信公众号。
原始发表:2020-05-13,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 早起Python 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
相关产品与服务
消息队列 TDMQ
消息队列 TDMQ (Tencent Distributed Message Queue)是腾讯基于 Apache Pulsar 自研的一个云原生消息中间件系列,其中包含兼容Pulsar、RabbitMQ、RocketMQ 等协议的消息队列子产品,得益于其底层计算与存储分离的架构,TDMQ 具备良好的弹性伸缩以及故障恢复能力。
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档