文章/答案/技术大牛

发布

社区首页 >问答首页 >如何在一定数量的请求后停止抓取爬虫？

问如何在一定数量的请求后停止抓取爬虫？
EN

Stack Overflow用户

提问于 2016-03-02 21:05:23

回答 5查看 9K关注 0票数 6

我正在开发一个简单的刮刀来获得9个gag帖子和它的图像，但由于一些技术困难，我无法停止刮刀，它继续刮我不想要的。我想增加计数器值，并在100个帖子后停止。但是9gag页面的设计是这样的，在每个响应中，它只提供10个帖子，并且在每次迭代之后，我的计数器值将重置为10，在本例中，我的循环运行无限长，并且从不停止。

# -*- coding: utf-8 -*-
import scrapy
from _9gag.items import GagItem

class FirstSpider(scrapy.Spider):
    name = "first"
    allowed_domains = ["9gag.com"]
    start_urls = (
        'http://www.9gag.com/',
    )

    last_gag_id = None
    def parse(self, response):
        count = 0
        for article in response.xpath('//article'):
            gag_id = article.xpath('@data-entry-id').extract()
            count +=1
            if gag_id:
                if (count != 100):
                    last_gag_id = gag_id[0]
                    ninegag_item = GagItem()
                    ninegag_item['entry_id'] = gag_id[0]
                    ninegag_item['url'] = article.xpath('@data-entry-url').extract()[0]
                    ninegag_item['votes'] = article.xpath('@data-entry-votes').extract()[0]
                    ninegag_item['comments'] = article.xpath('@data-entry-comments').extract()[0]
                    ninegag_item['title'] = article.xpath('.//h2/a/text()').extract()[0].strip()
                    ninegag_item['img_url'] = article.xpath('.//div[1]/a/img/@src').extract()

                    yield ninegag_item


                else:
                    break


        next_url = 'http://9gag.com/?id=%s&c=200' % last_gag_id
        yield scrapy.Request(url=next_url, callback=self.parse) 
        print count

items.py的代码在这里

from scrapy.item import Item, Field


class GagItem(Item):
    entry_id = Field()
    url = Field()
    votes = Field()
    comments = Field()
    title = Field()
    img_url = Field()

所以我想增加一个全局计数值，并尝试通过将3个参数传递给函数来增加全局计数值，它会给出错误

TypeError: parse() takes exactly 3 arguments (2 given)

那么，有没有一种方法可以传递一个全局计数值，并在每次迭代后返回它，并在100个post之后停止(假设)。

整个项目在这里可用，即使我设置了POST_LIMIT =100，无限循环也会发生，请看这里我执行的命令

scrapy crawl first -s POST_LIMIT=10 --output=output.json

python

python-2.7

loops

python-3.x

scrapy

回答 5

Stack Overflow用户

回答已采纳

发布于 2016-03-02 22:14:10

首先:在parse之外使用self.count和初始化。然后，不要阻止对项的解析，而是生成新的requests。请参阅以下代码：

# -*- coding: utf-8 -*-
import scrapy
from scrapy import Item, Field


class GagItem(Item):
    entry_id = Field()
    url = Field()
    votes = Field()
    comments = Field()
    title = Field()
    img_url = Field()


class FirstSpider(scrapy.Spider):

    name = "first"
    allowed_domains = ["9gag.com"]
    start_urls = ('http://www.9gag.com/', )

    last_gag_id = None
    COUNT_MAX = 30
    count = 0

    def parse(self, response):

        for article in response.xpath('//article'):
            gag_id = article.xpath('@data-entry-id').extract()
            ninegag_item = GagItem()
            ninegag_item['entry_id'] = gag_id[0]
            ninegag_item['url'] = article.xpath('@data-entry-url').extract()[0]
            ninegag_item['votes'] = article.xpath('@data-entry-votes').extract()[0]
            ninegag_item['comments'] = article.xpath('@data-entry-comments').extract()[0]
            ninegag_item['title'] = article.xpath('.//h2/a/text()').extract()[0].strip()
            ninegag_item['img_url'] = article.xpath('.//div[1]/a/img/@src').extract()
            self.last_gag_id = gag_id[0]
            self.count = self.count + 1
            yield ninegag_item

        if (self.count < self.COUNT_MAX):
            next_url = 'http://9gag.com/?id=%s&c=10' % self.last_gag_id
            yield scrapy.Request(url=next_url, callback=self.parse)

票数 6

Stack Overflow用户

发布于 2017-04-01 12:19:40

有一个内置的设置CLOSESPIDER_PAGECOUNT，可以通过命令行-s参数传递，也可以在以下设置中更改：scrapy crawl <spider> -s CLOSESPIDER_PAGECOUNT=100

一个小小的警告是，如果您启用了缓存，它也会将缓存命中率计入页数。

票数 10

Stack Overflow用户

发布于 2018-07-25 06:31:54

可以将custom_settings与CLOSESPIDER_PAGECOUNT一起使用，如下所示。

# -*- coding: utf-8 -*-
import scrapy
from scrapy import Item, Field


class GagItem(Item):
    entry_id = Field()
    url = Field()
    votes = Field()
    comments = Field()
    title = Field()
    img_url = Field()


class FirstSpider(scrapy.Spider):

    name = "first"
    allowed_domains = ["9gag.com"]
    start_urls = ('http://www.9gag.com/', )
    last_gag_id = None

    COUNT_MAX = 30

    custom_settings = {
        'CLOSESPIDER_PAGECOUNT': COUNT_MAX
    }

    def parse(self, response):

        for article in response.xpath('//article'):
            gag_id = article.xpath('@data-entry-id').extract()
            ninegag_item = GagItem()
            ninegag_item['entry_id'] = gag_id[0]
            ninegag_item['url'] = article.xpath('@data-entry-url').extract()[0]
            ninegag_item['votes'] = article.xpath('@data-entry-votes').extract()[0]
            ninegag_item['img_url'] = article.xpath('.//div[1]/a/img/@src').extract()
            self.last_gag_id = gag_id[0]
            yield ninegag_item

            next_url = 'http://9gag.com/?id=%s&c=10' % self.last_gag_id
            yield scrapy.Request(url=next_url, callback=self.parse)

票数 2

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/35748061

复制

相似问题

问如何在一定数量的请求后停止抓取爬虫？
EN

回答 5

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何在一定数量的请求后停止抓取爬虫？EN

回答 5

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何在一定数量的请求后停止抓取爬虫？
EN