首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >问答首页 >如何用Scrapy重新安排403响应页?

如何用Scrapy重新安排403响应页?
EN

Stack Overflow用户
提问于 2021-02-07 09:24:52
回答 1查看 286关注 0票数 0

偶尔,我在使用Scrapy2.4.1抓取页面时得到403个响应。下载中间件设置为5次尝试,并在第5次尝试之后放弃:

代码语言:javascript
运行
复制
2021-02-06 01:44:17 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET https://www.url...> (failed 5 times): 403 Forbidden
2021-02-06 01:44:17 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 https://www.url...>: HTTP status code is not handled or not allowed

然而,文档告诉我,失败的页面将在爬行结束时重新安排时间,但情况并非如此。一旦Scrapy放弃了那个页面,它就不会再重试一次。

在抓取过程中收集

失败的页面,并在结束时重新安排,一旦爬行器完成了所有常规(非失败)页面的爬行。

https://docs.scrapy.org/en/latest/_modules/scrapy/downloadermiddlewares/retry.html

我的问题是:如何配置中间件,使其在这些页面失败后不会立即重试,但继续使用另一个URL,并在其他页面被爬行后重新安排它们?

EN

回答 1

Stack Overflow用户

发布于 2021-02-07 09:32:07

为了避免403错误,我使用不同的用户代理,如下所示:

代码语言:javascript
运行
复制
    import random
    def get_header():
    headers_list = [
    # Firefox 77 Mac
     {
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:77.0) Gecko/20100101 Firefox/77.0",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.5",
        "Referer": "https://www.google.com/",
        "DNT": "1",
        "Connection": "keep-alive",
        "Upgrade-Insecure-Requests": "1"
    },
    # Firefox 77 Windows
    {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.5",
        "Accept-Encoding": "gzip, deflate, br",
        "Referer": "https://www.google.com/",
        "DNT": "1",
        "Connection": "keep-alive",
        "Upgrade-Insecure-Requests": "1"
    },
    # Chrome 83 Mac
    {
        "Connection": "keep-alive",
        "DNT": "1",
        "Upgrade-Insecure-Requests": "1",
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
        "Sec-Fetch-Site": "none",
        "Sec-Fetch-Mode": "navigate",
        "Sec-Fetch-Dest": "document",
        "Referer": "https://www.google.com/",
        "Accept-Encoding": "gzip, deflate, br",
        "Accept-Language": "en-GB,en-US;q=0.9,en;q=0.8"
    },
    # Chrome 83 Windows
    {
        "Connection": "keep-alive",
        "Upgrade-Insecure-Requests": "1",
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
        "Sec-Fetch-Site": "same-origin",
        "Sec-Fetch-Mode": "navigate",
        "Sec-Fetch-User": "?1",
        "Sec-Fetch-Dest": "document",
        "Referer": "https://www.google.com/",
        "Accept-Encoding": "gzip, deflate, br",
        "Accept-Language": "en-US,en;q=0.9"
    }
]

    return random.choice(headers_list)

然后,在您的主要功能中,只需这样称呼它:

代码语言:javascript
运行
复制
header = get_header()
response = requests.get(url, headers=header)

对我来说,这避免了大多数时间的403错误。

票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/66086333

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档