偶尔,我在使用Scrapy2.4.1抓取页面时得到403个响应。下载中间件设置为5次尝试,并在第5次尝试之后放弃:
2021-02-06 01:44:17 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET https://www.url...> (failed 5 times): 403 Forbidden
2021-02-06 01:44:17 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 https://www.url...>: HTTP status code is not handled or not allowed
然而,文档告诉我,失败的页面将在爬行结束时重新安排时间,但情况并非如此。一旦Scrapy放弃了那个页面,它就不会再重试一次。
在抓取过程中收集
失败的页面,并在结束时重新安排,一旦爬行器完成了所有常规(非失败)页面的爬行。
https://docs.scrapy.org/en/latest/_modules/scrapy/downloadermiddlewares/retry.html
我的问题是:如何配置中间件,使其在这些页面失败后不会立即重试,但继续使用另一个URL,并在其他页面被爬行后重新安排它们?
发布于 2021-02-07 09:32:07
为了避免403错误,我使用不同的用户代理,如下所示:
import random
def get_header():
headers_list = [
# Firefox 77 Mac
{
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:77.0) Gecko/20100101 Firefox/77.0",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.5",
"Referer": "https://www.google.com/",
"DNT": "1",
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1"
},
# Firefox 77 Windows
{
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.5",
"Accept-Encoding": "gzip, deflate, br",
"Referer": "https://www.google.com/",
"DNT": "1",
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1"
},
# Chrome 83 Mac
{
"Connection": "keep-alive",
"DNT": "1",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"Sec-Fetch-Site": "none",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Dest": "document",
"Referer": "https://www.google.com/",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-GB,en-US;q=0.9,en;q=0.8"
},
# Chrome 83 Windows
{
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"Sec-Fetch-Site": "same-origin",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-User": "?1",
"Sec-Fetch-Dest": "document",
"Referer": "https://www.google.com/",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-US,en;q=0.9"
}
]
return random.choice(headers_list)
然后,在您的主要功能中,只需这样称呼它:
header = get_header()
response = requests.get(url, headers=header)
对我来说,这避免了大多数时间的403错误。
https://stackoverflow.com/questions/66086333
复制相似问题