命令行
scrapy crawl jd_search
启动脚本
# 新建run.py
from scrapy import cmdline
command = "scrapy crawl jd_search".split()
cmdline.execute(command)
只是对解析的结构化结果进行一个约束, 在到达pipeline前就可以检查出数据错误.
ROBOTTEXT_OBEY
获取对方网站是否允许爬虫获取数据的信息.
设置中间件
数字越小, 离ENGINE
越近
DOWNLOADER_MIDDLEWARES = {
# 'jd_crawler_scrapy.middlewares.JdCrawlerScrapyDownloaderMiddleware': 543,
'jd_crawler_scrapy.middlewares.UAMiddleware': 100,
}
设置PIPELINE
ITEM_PIPELINES = {
'jd_crawler_scrapy.pipelines.JdCrawlerScrapyPipeline': 300,
}
LOG
True
, 是否使用logclass UAMiddleware: def process_request(self, request, spider): request.headers["user-agent"] = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36"
from scrapy.downloadermiddlewares.retry import RetryMiddleware from scrapy.utils.response import response_status_message class MyRetryMiddleware(RetryMiddleware): """ 解决对方服务器返回正常状态码200, 但是根据IP需要进行验证码验证的情况. 我们可以通过换IP可以解决验证码, 那么就应该重试. """ def process_response(self, request, response, spider): if request.meta.get('dont_retry', False): return response if "验证码" in response.text: reason = response_status_message(response.status) return self._retry(request, reason, spider) or response return response