我正在尝试从中间件中暂停运行抓取引擎(运行爬虫)。
当我试图调用self.crawler.engine.unpause()时,我会收到以下错误:
'cRetry‘对象没有属性’爬虫‘
这是我的中间件。如何访问爬虫对象?
class cRetry(RetryMiddleware):
errorCounter = 0
def process_response(self, request, response, spider):
if response.status in self.retry_http_codes:
reason = response_status_message(response.status)
return self._retry(request, reason, spider) or response
elif "error" in response.body:
self.errorCounter = self.errorCounter + 1
if self.errorCounter >= 10:
self.crawler.engine.pause()
os.system("restart.sh")
print "Reset"
time.sleep(10)
self.crawler.engine.unpause()
self.errorCounter = 0
reason ="Restart Required"
return self._retry(request, reason, spider) or response
### end
return response
发布于 2014-01-14 20:31:50
根据我的理解,您可以重写__init__
和from_crawler
方法,使其类似于以下内容:
class cRetry(RetryMiddleware):
errorCounter = 0
def __init__(self, crawler):
super(cRetry, self).__init__(crawler.settings)
self.crawler = crawler
@classmethod
def from_crawler(cls, crawler):
return cls(crawler)
def process_response(self, request, response, spider):
# ...
__init__
的签名实际上似乎并不重要,主库的入口点总是from_crawler(cls, crawler)
。这是一个类方法,并将类名作为第一个参数(然后使用它来调用构造函数)。
发布于 2014-01-15 14:05:59
谢谢你:-)
你的建议适用于一个小的修改小说。@classmethod需要添加,然后它就像一个魅力一样工作。
class cRetry(RetryMiddleware):
errorCounter = 0
def __init__(self, crawler):
super(cRetry, self).__init__(crawler.settings)
self.crawler = crawler
@classmethod
def from_crawler(cls, crawler):
return cls(crawler)
def process_response(self, request, response, spider):
# ...
https://stackoverflow.com/questions/21123198
复制相似问题