我有一个从站点地图开始的爬虫,抓取(一对)100个唯一的urls,然后在这100页上做进一步的处理。但是,我只会在前10个urls上得到回调。蜘蛛日志似乎只在前10个urls上调用HTTP。
class MySpider(scrapy.spider.BaseSpider):
# settings ...
def parse(self, response):
urls = [...]
for url in urls:
request = scrapy.http.Request(url, callback=self.parse_part2)
print url
yield request
def parse_part2(self, response):
print response.url
# do more parsing here
我已经考虑过这些选择:
有没有什么神秘的max_branching_factor标志我不知道或者什么的?
编辑:日志,完全正常。
2015-02-11 02:05:12-0800 [mysite] DEBUG: Crawled (200) <GET url1>
yay callback!
2015-02-11 02:05:12-0800 [mysite] DEBUG: Crawled (200) <GET url2>
2015-02-11 02:05:12-0800 [mysite] DEBUG: Crawled (200) <GET url3>
2015-02-11 02:05:12-0800 [mysite] DEBUG: Crawled (200) <GET url4>
2015-02-11 02:05:12-0800 [mysite] DEBUG: Crawled (200) <GET url5>
2015-02-11 02:05:12-0800 [mysite] DEBUG: Crawled (200) <GET url6>
yay callback!
yay callback!
yay callback!
yay callback!
yay callback!
2015-02-11 02:05:12-0800 [mysite] DEBUG: Crawled (200) <GET url7>
yay callback!
2015-02-11 02:05:13-0800 [mysite] DEBUG: Crawled (200) <GET url8>
yay callback!
2015-02-11 02:05:13-0800 [mysite] DEBUG: Crawled (200) <GET url9>
yay callback!
2015-02-11 02:05:13-0800 [mysite] DEBUG: Crawled (200) <GET url10>
yay callback!
2015-02-11 02:05:13-0800 [mysite] INFO: Closing spider (finished)
2015-02-11 02:05:13-0800 [mysite] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 4590,
'downloader/request_count': 11,
'downloader/request_method_count/GET': 11,
'downloader/response_bytes': 638496,
'downloader/response_count': 11,
'downloader/response_status_count/200': 11,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 2, 11, 10, 5, 13, 260322),
'log_count/DEBUG': 17,
'log_count/INFO': 3,
'request_depth_max': 1,
'response_received_count': 11,
'scheduler/dequeued': 11,
'scheduler/dequeued/memory': 11,
'scheduler/enqueued': 11,
'scheduler/enqueued/memory': 11,
'start_time': datetime.datetime(2015, 2, 11, 10, 5, 12, 492811)}
2015-02-11 02:05:13-0800 [mysite] INFO: Spider closed (finished)
发布于 2015-02-11 19:55:09
所以我在我的一个设置文件中找到了这个属性。
max_requests / MAX_REQUESTS = 10
是蜘蛛提前退出的罪魁祸首
发布于 2015-02-11 10:00:47
尝试将LOG_LEVEL设置为调试,您将看到更多的日志。
如果你做了so.please粘贴在上面
https://stackoverflow.com/questions/28451290
复制相似问题