我对Scrapy和路透社有意见。按照https://realpython.com/blog/python/web-scraping-and-crawling-with-scrapy-and-mongodb/页上给出的例子,我想对http://www.reuters.com/news/archive/businessNews?view=page&page=1做同样的事情。从第一个页面下载信息后,我想从下面的页面下载信息,但LinkExtractor函数无法正常工作。以下是我的代码
class ReutersCrawlerSpider(CrawlSpider):
name = 'reuters_crawler'
allowed_domains = ['www.reuters.com',]
start_urls = [
"http://www.reuters.com/news/archive/businessNews?page=1&pageSize=10&view=page",
]
rules = [
Rule(SgmlLinkExtractor(allow=r'\?page=[0-9]&pageSize=10&view=page', restrict_xpaths=('//div[@class="pageNavigation"]',)),
callback='parse_item', follow=True)
]
def parse_item(self, response):
questions = Selector(response).xpath('//div[@class="feature"]/h2')
for question in questions:
item = ReutersItem()
item['title'] = question.xpath('a/text()').extract()[0]
item['timeextraction'] = strftime("%Y-%m-%d %H:%M:%S", gmtime())
yield item
在哪里犯了错?谢谢你的帮助。
https://stackoverflow.com/questions/31918374
复制相似问题