blocks|key|4476892|text|尝试使用以下代码报废预期的项目：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|4476893|ReutersCrawlerSpider类(CrawlSpider)：|4476894|+name+=+'reuters\_crawler'|code-block|syntax|javascript|4476895|allowed_domains+=+['www.reuters.com',]
start_urls+=+["http://www.reuters.com/news/archive/businessNews?view=page&page=%25s"+%25+page+for+page+in+xrange(1,10)]

'''rules+=+[+Rule(SgmlLinkExtractor(allow=r'\?page=[0-9]&pageSize=10&view=page',+restrict_xpaths=('//div[@class="pageNavigation"]',)),
+++++callback='parse_item',+follow=True)]'''

def+parse(self,+response):
++++questions+=+Selector(response).xpath('.//div[@class="feature"]/h2')

++++for+question+in+questions:
++++++++item+=+ReutersItem()
++++++++item['title']+=+question.xpath('a/text()').extract()[0]
++++++++item['timeextraction']+=+strftime("%25Y-%25m-%25d+%25H:%25M:%25S",+gmtime())
++++++++yield+item|4476896|entityMap^0|0|0|0|0^^$0|@$1|2|3|4|5|6|7|M|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|N|8|@]|9|@]|A|$]]|$1|D|3|E|5|F|7|O|8|@]|9|@]|A|$G|H]]|$1|I|3|J|5|F|7|P|8|@]|9|@]|A|$G|H]]|$1|K|3|-4|5|6|7|Q|8|@]|9|@]|A|$]]]|L|$]]

try following code to scrap intended items:

class ReutersCrawlerSpider(CrawlSpider): 
 name = 'reuters_crawler'

<pre><code>allowed_domains = ['www.reuters.com',]
start_urls = ["http://www.reuters.com/news/archive/businessNews?view=page&amp;page=%s" % page for page in xrange(1,10)]

'''rules = [ Rule(SgmlLinkExtractor(allow=r'\?page=[0-9]&amp;pageSize=10&amp;view=page', restrict_xpaths=('//div[@class="pageNavigation"]',)),
 callback='parse_item', follow=True)]'''

def parse(self, response):
 questions = Selector(response).xpath('.//div[@class="feature"]/h2')

 for question in questions:
 item = ReutersItem()
 item['title'] = question.xpath('a/text()').extract()[0]
 item['timeextraction'] = strftime("%Y-%m-%d %H:%M:%S", gmtime())
 yield item
</code></pre>

blocks|key|4476952|text|Rule和LinkExtractor的含义是你有一个主站点，它有URL，但没有你想要抓取的内容。为此，您可以编写parse方法并手动提取页面上的每个URL，并将它们过滤为新的Request对象--或者您可以使用Rule来提取具有LinkExtractor的指定链接，以匹配allow过滤器，这些链接可以在restrict_xpaths块中找到。然后，Scrapy将自动从这些链接创建新的Request对象，并使用这些Request的Response调用callback方法。|type|unstyled|depth|inlineStyleRanges|offset|length|style|CODE|entityRanges|data|4476953|如果你查看你试图爬取的网站的HTML源代码，你可以看到它有一些创建class="pageNavigation"+divs的JavaScript，这意味着当你试图用Scrapy解析网站时，你没有找到指定的类来限制XPath。|4476954|为此，我们可以使用解析start_urls结果的parse方法|4476955|def+parse(self,+response):
++++print+response.xpath('//div[@class="pageNavigation"]')|code-block|syntax|javascript|4476956|这将在控制台上生成一个空列表。这意味着div的主体中不存在带有class+pageNavigation的Respone，因此规则将不起作用。|4476957|要解决这个问题，您可以使用手动解决方案从JavaScript代码中提取下一个URL+--或者简单地在parse方法中用下一个页码创建一个请求--直到遇到一个“找不到”的站点。|4476958|current_page+=+1
def+parse(self,+response):
++++questions+=+Selector(response).xpath('.//div[@class="feature"]/h2')

++++for+question+in+questions:
++++++++item+=+ReutersItem()
++++++++item['title']+=+question.xpath('a/text()').extract()[0]
++++++++item['timeextraction']+=+strftime("%25Y-%25m-%25d+%25H:%25M:%25S",+gmtime())
++++++++yield+item
++++self.current_page+%2B=+1
++++yield+Request("http://www.reuters.com/news/archive/businessNews?page={0}&pageSize=10&view=page".format(self.current_page))|4476959|顺便说一句:现在的问题在每个网站上都有“政治视频”和“华盛顿前排”。您应该在您的实现中限制这一点。|4476960|entityMap^0|0|4|5|D|1K|5|2F|7|2X|4|36|D|3S|5|48|F|5D|7|5S|7|60|8|6A|8|0|X|M|1K|3|0|B|A|O|5|0|0|J|3|V|5|11|E|1G|7|0|1E|5|0|0|0^^$0|@$1|2|3|4|5|6|7|Y|8|@$9|Z|A|10|B|C]|$9|11|A|12|B|C]|$9|13|A|14|B|C]|$9|15|A|16|B|C]|$9|17|A|18|B|C]|$9|19|A|1A|B|C]|$9|1B|A|1C|B|C]|$9|1D|A|1E|B|C]|$9|1F|A|1G|B|C]|$9|1H|A|1I|B|C]|$9|1J|A|1K|B|C]|$9|1L|A|1M|B|C]]|D|@]|E|$]]|$1|F|3|G|5|6|7|1N|8|@$9|1O|A|1P|B|C]|$9|1Q|A|1R|B|C]]|D|@]|E|$]]|$1|H|3|I|5|6|7|1S|8|@$9|1T|A|1U|B|C]|$9|1V|A|1W|B|C]]|D|@]|E|$]]|$1|J|3|K|5|L|7|1X|8|@]|D|@]|E|$M|N]]|$1|O|3|P|5|6|7|1Y|8|@$9|1Z|A|20|B|C]|$9|21|A|22|B|C]|$9|23|A|24|B|C]|$9|25|A|26|B|C]]|D|@]|E|$]]|$1|Q|3|R|5|6|7|27|8|@$9|28|A|29|B|C]]|D|@]|E|$]]|$1|S|3|T|5|L|7|2A|8|@]|D|@]|E|$M|N]]|$1|U|3|V|5|6|7|2B|8|@]|D|@]|E|$]]|$1|W|3|-4|5|6|7|2C|8|@]|D|@]|E|$]]]|X|$]]

The meaning for the <code>Rule</code>s and the <code>LinkExtractor</code>s is that you have a main site which has URLs but not the contents you want to scrape. For this you could write the <code>parse</code> method and extract every URL on the page manually and filter them to new <code>Request</code> objects -- or you can use a <code>Rule</code> to extract specified links with the <code>LinkExtractor</code> to match the <code>allow</code> filter and which can be found in the <code>restrict_xpaths</code> block. Then Scrapy will create automatically new <code>Request</code> objects out of those links and call the <code>callback</code> method with the <code>Response</code>s of those <code>Request</code>s.

If you look at the HTML source of the website you try to crawl you can see that it has some JavaScript which creates the <code>class="pageNavigation"</code> <code>div</code>s. This means when you try to parse the site with Scrapy you do not find the specified class to restrict the XPath to.

We can use for this the <code>parse</code> method which parses the result of the <code>start_urls</code>:

<pre><code>def parse(self, response):
 print response.xpath('//div[@class="pageNavigation"]')
</code></pre>

This produces an empty list on the console. This means that no <code>div</code> is present with the <code>class</code> <code>pageNavigation</code> in the <code>Respone</code>'s body. So the Rule won't work.

To solve this problem you could utilize a manual solution to extract the next URL from the JavaScript code -- or simply create a request in the <code>parse</code> method with the next page number -- until you encounter a "Not found" site.

<pre><code>current_page = 1
def parse(self, response):
 questions = Selector(response).xpath('.//div[@class="feature"]/h2')

 for question in questions:
 item = ReutersItem()
 item['title'] = question.xpath('a/text()').extract()[0]
 item['timeextraction'] = strftime("%Y-%m-%d %H:%M:%S", gmtime())
 yield item
 self.current_page += 1
 yield Request("http://www.reuters.com/news/archive/businessNews?page={0}&amp;pageSize=10&amp;view=page".format(self.current_page))
</code></pre>

And by the way: the current questions get the "Politics video" and "Front row Washington" with every site. You should restrict this in your implementation.

I have a problem with Scrapy and Reuters. Following the example given on page <a href="https://realpython.com/blog/python/web-scraping-and-crawling-with-scrapy-and-mongodb/" rel="nofollow">https://realpython.com/blog/python/web-scraping-and-crawling-with-scrapy-and-mongodb/</a> I want to do the same with <a href="http://www.reuters.com/news/archive/businessNews?view=page&amp;page=1" rel="nofollow">http://www.reuters.com/news/archive/businessNews?view=page&amp;page=1</a>, ie. after downloading the information from the first page, I want to download information from the following pages, but LinkExtractor function does not work properly. Here is my code

<pre><code>class ReutersCrawlerSpider(CrawlSpider):
name = 'reuters_crawler'
allowed_domains = ['www.reuters.com',]
start_urls = [
 "http://www.reuters.com/news/archive/businessNews?page=1&amp;pageSize=10&amp;view=page",
]

rules = [
 Rule(SgmlLinkExtractor(allow=r'\?page=[0-9]&amp;pageSize=10&amp;view=page', restrict_xpaths=('//div[@class="pageNavigation"]',)),
 callback='parse_item', follow=True)
]

def parse_item(self, response):
 questions = Selector(response).xpath('//div[@class="feature"]/h2')

 for question in questions:
 item = ReutersItem()
 item['title'] = question.xpath('a/text()').extract()[0]
 item['timeextraction'] = strftime("%Y-%m-%d %H:%M:%S", gmtime())
 yield item
</code></pre>

Where are making a mistake? Thanks for help.

Problems with Scrapy and reuters.com

我对Scrapy和路透社有意见。按照页上给出的例子，我想对做同样的事情。从第一个页面下载信息后，我想从下面的页面下载信息，但LinkExtractor函数无法正常工作。以下是我的代码class ReutersCrawlerSpider(CrawlSpider):name = 'reuters_crawler'allow...

问Scrapy和reuters.com的问题
EN

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Scrapy和reuters.com的问题EN

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Scrapy和reuters.com的问题
EN