文章/答案/技术大牛

发布

社区首页 >问答首页 >抓取表列和行的Python抓取器

问抓取表列和行的Python抓取器
EN

Stack Overflow用户

提问于 2014-01-14 03:19:08

回答 1查看 5.7K关注 0票数 1

我是python的新手，这是我第一次学习刮伤。我以前在perl中做过非常成功的数据挖掘，但是这是一个完全不同的游戏！

我试着刮一张桌子，抓取每一行的列。我的密码在下面。

items.py

from scrapy.item import Item, Field
class Cio100Item(Item):
   company = Field()
   person = Field()
   industry = Field()
   url = Field()

scrape.py (蜘蛛)

from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from cio100.items import Cio100Item

items = []

class MySpider(BaseSpider):
  name = "scrape"
  allowed_domains = ["cio.co.uk"]
  start_urls = ["http://www.cio.co.uk/cio100/2013/cio/"]

def parse(self, response):
  sel = Selector(response)
  tables = sel.xpath('//table[@class="bgWhite listTable"]//h2')
  for table in tables:
    # print table
    item = Cio100Item()
    item['company'] = table.xpath('a/text()').extract()
    item['person'] = table.xpath('a/text()').extract()
    item['industry'] = table.xpath('a/text()').extract()
    item['url'] = table.xpath('a/@href').extract()
    items.append(item)
  return items

我很难理解如何正确地阐明xpath选择。

我认为这句话的问题是：

      tables = sel.xpath('//table[@class="bgWhite listTable"]//h2')

当我运行刮板时，结果是在终端中得到这样的东西：

2014-01-13 22:13:29-0500 [scrape] DEBUG: Scraped from <200 http://www.cio.co.uk/cio100/2013/cio/>

{'company': [u"\nDomino's Pizza\n"],
 'industry': [u"\nDomino's Pizza\n"],
 'person': [u"\nDomino's Pizza\n"],
 'url': [u'/cio100/2013/dominos-pizza/']}

2014-01-13 22:13:29-0500 [scrape] DEBUG: Scraped from <200 http://www.cio.co.uk/cio100/2013/cio/>
{'company': [u'\nColin Rees\n'],
 'industry': [u'\nColin Rees\n'],
 'person': [u'\nColin Rees\n'],
 'url': [u'/cio100/2013/dominos-pizza/']}

理想情况下，我只想要一个街区，而不是两个街区，其中Domino在公司的位置，Colin在个人位置，而行业抓住了，这是它没有做的。

当我使用firebug检查表时，我看到了第1列和第2列(公司和个人)的h2，但是第3列是h3？

在最后将表行修改为h3时，如下所示

      tables = sel.xpath('//table[@class="bgWhite listTable"]//h3')

我明白了

2014-01-13 22:16:46-0500 [scrape] DEBUG: Scraped from <200 http://www.cio.co.uk/cio100/2013/cio/>
{'company': [u'\nRetail\n'],
 'industry': [u'\nRetail\n'],
 'person': [u'\nRetail\n'],
 'url': [u'/cio100/2013/dominos-pizza/']}

在这里，它只生成一个块，并且它正确地捕获了Industry和URL。但它没有得到公司的名字或个人。

任何帮助都将不胜感激！

谢谢!

python

回答 1

Stack Overflow用户

回答已采纳

发布于 2014-01-14 03:47:58

至于xpath，请考虑执行以下操作：

$ scrapy shell http://www.cio.co.uk/cio100/2013/cio/
...
>>> for tr in sel.xpath('//table[@class="bgWhite listTable"]/tr'):
...     item = Cio100Item()
...     item['company'] = tr.xpath('td[2]//a/text()').extract()[0].strip()
...     item['person'] = tr.xpath('td[3]//a/text()').extract()[0].strip()
...     item['industry'] = tr.xpath('td[4]//a/text()').extract()[0].strip()
...     item['url'] = tr.xpath('td[4]//a/@href').extract()[0].strip()
...     print item
... 
{'company': u'LOCOG',
 'industry': u'Leisure and entertainment',
 'person': u'Gerry Pennell',
 'url': u'/cio100/2013/locog/'}
{'company': u'Laterooms.com',
 'industry': u'Leisure and entertainment',
 'person': u'Adam Gerrard',
 'url': u'/cio100/2013/lateroomscom/'}
{'company': u'Vodafone',
 'industry': u'Communications and IT services',
 'person': u'Albert Hitchcock',
 'url': u'/cio100/2013/vodafone/'}
...

除此之外，您最好逐个yield项目，而不是将它们累加到列表中。

票数 3

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/21105492

复制

相似问题

问抓取表列和行的Python抓取器
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问抓取表列和行的Python抓取器EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问抓取表列和行的Python抓取器
EN