文章/答案/技术大牛

发布

问用scrapy抓取未嵌套的html
EN

Stack Overflow用户

提问于 2015-01-15 22:01:10

回答 1查看 358关注 0票数 0

我正在使用优秀的scrapy项目来尝试并刮取以下HTML：

<div id="bio">
    <b>Birthplace:&nbsp;</b><a href="/tags/?id=90" target="_blank">Ireland</a>
    <br>
    <b>Location:&nbsp;</b><a href="/tags/?id=294" target="_blank">London</a>, 
    <a href="/tags/?id=64" target="_blank">UK</a>
    <br>
    <b>Ethnicity:&nbsp;</b><a href="/tags/?id=4" target="_blank">Caucasian</a><br>
</div>

另一个例子(不同页)：

<div id="bio">
    <b>Birthplace:&nbsp;</b><a href="/tags/?id=100" target="_blank">United States</a>
    <br>
    <b>Location:&nbsp;</b><a href="/tags/?id=345" target="_blank">Baltimore</a>, 
    <a href="/tags/?id=190" target="_blank">Maryland</a>,
    <a href="/tags/?id=190" target="_blank">United States</a>
    <br>
    <b>Ethnicity:&nbsp;</b><a href="/tags/?id=4" target="_blank">Black</a><br>
</div>

我要寻找的输出是：

["London", "UK"]
["Baltimore", "Maryland", "United States"]

正如您所看到的，有时会有州和省，所以选择前两个<a>标记就不那么容易了。

我能想到的解决办法：

在<a>元素之后立即检测逗号。没有逗号时停止(最后一个元素)
查找<a>元素和元素之间的所有 标记
获取有州/省并按值筛选的国家列表(我不喜欢这样做)

编辑：

为了澄清，上面的两个例子来自不同的页面。其次，Ethnicity元素有时不会出现。它可能是Birthday或其他几个选项。Label:的顺序没有得到保证，而且数据非常非结构化，因此很困难。

html

xpath

web-scraping

scrapy

python

回答 1

Stack Overflow用户

回答已采纳

发布于 2015-01-15 22:21:46

以下XPath表达式：

//b[contains(.,'Location')]/following-sibling::a[not(preceding-sibling::b[contains(.,'Ethnicity')])]/text()

翻译成

//b[contains(.,'Location')]       Select `b` elements anywhere in the document and only
                                  if their text content contains "Location"
/following-sibling::a             Of those `b` elements select following sibling
                                  elements `a` 
[not(preceding-sibling::b         but only if they (i.e. the `a` elements) are not
                                  preceded by a `b` element
[contains(.,'Ethnicity')])]       whose text nodes contain "Ethnicity"
/text()                           return all text nodes of those `a` elements

和产量(由-------分隔的单个结果)

London
-----------------------
UK
-----------------------
Baltimore
-----------------------
Maryland
-----------------------
United States

它所依赖的事实是，您要查找的a元素介于包含Location的b元素和包含Ethnicity的b元素之间。总是这样吗？

编辑：作为对编辑的响应，尝试以下类似的表达式：

//b[contains(.,'Location')]/following-sibling::a[not(preceding-sibling::b[preceding-sibling::b[contains(.,'Location')]])]/text()

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/27973802

复制

相似问题

问用scrapy抓取未嵌套的html
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问用scrapy抓取未嵌套的htmlEN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问用scrapy抓取未嵌套的html
EN