我正在使用优秀的scrapy项目来尝试并刮取以下HTML:
<div id="bio">
<b>Birthplace: </b><a href="/tags/?id=90" target="_blank">Ireland</a>
<br>
<b>Location: </b><a href="/tags/?id=294" target="_blank">London</a>,
<a href="/tags/?id=64" target="_blank">UK</a>
<br>
<b>Ethnicity: </b><a href="/tags/?id=4" target="_blank">Caucasian</a><br>
</div>另一个例子(不同页):
<div id="bio">
<b>Birthplace: </b><a href="/tags/?id=100" target="_blank">United States</a>
<br>
<b>Location: </b><a href="/tags/?id=345" target="_blank">Baltimore</a>,
<a href="/tags/?id=190" target="_blank">Maryland</a>,
<a href="/tags/?id=190" target="_blank">United States</a>
<br>
<b>Ethnicity: </b><a href="/tags/?id=4" target="_blank">Black</a><br>
</div>我要寻找的输出是:
["London", "UK"]
["Baltimore", "Maryland", "United States"]正如您所看到的,有时会有州和省,所以选择前两个<a>标记就不那么容易了。
我能想到的解决办法:
<a>元素之后立即检测逗号。没有逗号时停止(最后一个元素)<a>元素和<b>元素之间的所有<br>标记编辑:
为了澄清,上面的两个例子来自不同的页面。其次,<b>Ethnicity</b>元素有时不会出现。它可能是Birthday或其他几个选项。<b>Label:</b>的顺序没有得到保证,而且数据非常非结构化,因此很困难。
发布于 2015-01-15 22:21:46
以下XPath表达式:
//b[contains(.,'Location')]/following-sibling::a[not(preceding-sibling::b[contains(.,'Ethnicity')])]/text()翻译成
//b[contains(.,'Location')] Select `b` elements anywhere in the document and only
if their text content contains "Location"
/following-sibling::a Of those `b` elements select following sibling
elements `a`
[not(preceding-sibling::b but only if they (i.e. the `a` elements) are not
preceded by a `b` element
[contains(.,'Ethnicity')])] whose text nodes contain "Ethnicity"
/text() return all text nodes of those `a` elements和产量(由-------分隔的单个结果)
London
-----------------------
UK
-----------------------
Baltimore
-----------------------
Maryland
-----------------------
United States它所依赖的事实是,您要查找的a元素介于包含Location的b元素和包含Ethnicity的b元素之间。总是这样吗?
编辑:作为对编辑的响应,尝试以下类似的表达式:
//b[contains(.,'Location')]/following-sibling::a[not(preceding-sibling::b[preceding-sibling::b[contains(.,'Location')]])]/text()https://stackoverflow.com/questions/27973802
复制相似问题