我试着用剪贴法从html页面中提取一些数据。
我试图抓取的html页面包含一些html标记,这些标记既包含要刮掉的文本,也包含需要刮掉内容的内部标记。因此,当我试图训练刮板时,我会得到一个FragmentAlreadyAnnotated
异常,因为分类器最终会注释两个字符串的外部html标记。
有人知道如何避免这种情况吗?
我创建了一个最小的工作示例,供您进行实验:
import json
from scrapely import HtmlPage, Scraper
train_html = """<!doctype html>
<html>
<head>
<title>Example</title>
</head>
<body>
<p><span>Example 1</span> * 2018</p>
<p><span>Example 2</span> * 2017</p>
<p><span>Example 3</span> * 2016</p>
</body>
</html>"""
test_html = """<!doctype html>
<html>
<head>
<title>Example</title>
</head>
<body>
<p><span>Example A</span> * 2015</p>
<p><span>Example B</span> * 2014</p>
<p><span>Example C</span> * 2013</p>
</body>
</html>"""
if __name__ == '__main__':
train_page = HtmlPage(url='http://example.com/', page_id=1, body=train_html)
train_data = {
'special': ['Example 1', 'Example 2', 'Example 3'],
'year': ['2018', '2017', '2016']
}
test_page = HtmlPage(url='http://example.com/', page_id=2, body=test_html)
s = Scraper()
s.train_from_htmlpage(train_page, train_data)
matches = s.scrape_page(test_page)
print(json.dumps(matches, indent=4))
print('Done.')
当我尝试执行这个脚本时,我会得到以下内容:
Traceback (most recent call last):
File "/Users/stefano/Workspace/2018/re-searcher/src/main/python/researcher/mwe.py", line 40, in <module>
s.train_from_htmlpage(train_page, train_data)
File "/Users/stefano/Workspace/2018/re-searcher/.env/lib/python3.5/site-packages/scrapely/__init__.py", line 44, in train_from_htmlpage
tm.annotate(field, best_match(value))
File "/Users/stefano/Workspace/2018/re-searcher/.env/lib/python3.5/site-packages/scrapely/template.py", line 44, in annotate
self.annotate_fragment(i, field)
File "/Users/stefano/Workspace/2018/re-searcher/.env/lib/python3.5/site-packages/scrapely/template.py", line 83, in annotate_fragment
raise FragmentAlreadyAnnotated("Fragment already annotated: %s" % fstr)
scrapely.template.FragmentAlreadyAnnotated: Fragment already annotated: <span data-scrapy-annotate="{"annotations": {"content": "year"}}">
虽然我希望这样的事情:
[
{
"year": [
"2015",
"2014",
"2013"
],
"special": [
"Example A",
"Example B",
"Example C"
]
}
]
Done.
事先非常感谢!
额外的问题:您知道是否有办法使每个special
与最近的year
相关联?请注意,在某些情况下,年份可能会丢失:
<body>
<p><span>Example D</span> * 2012</p>
<p><span>Example E</span></p>
<p><span>Example F</span> * 2011</p>
</body>
发布于 2018-06-20 12:53:36
不是真正的答案,而是一次黑客攻击。
我创建了一个函数,它使用RegEx删除不必要的空格和新行,然后寻找像<X><Y>some text</Y>more text</X>
这样的模式来用<X><Y>some text</Y><span>mote text</span></Y>
替换它们( RegEx可能无法处理某些边缘情况,如果您发现它们,请在下面建议如何修复它们)。
通过使用该函数对任何HTML进行预处理,上述错误永远不会发生,并且(几乎--注意星号)将产生预期的结果,即:
[
{
"year": [
"* 2015",
"* 2014",
"* 2013"
],
"special": [
"Example A",
"Example B",
"Example C"
]
}
]
经修订的守则如下:
def fix(html: str) -> str:
html = re.sub(r'\s+', ' ', html)
html = re.sub(r'> <', '><', html)
html = re.sub(
r'<([^>]+)>([^<]*)<([^>]+)>([^<]+)</([^>]+)>([^<]+)</([^>]+)>',
r'<\1>\2<\3>\4</\5><span>\6</span></\7>',
html
)
return html
def clean(text: str) -> str:
return re.sub(r'\s+', ' ', re.sub(r'<.*>', ' ', text)).strip()
if __name__ == '__main__':
train_page = HtmlPage(url='http://example.com/', page_id=1, body=fix(train_html))
train_data = {
'special': ['Example 1', 'Example 2', 'Example 3'],
'year': ['2018', '2017', '2016']
}
test_page = HtmlPage(url='http://example.com/', page_id=2, body=fix(test_html))
s = Scraper()
s.train_from_htmlpage(train_page, train_data)
matches = s.scrape_page(test_page)
for match in matches:
for key in match:
match[key] = [clean(value) for value in match[key]]
print(json.dumps(matches, indent=4))
print('Done.')
https://stackoverflow.com/questions/50933724
复制相似问题