我正在尝试使用Python中的lxml解析HTML页面。
在HTML中具有以下结构:
<html>
<h5>Title</h5>
<p>Some text <b>with</b> <i>other tags</i>.</p>
<p>More text.</p>
<p>More text[2].</p>
<h5>Title[2]</h5>
<p>Description.</p>
<h5>Title[3]</h5>
<p>Description[1].</p>
<p>Description[2].</p>
***
and so on...
***
</html>
我需要将此HTML解析为以下JSON:
[
{
"title": "Title",
"text": "Some text with other tags.\nMore text.\nMore text[2].",
},
{
"title": "Title[2]",
"text": "Description.",
},
{
"title": "Title[3]",
"text": "Description[1].\nDescription[2]",
}
]
我可以读取所有带有标题的h5标签,并使用以下代码将它们写入JSON:
array = []
for title in tree.xpath('//h5/text()'):
data = {
"title" : title,
"text" : ""
}
array.append(data)
with io.open('data.json', 'w', encoding='utf8') as outfile:
str_ = json.dumps(array,
indent=4, sort_keys=True,
separators=(',', ' : '), ensure_ascii=False)
outfile.write(to_unicode(str_))
问题是,我不知道如何读取<h5>
标题之间的所有段落内容,并将它们放入text
JSON字段中。
发布于 2019-04-21 20:47:41
要获取两个元素之间的所有文本,例如两个标题之间的文本,没有其他方法可以这样做:
tree
(我们将使用.iterwalk()
,因为我们必须为遇到的每个标题区分elements)current_heading
)ElementTree元素中的每个元素都可以有一个.text
和一个.tail
<b>This will be the .text</b> and this will be the .tail
我们必须将两者都收集起来,否则输出中将缺少文本。
下面的代码使用堆栈跟踪我们在HTML树中的位置,以便以正确的顺序收集嵌套元素的.head
和.tail
。
collected_text = []
data = []
stack = []
current_heading = {
'title': '',
'text': []
}
html_headings = ['h1', 'h2', 'h3', 'h4', 'h5', 'h6']
def normalize(strings):
return ''.join(strings)
for event, elem in ET.iterwalk(tree, events=('start', 'end')):
# when an element starts, collect its .text
if event == 'start':
stack.append(elem)
if elem.tag in html_headings:
# reset any collected text, b/c now we're starting to collect
# the heading's text. There might be nested elements in it.
collected_text = []
if elem.text:
collected_text.append(elem.text)
# ...and when it ends, collect its .tail
elif event == 'end' and elem == stack[-1]:
# headings mark the border between data items
if elem.tag in html_headings:
# normalize text in the previous data item
current_heading['text'] = normalize(current_heading['text'])
# start new data item
current_heading = {
'title': normalize(collected_text),
'text': []
}
data.append(current_heading)
# reset any collected text, b/c now we're starting to collect
# the text after the the heading
collected_text = []
if elem.tail:
collected_text.append(elem.tail)
current_heading['text'] = collected_text
stack.pop()
# normalize text in final data item
current_heading['text'] = normalize(current_heading['text'])
当我对您的示例HTML运行此命令时,我得到以下输出(JSON格式):
[
{
"text" : "\n Some text with other tags.\n More text.\n More text[2].\n\n ",
"title" : "Title"
},
{
"text" : "\n Description.\n\n ",
"title" : "Title[2]"
},
{
"text" : "\n Description[1].\n Description[2].\n\n ***\n and so on...\n ***\n",
"title" : "Title[3]"
}
]
我的normalize()
函数非常简单,它保留了所有的换行符和其他空格,它们是HTML源代码的一部分。如果你想要更好的结果,就写一个更复杂的函数。
发布于 2019-04-22 08:33:32
有一种更简单的方法可以做到这一点,只需跟踪下一个h5的位置,并确保选择位置较低的p:
data = []
for h5 in doc.xpath('//h5'):
more_h5s = h5.xpath('./following-sibling::h5')
position = int(more_h5s[0].xpath('count(preceding-sibling::*)')) if len(more_h5s) > 0 else 999
ps = h5.xpath('./following-sibling::p[position()<' + str(position) + ']')
data.append({
"title": h5.text,
"text": "\n".join(map(lambda p: p.text_content(), ps))
})
甚至可以更简单地“跟随”following-sibling::*
,直到它不再是p
为止
发布于 2020-08-06 01:33:15
首先,根据传递的标记将元素的子元素拆分成单独的部分。
def split(element, tag):
sections = [[]]
for element in element:
if element.tag == tag:
sections.append([])
sections[-1].append(element)
return sections
从那里,它可以被重塑成一本字典。下面这样的代码就可以了:
data = []
for section in split(html, "h5"):
if section and section[0].tag == "h5":
data.append(
{
"title": section[0].text_content(),
"text": "\n".join(q.text_content() for q in section[1:]),
}
)
https://stackoverflow.com/questions/55781847
复制相似问题