我正在试着把html解析成字典。
我当前的代码中有很多逻辑。
它很难闻,我使用lxml来帮助我解析它。有什么推荐的方法可以在没有太多格式良好的DOM的情况下解析这种html吗?
非常感谢
原始html
<p><strong>Departs:</strong> 5:15:00AM, Sat, Nov 28, 2015 - Taipei</p>
<p><strong>Arrives:</strong> 8:00:00AM, Sat, Nov 28, 2015 - Bangkok - Don Mueang</p>
<p><strong>Flight duration:</strong> 3h 45m</p>
<p><strong>Operated by:</strong> NokScoot</p>预期结果
{
Departs: "5:15:00AM, Sat, Nov 28, 2015",
Arrives: "5:15:00AM, Sat, Nov 28, 2015",
Flight duration: "3h 45m"
...
}当前代码(实现)
doc_root = html.document_fromstring(resp.text)
for ele in doc_root.xpath('//ul[@class="tb_body"]'):
if has_stops(ele.xpath('.//li[@class="tb_body_flight"]//span[@class="has_cuspopup"]')):
continue
set_trace()
from_city = ele.xpath('.//li[@class="tb_body_city"]')[0]
set_trace()
sub_ele = ele.xpath('.//li[@class="tb_body_flight"]//span[@class="has_cuspopup"]')
set_trace() 发布于 2015-10-29 13:39:55
我为你提供的html创建了一个例子。它使用流行的Beautiful Soup。
from bs4 import BeautifulSoup
data = '<p><strong>Departs:</strong> 5:15:00AM, Sat, Nov 28, 2015 - Taipei</p>\
<p><strong>Arrives:</strong> 8:00:00AM, Sat, Nov 28, 2015 - Bangkok - Don Mueang</p>\
<p><strong>Flight duration:</strong> 3h 45m</p>\
<p><strong>Operated by:</strong> NokScoot</p>'
soup = BeautifulSoup(data, 'html.parser')
res = {p.contents[0].text: p.contents[1].split(' - ')[0].strip() for p in soup.find_all('p')}
print(res)输出:
{
'Departs:': '5:15:00AM, Sat, Nov 28, 2015',
'Flight duration:': '3h 45m',
'Operated by:': 'NokScoot',
'Arrives:': '8:00:00AM, Sat, Nov 28, 2015'
}我认为如果你想使你的代码紧凑,你应该避免使用属性。
https://stackoverflow.com/questions/33406250
复制相似问题