我正在处理某些HTML页面,我需要从中抓取数据。问题是span ids是有编号的。例如:
ContentPlaceHolder_0, ContentPlaceHolder_1, ContentPlaceHolder_2 ..... ContentPlaceHolder_n
我需要在每个页面上从所有这些span标记中获取数据。使用Beautiful Soup获取此数据的最佳方法是什么?
发布于 2018-08-05 04:18:54
你可以试试BeautifulSoup内置的CSS选择器。这将选择ids以ContentPlaceHolder
开头的所有span
soup.select('span[id^=ContentPlaceHolder]')
示例:
from bs4 import BeautifulSoup
html = """<span id='ContentPlaceHolder_0'>0</span>
<span id='ContentPlaceHolder_1'>1</span>
<span id='ContentPlaceHolder_2'>2</span>
<span id='ContentPlaceHolder_3'>3</span>
<span id='xxx'>xxx</span>"""
soup = BeautifulSoup(html, 'lxml')
for s in soup.select('span[id^=ContentPlaceHolder]'):
print(s.text)
打印:
0
1
2
3
https://stackoverflow.com/questions/51688989
复制相似问题