我有一些html
html = '''<span class="head">A</span>Explanation <span style="color: red;">1</span><span class="head">B</span>Explanation 2<span class="head">C</span>Explanation <span style="color: red;">3</span>'''
soup = BeautifulSoup(html)现在我想把它分开
head = ["A", "B", "C"]
contents = ["Explanation 1", "Explanation 2", "Explanation 3"]我可以通过
head = [i.get_text() for i in soup.select("span.head")]但不知道如何提取内容。
发布于 2020-11-17 17:00:36
不幸的是,我的汉字不是它应有的,但这是我得到的:
targets = soup.select('span.head')
heads = []
entries = []
for target in targets:
entry = []
heads.append(target.text)
entry.append(target.next_sibling)
if target.next_sibling.next_sibling.has_attr('style'):
entry.append(target.next_sibling.next_sibling.text)
entries.append(''.join(entry).strip().replace('\n\t',''))
print(heads)
print(entries)输出:
['東', '菄', '鶇']
['春方也〾說文曰動...爲人', '東風菜義見上注俗加艹', '鶇鵍鳥名美形出廣雅亦作?']对吗?
发布于 2020-11-17 08:06:49
尝试使用zip()
from bs4 import BeautifulSoup
html = '''<span class="head">A</span>Explanation <span style="color: red;">1</span><span class="head">B</span>Explanation <span style="color: red;">2</span>
'''
soup = BeautifulSoup(html, "html.parser")
contents = []
for tag1, tag2 in zip(
soup.select('span.head'), soup.select('span:not(span.head)')):
contents.append(tag1.next.find_next(text=True) + tag2.text)
print(contents)输出:
['Explanation 1', 'Explanation 2']https://stackoverflow.com/questions/64870372
复制相似问题