我在寻找一种只提取没有其他标签的标签的方法
例如:
from bs4 import BeautifulSoup
html = """
<p><a href='XYZ'>Text1</a></p>
<p>Text2</p>
<p><a href='QWERTY'>Text3</a></p>
<p>Text4</p>
"""
soup = BeautifulSoup(html, 'html.parser')
soup.find_all('p')给出
[<p><a href="XYZ">Text1</a></p>,
<p>Text2</p>,
<p><a href="QWERTY">Text3</a></p>,
<p>Text4</p>]这就是我想要达到的目标:
[<p>Text2</p>,
<p>Text4</p>]发布于 2022-11-10 13:09:31
您可以在没有其他标记的情况下过滤Tag,如下所示:
for tag in soup.find_all('p'):
if isinstance(tag.next, str):
print(tag)回传
<p>Text2</p>
<p>Text4</p>发布于 2022-11-10 12:32:13
之后,我只需在标记长度上使用if/else对其进行过滤,如果它仅为p,那么它将为空,否则将被过滤掉:
for x in soup.find_all('p'):
if len([x.tag for x in x.find_all()]) == 0:
print(x)只返回:
<p>Text2</p>
<p>Text4</p>发布于 2022-11-10 13:56:59
from bs4 import BeautifulSoup
html = """
<p><a href='XYZ'>Text1</a></p>
<p>Text2</p>
<p><a href='QWERTY'>Text3</a></p>
<p>Text4</p>
<p>Text6: <a href='QWERTY'>Text5</a></p>
"""
soup = BeautifulSoup(html, 'html.parser')
def p_tag_with_only_strings_as_children(tag):
return tag.name == "p" and all(isinstance(x, str) for x in tag.children)
result = soup.find_all(p_tag_with_only_strings_as_children)
print(result)结果:
[<p>Text2</p>, <p>Text4</p>]用于检查列表学分中的类型。
https://stackoverflow.com/a/32705845/5288820。
或者使用CSS-选择器:https://beautiful-soup-4.readthedocs.io/en/latest/#css-selectors
from bs4 import BeautifulSoup
html = """
<p><a href='XYZ'>Text1</a></p>
<p>Text2</p>
<p><a href='QWERTY'>Text3</a></p>
<p>Text4</p>
<p>Text6: <a href='QWERTY'>Text5</a></p>
""".replace('\n',"")
soup = BeautifulSoup(html, 'html.parser')
print(soup.select('p:not(:has(*))'))
#or in case you only want to filter out "a" tags:
print(soup.select('p:not(:has(a))'))结果:
[<p>Text2</p>, <p>Text4</p>]https://stackoverflow.com/questions/74388919
复制相似问题