我需要从网页刮擦的数据,这是在这种格式。我只需要h2和h3的第一个孩子的内部文本,即来自第一个跨度和所有其他<p>
标记的文本
<div class="info">
<h2>
<span>first heading</span>
<span> not required</span>
</h2>
<p> 1 paragraph</p>
<p> 2 paragraph</p>
<div> some tags</div>
<h3>
<span>second heading</span>
<span> not required</span>
</h3>
<p> 3 paragraph</p>
<p> 4 paragraph</p>
</div>
输出;
first heading
1 paragraph
2 paragraph
second heading
3 paragraph
4 paragraph
soup.find_all("h1","p","h2","h3")在尝试之后,我也得到了第二个跨度的内部文本,这是我不想要的。我只需要h2和h3的第一个span内容和p标记内容的内部文本。我是新来的蟒蛇和汤,任何帮助都将不胜感激。
发布于 2019-06-02 20:42:44
试试这个
from bs4 import BeautifulSoup as bs
my_data = [your html above]
soup = bs(my_data, "lxml")
for head in ["h2", "h3"]:
target = soup.find(head)
print(target.findChild().text)
输出:
first heading
second heading
发布于 2019-06-03 00:34:19
您可以使用find_all()来获取所需的标记,然后对只想要第一个子级的元素使用findChild()
from bs4 import BeautifulSoup
html = """
<div class="info">
<h2>
<span>first heading</span>
<span> not required</span>
</h2>
<p> 1 paragraph</p>
<p> 2 paragraph</p>
<div> some tags</div>
<h3>
<span>second heading</span>
<span> not required</span>
</h3>
<p> 3 paragraph</p>
<p> 4 paragraph</p>
</div>
"""
soup = BeautifulSoup(html, "lxml")
for elem in soup.find_all(['h2', 'h3', 'p']):
if elem.name == 'p':
print(elem.text)
else:
print(elem.findChild().text)
输出;
first heading
1 paragraph
2 paragraph
second heading
3 paragraph
4 paragraph
https://stackoverflow.com/questions/56414317
复制相似问题