我正在尝试使用漂亮的汤和蟒蛇从网页中提取信息。我想提取特定标签下面的信息。要知道它是否是正确的标记,我想对它的文本进行比较,然后在下一个即时标记中提取文本。
例如,如果下面是HTML页面源的一部分,
<div class="row">
::before
<div class="four columns">
<p class="title">Procurement type</p>
<p class="data strong">Services</p>
</div>
<div class="four columns">
<p class="title">Reference</p>
<p class="data strong">ANAJSKJD23423-Commission</p>
</div>
<div class="four columns">
<p class="title">Funding Agency</p>
<p class="data strong">Health Commission</p>
</div>
::after
</div>
<div class="row">
::before
::after
</div>
<hr>
<div class="row">
::before
<div class="twelve columns">
<p class="title">Countries</p>
<p class="data strong">
<span class>Belgium</span>
", "
<span class>France</span>
", "
<span class>Luxembourg</span>
</p>
<p></p>
</div>
::after
</div>我想检查<p class="title">是否有文本值作为Procurement type,然后打印出服务
类似地,如果<p class="title">有文本值作为Reference,那么我想打印出Reference,如果<p class="title">有值作为Countries,那么就打印出所有的国家,比如、比利时、法国、卢森堡。
我知道我可以用<p class="data strong">提取所有文本,并将它们附加到列表中,然后使用索引获取所有值。但问题是,这些<p class="title>的发生顺序不是fixed....at,有些地方的国家可以在采购前提及。因此,我希望对文本值执行检查,然后提取下一个立即标记的文本值。我还是BeautifulSoup的新手,所以我们很感谢你的帮助。谢谢
发布于 2019-04-10 11:48:44
你可以做很多ways.Here你去。
from bs4 import BeautifulSoup
htmldata='''<div class="row">
::before
<div class="four columns">
<p class="title">Procurement type</p>
<p class="data strong">Services</p>
</div>
<div class="four columns">
<p class="title">Reference</p>
<p class="data strong">ANAJSKJD23423-Commission</p>
</div>
<div class="four columns">
<p class="title">Funding Agency</p>
<p class="data strong">Health Commission</p>
</div>
::after
</div>
<div class="row">
::before
::after
</div>
<hr>
<div class="row">
::before
<div class="twelve columns">
<p class="title">Countries</p>
<p class="data strong">
<span class>Belgium</span>
", "
<span class>France</span>
", "
<span class>Luxembourg</span>
</p>
<p></p>
</div>
::after
</div>'''
soup=BeautifulSoup(htmldata,'html.parser')
items=soup.find_all('p', class_='title')
for item in items:
if ('Procurement type' in item.text) or ('Reference' in item.text):
print(item.findNext('p').text)发布于 2019-04-10 12:24:17
您还可以在:contains 4.7.1中使用bs4伪类。虽然我已经作为一个列表传递给您,但是您可以将每个条件分开
from bs4 import BeautifulSoup as bs
import re
html = 'yourHTML'
soup = bs(html, 'lxml')
items=[re.sub(r'\n\s+','', item.text.strip()) for item in soup.select('p.title:contains("Procurement type") + p, p.title:contains(Reference) + p, p.title:contains(Countries) + p')]
print(items)输出:

发布于 2019-04-10 11:46:51
可以在使用.find()或.find_all()时添加参数以检查特定文本,然后使用.next_sibling或findNext()获取内容的下一个标记。
Ie:
soup.find('p', {'class':'title'}, text = 'Procurement type')给予:
html = '''<div class="row">
::before
<div class="four columns">
<p class="title">Procurement type</p>
<p class="data strong">Services</p>
</div>
<div class="four columns">
<p class="title">Reference</p>
<p class="data strong">ANAJSKJD23423-Commission</p>
</div>
<div class="four columns">
<p class="title">Funding Agency</p>
<p class="data strong">Health Commission</p>
</div>
::after
</div>
<div class="row">
::before
::after
</div>
<hr>
<div class="row">
::before
<div class="twelve columns">
<p class="title">Countries</p>
<p class="data strong">
<span class>Belgium</span>
", "
<span class>France</span>
", "
<span class>Luxembourg</span>
</p>
<p></p>
</div>
::after
</div>'''你可以这样做:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
alpha = soup.find('p', {'class':'title'}, text = 'Procurement type')
for sibling in alpha.next_siblings:
try:
print (sibling.text)
except:
continue输出:
Services或
ref = soup.find('p', {'class':'title'}, text = 'Reference')
for sibling in ref.next_siblings:
try:
print (sibling.text)
except:
continue输出:
ANAJSKJD23423-Commission 或
countries = soup.find('p', {'class':'title'}, text = 'Countries')
names = countries.findNext('p', {'class':'data strong'}).text.replace('", "','').strip().split('\n')
names = [name.strip() for name in names if not name.isspace()]
for country in names:
print (country)输出:
Belgium
France
Luxembourghttps://stackoverflow.com/questions/55611273
复制相似问题