我有很多html文件,我必须获得完整的文件头。不同位置的标头标记: class="c6",class="c7“
我试过BeautifulSoup
for head_c6 in soup.find_all('span', attrs={'class': 'c6'}):
print(head_c6.get_text())
for head_c7 in soup.find_all('span', attrs={'class': 'c7'}):
print(head_c7.get_text())但结果是:
Q3 2017美国运通公司财报电话-最终长度:
Q2 2016 Akamai技术公司呼叫最终收益
在这里,不同的文件是什么样的:
文件1
<div class="c4">
<p class="c5">
<span class="c6">
Q3 2017 American Express Co Earnings Call - Final
</span>
</p>
</div>
<div class="c4">
<p class="c5">
<span class="c7">
LENGTH:
</span>
<span class="c2">
11051 words
</span>
</p>
</div>文件2
<div class="c4">
<p class="c5">
<span class="c6">
Q2 2018 Akamai Technologies Inc
</span>
<span class="c7">
Earnings
</span>
<span class="c6">
Call - Final
</span>
</p>
</div>文件3
<div class="c4">
<p class="c5">
<span class="c6">
Q4 2018
</span>
<span class="c7">
Facebook
</span>
<span class="c6">
Inc
</span>
<span class="c7">
Earnings
</span>
<span class="c6">
Call - Final
</span>
</p>我想要的是获得标题的全文:
Q3 2017年美国运通公司财报电话会议-最终
Q2 2018年Akamai科技公司收益电话会议-最终
2018年Q4 Facebook公司财报电话会议-最终
发布于 2019-05-14 10:05:41
使用正则表达式re我已经更新了最后一个文件html.You可以对剩余的文件做同样的事情
from bs4 import BeautifulSoup
import re
data='''<div class="c4">
<p class="c5">
<span class="c6">
Q4 2018
</span>
<span class="c7">
Facebook
</span>
<span class="c6">
Inc
</span>
<span class="c7">
Earnings
</span>
<span class="c6">
Call - Final
</span>
</p>'''
soup=BeautifulSoup(data,'html.parser')
items=[item.text.strip() for item in soup.find_all('span', class_=re.compile("c"))]
stritem=' '.join(items)
print(stritem.replace('\n',''))输出:
Q4 2018 Facebook Inc Earnings Call - Final您也可以使用以下方式。
items=[item.text.strip() for item in soup.find_all('span', class_=re.compile("c6|c7"))]
stritem=' '.join(items)
print(stritem.replace('\n',''))或者要获取父标记文本,请尝试这样做。
from bs4 import BeautifulSoup
import re
data='''<div class="c4">
<p class="c5">
<span class="c6">
Q4 2018
</span>
<span class="c7">
Facebook
</span>
<span class="c6">
Inc
</span>
<span class="c7">
Earnings
</span>
<span class="c6">
Call - Final
</span>
</p>'''
soup=BeautifulSoup(data,'html.parser')
childtag=soup.find('span', class_=re.compile("c6|c7"))
parenttag=childtag.parent
print(parenttag.text.replace('\n',''))发布于 2019-05-14 10:21:56
Python内置的条状()函数用于从字符串中删除所有的前导和尾随空格。
Str.join(可迭代) -返回一个字符串,它是可迭代字符串的级联。
from bs4 import BeautifulSoup
html1 = ''' <div class="c4">
<p class="c5">
<span class="c6">
Q4 2018
</span>
<span class="c7">
Facebook
</span>
<span class="c6">
Inc
</span>
<span class="c7">
Earnings
</span>
<span class="c6">
Call - Final
</span>
</p></div>'''
soup = BeautifulSoup(html1,'lxml')
tag = soup.find('div',{'class':'c4'})
header = ' '.join(("".join((tag.text.strip()).split('\n'))).split())
print(header)O/P
2018年Q4公司财报电话会议-决赛
发布于 2019-05-14 10:54:03
传递要选择的或列表似乎更容易,也更有效。
from bs4 import BeautifulSoup as bs
html = '''<div class="c4">
<p class="c5">
<span class="c6">
Q4 2018
</span>
<span class="c7">
Facebook
</span>
<span class="c6">
Inc
</span>
<span class="c7">
Earnings
</span>
<span class="c6">
Call - Final
</span>
</p>'''
soup= bs(html,'html.parser')
result = ' '.join([item.text.strip() for item in soup.select('.c6,.c7')])
print(result)https://stackoverflow.com/questions/56127598
复制相似问题