我有一堆具有相同设置的HTML文件。从这些(本地存储的HTML)中,我想提取黄色标记的字段(example)。作为文本(我只对div部分感兴趣),total html可以在dropbox:https://www.dropbox.com/s/uka24w7o5006ole/transcript-86-855.html?dl=0上找到
<DIV id=article_participants class="content_part hid">
<P>Redhill Biopharma Ltd. (NASDAQ:<A title="" href="http://seekingalpha.com/symbol/rdhl" symbolSlug="RDHL">RDHL</A>)</P>
<P>Q4 2014 <SPAN class=transcript-search-span style="BACKGROUND-COLOR: yellow">Earnings</SPAN> Conference <SPAN class=transcript-search-span style="BACKGROUND-COLOR: #f38686">Call</SPAN></P>
<P>February 26, 2015 9:00 AM ET</P>
<P><STRONG>Executives</STRONG></P>
<P>Dror Ben Asher - CEO</P>
<P>Ori Shilo - Deputy CEO, Finance and Operations</P>
<P>Guy Goldberg - Chief Business Officer</P>
<P><STRONG>Analysts</STRONG></P>我对Python了解不多,但我认为使用Beautiful soup应该是双倍的,但我被卡住了。到目前为止,我得到的是:
import textwrap
import os
from bs4 import BeautifulSoup
directory ='C:/Research syntheses - Meta analysis/SeekingAlpha/out'
for filename in os.listdir(directory):
if filename.endswith('.html'):
fname = os.path.join(directory,filename)
with open(fname, 'r') as f:
soup = BeautifulSoup(f.read(),'html.parser')我的输出应该是一个csv文件,其中包含: Name of executive / Function of executive / Symbol ticker / Period
发布于 2020-01-23 23:21:49
下面的代码从黄色的位置提取文本。
我认为最简单的方法是使用XPath。就目前所知,bs4不支持XPath,所以代码使用了lxml。我希望这个不同之处对你来说很好。输出文件名为eggs.csv
为了让它为您工作,请更改目录变量。
*这适用于windows。在其他平台上,您必须更改"directory“变量的形式。
import textwrap
import os
from lxml import html
import csv
directory=r"C:\Users\Anita Pania\Desktop"
for filename in os.listdir(directory):
if filename.endswith('.html'):
fname = os.path.join(directory,filename)
with open(fname, 'r') as f:
page=f.read()
tree = html.fromstring(page)
y1=(tree.xpath("/html/body/div/p[1]/a/text()"))
y2=(tree.xpath("/html/body/div/p[2]/text()"))[0]
y3=(tree.xpath("/html/body/div/p[5]/text()"))
y4=(tree.xpath("/html/body/div/p[6]/text()"))
y5=(tree.xpath("/html/body/div/p[7]/a/text()"))
#soup = BeautifulSoup(f.read(),'html.parser')
with open('eggs.csv', 'w', newline='') as csvfile:
filewriter = csv.writer(csvfile, delimiter=',',quotechar='|', quoting=csv.QUOTE_MINIMAL)
filewriter.writerow(['Name of executive', y3])
filewriter.writerow(['Function of executive', y4])
filewriter.writerow(['Symbol ticker', y1])
filewriter.writerow(['Period', y2])
filewriter.writerow(['Other', y5])https://stackoverflow.com/questions/59880219
复制相似问题