文章/答案/技术大牛

发布

问信息提取- htmls
EN

Stack Overflow用户

提问于 2020-01-23 21:53:31

回答 1查看 135关注 0票数 0

我有一堆具有相同设置的HTML文件。从这些(本地存储的HTML)中，我想提取黄色标记的字段(example)。作为文本(我只对div部分感兴趣)，total html可以在dropbox：https://www.dropbox.com/s/uka24w7o5006ole/transcript-86-855.html?dl=0上找到

<DIV id=article_participants class="content_part hid">
<P>Redhill Biopharma Ltd. (NASDAQ:<A title="" href="http://seekingalpha.com/symbol/rdhl" symbolSlug="RDHL">RDHL</A>)</P>
<P>Q4 2014 <SPAN class=transcript-search-span style="BACKGROUND-COLOR: yellow">Earnings</SPAN> Conference <SPAN class=transcript-search-span style="BACKGROUND-COLOR: #f38686">Call</SPAN></P>
<P>February 26, 2015 9:00 AM ET</P>
<P><STRONG>Executives</STRONG></P> 
<P>Dror Ben Asher - CEO</P>
<P>Ori Shilo - Deputy CEO, Finance and Operations</P>
<P>Guy Goldberg - Chief Business Officer</P>
<P><STRONG>Analysts</STRONG></P>

我对Python了解不多，但我认为使用Beautiful soup应该是双倍的，但我被卡住了。到目前为止，我得到的是：

import textwrap
import os
from bs4 import BeautifulSoup

directory ='C:/Research syntheses - Meta analysis/SeekingAlpha/out'
for filename in os.listdir(directory):
    if filename.endswith('.html'):
        fname = os.path.join(directory,filename)
        with open(fname, 'r') as f:
            soup = BeautifulSoup(f.read(),'html.parser')

我的输出应该是一个csv文件，其中包含: Name of executive / Function of executive / Symbol ticker / Period

python

html

pandas

csv

回答 1

Stack Overflow用户

回答已采纳

发布于 2020-01-23 23:21:49

下面的代码从黄色的位置提取文本。

我认为最简单的方法是使用XPath。就目前所知，bs4不支持XPath，所以代码使用了lxml。我希望这个不同之处对你来说很好。输出文件名为eggs.csv

为了让它为您工作，请更改目录变量。

*这适用于windows。在其他平台上，您必须更改"directory“变量的形式。

import textwrap
import os
from lxml import html
import csv

directory=r"C:\Users\Anita Pania\Desktop"
for filename in os.listdir(directory):
    if filename.endswith('.html'):
        fname = os.path.join(directory,filename)
        with open(fname, 'r') as f:
            page=f.read()
            tree = html.fromstring(page)
            y1=(tree.xpath("/html/body/div/p[1]/a/text()"))
            y2=(tree.xpath("/html/body/div/p[2]/text()"))[0]
            y3=(tree.xpath("/html/body/div/p[5]/text()"))
            y4=(tree.xpath("/html/body/div/p[6]/text()"))
            y5=(tree.xpath("/html/body/div/p[7]/a/text()"))
            #soup = BeautifulSoup(f.read(),'html.parser')

with open('eggs.csv', 'w', newline='') as csvfile:
    filewriter = csv.writer(csvfile, delimiter=',',quotechar='|', quoting=csv.QUOTE_MINIMAL)
    filewriter.writerow(['Name of executive', y3])
    filewriter.writerow(['Function of executive', y4])
    filewriter.writerow(['Symbol ticker', y1])
    filewriter.writerow(['Period', y2])
    filewriter.writerow(['Other', y5])

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/59880219

复制

相似问题

问信息提取- htmls
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问信息提取- htmlsEN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问信息提取- htmls
EN