首页
学习
活动
专区
工具
TVP
发布
社区首页 >问答首页 >将regex与html标记组合在一起

将regex与html标记组合在一起
EN

Stack Overflow用户
提问于 2019-03-25 19:38:02
回答 2查看 112关注 0票数 1

我有来自html页面的以下文本:

代码语言:javascript
复制
page = 
"""
<font style="DISPLAY: inline; FONT-FAMILY: Times New Roman; FONT-SIZE: 10pt; FONT-WEIGHT: bold">Item 1. Business/</font> Unless otherwise indicated by the context, we use the terms “GE” and “GECC” on the basis of consolidation described in Note 1 to the consolidated financial statements in Part II, Item 8. “Financial Statements and Supplementary Data” of this Form 10-K Report. Also, unless otherwise indicated by the context, “General Electric” means the parent company, General Electric Company (the Company).

General Electric’s address is 1 River Road, Schenectady, NY 12345-6999; we also maintain executive offices at 3135 Easton Turnpike, Fairfield, CT 06828-0001.

<font style="DISPLAY: inline; FONT-FAMILY: Times New Roman; FONT-SIZE: 10pt; FONT-WEIGHT: bold">Item 1A. Risk Factors</font>"""

我想找到获取项目1业务和项目1A风险因素之间的文本。因为每个页面都有不同的html标签结构,所以我不能使用漂亮的汤。我使用以下代码来获取文本,但它不起作用:

代码语言:javascript
复制
regexs = ('bold;\">\s*Item 1\.(.+?)bold;\">\s*Item 1A\.',   #<===pattern 1: with an attribute bold before the item subtitle
              'b>\s*Item 1\.(.+?)b>\s*Item 1A\.',               #<===pattern 2: with a tag <b> before the item subtitle
              'Item 1\.\s*<\/b>(.+?)Item 1A\.\s*<\/b>',         #<===pattern 3: with a tag <\b> after the item subtitle          
              'Item 1\.\s*Business\.\s*<\/b(.+?)Item 1A\.\s*Risk Factors\.\s*<\/b') #<===pattern 4: with a tag <\b> after the item+description subtitle 

for regex in regexs:
    match = re.search(regex, page, flags=re.IGNORECASE|re.DOTALL)  #<===search for the pattern in HTML using re.search from the re package. Ignore cases.
    if match:
        soup = BeautifulSoup(match.group(1), "html.parser") #<=== match.group(1) returns the texts inside the parentheses (.*?) 

            #soup.text removes the html tags and only keep the texts
            #rawText = soup.text.encode('utf8') #<=== you have to change the encoding the unicodes
        rawText = soup.text
        print(rawText)
        break

预期输出为:

代码语言:javascript
复制
Unless otherwise indicated by the context, we use the terms “GE” and “GECC” on the basis of consolidation described in Note 1 to the consolidated financial statements in Part II, Item 8. “Financial Statements and Supplementary Data” of this Form 10-K Report. Also, unless otherwise indicated by the context, “General Electric” means the parent company, General Electric Company (the Company).

General Electric’s address is 1 River Road, Schenectady, NY 12345-6999; we also maintain executive offices at 3135 Easton Turnpike, Fairfield, CT 06828-0001.

我认为,第一个正则表达式应该与模式匹配,但它不匹配。

编辑:这是实际的htm页面和检索文本的方式:

代码语言:javascript
复制
# Import the libraries
import requests
from bs4 import BeautifulSoup
import re
url = "https://www.sec.gov/Archives/edgar/data/40545/000004054513000036/geform10k2012.htm"
HEADERS = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36"}
response = requests.get(url, headers=HEADERS)
print(response.status_code)

page = response.text
#Pre-processing the html content by removing extra white space and combining then into one line.
page = page.strip()  #<=== remove white space at the beginning and end
page = page.replace('\n', ' ') #<===replace the \n (new line) character with space
page = page.replace('\r', '') #<===replace the \r (carriage returns -if you're on windows) with space
page = page.replace('&nbsp;', ' ') #<===replace "&nbsp;" (a special character for space in HTML) with space. 
page = page.replace('&#160;', ' ') #<===replace "&#160;" (a special character for space in HTML) with space.
page = page.replace(u'\xa0', ' ') #<===replace "&#160;" (a special character for space in HTML) with space.
page = page.replace(u'/s/', ' ') #<===replace "&#160;" (a special character for space in HTML) with space.
while '  ' in page:
    page = page.replace('  ', ' ') #<===remove extra space
EN

回答 2

Stack Overflow用户

发布于 2019-03-25 20:24:54

类似于下面的内容?

代码语言:javascript
复制
import re
page =  """
<font style="DISPLAY: inline; FONT-FAMILY: Times New Roman; FONT-SIZE: 10pt; FONT-WEIGHT: bold">Item 1. Business/</font> Unless otherwise indicated by the context, we use the terms “GE” and “GECC” on the basis of consolidation described in Note 1 to the consolidated financial statements in Part II, Item 8. “Financial Statements and Supplementary Data” of this Form 10-K Report. Also, unless otherwise indicated by the context, “General Electric” means the parent company, General Electric Company (the Company).

General Electric’s address is 1 River Road, Schenectady, NY 12345-6999; we also maintain executive offices at 3135 Easton Turnpike, Fairfield, CT 06828-0001.

<font style="DISPLAY: inline; FONT-FAMILY: Times New Roman; FONT-SIZE: 10pt; FONT-WEIGHT: bold">Item 1A. Risk Factors</font>"""

data = re.search('Item 1\. Business\/<\/font> (.*)(<font(.*)">Item 1A. Risk Factors)', page, flags=re.DOTALL).group(1)
print(data)
票数 0
EN

Stack Overflow用户

发布于 2019-03-25 20:27:32

如果您更改了您的正则表达式:

代码语言:javascript
复制
regexs = ('Item 1\.\s*Business\/(.*)',
          'Item 1\.\s*Business\.\s*<\/b(.+?)Item 1A\.\s*Risk Factors\.\s*<\/b')

它起作用了吗?

票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/55336946

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档