问将regex与html标记组合在一起
EN

Stack Overflow用户

提问于 2019-03-25 19:38:02

回答 2查看 112关注 0票数 1

我有来自html页面的以下文本：

page = 
"""
<font style="DISPLAY: inline; FONT-FAMILY: Times New Roman; FONT-SIZE: 10pt; FONT-WEIGHT: bold">Item 1. Business/</font> Unless otherwise indicated by the context, we use the terms “GE” and “GECC” on the basis of consolidation described in Note 1 to the consolidated financial statements in Part II, Item 8. “Financial Statements and Supplementary Data” of this Form 10-K Report. Also, unless otherwise indicated by the context, “General Electric” means the parent company, General Electric Company (the Company).

General Electric’s address is 1 River Road, Schenectady, NY 12345-6999; we also maintain executive offices at 3135 Easton Turnpike, Fairfield, CT 06828-0001.

<font style="DISPLAY: inline; FONT-FAMILY: Times New Roman; FONT-SIZE: 10pt; FONT-WEIGHT: bold">Item 1A. Risk Factors</font>"""

我想找到获取项目1业务和项目1A风险因素之间的文本。因为每个页面都有不同的html标签结构，所以我不能使用漂亮的汤。我使用以下代码来获取文本，但它不起作用：

regexs = ('bold;\">\s*Item 1\.(.+?)bold;\">\s*Item 1A\.',   #<===pattern 1: with an attribute bold before the item subtitle
              'b>\s*Item 1\.(.+?)b>\s*Item 1A\.',               #<===pattern 2: with a tag <b> before the item subtitle
              'Item 1\.\s*<\/b>(.+?)Item 1A\.\s*<\/b>',         #<===pattern 3: with a tag <\b> after the item subtitle          
              'Item 1\.\s*Business\.\s*<\/b(.+?)Item 1A\.\s*Risk Factors\.\s*<\/b') #<===pattern 4: with a tag <\b> after the item+description subtitle 

for regex in regexs:
    match = re.search(regex, page, flags=re.IGNORECASE|re.DOTALL)  #<===search for the pattern in HTML using re.search from the re package. Ignore cases.
    if match:
        soup = BeautifulSoup(match.group(1), "html.parser") #<=== match.group(1) returns the texts inside the parentheses (.*?) 

            #soup.text removes the html tags and only keep the texts
            #rawText = soup.text.encode('utf8') #<=== you have to change the encoding the unicodes
        rawText = soup.text
        print(rawText)
        break

预期输出为：

Unless otherwise indicated by the context, we use the terms “GE” and “GECC” on the basis of consolidation described in Note 1 to the consolidated financial statements in Part II, Item 8. “Financial Statements and Supplementary Data” of this Form 10-K Report. Also, unless otherwise indicated by the context, “General Electric” means the parent company, General Electric Company (the Company).

General Electric’s address is 1 River Road, Schenectady, NY 12345-6999; we also maintain executive offices at 3135 Easton Turnpike, Fairfield, CT 06828-0001.

我认为，第一个正则表达式应该与模式匹配，但它不匹配。

编辑:这是实际的htm页面和检索文本的方式：

# Import the libraries
import requests
from bs4 import BeautifulSoup
import re
url = "https://www.sec.gov/Archives/edgar/data/40545/000004054513000036/geform10k2012.htm"
HEADERS = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36"}
response = requests.get(url, headers=HEADERS)
print(response.status_code)

page = response.text
#Pre-processing the html content by removing extra white space and combining then into one line.
page = page.strip()  #<=== remove white space at the beginning and end
page = page.replace('\n', ' ') #<===replace the \n (new line) character with space
page = page.replace('\r', '') #<===replace the \r (carriage returns -if you're on windows) with space
page = page.replace('&nbsp;', ' ') #<===replace "&nbsp;" (a special character for space in HTML) with space. 
page = page.replace('&#160;', ' ') #<===replace "&#160;" (a special character for space in HTML) with space.
page = page.replace(u'\xa0', ' ') #<===replace "&#160;" (a special character for space in HTML) with space.
page = page.replace(u'/s/', ' ') #<===replace "&#160;" (a special character for space in HTML) with space.
while '  ' in page:
    page = page.replace('  ', ' ') #<===remove extra space

regex

python

html

回答 2

Stack Overflow用户

发布于 2019-03-25 20:24:54

类似于下面的内容？

import re
page =  """
<font style="DISPLAY: inline; FONT-FAMILY: Times New Roman; FONT-SIZE: 10pt; FONT-WEIGHT: bold">Item 1. Business/</font> Unless otherwise indicated by the context, we use the terms “GE” and “GECC” on the basis of consolidation described in Note 1 to the consolidated financial statements in Part II, Item 8. “Financial Statements and Supplementary Data” of this Form 10-K Report. Also, unless otherwise indicated by the context, “General Electric” means the parent company, General Electric Company (the Company).

General Electric’s address is 1 River Road, Schenectady, NY 12345-6999; we also maintain executive offices at 3135 Easton Turnpike, Fairfield, CT 06828-0001.

<font style="DISPLAY: inline; FONT-FAMILY: Times New Roman; FONT-SIZE: 10pt; FONT-WEIGHT: bold">Item 1A. Risk Factors</font>"""

data = re.search('Item 1\. Business\/<\/font> (.*)(<font(.*)">Item 1A. Risk Factors)', page, flags=re.DOTALL).group(1)
print(data)

票数 0

Stack Overflow用户

发布于 2019-03-25 20:27:32

如果您更改了您的正则表达式：

regexs = ('Item 1\.\s*Business\/(.*)',
          'Item 1\.\s*Business\.\s*<\/b(.+?)Item 1A\.\s*Risk Factors\.\s*<\/b')

它起作用了吗？

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/55336946

复制

相似问题

问将regex与html标记组合在一起
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问将regex与html标记组合在一起EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问将regex与html标记组合在一起
EN