我有来自html页面的以下文本:
page =
"""
<font style="DISPLAY: inline; FONT-FAMILY: Times New Roman; FONT-SIZE: 10pt; FONT-WEIGHT: bold">Item 1. Business/</font> Unless otherwise indicated by the context, we use the terms “GE” and “GECC” on the basis of consolidation described in Note 1 to the consolidated financial statements in Part II, Item 8. “Financial Statements and Supplementary Data” of this Form 10-K Report. Also, unless otherwise indicated by the context, “General Electric” means the parent company, General Electric Company (the Company).
General Electric’s address is 1 River Road, Schenectady, NY 12345-6999; we also maintain executive offices at 3135 Easton Turnpike, Fairfield, CT 06828-0001.
<font style="DISPLAY: inline; FONT-FAMILY: Times New Roman; FONT-SIZE: 10pt; FONT-WEIGHT: bold">Item 1A. Risk Factors</font>"""
我想找到获取项目1业务和项目1A风险因素之间的文本。因为每个页面都有不同的html标签结构,所以我不能使用漂亮的汤。我使用以下代码来获取文本,但它不起作用:
regexs = ('bold;\">\s*Item 1\.(.+?)bold;\">\s*Item 1A\.', #<===pattern 1: with an attribute bold before the item subtitle
'b>\s*Item 1\.(.+?)b>\s*Item 1A\.', #<===pattern 2: with a tag <b> before the item subtitle
'Item 1\.\s*<\/b>(.+?)Item 1A\.\s*<\/b>', #<===pattern 3: with a tag <\b> after the item subtitle
'Item 1\.\s*Business\.\s*<\/b(.+?)Item 1A\.\s*Risk Factors\.\s*<\/b') #<===pattern 4: with a tag <\b> after the item+description subtitle
for regex in regexs:
match = re.search(regex, page, flags=re.IGNORECASE|re.DOTALL) #<===search for the pattern in HTML using re.search from the re package. Ignore cases.
if match:
soup = BeautifulSoup(match.group(1), "html.parser") #<=== match.group(1) returns the texts inside the parentheses (.*?)
#soup.text removes the html tags and only keep the texts
#rawText = soup.text.encode('utf8') #<=== you have to change the encoding the unicodes
rawText = soup.text
print(rawText)
break
预期输出为:
Unless otherwise indicated by the context, we use the terms “GE” and “GECC” on the basis of consolidation described in Note 1 to the consolidated financial statements in Part II, Item 8. “Financial Statements and Supplementary Data” of this Form 10-K Report. Also, unless otherwise indicated by the context, “General Electric” means the parent company, General Electric Company (the Company).
General Electric’s address is 1 River Road, Schenectady, NY 12345-6999; we also maintain executive offices at 3135 Easton Turnpike, Fairfield, CT 06828-0001.
我认为,第一个正则表达式应该与模式匹配,但它不匹配。
编辑:这是实际的htm页面和检索文本的方式:
# Import the libraries
import requests
from bs4 import BeautifulSoup
import re
url = "https://www.sec.gov/Archives/edgar/data/40545/000004054513000036/geform10k2012.htm"
HEADERS = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36"}
response = requests.get(url, headers=HEADERS)
print(response.status_code)
page = response.text
#Pre-processing the html content by removing extra white space and combining then into one line.
page = page.strip() #<=== remove white space at the beginning and end
page = page.replace('\n', ' ') #<===replace the \n (new line) character with space
page = page.replace('\r', '') #<===replace the \r (carriage returns -if you're on windows) with space
page = page.replace(' ', ' ') #<===replace " " (a special character for space in HTML) with space.
page = page.replace(' ', ' ') #<===replace " " (a special character for space in HTML) with space.
page = page.replace(u'\xa0', ' ') #<===replace " " (a special character for space in HTML) with space.
page = page.replace(u'/s/', ' ') #<===replace " " (a special character for space in HTML) with space.
while ' ' in page:
page = page.replace(' ', ' ') #<===remove extra space
发布于 2019-03-25 20:24:54
类似于下面的内容?
import re
page = """
<font style="DISPLAY: inline; FONT-FAMILY: Times New Roman; FONT-SIZE: 10pt; FONT-WEIGHT: bold">Item 1. Business/</font> Unless otherwise indicated by the context, we use the terms “GE” and “GECC” on the basis of consolidation described in Note 1 to the consolidated financial statements in Part II, Item 8. “Financial Statements and Supplementary Data” of this Form 10-K Report. Also, unless otherwise indicated by the context, “General Electric” means the parent company, General Electric Company (the Company).
General Electric’s address is 1 River Road, Schenectady, NY 12345-6999; we also maintain executive offices at 3135 Easton Turnpike, Fairfield, CT 06828-0001.
<font style="DISPLAY: inline; FONT-FAMILY: Times New Roman; FONT-SIZE: 10pt; FONT-WEIGHT: bold">Item 1A. Risk Factors</font>"""
data = re.search('Item 1\. Business\/<\/font> (.*)(<font(.*)">Item 1A. Risk Factors)', page, flags=re.DOTALL).group(1)
print(data)
发布于 2019-03-25 20:27:32
如果您更改了您的正则表达式:
regexs = ('Item 1\.\s*Business\/(.*)',
'Item 1\.\s*Business\.\s*<\/b(.+?)Item 1A\.\s*Risk Factors\.\s*<\/b')
它起作用了吗?
https://stackoverflow.com/questions/55336946
复制相似问题