问将SEC 10-K年度报告文本保存到文件中(解码问题)
EN

Stack Overflow用户

提问于 2020-04-13 19:12:26

回答 1查看 254关注 0票数 0

我试图从10-K SEC Edgar报告中大量下载对“最终用户”可见的文本(不关心表)，并将其保存在文本文件中。我在Youtube上找到了下面的代码，但是我面临着两个挑战：

我不确定我是否捕获了所有的文本，当我从下面打印URL时，我收到非常奇怪的输出(例如，打印出的末尾的特殊字符)
，我似乎无法将文本保存在txt文件中，不确定这是否是由于编码(我对编程完全陌生)。

import re
import requests
import unicodedata
from bs4 import BeautifulSoup

def restore_windows_1252_characters(restore_string):
    def to_windows_1252(match):
        try:
            return bytes([ord(match.group(0))]).decode('windows-1252')
        except UnicodeDecodeError:
            # No character at the corresponding code point: remove it.
            return ''

    return re.sub(r'[\u0080-\u0099]', to_windows_1252, restore_string)

# define the url to specific html_text file
new_html_text = r"https://www.sec.gov/Archives/edgar/data/796343/0000796343-14-000004.txt"

# grab the response
response = requests.get(new_html_text)
page_soup = BeautifulSoup(response.content,'html5lib')

page_text = page_soup.html.body.get_text(' ',strip = True)

# normalize the text, remove characters. Additionally, restore missing window characters.
page_text_norm = restore_windows_1252_characters(unicodedata.normalize('NFKD', page_text)) 

# print: this works however gives me weird special characters in the print (e.g., at the very end)
print(page_text_norm)

# save to file: this only gives me an empty text file
with open('testfile.txt','w') as file:
    file.write(page_text_norm)

edgar

parsing

回答 1

Stack Overflow用户

回答已采纳

发布于 2020-04-13 22:39:23

尝尝这个。如果你以你期望的数据为例，人们就会更容易理解你的需求。

from simplified_scrapy import SimplifiedDoc,req,utils
url = 'https://www.sec.gov/Archives/edgar/data/796343/0000796343-14-000004.txt'
html = req.get(url)
doc = SimplifiedDoc(html)
# text = doc.body.text
text = doc.body.unescape() # Converting HTML entities
utils.saveFile("testfile.txt",text)

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/61195015

复制

相似问题

问将SEC 10-K年度报告文本保存到文件中(解码问题)
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问将SEC 10-K年度报告文本保存到文件中(解码问题)EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问将SEC 10-K年度报告文本保存到文件中(解码问题)
EN