您好,我正在尝试使用R edgar包读取SEC edgar数据库中的13F文件
我面临的挑战是我正在查看的文件是旧的文件(~2000年) https://www.sec.gov/edgar/browse/?CIK=1087699
它们是糟糕的txt格式,不同于今天的13F,并且无法使用readtxt函数读取。
示例文件在这里:https://www.sec.gov/Archives/edgar/data/1087699/000108769999000001/0001087699-99-000001.txt
library(edgar)
F13<-
getFilings(
cik.no = "0001087699",
form.type = "13F-HR",
1999,
quarter=c(1,2,3),
useragent="myname@gmail.com"
)我试过了,R告诉我它一直在忙着下载,它不是一个很大的txt文件。所以有些地方不对劲。然后,当它最终完成时,它会说没有找到给定的CIK和表单类型的归档信息,但我清楚地看到了该文件。如果edgar包不是为处理它而设计的,那么我如何处理它呢?
我的最终目标是将文件放在漂亮的数据框架中,列用于股票符号和价格,行用于股票数据。请帮个忙。
有没有可用的报废产品?我用铬合金高亮了检查的灯光,但它们在我看来很奇怪(对不起,根本不擅长报废)。
发布于 2021-10-12 12:59:28
我解析了您提供的作为示例here的文件。我首先将数据从文件复制到txt文件。文件copied.txt需要位于当前工作目录中。这可以给你一个如何继续的想法。
library(tidyverse)
df <- read_file("copied.txt") %>%
# trying to extract data only from the table
(function(x){
tbl_beg <- str_locate(x, "Managers Sole")[2] + 1
tbl_end <- str_locate(x, "\r\n</TABLE>")[1]
str_sub(x, tbl_beg, tbl_end)
}) %>%
# removing some unwanted characters from the beginning and the end of the extracted string
str_sub(start = 4, end = -3) %>%
# splitting for individual lines
str_split('\"\r\n\"') %>% unlist() %>%
# removing broken line break
str_remove("\r\n") %>%
# replacing the original text where there are spaces with one, where there is underscore
# the reason for that is that I need to split the rows into columns using space
str_replace_all("Sole Managers Sole", " Managers_Sole") %>%
# removing extra spaces
str_squish() %>%
# reversing the order of the line (I need to split from the right because the company name contains additional spaces)
# if the company name is the last one, it is okey that there are additional spaces
stringi::stri_reverse() %>%
str_split(pattern = " ", n = 6, simplify = T) %>%
# making the order to the original one
apply(MARGIN = 2, FUN = stringi::stri_reverse) %>%
as_tibble() %>%
select(c(6:1)) %>%
set_names(nm = c("name_of_issuer", "title_of_cl", "cusip_number", "fair_market_value", "shares", "shares_of_princip_mngrs"))
# A tibble: 47 x 6
name_of_issuer title_of_cl cusip_number fair_market_value shares shares_of_princip_mngrs
<chr> <chr> <chr> <chr> <chr> <chr>
1 America Online COM 02364J104 2,940,000 20,000 Managers_Sole
2 Anheuser Busch COM 35229103 3,045,000 40,000 Managers_Sole
3 At Home COM 45919107 787,500 5,000 Managers_Sole
4 AT&T COM 1957109 5,985,937 75,000 Managers_Sole
5 Bank Toyko COM 65379109 700,000 50,000 Managers_Sole
6 Bay View Capital COM 07262L101 14,958,437 792,500 Managers_Sole
7 Broadcast.com COM 111310108 2,954,687 25,000 Managers_Sole
8 Chase Manhattan COM 16161A108 10,578,750 130,000 Managers_Sole
9 Chase Manhattan 4/85C 16161A9DQ 59,375 500 Managers_Sole
10 Cisco Systems COM 17275R102 4,930,312 45,000 Managers_Sole 发布于 2021-10-11 09:21:25
您可以使用httr包来请求页面:
> install.packages("httr")
# follow instructions etc然后在R shell中(您可能需要重新启动):
> httr::GET("https://www.sec.gov/Archives/edgar/data/1087699/000108769999000001/0001087699-99-000001.txt")这将成功地下载文件,但是我的R语言不够流利,无法解析这个文本,但它看起来很简单:按<TABLE>拆分文本,用新行样条换行,用空格拆分每一行。
https://stackoverflow.com/questions/69460581
复制相似问题