我试着在R中抓取几个PDF2,PDF1有9页,PDF2有12页。当我运行下面的代码时,它抓取了两个PDF,但只到了第6页,之后就什么都没有了。这是有原因的吗?我的代码中缺少了什么吗?
library(tm)
read <- readPDF(engine = "xpdf", control = list(text = "-layout"))
document <- Corpus(URISource("C:\\Users\\Goku\\Documents\\Python Scripts\\PDF Scraping\\123.pdf"), readerControl = list(reader = read))
doc <- content(document[[1]])
head(doc)
你可以在以下网址找到pdf:https://www.scribd.com/document/396797318/123
发布于 2019-01-04 21:54:08
我不能复制你的问题。使用你的文档,我用两种方式阅读了12页的文本。检查它们是否相同也会产生true。
带有阅读器pdftools的tm:
library(tm)
read <- readPDF(engine = "pdftools", control = list(text = "-layout"))
document <- Corpus(URISource("396797318-123.pdf"), readerControl = list(reader = read))
直接使用pdftools:
library(pdftools)
text <- pdf_text("396797318-123.pdf")
检查它们是否相同:
tm_text <- as.vector(sapply(document, as.character))
identical(text, tm_text)
[1] TRUE
str(document)
List of 1
$ 396797318-123.pdf:List of 2
..$ content: chr [1:12] " Training and Development Policy\r\n Contents\r\"| __truncated__ "1.2 As a guiding principle, all staff must be offered appropriate and relevant\r\n development opportunitie"| __truncated__ " • Short Term Externally Provided Training (e.g. one or two day external\r\n courses)\r\n • Longer Term"| __truncated__ " are at no cost) together with periodic training bulletins informing employees of\r\n impending courses"| __truncated__ ...
..$ meta :List of 7
.. ..$ author : chr "leanne.cutts"
.. ..$ datetimestamp: POSIXct[1:1], format: "2014-02-28 17:57:34"
.. ..$ description : chr ""
.. ..$ heading : chr "Training and Development Policy"
.. ..$ id : chr "396797318-123.pdf"
.. ..$ language : chr "en"
.. ..$ origin : chr "PDFCreator Version 1.4.3"
.. ..- attr(*, "class")= chr "TextDocumentMeta"
..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
- attr(*, "class")= chr [1:2] "VCorpus" "Corpus"
https://stackoverflow.com/questions/54038102
复制相似问题