我需要从PDF中提取文本,我有一个关键字列表,它告诉我我需要提取哪些文本部分。
PDF看起来如下所示:
到目前为止,这是我的代码:
library(pdftools)
library(pdfsearch)
library(tidyverse)
pdf <- pdf_text(dir(pattern = "*.pdf")) %>%
read_lines()
Keyword_list <- c("swDisproportionateCost", `"swDisproportionateCostOtherEULegislation", "swExemptionsTransboundary","swDisproportionateCostAlternativeFinancing","swDisproportionateCostAnalysis","swDisproportionateCostScale")`
然后我尝试使用keyword_search,但它只告诉我关键字在哪一行。
我想将草书中的文本提取为我的keyword_list中的一个新列。我认为可以用regex来完成,使用关键字和粗体中的文本作为开始和停止。
这是一个到pdf的链接。https://www.dropbox.com/s/kyyzr5wnh8z87if/FINAL%20Draft4_WFD_Reporting_Guidance_2022_resource_page.pdf?dl=0
发布于 2020-06-22 05:14:30
这只是一个相当平淡无奇的文本提取工作。做这件事有很多种方法,而且我相信有比这更优雅的方法,但这个方法做的是:
library(pdftools)
library(dplyr)
keywords <- pdf_text("mypdf.pdf") %>%
strsplit("Schema element:") %>%
lapply(function(x) x[-1]) %>%
lapply(function(x) sapply(strsplit(x, "\r\n"), `[`, 1)) %>%
unlist %>%
trimws()
text <- pdf_text("mypdf.pdf") %>%
strsplit("Guidance on completion of schema element:") %>%
lapply(function(x) x[-1]) %>%
lapply(function(x) sapply(strsplit(x, ":"), `[`, 1)) %>%
lapply(function(x) sapply(strsplit(x, "\r\n"),
function(y) paste(y[-length(y)], collapse = ""))) %>%
unlist() %>%
{gsub(" ", " ", .)} %>%
trimws() %>%
strsplit("Guidance on contents") %>%
sapply(`[`, 1)
df <- tibble(keywords, text)
结果如下:
df
#> # A tibble: 15 x 2
#> keywords text
#> <chr> <chr>
#> 1 swExemption44Driver "Required. Select from the enumeration list the driver~
#> 2 swExemption45Impact "Required. Select from the enumeration list the impact~
#> 3 swExemption45Driver "Required. Select from the enumeration list the driver~
#> 4 swDisproportionateCost "Required. Indicate if disproportionate costs have bee~
#> 5 swDisproportionateCostScale "Conditional. Select from the enumeration list the sc~
#> 6 swDisproportionateCostAnalysis "Conditional. Select from the enumeration list the an~
#> 7 swDisproportionateCostAlterna~ "Conditional. Select from the enumeration list the al~
#> 8 swDisproportionateCostOtherEU~ "Conditional. Indicate whether the costs of basic mea~
#> 9 swTechnicalInfeasibility "Required. Report how ‘technical infeasibility’ has be~
#> 10 swNaturalConditions "Required. Select from the enumeration list the eleme~
#> 11 swExemption46 "Required. Select from the enumeration list the reason~
#> 12 swExemption47 "Required. Select from the enumeration list the modif~
#> 13 swExemptionsTransboundary "Required. Indicate whether the application of exempt~
#> 14 swExemptionsReference "Required. Provide references or hyperlinks to the re~
#> 15 driversSWExemptionsReference "Required. Provide references or hyperlinks to the re~
https://stackoverflow.com/questions/62464137
复制相似问题