我想收集法国报纸“西部日报”(1915年)的文字。法文版文本可从法国数字图书馆Gallica获得。
 library(httr)
 library(xml2)
 library(tidyverse)
 # Newspapers issues identifiers calls arks. They are scrapped from Gallica (XML) and parsed to data frame object
  r <- GET("https://gallica.bnf.fr/services/Issues?ark=ark:/12148/cb41193663x/date&date=1916")
 ouest_eclair <- r %>%
  content() %>% 
 xml_find_all(".//issue") %>% 
 map_df(~ c(as.list(xml_attrs(.x)), date_parution = xml_text(.x)))
 # keep only the good colum withs identifiers
 arks2 <- ouest_eclair[,'ark']
#  The library htm2txt is used to extract easily text from an html page. 
 library(htm2txt)
# Here's the loop
    for (i in arks2) {
        url <- paste0("https://gallica.bnf.fr/ark:/12148/", i, ".texteBrut")
        print(url)
        txt <- gettxt(url) 
          txt <- paste(txt,txt)
         Sys.sleep(1)
                }我的问题是:如何在循环中合并one txt对象中的所有文本,避免第一次文本的2倍?
发布于 2019-11-25 09:02:05
问题是在循环的每一次迭代中都要覆盖txt。诀窍是从一个(空)变量开始,用于在循环之外定义并在每次迭代中得到更新(而不是覆盖)的输出,再加上另一个确实被重写的临时变量:
library(htm2txt)
arks2 <-  c("bpt6k5674481", "bpt6k567454v", "bpt6k567462f")
txt.output <- "" # start with an empty string of text before you start the loop
for (i in arks2) {
  url <- paste0("https://gallica.bnf.fr/ark:/12148/", i, ".texteBrut")
  print(url)
  txt.temp <- gettxt(url) 
  txt.output <- paste(txt, txt.temp)
  Sys.sleep(1)
}不过,你实际上并不需要临时的:
txt.output <- ""
for (i in arks2) {
  url <- paste0("https://gallica.bnf.fr/ark:/12148/", i, ".texteBrut")
  print(url)
  txt.output <- paste(txt, gettxt(url))
  Sys.sleep(1)
}https://stackoverflow.com/questions/59027566
复制相似问题