问合并从Gallica XML报纸发行的for循环中的文本
EN

Stack Overflow用户

提问于 2019-11-25 08:18:25

回答 2查看 59关注 0票数 0

我想收集法国报纸“西部日报”(1915年)的文字。法文版文本可从法国数字图书馆Gallica获得。

 library(httr)
 library(xml2)
 library(tidyverse)


 # Newspapers issues identifiers calls arks. They are scrapped from Gallica (XML) and parsed to data frame object


  r <- GET("https://gallica.bnf.fr/services/Issues?ark=ark:/12148/cb41193663x/date&date=1916")

 ouest_eclair <- r %>%
  content() %>% 
 xml_find_all(".//issue") %>% 
 map_df(~ c(as.list(xml_attrs(.x)), date_parution = xml_text(.x)))

 # keep only the good colum withs identifiers
 arks2 <- ouest_eclair[,'ark']




#  The library htm2txt is used to extract easily text from an html page. 

 library(htm2txt)

# Here's the loop

    for (i in arks2) {
        url <- paste0("https://gallica.bnf.fr/ark:/12148/", i, ".texteBrut")
        print(url)
        txt <- gettxt(url) 
          txt <- paste(txt,txt)
         Sys.sleep(1)
                }

我的问题是:如何在循环中合并one txt对象中的所有文本，避免第一次文本的2倍？

Stack Overflow用户

发布于 2019-11-25 09:02:05

问题是在循环的每一次迭代中都要覆盖txt。诀窍是从一个(空)变量开始，用于在循环之外定义并在每次迭代中得到更新(而不是覆盖)的输出，再加上另一个确实被重写的临时变量：

library(htm2txt)
arks2 <-  c("bpt6k5674481", "bpt6k567454v", "bpt6k567462f")

txt.output <- "" # start with an empty string of text before you start the loop
for (i in arks2) {
  url <- paste0("https://gallica.bnf.fr/ark:/12148/", i, ".texteBrut")
  print(url)
  txt.temp <- gettxt(url) 
  txt.output <- paste(txt, txt.temp)
  Sys.sleep(1)
}

不过，你实际上并不需要临时的：

txt.output <- ""
for (i in arks2) {
  url <- paste0("https://gallica.bnf.fr/ark:/12148/", i, ".texteBrut")
  print(url)
  txt.output <- paste(txt, gettxt(url))
  Sys.sleep(1)
}

票数 1

查看全部 2 条回答

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/59027566

复制

相似问题

问合并从Gallica XML报纸发行的for循环中的文本
EN

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问合并从Gallica XML报纸发行的for循环中的文本EN

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问合并从Gallica XML报纸发行的for循环中的文本
EN