文章/答案/技术大牛

发布

社区首页 >问答首页 >Web-scraping Rvest -如何从缩短的URL中捕获完整的‘href` url

问Web-scraping Rvest -如何从缩短的URL中捕获完整的‘href` url
EN

Stack Overflow用户

提问于 2019-03-19 22:45:00

回答 1查看 291关注 0票数 0

我正在尝试从包含表格和链接的web中抓取数据。我可以成功下载带有链接文本"score“的表格。但是，我想要捕获完整的href url，而不是缩短的URL。

但是，我想我会用rvest缩短URL。我不知道如何才能获得完整的'url‘，我可以像下面这样循环，以获得所需的数据，然后将所有内容转换为数据帧。

library(rvest)
    # Load the page
    odi_score_url <- read_html('http://stats.espncricinfo.com/ci/engine/records/team/match_results.html?class=2;id=2019;type=year')
    
    
    urls <- odi_score_url %>% 
        html_nodes('td:nth-child(7) .data-link') %>%
        html_attr("href")
    
    links <- odi_score_url  %>% 
        html_nodes('td:nth-child(7) .data-link') %>%
        html_text()
    
    # Combine `links` and `urls` into a data.frame
    score_df <- data.frame(links = links, urls = urls, stringsAsFactors = FALSE)
    head(score_df)
       links                          urls
1 ODI # 4074 /ci/engine/match/1153840.html
2 ODI # 4075 /ci/engine/match/1153841.html
3 ODI # 4076 /ci/engine/match/1153842.html
4 ODI # 4077 /ci/engine/match/1144997.html
5 ODI # 4078 /ci/engine/match/1144998.html
6 ODI # 4079 /ci/engine/match/1144999.html

循环遍历score_df中的每一行并获取所需数据

    for(i in score_df) {
        text <- read_html(score_df$urls[i]) %>% # load the page
            html_nodes(".match-detail--item:nth-child(3) span , .match-detail--item:nth-child(3) h4 , 
                   .stadium-details+ .match-detail--item span , .stadium-details , 
                   .stadium-details+ .match-detail--item h4 , .cscore_score , .cscore_name--long") %>% # isloate the text
            html_text() # get the text
        ## Create the dataframe
     
    }

非常感谢您的帮助！

提前感谢

web-scraping

rvest

回答 1

Stack Overflow用户

回答已采纳

发布于 2019-03-19 23:03:17

这些urls是相对于主页的。因此，您可以通过在链接的开头添加http://stats.espncricinfo.com/来获得完整的urls。所以，举个例子：

urls <- odi_score_url %>% 
  html_nodes('td:nth-child(7) .data-link') %>%
  html_attr("href") %>% 
  paste0("http://stats.espncricinfo.com/", .)

然后，您可以将循环编写为：

text_list <- list()
for(i in seq_along(score_df$urls)) {
  text_list[[i]] <- read_html(score_df$urls[i]) %>% # load the page
    html_nodes(".match-detail--item:nth-child(3) span , .match-detail--item:nth-child(3) h4 , 
                   .stadium-details+ .match-detail--item span , .stadium-details , 
                   .stadium-details+ .match-detail--item h4 , .cscore_score , .cscore_name--long") %>% # isloate the text
    html_text() # get the text
  # give some nice status
  cat("Scraping link", i, "\n")
}

或者，更好的是，作为一个应用循环：

text_list <- lapply(score_df$urls, function(x) {
  text <- read_html(x) %>% # load the page
    html_nodes(".match-detail--item:nth-child(3) span , .match-detail--item:nth-child(3) h4 , 
                   .stadium-details+ .match-detail--item span , .stadium-details , 
                   .stadium-details+ .match-detail--item h4 , .cscore_score , .cscore_name--long") %>% # isloate the text
    html_text()
  data.frame(url = x, text = text, stringsAsFactors = FALSE)
  cat("Scraping link", x, "\n")
})

然后，我们可以使用dplyr将其转换为data.frame：

text_df <- dplyr::bind_rows(text_list)
head(text_df)
                                                          url           text
1 http://stats.espncricinfo.com//ci/engine/match/1153840.html    New Zealand
2 http://stats.espncricinfo.com//ci/engine/match/1153840.html          371/7
3 http://stats.espncricinfo.com//ci/engine/match/1153840.html      Sri Lanka
4 http://stats.espncricinfo.com//ci/engine/match/1153840.html 326 (49/50 ov)
5 http://stats.espncricinfo.com//ci/engine/match/1153840.html    New Zealand
6 http://stats.espncricinfo.com//ci/engine/match/1153840.html          371/7

不确定这是否已经是您想要的。可能你想折叠文本，所以每个url只有一行。但我认为，如果你想要的话，这应该很容易弄清楚。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/55243677

复制

相似问题

问Web-scraping Rvest -如何从缩短的URL中捕获完整的‘href` url
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Web-scraping Rvest -如何从缩短的URL中捕获完整的‘href` urlEN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Web-scraping Rvest -如何从缩短的URL中捕获完整的‘href` url
EN