文章/答案/技术大牛

发布

社区首页 >问答首页 >R rvest无法获取html_node

问R rvest无法获取html_node
EN

Stack Overflow用户

提问于 2020-08-30 06:34:05

回答 2查看 315关注 0票数 0

我有一些使用rvest包从web上抓取我需要的数据的经验，但我遇到了这个页面的问题：

https://www.nytimes.com/interactive/2020/us/covid-college-cases-tracker.html

如果你向下滚动一点，你会看到所有学校所在的一部分。

我想要学校，案例和位置的数据。我应该注意到，有人要求在NYT GitHub上发布这作为一个csv，他们recommended that the data is all in the page and can just be pulled from there.因此，我认为这是可以从这个页面上抓取。

但是我不能让它工作。假设我只想从第一个学校的一个简单选择器开始。我使用检查器查找xpath。

我没有得到任何结果：

library(rvest)

URL <- "https://www.nytimes.com/interactive/2020/us/covid-college-cases-tracker.html"
pg <- read_html(URL)

# xpath copied from inspector
xpath_first_school <- '//*[@id="school100663"]'

node_first_school <- html_node(pg, xpath = xpath_first_school)

> node_first_school
{xml_missing}
<NA>

我得到了{xml_missing}。

显然，我有更多的工作要做，以概括这一点并收集所有学校的数据，但对于网络抓取，我通常会尝试从简单和具体开始，然后扩大范围。但即使是我的简单测试也不起作用。有什么想法吗？

rvest

web-scraping

回答 2

Stack Overflow用户

发布于 2020-08-31 04:43:20

设置Rselenium可能需要一些时间。首先你必须下载chromedriver (https://chromedriver.chromium.org/)，选择你当前的chrome最接近的版本。然后将其解压到您的R工作目录。

我试过使用一个名为decapitated的包，它可以抓取javascript渲染的网站，但是因为这个网站包含"show more“，需要在显示所有数据之前物理地单击它，所以在获得页面源代码之前，我必须使用Rselenium来”单击“，然后使用rvest进行解析。

代码：

library(rvest)
library(tidyverse)
library(RSelenium)

url <- "https://www.nytimes.com/interactive/2020/us/covid-college-cases-tracker.html"

driver <- rsDriver(browser = c("chrome"), chromever = "85.0.4183.87", port = 560L)
remote_driver <- driver[["client"]] 
remote_driver$navigate(url)

showmore <- remote_driver$findElement(using = "xpath", value = "//*[@id=\"showall\"]/p")
showmore$clickElement()

test <- remote_driver$getPageSource()

school <- read_html(test[[1]]) %>%
  html_nodes(xpath = "//*[contains(@id, \"school\")]/div[2]/h2") %>%
  html_text() %>%
  as.tibble()

case <- read_html(test[[1]]) %>%
  html_nodes(xpath = "//*[contains(@id, \"school\")]/div[3]/p") %>%
  html_text() %>%
  as.tibble() 

location <- read_html(test[[1]]) %>%
  html_nodes(xpath = "//*[contains(@id, \"school\")]/div[4]/p") %>%
  html_text() %>%
  as.tibble() 

combined_table <- bind_cols(school,case = case[2:nrow(case),],location = location[2:nrow(location),]) 
names(combined_table) <- c("school", "case", "location")

combined_table %>% view()

输出：

# A tibble: 913 x 3
   school                                      case  location              
   <chr>                                       <chr> <chr>                 
 1 University of Alabama at Birmingham*        972   Birmingham, Ala.      
 2 University of North Carolina at Chapel Hill 835   Chapel Hill, N.C.     
 3 University of Central Florida               727   Orlando, Fla.         
 4 University of Alabama                       568   Tuscaloosa, Ala.      
 5 Auburn University                           557   Auburn, Ala.          
 6 North Carolina State University             509   Raleigh, N.C.         
 7 University of Georgia                       504   Athens, Ga.           
 8 Texas A&M University                        500   College Station, Texas
 9 University of Texas at Austin               483   Austin, Texas         
10 University of Notre Dame                    473   Notre Dame, Ind.      
# ... with 903 more rows

希望这对你有用！

票数 1

Stack Overflow用户

发布于 2020-09-01 03:19:43

因此，我将在这里提供一个答案，它违反了a very important rule described here，通常是一个丑陋的解决方案。但这是一种可以让我们避免使用Selenium的解决方案。

要在上面使用html_nodes，我们需要启动需要Selenium的JS操作。@KWN的解决方案似乎在他们的机器上有效，但我无法让chromedriver在我的机器上工作。我可以在Firefox或Chrome上使用Docker，但不能得到结果。因此，我将首先检查该解决方案。如果失败了，那就试试吧。基本上，这个站点将我需要的数据公开为JSON。因此，我提取站点的文本，并使用正则表达式隔离JSON，然后使用jsonlite进行解析。

library(jsonlite)
library(rvest)
library(tidyverse)

url <- "https://www.nytimes.com/interactive/2020/us/covid-college-cases-tracker.html"

html_res <- read_html(url)

# get text
text_res <- html_res %>% 
  html_text(trim = TRUE)

# find the area of interest
# find the area of interest
data1 <- str_extract_all(text_res, "(?<=var NYTG_schools = ).*(?=;)")[[1]]

# get json into data frame
json_res <- fromJSON(data1)

# did it work?
glimpse(json_res)

Rows: 1,515
Columns: 16
$ ipeds_id    <chr> "100663", "199120", "132903", "100751"...
$ nytname     <chr> "University of Alabama at Birmingham",...
$ shortname   <chr> "U.A.B.", "North Carolina", "Central F...
$ city        <chr> "Birmingham", "Chapel Hill", "Orlando"...
$ state       <chr> "Ala.", "N.C.", "Fla.", "Ala.", "Ala."...
$ county      <chr> "Jefferson", "Orange", "Orange", "Tusc...
$ fips        <chr> "01073", "37135", "12095", "01125", "0...
$ lat         <dbl> 33.50199, 35.90491, 28.60258, 33.21402...
$ long        <dbl> -86.80644, -79.04691, -81.20223, -87.5...
$ logo        <chr> "https://static01.nyt.com/newsgraphics...
$ infected    <int> 972, 835, 727, 568, 557, 509, 504, 500...
$ death       <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,...
$ dateline    <chr> "n", "n", "n", "n", "n", "n", "n", "n"...
$ ranking     <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,...
$ medicalnote <chr> "y", NA, NA, NA, NA, NA, NA, NA, NA, N...
$ coord       <list> [<847052.5, -406444.3>, <1508445.93, ...

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/63652388

复制

相似问题

问R rvest无法获取html_node
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问R rvest无法获取html_nodeEN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问R rvest无法获取html_node
EN