如何使用XML包抓取html表?
以Brazilian soccer team上的这个维基百科页面为例。我想在R中读一读,并获得“巴西对国际足联认可球队的所有比赛清单”表作为data.frame。我该怎么做呢?
发布于 2009-12-05 04:14:56
…或者更短的尝试:
library(XML)
library(RCurl)
library(rlist)
theurl <- getURL("https://en.wikipedia.org/wiki/Brazil_national_football_team",.opts = list(ssl.verifypeer = FALSE) )
tables <- readHTMLTable(theurl)
tables <- list.clean(tables, fun = is.null, recursive = FALSE)
n.rows <- unlist(lapply(tables, function(t) dim(t)[1]))
所选的表格是页面上最长的表格
tables[[which.max(n.rows)]]
发布于 2009-09-09 18:43:11
另一种选择是使用Xpath。
library(RCurl)
library(XML)
theurl <- "http://en.wikipedia.org/wiki/Brazil_national_football_team"
webpage <- getURL(theurl)
webpage <- readLines(tc <- textConnection(webpage)); close(tc)
pagetree <- htmlTreeParse(webpage, error=function(...){}, useInternalNodes = TRUE)
# Extract table header and contents
tablehead <- xpathSApply(pagetree, "//*/table[@class='wikitable sortable']/tr/th", xmlValue)
results <- xpathSApply(pagetree, "//*/table[@class='wikitable sortable']/tr/td", xmlValue)
# Convert character vector to dataframe
content <- as.data.frame(matrix(results, ncol = 8, byrow = TRUE))
# Clean up the results
content[,1] <- gsub("Â ", "", content[,1])
tablehead <- gsub("Â ", "", tablehead)
names(content) <- tablehead
产生这样的结果
> head(content)
Opponent Played Won Drawn Lost Goals for Goals against % Won
1 Argentina 94 36 24 34 148 150 38.3%
2 Paraguay 72 44 17 11 160 61 61.1%
3 Uruguay 72 33 19 20 127 93 45.8%
4 Chile 64 45 12 7 147 53 70.3%
5 Peru 39 27 9 3 83 27 69.2%
6 Mexico 36 21 6 9 69 34 58.3%
发布于 2016-05-13 08:55:02
与xml2
一起使用的rvest
是另一个流行的用于解析html网页的包。
library(rvest)
theurl <- "http://en.wikipedia.org/wiki/Brazil_national_football_team"
file<-read_html(theurl)
tables<-html_nodes(file, "table")
table1 <- html_table(tables[4], fill = TRUE)
该语法比xml
包更易于使用,并且对于大多数网页,该包提供了所需的所有选项。
https://stackoverflow.com/questions/1395528
复制相似问题