我想练习网络抓取,并使用'R‘和'rvest’包为它。现在,我有了一个由125个元素组成的字符向量(p_text),并希望将其转换为数据格式。有25行5列,名称为q1、opt1、opt2、opt3、opt4。
所以元素1,5,10列= q1;2,6,11列= opt1;3,7,12列= opt2;等等。
library(dplyr)
library(rvest)
url <- 'http://upscfever.com/upsc-fever/en/test/en-test-sci1.html'
webpage <- read_html(url)
p_text <- webpage %>%
html_nodes("label") %>%
html_text()
怎么做?
发布于 2017-10-22 14:11:53
转换为矩阵以正确排列事物,然后转换为数据框架:
dat <- as.data.frame(matrix(p_text, ncol = 5, byrow = TRUE), stringsAsFactors = FALSE)
names(dat) <- c("q1", "opt1", "opt2", "opt3", "opt4")
str(dat)
## 'data.frame': 25 obs. of 5 variables:
## $ q1 : chr "Q1: Energy giving foods are " "Q2:Animal fats are categorized as" "Q3: Which is true" "Q4: Trans fats are" ...
## $ opt1: chr "Carbohydrates and fats" "saturated fatty acids" "saturated fatty acids are good for health" "unsaturated fats" ...
## $ opt2: chr "Carbohydrates and Proteins" "unsaturated fatty acids" "unsaturated fatty acids are harmful for health" "saturated fats" ...
## $ opt3: chr "Proteins and fats" "polyunsaturated fatty acids" "unsaturated fatty acids are good for health" "good for health" ...
## $ opt4: chr "carbohydrates, fats and proteins" "trans fats" "Animal fats are good for health" "animal fats" ...
如果您想清除q1
列,您可能要这样做:
dat$q1 <- sub("^Q\\d{1,2}:[ ]?", "", dat$q1)
删除前面的问号、冒号等。
https://stackoverflow.com/questions/46874660
复制相似问题