之前曾经写过一篇关于知乎live课程信息爬取的短文,那个直接遍历的知乎live主页上展示的部分课程,仅仅是很小的一部分。
今日这一篇将是该小项目的升级版,直接对live主页的课程按照模块进行二级页面的遍历,这样可以抓取更加丰富的课程信息,本次一共获取课程数目将近800+
对于课程页抓包分析详情,这里不再赘述,想要了解的可以看这一篇旧文,本篇内容仅对二级页面的遍历思路进行整理。
因为课程数相对较多,这里使用cookie直接登录,需要获取cookie值。
library("httr")
library("jsonlite")
library("httr")
library("magrittr")
library("plyr")
library("rlist")
一级页面遍历,获取各个模块课程主题信息以及其中的课程id值。
按照以往的抓包流程,一级课程模块的抓取函数如下:
mylive <- function(){
baseurl<-"https://api.zhihu.com/lives/special_lists"
header <- c(
'Content-Type'='application/json; charset=utf-8',
'User-Agent'='Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36',
'Referer'='https://www.zhihu.com/lives/specials',
'Cookie' = "请copy自己浏览器中知乎网页cookie"
)
payload<-list(
'limit'=10,
'offset'=0,
'subtype'='special_list'
)
i = 0
myresult <- data.frame()
while (TRUE){
###每次请求offset值偏移10个单位
payload['offset'] = payload[['offset']] %>% `+`(10*i)
tryCatch({
r <- GET(baseurl,add_headers(.headers =header),query =payload, encode="json")
myresult <-r %>% content(as="text") %>% fromJSON(flatten = TRUE) %>% `[[`(3) %>% rbind(myresult,.)
cat(sprintf("正在处理第【%d】页!",i),sep = "\n")
},error = function(e){
cat(sprintf("第【%d】页抓取失败!",i),sep = "\n")
})
###通过抓包返回值中的状态信息确定是否应该跳出循环
if ( r %>% content(as="text") %>% fromJSON(flatten = TRUE) %>% `[[`(2) %>% `[[`(1) == TRUE) break
Sys.sleep(runif(1,0.5,1.5))
i = i +1
}
cat("all page is OK!!!",sep = "\n")
return (myresult)
}
system.time( myresult <- mylive() )
以上我抓到了一级课程模块的信息,其中就含有所有课程的id值,我们获取到id值之后,使用id值来遍历每一个课程模块(id值)下的子课程信息。
过程与上述一级页面的遍历过程基本一致。
outdata <- function(id){
baseurl<-sprintf("https://api.zhihu.com/lives/special_lists/%s/lives",id)
header <- c(
'Content-Type'='application/json; charset=utf-8',
'User-Agent'='Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36',
'Referer'=sprintf('https://www.zhihu.com/lives/specials/%s',id),
'Cookie' = Cookie
)
payload<-list(
'limit'=10,
'offset'=0,
'subtype'='special_list'
)
myresult <- data.frame()
i = 0
while (TRUE){
payload['offset'] = payload[['offset']] %>% `+`(10*i)
tryCatch({
r <- GET(baseurl,add_headers(.headers =header),query =payload, encode="json")
myresult <-r %>% content(as="text") %>% fromJSON(flatten = TRUE) %>% `[[`(3) %>% rbind(myresult,.)
})
if ( r %>% content(as="text") %>% fromJSON(flatten = TRUE) %>% `[[`(2) %>% `[[`(1) == TRUE) break
Sys.sleep(runif(1,0.5,1.5))
i = i +1
}
return (myresult)
}
fulloutdata <- function(){
mydatafull <- data.frame()
i = 1
for (id in ids){
tryCatch({
mydatafull <- outdata(id) %>% rbind(mydatafull,.)
cat(sprintf("正在处理任务【%s】",id),sep = "\n")
Sys.sleep(runif(1,0.5,1.5))
},error = function(e){
cat(sprintf("任务【%s】处理失败!",i),sep = "\n")
i = i +1
})
}
cat("have done!",sep = "\n")
cat(sprintf("一共有【i】个任务处理失败!",i),sep = "\n")
return(mydatafull)
}
执行二级页面遍历函数
system.time(mydatalast <- fulloutdata())
library("rmongodb")
mongo <- mongo.create(host = "localhost")
bson <- mongo.bson.from.list(mydatalast)
mongo.insert(mongo,"rmongo_test.",mydatalast)
list.save(mydatalast,"D:/R/File/liveinfo.json")