专栏首页icecream小屋百家号爬取(2)

百家号爬取(2)

此篇文章主要讲述百家号评论数阅读数的爬取

评论数和阅读数都在单独的一个json数据表中

https://mbd.baidu.com/webpage?type=homepage&action=interact&format=jsonp&params=%5B%7B%22user_type%22%3A%223%22%2C%22dynamic_id%22%3A%229683117499664348209%22%2C%22dynamic_type%22%3A%222%22%2C%22dynamic_sub_type%22%3A%222001%22%2C%22thread_id%22%3A%221113000014175815%22%2C%22feed_id%22%3A%229683117499664348209%22%7D%2C%7B%22user_type%22%3A%223%22%2C%22dynamic_id%22%3A%228997120757336896754%22%2C%22dynamic_type%22%3A%222%22%2C%22dynamic_sub_type%22%3A%222001%22%2C%22thread_id%22%3A%221106000014171319%22%2C%22feed_id%22%3A%228997120757336896754%22%7D%2C%7B%22user_type%22%3A%223%22%2C%22dynamic_id%22%3A%229442416292259854102%22%2C%22dynamic_type%22%3A%222%22%2C%22dynamic_sub_type%22%3A%222001%22%2C%22thread_id%22%3A%221106000014171220%22%2C%22feed_id%22%3A%229442416292259854102%22%7D%2C%7B%22user_type%22%3A%223%22%2C%22dynamic_id%22%3A%228994022518148142722%22%2C%22dynamic_type%22%3A%222%22%2C%22dynamic_sub_type%22%3A%222001%22%2C%22thread_id%22%3A%221084000014170786%22%2C%22feed_id%22%3A%228994022518148142722%22%7D%2C%7B%22user_type%22%3A%223%22%2C%22dynamic_id%22%3A%229180210467318996709%22%2C%22dynamic_type%22%3A%222%22%2C%22dynamic_sub_type%22%3A%222001%22%2C%22thread_id%22%3A%221110000014181138%22%2C%22feed_id%22%3A%229180210467318996709%22%7D%2C%7B%22user_type%22%3A%223%22%2C%22dynamic_id%22%3A%229470100560664750777%22%2C%22dynamic_type%22%3A%222%22%2C%22dynamic_sub_type%22%3A%222001%22%2C%22thread_id%22%3A%221119000014172446%22%2C%22feed_id%22%3A%229470100560664750777%22%7D%5D&uk=D0hHfmuMEVka02HZelKA7g&_=1548119615162&callback=jsonp1

该url解析

主要是从上个json数据表中获得的

"user_type"

dynamic_id"

"dynamic_type"

"dynamic_sub_type"

"thread_id"

"feed_id"

进行拼装

代码为

for iin range(len(title)):

user_type = re.findall(r'"user_type":"(.+?)",', asyncData[i])[0]

dynamic_id = re.findall(r'"dynamic_id":"(.+?)",', asyncData[i])[0]

dynamic_type=re.findall(r'"dynamic_type":"(.+?)",', asyncData[i])[0]

dynamic_sub_type=re.findall(r'"dynamic_sub_type":"(.+?)",', asyncData[i])[0]

thread_id=re.findall(r'"thread_id":"(.+?)",', asyncData[i])[0]

feed_id=re.findall(r'"feed_id":"(.+?)"', asyncData[i])[0]

print(title[i],url[i],date[i],cerate[i],publish[i],updated[i])

if i<len(title)-1

readjson+='user_type%22%3A%22'+user_type+'%22%2C%22'\

+'dynamic_id%22%3A%22'+dynamic_id+'%22%2C%22'\

+'dynamic_type%22%3A%22'+dynamic_type+'%22%2C%22'\

+'dynamic_sub_type%22%3A%22'+dynamic_sub_type+'%22%2C%22'\

+'thread_id%22%3A%22'+thread_id+'%22%2C%22'\

+'feed_id%22%3A%22'+feed_id+'%22%7D%2C%7B%22'

else:

readjson +='user_type%22%3A%22' + user_type +'%22%2C%22' \

+'dynamic_id%22%3A%22' + dynamic_id +'%22%2C%22' \

+'dynamic_type%22%3A%22' + dynamic_type +'%22%2C%22' \

+'dynamic_sub_type%22%3A%22' + dynamic_sub_type +'%22%2C%22' \

+'thread_id%22%3A%22' + thread_id +'%22%2C%22' \

+'feed_id%22%3A%22' + feed_id +'%22%7D%5D'

readjson+='&uk=D0hHfmuMEVka02HZelKA7g&_='+str(b)

注:feed_id最后一个接的是%22%7D%5D,而不是之前的'%22%7D%2C%7B%22'

本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

我来说两句

0 条评论
登录 后参与评论

相关文章

  • 百家号爬取(1)

    我爬取的是https://author.baidu.com/home?type=profile&action=profile&mthfr=box_share&c...

    Centy Zhao
  • js爬虫,正则

    昨天有小伙伴找我,新浪新闻的国内新闻页,其他部分都是静态网页可以抓到,但是在左下方的最新新闻部分,不是静态网页,也没有json数据,让我帮忙抓一下。大概看了下,...

    Centy Zhao
  • Fiddler抓取手机APP数据包

    http://w.x.baidu.com/alading/anquan_soft_down_ub/10963

    Centy Zhao
  • 19.SimLogin_case05

    hankleo
  • 万方数据库,文献下载的准备

     ###后来文献下载任务完成了 ,相关的代码,细节,介绍看这篇文章 https://www.jianshu.com/p/134530b46a65

    东风冷雪
  • 【Go 语言社区】GO中怎么处理URL编码?

    package main import( "fmt" "net/url") func main(){ var URL = "%5B%7...

    李海彬
  • springboot Actuator

    springboot Actuator只需要加入依赖即可使用: <dependency> <groupId>org.springframework.bo...

    yawn
  • 关于字符串的应用

    hankleo
  • 【延迟注入】A5站长网某站存在SQL注入漏洞

    A5站长网某站存在SQL注入漏洞(附验证脚本) 详细说明: code 区域 POST /Login/login HTTP/1.1 Host: lianmeng....

    逸鹏
  • Logstash解析嵌套Json

    由于我们的埋点日志是嵌套json类型,要想最终所有字段展开来统计分析就必须把嵌套json展开。

    神秘的寇先森

扫码关注云+社区

领取腾讯云代金券