写爬小说的爬虫的一些心得

前端GoGoGo

发布于 2018-08-24 16:33:49

6670

发布于 2018-08-24 16:33:49

文章被收录于专栏：九彩拼盘的叨叨叨

最近用 Node.js 写了爬某小说的爬虫，发现坑还是满多的。

网页中文乱码

小说网站的页面内容编码用的 GBK，如果不做处理，中文内容会是乱码。解决方案是用 iconv-lite 来对内容用 GBK 的方式来解码。大概的写法：

var iconv = require('iconv-lite')
request({
  url: BOOK_URL,
  encoding: null // 传 null，可以让 body 的类型是 Buffer。 用 iconv 进行 decode 传入的参数必须是 Buffer类型的。
}, (error, response, body) => {
  body = iconv.decode(body, 'GBK')
})

提取小说正文

发现小说的章节的 HTML 内容乱的超乎我的想象：

内容都在 <head> 标签内。浏览器竟然也能正常的输出。。。
正文外面也没有一个标签来包裹。正文的兄弟节点也没什么标志性的元素。

鉴于第 2 点，我用删除非正文内容来提取正文。下面是我的实现，以供参考

function toTxT(html) {
  html = html
    .replace(/<br ?\/?>/g, '\n')
    .replace(/ ?/g, ' ')
    .replace(/<!--.*-->/g, '') // 注释

  var $ = cheerio.load(html, { decodeEntities: false })
  var $body = $('html')
  $body.find('meta,div,style,link,script,table,h1,center,title').remove()
  var res = $body.html()
  res = res.replace(/<\/?head>/g, '')
    .replace(/^(\n|\r)+/mg, '')// 删除开头的空行
    .replace(/(\n|\r)+/mg, '\n\n')// 多个换行替换成1个
    .replace(/ {4,}/mg, '  ')// 4个以上的空格统一换成2个
  return res
}

同时发大量请求导致的服务器拒绝

那小说有一千多章。开始的做法是，对那小说网站同时发一千多个请求。每个请求请求 1 个章节的内容。尝试多次，发现每次都是只有不到 200 个请求是成功的，剩余的全部超时。

我发现可以通过减少同时发请求的数量来解决这个问题。我想了如下两个策略: 策略1 : 发 1 个请求，若请求成功，则发下 1 个请求。直到发完所有的请求。若中间某个请求返回失败，则重试若干次，若还是失败，则跳过。具体实现如下:

const RETRY_MAX = 3
var SingleQueue = function(queueArr, opts) {
  this.opts = Object.assign({
    retryMax: RETRY_MAX,
    callback: () => {}
  }, opts)
  this.queue = queueArr
  this.retryMax = this.opts.retryMax
  this.retryInfo = {}
  this.failArr = []
  this.execute()
}

SingleQueue.prototype = {
  execute: function(executeIndex) {
    executeIndex = executeIndex || 0
    if (executeIndex >= this.queue.length) {
      console.log(`完成。失败列表:${this.failArr.join()} `)
      this.opts.callback(this.failArr)
      return
    }
    if (this.retryInfo[executeIndex] && this.retryInfo[executeIndex] >= this.retryMax) {
      console.log(`执行${executeIndex+1}失败`)
      this.failArr.push(executeIndex)
      this.execute(executeIndex + 1) // 下一个
      return
    }

    this.queue[executeIndex]().then(function() {
      return this.execute(executeIndex + 1)
    }.bind(this), function() {
      this.retryInfo[executeIndex] = this.retryInfo[executeIndex] || 0
      this.retryInfo[executeIndex]++
        // 重试
        console.log(`第${this.retryInfo[executeIndex]}次 重试${executeIndex}`)
      return this.execute(executeIndex)
    }.bind(this))
  }
}

策略2 : 控制最多同时发若干个请求，没发的请求在等待队列种等待。发的请求成功，则执行等待队列中的第一个请求，若失败，则将该请求放入失败队列。最后，等待队列为空并且之前发的请求都完成时，去检查失败队列，若失败队列不为空，则对失败队列用策略 1 来处理。具体实现如下:

var MultiQueue = function(queueArr, opts) {
  this.opts = Object.assign({
    retryMax: RETRY_MAX,
    parallelNum: 20,
    callback: () => {}
  }, opts)
  this.undoneQueue = queueArr
  this.nowDoingNum = 0
  this.retryQueue = []
  this.executeDone = false
  this.execute()
}

MultiQueue.prototype = {
  execute: function() {
    if(this.executeDone) {
      return
    }
    if (this.undoneQueue.length === 0) {
      if (this.nowDoingNum === 0) {
        this.executeDone = true
        if (this.retryQueue.length > 0) {
          new SingleQueue(this.retryQueue, this.opts)
        } else {
          this.opts.callback()
        }
      } else {
        // 过段时间检查正在执行的有没结束
        setTimeout(function() {
          this.execute()
        }.bind(this), 2000)
      }
      return
    }
    if (this.nowDoingNum < this.opts.parallelNum) {
      var doThing = this.undoneQueue.shift()
      this.nowDoingNum++
      doThing().then(function() {
        this.nowDoingNum--
          this.execute()
      }.bind(this), function() {
        this.nowDoingNum--
          this.retryQueue.push(doThing)
      }.bind(this))
      this.execute()
    }
  }
}

通过实验，发现策略 1 和 2 都能把整本小说爬完，并且策略 2 比 1 的完成速度快很多。其中策略2 中我设置的同时发请求的数量是 20。

完整代码见这里。

本文遵守创作共享CC BY-NC-SA 4.0协议 网络平台如需转载必须与本人联系确认。

本文参与腾讯云自媒体同步曝光计划，分享自作者个人站点/博客。

原始发表：2016.09.24 ，如有侵权请联系 cloudcommunity@tencent.com 删除

其他

本文分享自作者个人站点/博客前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与腾讯云自媒体同步曝光计划，欢迎热爱写作的你一起参与！

其他

登录后参与评论

0 条评论

热度