首页
学习
活动
专区
工具
TVP
发布
精选内容/技术社群/优惠产品,尽在小程序
立即前往

用Python实现笔趣阁小说爬取

今天来实现以下笔趣阁小说爬虫

笔趣阁的小说爬取难度还是比较低的(不涉及搜索功能)

咱们用requests和xpath来完成这个小爬虫

首先肯定是导包

```python

import requests

import time

from lxml import etree

```

然后来写两个辅助函数

分别用于请求网页和xpath解析

函数会让我们后面的程序更加简洁方便

```python

def get_tag(response, tag):

  html = etree.HTML(response)

  ret = html.xpath(tag)

  return ret

def parse_url(url):

  response = requests.get(url)

  response.encoding = 'gbk'

  return response.text

```

我们是从目录页开始爬取的

所以首先我们要获取目录页面里面所有的章节网址

```python

def find_url(response):

  chapter = get_tag(response, '//*[@id="list"]/dl/dd/a/@href')

  # print(chapter)

  for i in chapter:

      url_list.append(start_url + i)

      # url_list.append('https://www.52bqg.com/book_187/' + i)

  # print(url_list)

```

然后根据章节的网址

去获取章节内的信息

```python

def find_content(url):

  response = parse_url(url)

  chapter = get_tag(response, '//*[@id="box_con"]/div[2]/h1/text()')[0]

  content = get_tag(response, '//*[@id="content"]/text()')

  print('正在爬取', chapter)

  with open('{}.txt'.format(title), 'at', encoding='utf-8') as j:

      j.write(chapter)

      for i in content:

          if i == '\r\n':

              continue

          j.write(i)

  j.close()

  print(chapter, '保存完毕')

  time.sleep(2)

```

最后就是一个主体函数

```python

def main():

  global title, start_url

  # start_url = 'https://www.52bqg.com/book_187/'

  start_url = input('输入爬取小说的目录页:例如(https://www.52bqg.com/book_187/)')

  response = parse_url(start_url)

  # print(response)

  title = get_tag(response, '//*[@id="info"]/h1/text()')[0]

  # print(title)

  find_url(response)

  # print(1)

  for url in url_list:

      find_content(url)

```

完整代码如下:

```python

import requests

import time

from lxml import etree

url_list = []

def get_tag(response, tag):

  html = etree.HTML(response)

  ret = html.xpath(tag)

  return ret

def parse_url(url):

  response = requests.get(url)

  response.encoding = 'gbk'

  return response.text

def find_url(response):

  chapter = get_tag(response, '//*[@id="list"]/dl/dd/a/@href')

  # print(chapter)

  for i in chapter:

      url_list.append(start_url + i)

      # url_list.append('https://www.52bqg.com/book_187/' + i)

  # print(url_list)

def find_content(url):

  response = parse_url(url)

  chapter = get_tag(response, '//*[@id="box_con"]/div[2]/h1/text()')[0]

  content = get_tag(response, '//*[@id="content"]/text()')

  print('正在爬取', chapter)

  with open('{}.txt'.format(title), 'at', encoding='utf-8') as j:

      j.write(chapter)

      for i in content:

          if i == '\r\n':

              continue

          j.write(i)

  j.close()

  print(chapter, '保存完毕')

  time.sleep(2)

def main():

  global title, start_url

  # start_url = 'https://www.52bqg.com/book_187/'

  start_url = input('输入爬取小说的目录页:例如(https://www.52bqg.com/book_187/)')

  response = parse_url(start_url)

  # print(response)

  title = get_tag(response, '//*[@id="info"]/h1/text()')[0]

  # print(title)

  find_url(response)

  # print(1)

  for url in url_list:

      find_content(url)

if __name__ == '__main__':

  main()

```

效果图:

笔趣阁小数的爬虫还是比较好写的

下篇文章我们为这个爬虫加上GUI界面

一起学习python,小白指导,教学分享记得私信我

  • 发表于:
  • 原文链接https://kuaibao.qq.com/s/20201231A0AZ1700?refer=cp_1026
  • 腾讯「腾讯云开发者社区」是腾讯内容开放平台帐号(企鹅号)传播渠道之一,根据《腾讯内容开放平台服务协议》转载发布内容。
  • 如有侵权,请联系 cloudcommunity@tencent.com 删除。

扫码

添加站长 进交流群

领取专属 10元无门槛券

私享最新 技术干货

扫码加入开发者社群
领券