python 爬虫爬小说

原创

百里丶落云

修改于 2023-11-14 14:41:51

3752

修改于 2023-11-14 14:41:51

文章被收录于专栏：享~方法享~方法

学如逆水行舟,不进则退

今天想看小说..找了半天,没有资源..

只能自己爬了

想了半天.,,,忘记了这个古老的技能

捡了一下

那么什么是爬虫呢。

爬虫是一种自动化程序，用于从网络上抓取信息。它通过模拟人类操作，在网页上获取所需的数据，并将其保存或处理。爬虫可以根据特定规则或策略遍历网页，收集各种类型的数据，例如文字、图片、视频等。这些数据可以被用于分析、建立索引、挖掘有价值的信息等目的。爬虫在许多领域都有应用，如搜索引擎、数据采集、舆情监测等。在使用爬虫时，需要遵守相关的法律法规，不得侵犯他人的合法权益。

今天我爬的是一个小说的网站。可能到大家都看过。。是一个经典的小说网站，笔趣阁。

这里使用的包很简单就是requests 请求包。模拟浏览器请求。

首先是模拟请求cookies 和请求头，打开F12 ,自行按照请求内容复制。

import requests
from bs4 import BeautifulSoup
cookies = {
    'bcolor': 'null',
    'font': 'null',
    'size': 'null',
    'color': 'null',
    'width': 'null',
    'clickbids': '18836',
    'Hm_lvt_30876ba2abc5f5253467ef639ca0ad48': '1571030311,1571030949,1571031218',
    'Hm_lpvt_30876ba2abc5f5253467ef639ca0ad48': '1571031588',
}

headers = {
    'Connection': 'keep-alive',
    'Cache-Control': 'max-age=0',
    'Upgrade-Insecure-Requests': '1',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
    'Accept-Encoding': 'gzip, deflate',
    'Accept-Language': 'zh-CN,zh;q=0.9',
}

response = requests.get('http://www.biquku.la/18/18836/', headers=headers, cookies=cookies)

接下来。写下载功能。通过了解HTML 的规则。抓取其中的规则获取对应数据。

# print(response.text)
class downloder(object):
    def __init__(self):
        self.server = 'http://www.biqukan.com'
        self.target = 'http://www.biqukan.com/1_1094/'
        self.names = []  #存放章节名字
        self.urls = [] #存放章节链接
        self.nums = 0 # 章节数量
    def get_download_url(self):
        req = requests.get('http://www.biquku.la/18/18836/', headers=headers, cookies=cookies)

        html = req.text
        # print(html)
        div_bf = BeautifulSoup(html)
        div = div_bf.find_all('div',id='list')
        a_bf = BeautifulSoup(str(div[0]))
        a = a_bf.find_all('a')
        for each in a:
            self.names.append(each.string)
            self.urls.append('http://www.biquku.la/18/18836/'+each.get('href'))
        self.nums = len(a)

    def writer(self, name, path, text):
        write_flag = True
        with open(path, 'a', encoding='utf-8') as f:
            f.write(name + '\n')
            f.writelines(text)
            f.writelines('\n\n')

    def get_contents(self, target):
        req = requests.get(url=target)
        html = req.content
        # print('html',html)
        bf = BeautifulSoup(html)
        texts = bf.find_all('div', id='content')

        texts=str(texts[0]).replace('<br/>','\n')
        # print('texts',texts)
        # texts = texts[0].text.replace('&nbsp', '\n\n')
        # texts = texts[0].text.replace('<br/>', '\n\n')
        # texts = texts[0].text.replace('<br/>', '\n\n')
        # texts = texts[0].text.replace('<br>', '\n\n')

        return texts

最后输出主函数内容。

if __name__ == '__main__':
    dl = downloder()
    dl.get_download_url()
    # print(dl.urls)
    print(dl.nums)
    print('开始下载')
    for i in range(dl.nums):
        dl.writer(dl.names[i], '用点.txt', dl.get_contents(dl.urls[i]))
        print('第'+str(i)+'章下载完')
    print("下载完成")

这样就下载完成了。

我正在参与2023腾讯技术创作特训营第三期有奖征文，组队打卡瓜分大奖！

原创声明：本文系作者授权腾讯云开发者社区发表，未经许可，不得转载。

如有侵权，请联系 cloudcommunity@tencent.com 删除。

2023腾讯·技术创作特训营第三期

原创声明：本文系作者授权腾讯云开发者社区发表，未经许可，不得转载。

如有侵权，请联系 cloudcommunity@tencent.com 删除。

2023腾讯·技术创作特训营第三期

登录后参与评论

0 条评论

热度