通过Python爬虫获取【小说网站】数据，保姆级教学

红目香薰

发布于 2023-01-13 09:56:53

1.6K0

发布于 2023-01-13 09:56:53

文章被收录于专栏：CSDNToQQCode

通过Python爬虫获取【小说网站】数据，保姆级教学

前言

示例环境

爬取目标

爬取代码

核心技术点：

爬取结果

前言

所有的前置环境以及需要学习的基础我都放置在【Python基础(适合初学-完整教程-学习时间一周左右-节约您的时间)】中，学完基础咱们再配置一下Python爬虫的基础环境【看完这个，还不会【Python爬虫环境】，请你吃瓜】，搞定了基础和环境，我们就可以相对的随心所欲的获取想要的数据了，所有的代码都是我一点点写的，都细心的测试过，如果某个博客爬取的内容失效，私聊我即可，留言太多了，很难看得到，本系列的文章意在于帮助大家节约工作时间，希望能给大家带来一定的价值。

示例环境

系统环境：win11 开发工具：PyCharm Community Edition 2022.3.1 Python版本：Python 3.9.6 资源地址：链接：https://pan.baidu.com/s/1UZA8AAbygpP7Dv0dYFTFFA 提取码：7m3e MySQL：5.7，url=【rm-bp1zq3879r28p726lco.mysql.rds.aliyuncs.com】,user=【qwe8403000】,pwd=【Qwe8403000】，库比较多，自己建好自己的，别跟别人冲突。

爬取目标

小说,小说网-纵横中文网|最热门的免费小说网 https://book.zongheng.com/

输入对应的网址即可下载：

爬取代码

核心技术点：

1、双重集合单循环遍历

    for item1, item2 in zip(href, text):
        a_href_list = ["", ""]
        a_href_list[0] = item1
        a_href_list[1] = item2
        a_href_arr.append(a_href_list)

2、parsel的css选择器语法

注意点：这里的注意点依然是时间的随机上，如果你有IP代理的话就无所谓了。

import requests
import parsel
import uuid
import time
import random
import os

baseUrl = "http://www.zongheng.com/"

bookId = "https://book.zongheng.com/book/1228049.html"

bookIdDir = bookId.replace("book/", "showchapter/")

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'
}

listChild = []
listDate = []
mTitle = []
# 文章链接与标题独立列表
a_href_list = ["", ""]
# 存放文章链接与标题数组列表
a_href_arr = []


def GetUrl(url):
    html = requests.get(url, headers=headers)
    sel = parsel.Selector(html.text)
    # 获取主Title
    mTitle.append(sel.css(".book-meta h1::text").getall()[0])
    os.mkdir("./" + mTitle[0] + "/", mode=0o777)
    print(mTitle)
    # 获取文章url列表
    href = sel.css(".volume-list ul a::attr(href)").getall()
    # 获取标题
    text = sel.css(".volume-list ul a::text").getall()
    for item1, item2 in zip(href, text):
        a_href_list = ["", ""]
        a_href_list[0] = item1
        a_href_list[1] = item2
        a_href_arr.append(a_href_list)


def GetTxt(url, title):
    print(url)
    print(mTitle)
    print(title)
    html = requests.get(url, headers=headers)
    sel = parsel.Selector(html.text)
    # 文章
    infoDate = []
    info = sel.css(".content p::text").getall()
    for item in info:
        infoDate.append(item+"\r\n")
    title = str(title).replace(" ", "_")
    title = str.format("{0}/{1}.txt", mTitle[0], title)
    with open(title, "w+", encoding="utf-8") as f:
        f.write("".join(infoDate))
        f.close()
    print(title, "保存完毕")


GetUrl(bookIdDir)

for item in a_href_arr:
    GetTxt(item[0], item[1])
    time.sleep(random.uniform(0.5, 1.5))