构建定时监控系统，轻松爬取番茄小说最新章节

原创

小白学大数据

发布于 2025-10-10 16:48:34

2540

一、技术蓝图：为何选择这些工具？

一个健壮的自动化爬虫系统需要由以下几个核心模块构成：

爬虫引擎：Requests + BeautifulSoup。这是一个经典组合。Requests用于高效地发送HTTP请求，获取网页源代码；BeautifulSoup用于解析HTML，精准提取我们需要的章节链接和正文。
定时任务调度器：APScheduler。这是一个强大的Python库，可以非常方便地实现类似Cron的定时任务，支持单次、间隔性和周期性调度，完美契合我们的“定时监控”需求。
数据持久化：SQLite数据库。由于其轻量级、无需单独部署服务器，且通过Python标准库sqlite3即可操作，非常适合本项目。我们将用它来存储已爬取的章节信息，实现增量爬取。
通知机制（可选）：SMTPLib。通过Python的smtplib和email库，我们可以在发现新章节时，自动发送邮件到指定邮箱，实现即时提醒。

二、逆向工程：剖析番茄小说网页结构

在编写代码前，我们必须先理解目标网站的结构。通过浏览器开发者工具（F12），我们可以分析番茄小说的书籍目录页和章节内容页。

目录页分析：书籍的目录页通常包含所有章节的链接。我们需要找到最新的章节链接，并与本地数据库记录进行比对。
正文页分析：章节内容通常位于一个特定的HTML标签内（如一个包含特定class的div标签）。我们的任务是定位到这个标签并提取其中的纯文本。

注意：本示例旨在提供技术思路和教学。在实际爬取时，请务必遵守网站的robots.txt协议，尊重版权，并将爬虫用于个人学习目的，避免对目标网站服务器造成压力。

三、代码实现：从零构建自动化系统

下面，我们将分步骤实现整个系统。

步骤1：创建数据库模型

我们首先创建一个SQLite数据库和一张表，用于记录已爬取的章节。

# create_database.py
import sqlite3

def init_database():
    conn = sqlite3.connect('fanqie_novel.db')
    c = conn.cursor()
    c.execute('''
        CREATE TABLE IF NOT EXISTS chapters (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            chapter_title TEXT NOT NULL UNIQUE,
            chapter_content TEXT,
            created_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP
        )
    ''')
    conn.commit()
    conn.close()
    print("数据库初始化成功！")

if __name__ == '__main__':
    init_database()

步骤2：核心爬虫类实现

这个类封装了所有的核心功能：获取页面、解析目录、解析正文、数据存储和比对。

# fanqie_crawler.py
import requests
from bs4 import BeautifulSoup
import sqlite3
import time
from datetime import datetime

class FanQieNovelCrawler:
    def __init__(self, book_url, db_path='fanqie_novel.db'):
        self.book_url = book_url
        self.db_path = db_path
        self.headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
        }
        
        # 代理配置
        self.proxyHost = "www.16yun.cn"
        self.proxyPort = "5445"
        self.proxyUser = "16QMSOML"
        self.proxyPass = "280651"
        
        # 构建代理字典
        self.proxyMeta = f"http://{self.proxyUser}:{self.proxyPass}@{self.proxyHost}:{self.proxyPort}"
        self.proxies = {
            "http": self.proxyMeta,
            "https": self.proxyMeta
        }
        
        self.session = requests.Session()
        self.session.headers.update(self.headers)

    def get_soup(self, url):
        """发送请求并返回BeautifulSoup对象"""
        try:
            # 在请求中添加代理参数
            response = self.session.get(url, timeout=10, proxies=self.proxies)
            response.raise_for_status() # 检查请求是否成功
            response.encoding = 'utf-8'
            return BeautifulSoup(response.text, 'html.parser')
        except requests.exceptions.RequestException as e:
            print(f"请求出错: {e}")
            return None

    def get_latest_chapter_info(self):
        """从目录页获取最新的章节链接和标题"""
        soup = self.get_soup(self.book_url)
        if not soup:
            return None, None

        # 注意：此选择器为示例，实际需要根据番茄小说网站结构调整
        # 通常章节列表在一个特定的<ul>或<ol>标签中
        chapter_list = soup.find('div', class_='chapter-list') 
        if not chapter_list:
            print("未找到章节列表，请检查选择器或网站结构是否变化。")
            return None, None

        latest_chapter_link = chapter_list.find('a')['href'] # 假设第一个链接是最新的
        latest_chapter_title = chapter_list.find('a').get_text().strip()

        # 确保链接是完整的URL
        if not latest_chapter_link.startswith('http'):
            # 这里需要根据实际情况拼接基础URL，可能比较复杂
            # 为简化示例，我们直接返回找到的内容
            pass

        return latest_chapter_title, latest_chapter_link

    def get_chapter_content(self, chapter_url):
        """解析章节页面，获取正文内容"""
        soup = self.get_soup(chapter_url)
        if not soup:
            return "内容获取失败"

        # 注意：此选择器为示例，实际需要根据番茄小说网站结构调整
        # 正文通常在一个带有特定class的div中
        content_div = soup.find('div', class_='chapter-content')
        if content_div:
            # 清理无关标签，获取纯文本
            content = '\n'.join([p.get_text().strip() for p in content_div.find_all('p')])
            return content
        else:
            return "未找到正文内容"

    def is_chapter_exist(self, chapter_title):
        """检查章节是否已存在于数据库中"""
        conn = sqlite3.connect(self.db_path)
        c = conn.cursor()
        c.execute("SELECT 1 FROM chapters WHERE chapter_title = ?", (chapter_title,))
        exists = c.fetchone() is not None
        conn.close()
        return exists

    def save_chapter(self, chapter_title, chapter_content):
        """保存新章节到数据库"""
        conn = sqlite3.connect(self.db_path)
        c = conn.cursor()
        try:
            c.execute("INSERT INTO chapters (chapter_title, chapter_content) VALUES (?, ?)",
                      (chapter_title, chapter_content))
            conn.commit()
            print(f"新章节已保存: {chapter_title}")
        except sqlite3.IntegrityError:
            print(f"章节已存在，跳过: {chapter_title}")
        finally:
            conn.close()

    def check_and_crawl(self):
        """核心逻辑：检查并爬取新章节"""
        print(f"{datetime.now().strftime('%Y-%m-%d %H:%M:%S')} 开始检查更新...")
        latest_title, latest_url = self.get_latest_chapter_info()

        if not latest_title or not latest_url:
            print("获取最新章节信息失败。")
            return

        if not self.is_chapter_exist(latest_title):
            print(f"发现新章节: {latest_title}")
            content = self.get_chapter_content(latest_url)
            self.save_chapter(latest_title, content)
            # 在这里可以触发通知，例如调用发送邮件的函数
            # self.send_notification(latest_title, content)
        else:
            print("暂无新章节。")

步骤3：集成定时任务与系统调度

现在，我们使用APScheduler来让爬虫定时运行。

# main_scheduler.py
from apscheduler.schedulers.blocking import BlockingScheduler
from fanqie_crawler import FanQieNovelCrawler
import logging

# 配置日志，方便查看任务运行情况
logging.basicConfig()
logging.getLogger('apscheduler').setLevel(logging.DEBUG)

def scheduled_task():
    # 替换成你想要监控的番茄小说书籍URL
    BOOK_URL = "https://fanqienovel.com/page/your-book-id"
    crawler = FanQieNovelCrawler(book_url=BOOK_URL)
    crawler.check_and_crawl()

if __name__ == '__main__':
    # 创建调度器
    scheduler = BlockingScheduler()

    # 添加定时任务
    # 方式一：间隔性执行，例如每30分钟执行一次
    scheduler.add_job(scheduled_task, 'interval', minutes=30)

    # 方式二：Cron式执行，例如每天上午9点和晚上9点各执行一次
    # scheduler.add_job(scheduled_task, 'cron', hour='9,21')

    print("自动化爬虫调度器已启动，按 Ctrl+C 退出。")

    try:
        scheduler.start()
    except (KeyboardInterrupt, SystemExit):
        print("\n调度器已退出。")

四、系统优化与部署

异常处理与日志：上述代码包含了基础异常处理。在生产环境中，应引入更完善的日志记录（如logging模块），将运行状态、错误信息记录到文件。
反爬虫策略应对：
- 随机User-Agent：使用fake_useragent库轮换User-Agent。
- 代理IP：在遭遇IP封禁时，需要引入代理IP池。
- 请求频率控制：在爬取过程中加入time.sleep(random.uniform(1, 3))来模拟人类行为。
部署运行：你可以将整个脚本部署到云服务器（如阿里云、腾讯云ECS）或树莓派上，使用nohup python main_scheduler.py &命令让其在后端持续运行。对于更复杂的生产环境，可以考虑使用Systemd或Supervisor来管理进程，确保爬虫服务在异常退出后能自动重启。

五、结语与伦理考量

通过本文的讲解，我们成功构建了一个集监控、爬取、存储于一体的自动化系统。这个系统不仅解决了“追更”的痛点，其技术框架（Requests + BeautifulSoup + APScheduler + SQLite）也具有很高的通用性，稍作修改即可应用于其他类似的监控场景。

原创声明：本文系作者授权腾讯云开发者社区发表，未经许可，不得转载。

如有侵权，请联系 cloudcommunity@tencent.com 删除。

实时监控

python-3.4

原创声明：本文系作者授权腾讯云开发者社区发表，未经许可，不得转载。

如有侵权，请联系 cloudcommunity@tencent.com 删除。

实时监控

python-3.4

登录后参与评论

0 条评论

热度