Python爬虫实战-抓取《盗墓笔记》所有章节及链接

爱吃西瓜的番茄酱

发布于 2018-04-04 11:17:51

1.7K0

发布于 2018-04-04 11:17:51

文章被收录于专栏：一个爱吃西瓜的程序员

本次以一个盗墓笔记的小说阅读网（http://seputu.com）为例，抓取盗墓笔记的标题、章节名和链接，如下图

前提：

这是一个静态网站，标题、章节都不是由JavaScript动态加载的，无代理，无登录，无验证。

分析目标url的HTML结构：

分析结果如下：

标题和章节都被包含在<div class="mulu">标记下，标题位于其中的<div class="mulu-title"标记下的<h2>标签中，章节位于其中的<div class="box"下的<a>标签中。

爬取思路：

requests（http请求）

BeautifulSoup（页面解析）

json&CSV&txt（数据存储）

代码构造如下：

一：存储为TXT文本文件：

先导入需要库：

from bs4 import BeautifulSoup
import requests

设置请求头、目标url，使用get方法请求：

url = “http://seputu.com“
user_agent = “Mozilla/5.0 (Windows NT 6.3; WOW64)”
headers = {“User_agent”: user_agent}
req = requests.get(url, headers=headers)

使用BeautifulSoup进行网页解析：

# 指定htm.parser为解析器
soup = BeautifulSoup(req.text, "html.parser")  
rows = []
for mulu in soup.find_all(class_="mulu"):
    h2 = mulu.find("h2")
    if h2 is not None:
        h2_title = h2.get_text()  # 提取标题

        for a in mulu.find(class_="box").find_all("a"):
            href = a["href"]  # 提取链接
            box_title = a["title"]  # 提取章节名
            content = (h2_title, box_title, href)
            rows.append(content)

存储为TXT文件：

# 一定要指定utf-8编码，否则会乱码
with open("盗墓笔记.txt", "w", encoding="utf-8") as f:  
    for row in rows:
        f.write("\n" + str(row))  # 转换为字符串，按行输出

爬取结果如下：

二：存储为json文件：

先导入json模块：

from bs4 import BeautifulSoup
import requests
import json

http请求与上相同：

url = "http://seputu.com"
user_agent = "Mozilla/5.0 (Windows NT 6.3; WOW64)"
headers = {"User_agent": user_agent}
req = requests.get(url, headers=headers)

网页解析略有不同：先将数据放在字典中，字典嵌套在列表中：

soup = BeautifulSoup(req.text, "html.parser")
content = []
_list = []
for mulu in soup.find_all(class_="mulu"):
    h2 = mulu.find("h2")
    if h2 is not None:
        h2_title = h2.string

        for a in mulu.find(class_="box").find_all("a"):
            href = a["href"]
            box_title = a["title"]
            _list.append({"链接": href, "章节名": box_title})
        content.append({"标题": h2_title, "章节列表": _list})

最后将数据存储在.json文件中：

with open("盗墓笔记.json", "w", encoding="utf-8") as fp:
    # 一定要指定ensure_ascii=False，否则存储汉汉字会乱码
    json.dump(content, fp=fp, indent=4, ensure_ascii=False)

看一下爬取结果如何：

假如我们在存储为json文件时没有指定ensure_ascii=False:

with open("盗墓笔记.json", "w", encoding="utf-8") as fp:
    # 一定要指定ensure_ascii=False，否则存储汉汉字会乱码
    json.dump(content, fp=fp, indent=4)

看一下结果会怎样：

汉字全部变成\u565\u4d\等乱码格式。

三：将数据存储为CSV文件：

先导入CSV模块：

from bs4 import BeautifulSoup
import requests
import csv

http请求与上相同：

url = "http://seputu.com"
user_agent = "Mozilla/5.0 (Windows NT 6.3; WOW64)"
headers = {"User_agent": user_agent}
req = requests.get(url, headers=headers)

网页解析与上类似：

soup = BeautifulSoup(req.text, "html.parser")
rows = []
for mulu in soup.find_all(class_="mulu"):
    h2 = mulu.find("h2")
    if h2 is not None:
        h2_title = h2.string

        for a in mulu.find(class_="box").find_all("a"):
            href = a["href"]
            box_title = a["title"]
            content = (h2_title, box_title, href)
            rows.append(content)

存储为CSV文件：

headers_ = ("标题", "章节名", "链接")
# 打开文件时要指定newline=''，否则存储为CSV时，每行数据之间都有空行
with open("盗墓笔记.csv", "w", newline='') as fp:
    f_csv = csv.writer(fp)
    f_csv.writerow(headers_)
    f_csv.writerows(rows)

打开CSV文件，使用reader（）方法：

with open("盗墓笔记.csv") as f:
    f_csv = csv.reader(f)
    headers_ = next(f_csv)
    print(headers_)
    for row in f_csv:
        print(row)

爬取结果如下：

我主要遇到两个问题：

1：不知道如何在json文件中写入汉字，查阅资料后才知道在写入json文件时要指定ensure_ascii=False：

json.dump(content, fp=fp, indent=4, ensure_ascii=False)

2：我写入数据到CSV文件后，发现每行数据之间都有空行，查阅资料之后发现要在打开文件的同时指定newline=''：

with open("盗墓笔记.csv", "w", newline='') as fp:

你们有遇到什么问题的话，可以互相交流。

每天学习一点点，每天进步一点点。

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2017-11-06，如有侵权请联系 cloudcommunity@tencent.com 删除

python

爬虫

本文分享自小白客微信公众号，前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与腾讯云自媒体同步曝光计划，欢迎热爱写作的你一起参与！

python

爬虫

登录后参与评论

0 条评论

热度

Python爬虫实战-抓取《盗墓笔记》所有章节及链接

Python爬虫实战-抓取《盗墓笔记》所有章节及链接

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐