python爬虫：爬取笔趣小说网站首页所有的小说内容，并保存到本地(单线程爬取，似乎有点慢)

戈贝尔光和热

发布于 2018-12-27 15:15:40

2.1K0

发布于 2018-12-27 15:15:40

文章被收录于专栏：HUBU生信HUBU生信

这几天在进行新的内容学习，并且在尝试使用据说是全宇宙唯一一款专门开发python的ide工具，叫做pycharm。

这个软件是全英文的，不过在网上有汉化的工具包，但是仔细想一想，这么牛皮的软件用汉化版的会不会有点low（就像中文软件你使用英文包一样）。所以，我还是决定自己来玩一玩这款软件。下图软件运行的截图（还正在爬小说中ing）

这款软件需要激活码，网上很多方法都已经失效了，如果需要的伙伴激活的朋友，可以在评论区留言，我在之后会更新我在网上发现的一些有效激活方法。

下面进入正题。这是我们今天要爬取的小说网站：小说排行榜_2017完结小说排行榜_笔趣阁

相信经常看小说的朋友应该对这些小说一点也陌生。那么，我们怎样才能将这些小说一次性下载下来呢？

我们先讲一下，主要思路：

1.爬取网站总榜，获取每本小说的url；

2通过每本小说的url，找到每本小说的所有章节的url；

3通过每本书每一章的url，获取到每一章的内容。

首先爬虫的基本操作：F12，进行网页分析我就不仔细讲了，在之前的文章中有说过，文章地址：（python小白必看！）python爬虫详细讲解：静态单网页的内容爬取爬取对象：百度贴吧湖北大学吧

分析网页很明显就能找到，每个榜单都在标签：

<div class="index_toplist mright mbottom">

·····

<div class="topbooks" id="con_o1g_3" style="display: none;">

<li><span class="hits">05-08</span><span class="num">1.</span><a href="/book/68/" title="武炼巅峰" target="_blank">武炼巅峰</a></li>

</div>

<div class="clearfix"></div>

</div>之中

所以代码可以如下来写：

import   requests
import   pandas
from  pandas  import  DataFrame
import  bs4
from  bs4   import  BeautifulSoup
#获取url请求
def   get_url(url):
    r=requests.get(url)
    r.encoding=r.apparent_encoding
    r=r.text
    return   r
#用beatifulsoup分析网页
def  get_data(url):
    url=get_url(url)
    comments=[]
    soup=BeautifulSoup(url,'lxml')
    Tags=soup.find_all('div',class_="index_toplist mright mbottom")
    for   li   in   Tags:
        try:
            head=li.find('div',class_="toptab").span.stringwith
            with   open("novel_information.txt",'a+')   as   f:
                f.write("\n小说排行榜：{}\n".format(head))
            novel_list=li.find('div',attrs={"style":"display: block;"})
            novel=novel_list.find_all('li')
            for   txt  in   novel:
               novel_title=txt.a["title"]
               novel_link='https://www.qu.la'+str(txt.a["href"])
               with  open('novel_information.txt','a+')   as   f:
                    f.write("小说名称：{}\t\t小说网址:{}\n".format(novel_title,novel_link))
               comments.append(novel_link)
        except:
            continue
    return   comments#将每本小说的url保存到列表中
成功将所有小说的url均保存到了comments之中。

下一步就是获取每本小说的每一章，comments列表中的每一个url之后，返回出每一章的url

#得到每本小说每一章节的url
def  get_novel_url(url):
    chapter_list=[]
    html=get_url(url)
    soup=BeautifulSoup(html,'lxml')
    novel_name=soup.find('h1').get_text()
    with  open(r'C:\Users\13016\PycharmProjects\untitled11\{}.txt'.format(novel_name),'a+')   as  f:
        f.write('小说标题：{}\n'.format(novel_name))
    Tags=soup.find('div',attrs={"id":"list"})
    all_chapter=Tags.find_all('dd')
    for  chapter  in   all_chapter:
        try:
            chapter_title=chapter.a.string
            chapter_link=url+str(chapter.a["href"])
            chapter_list.append(chapter_link)
            with  open(r'C:\Users\13016\PycharmProjects\untitled11\{}.txt'.format(novel_name),'a+')  as  f:
                 f.write("各章节名称为：{}\t\t".format(chapter_title))
        except:
            continue
    return   chapter_list,novel_name
最后，爬取每一章中的内容：

#爬取每一本小说中每一章的内容
def  get_novel_txt(url,novel_name):#此时的url为每本小说中每一张的url
    html=get_url(url).replace("<br/>","\n")
    html=html.replace("&nbsp;&nbsp;&nbsp;&nbsp;","   ")
    soup=BeautifulSoup(html,'lxml')
    try:
         chapter=soup.find('div',attrs={"id":"content"}).text
         with  open(r'C:\Users\13016\PycharmProjects\untitled11\{}.txt'.format(novel_name),'a')   as  f:
             f.write("当前小说为：{}\n内容为：\n{}\n".format(novel_name,chapter))
         print("爬取成功")
    except:
        print("有误")
    return html
最后将所有函数联合起来，实现功能：

url='https://www.qu.la/paihangbang/'
all_novel_list=get_data(url)
for  novel_url  in   all_novel_list:
    chapter=get_novel_url(novel_url)
    chapter_list=chapter[0]
    chapter_name=chapter[1]
    for  i  in   chapter_list:
        get_novel_txt(i,chapter_name)

爬取部分结果：