A failed spider

BORBER

发布于 2019-08-06 17:31:40

5970

发布于 2019-08-06 17:31:40

文章被收录于专栏：BORBER

失败的爬虫成功的尝试

在爬取完漫画网站之后，我在想，我还能用自己浅薄的知识做点什么，但实在是因为自己 python的基本功不够扎实，以及自己的需求过于模糊，所以最后还是选择了爬取笔趣阁的小说。练习python，熟悉bs4 和 requsets 的使用。

因为想要使用多进程，所以是用的与之前不一样的写法，以及困于PyChram的输入网址之后直接回车就会打开浏览器，所以默认多加一个字符长度，默认是空格

首先是爬取文章标题

#

from bs4 import BeautifulSoup
import requests


def get_titles(urlx):
    wb_data = requests.get(urlx)
    wb_data.encoding = 'UTF-8'
    soup = BeautifulSoup(wb_data.text, 'lxml')
    return soup.select('h1')[0].get_text()

其次是爬取文章目录

#

from bs4 import BeautifulSoup
import requests
import pymongo

head = 'https://www.biquge.com.cn'
client = pymongo.MongoClient('localhost', 27017)
BQG = client['BQG']
chapters = BQG['chapters']


def get_chapters(urlx):
    wb_data = requests.get(urlx)
    wb_data.encoding = 'UTF-8'
    soup = BeautifulSoup(wb_data.text, 'lxml')
    local_chapters = soup.select('dd > a')
    for index, each in enumerate(local_chapters):
        local_chapter = {
            'index': index,
            'url': head+each.get('href')
        }
        chapters.insert_one(local_chapter)

然后是爬取文章内容

#

from bs4 import BeautifulSoup
import requests
import pymongo

client = pymongo.MongoClient('localhost', 27017)
BQG = client['BQG']
books = BQG['books']
chapters = BQG['chapters']
articles = BQG['articles']


def get_articles(urlx):
    wb_data = requests.get(urlx)
    wb_data.encoding = 'UTF-8'
    soup = BeautifulSoup(wb_data.text, 'lxml')
    wb_data.close()
    title = {'T|C': 1, 'm': soup.select('div.bookname > h1')[0].get_text()}
    articles.insert_one(title)
    content = soup.select('#content').__str__().replace('[<div id="content">', '').replace('</div>]', '').replace('\xa0', '').split('<br/><br/>')
    for each in content:
        paragraph = {
            'T|C': 0,
            'm': each
        }
        articles.insert_one(paragraph)

主程序：

import pymongo
import os
from multiprocessing import Pool
from zerox.bqg_get_chapters import get_chapters
from zerox.bqg_get_articles import get_articles
from zerox.bqg_get_titles import get_titles

client = pymongo.MongoClient('localhost', 27017)
BQG = client['BQG']
books = BQG['books']
chapters = BQG['chapters']
articles = BQG['articles']
rootpath = '/home/x/BORBER/File/Tmp/novel/'
filend = '.txt'


def write_in(item):
    if item['T|C'] == 1:
        file.write(item['m'])
        file.write('\n\n\n\n')
    else:
        print('  ')
        file.write(item['m'])
        file.write('\n\n')


if __name__ == '__main__':
    BQG.drop_collection('books')
    BQG.drop_collection('chapters')
    BQG.drop_collection('articles')
    pool = Pool()
    print('Enter the specific link to the novel:')
    url = input()[:-1]
    title = get_titles(url)
    a = rootpath + title + filend
    file = open(rootpath + title + filend, 'w')
    get_chapters(url)
    for item in chapters.find():
        get_articles(item['url'])
    for item in articles.find():
        write_in(item)

以下提供监测程序：

import time
import pymongo
import os

client = pymongo.MongoClient('localhost', 27017)
BQG = client['BQG']
books = BQG['books']
chapters = BQG['chapters']
articles = BQG['articles']

while True:
    os.system('clear')
    print(chapters.find().count()+articles.find().count())
    time.sleep(2)

文件结构：

-zero
	-zerox  #python package
		-bqg_get_chapters.py
		-bqg_get_articles.py
		-bqg_get_titles.py
		-count.py
		-main.py
		-__init__.py

但是为什么文章叫失败的爬虫呢？因为我只成功了两次，后面的话就会报错了，应该是笔趣阁的反扒手段，所以还是同样，如果以后我的技术进步了，我会着手改进这个爬虫的，有想尝试的朋友可以先试着加个 headers 到 get方法。

此爬虫需要使用 mongoDB

真的去下苦工打基础了 (๑•̀ㅂ•́)و✧

本文参与腾讯云自媒体同步曝光计划，分享自作者个人站点/博客。

原始发表：2019-07-24，如有侵权请联系 cloudcommunity@tencent.com 删除

爬虫

python

本文分享自作者个人站点/博客前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与腾讯云自媒体同步曝光计划，欢迎热爱写作的你一起参与！

爬虫

python

登录后参与评论

0 条评论

热度

A failed spider

A failed spider

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐