前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >A failed spider

A failed spider

作者头像
BORBER
发布2019-08-06 17:31:40
5740
发布2019-08-06 17:31:40
举报
文章被收录于专栏:BORBERBORBERBORBER

失败的爬虫 成功的尝试

在爬取完漫画网站之后,我在想,我还能用自己浅薄的知识做点什么,但实在是因为自己 python的基本功不够扎实,以及自己的需求过于模糊,所以最后还是选择了爬取笔趣阁的小说。练习python,熟悉bs4 和 requsets 的使用。

因为想要使用多进程,所以是用的与之前不一样的写法,以及困于PyChram的输入网址之后直接回车就会打开浏览器,所以默认多加一个字符长度,默认是空格

首先是爬取文章标题

#

from bs4 import BeautifulSoup
import requests


def get_titles(urlx):
    wb_data = requests.get(urlx)
    wb_data.encoding = 'UTF-8'
    soup = BeautifulSoup(wb_data.text, 'lxml')
    return soup.select('h1')[0].get_text()

其次是 爬取文章目录

#

from bs4 import BeautifulSoup
import requests
import pymongo

head = 'https://www.biquge.com.cn'
client = pymongo.MongoClient('localhost', 27017)
BQG = client['BQG']
chapters = BQG['chapters']


def get_chapters(urlx):
    wb_data = requests.get(urlx)
    wb_data.encoding = 'UTF-8'
    soup = BeautifulSoup(wb_data.text, 'lxml')
    local_chapters = soup.select('dd > a')
    for index, each in enumerate(local_chapters):
        local_chapter = {
            'index': index,
            'url': head+each.get('href')
        }
        chapters.insert_one(local_chapter)

然后是 爬取 文章内容

#

from bs4 import BeautifulSoup
import requests
import pymongo

client = pymongo.MongoClient('localhost', 27017)
BQG = client['BQG']
books = BQG['books']
chapters = BQG['chapters']
articles = BQG['articles']


def get_articles(urlx):
    wb_data = requests.get(urlx)
    wb_data.encoding = 'UTF-8'
    soup = BeautifulSoup(wb_data.text, 'lxml')
    wb_data.close()
    title = {'T|C': 1, 'm': soup.select('div.bookname > h1')[0].get_text()}
    articles.insert_one(title)
    content = soup.select('#content').__str__().replace('[<div id="content">', '').replace('</div>]', '').replace('\xa0', '').split('<br/><br/>')
    for each in content:
        paragraph = {
            'T|C': 0,
            'm': each
        }
        articles.insert_one(paragraph)

主程序:

import pymongo
import os
from multiprocessing import Pool
from zerox.bqg_get_chapters import get_chapters
from zerox.bqg_get_articles import get_articles
from zerox.bqg_get_titles import get_titles

client = pymongo.MongoClient('localhost', 27017)
BQG = client['BQG']
books = BQG['books']
chapters = BQG['chapters']
articles = BQG['articles']
rootpath = '/home/x/BORBER/File/Tmp/novel/'
filend = '.txt'


def write_in(item):
    if item['T|C'] == 1:
        file.write(item['m'])
        file.write('\n\n\n\n')
    else:
        print('  ')
        file.write(item['m'])
        file.write('\n\n')


if __name__ == '__main__':
    BQG.drop_collection('books')
    BQG.drop_collection('chapters')
    BQG.drop_collection('articles')
    pool = Pool()
    print('Enter the specific link to the novel:')
    url = input()[:-1]
    title = get_titles(url)
    a = rootpath + title + filend
    file = open(rootpath + title + filend, 'w')
    get_chapters(url)
    for item in chapters.find():
        get_articles(item['url'])
    for item in articles.find():
        write_in(item)

以下提供 监测程序:

import time
import pymongo
import os

client = pymongo.MongoClient('localhost', 27017)
BQG = client['BQG']
books = BQG['books']
chapters = BQG['chapters']
articles = BQG['articles']

while True:
    os.system('clear')
    print(chapters.find().count()+articles.find().count())
    time.sleep(2)

文件结构:

-zero
	-zerox  #python package
		-bqg_get_chapters.py
		-bqg_get_articles.py
		-bqg_get_titles.py
		-count.py
		-main.py
		-__init__.py

但是 为什么文章叫 失败的爬虫呢? 因为我只成功了两次,后面的话就会报错了,应该是 笔趣阁的反扒手段,所以还是同样,如果以后我的技术进步了,我会着手改进这个爬虫的 ,有想尝试的朋友可以先试着加个 headers 到 get方法。

此爬虫需要使用 mongoDB

真的去下苦工 打基础了 (๑•̀ㅂ•́)و✧

本文参与 腾讯云自媒体分享计划,分享自作者个人站点/博客。
原始发表:2019-07-24,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 作者个人站点/博客 前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档