前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >scrapy爬虫完整的代码实例[通俗易懂]

scrapy爬虫完整的代码实例[通俗易懂]

作者头像
全栈程序员站长
发布2022-09-13 15:09:01
6070
发布2022-09-13 15:09:01
举报
文章被收录于专栏:全栈程序员必看

大家好,又见面了,我是你们的朋友全栈君。

新建工程

代码语言:javascript
复制
scrapy startproject tutorial

进入tutorial目录,在spider下面新建quotes_spider.py

代码语言:javascript
复制
import scrapy
from ..items import QuotesItem

#coding:utf-8

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    allowed_domain = "toscrape.com"

    def start_requests(self):
        for i in range(1,2):
            url = "http://quotes.toscrape.com/page/" + str(i) + "/"
            yield scrapy.Request(url=url,callback=self.parse)


    def parse(self, response):
        item = QuotesItem()
        for quote in response.css('div.quote'):
            item['text'] = quote.css('span.text::text').get(),
            item['author'] = quote.css('small.author::text').get(),
            item['tags'] = quote.css('div.tags a.tag::text').getall()
            yield item

进入items.py,代码如下:

代码语言:javascript
复制
# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class TutorialItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    pass

class QuotesItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    text = scrapy.Field()
    author = scrapy.Field()
    tags = scrapy.Field()
    pass

进入pipelines.py进行设置,对数据进行清洗

代码语言:javascript
复制
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html


class TutorialPipeline(object):
    def process_item(self, item, spider):
        return item

class QuotesPipeline(object):
    def process_item(self,item, spider):
        if item['text']:
            item['text'] = item['text'][0][1:-1]
        if item['author']:
            item['author'] = item['author'][0][1:-1]
        return item

进入setting.py,修改相关配置

代码语言:javascript
复制
ITEM_PIPELINES = {
   'tutorial.pipelines.TutorialPipeline': 300,
   'tutorial.pipelines.QuotesPipeline': 500,
}
FEED_EXPORT_ENCODING = 'utf-8'

进行命令行,执行爬虫

代码语言:javascript
复制
scrapy crawl quotes -o quotes.jl
代码语言:javascript
复制
import json
import codecs

class JsonPipeline(object):
    def __init__(self):
        self.file = codecs.open('logs.json', 'w', encoding='utf-8')
    def process_item(self, item, spider):
        line = json.dumps(dict(item), ensure_ascii=False) + "\n"
        self.file.write(line)
        return item
    def spider_closed(self, spider):
        self.file.close()

发布者:全栈程序员栈长,转载请注明出处:https://javaforall.cn/153159.html原文链接:https://javaforall.cn

本文参与 腾讯云自媒体同步曝光计划,分享自作者个人站点/博客。
如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 作者个人站点/博客 前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档