Scrapy（3）将蜘蛛狠狠的踩在地上摩擦摩擦

公众号---人生代码

发布于 2020-05-16 22:13:29

6620

发布于 2020-05-16 22:13:29

文章被收录于专栏：人生代码人生代码

看到蜘蛛，你可能会想起恶心的真蜘蛛，像这样的，够吓人吧，世界上十种最毒蜘蛛，他算上其中之一。

你错了，只是你影像中的可恶的蜘蛛，你万万没想到，蜘蛛还蛮可爱的，像这样的，卡姿兰大眼睛，舍不得狠狠的按在地上摩擦摩擦

哦，等等，突然脑子灵光一散，蜘蛛侠，这可是荡气回肠啊，想当年蜘蛛侠还没称为蜘蛛侠的时候，就是被蜘蛛咬了，才称为蜘蛛侠的

哦，好像扯远了，还是回到主题吧，今天的主题是 scrapy 里面的蜘蛛（spider）是指，网络爬虫

今天我们通过一个完整的例子，爬取虎嗅网新闻列表，我进来网址，看看

https://www.huxiu.com/

感觉我发现了什么样的宝藏一样，好像可以学习里面的文章写作技巧什么？

创建工程

scrapy startproject coolscrapy

这一条命令下去，你不得顺利服从？我们先来看看目录分布

coolscrapy/
    scrapy.cfg            # 部署配置文件

    coolscrapy/           # Python模块，你所有的代码都放这里面
        __init__.py

        items.py          # Item定义文件

        pipelines.py      # pipelines定义文件

        settings.py       # 配置文件

        spiders/          # 所有爬虫spider都放这个文件夹下面
            __init__.py
            ...

定义我们自己的 Items

因为我们需要爬取虎嗅网的新闻列表的《标题》《简述》《链接》《发布时间》，所以我们需要定义一个 spider.Items 类，来抓取

import scrapy

# 传入 scrapy.Item 说明是继承自 scrapy.Item 基类
class HuXiuItem(scrapy.Item):
    # define the fields for your item here like:
    title = scrapy.Field()
    link = scrapy.Field()
    desc = scrapy.Field()
    posttime = scrapy.Field()

或许你会觉得定义这个东西，有点麻烦，没有必要，但是你有没有仔细发现，这个不就像 java 里面的基类，定义着各种属性，可能对应了 model 层的数据字段，其实我也不太懂 java，只是公司用的是 java 后台，所以稍微涉略了一下

接下来就是我们的蜘蛛了

这些蜘蛛，其实就是一些爬取工具，但是抽象到代码层面其实就是一个一个的方法，更加抽象的说法就是一个一个的类（class）,Scrapy 使用他们来自 domain（其实就是我们所说的 url 地址）爬取信息，在蜘蛛类中定义一个初始化 url，以及跟踪链接，如何解析页面信息

定义一个Spider，只需继承scrapy.Spider类并定于一些属性：

name: Spider名称，必须是唯一的

start_urls: 初始化下载链接URL

parse(): 用来解析下载后的Response对象，该对象也是这个方法的唯一参数。它负责解析返回页面数据并提取出相应的Item（返回Item对象），还有其他合法的链接URL（返回Request对象）

我们在coolscrapy/spiders文件夹下面新建huxiu_spider.py，内容如下

#!/usr/bin/env python
# -*- encoding: utf-8 -*-
"""
Topic: sample
Desc :
"""
from coolscrapy.items import HuXiuItem
import scrapy


class HuXiuSpider(scrapy.Spider):
    name = 'huxiu'
    allowed_domains = ['huxiu.com']
    start_urls = [
        'http://www/huxiu.com/index.php'
    ]

    def parse(self, response):
        for sel in response.xpath('//div[@class="mod-info-flow"]/div/div[@class="mob-ctt"]'):
            item = HuXiuItem()
            item['title'] = sel.xpath('h3/a/text()')[0].extract()
            item['link'] = sel.xpath('h3/a/@href')[0].extract()
            url = response.urljoin(item['link'])
            item['desc'] = sel.xpath(
                'div[@class="mob-sub"]/text()')[0].extract()
            print(item['title'], item['link'], item['desc'])

运行爬虫

难哦你投佛，老天爷保佑我的爬虫安然无事，不出bug，好紧张啊

在根目录执行下面的命令，其中huxiu是你定义的spider名字

scrapy crawl huxiu

老天爷不包邮啊，还是报错了，竟然这样我们就来解决bug喽

目前暂且留着这个 bug，我们先来熟悉一下流程吧，后期再改吧

处理链接

如果想继续跟踪每个新闻链接进去，看看它的详细内容的话，那么可以在parse()方法中返回一个Request对象，然后注册一个回调函数来解析新闻详情

from coolscrapy.items import HuXiuItem
import scrapy

class HuxiuSpider(scrapy.Spider):
    name = "huxiu"
    allowed_domains = ["huxiu.com"]
    start_urls = [
        "http://www.huxiu.com/index.php"
    ]

    def parse(self, response):
        for sel in response.xpath('//div[@class="mod-info-flow"]/div/div[@class="mob-ctt"]'):
            item = HuXiuItem()
            item['title'] = sel.xpath('h3/a/text()')[0].extract()
            item['link'] = sel.xpath('h3/a/@href')[0].extract()
            url = response.urljoin(item['link'])
            item['desc'] = sel.xpath('div[@class="mob-sub"]/text()')[0].extract()
            # print(item['title'],item['link'],item['desc'])
            yield scrapy.Request(url, callback=self.parse_article)

    def parse_article(self, response):
        detail = response.xpath('//div[@class="article-wrap"]')
        item = HuXiuItem()
        item['title'] = detail.xpath('h1/text()')[0].extract()
        item['link'] = response.url
        item['posttime'] = detail.xpath(
            'div[@class="article-author"]/span[@class="article-time"]/text()')[0].extract()
        print(item['title'],item['link'],item['posttime'])
        yield item

现在parse只提取感兴趣的链接，然后将链接内容解析交给另外的方法去处理了。你可以基于这个构建更加复杂的爬虫程序了

导出数据

最简单的保存抓取数据的方式是使用json格式的文件保存在本地，像下面这样运行：

scrapy crawl huxiu -o items.json

在演示的小系统里面这种方式足够了。不过如果你要构建复杂的爬虫系统，最好自己编写Item Pipeline

保存数据到数据库

上面我们介绍了可以将抓取的Item导出为json格式的文件，不过最常见的做法还是编写Pipeline将其存储到数据库中。我们在coolscrapy/pipelines.py定义

# -*- coding: utf-8 -*-
import datetime
import redis
import json
import logging
from contextlib import contextmanager

from scrapy import signals
from scrapy.exporters import JsonItemExporter
from scrapy.pipelines.images import ImagesPipeline
from scrapy.exceptions import DropItem
from sqlalchemy.orm import sessionmaker
from coolscrapy.models import News, db_connect, create_news_table, Article


class ArticleDataBasePipeline(object):
    """保存文章到数据库"""

    def __init__(self):
        engine = db_connect()
        create_news_table(engine)
        self.Session = sessionmaker(bind=engine)

    def open_spider(self, spider):
        """This method is called when the spider is opened."""
        pass

    def process_item(self, item, spider):
        a = Article(url=item["url"],
                    title=item["title"].encode("utf-8"),
                    publish_time=item["publish_time"].encode("utf-8"),
                    body=item["body"].encode("utf-8"),
                    source_site=item["source_site"].encode("utf-8"))
        with session_scope(self.Session) as session:
            session.add(a)

    def close_spider(self, spider):
        pass

上面我使用了python中的SQLAlchemy来保存数据库，这个是一个非常优秀的ORM库，我写了篇关于它的入门教程，可以参考下。

然后在setting.py中配置这个Pipeline，还有数据库链接等信息：

ITEM_PIPELINES = {
    'coolscrapy.pipelines.ArticleDataBasePipeline': 5,
}

# linux pip install MySQL-python
DATABASE = {'drivername': 'mysql',
            'host': '192.168.203.95',
            'port': '3306',
            'username': 'root',
            'password': 'mysql',
            'database': 'spider',
            'query': {'charset': 'utf8'}
}

再次运行爬虫

明天再解决这个 bug 吧

本文参与腾讯云自媒体分享计划，分享自微信公众号。

原始发表：2020-04-25，如有侵权请联系 cloudcommunity@tencent.com 删除

编程算法