文章/答案/技术大牛

发布

Scrapy 练习（一）下载壁纸图，使用ImagesPipeline

文章来源：企鹅号 - 耿子blog

（1）准备工作

我们准备爬取的网站：https://alpha.wallhaven.cc/random

分析该网站图片的标签：这是一张图片的标签

[html]view plaincopy

1920 x 1280

xpath解析一下：

//figure/@data-wallpaper-id 可以获取图片的编号集合

再根据图片的编号获取到整个标签

//figure[@data-wallpaper-id="316105"] （这里的 316105 就是图片的编号）

就可以获取整个 figure 标签，然后就可以抓取需要的信息了，具体字段分析见后面爬虫部分

（2）新建scrapy项目

命令：scrapy startproject wallhavenSpider

目录结构如下：

通常需要编辑的几个py有：settings.py 和 items.py 和 pipelines.py

1、配置settings.py

修改 USER_AGENT

[html]view plaincopy

USER_AGENT="Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)"

启用ITEM_PIPELINES 配置，本次我们就使用默认生成的item即可。

[html]view plaincopy

ITEM_PIPELINES= {

'wallhavenSpider.pipelines.WallhavenspiderPipeline': 300,

}

我们在配置一个存放图片的路径。

[html]view plaincopy

#编写自定义配置字段

IMAGES_STORE="H:\\python_workspace\\scrapy\\imgaedownload"

其他视需要再配置。

2、编写items.py

该文件就是设置我们需要存储的字段信息。

[html]view plaincopy

class WallhavenspiderItem(scrapy.Item):

# define the fields for your item here like:

# 图片的id

imageId=scrapy.Field()

# 图片的缩略图路径

imageThumbnailUrl=scrapy.Field()

# 图片的分辨率

imageSize=scrapy.Field()

# 图片的下载路径

imageDownloadUrl=scrapy.Field()

# 图片的tag的路径

imageTagUrl=scrapy.Field()

# 图片保存的路径

#imagePath=scrapy.Field()

pipelines.py 我们在编写完爬虫再编写。

（3）创建爬虫程序

会在spiders目录下创建一个爬虫程序的模板。

1、具体分析网站如何爬取

首先确定爬取网站网址：https://alpha.wallhaven.cc/random

我们来进入下一页，查看网址变化，确定该网站的分页的字段：

https://alpha.wallhaven.cc/random?page=x

通过page 字段来实现分页，那么我们可以根据 page 值的变化，爬取每页的信息

在文章最开始，图片信息已经分析了一下。

那么就具体使用xpath 将图片信息拿出来

（1）获取一个图片的整体标签体：

//figure[@data-wallpaper-id="+id+"]

这里的id 是一个图片的id标识，到时候我们得到一页中所有图片id,分页爬取

那我们先获取一个 id 进行测试。

（2）图片的缩略图的路径

//figure[@data-wallpaper-id="378330"]/img/@data-src

（3）图片的分辨率

//figure[@data-wallpaper-id="378330"]/div/span/text()

（4）图片的真实路径

但是，经过发现这些图片的前缀路径都是一致的，变化的仅是图片的id。

我们就可以使用拼接的方式，拿到图片的id将图片的真实路径拼接出来。

2、编写爬虫文件

[html]view plaincopy

# -*- coding: utf-8 -*-

import scrapy

from wallhavenSpider.items import WallhavenspiderItem

class ImageinfodownloadSpider(scrapy.Spider):

"""

爬取图片信息

"""

name='imageInfoDownload'

allowed_domains= ['alpha.wallhaven.cc']

#拼接请求分页的路径

url='https://alpha.wallhaven.cc/random?page='

offset=1

reqUrl=url+ str(offset)

start_urls= [reqUrl]

def parse(self, response):

"""

解析response

:param response:

:return:

"""

imageId_list=response.xpath("//figure/@data-wallpaper-id")

for imageid in imageId_list:

#创建一个新的 item

item=WallhavenspiderItem()

id=imageid.extract()

#图片的编号

item['imageId'] = id

#根据图片id进行解析

imageinfo=response.xpath("//figure[@data-wallpaper-id="+id+"]")

for imginfo in imageinfo:

# 图片的缩略图

item['imageThumbnailUrl'] = imginfo.xpath("./img/@data-src").extract()[0]

#图片的分辨率

item['imageSize'] = imginfo.xpath("./div/span/text()").extract()[0]

item['imageTagUrl'] = imginfo.xpath('./div/a[@title="Tags"]/@href').extract()[0]

#https://wallpapers.wallhaven.cc/wallpapers/full/wallhaven-634130.jpg

#截取图片的后缀

imgSuffix=item['imageThumbnailUrl'].split('.')[-1]

item['imageDownloadUrl'] = 'https://wallpapers.wallhaven.cc/wallpapers/full/wallhaven-'+id+'.'+imgSuffix

#交给管道文件进行处理

yield item

# 分页请求，控制页码变化

if self.offset

self.offset += 1

# else:

# raise "结束工作"

#处理完一页，再次发送分页请求

yield scrapy.Request(self.url + str(self.offset),callback=self.parse)

运行测试爬虫：scrapy crawl imageInfoDownload

看是否正常爬取。

（4）编写pipelines.py

pipelines 文件就是用来处理 item（数据）的地方

我们现在的需求是下载图片

scrapy 中提供了一个对图片下载的 pipline 文件：ImagesPipeline

引入该pipline 的方式：

[html]view plaincopy

在该ImagesPipeline 中提供了两个方法，支持下载的功能

get_media_requests 和 item_completed

看下具体代码实现：

[html]view plaincopy

# -*- coding: utf-8 -*-

# Define your item pipelines here

# Don't forget to add your pipeline to the ITEM_PIPELINES setting

# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

#引入settings.py 的配置项

import scrapy

import json

from scrapy.utils.project import get_project_settings

from scrapy.pipelines.images import ImagesPipeline

import os

class WallhavenspiderPipeline(ImagesPipeline):

#获取在 settings 文件中的配置项

IMAGE_SOURCE=get_project_settings().get('IMAGES_STORE')

def get_media_requests(self, item, info):

image_url=item['imageDownloadUrl'] #拿到图片的真实路径

yield scrapy.Request(image_url)

def item_completed(self, result, item, info):

image_path= [x["path"] for ok, x in result if ok]

os.rename(self.IMAGE_SOURCE + "\\" + image_path[0], self.IMAGE_SOURCE + "\\" + item["imageId"] + ".jpg")

item['imageDownloadUrl'] = image_path

return item

os.rename(self.IMAGE_SOURCE + "\\" + image_path[0], self.IMAGE_SOURCE + "\\" + item["imageId"] + ".jpg")

这一句主要将图片改名称

其中这两个方法的编写，基本可以作为模板代码使用。只需要修改部分参数即可。

（5）总结

scrapy提供了图片下载的方式很方便，也可以自己写下载图片的方法。

具体参考：http://scrapy-chs.readthedocs.io/zh_CN/latest/topics/images.html

如果需要查看源码：https://github.com/gengzi/wallhavenSpider

发表于: 2018-04-182018-04-18 18:23:30
原文链接：http://kuaibao.qq.com/s/20180418G1C0YP00?refer=cp_1026
腾讯「腾讯云开发者社区」是腾讯内容开放平台帐号（企鹅号）传播渠道之一，根据《腾讯内容开放平台服务协议》转载发布内容。
如有侵权，请联系 cloudcommunity@tencent.com 删除。

扫码

添加站长进交流群

领取专属 10元无门槛券

私享最新 技术干货

Scrapy 练习（一）下载壁纸图，使用ImagesPipeline

相关快讯

扫码

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐