首页
学习
活动
专区
工具
TVP
发布
社区首页 >问答首页 >Scrapy图像下载

Scrapy图像下载
EN

Stack Overflow用户
提问于 2016-08-05 00:22:24
回答 4查看 20.5K关注 0票数 6

我的爬行器运行时没有显示任何错误,但图像没有存储在文件夹中,以下是我的抓取文件:

Spider.py:

代码语言:javascript
复制
import scrapy
import re
import os
import urlparse
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.loader.processors import Join, MapCompose, TakeFirst
from scrapy.pipelines.images import ImagesPipeline
from production.items import ProductionItem, ListResidentialItem

class productionSpider(scrapy.Spider):
    name = "production"
    allowed_domains = ["someurl.com"]
    start_urls = [
        "someurl.com"
]

def parse(self, response):
    for sel in response.xpath('//html/body'):
        item = ProductionItem()
        img_url = sel.xpath('//a[@data-tealium-id="detail_nav_showphotos"]/@href').extract()[0]
        yield scrapy.Request(urlparse.urljoin(response.url, img_url),callback=self.parseBasicListingInfo,  meta={'item': item})

def parseBasicListingInfo(item, response):
    item = response.request.meta['item']
    item = ListResidentialItem()
    try:
        image_urls = map(unicode.strip,response.xpath('//a[@itemprop="contentUrl"]/@data-href').extract())
        item['image_urls'] = [ x for x in image_urls]
    except IndexError:
        item['image_urls'] = ''

    return item

settings.py:

代码语言:javascript
复制
from scrapy.settings.default_settings import ITEM_PIPELINES
from scrapy.pipelines.images import ImagesPipeline

BOT_NAME = 'production'

SPIDER_MODULES = ['production.spiders']
NEWSPIDER_MODULE = 'production.spiders'
DEFAULT_ITEM_CLASS = 'production.items'

ROBOTSTXT_OBEY = True
DEPTH_PRIORITY = 1
IMAGE_STORE = '/images'

CONCURRENT_REQUESTS = 250

DOWNLOAD_DELAY = 2

ITEM_PIPELINES = {
    'scrapy.contrib.pipeline.images.ImagesPipeline': 300,
}

items.py

代码语言:javascript
复制
# -*- coding: utf-8 -*-
import scrapy

class ProductionItem(scrapy.Item):
    img_url = scrapy.Field()

# ScrapingList Residential & Yield Estate for sale
class ListResidentialItem(scrapy.Item):
    image_urls = scrapy.Field()
    images = scrapy.Field()

    pass

我的管道文件是空的,我不确定我应该添加什么到pipeline.py文件中。

任何帮助都是非常感谢的。

EN

回答 4

Stack Overflow用户

回答已采纳

发布于 2016-08-05 00:42:41

由于您不知道要在管道中放入什么内容,因此我假设您可以使用scrapy提供的图像的默认管道,因此在settings.py文件中,您可以像这样声明它

代码语言:javascript
复制
ITEM_PIPELINES = {
'scrapy.pipelines.images.ImagesPipeline':1
}

此外,您的图像路径是错误的/意味着您将转到计算机的绝对根路径,因此您要么将绝对路径放到要保存的位置,要么从运行爬虫程序的位置执行相对路径

代码语言:javascript
复制
IMAGES_STORE = '/home/user/Documents/scrapy_project/images'

代码语言:javascript
复制
IMAGES_STORE = 'images'

现在,在爬行器中提取url,但不将其保存到项目中。

代码语言:javascript
复制
item['image_urls'] = sel.xpath('//a[@data-tealium-id="detail_nav_showphotos"]/@href').extract_first()

如果您使用的是默认管道,则该字段必须为image_urls

现在,在items.py文件中,您需要添加以下两个字段(这两个字段都需要使用此文字名称)

代码语言:javascript
复制
image_urls=Field()
images=Field()

这应该是可行的

票数 7
EN

Stack Overflow用户

发布于 2016-08-07 09:07:19

我的工作最终结果:

spider.py

代码语言:javascript
复制
import scrapy
import re
import urlparse
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.loader.processors import Join, MapCompose, TakeFirst
from scrapy.pipelines.images import ImagesPipeline
from production.items import ProductionItem
from production.items import ImageItem

class productionSpider(scrapy.Spider):
    name = "production"
    allowed_domains = ["url"]
    start_urls = [
        "startingurl.com"
    ]

def parse(self, response):
    for sel in response.xpath('//html/body'):
        item = ProductionItem()
        img_url = sel.xpath('//a[@idd="followclaslink"]/@href').extract()[0]
        yield scrapy.Request(urlparse.urljoin(response.url, img_url),callback=self.parseImages,  meta={'item': item})

def parseImages(self, response):
    for elem in response.xpath("//img"):
        img_url = elem.xpath("@src").extract_first()
        yield ImageItem(image_urls=[img_url])

Settings.py

代码语言:javascript
复制
BOT_NAME = 'production'

SPIDER_MODULES = ['production.spiders']
NEWSPIDER_MODULE = 'production.spiders'
DEFAULT_ITEM_CLASS = 'production.items'
ROBOTSTXT_OBEY = True
IMAGES_STORE = '/Users/home/images'

DOWNLOAD_DELAY = 2

ITEM_PIPELINES = {'scrapy.pipelines.images.ImagesPipeline': 1}
# Disable cookies (enabled by default)

items.py

代码语言:javascript
复制
# -*- coding: utf-8 -*-
import scrapy

class ProductionItem(scrapy.Item):
    img_url = scrapy.Field()
# ScrapingList Residential & Yield Estate for sale
class ListResidentialItem(scrapy.Item):
    image_urls = scrapy.Field()
    images = scrapy.Field()

class ImageItem(scrapy.Item):
    image_urls = scrapy.Field()
    images = scrapy.Field()

pipelines.py

代码语言:javascript
复制
import scrapy
from scrapy.pipelines.images import ImagesPipeline
from scrapy.exceptions import DropItem

class MyImagesPipeline(ImagesPipeline):

    def get_media_requests(self, item, info):
        for image_url in item['image_urls']:
            yield scrapy.Request(image_url)

    def item_completed(self, results, item, info):
        image_paths = [x['path'] for ok, x in results if ok]
        if not image_paths:
            raise DropItem("Item contains no images")
        item['image_paths'] = image_paths
        return item
票数 10
EN

Stack Overflow用户

发布于 2018-09-26 15:56:02

在我的例子中,是IMAGES_STORE路径导致了问题

我做了IMAGES_STORE = 'images',它就像一个护身符!

下面是完整的代码:

设置:

代码语言:javascript
复制
ITEM_PIPELINES = {
   'mutualartproject.pipelines.MyImagesPipeline': 1,
}

IMAGES_STORE = 'images' 

管线:

代码语言:javascript
复制
class MyImagesPipeline(ImagesPipeline):

    def get_media_requests(self, item, info):
        for image_url in item['image_urls']:
            yield scrapy.Request(image_url)

    def item_completed(self, results, item, info):
        image_paths = [x['path'] for ok, x in results if ok]
        if not image_paths:
            raise DropItem("Item contains no images")
        return item
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/38772662

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档