首页
学习
活动
专区
工具
TVP
发布
社区首页 >问答首页 >如何让这个Spider为每个项目列表导出一个JSON文件?

如何让这个Spider为每个项目列表导出一个JSON文件?
EN

Stack Overflow用户
提问于 2018-06-17 02:36:56
回答 2查看 462关注 0票数 0

在我的下面的文件Reddit.py中,它有这个爬行器:

代码语言:javascript
复制
import scrapy

class RedditSpider(scrapy.Spider):
    name = 'Reddit'
    allowed_domains = ['reddit.com']
    start_urls = ['https://old.reddit.com']

    def parse(self, response):

        for link in response.css('li.first a.comments::attr(href)').extract():
            yield scrapy.Request(url=response.urljoin(link), callback=self.parse_topics)



    def parse_topics(self, response):
        topics = {}
        topics["title"] = response.css('a.title::text').extract_first()
        topics["author"] = response.css('p.tagline a.author::text').extract_first()

        if response.css('div.score.likes::attr(title)').extract_first() is not None:
            topics["score"] = response.css('div.score.likes::attr(title)').extract_first()
        else:
            topics["score"] = "0"

        if int(topics["score"]) > 10000:
            author_url = response.css('p.tagline a.author::attr(href)').extract_first()
            yield scrapy.Request(url=response.urljoin(author_url), callback=self.parse_user, meta={'topics': topics})
        else:
            yield topics

    def parse_user(self, response):
        topics = response.meta.get('topics')

        users = {}
        users["name"] = topics["author"]
        users["karma"] = response.css('span.karma::text').extract_first()

        yield users
        yield topics

它的作用是从old.reddit的主页获取所有URL,然后抓取每个URL的titleauthorscore

我已经添加了第二部分,它检查score是否高于10000,如果是,则爬行器转到用户的页面并从其中抓取他的karma

我确实知道我可以从topic的页面中抓取karma,但我想这样做,因为我抓取的user页面的其他部分在topic的页面中不存在。

我要做的是将包含title, author, scoretopics列表导出到名为topics.jsonJSON文件中,然后如果主题的分数高于10000,则将包含<代码>D34的users列表导出到名为<代码>D36的<代码>D35文件中。

我只知道如何使用command-line

代码语言:javascript
复制
scrapy runspider Reddit.py -o Reddit.json

它将所有列表导出到一个名为RedditJSON文件中,但其结构很糟糕,如下所示

代码语言:javascript
复制
[
  {"name": "Username", "karma": "00000"},
  {"title": "ExampleTitle1", "author": "Username", "score": "11000"},
  {"name": "Username2", "karma": "00000"},
  {"title": "ExampleTitle2", "author": "Username2", "score": "12000"},
  {"name": "Username3", "karma": "00000"},
  {"title": "ExampleTitle3", "author": "Username3", "score": "13000"},
  {"title": "ExampleTitle4", "author": "Username4", "score": "9000"},
  ....
]

我对ScrapyItem PipelineItem Exporters & Feed Exporters完全不了解如何在我的爬虫上实现它们,或者如何整体使用它们,我试图从文档中理解它,但似乎我不知道如何在我的蜘蛛中使用它。

我想要的最终结果是两个文件:

topics.json

代码语言:javascript
复制
[
 {"title": "ExampleTitle1", "author": "Username", "score": "11000"},
 {"title": "ExampleTitle2", "author": "Username2", "score": "12000"},
 {"title": "ExampleTitle3", "author": "Username3", "score": "13000"},
 {"title": "ExampleTitle4", "author": "Username4", "score": "9000"},
 ....
]

users.json

代码语言:javascript
复制
[
  {"name": "Username", "karma": "00000"},
  {"name": "Username2", "karma": "00000"},
  {"name": "Username3", "karma": "00000"},
  ....
]

同时去掉列表中的重复项。

EN

回答 2

Stack Overflow用户

回答已采纳

发布于 2018-06-19 19:19:40

从下面的线程中应用方法

Export scrapy items to different files

我创建了一个样本刮刀

代码语言:javascript
复制
import scrapy


class ExampleSpider(scrapy.Spider):
    name = 'example'
    allowed_domains = ['example.com']
    start_urls = ['http://example.com/']

    def parse(self, response):
        yield {"type": "unknown item"}
        yield {"title": "ExampleTitle1", "author": "Username", "score": "11000"}
        yield {"name": "Username", "karma": "00000"}
        yield {"name": "Username2", "karma": "00000"}
        yield {"someothertype": "unknown item"}

        yield {"title": "ExampleTitle2", "author": "Username2", "score": "12000"}
        yield {"title": "ExampleTitle3", "author": "Username3", "score": "13000"}
        yield {"title": "ExampleTitle4", "author": "Username4", "score": "9000"}
        yield {"name": "Username3", "karma": "00000"}

然后在exporters.py

代码语言:javascript
复制
from scrapy.exporters import JsonItemExporter
from scrapy.extensions.feedexport import FileFeedStorage


class JsonMultiFileItemExporter(JsonItemExporter):
    types = ["topics", "users"]

    def __init__(self, file, **kwargs):
        super().__init__(file, **kwargs)
        self.files = {}
        self.kwargs = kwargs

        for itemtype in self.types:
            storage = FileFeedStorage(itemtype + ".json")
            file = storage.open(None)
            self.files[itemtype] = JsonItemExporter(file, **self.kwargs)

    def start_exporting(self):
        super().start_exporting()
        for exporters in self.files.values():
            exporters.start_exporting()

    def finish_exporting(self):
        super().finish_exporting()
        for exporters in self.files.values():
            exporters.finish_exporting()
            exporters.file.close()

    def export_item(self, item):
        if "title" in item:
            itemtype = "topics"
        elif "karma" in item:
            itemtype = "users"
        else:
            itemtype = "self"

        if itemtype == "self" or itemtype not in self.files:
            super().export_item(item)
        else:
            self.files[itemtype].export_item(item)

将以下内容添加到settings.py

代码语言:javascript
复制
FEED_EXPORTERS = {
    'json': 'testing.exporters.JsonMultiFileItemExporter',
}

运行刮板,我得到了3个文件生成

example.json

代码语言:javascript
复制
[
{"type": "unknown item"},
{"someothertype": "unknown item"}
]

topics.json

代码语言:javascript
复制
[
{"title": "ExampleTitle1", "author": "Username", "score": "11000"},
{"title": "ExampleTitle2", "author": "Username2", "score": "12000"},
{"title": "ExampleTitle3", "author": "Username3", "score": "13000"},
{"title": "ExampleTitle4", "author": "Username4", "score": "9000"}
]

users.json

代码语言:javascript
复制
[
{"name": "Username", "karma": "00000"},
{"name": "Username2", "karma": "00000"},
{"name": "Username3", "karma": "00000"}
]
票数 1
EN

Stack Overflow用户

发布于 2018-06-17 03:29:07

爬行器在爬行用户页面时会产生两个项目。如果有以下情况,它可能会起作用:

代码语言:javascript
复制
def parse_user(self, response):
    topics = response.meta.get('topics')

    users = {}
    users["name"] = topics["author"]
    users["karma"] = response.css('span.karma::text').extract_first()
    topics["users"] = users

    yield topics

您可以根据需要对JSON进行后处理。

顺便说一句,我不明白为什么在处理单个元素(单个“主题”)时要使用复数(“主题”)。

票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/50890686

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档