文章/答案/技术大牛

发布

社区首页 >问答首页 >在Scrapy中，如何使用JSON加载的项来填充新字段？

问在Scrapy中，如何使用JSON加载的项来填充新字段？
EN

Stack Overflow用户

提问于 2019-12-11 10:57:30

回答 1查看 780关注 0票数 0

我正在尝试从一个工作列表站点的HTML元数据中生成数据列。我使用scrapy Item Loaders清理HTML字符串，并将元数据转换为JSON对象。然后，我想使用该JSON中包含的信息来填充我的爬虫中的其他字段。

下面是到目前为止的爬虫，它爬行最近的100个工作列表：

import scrapy, json
from ..items import EthjobsScrapyItem, EthJobsLoader

class EthioJobsSpider(scrapy.Spider):
    name = "EthioJobs"
    allowed_domains = ["ethiojobs.net"]
    start_urls = ["http://www.ethiojobs.net/latest-jobs-in-ethiopia/?searchId=1573067457.6526&action=search&page=1&listings_per_page=100&view=list"]

    def parse(self, response):
        for listing_url in response.xpath('/html/body/div[4]/section/div/div/div/div[4]/div/div[1]/div[4]/div/div/div/table/tbody//@href').getall():
            yield response.follow(listing_url, callback=self.parse_listing)

    def parse_listing(self, response):
        loader = EthJobsLoader(item = EthjobsScrapyItem(), response=response)
        loader.add_xpath('JSON_LD', '//script[@type="application/ld+json"]/text()')

        yield loader.load_item()

其中，items.py是：

import scrapy, re, json
from scrapy.loader import ItemLoader

class EthjobsScrapyItem(scrapy.Item):
    JSON_LD     = scrapy.Field()
    datePosted  = scrapy.Field() # an example of a field that would populate data from the JSON data


def cleanJsonVar(self, jsonvar): # Clean HTML markup
    for TEXT in jsonvar:
        if jsonvar:
            try:
                jsonvar = re.sub(r"\r+|\n+|\t+|  |&nbsp;|amp;|</?.{,6}>", " ", TEXT).strip()
                jsonvar = re.sub(r"Job\sDescription", "", jsonvar)
                jsonvar = re.sub(r"\A\s+", "", jsonvar) 
                jsonvar = re.sub(r"( ){2,}", r" ", jsonvar)
                jsonvar = re.sub(r"\u2019", r" '", jsonvar)
                jsonvar = re.sub(r"\u25cf", r" -", jsonvar)
                jsonvar = re.sub(r"\\",r"/", jsonvar)

            except Exception as e:
                jsonvar = None
                print("ERROR: ", str(e))
        else:
            pass
        return jsonvar

def intoJsonVar(self, jsonvar): # Convert from string to JSON
    for TEXT in jsonvar: 
        return json.loads(TEXT)


class EthJobsLoader(ItemLoader):
    JSON_LD_in  =  cleanJsonVar
    JSON_LD_out =  intoJsonVar

从爬虫输出的JSON_LD如下所示：

{'JSON_LD': ["{
    '@context': 'http://schema.org/',
    '@type': 'JobPosting',
    'title': 'Terms of Reference',
    'description': ' Terms of Reference for developing General Management Plan...,'
    'identifier': {
        '@type': 'PropertyValue',
        'name': 'Population Health and Environment â€“ Ethiopia Consortium (PHE EC)',
        'value': '65264'
    },
    'datePosted': '2019-12-10 04:13:31',
    'validThrough': '2019-12-20 23:59:59',
    'employmentType': 'Full Time',
    'hiringOrganization': {
        '@type': 'Organization',
        'name': 'Population Health and Envir...'
    },
    'jobLocation': {
        '@type': 'Place',
        'address': {
            '@type': 'PostalAddress',
            'addressLocality': 'ETH Region',
            'addressRegion': ' Addis Ababa ',
            'addressCountry': 'ETH'
        }
    }
}"]
}

我的问题是:我如何从上面的JSON中获取信息并使用它来填充我的爬虫中的新字段？

任何和所有输入/评论都是不受欢迎的！

python

json

scrapy

json-ld

回答 1

Stack Overflow用户

发布于 2019-12-11 15:41:04

首先，您可能想要flatten1您的json ld文件，因为scrapy.Item应该是扁平的，或者至少与其他scrapy.Items嵌套。您还应该删除私有变量(带有@的变量)，因为对于python中的变量名称，这是无效的字符：

{
    '@context': 'http://schema.org/',
    '@type': 'JobPosting',
    'title': 'Terms of Reference',
    'description': ' Terms of Reference for developing General Management Plan...,'
    'identifier': {
        '@type': 'PropertyValue',
        'name': 'Population Health and Environment â€“ Ethiopia Consortium (PHE EC)',
        'value': '65264'
    }

至：

{
    'title': 'Terms of Reference',
    'description': ' Terms of Reference for developing General Management Plan...,'
    'identifier_name': 'Population Health and Environment â€“ Ethiopia Consortium (PHE EC)',
    'identifier_value': '65264'
    },

你会有一个项目：

class MyItem(Item):
    title = Field()
    description = Field()
    identifier_name = Field()
    identifier_value = Field()

最后，您可以通过使用合并的dictionary2重新创建item对象来合并多个项目

first = MyItem()
first['title'] = 'foo'
json_ld = {
    'description': 'bar'
}
yield MyItem({**first, **json_ld})
# {'title': 'foo', 'description': 'bar'}

1在stackoverflow上有许多字典展平功能和解释，例如：Flatten nested dictionaries, compressing keys

2 scrapy.Item只是python字典的一个扩展，因此它可以与任何其他python字典合并。有关合并字典的信息，请参阅：How do I merge two dictionaries in a single expression?

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/59278288

复制

相似问题

问在Scrapy中，如何使用JSON加载的项来填充新字段？
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在Scrapy中，如何使用JSON加载的项来填充新字段？EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在Scrapy中，如何使用JSON加载的项来填充新字段？
EN