我正在尝试从一个工作列表站点的HTML元数据中生成数据列。我使用scrapy Item Loaders清理HTML字符串,并将元数据转换为JSON对象。然后,我想使用该JSON中包含的信息来填充我的爬虫中的其他字段。
下面是到目前为止的爬虫,它爬行最近的100个工作列表:
import scrapy, json
from ..items import EthjobsScrapyItem, EthJobsLoader
class EthioJobsSpider(scrapy.Spider):
name = "EthioJobs"
allowed_domains = ["ethiojobs.net"]
start_urls = ["http://www.ethiojobs.net/latest-jobs-in-ethiopia/?searchId=1573067457.6526&action=search&page=1&listings_per_page=100&view=list"]
def parse(self, response):
for listing_url in response.xpath('/html/body/div[4]/section/div/div/div/div[4]/div/div[1]/div[4]/div/div/div/table/tbody//@href').getall():
yield response.follow(listing_url, callback=self.parse_listing)
def parse_listing(self, response):
loader = EthJobsLoader(item = EthjobsScrapyItem(), response=response)
loader.add_xpath('JSON_LD', '//script[@type="application/ld+json"]/text()')
yield loader.load_item()
其中,items.py
是:
import scrapy, re, json
from scrapy.loader import ItemLoader
class EthjobsScrapyItem(scrapy.Item):
JSON_LD = scrapy.Field()
datePosted = scrapy.Field() # an example of a field that would populate data from the JSON data
def cleanJsonVar(self, jsonvar): # Clean HTML markup
for TEXT in jsonvar:
if jsonvar:
try:
jsonvar = re.sub(r"\r+|\n+|\t+| | |amp;|</?.{,6}>", " ", TEXT).strip()
jsonvar = re.sub(r"Job\sDescription", "", jsonvar)
jsonvar = re.sub(r"\A\s+", "", jsonvar)
jsonvar = re.sub(r"( ){2,}", r" ", jsonvar)
jsonvar = re.sub(r"\u2019", r" '", jsonvar)
jsonvar = re.sub(r"\u25cf", r" -", jsonvar)
jsonvar = re.sub(r"\\",r"/", jsonvar)
except Exception as e:
jsonvar = None
print("ERROR: ", str(e))
else:
pass
return jsonvar
def intoJsonVar(self, jsonvar): # Convert from string to JSON
for TEXT in jsonvar:
return json.loads(TEXT)
class EthJobsLoader(ItemLoader):
JSON_LD_in = cleanJsonVar
JSON_LD_out = intoJsonVar
从爬虫输出的JSON_LD
如下所示:
{'JSON_LD': ["{
'@context': 'http://schema.org/',
'@type': 'JobPosting',
'title': 'Terms of Reference',
'description': ' Terms of Reference for developing General Management Plan...,'
'identifier': {
'@type': 'PropertyValue',
'name': 'Population Health and Environment – Ethiopia Consortium (PHE EC)',
'value': '65264'
},
'datePosted': '2019-12-10 04:13:31',
'validThrough': '2019-12-20 23:59:59',
'employmentType': 'Full Time',
'hiringOrganization': {
'@type': 'Organization',
'name': 'Population Health and Envir...'
},
'jobLocation': {
'@type': 'Place',
'address': {
'@type': 'PostalAddress',
'addressLocality': 'ETH Region',
'addressRegion': ' Addis Ababa ',
'addressCountry': 'ETH'
}
}
}"]
}
我的问题是:我如何从上面的JSON中获取信息并使用它来填充我的爬虫中的新字段?
任何和所有输入/评论都是不受欢迎的!
发布于 2019-12-11 15:41:04
首先,您可能想要flatten1您的json ld文件,因为scrapy.Item
应该是扁平的,或者至少与其他scrapy.Items
嵌套。您还应该删除私有变量(带有@的变量),因为对于python中的变量名称,这是无效的字符:
{
'@context': 'http://schema.org/',
'@type': 'JobPosting',
'title': 'Terms of Reference',
'description': ' Terms of Reference for developing General Management Plan...,'
'identifier': {
'@type': 'PropertyValue',
'name': 'Population Health and Environment – Ethiopia Consortium (PHE EC)',
'value': '65264'
}
至:
{
'title': 'Terms of Reference',
'description': ' Terms of Reference for developing General Management Plan...,'
'identifier_name': 'Population Health and Environment – Ethiopia Consortium (PHE EC)',
'identifier_value': '65264'
},
你会有一个项目:
class MyItem(Item):
title = Field()
description = Field()
identifier_name = Field()
identifier_value = Field()
最后,您可以通过使用合并的dictionary2重新创建item对象来合并多个项目
first = MyItem()
first['title'] = 'foo'
json_ld = {
'description': 'bar'
}
yield MyItem({**first, **json_ld})
# {'title': 'foo', 'description': 'bar'}
1在stackoverflow上有许多字典展平功能和解释,例如:Flatten nested dictionaries, compressing keys
2 scrapy.Item
只是python字典的一个扩展,因此它可以与任何其他python字典合并。有关合并字典的信息,请参阅:How do I merge two dictionaries in a single expression?
https://stackoverflow.com/questions/59278288
复制相似问题