我正在用刮痕制作一个网络蜘蛛,有一个问题:我试图得到一组html data.And --它包含了我需要发送ajax request.However的id,当我试图将request.However数据和其他与html获得的数据结合起来时,它就会变成wrong.How,我能解决它吗?下面是我的代码:
class DoubanSpider(scrapy.Spider):
name = "douban"
allowed_domains = ["movie.douban.com"]
start_urls = ["https://movie.douban.com/review/best"]
def parse(self, response):
for review in response.css(".review-item"):
rev = Review()
rev['reviewer'] = review.css("a[property='v:reviewer']::text").extract_first()
rev['rating'] = review.css("span[property='v:rating']::attr(class)").extract_first()
rev['title'] = review.css(".main-bd>h2>a::text").extract_first()
number = review.css("::attr(id)").extract_first()
f = scrapy.Request(url='https://movie.douban.com/j/review/%s/full' % number,
callback=self.parse_full_passage)
rev['comment'] = f
yield rev
def parse_full_passage(self, response):
r = json.loads(response.body_as_unicode())
html = r['html']
yield html发布于 2018-02-09 14:18:36
您需要首先和下一次完全解析HTML,将其作为meta传递给JSON的回调:
yield scrapy.Request(url='https://movie.douban.com/j/review/%s/full' % number,callback=self.parse_full_passage, meta={'rev': rev} )接下来在JSON的回调中:
def parse_full_passage(self, response):
rev = response.meta["rev"]
r = json.loads(response.body_as_unicode())
.....
yield rev发布于 2018-02-09 09:30:30
我想试试这个:
response = scrapy.Request(url='https://movie.douban.com/j/review/%s/full' % number)
jsonresponse = json.loads(response.body_as_unicode())
rev['comment'] = jsonresponse['html']如果需要的话,您可能需要从html字段中提取内容。或者与这个url一起工作
https://stackoverflow.com/questions/48702370
复制相似问题