我尝试使用scrapy xpath来抓取页面,但是当我使用for循环时,它似乎不能捕获带有谓词的标记,# This package将包含Scrapy项目的爬行器
from cunyfirst.items import CunyfirstSectionItem
import scrapy
import json
class CunyfristsectionSpider(scrapy.Spider):
name = "cunyfirst-section-spider"
start_urls = ["file:///Users/haowang/Desktop/section.htm"]
def parse(self, response):
url = response.url
yield scrapy.Request(url, self.parse_page)
def parse_page(self, response):
n = -1
for section in response.xpath("//a[contains(@name,'MTG_CLASS_NBR')]"):
print(response.xpath("//a[@name ='MTG_CLASSNAME$10']/text()"))
n += 1
class_num = section.xpath('text()').extract_first()
# print(class_num)
classname = "MTG_CLASSNAME$" + str(n)
date = "MTG_DAYTIME$" + str(n)
instr = "MTG_INSTR$" + str(n)
print(classname)
class_name = response.xpath("//a[@name = classname]/text()")
我正在寻找一个名称为"MTG_CLASSNAME$“+ str(n)的标记,其中n是0,1,2...,并且从我的xpath查询中得到的输出为空。不知道为什么..。
谢谢!
发布于 2018-06-02 06:37:30
好吧..。我已经访问了你在问题描述中放置的网站,我使用了元素检查并搜索了"MTG_CLASSNAME“,我得到了0个匹配项...
所以我会给你一些工具:
LOG_FILE = "log.txt“
LOG_STDOUT=True
然后将响应正文( response.body )打印到您应该打印的位置(本例中是在parse_page函数的顶部),并在log.txt
此外,将for section in response.xpath("//a[contains(@name,'MTG_CLASS_NBR')]"):
更改为for section in response.xpath("//a[contains(@name,'MTG_CLASS_NBR')]").extract():
,这将在获取您要查找的数据时引发错误。
https://stackoverflow.com/questions/50651458
复制相似问题