我有一个CrawlSpider设置为跟踪某些链接,并抓取新闻杂志,其中每个问题的链接遵循以下网址方案:
http://example.com/YYYY/DDDD/index.htm,其中YYYY是年份,DDDD是三位或四位数的发行编号。
我只想要问题928以上,并有我的规则如下。我没有任何问题连接到网站,抓取链接,或提取项目(所以我没有包括我的其余代码)。蜘蛛似乎下定决心要遵循不允许的链接。它试图抓取问题377、398和更多,并遵循"culture.htm“和"feature.htm”链接。这会抛出很多错误,并不是非常重要,但它需要对数据进行大量清理。对于哪里出了问题,有什么建议吗?
class crawlerNameSpider(CrawlSpider):
name = 'crawler'
allowed_domains = ["example.com"]
start_urls = ["http://example.com/issues.htm"]
rules = (
Rule(SgmlLinkExtractor(allow = ('\d\d\d\d/(92[8-9]|9[3-9][0-9]|\d\d\d\d)/index\.htm', )), follow = True),
Rule(SgmlLinkExtractor(allow = ('fr[0-9].htm', )), callback = 'parse_item'),
Rule(SgmlLinkExtractor(allow = ('eg[0-9]*.htm', )), callback = 'parse_item'),
Rule(SgmlLinkExtractor(allow = ('ec[0-9]*.htm', )), callback = 'parse_item'),
Rule(SgmlLinkExtractor(allow = ('op[0-9]*.htm', )), callback = 'parse_item'),
Rule(SgmlLinkExtractor(allow = ('sc[0-9]*.htm', )), callback = 'parse_item'),
Rule(SgmlLinkExtractor(allow = ('re[0-9]*.htm', )), callback = 'parse_item'),
Rule(SgmlLinkExtractor(allow = ('in[0-9]*.htm', )), callback = 'parse_item'),
Rule(SgmlLinkExtractor(deny = ('culture.htm', )), ),
Rule(SgmlLinkExtractor(deny = ('feature.htm', )), ),
)
编辑:我用一个更简单的regex fot 2009,2010,2011修复了这个问题,但我仍然很好奇,如果有人有任何建议的话,为什么上面的方法不起作用。
发布于 2011-12-17 14:09:43
您需要将deny
参数传递给SgmlLinkExtractor
,它收集指向follow
的链接。如果调用一个函数parse_item
,你就不需要创建这么多的Rule
。我会把你的代码写成:
rules = (
Rule(SgmlLinkExtractor(
allow = ('\d\d\d\d/(92[8-9]|9[3-9][0-9]|\d\d\d\d)/index\.htm', ),
deny = ('culture\.htm', 'feature\.htm'),
),
follow = True
),
Rule(SgmlLinkExtractor(
allow = (
'fr[0-9].htm',
'eg[0-9]*.htm',
'ec[0-9]*.htm',
'op[0-9]*.htm',
'sc[0-9]*.htm',
're[0-9]*.htm',
'in[0-9]*.htm',
)
),
callback = 'parse_item',
),
)
如果它是您用于parse_item
的规则中的真实url模式,则可以将其简化为:
Rule(SgmlLinkExtractor(
allow = ('(fr|eg|ec|op|sc|re|in)[0-9]*\.htm', ),
callback = 'parse_item',
),
)
https://stackoverflow.com/questions/8537687
复制相似问题