首页
学习
活动
专区
工具
TVP
发布
社区首页 >问答首页 >Scrapy Traceback 302,索引错误列表

Scrapy Traceback 302,索引错误列表
EN

Stack Overflow用户
提问于 2018-06-10 02:14:26
回答 1查看 49关注 0票数 0

我正在尝试抓取特定标签的文章,比如Python2.7中的“机器学习”。我有以下代码:

代码语言:javascript
复制
import scrapy
import codecs
import json
from datetime import datetime
from datetime import timedelta
import os

def writeTofile(fileName,text):
    with codecs.open(fileName,'w','utf-8') as outfile:
        outfile.write(text)

class MediumPost(scrapy.Spider):
    name='medium_scraper'
    handle_httpstatus_list = [401,400]    
    autothrottle_enabled=True


    def start_requests(self):        
        start_urls = ['https://medium.com/tag/'+self.tagSlug.strip("'")+'/archive/']
        print(start_urls)        
        #Header and cookie information can be got from the Network Tab in Developer Tools
        cookie = {'mhj': 'd4c630604c57a104af8bc98218fb3430145',
                                        'nj': '1',
                                        'ko': '1:J0mnan1t5jlHypyliL8GAY1WNfDvtqZBgmBDr+7STp2QSwyWUz6',
                                        'pi': '233',
                                        'mt': '-874'}
        header = {'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'}
        startDate=datetime.strptime(self.start_date,"%Y%m%d")
        endDate=datetime.strptime(self.end_date,"%Y%m%d")
        delta=endDate-startDate
        print(delta)
        for i in range(delta.days + 1):
            d=datetime.strftime(startDate+timedelta(days=i),'%Y/%m/%d')
            for url in start_urls:
                print(url+d)
                yield scrapy.Request(url+d, method="GET",headers=header,cookies=cookie,callback=self.parse,meta={'reqDate':d})
    
    def parse(self,response):
        response_data=response.text
        response_split=response_data.split("while(1);</x>")
        response_data=response_split[1]
        date_post=response.meta['reqDate']
        date_post=date_post.replace("/","")
        directory=datetime.now().strftime("%Y%m%d")
        if not os.path.exists(directory):
            os.makedirs(directory)
        writeTofile(directory+"//"+self.tagSlug.replace("-","").strip("'")+"Tag"+date_post+".json",response_data)

一条消息说:

代码语言:javascript
复制
scrapy.core.engine] DEBUG: Crawled (200) <GET https://medium.com/tag/machine-learning/archive/2015/07/13> (referer: None)

NotImplementedError:未定义MediumPost.parse回调

但是,我反复收到如下错误:

代码语言:javascript
复制
current.result = callback(current.result, *args, **kw)
File "/home/mkol/anaconda2/lib/python2.7/site-packages/scrapy/spiders/__init__.py", line 90, in parse
    raise NotImplementedError('{}.parse callback is not defined'.format(self.__class__.__name__))
NotImplementedError('{}.parse callback is not defined'.format(self.__class__.__name__))

当我试图将def parse放在def start_requests之上时,我得到了缩进错误。

由于我是一个初学者,我不知道错误在哪里?

EN

回答 1

Stack Overflow用户

发布于 2018-06-10 05:28:53

我想你的编辑器会遇到"MediumPost.parse回调没有定义“的问题。看起来像python翻译器看不到函数"parse“。我想你把四个空格和表格混在一起了。我使用PyCharm。当然,我可能没有同样的问题。在做了一些修改之后,它对我来说是有效的。我添加了self.tagSlug、self.start_date、self.end_date。我使用PEP-8推荐标准编辑了代码。现在看起来好多了。我去掉了“指纹”。在调试过程中最好使用断点。我将变量的名称转换为Python类型。在我的记忆中,PEP-8只建议您使用一种类型的名称(Python类型或Java类型)。

代码语言:javascript
复制
import scrapy
import codecs
from datetime import datetime
from datetime import timedelta
import os

def writeTofile(file_name, text):
    with codecs.open(file_name, 'w', 'utf-8') as outfile:
        outfile.write(text)

class MediumPost(scrapy.Spider):
    name='medium_scrapper'
    handle_httpstatus_list = [401, 400]
    autothrottle_enabled = True
    tag_slug = 'machine-learning'
    start_date = '20170110'
    end_date = '20181130'

    def start_requests(self):
        start_urls = ['https://medium.com/tag/' + self.tag_slug.strip("'") + '/archive/']

        #Header and cookie information can be got from the Network Tab in Developer Tools
        cookie = {'mhj': 'd4c630604c57a104af8bc98218fb3430145',
                  'nj': '1',
                  'ko': '1:J0mnan1t5jlHypyliL8GAY1WNfDvtqZBgmBDr+7STp2QSwyWUz6',
                  'pi': '233',
                  'mt': '-874'}

        header = {'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'}

        startDate = datetime.strptime(self.start_date, "%Y%m%d")
        endDate = datetime.strptime(self.end_date, "%Y%m%d")
        delta = endDate - startDate

        for i in range(delta.days + 1):
            d = datetime.strftime(startDate + timedelta(days=i), '%Y/%m/%d')

            for url in start_urls:
                print(url + d)
                yield scrapy.Request(url + d, headers=header, cookies=cookie, meta={'req_date': d})

    def parse(self,response):
        response_data = response.text
        response_split = response_data.split("while(1);</x>")
        response_data = response_split[0]
        date_post = response.meta['req_date']
        date_post = date_post.replace("/", "")
        directory = datetime.now().strftime("%Y%m%d")

        if not os.path.exists(directory):
            os.makedirs(directory)

        writeTofile(directory + "//" + self.tag_slug.replace("-", "").strip("'") + "Tag" + date_post + ".json", response_data)
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/50777226

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档