文章/答案/技术大牛

发布

社区首页 >问答首页 >toI如何使用抓取抓取每个href

问toI如何使用抓取抓取每个href
EN

Stack Overflow用户

提问于 2013-07-10 22:00:17

回答 1查看 624关注 0票数 2

我怎样才能抓取每一个href的抓取？我只知道如何显示它，但我希望能够进入其中的每一个链接。这是我们的内部网数据，因此您将无法访问链接。另外，当数据显示在文件中时，我如何格式化日期？是否需要在start_url中添加urls列表？我需要把我的initSpider改成crawlSpider吗？

<row>
<cell type="href" href="/dis/packages.jsp?view=list&show=perdevice&device_gid=6F5941585835587177572B3465656A61496B76747A673D3D54766B47446C376A77555A72624237756330506755673D3D&hwdid=353918053831794&mdn=14256238845&subscrbid=310260548400764&maxlength=100">14256238845</cell>
<cell type="href" href="/dis/packages.jsp?view=list&show=perdevice&device_gid=6F5941585835587177572B3465656A61496B76747A673D3D54766B47446C376A77555A72624237756330506755673D3D&hwdid=353918053831794&subscrbid=310260548400764&mdn=14256238845&maxlength=100">353918053831794</cell>
<cell type="href" href="/dis/packages.jsp?view=list&show=perdevice&device_gid=6F5941585835587177572B3465656A61496B76747A673D3D54766B47446C376A77555A72624237756330506755673D3D&hwdid=353918053831794&subscrbid=310260548400764&mdn=14256238845&maxlength=100">310260548400764</cell>
<cell type="href" href="/dis/packages.jsp?view=timeline&show=perdevice&device_gid=6F5941585835587177572B3465656A61496B76747A673D3D54766B47446C376A77555A72624237756330506755673D3D&hwdid=353918053831794&subscrbid=310260548400764&mdn=14256238845&maxlength=100&date=20130423T020032243">2013-04-23 02:00:32.243</cell>
<cell type="plain">2013-04-23 02:00:32.243</cell>
<cell type="plain">3 - PackageCreation</cell>
<cell type="href" href="/dis/profile_download?profileId=400006">400006</cell>
<cell type="href" href="/dis/sessions.jsp?view=list&device_gid=6F5941585835587177572B3465656A61496B76747A673D3D54766B47446C376A77555A72624237756330506755673D3D&hwdid=353918053831794&mdn=14256238845&subscrbid=310260548400764&maxlength=100">view sessions</cell>
<cell type="href" href="/dis/errors_agg.jsp?view=list&device_gid=6F5941585835587177572B3465656A61496B76747A673D3D54766B47446C376A77555A72624237756330506755673D3D&hwdid=353918053831794&mdn=14256238845&subscrbid=310260548400764&maxlength=100">view errors</cell>
</row>

这就是我目前所拥有的，它可以打印所有内容

from scrapy.contrib.spiders.init import InitSpider
from scrapy.http import Request, FormRequest
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import Rule

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector

from scrapy.selector import XmlXPathSelector

from carrier.items import CarrierItem

class CarrierSpider(InitSpider):
    name = 'dis'
    allowed_domains = ['qvpweb01.ciq.labs.att.com']
    login_page = 'https://qvpweb01.ciq.labs.att.com:8080/dis/login.jsp'
    start_urls = ["https://qvpweb01.ciq.labs.att.com:8080/dis/"]

    def init_request(self):
    #"""This function is called before crawling starts."""
    return Request(url=self.login_page, callback=self.login)

    def login(self, response):
    #"""Generate a login request."""
    return FormRequest.from_response(response,
            formdata={'txtUserName': 'myuser', 'txtPassword': 'xxxx'},
            callback=self.check_login_response)

    def check_login_response(self, response):
    #"""Check the response returned by a login request to see if we aresuccessfully logged in."""
    if "logout" in response.body:
        self.log("\n\n\nSuccessfully logged in. Let's start crawling!\n\n\n")
        # Now the crawling can begin..

        return self.initialized() 

    else:
        self.log("\n\n\nFailed, Bad password :(\n\n\n")
        # Something went wrong, we couldn't log in, so nothing happens.


    def parse(self, response):
    xhs = XmlXPathSelector(response)
    columns = xhs.select('//table[3]/row/cell')
    for column in columns:
        item = CarrierItem()
        item['title'] = column.select('.//text()').extract()
        item['link'] = column.select('.//@href').extract()
        yield item

我从下面的csv文件中得到的输出：

14256238845
3.53918E+14
3.10261E+14
00:32.2
00:32.2
3 - PackageCreation
400006
view sessions
view errors

以下是我希望从csv获得的输出：

14256238845
353918053831794
310260548400764
2013-04-23 02:00:32.243
2013-04-23 02:00:32.243
3 - PackageCreation
400006
view sessions
view errors

xml

screen-scraping

scrapy

回答 1

Stack Overflow用户

回答已采纳

发布于 2013-07-15 20:44:01

每当您想要跟随一个URL时，您都可以生成一个请求对象。

例如：yield Request(extracted_url_link, callback=your_parse_function)

请看下面链接中的第二个示例。

http://doc.scrapy.org/en/latest/topics/spiders.html#basespider-example

指定爬网url的另一种方法是使用SgmlLinkExtractor。您可以编写规则。如果规则匹配，爬行器将抓取任何页面中的所有urls。请参考以下url中的示例。

http://doc.scrapy.org/en/latest/topics/spiders.html#crawlspider

Date只是爬网后的一个字符串，你可以将它转换成python datetime对象，然后使用诸如strftime这样的datetime渲染函数以你想要的方式显示它。

希望我回答了你的问题。

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/17572651

复制

相似问题

问toI如何使用抓取抓取每个href
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问toI如何使用抓取抓取每个hrefEN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问toI如何使用抓取抓取每个href
EN