我怎样才能抓取每一个href的抓取?我只知道如何显示它,但我希望能够进入其中的每一个链接。这是我们的内部网数据,因此您将无法访问链接。另外,当数据显示在文件中时,我如何格式化日期?是否需要在start_url中添加urls列表?我需要把我的initSpider改成crawlSpider吗?
<row>
<cell type="href" href="/dis/packages.jsp?view=list&show=perdevice&device_gid=6F5941585835587177572B3465656A61496B76747A673D3D54766B47446C376A77555A72624237756330506755673D3D&hwdid=353918053831794&mdn=14256238845&subscrbid=310260548400764&maxlength=100">14256238845</cell>
<cell type="href" href="/dis/packages.jsp?view=list&show=perdevice&device_gid=6F5941585835587177572B3465656A61496B76747A673D3D54766B47446C376A77555A72624237756330506755673D3D&hwdid=353918053831794&subscrbid=310260548400764&mdn=14256238845&maxlength=100">353918053831794</cell>
<cell type="href" href="/dis/packages.jsp?view=list&show=perdevice&device_gid=6F5941585835587177572B3465656A61496B76747A673D3D54766B47446C376A77555A72624237756330506755673D3D&hwdid=353918053831794&subscrbid=310260548400764&mdn=14256238845&maxlength=100">310260548400764</cell>
<cell type="href" href="/dis/packages.jsp?view=timeline&show=perdevice&device_gid=6F5941585835587177572B3465656A61496B76747A673D3D54766B47446C376A77555A72624237756330506755673D3D&hwdid=353918053831794&subscrbid=310260548400764&mdn=14256238845&maxlength=100&date=20130423T020032243">2013-04-23 02:00:32.243</cell>
<cell type="plain">2013-04-23 02:00:32.243</cell>
<cell type="plain">3 - PackageCreation</cell>
<cell type="href" href="/dis/profile_download?profileId=400006">400006</cell>
<cell type="href" href="/dis/sessions.jsp?view=list&device_gid=6F5941585835587177572B3465656A61496B76747A673D3D54766B47446C376A77555A72624237756330506755673D3D&hwdid=353918053831794&mdn=14256238845&subscrbid=310260548400764&maxlength=100">view sessions</cell>
<cell type="href" href="/dis/errors_agg.jsp?view=list&device_gid=6F5941585835587177572B3465656A61496B76747A673D3D54766B47446C376A77555A72624237756330506755673D3D&hwdid=353918053831794&mdn=14256238845&subscrbid=310260548400764&maxlength=100">view errors</cell>
</row>
这就是我目前所拥有的,它可以打印所有内容
from scrapy.contrib.spiders.init import InitSpider
from scrapy.http import Request, FormRequest
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import Rule
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.selector import XmlXPathSelector
from carrier.items import CarrierItem
class CarrierSpider(InitSpider):
name = 'dis'
allowed_domains = ['qvpweb01.ciq.labs.att.com']
login_page = 'https://qvpweb01.ciq.labs.att.com:8080/dis/login.jsp'
start_urls = ["https://qvpweb01.ciq.labs.att.com:8080/dis/"]
def init_request(self):
#"""This function is called before crawling starts."""
return Request(url=self.login_page, callback=self.login)
def login(self, response):
#"""Generate a login request."""
return FormRequest.from_response(response,
formdata={'txtUserName': 'myuser', 'txtPassword': 'xxxx'},
callback=self.check_login_response)
def check_login_response(self, response):
#"""Check the response returned by a login request to see if we aresuccessfully logged in."""
if "logout" in response.body:
self.log("\n\n\nSuccessfully logged in. Let's start crawling!\n\n\n")
# Now the crawling can begin..
return self.initialized()
else:
self.log("\n\n\nFailed, Bad password :(\n\n\n")
# Something went wrong, we couldn't log in, so nothing happens.
def parse(self, response):
xhs = XmlXPathSelector(response)
columns = xhs.select('//table[3]/row/cell')
for column in columns:
item = CarrierItem()
item['title'] = column.select('.//text()').extract()
item['link'] = column.select('.//@href').extract()
yield item
我从下面的csv文件中得到的输出:
14256238845
3.53918E+14
3.10261E+14
00:32.2
00:32.2
3 - PackageCreation
400006
view sessions
view errors
以下是我希望从csv获得的输出:
14256238845
353918053831794
310260548400764
2013-04-23 02:00:32.243
2013-04-23 02:00:32.243
3 - PackageCreation
400006
view sessions
view errors
发布于 2013-07-15 20:44:01
每当您想要跟随一个URL时,您都可以生成一个请求对象。
例如:yield Request(extracted_url_link, callback=your_parse_function)
请看下面链接中的第二个示例。
http://doc.scrapy.org/en/latest/topics/spiders.html#basespider-example
指定爬网url的另一种方法是使用SgmlLinkExtractor。您可以编写规则。如果规则匹配,爬行器将抓取任何页面中的所有urls。请参考以下url中的示例。
http://doc.scrapy.org/en/latest/topics/spiders.html#crawlspider
Date只是爬网后的一个字符串,你可以将它转换成python datetime对象,然后使用诸如strftime这样的datetime渲染函数以你想要的方式显示它。
希望我回答了你的问题。
https://stackoverflow.com/questions/17572651
复制相似问题