首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >问答首页 >selenium ajax动态分页基蜘蛛

selenium ajax动态分页基蜘蛛
EN

Stack Overflow用户
提问于 2014-12-17 12:00:50
回答 2查看 1.5K关注 0票数 1

我试图运行我的基础蜘蛛的动态分页,但我没有获得成功的爬行。我使用了selenium ajax动态分页。我使用的网址是:http://www.demo.com。这是我的代码:

代码语言:javascript
运行
复制
# -*- coding: utf-8 -*-
import scrapy

import re

from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

from scrapy.selector import Selector

from scrapy.spider import BaseSpider

from demo.items import demoItem

from selenium import webdriver

def removeUnicodes(strData):

        if(strData):

            #strData = strData.decode('unicode_escape').encode('ascii','ignore')

            strData = strData.encode('utf-8').strip() 

            strData = re.sub(r'[\n\r\t]',r' ',strData.strip())

            #print 'Output:',strData

        return strData


class demoSpider(scrapy.Spider):
    name = "demourls"

    allowed_domains = ["demo.com"]

    start_urls = ['http://www.demo.com']

     def __init__(self):
        self.driver = webdriver.Firefox()

    def parse(self, response):
        print "*****************************************************"
        self.driver.get(response.url)
        print response.url
        print "______________________________"



        hxs = Selector(response)
        item = sumItem()
        finalurls = []
        while True:
            next = self.driver.find_element_by_xpath('//div[@class="showMoreCars hide"]/a')

            try:
                next.click()
                # get the data and write it to scrapy items
                item['pageurl'] = response.url
                item['title'] =  removeUnicodes(hxs.xpath('.//h1[@class="page-heading"]/text()').extract()[0])
                urls = hxs.xpath('.//a[@id="linkToDetails"]/@href').extract()
                print '**********************************************2***url*****************************************',urls


                for url in urls:
                    print '---------url-------',url
                finalurls.append(url)          

                item['urls'] = finalurls

            except:
                break

        self.driver.close()

        return item

我的items.py是

代码语言:javascript
运行
复制
from scrapy.item import Item, Field


class demoItem(Item):



    page = Field()
    urls = Field()

    pageurl = Field()
    title = Field()

当我试图抓取它并在json中转换它时,我得到我的json文件如下:

代码语言:javascript
运行
复制
[{"pageurl": "http://www.demo.com", "urls": [], "title": "demo"}]

我不能抓取所有的urls,因为它是动态加载。

EN

回答 2

Stack Overflow用户

回答已采纳

发布于 2014-12-17 20:20:16

我希望下面的代码会有所帮助。

somespider.py

代码语言:javascript
运行
复制
# -*- coding: utf-8 -*-
import scrapy
import re
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.spider import BaseSpider
from demo.items import DemoItem
from selenium import webdriver

def removeUnicodes(strData):
        if(strData):
            strData = strData.encode('utf-8').strip() 
            strData = re.sub(r'[\n\r\t]',r' ',strData.strip())
        return strData

class demoSpider(scrapy.Spider):
    name = "domainurls"
    allowed_domains = ["domain.com"]
    start_urls = ['http://www.domain.com/used/cars-in-trichy/']

    def __init__(self):
        self.driver = webdriver.Remote("http://127.0.0.1:4444/wd/hub", webdriver.DesiredCapabilities.HTMLUNITWITHJS)

    def parse(self, response):
        self.driver.get(response.url)
        self.driver.implicitly_wait(5)
        hxs = Selector(response)
        item = DemoItem()
        finalurls = []
        while True:
            next = self.driver.find_element_by_xpath('//div[@class="showMoreCars hide"]/a')

            try:
                next.click()
                # get the data and write it to scrapy items
                item['pageurl'] = response.url
                item['title'] =  removeUnicodes(hxs.xpath('.//h1[@class="page-heading"]/text()').extract()[0])
                urls = self.driver.find_elements_by_xpath('.//a[@id="linkToDetails"]')

                for url in urls:
                    url = url.get_attribute("href")
                    finalurls.append(removeUnicodes(url))          

                item['urls'] = finalurls

            except:
                break

        self.driver.close()
        return item

items.py

代码语言:javascript
运行
复制
from scrapy.item import Item, Field

class DemoItem(Item):
    page = Field()
    urls = Field()
    pageurl = Field()
    title = Field()

注意:您需要运行selenium服务器,因为HTMLUNITWITHJS只使用与selenium一起工作。

运行您的selenium服务器,发出命令

代码语言:javascript
运行
复制
java -jar selenium-server-standalone-2.44.0.jar

使用命令运行您的蜘蛛

代码语言:javascript
运行
复制
spider crawl domainurls -o someoutput.json
票数 1
EN

Stack Overflow用户

发布于 2014-12-17 15:08:30

首先,您不需要按showMoreCars按钮,因为它将在页面加载后被动态按下。相反,等一秒就足够了。

除了您的scrapy代码之外,selenium还能够为您捕获所有href。下面是您需要在selenium中做的一个例子。

代码语言:javascript
运行
复制
from selenium import webdriver

driver = webdriver.Firefox()

driver.get("http://www.carwale.com/used/cars-in-trichy/#city=194&kms=0-&year=0-&budget=0-&pn=2")
driver.implicitly_wait(5)
urls = driver.find_elements_by_xpath('.//a[@id="linkToDetails"]')
for url in urls:
    print url.get_attribute("href")
driver.close()

你所需要的就是把这个和你的刮伤部分合并。

输出:

代码语言:javascript
运行
复制
http://www.carwale.com/used/cars-in-trichy/renault-pulse-s586981/
http://www.carwale.com/used/cars-in-trichy/marutisuzuki-ritz-2009-2012-s598266/
http://www.carwale.com/used/cars-in-trichy/mahindrarenault-logan-2007-2009-s607757/
http://www.carwale.com/used/cars-in-trichy/marutisuzuki-ritz-2009-2012-s589835/
http://www.carwale.com/used/cars-in-trichy/hyundai-santro-xing-2003-2008-s605866/
http://www.carwale.com/used/cars-in-trichy/chevrolet-captiva-s599023/
http://www.carwale.com/used/cars-in-trichy/chevrolet-enjoy-s595824/
http://www.carwale.com/used/cars-in-trichy/tata-indicav2-s606823/
http://www.carwale.com/used/cars-in-trichy/tata-indicav2-s606617/
http://www.carwale.com/used/cars-in-trichy/marutisuzuki-estilo-2009-2014-s592745/
http://www.carwale.com/used/cars-in-trichy/toyota-etios-2013-2014-s605950/
http://www.carwale.com/used/cars-in-trichy/tata-indica-vista-2008-2011-s599001/
http://www.carwale.com/used/cars-in-trichy/opel-corsa-s591616/
http://www.carwale.com/used/cars-in-trichy/hyundai-i20-2008-2010-s596173/
http://www.carwale.com/used/cars-in-trichy/tata-indica-vista-2012-2014-s600753/
http://www.carwale.com/used/cars-in-trichy/fiat-punto-2009-2011-s606934/
http://www.carwale.com/used/cars-in-trichy/mitsubishi-pajero-s597849/
http://www.carwale.com/used/cars-in-trichy/fiat-linea20082014-s596079/
http://www.carwale.com/used/cars-in-trichy/tata-indicav2-s597390/
http://www.carwale.com/used/cars-in-trichy/mahindra-xylo-2009-2012-s603434/
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/27525142

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档