返回空网页的PhantomJS(python,Selenium),怎么办?

内容来源于 Stack Overflow,并遵循CC BY-SA 3.0许可协议进行翻译与使用

  • 回答 (2)
  • 关注 (0)
  • 查看 (553)

在不需要在python脚本中启动实际浏览器实例的情况下(使用Selenium),就可以在屏幕上扫描网站。我可以用Chrome或Firefox来做这件事--我已经试过了,而且它也能工作--但是我想使用PhantomJS,这样它就无头了。

代码如下:

import sys
import traceback
import time

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities

dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap["phantomjs.page.settings.userAgent"] = (
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/53 "
    "(KHTML, like Gecko) Chrome/15.0.87"
)

try:
    # Choose our browser
    browser = webdriver.PhantomJS(desired_capabilities=dcap)
    #browser = webdriver.PhantomJS()
    #browser = webdriver.Firefox()
    #browser = webdriver.Chrome(executable_path="/usr/local/bin/chromedriver")

    # Go to the login page
    browser.get("https://www.whatever.com")

    # For debug, see what we got back
    html_source = browser.page_source
    with open('out.html', 'w') as f:
        f.write(html_source)

    # PROCESS THE PAGE (code removed)

except Exception, e:
    browser.save_screenshot('screenshot.png')
    traceback.print_exc(file=sys.stdout)

finally:
    browser.close()

结果是:

<html><head></head><body></body></html>

但是,当我使用Chrome或Firefox选项时,它可以正常工作。我想也许网站是根据用户代理返回垃圾,所以我试着装出来。没什么区别。

我遗漏了什么?

我得到的答案如下:

import sys
import traceback
import time
import re

from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.support import expected_conditions as EC

dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap["phantomjs.page.settings.userAgent"] = (
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/53 (KHTML, like Gecko) Chrome/15.0.87")

try:
    # Set up our browser
    browser = webdriver.PhantomJS(desired_capabilities=dcap, service_args=['--ignore-ssl-errors=true'])
    #browser = webdriver.Chrome(executable_path="/usr/local/bin/chromedriver")

    # Go to the login page
    print "getting web page..."
    browser.get("https://www.website.com")

    # Need to wait for the page to load
    timeout = 10
    print "waiting %s seconds..." % timeout
    wait = WebDriverWait(browser, timeout)
    element = wait.until(EC.element_to_be_clickable((By.ID,'the_id')))
    print "done waiting. Response:"

    # Rest of code snipped. Fails as "wait" above.
提问于
用户回答回答于
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC


# ...
browser.get("https://www.whatever.com")

wait = WebDriverWait(driver, 10)
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div.content")))

html_source = browser.page_source
# ...

若要在获取页源之前变得可见,请执行以下操作。

此外,您可能需要忽略SSL错误:

browser = webdriver.PhantomJS(desired_capabilities=dcap, service_args=['--ignore-ssl-errors=true'])
用户回答回答于

问题是Https网站上的SSL加密,忽略它们就可以了。

driver = webdriver.PhantomJS(service_args=['--ignore-ssl-errors=true', '--ssl-protocol=TLSv1'])

所属标签

可能回答问题的人

  • Richel

    5 粉丝0 提问6 回答
  • 人生的旅途

    10 粉丝484 提问5 回答
  • 天使的炫翼

    17 粉丝531 提问5 回答
  • 嗨喽你好

    7 粉丝480 提问4 回答

扫码关注云+社区

领取腾讯云代金券