我与selenium一起在python中创建了一个脚本,用于从网页中抓取一些位于框中的内容,如其左侧侧栏中的容器。当我使用selenium时,我可以毫不费力地得到它们。现在,我希望使用请求模块获得相同的内容。我在dev工具中做了一些实验,并注意到有一个post请求正在发送,这会产生一些json响应,我已经粘贴在下面。但是,在这一点上,我无法理解如何使用请求获取内容。
网页链接
硒方法:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
def get_content(link):
driver.get(link)
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "#tab-outline"))).click()
for item in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "#pageoutline > [class^='outline_H']"))):
print(item.text)
if __name__ == '__main__':
url = "http://wave.webaim.org/report#/www.onewerx.com"
with webdriver.Chrome() as driver:
wait = WebDriverWait(driver,10)
get_content(url)脚本产生的部分输出(如所需):
Marketing Mix Modeling
Programmatic & Modeling
Programmatic is buying digital advertising space automatically, with computers using data to decide which ads to buy and how much to pay for them.
Modern
Efficient
Scalable
Resultative
What is Modeling?
Modeling is an analytical approach that uses historic information, such as syndicated point-of-sale data and companies’ internal data, to quantify the sales impact of various marketing activities.
Programmatic - future of the marketing在尝试处理请求时:
import requests
url = "http://wave.webaim.org/data/request.php"
headers = {
'Referer': 'http://wave.webaim.org/report',
'X-Requested-With': 'XMLHttpRequest'
}
res = requests.post(url,data={'source':'http://www.onewerx.com'},headers=headers)
print(res.json())我得到以下输出:
{'success': True, 'reportkey': '6520439253ac21885007b52c677b8078', 'contenttype': 'text/html; charset=UTF-8'}如何使用请求获得相同的内容?
更清楚的是:这就是我感兴趣的。
上面的输出看起来与图像不同,因为selenium脚本单击附加到该框的以下按钮来展开内容:

发布于 2019-07-04 21:16:19
好吧,我做了一些逆向工程。
整个过程似乎在客户端运行。下面是操作步骤:
wave.engine.statistics包含您要查找的结果:
// wave.min.js
wave.fn.applyRules = function() {
var e = {};
e.statistics = {};
try {
e.categories = wave.engine.run(),
e.statistics = wave.engine.statistics;
wave.engine.ruleTimes;
e.statistics.pagetitle = wave.page.title,
e.statistics.totalelements = wave.allTags.length,
e.success = !0
} catch (t) {
console.log(t)
}
return e
}在这里,wave.engine.run函数运行客户端的所有规则。s是<body>元素:

并返回结果
wave.engine.run = function(e) {
var t = new Date
, n = null
, i = null
, a = new Date;
wave.engine.fn.calculateContrast(this.fn.getBody());
var o = new Date
, r = wave.rules
, s = $(wave.page);
if (e)
r[e] && r[e](s);
else
for (e in r) {
n = new Date;
try {
r[e](s)
} catch (l) {
console.log("RULE FAILURE(" + e + "): " + l.stack)
}
i = new Date,
this.ruleTimes[e] = i - n,
config.debug && console.log("RULE: " + e + " (" + this.ruleTimes[e] + "ms)")
}
return EndTimer = new Date,
config.debug && console.log("TOTAL RULE TIME: " + (EndTimer - t) + "ms"),
a = new Date,
wave.engine.fn.structureOutput(),
o = new Date,
wave.engine.results
}因此,您有两个选项:将这些规则移植到Python中,或者继续使用Selenium。
wave.rules = {},
wave.rules.text_justified = function(e) {
e.find("p, div, td").each(function(t, n) {
var i = e.find(n);
"justify" == i.css("text-align") && wave.engine.fn.addIcon(n, "text_justified")
})
}
,
wave.rules.alt_missing = function(e) {
wave.engine.fn.overrideby("alt_missing", ["alt_link_missing", "alt_map_missing", "alt_spacer_missing"]),
e.find("img:not([alt])").each(function(e, t) {
var n = $(t);
void 0 != n.attr("title") && 0 != n.attr("title").length || wave.engine.fn.addIcon(t, "alt_missing")
})
}
// ... and many more由于测试依赖于浏览器引擎来完全呈现页面(不幸的是,云中没有生成报告),因此您必须使用Selenium来完成这项工作。
https://stackoverflow.com/questions/56893932
复制相似问题