首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >问答首页 >木偶师:如何等待第一反应(HTML)

木偶师:如何等待第一反应(HTML)
EN

Stack Overflow用户
提问于 2019-09-13 08:12:14
回答 3查看 3.7K关注 0票数 2

我用木偶群来爬行网页。

如果我在每个网站打开多个页面(8-10页),连接速度会减慢,出现许多超时错误,如下所示:

TimeoutError:超过导航超时:超过3000ms超过

我只需要访问每个页面的HTML代码。我不需要等待域名内容加载等等。

有没有一种方法可以告诉page.goto()只等待来自way服务器的第一个响应?还是我需要用另一种技术代替木偶师?

EN

回答 3

Stack Overflow用户

回答已采纳

发布于 2019-09-13 08:32:35

所加载的domcontentloaded是第一个html内容的事件。

当初始的DOMContentLoaded文档已经完全加载和解析时,无需等待样式表、图像和子框架完成加载,就会触发事件。

以下内容将在加载初始HTML文档时完成加载。

代码语言:javascript
运行
复制
await page.goto(url, {waitUntil: 'domcontentloaded'})

但是,您可以阻止图像或样式表,以节省带宽,并在一次加载10个页面时加载得更快。

将下面的代码放在正确的位置(在使用page.goto导航之前),它将停止加载图像、样式表、字体和脚本。

代码语言:javascript
运行
复制
await page.setRequestInterception(true);
page.on('request', (request) => {
    if (['image', 'stylesheet', 'font', 'script'].indexOf(request.resourceType()) !== -1) {
        request.abort();
    } else {
        request.continue();
    }
});
票数 4
EN

Stack Overflow用户

发布于 2019-09-13 12:10:16

@ code 3817605,我为您提供了完美的代码。:)

代码语言:javascript
运行
复制
/**
 * The methods `page.waitForNavigation` and `frame.waitForNavigation` wait for the page
 * event `domcontentloaded` at minimum. This function returns a promise that resolves as
 * soon as the specified page `event` happens.
 * 
 * @param {puppeteer.Page} page
 * @param {string} event Can be any event accepted by the method `page.on()`. E.g.: "requestfinished" or "framenavigated".
 * @param {number} [timeout] optional time to wait. If not specified, waits forever.
 */
function waitForEvent(page, event, timeout) {
  page.once(event, done);
  let fulfill, timeoutId = (typeof timeout === 'number' && timeout >= 0) ? setTimeout(done, timeout) : -1;
  return new Promise(resolve => fulfill = resolve);
  function done() {
    clearTimeout(timeoutId);
    fulfill();
  }
}

您要求一个函数只等待第一个响应,所以您可以这样使用这个函数:

代码语言:javascript
运行
复制
page.goto(<URL>); // use .catch(() => {}) if you kill the page too soon, to avoid throw errors on console
await waitForEvent(page, 'response'); // after this line here you alread have the html response received

这正是你想要的。但是要注意,“接收到的响应”与“收到的完整html响应”并不相同。第一是回应的开始,最后是回应的结束。因此,您可能希望使用"response“中的事件”request成品“。事实上,你可以使用任何事件,由傀儡网页访问。它们是:关闭、控制台、对话框、域内容加载、错误、帧连接、帧分离、帧化、加载、度量、页面错误、弹出、请求、请求失败、请求完成、响应、工作人员创建、工作人员销毁。

尝试使用这些:请求完成或框架化。也许它们会适合你。

为了帮助您确定哪一个最适合您,您可以设置如下测试代码:

代码语言:javascript
运行
复制
const puppeteer = require('puppeteer');

/**
 * The methods `page.waitForNavigation` and `frame.waitForNavigation` wait for the page
 * event `domcontentloaded` at minimum. This function returns a promise that resolves as
 * soon as the specified page `event` happens.
 * 
 * @param {puppeteer.Page} page
 * @param {string} event Can be any event accepted by the method `page.on()`. E.g.: "requestfinished" or "framenavigated".
 * @param {number} [timeout] optional time to wait. If not specified, waits forever.
 */
function waitForEvent(page, event, timeout) {
  page.once(event, done);
  let fulfill, timeoutId = (typeof timeout === 'number' && timeout >= 0) ? setTimeout(done, timeout) : -1;
  return new Promise(resolve => fulfill = resolve);
  function done() {
    clearTimeout(timeoutId);
    fulfill();
  }
}

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  const cdp = await page.target().createCDPSession();
  await cdp.send('Network.enable');
  await cdp.send('Page.enable');
  const t0 = Date.now();
  page.on('request', req => console.log(`> ${Date.now() - t0} request start: ${req.url()}`));
  page.on('response', req => console.log(`< ${Date.now() - t0} response: ${req.url()}`));
  page.on('requestfinished', req => console.log(`. ${Date.now() - t0} request finished: ${req.url()}`));
  page.on('requestfailed', req => console.log(`E ${Date.now() - t0} request failed: ${req.url()}`));

  page.goto('https://www.google.com').catch(() => { });
  await waitForEvent(page, 'requestfinished');
  console.log(`\nThe page was released after ${Date.now() - t0}ms\n`);
  await page.close();
  await browser.close();
})();

/* The output should be something like this:

> 2 request start: https://www.google.com/
< 355 response: https://www.google.com/
> 387 request start: https://www.google.com/tia/tia.png
> 387 request start: https://www.google.com/images/branding/googlelogo/1x/googlelogo_color_272x92dp.png
. 389 request finished: https://www.google.com/

The page was released after 389ms

*/
票数 3
EN

Stack Overflow用户

发布于 2019-09-13 20:16:03

我可以看到实现您想要的目标的另外两种方法:使用page.waitForResponsepage.waitForFunction。让我们看看这两个。

使用page.waitForResponse,您可以做以下简单的事情:

代码语言:javascript
运行
复制
page.goto('https://www.google.com/').catch(() => {});
await page.waitForResponse('https://www.google.com/'); // don't forget to put the final slash

很简单,安?如果您不喜欢它,请尝试page.waitForFunction并等待de document创建:

代码语言:javascript
运行
复制
page.goto('https://www.google.com/').catch(() => {});
await page.waitForFunction(() => document); // you can use `window` too. It is almost the same

此代码将等待直到document存在。当html的第一部分到达并且浏览器开始创建文档的DOM树表示时,就会发生这种情况。

但是请注意,尽管这两种解决方案很简单,但它们都不能等到整个html页面/文档下载后才能完成。如果您想要这样做,您应该修改我的另一个答案的waitForEvent函数,以接受要完整下载的特定url。示例:

代码语言:javascript
运行
复制
/**
 * The methods `page.waitForNavigation` and `frame.waitForNavigation` wait for the page
 * event `domcontentloaded` at minimum. This function returns a promise that resolves as
 * soon as the specified `requestUrl` resource has finished downloading, or `timeout` elapses.
 * 
 * @param {puppeteer.Page} page
 * @param {string} requestUrl pass the exact url of the resource you want to wait for. Paths must be ended with slash "/". Don't forget that.
 * @param {number} [timeout] optional time to wait. If not specified, waits forever.
 */
function waitForRequestToFinish(page, requestUrl, timeout) {
  page.on('requestfinished', onRequestFinished);
  let fulfill, timeoutId = (typeof timeout === 'number' && timeout >= 0) ? setTimeout(done, timeout) : -1;
  return new Promise(resolve => fulfill = resolve);
  function done() {
    page.removeListener('requestfinished', onRequestFinished);
    clearTimeout(timeoutId);
    fulfill();
  }
  function onRequestFinished(req) {
    if (req.url() === requestUrl) done();
  }
}

如何使用:

代码语言:javascript
运行
复制
page.goto('https://www.amazon.com/').catch(() => {});
await waitForRequestToFinish(page, 'https://www.amazon.com/', 3000);

完整的示例显示了整洁的console.logs:

代码语言:javascript
运行
复制
const puppeteer = require('puppeteer');

/**
 * The methods `page.waitForNavigation` and `frame.waitForNavigation` wait for the page
 * event `domcontentloaded` at minimum. This function returns a promise that resolves as
 * soon as the specified `requestUrl` resource has finished downloading, or `timeout` elapses.
 * 
 * @param {puppeteer.Page} page
 * @param {string} requestUrl pass the exact url of the resource you want to wait for. Paths must be ended with slash "/". Don't forget that.
 * @param {number} [timeout] optional time to wait. If not specified, waits forever.
 */
function waitForRequestToFinish(page, requestUrl, timeout) {
  page.on('requestfinished', onRequestFinished);
  let fulfill, timeoutId = (typeof timeout === 'number' && timeout >= 0) ? setTimeout(done, timeout) : -1;
  return new Promise(resolve => fulfill = resolve);
  function done() {
    page.removeListener('requestfinished', onRequestFinished);
    clearTimeout(timeoutId);
    fulfill();
  }
  function onRequestFinished(req) {
    if (req.url() === requestUrl) done();
  }
}

(async () => {
  const netMap = new Map();
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  const cdp = await page.target().createCDPSession();
  await cdp.send('Network.enable');
  await cdp.send('Page.enable');
  const t0 = Date.now();
  cdp.on('Network.requestWillBeSent', ({ requestId, request: { url: requestUrl } }) => {
    netMap.set(requestId, requestUrl);
    console.log(`> ${Date.now() - t0}ms\t requestWillBeSent:\t${requestUrl}`);
  });
  cdp.on('Network.responseReceived', ({ requestId }) => console.log(`< ${Date.now() - t0}ms\t responseReceived:\t${netMap.get(requestId)}`));
  cdp.on('Network.dataReceived', ({ requestId, dataLength }) => console.log(`< ${Date.now() - t0}ms\t dataReceived:\t\t${netMap.get(requestId)} ${dataLength} bytes`));
  cdp.on('Network.loadingFinished', ({ requestId }) => console.log(`. ${Date.now() - t0}ms\t loadingFinished:\t${netMap.get(requestId)}`));
  cdp.on('Network.loadingFailed', ({ requestId }) => console.log(`E ${Date.now() - t0}ms\t loadingFailed:\t${netMap.get(requestId)}`));

  // The magic happens here
  page.goto('https://www.amazon.com').catch(() => { });
  await waitForRequestToFinish(page, 'https://www.amazon.com/', 3000);

  console.log(`\nThe page was released after ${Date.now() - t0}ms\n`);
  await page.close();
  await browser.close();
})();

/* OUTPUT EXAMPLE
[... lots of logs removed ...]
> 574ms  requestWillBeSent:     https://images-na.ssl-images-amazon.com/images/I/71vvXGmdKWL._AC_SY200_.jpg
< 574ms  dataReceived:          https://www.amazon.com/ 65536 bytes
< 624ms  responseReceived:      https://images-na.ssl-images-amazon.com/images/G/01/AmazonExports/Fuji/2019/February/Dashboard/computer120x._CB468850970_SY85_.jpg
> 628ms  requestWillBeSent:     https://images-na.ssl-images-amazon.com/images/I/81Hhc9zh37L._AC_SY200_.jpg
> 629ms  requestWillBeSent:     https://images-na.ssl-images-amazon.com/images/G/01/personalization/ybh/loading-4x-gray._CB317976265_.gif
< 631ms  dataReceived:          https://www.amazon.com/ 58150 bytes
. 631ms  loadingFinished:       https://www.amazon.com/

*/

这段代码显示了大量的请求和响应,但是一旦"https://www.amazon.com/“被完全下载,代码就会停止。

票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/57919714

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档