文章/答案/技术大牛

发布

社区首页 >问答首页 >克服木偶机(库)进行网络抓取时的分页问题

问克服木偶机(库)进行网络抓取时的分页问题
EN

Stack Overflow用户

提问于 2018-09-14 04:45:49

回答 1查看 9.3K关注 0票数 4

我正在使用木偶师构建一个基本的web刮刀，到目前为止，我可以从任何给定的页面返回我需要的所有数据，但是当分页涉及到时，我的刮刀就会松开(只返回第一页)。

参见示例--这将返回前20本书的标题/价格，但不查看其他49页的书籍。

只是在寻找如何克服这一问题的指导--我在文档中什么也看不见。

谢谢!

const puppeteer = require('puppeteer');

let scrape = async () => {
  const browser = await puppeteer.launch({headless: false});
const page = await browser.newPage();

await page.goto('http://books.toscrape.com/');

const result = await page.evaluate(() => {
  let data = []; 
  let elements = document.querySelectorAll('.product_pod');

  for (var element of elements){
      let title = element.childNodes[5].innerText;
      let price = element.childNodes[7].children[0].innerText;

      data.push({title, price});
  }

  return data;
});

browser.close();
return result;
};

scrape().then((value) => {
console.log(value);
});

说清楚了。我在这里学习一个教程-这段代码来自codeburst.io!！https://codeburst.io/a-guide-to-automating-scraping-the-web-with-javascript-chrome-puppeteer-node-js-b18efb9e9921上的Brandon

node.js

web-scraping

pagination

puppeteer

回答 1

Stack Overflow用户

回答已采纳

发布于 2018-11-02 03:10:22

我跟随同一篇文章，是为了教育自己如何使用木偶词典。简单地回答你的问题是，你需要再引入一个循环来迭代在线图书目录中所有可用的页面。为了收集所有书名和价格，我已经完成了以下步骤：

以页为参数的单独异步函数中提取的page.evaluate部件
介绍了带有硬编码的上一个目录页码的循环(如果你愿意的话，可以用木偶词典来提取它)
将异步函数从步骤一放到循环中

与Brandon Morelli文章中相同的代码相同，但现在有了一个额外的循环：

const puppeteer = require('puppeteer');

let scrape = async () => {
    const browser = await puppeteer.launch({ headless: true });
    const page = await browser.newPage();

    await page.goto('http://books.toscrape.com/');

    var results = []; // variable to hold collection of all book titles and prices
    var lastPageNumber = 50; // this is hardcoded last catalogue page, you can set it dunamically if you wish
    // defined simple loop to iterate over number of catalogue pages
    for (let index = 0; index < lastPageNumber; index++) {
        // wait 1 sec for page load
        await page.waitFor(1000);
        // call and wait extractedEvaluateCall and concatenate results every iteration.
        // You can use results.push, but will get collection of collections at the end of iteration
        results = results.concat(await extractedEvaluateCall(page));
        // this is where next button on page clicked to jump to another page
        if (index != lastPageNumber - 1) {
            // no next button on last page
            await page.click('#default > div > div > div > div > section > div:nth-child(2) > div > ul > li.next > a');
        }
    }

    browser.close();
    return results;
};

async function extractedEvaluateCall(page) {
    // just extracted same exact logic in separate function
    // this function should use async keyword in order to work and take page as argument
    return page.evaluate(() => {
        let data = [];
        let elements = document.querySelectorAll('.product_pod');

        for (var element of elements) {
            let title = element.childNodes[5].innerText;
            let price = element.childNodes[7].children[0].innerText;

            data.push({ title, price });
        }

        return data;
    });
}

scrape().then((value) => {
    console.log(value);
    console.log('Collection length: ' + value.length);
    console.log(value[0]);
    console.log(value[value.length - 1]);
});

控制台输出：

...
  { title: 'In the Country We ...', price: '£22.00' },
  ... 900 more items ]
Collection length: 1000
{ title: 'A Light in the ...', price: '£51.77' }
{ title: '1,000 Places to See ...', price: '£26.08' }

票数 12

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/52325114

复制

相似问题

问克服木偶机(库)进行网络抓取时的分页问题
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问克服木偶机(库)进行网络抓取时的分页问题EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问克服木偶机(库)进行网络抓取时的分页问题
EN