Crawlee

阿超

发布于 2024-12-12 09:41:37

2500

芸芸众生，孰不爱生?爱生之极，进而爱群。 —— 秋瑾

Crawlee——一个用于 Node.js 的网络抓取和浏览器自动化库，用于构建可靠的爬虫。在 JavaScript 和 TypeScript 中。提取 AI、 LLMs 、RAG 或 GPT 的数据。从网站下载 HTML、PDF、JPG、PNG 和其他文件。适用于 Puppeteer、Playwright、Cheerio、JSDOM 和原始 HTTP。有头模式和无头模式。通过代理轮换。

Crawlee 涵盖了端到端的爬行和抓取，并帮助您构建可靠的抓取工具。快速地。

即使使用默认配置，您的爬虫也会像人类一样出现并在现代机器人保护的雷达下飞行。 Crawlee 为您提供了在网络上抓取链接、抓取数据并将其存储到磁盘或云的工具，同时保持可配置以满足您的项目需求。

Crawlee 可作为crawlee NPM 包使用。

👉在Crawlee 项目网站上查看完整文档、指南和示例👈
Crawlee for Python 对早期采用者开放。 🐍 👉 查看源代码 👈 .

使用 Crawlee CLI

尝试 Crawlee 的最快方法是使用Crawlee CLI并选择入门示例。 CLI 将安装所有必要的依赖项并添加样板代码供您使用。

1	npx crawlee create my-crawler

1	cd my-crawlernpm start

手动安装

如果您更喜欢将 Crawlee 添加到您自己的项目中，请尝试下面的示例。因为它使用PlaywrightCrawler我们还需要安装Playwright 。它没有与 Crawlee 捆绑在一起以减少安装大小。

1	npm install crawlee playwright

1234567891011121314151617181920212223

import { PlaywrightCrawler, Dataset } from 'crawlee';// PlaywrightCrawler crawls the web using a headless// browser controlled by the Playwright library.const crawler = new PlaywrightCrawler({ // Use the requestHandler to process each of the crawled pages. async requestHandler({ request, page, enqueueLinks, log }) { const title = await page.title(); log.info(`Title of ${request.loadedUrl} is '${title}'`); // Save results as JSON to ./storage/datasets/default await Dataset.pushData({ title, url: request.loadedUrl }); // Extract links from the current page // and add them to the crawling queue. await enqueueLinks(); }, // Uncomment this option to see the browser window. // headless: false,});// Add first URL to the queue and start the crawl.await crawler.run(['https://crawlee.dev']);

默认情况下，Crawlee将数据存储到当前工作目录中的./storage 。您可以通过 Crawlee 配置覆盖此目录。详细信息请参见配置指南、请求存储和结果存储。