System Design Interview 9 设计网络爬虫

发布2024-04-10 14:10:14
发布2024-04-10 14:10:14

In this chapter, we focus on web crawler design: an interesting and classic system design interview question.


A web crawler is known as a robot or spider. It is widely used by search engines to discover new or updated content on the web. Content can be a web page, an image, a video, a PDF file, etc. A web crawler starts by collecting a few web pages and then follows links on those pages to collect new content. Figure 1 shows a visual example of the crawl process.

网络爬虫(Web Crawler,下文简称为“爬虫”)也称为机器人(Bot)或者蜘蛛(Spider),被搜索引擎广泛地用于发现网络上的新内容或者更新的内容。这些内容可以是网页、图片、视频、PDF文件等。爬虫从收集网页开始,然后顺着这些网页上的链接收集新的内容。图1展示了爬虫爬取页面的示例。

A crawler is used for many purposes:


  • Search engine indexing: This is the most common use case. A crawler collects web pages to create a local index for search engines. For example, Googlebot is the web crawler behind the Google search engine. 搜索引擎索引:这是最常见的使用场景。爬虫收集网页并为搜索引擎创建本地索引。例如,Googlebot就是谷歌搜索引擎背后的爬虫。
  • Web archiving: This is the process of collecting information from the web to preserve data for future uses. For instance, many national libraries run crawlers to archive web sites. Notable examples are the US Library of Congress and the EU web archive. 网页存档:这是指从网上收集信息并保存起来以备未来使用的过程。很多国家图书馆运行爬虫来存档网站,比如美国国会图书馆和欧盟网页存档。
  • Web mining: The explosive growth of the web presents an unprecedented opportunity for data mining. Web mining helps to discover useful knowledge from the internet. For example, top financial firms use crawlers to download shareholder meetings and annual reports to learn key company initiatives. 网络挖掘:互联网的迅猛发展为数据挖掘提供了前所未有的机会。网络挖掘帮助我们从互联网上发现有用的信息。比如,顶级金融公司使用爬虫来获取关键公司的股东会议信息和年报,从而了解它们的动向。
  • Web monitoring. The crawlers help to monitor copyright and trademark infringements over the Internet. For example, Digimarc utilizes crawlers to discover pirated works and reports. 网络监控:爬虫可以帮助监控互联网上的版权和商标侵权行为。例如,Digimarc公司利用爬虫发现盗版作品并上报。

The complexity of developing a web crawler depends on the scale we intend to support. It could be either a small school project, which takes only a few hours to complete or a gigantic project that requires continuous improvement from a dedicated engineering team. Thus, we will explore the scale and features to support below.


1 Step 1 - Understand the problem and establish design scope 第一步 理解问题并确定设计的边界

The basic algorithm of a web crawler is simple:


  1. Given a set of URLs, download all the web pages addressed by the URLs. 给定一组URL,下载这些URL对应的所有网页。
  2. Extract URLs from these web pages 从这些网页中提取URL。
  3. Add new URLs to the list of URLs to be downloaded. Repeat these 3 steps. 将新的URL添加到需要下载的URL列表里。然后重复执行这3个步骤。

Does a web crawler work truly as simple as this basic algorithm? Not exactly. Designing a vastly scalable web crawler is an extremely complex task. It is unlikely for anyone to design a massive web crawler within the interview duration. Before jumping into the design, we must ask questions to understand the requirements and establish design scope:


Candidate: What is the main purpose of the crawler? Is it used for search engine indexing, data mining, or something else? Interviewer: Search engine indexing.



Candidate: How many web pages does the web crawler collect per month? Interviewer: 1 billion pages.



Candidate: What content types are included? HTML only or other content types such as PDFs and images as well? Interviewer: HTML only.



Candidate: Shall we consider newly added or edited web pages? Interviewer: Yes, we should consider the newly added or edited web pages.



Candidate: Do we need to store HTML pages crawled from the web? Interviewer: Yes, up to 5 years



Candidate: How do we handle web pages with duplicate content? Interviewer: Pages with duplicate content should be ignored.



Above are some of the sample questions that you can ask your interviewer. It is important to understand the requirements and clarify ambiguities. Even if you are asked to design a straightforward product like a web crawler, you and your interviewer might not have the same assumptions.


Beside functionalities to clarify with your interviewer, it is also important to note down the following characteristics of a good web crawler:


  • Scalability: The web is very large. There are billions of web pages out there. Web crawling should be extremely efficient using parallelization. 可扩展性:互联网很庞大,存在数十亿的网页。爬虫需要通过并行化来高效爬取信息。
  • Robustness: The web is full of traps. Bad HTML, unresponsive servers, crashes, malicious links, etc. are all common. The crawler must handle all those edge cases. 健壮性:网络上充满了陷阱。糟糕的HTML页面、无响应的服务器、宕机、恶意链接等都很常见。爬虫必须应对所有这些极端场景。
  • Politeness: The crawler should not make too many requests to a website within a short time interval. 礼貌性:爬虫不应该在很短的时间间隔内对一个网站发送太多请求。
  • Extensibility: The system is flexible so that minimal changes are needed to support new content types. For example, if we want to crawl image files in the future, we should not need to redesign the entire system. 可扩展性:系统应该具有灵活性,只需要做最少的更改就能支持新的内容类型。举个例子,如果我们将来想要爬取图片,应该不需要重新设计整个系统。

Back of the envelope estimation 封底估算

The following estimations are based on many assumptions, and it is important to communicate with the interviewer to be on the same page.


  • Assume 1 billion web pages are downloaded every month. 假设每个月要下载10亿个网页。
  • QPS: 1,000,000,000 / 30 days / 24 hours / 3600 seconds = ~400 pages per second. QPS:1,000,000,000÷30÷24÷3600≈400,即每秒约400个网页。
  • Peak QPS = 2 * QPS = 800 峰值QPS=2×QPS=800。
  • Assume the average web page size is 500k. 假设平均每个网页的大小是500KB。
  • 1-billion-page x 500k = 500 TB storage per month. If you are unclear about digital storage units, go through “Power of 2” section in the "Back-of-the-envelope Estimation" chapter again. 每月需要存储1,000,000,000×500KB=500TB。如果你不太熟悉存储单位的含义,请重新阅读第2章的2.1节。
  • Assuming data are stored for five years, 500 TB * 12 months * 5 years = 30 PB. A 30 PB storage is needed to store five-year content. 假设数据要保存5年,则500TB×12×5=30PB,即需要30PB的存储空间来保存5年的内容。

2 Step 2 - Propose high-level design and get buy-in 第二步 提议高层级的设计并获得认同

Once the requirements are clear, we move on to the high-level design. Inspired by previous studies on web crawling, we propose a high-level design as shown in Figure 2.


First, we explore each design component to understand their functionalities. Then, we examine the crawler workflow step-by-step.


Seed URLs


A web crawler uses seed URLs as a starting point for the crawl process. For example, to crawl all web pages from a university’s website, an intuitive way to select seed URLs is to use the university’s domain name.


To crawl the entire web, we need to be creative in selecting seed URLs. A good seed URL serves as a good starting point that a crawler can utilize to traverse as many links as possible. The general strategy is to divide the entire URL space into smaller ones. The first proposed approach is based on locality as different countries may have different popular websites. Another way is to choose seed URLs based on topics; for example, we can divide URL space into shopping, sports, healthcare, etc. Seed URL selection is an open-ended question. You are not expected to give the perfect answer. Just think out loud.


URL Frontier


Most modern web crawlers split the crawl state into two: to be downloaded and already downloaded. The component that stores URLs to be downloaded is called the URL Frontier. You can refer to this as a First-in-First-out (FIFO) queue. For detailed information about the URL Frontier, refer to the deep dive.


HTML Downloader


The HTML downloader downloads web pages from the internet. Those URLs are provided by the URL Frontier.


DNS Resolver


To download a web page, a URL must be translated into an IP address. The HTML Downloader calls the DNS Resolver to get the corresponding IP address for the URL. For instance, URL www.wikipedia.org is converted to IP address as of 3/5/2019.


Content Parser


After a web page is downloaded, it must be parsed and validated because malformed web pages could provoke problems and waste storage space. Implementing a content parser in a crawl server will slow down the crawling process. Thus, the content parser is a separate component.


Content Seen?


Online research reveals that 29% of the web pages are duplicated contents, which may cause the same content to be stored multiple times. We introduce the “Content Seen?” data structure to eliminate data redundancy and shorten processing time. It helps to detect new content previously stored in the system. To compare two HTML documents, we can compare them character by character. However, this method is slow and time-consuming, especially when billions of web pages are involved. An efficient way to accomplish this task is to compare the hash values of the two web pages.


Content Storage


It is a storage system for storing HTML content. The choice of storage system depends on factors such as data type, data size, access frequency, life span, etc. Both disk and memory are used.


  • Most of the content is stored on disk because the data set is too big to fit in memory. 大部分内容存储在硬盘中,因为数据集太大,内存装不下。
  • Popular content is kept in memory to reduce latency. 热门内容被存储在内存中以降低延时。

URL Extractor


URL Extractor parses and extracts links from HTML pages. Figure 3 shows an example of a link extraction process. Relative paths are converted to absolute URLs by adding the “https://en.wikipedia.org” prefix.


URL Filter


The URL filter excludes certain content types, file extensions, error links and URLs in “blacklisted” sites.


URL Seen?


“URL Seen?” is a data structure that keeps track of URLs that are visited before or already in the Frontier. “URL Seen?” helps to avoid adding the same URL multiple times as this can increase server load and cause potential infinite loops.


Bloom filter and hash table are common techniques to implement the “URL Seen?” component. We will not cover the detailed implementation of the bloom filter and hash table here. For more information, refer to the reference materials.

布隆过滤器和哈希表都是实现“已见过的URL?”组件的常见技术。在这里我们不会介绍布隆过滤器和哈希表的详细实现细节。如果你感兴趣,请参考Burton H.Bloom的文章“Space/Time Trade-Offs in Hash Coding with Allowable Errors”,以及Allan Heydon与Marc Najork合著的文章“Mercator:A Scalable,Extensible Web Crawler”。

URL Storage


URL Storage stores already visited URLs.


So far, we have discussed every system component. Next, we put them together to explain the workflow.


Web crawler workflow


To better explain the workflow step-by-step, sequence numbers are added in the design diagram as shown in Figure 4.


Step 1: Add seed URLs to the URL Frontier


Step 2: HTML Downloader fetches a list of URLs from URL Frontier.


Step 3: HTML Downloader gets IP addresses of URLs from DNS resolver and starts downloading.


Step 4: Content Parser parses HTML pages and checks if pages are malformed.


Step 5: After content is parsed and validated, it is passed to the “Content Seen?” component.


Step 6: “Content Seen” component checks if a HTML page is already in the storage.


  • If it is in the storage, this means the same content in a different URL has already been processed. In this case, the HTML page is discarded. 如果页面已经在数据库中,意味着包含同样的内容的不同URL已经被处理过。在这种情况下,这个HTML页面会被丢弃。
  • If it is not in the storage, the system has not processed the same content before. The content is passed to Link Extractor. 如果页面不在数据库中,表示系统还没有处理过相同的内容。该页面将被传递给链接提取器。

Step 7: Link extractor extracts links from HTML pages.


Step 8: Extracted links are passed to the URL filter.


Step 9: After links are filtered, they are passed to the “URL Seen?” component.


Step 10: “URL Seen” component checks if a URL is already in the storage, if yes, it is processed before, and nothing needs to be done.


Step 11: If a URL has not been processed before, it is added to the URL Frontier.


3 Step 3 - Design deep dive 第三步 设计继续深入

Up until now, we have discussed the high-level design. Next, we will discuss the most important building components and techniques in depth:


  • Depth-first search (DFS) vs Breadth-first search (BFS) 深度优先搜索(DFS)与广度优先搜索(BFS)
  • URL frontier URL前线
  • HTML Downloader HTML下载器
  • Robustness 健壮性
  • Extensibility 可扩展性
  • Detect and avoid problematic content 检测和避免有问题的内容

3.1 DFS vs BFS DFS vs.BFS

You can think of the web as a directed graph where web pages serve as nodes and hyperlinks (URLs) as edges. The crawl process can be seen as traversing a directed graph from one web page to others. Two common graph traversal algorithms are DFS and BFS. However, DFS is usually not a good choice because the depth of DFS can be very deep.


BFS is commonly used by web crawlers and is implemented by a first-in-first-out (FIFO) queue. In a FIFO queue, URLs are dequeued in the order they are enqueued. However, this implementation has two problems:


  • Most links from the same web page are linked back to the same host. In Figure 5, all the links in wikipedia.com are internal links, making the crawler busy processing URLs from the same host (wikipedia.com). When the crawler tries to download web pages in parallel, Wikipedia servers will be flooded with requests. This is considered as “impolite”. 同一个网页的大部分链接都指向同一个主机。如图5所示,wikipedia.com中的所有链接都是内部链接,这使得爬虫忙于处理来自同一个主机(wikipedia.com)的URL。当爬虫尝试并行下载网页时,维基百科的服务器会被大量请求“淹没”。这样做被认为是“不礼貌”的。
  • Standard BFS does not take the priority of a URL into consideration. The web is large and not every page has the same level of quality and importance. Therefore, we may want to prioritize URLs according to their page ranks, web traffic, update frequency, etc. 标准的BFS并没有考虑URL的优先级。互联网很大,不是每个网页都有同样水平的质量和同等重要性。因此,我们可能想要基于网页的排名、网络流量、更新频率等对URL进行排序,以便优先处理某些网页。

3.2 URL frontier URL前线

URL frontier helps to address these problems. A URL frontier is a data structure that stores URLs to be downloaded. The URL frontier is an important component to ensure politeness, URL prioritization, and freshness. A few noteworthy papers on URL frontier are mentioned in the reference materials. The findings from these papers are as follows:

URL前线帮我们解决了这些问题。URL前线是一个重要组件,它是一个存储待下载URL的数据结构,能确保爬虫礼貌地访问网页,确定URL优先级并保证内容新鲜度。关于URL前线,建议细读Christopher Olston与Marc Najork合著的文章“Web Crawling”。这篇文章给出了如下结论。

Politeness 礼貌性

Generally, a web crawler should avoid sending too many requests to the same hosting server within a short period. Sending too many requests is considered as “impolite” or even treated as denial-of-service (DOS) attack. For example, without any constraint, the crawler can send thousands of requests every second to the same website. This can overwhelm the web servers.


The general idea of enforcing politeness is to download one page at a time from the same host. A delay can be added between two download tasks. The politeness constraint is implemented by maintain a mapping from website hostnames to download (worker) threads. Each downloader thread has a separate FIFO queue and only downloads URLs obtained from that queue. Figure 6 shows the design that manages politeness.


  • Queue router: It ensures that each queue (b1, b2, … bn) only contains URLs from the same host. 队列路由器:确保每个队列(b1,b2,…,bn)只包含来自同一个主机的URL。
  • Mapping table: It maps each host to a queue. 映射表:把每个主机映射到队列中(见表1)。
  • FIFO queues b1, b2 to bn: Each queue contains URLs from the same host. FIFO队列(从b1到bn):每个队列只包含来自同一个主机的URL。
  • Queue selector: Each worker thread is mapped to a FIFO queue, and it only downloads URLs from that queue. The queue selection logic is done by the Queue selector. 队列选择器:每个Worker都被映射到一个FIFO队列,它只下载来自这个队列的URL。队列选择器实现队列选择的逻辑。
  • Worker thread 1 to N. A worker thread downloads web pages one by one from the same host. A delay can be added between two download tasks. 下载线程(Worker1到WorkerN):Worker一个接一个地下载来源于同一个主机的网页。在两个下载任务之间可以加入延时。
Priority 优先级

A random post from a discussion forum about Apple products carries very different weight than posts on the Apple home page. Even though they both have the “Apple” keyword, it is sensible for a crawler to crawl the Apple home page first.


We prioritize URLs based on usefulness, which can be measured by PageRank, website traffic, update frequency, etc. “Prioritizer” is the component that handles URL prioritization. Refer to the reference materials for in-depth information about this concept.


Figure 7 shows the design that manages URL priority.


  • Prioritizer: It takes URLs as input and computes the priorities. 优先级排序器:它接收URL作为输入并计算其优先级。
  • Queue f1 to fn: Each queue has an assigned priority. Queues with high priority are selected with higher probability. 队列f1到fn:每个队列都有一个设定好的优先级。优先级高的队列有更高的概率被选中。
  • Queue selector: Randomly choose a queue with a bias towards queues with higher priority. 队列选择器:从多个队列中随机选择一个,尽管优先级高的队列有更高的概率被选中,但这并不是绝对确定的,仍然存在一定的随机性。

Figure 8 presents the URL frontier design, and it contains two modules:


  • Front queues: manage prioritization 前队列:实现优先级管理。
  • Back queues: manage politeness 后队列:实现礼貌性管理。
Freshness 新鲜度

Web pages are constantly being added, deleted, and edited. A web crawler must periodically recrawl downloaded pages to keep our data set fresh. Recrawl all the URLs is time-consuming and resource intensive. Few strategies to optimize freshness are listed as follows:


  • Recrawl based on web pages’ update history. 根据网页的更新历史来判断是否重新爬取。
  • Prioritize URLs and recrawl important pages first and more frequently. 对URL按优先级排序,并且优先频繁地重新爬取重要的网页。
Storage for URL Frontier URL前线的存储

In real-world crawl for search engines, the number of URLs in the frontier could be hundreds of millions. Putting everything in memory is neither durable nor scalable. Keeping everything in the disk is undesirable neither because the disk is slow; and it can easily become a bottleneck for the crawl.


We adopted a hybrid approach. The majority of URLs are stored on disk, so the storage space is not a problem. To reduce the cost of reading from the disk and writing to the disk, we maintain buffers in memory for enqueue/dequeue operations. Data in the buffer is periodically written to the disk.


3.3 HTML Downloader HTML下载器

The HTML Downloader downloads web pages from the internet using the HTTP protocol. Before discussing the HTML Downloader, we look at Robots Exclusion Protocol first ——robots.txt.

HTML下载器通过HTTP协议从互联网下载网页。在讨论HTML下载器之前,我们先看看机器人排除协议(Robots Exclusion Protocol)——robots.txt。

Robots.txt, called Robots Exclusion Protocol, is a standard used by websites to communicate with crawlers. It specifies what pages crawlers are allowed to download. Before attempting to crawl a web site, a crawler should check its corresponding robots.txt first and follow its rules.


To avoid repeat downloads of robots.txt file, we cache the results of the file. The file is downloaded and saved to cache periodically. Here is a piece of robots.txt file taken from https://www.amazon.com/robots.txt. Some of the directories like creatorhub are disallowed for Google bot.


User-agent: Googlebot

Disallow: /creatorhub/\*

Disallow: /rss/people/\*/reviews

Disallow: /gp/pdp/rss/\*/reviews

Disallow: /gp/cdp/member-reviews/

Disallow: /gp/aw/cr/

Besides robots.txt, performance optimization is another important concept we will cover for the HTML downloader.


Performance optimization 性能优化

Below is a list of performance optimizations for HTML downloader.


  1. Distributed crawl 分布式爬取

To achieve high performance, crawl jobs are distributed into multiple servers, and each server runs multiple threads. The URL space is partitioned into smaller pieces; so, each downloader is responsible for a subset of the URLs. Figure 9 shows an example of a distributed crawl.


  1. Cache DNS Resolver 缓存DNS解析器

DNS Resolver is a bottleneck for crawlers because DNS requests might take time due to the synchronous nature of many DNS interfaces. DNS response time ranges from 10ms to 200ms. Once a request to DNS is carried out by a crawler thread, other threads are blocked until the first request is completed. Maintaining our DNS cache to avoid calling DNS frequently is an effective technique for speed optimization. Our DNS cache keeps the domain name to IP address mapping and is updated periodically by cron jobs.


  1. Locality 本地性

Distribute crawl servers geographically. When crawl servers are closer to website hosts, crawlers experience faster download time. Design locality applies to most of the system components: crawl servers, cache, queue, storage, etc.


  1. Short timeout 短超时时间

Some web servers respond slowly or may not respond at all. To avoid long wait time, a maximal wait time is specified. If a host does not respond within a predefined time, the crawler will stop the job and crawl some other pages.


3.4 Robustness 健壮性

Besides performance optimization, robustness is also an important consideration. We present a few approaches to improve the system robustness:


  • Consistent hashing: This helps to distribute loads among downloaders. A new downloader server can be added or removed using consistent hashing. Refer to the "Design consistent hashing" chapter for more details. 一致性哈希:有助于负载在HTML下载器之间均匀分布。使用一致性哈希,可以添加或者移除新的下载器服务器。可参考第5章了解关于一致性哈希的更多细节。
  • Save crawl states and data: To guard against failures, crawl states and data are written to a storage system. A disrupted crawl can be restarted easily by loading saved states and data. 保存爬取状态和数据:为了应对故障,将爬取状态和数据写入存储系统。通过加载保存的爬取状态和数据,可以很容易地重启被中断的爬取过程。
  • Exception handling: Errors are inevitable and common in a large-scale system. The crawler must handle exceptions gracefully without crashing the system. 异常处理:在大型系统中,错误是无法避免的,出错是很常见的事情。爬虫必须能“得体地”处理异常,避免系统崩溃。
  • Data validation: This is an important measure to prevent system errors. 数据校验:这是防止系统错误的重要措施。

3.5 Extensibility 可扩展性

As almost every system evolves, one of the design goals is to make the system flexible enough to support new content types. The crawler can be extended by plugging in new modules. Figure 10 shows how to add new modules.


  • PNG Downloader module is plugged-in to download PNG files. PNG下载器模块作为插件被添加进来,用于下载PNG文件。
  • Web Monitor module is added to monitor the web and prevent copyright and trademark infringements. 网络监视器模块作为插件被添加进来,用于监控网络,以避免版权和商标侵权。

3.6 Detect and avoid problematic content 检测和避免有问题的内容

This section discusses the detection and prevention of redundant, meaningless, or harmful content.


  1. Redundant content 重复内容

As discussed previously, nearly 30% of the web pages are duplicates. Hashes or checksums help to detect duplication.


  1. Spider traps 蜘蛛陷阱

A spider trap is a web page that causes a crawler in an infinite loop. For instance, an infinite deep directory structure is listed as follows: http://www.spidertrapexample.com/foo/bar/foo/bar/foo/bar/ … Such spider traps can be avoided by setting a maximal length for URLs. However, no one-size-fits-all solution exists to detect spider traps. Websites containing spider traps are easy to identify due to an unusually large number of web pages discovered on such websites. It is hard to develop automatic algorithms to avoid spider traps; however, a user can manually verify and identify a spider trap, and either exclude those websites from the crawler or apply some customized URL filters.


  1. Data noise 数据噪声

Some of the contents have little or no value, such as advertisements, code snippets, spam URLs, etc. Those contents are not useful for crawlers and should be excluded if possible.


4 Step 4 - Wrap up 第四步 总结

In this chapter, we first discussed the characteristics of a good crawler: scalability, politeness, extensibility, and robustness. Then, we proposed a design and discussed key components. Building a scalable web crawler is not a trivial task because the web is enormously large and full of traps. Even though we have covered many topics, we still miss many relevant talking points:


  • Server-side rendering: Numerous websites use scripts like JavaScript, AJAX, etc to generate links on the fly. If we download and parse web pages directly, we will not be able to retrieve dynamically generated links. To solve this problem, we perform server-side rendering (also called dynamic rendering) first before parsing a page. 服务端渲染。无数网站使用JavaScript、AJAX等脚本来动态生成链接。如果直接下载和解析网页,我们并不能获取这些动态生成的链接。为了解决这个问题,我们会在解析网页之前先进行服务器端渲染(也叫动态渲染)。
  • Filter out unwanted pages: With finite storage capacity and crawl resources, an anti-spam component is beneficial in filtering out low quality and spam pages. 滤掉不想要的网页。因为存储容量和爬虫资源是有限的,使用反垃圾组件,有助于滤掉低质量的垃圾页面。
  • Database replication and sharding: Techniques like replication and sharding are used to improve the data layer availability, scalability, and reliability. 数据库复制和分片。复制和分片等技术可以增强数据层的可用性、可扩展性和可靠性。
  • Horizontal scaling: For large scale crawl, hundreds or even thousands of servers are needed to perform download tasks. The key is to keep servers stateless. 横向扩展。对于大范围的爬取,需要成百上千的服务器来执行下载任务。保持服务器无状态是关键。
  • Availability, consistency, and reliability: These concepts are at the core of any large system’s success. 可用性、一致性和可靠性。这些概念是任何大型系统成功的核心。
  • Analytics: Collecting and analyzing data are important parts of any system because data is key ingredient for fine-tuning. 数据分析。收集和分析数据对任何系统来说都很重要,因为数据是优化系统的关键要素。

Congratulations on getting this far! Now give yourself a pat on the back. Good job!


