← Lessons

quiz vs the machine

Gold1500

System Design

Design a Web Crawler

Fetch and index billions of pages politely without revisiting endlessly.

7 min read · core · beat Gold to climb

Requirements

  • Discover and download pages across the web at large scale.
  • Avoid duplicate work and respect site rules.
  • Refresh changing pages over time.

High level design

A frontier of URLs to visit feeds workers that fetch, parse, and extract new links back into the frontier.

  • URL frontier: a prioritized queue that orders by importance and politeness per host.
  • Fetchers: workers download pages, honoring robots rules and crawl delays.
  • Parser and dedup: extract links and content, then drop already seen URLs and near duplicate pages.

Bottlenecks

  • Politeness: hitting one host too hard is abusive, so queue per host and space requests by the crawl delay.
  • Duplicate detection: the same content appears at many URLs, so hash content and check seen URLs in a fast set.
  • Trap avoidance: infinite calendar style links waste effort, so cap depth and detect cyclic patterns.

Store fetched content for indexing and schedule recrawls based on how often a page changes.

Key idea

A web crawler is a politeness aware queue of URLs feeding fetch and parse workers, with dedup and traps handling kept the system from spinning forever.

Check yourself

Answer to earn rating on the learn ladder.

1. Why queue URLs per host inside the frontier?

2. What stops the crawler from refetching the same content endlessly?