Design a Web Crawler

Requirements

Discover and download pages across the web at large scale.
Avoid duplicate work and respect site rules.
Refresh changing pages over time.

High level design

A frontier of URLs to visit feeds workers that fetch, parse, and extract new links back into the frontier.

URL frontier: a prioritized queue that orders by importance and politeness per host.
Fetchers: workers download pages, honoring robots rules and crawl delays.
Parser and dedup: extract links and content, then drop already seen URLs and near duplicate pages.

Bottlenecks

Politeness: hitting one host too hard is abusive, so queue per host and space requests by the crawl delay.
Duplicate detection: the same content appears at many URLs, so hash content and check seen URLs in a fast set.
Trap avoidance: infinite calendar style links waste effort, so cap depth and detect cyclic patterns.

Store fetched content for indexing and schedule recrawls based on how often a page changes.

Key idea

A web crawler is a politeness aware queue of URLs feeding fetch and parse workers, with dedup and traps handling kept the system from spinning forever.

Design a Web Crawler

Requirements

High level design

Bottlenecks

Key idea

Check yourself