Requirements
- Discover and download pages across the web at large scale.
- Avoid duplicate work and respect site rules.
- Refresh changing pages over time.
High level design
A frontier of URLs to visit feeds workers that fetch, parse, and extract new links back into the frontier.
- URL frontier: a prioritized queue that orders by importance and politeness per host.
- Fetchers: workers download pages, honoring robots rules and crawl delays.
- Parser and dedup: extract links and content, then drop already seen URLs and near duplicate pages.
Bottlenecks
- Politeness: hitting one host too hard is abusive, so queue per host and space requests by the crawl delay.
- Duplicate detection: the same content appears at many URLs, so hash content and check seen URLs in a fast set.
- Trap avoidance: infinite calendar style links waste effort, so cap depth and detect cyclic patterns.
Store fetched content for indexing and schedule recrawls based on how often a page changes.
Key idea
A web crawler is a politeness aware queue of URLs feeding fetch and parse workers, with dedup and traps handling kept the system from spinning forever.