← Lessons

quiz vs the machine

Gold1470

System Design

The Web Crawler Design

Fetching the web at scale with a frontier, politeness, and duplicate detection.

5 min read · core · beat Gold to climb

What a crawler does

A web crawler systematically fetches pages, extracts links, and follows them to discover more pages. It is a giant loop, take a url, download it, parse out links, and add new urls to a queue.

The url frontier

The queue of urls to visit is the frontier. It is more than a simple list:

  • It enforces politeness, limiting how often any one host is hit so the crawler does not overload a site.
  • It applies priority, crawling important or fresh pages sooner.
  • It is partitioned across many worker machines for scale.

Avoiding duplicate work

  • A seen set tracks already visited urls so the crawler does not loop forever. A hash set or bloom filter holds the huge url space compactly.
  • Content fingerprints detect pages that differ in url but share identical content.

Being a good citizen

  • The crawler reads each site robots rules to learn what it may fetch.
  • It rate limits per host and identifies itself with a user agent.

Key idea

A web crawler loops over a url frontier that enforces politeness and priority, downloads and parses pages, and uses a seen set and fingerprints to avoid duplicate and runaway crawling.

Check yourself

Answer to earn rating on the learn ladder.

1. What is the url frontier in a crawler?

2. Why does a crawler keep a seen set?

3. What does politeness in a crawler mean?