← Lessons

quiz vs the machine

Silver1100

System Design

The Crawler and Indexer

How content is discovered, fetched, and handed to the index pipeline.

4 min read · intro · beat Silver to climb

Discovering content

A crawler fetches documents and follows links to discover more. It keeps a frontier of URLs to visit, prioritized by importance and freshness.

Politeness and dedup

  • Politeness limits request rate per host so you do not overload a site.
  • Deduplication drops pages whose content you already have, often using a content hash.
  • Robots rules tell the crawler which paths are allowed.

From fetch to index

After fetching, the indexer parses the page, extracts text and fields, and emits records for the index builder. Crawl and index are decoupled by a queue, so a slow build never stalls fetching.

Recrawl

Pages change, so the crawler revisits them. Frequently changing pages are recrawled often; stable pages less so. This balances freshness against load.

Diagram

Key idea

A crawler discovers and refreshes content politely, while a decoupled indexer turns fetched pages into index records.

Check yourself

Answer to earn rating on the learn ladder.

1. What does crawler politeness control?

2. Why decouple crawling from indexing with a queue?