The Crawler and Indexer

Discovering content

A crawler fetches documents and follows links to discover more. It keeps a frontier of URLs to visit, prioritized by importance and freshness.

Politeness and dedup

Politeness limits request rate per host so you do not overload a site.
Deduplication drops pages whose content you already have, often using a content hash.
Robots rules tell the crawler which paths are allowed.

From fetch to index

After fetching, the indexer parses the page, extracts text and fields, and emits records for the index builder. Crawl and index are decoupled by a queue, so a slow build never stalls fetching.

Recrawl

Pages change, so the crawler revisits them. Frequently changing pages are recrawled often; stable pages less so. This balances freshness against load.

Diagram

Key idea

A crawler discovers and refreshes content politely, while a decoupled indexer turns fetched pages into index records.