Discovering content
A crawler fetches documents and follows links to discover more. It keeps a frontier of URLs to visit, prioritized by importance and freshness.
Politeness and dedup
- Politeness limits request rate per host so you do not overload a site.
- Deduplication drops pages whose content you already have, often using a content hash.
- Robots rules tell the crawler which paths are allowed.
From fetch to index
After fetching, the indexer parses the page, extracts text and fields, and emits records for the index builder. Crawl and index are decoupled by a queue, so a slow build never stalls fetching.
Recrawl
Pages change, so the crawler revisits them. Frequently changing pages are recrawled often; stable pages less so. This balances freshness against load.
Diagram
Key idea
A crawler discovers and refreshes content politely, while a decoupled indexer turns fetched pages into index records.