What a crawler does
A web crawler systematically fetches pages, extracts links, and follows them to discover more pages. It is a giant loop, take a url, download it, parse out links, and add new urls to a queue.
The url frontier
The queue of urls to visit is the frontier. It is more than a simple list:
- It enforces politeness, limiting how often any one host is hit so the crawler does not overload a site.
- It applies priority, crawling important or fresh pages sooner.
- It is partitioned across many worker machines for scale.
Avoiding duplicate work
- A seen set tracks already visited urls so the crawler does not loop forever. A hash set or bloom filter holds the huge url space compactly.
- Content fingerprints detect pages that differ in url but share identical content.
Being a good citizen
- The crawler reads each site robots rules to learn what it may fetch.
- It rate limits per host and identifies itself with a user agent.
Key idea
A web crawler loops over a url frontier that enforces politeness and priority, downloads and parses pages, and uses a seen set and fingerprints to avoid duplicate and runaway crawling.