What an inverted index is
An inverted index maps each term to a posting list of documents that contain it. Instead of scanning every document for a word, you jump straight to the list for that word.
Building it
- Tokenize each document into terms.
- Normalize terms by lowercasing and stemming.
- Emit pairs of term and document id.
- Sort and group pairs by term to form posting lists.
This is a large sort, often done with a map and reduce style pipeline so it scales across machines.
What postings hold
A posting can store more than a document id. It often includes term frequency and positions so the ranker can compute relevance and support phrase queries.
Compression
Posting lists are huge, so document ids are stored as gaps between sorted ids and compressed. Smaller lists mean fewer bytes read per query.
Diagram
Key idea
An inverted index trades a big offline sort for fast online lookups from terms to the documents that contain them.