The Inverted Index Build

How documents become posting lists that map terms to the documents containing them.

What an inverted index is

An inverted index maps each term to a posting list of documents that contain it. Instead of scanning every document for a word, you jump straight to the list for that word.

Building it

Tokenize each document into terms.
Normalize terms by lowercasing and stemming.
Emit pairs of term and document id.
Sort and group pairs by term to form posting lists.

This is a large sort, often done with a map and reduce style pipeline so it scales across machines.

What postings hold

A posting can store more than a document id. It often includes term frequency and positions so the ranker can compute relevance and support phrase queries.

Compression

Posting lists are huge, so document ids are stored as gaps between sorted ids and compressed. Smaller lists mean fewer bytes read per query.

Diagram

Key idea

An inverted index trades a big offline sort for fast online lookups from terms to the documents that contain them.