← Lessons

quiz vs the machine

Gold1400

System Design

The Inverted Index Build

How documents become posting lists that map terms to the documents containing them.

5 min read · core · beat Gold to climb

What an inverted index is

An inverted index maps each term to a posting list of documents that contain it. Instead of scanning every document for a word, you jump straight to the list for that word.

Building it

  • Tokenize each document into terms.
  • Normalize terms by lowercasing and stemming.
  • Emit pairs of term and document id.
  • Sort and group pairs by term to form posting lists.

This is a large sort, often done with a map and reduce style pipeline so it scales across machines.

What postings hold

A posting can store more than a document id. It often includes term frequency and positions so the ranker can compute relevance and support phrase queries.

Compression

Posting lists are huge, so document ids are stored as gaps between sorted ids and compressed. Smaller lists mean fewer bytes read per query.

Diagram

Key idea

An inverted index trades a big offline sort for fast online lookups from terms to the documents that contain them.

Check yourself

Answer to earn rating on the learn ladder.

1. What does an inverted index map from and to?

2. Why are document ids stored as gaps in posting lists?

3. Why store positions inside a posting?