The Search Engine Inverted Index
A search engine database answers full text queries like find all documents containing crux. Scanning every document for the word would be far too slow, so these systems build an inverted index.
Forward versus inverted
A forward index maps a document to the words it contains. An inverted index flips that around: it maps each term to the list of documents that contain it. That list is called a postings list.
- A query for one word becomes a direct lookup of its postings list.
- A query for two words intersects two postings lists.
- Postings often store positions so the engine can match exact phrases.
Analysis
Before indexing, text passes through analysis: lowercasing, splitting into tokens, and stemming so running and run share a term. The same analysis runs on the query so the two sides match.
Ranking
Matching is not enough; results must be ranked. Scores like term frequency weighted by how rare a term is across documents push the most relevant documents to the top.
Key idea
An inverted index maps each term to the documents that contain it, turning slow text scans into fast list lookups that are then ranked.