← Lessons

quiz vs the machine

Gold1450

Databases

Full Text Indexes Deep Dive

How tokenizing and normalizing text powers fast natural language search.

5 min read · core · beat Gold to climb

Beyond LIKE

A pattern match with a leading wildcard cannot use a normal index and scans every row. A full text index is built specifically for searching words inside documents, supporting relevance ranking and language aware matching.

The Build Pipeline

Indexing text passes each document through several steps:

  • Tokenization splits text into individual words or terms.
  • Normalization lowercases and may strip accents so case and diacritics do not matter.
  • Stop word removal drops very common words like the or and that carry little meaning.
  • Stemming reduces words to a root, so running and runs match run.

The processed terms feed an inverted index mapping each term to the documents containing it.

Querying

A search query runs through the same pipeline, so the user terms align with stored terms. The engine looks up each term, combines the document lists, and ranks results by a relevance score based on term frequency and rarity.

Key idea

A full text index tokenizes, normalizes, and stems text into an inverted index so word searches return ranked relevant matches instead of scanning rows.

Check yourself

Answer to earn rating on the learn ladder.

1. What does stemming accomplish in a full text pipeline?

2. Why must the query pass through the same pipeline as indexing?