Beyond LIKE
A pattern match with a leading wildcard cannot use a normal index and scans every row. A full text index is built specifically for searching words inside documents, supporting relevance ranking and language aware matching.
The Build Pipeline
Indexing text passes each document through several steps:
- Tokenization splits text into individual words or terms.
- Normalization lowercases and may strip accents so case and diacritics do not matter.
- Stop word removal drops very common words like the or and that carry little meaning.
- Stemming reduces words to a root, so running and runs match run.
The processed terms feed an inverted index mapping each term to the documents containing it.
Querying
A search query runs through the same pipeline, so the user terms align with stored terms. The engine looks up each term, combines the document lists, and ranks results by a relevance score based on term frequency and rarity.
Key idea
A full text index tokenizes, normalizes, and stems text into an inverted index so word searches return ranked relevant matches instead of scanning rows.