Searching Inside Text
A B tree cannot efficiently find a word inside a body of text, since a contains search has a leading wildcard. A full text index solves this by building an inverted index: a map from each searchable token to the list of documents that contain it.
How A Document Is Processed
Before indexing, text is run through an analysis pipeline:
- Tokenize: split the text into words.
- Normalize: lowercase and strip punctuation.
- Stem or lemmatize: reduce words to a root, so running and runs match run.
- Remove stop words: drop very common words like the and and.
The result is a set of normalized tokens, each pointing to its documents.
Querying
A search applies the same analysis to the query terms, then looks up each token in the inverted index and combines the document lists. Many full text engines also produce a relevance rank, scoring how well each document matches so results can be ordered by relevance, not just presence.
Key idea
A full text index builds an inverted token to document map after tokenizing, normalizing, and stemming text, so word searches inside large text are fast and can be ranked by relevance, unlike a wildcard scan.