← Lessons

quiz vs the machine

Gold1460

Databases

The Full Text Index

Breaking documents into searchable tokens lets the database find words inside text far faster than scanning with a wildcard.

5 min read · core · beat Gold to climb

Searching Inside Text

A B tree cannot efficiently find a word inside a body of text, since a contains search has a leading wildcard. A full text index solves this by building an inverted index: a map from each searchable token to the list of documents that contain it.

How A Document Is Processed

Before indexing, text is run through an analysis pipeline:

  • Tokenize: split the text into words.
  • Normalize: lowercase and strip punctuation.
  • Stem or lemmatize: reduce words to a root, so running and runs match run.
  • Remove stop words: drop very common words like the and and.

The result is a set of normalized tokens, each pointing to its documents.

Querying

A search applies the same analysis to the query terms, then looks up each token in the inverted index and combines the document lists. Many full text engines also produce a relevance rank, scoring how well each document matches so results can be ordered by relevance, not just presence.

Key idea

A full text index builds an inverted token to document map after tokenizing, normalizing, and stemming text, so word searches inside large text are fast and can be ranked by relevance, unlike a wildcard scan.

Check yourself

Answer to earn rating on the learn ladder.

1. What core structure does a full text index build?

2. Why are query terms run through the same analysis as the documents?

3. What can many full text engines provide beyond presence of a term?