Tokenization and Analysis

From text to tokens

Before any word reaches the index it passes through an analysis pipeline. The goal is that a query and a matching document produce the same token even if the raw text differs in case, punctuation, or word form.

The pipeline has three common stages:

Character filtering strips or rewrites raw characters, such as removing HTML tags.
Tokenization splits the stream into tokens, usually on whitespace and punctuation.
Token filtering transforms each token, for example lowercasing, removing stop words, or reducing words to a root.

Stemming and lemmatization

Stemming chops suffixes with simple rules so running becomes run. Lemmatization uses a dictionary to map a word to its proper base form. Both increase recall by matching variants, at some cost to precision.

The golden rule

The exact same analyzer must run at index time and at query time. If indexing lowercases but the query does not, the tokens never match and results vanish.

Key idea

Analysis normalizes text into consistent tokens, and the same analyzer must run when indexing and when querying.

Tokenization and Analysis

From text to tokens

Stemming and lemmatization

The golden rule

Key idea

Check yourself