← Lessons

quiz vs the machine

Silver1100

System Design

Tokenization and Analysis

Turning raw text into the clean, comparable tokens that fill the index.

4 min read · intro · beat Silver to climb

From text to tokens

Before any word reaches the index it passes through an analysis pipeline. The goal is that a query and a matching document produce the same token even if the raw text differs in case, punctuation, or word form.

The pipeline has three common stages:

  • Character filtering strips or rewrites raw characters, such as removing HTML tags.
  • Tokenization splits the stream into tokens, usually on whitespace and punctuation.
  • Token filtering transforms each token, for example lowercasing, removing stop words, or reducing words to a root.

Stemming and lemmatization

Stemming chops suffixes with simple rules so running becomes run. Lemmatization uses a dictionary to map a word to its proper base form. Both increase recall by matching variants, at some cost to precision.

The golden rule

The exact same analyzer must run at index time and at query time. If indexing lowercases but the query does not, the tokens never match and results vanish.

Key idea

Analysis normalizes text into consistent tokens, and the same analyzer must run when indexing and when querying.

Check yourself

Answer to earn rating on the learn ladder.

1. Why must the same analyzer run at index and query time?

2. What does stemming primarily improve?