← Lessons

quiz vs the machine

Gold1400

Machine Learning

The Semantic Chunking

Splitting documents where meaning shifts instead of at fixed lengths.

5 min read · core · beat Gold to climb

Beyond fixed sizes

Fixed size chunking cuts blindly at a character count, which can break a thought mid argument or fuse two unrelated topics. Semantic chunking instead places boundaries where the meaning of the text actually changes, so each chunk holds one coherent idea.

How it decides boundaries

A common method embeds each sentence, then walks through the document comparing the embedding of consecutive sentences.

  • When two neighboring sentences are similar, they belong together and stay in the same chunk.
  • When the similarity drops sharply, that gap marks a topic shift, and a boundary is placed there.

The threshold for what counts as a sharp drop is tuned, often as a percentile of the observed similarity gaps.

Costs and payoffs

Semantic chunking costs extra embedding work up front and produces variable length chunks, which complicates packing. In return each chunk is topically clean, so its single embedding represents it faithfully and retrieval precision rises.

Why it matters

When documents wander across many subjects, semantic boundaries keep each retrievable unit pure, which is exactly what an embedding model needs to match queries well.

Key idea

Semantic chunking sets boundaries where sentence embeddings show a topic shift, producing topically pure chunks whose embeddings match queries more precisely than fixed size cuts.

Check yourself

Answer to earn rating on the learn ladder.

1. How does semantic chunking find a boundary?

2. What is a downside of semantic chunking?