← Lessons

quiz vs the machine

Gold1330

Machine Learning

The Chunking Strategy for Documents

Why you split documents before embedding them, and how the split shapes results.

5 min read · core · beat Gold to climb

Why split at all

Long documents make poor retrieval units. A single embedding of a whole page blurs many topics together, and a model can only attend to so much text. Chunking breaks a document into smaller passages that each carry a focused idea.

What good chunks look like

  • Self contained: a chunk should make sense on its own, since it is retrieved alone.
  • Topically coherent: split at natural boundaries like paragraphs or sections.
  • Right sized: too large dilutes meaning, too small loses context.

Strategies in practice

  • Fixed size: split every few hundred tokens, simple but may cut mid sentence.
  • Structure aware: split on headings, paragraphs, or sentences to respect meaning.
  • Recursive: try large boundaries first, then fall back to smaller ones.

Why it matters so much

Retrieval can only return chunks you created. If an answer spans two badly split chunks, neither alone may rank well. Chunking quietly sets the ceiling on retrieval quality.

Key idea

Chunking turns documents into focused, self contained passages, and the choice of boundaries and size sets the ceiling on what retrieval can find.

Check yourself

Answer to earn rating on the learn ladder.

1. Why are whole documents poor retrieval units?

2. What does a structure aware chunking strategy do?