Why split at all
Long documents make poor retrieval units. A single embedding of a whole page blurs many topics together, and a model can only attend to so much text. Chunking breaks a document into smaller passages that each carry a focused idea.
What good chunks look like
- Self contained: a chunk should make sense on its own, since it is retrieved alone.
- Topically coherent: split at natural boundaries like paragraphs or sections.
- Right sized: too large dilutes meaning, too small loses context.
Strategies in practice
- Fixed size: split every few hundred tokens, simple but may cut mid sentence.
- Structure aware: split on headings, paragraphs, or sentences to respect meaning.
- Recursive: try large boundaries first, then fall back to smaller ones.
Why it matters so much
Retrieval can only return chunks you created. If an answer spans two badly split chunks, neither alone may rank well. Chunking quietly sets the ceiling on retrieval quality.
Key idea
Chunking turns documents into focused, self contained passages, and the choice of boundaries and size sets the ceiling on what retrieval can find.