← Lessons

quiz vs the machine

Gold1330

Machine Learning

Document Chunking Strategies

How to split documents so retrieval finds the right context.

5 min read · core · beat Gold to climb

What it is

Chunking splits long documents into smaller pieces before embedding them for retrieval. Chunk size shapes what a vector search can find and how much context an LLM receives, so it strongly affects answer quality.

The size trade off

  • Tiny chunks give precise matches but may lack enough context to answer.
  • Huge chunks carry full context but dilute the embedding, so the relevant sentence gets averaged away and recall drops.

The sweet spot keeps each chunk about one coherent idea.

Strategies

  • Fixed size with overlap: split every N tokens with some overlap so a fact spanning a boundary is not cut.
  • Semantic or structural: split on natural units like paragraphs, headings, or sentences, which keeps ideas intact.
  • Parent document: embed small chunks for matching but return the larger parent section for context.

Pick the unit that matches your data: code by function, prose by paragraph, tables by row group.

Key idea

Chunking balances precision against context: chunks built around one coherent idea, with overlap or parent retrieval, retrieve far better than arbitrary large blocks.

Check yourself

Answer to earn rating on the learn ladder.

1. Why can very large chunks hurt retrieval recall?

2. What does the parent document strategy do?