← Lessons

quiz vs the machine

Silver1140

Machine Learning

The Chunk Overlap Tuning

Why neighboring chunks share text and how much overlap to use.

5 min read · intro · beat Silver to climb

The boundary problem

When you cut a document into chunks, the key sentence for a query may land right on a boundary, split between two pieces. Neither chunk then holds the full thought, and retrieval can miss it. Chunk overlap copies a slice of text from the end of one chunk into the start of the next so ideas that straddle a cut survive in at least one piece.

How overlap works

Overlap is usually set as a fraction of chunk size, often ten to twenty percent. With a five hundred token chunk and a fifty token overlap, each chunk repeats the last fifty tokens of its predecessor.

  • Too little overlap risks splitting a sentence or a definition across chunks.
  • Too much overlap stores the same text many times, inflating the index and returning near duplicate results.

Tuning it

Set overlap large enough to capture a typical complete thought, such as a sentence or a short list, but no larger. Watch for duplicate hits in retrieval results, a sign overlap is wasting space.

Why it matters

Overlap is a cheap insurance against boundary loss, but unbounded overlap quietly bloats storage and crowds out diverse results.

Key idea

Chunk overlap copies a small slice between neighbors so boundary spanning ideas survive, tuned just large enough to hold a complete thought without bloating the index.

Check yourself

Answer to earn rating on the learn ladder.

1. What problem does chunk overlap solve?

2. What is a symptom of too much overlap?