← Lessons

quiz vs the machine

Gold1320

Machine Learning

The Overlap in Chunking

Sharing text between neighboring chunks so meaning is not cut in half.

4 min read · core · beat Gold to climb

The problem at the seams

When you split a document into chunks, a single idea can fall exactly on a boundary. Half lands in one chunk and half in the next, so neither chunk fully captures it. Overlap copies a slice of text from the end of one chunk into the start of the next.

Why overlap helps

  • Context preservation: a sentence split across chunks still appears whole in at least one of them.
  • Better recall: the answer near a boundary now lives intact in a retrievable chunk.
  • Smoother retrieval: queries that straddle a seam still match somewhere.

The cost side

Overlap is not free. Shared text means more chunks and more vectors, which raises storage and search cost. It can also surface near duplicate results that need deduplication.

Picking an amount

A common choice is a modest overlap, perhaps ten to twenty percent of the chunk size. Enough to catch boundary ideas without bloating the index.

Key idea

Overlap copies a slice of text between neighboring chunks so ideas at a boundary stay intact, improving recall at the cost of more vectors to store and search.

Check yourself

Answer to earn rating on the learn ladder.

1. What problem does chunk overlap solve?

2. What is a cost of using overlap?