← Lessons

quiz vs the machine

Platinum1730

System Design

Deduplication Storage

Storing identical data only once to cut storage cost dramatically.

6 min read · advanced · beat Platinum to climb

Store each unique piece once

Deduplication removes redundant copies of data so that identical content is stored only a single time. Backups, virtual machine images, and document repositories contain enormous overlap, and dedup can shrink them many fold.

Chunking and fingerprints

Data is split into chunks, and each chunk is hashed to produce a fingerprint. The system keeps an index of fingerprints already stored. When a new chunk's fingerprint is already present, the system stores only a reference to the existing chunk instead of the bytes.

Fixed versus variable chunking

  • Fixed size chunking is simple but a single inserted byte shifts every later boundary, defeating matches.
  • Content defined chunking sets boundaries based on the data itself, so an insertion only disturbs nearby chunks and most fingerprints still match. This catches far more duplicates.

The costs to weigh

The fingerprint index must be fast to look up, which takes memory. Hashing every chunk costs CPU. And because many files now share chunks, reference counting is needed so a chunk is freed only when no file references it. Despite this, the space savings are usually well worth it for redundant data.

Key idea

Deduplication splits data into chunks, fingerprints each one, and stores duplicates as references, with content defined chunking catching the most overlap and reference counting ensuring shared chunks are freed only when truly unused.

Check yourself

Answer to earn rating on the learn ladder.

1. How does deduplication decide a chunk is a duplicate?

2. Why does content defined chunking catch more duplicates than fixed size chunking?

3. Why is reference counting needed in a dedup store?