Deduplication Storage

Store each unique piece once

Deduplication removes redundant copies of data so that identical content is stored only a single time. Backups, virtual machine images, and document repositories contain enormous overlap, and dedup can shrink them many fold.

Chunking and fingerprints

Data is split into chunks, and each chunk is hashed to produce a fingerprint. The system keeps an index of fingerprints already stored. When a new chunk's fingerprint is already present, the system stores only a reference to the existing chunk instead of the bytes.

Fixed versus variable chunking

Fixed size chunking is simple but a single inserted byte shifts every later boundary, defeating matches.
Content defined chunking sets boundaries based on the data itself, so an insertion only disturbs nearby chunks and most fingerprints still match. This catches far more duplicates.

The costs to weigh

The fingerprint index must be fast to look up, which takes memory. Hashing every chunk costs CPU. And because many files now share chunks, reference counting is needed so a chunk is freed only when no file references it. Despite this, the space savings are usually well worth it for redundant data.

Key idea

Deduplication splits data into chunks, fingerprints each one, and stores duplicates as references, with content defined chunking catching the most overlap and reference counting ensuring shared chunks are freed only when truly unused.

Deduplication Storage

Store each unique piece once

Chunking and fingerprints

Fixed versus variable chunking

The costs to weigh

Key idea

Check yourself