Store each unique piece once
Deduplication removes redundant copies of data so that identical content is stored only a single time. Backups, virtual machine images, and document repositories contain enormous overlap, and dedup can shrink them many fold.
Chunking and fingerprints
Data is split into chunks, and each chunk is hashed to produce a fingerprint. The system keeps an index of fingerprints already stored. When a new chunk's fingerprint is already present, the system stores only a reference to the existing chunk instead of the bytes.
Fixed versus variable chunking
- Fixed size chunking is simple but a single inserted byte shifts every later boundary, defeating matches.
- Content defined chunking sets boundaries based on the data itself, so an insertion only disturbs nearby chunks and most fingerprints still match. This catches far more duplicates.
The costs to weigh
The fingerprint index must be fast to look up, which takes memory. Hashing every chunk costs CPU. And because many files now share chunks, reference counting is needed so a chunk is freed only when no file references it. Despite this, the space savings are usually well worth it for redundant data.
Key idea
Deduplication splits data into chunks, fingerprints each one, and stores duplicates as references, with content defined chunking catching the most overlap and reference counting ensuring shared chunks are freed only when truly unused.