What dedup does
Deduplication stores each unique block of data exactly once. When the same block appears again, whether in another file or another user's upload, the system stores only a reference to the existing block instead of new bytes. A reference count tracks how many files point at each block.
How identity is decided
Each block is fingerprinted by a strong hash. Two blocks with the same hash are treated as identical. Before writing, the system checks whether that hash already exists; if so it skips the write and just adds a reference. This makes the store a deduplicating, content keyed system.
The hard parts
- Garbage collection: a block can only be freed when its reference count reaches zero, which requires careful counting under concurrent writes.
- Hash collisions: vanishingly rare with a strong hash, but a collision would corrupt data, so the hash choice matters.
- Privacy: cross user dedup can leak whether a file already exists, so some systems scope dedup per user.
Key idea
Block deduplication stores each unique block once, keyed by a strong hash, and replaces duplicates with references tracked by a count.