Deduplication of Blocks

Store each unique block once and reference it everywhere, collapsing redundant copies.

What dedup does

Deduplication stores each unique block of data exactly once. When the same block appears again, whether in another file or another user's upload, the system stores only a reference to the existing block instead of new bytes. A reference count tracks how many files point at each block.

How identity is decided

Each block is fingerprinted by a strong hash. Two blocks with the same hash are treated as identical. Before writing, the system checks whether that hash already exists; if so it skips the write and just adds a reference. This makes the store a deduplicating, content keyed system.

The hard parts

Garbage collection: a block can only be freed when its reference count reaches zero, which requires careful counting under concurrent writes.
Hash collisions: vanishingly rare with a strong hash, but a collision would corrupt data, so the hash choice matters.
Privacy: cross user dedup can leak whether a file already exists, so some systems scope dedup per user.

Key idea

Block deduplication stores each unique block once, keyed by a strong hash, and replaces duplicates with references tracked by a count.

Deduplication of Blocks

What dedup does

How identity is decided

The hard parts

Key idea

Check yourself