← Lessons

quiz vs the machine

Platinum1760

System Design

Deduplication of Blocks

Store each unique block once and reference it everywhere, collapsing redundant copies.

5 min read · advanced · beat Platinum to climb

What dedup does

Deduplication stores each unique block of data exactly once. When the same block appears again, whether in another file or another user's upload, the system stores only a reference to the existing block instead of new bytes. A reference count tracks how many files point at each block.

How identity is decided

Each block is fingerprinted by a strong hash. Two blocks with the same hash are treated as identical. Before writing, the system checks whether that hash already exists; if so it skips the write and just adds a reference. This makes the store a deduplicating, content keyed system.

The hard parts

  • Garbage collection: a block can only be freed when its reference count reaches zero, which requires careful counting under concurrent writes.
  • Hash collisions: vanishingly rare with a strong hash, but a collision would corrupt data, so the hash choice matters.
  • Privacy: cross user dedup can leak whether a file already exists, so some systems scope dedup per user.

Key idea

Block deduplication stores each unique block once, keyed by a strong hash, and replaces duplicates with references tracked by a count.

Check yourself

Answer to earn rating on the learn ladder.

1. How does dedup decide two blocks are identical?

2. When can a deduplicated block be safely freed?

3. What privacy concern does cross user dedup raise?