← Lessons

quiz vs the machine

Gold1390

Databases

The Space Amplification

Space amplification measures how much more disk space the data occupies than the live logical data actually needs.

4 min read · core · beat Gold to climb

What Space Amplification Is

Space amplification is the ratio of physical bytes on disk to the size of the live logical data. A value of two means the engine uses twice the disk space the actual current data requires.

Where the Waste Comes From

  • Stale versions in an LSM tree linger in old SSTables until compaction removes them.
  • Tombstones mark deletions but still occupy space until they are merged away.
  • B tree pages are rarely completely full, so partly empty pages waste space.
  • Fragmentation leaves gaps that hold no useful data.

Why It Matters

Disk is cheaper than it used to be, but space amplification still increases storage cost, lengthens backups, and reduces how much useful data fits on a node. On large fleets it directly drives the number of machines needed.

The Tradeoff

  • Leveled compaction keeps space amplification low, often near one point one, because it aggressively removes duplicates.
  • Size tiered compaction can briefly hold two full copies of the data during a merge, raising space amplification.

Choosing a strategy means deciding whether to spend disk, write bandwidth, or read time.

Key idea

Space amplification is the extra disk a dataset occupies beyond its live size, caused by stale versions, tombstones, and partly empty pages.

Check yourself

Answer to earn rating on the learn ladder.

1. What does space amplification measure?

2. Which factor adds to space amplification in an LSM tree?