Why Not Just B-Trees
B-trees update data in place, which means random disk writes. A log structured merge tree (LSM tree) instead buffers writes in memory and flushes them as sorted files, turning random writes into fast sequential ones.
How It Works
- New writes land in an in memory memtable, kept sorted.
- When the memtable fills it flushes to an immutable sorted file called an SSTable.
- A background compaction process merges SSTables and drops deleted keys.
- Deletes are recorded as a marker called a tombstone.
Reads And Their Cost
- A read may check the memtable then several SSTables, so reads can be slower.
- A bloom filter per file quickly tells the engine which files to skip.
- The cost of compaction and extra reads is the price for very fast writes.
Key idea
An LSM tree buffers writes in memory and flushes sorted immutable files, trading slower reads and background compaction for very high sequential write throughput.