Vivcre Learn learn it · write it · retain it

← All posts

Database #lsm#storage-engines#compaction

How LSM Trees Make Writes Cheap

24 Jun 2026

A B-tree updates data in place. To insert a row, the engine finds the right leaf page and writes there. When your keys arrive in random order — think UUIDv4 primary keys — consecutive inserts land on different pages scattered across the disk. Each one becomes a separate random write, and random I/O is roughly 10–100x slower than sequential I/O on a spinning disk and still meaningfully slower on SSDs because of write amplification. At a few thousand inserts per second, the storage layer becomes the bottleneck.

The Log-Structured Merge tree (LSM) attacks this by refusing to do random writes at all. It splits the problem into a fast in-memory part and a patient on-disk part.

The write path

  1. Every write first appends to a write-ahead log (WAL) — a pure sequential append, cheap and crash-safe.
  2. The write also goes into an in-memory sorted structure, the memtable (often a skip list or balanced tree). In RAM, keeping things sorted is free.
  3. When the memtable fills, it is flushed to disk in one shot as an immutable, sorted file called an SSTable. Because the memtable was already sorted, this flush is one large sequential write.

So no matter how randomly the keys arrive, they hit the disk only in big sorted batches. That is the whole trick: trade random writes for sequential ones.

The cost: reads and space

The price is paid on the read path. A key might live in the memtable, or in any of several SSTables on disk. A naive read would have to check all of them. Two mechanisms keep reads sane:

Compaction is what keeps read amplification and disk usage bounded — but it costs background I/O and CPU, and a poorly tuned compaction strategy can cause latency spikes. That is the fundamental LSM tradeoff: cheap writes now, paid for with background compaction and more complex reads later.

This is why write-heavy systems — Cassandra, RocksDB, LevelDB, ScyllaDB — are built on LSM trees, while read-heavy relational workloads often stay on B-trees.


Practise these questions →

Spaced-repetition MCQs for this post, on practise.vivcre.com.