A B-tree updates data in place. To insert a row, the engine finds the right leaf page and writes there. When your keys arrive in random order — think UUIDv4 primary keys — consecutive inserts land on different pages scattered across the disk. Each one becomes a separate random write, and random I/O is roughly 10–100x slower than sequential I/O on a spinning disk and still meaningfully slower on SSDs because of write amplification. At a few thousand inserts per second, the storage layer becomes the bottleneck.
The Log-Structured Merge tree (LSM) attacks this by refusing to do random writes at all. It splits the problem into a fast in-memory part and a patient on-disk part.
The write path
- Every write first appends to a write-ahead log (WAL) — a pure sequential append, cheap and crash-safe.
- The write also goes into an in-memory sorted structure, the memtable (often a skip list or balanced tree). In RAM, keeping things sorted is free.
- When the memtable fills, it is flushed to disk in one shot as an immutable, sorted file called an SSTable. Because the memtable was already sorted, this flush is one large sequential write.
So no matter how randomly the keys arrive, they hit the disk only in big sorted batches. That is the whole trick: trade random writes for sequential ones.
The cost: reads and space
The price is paid on the read path. A key might live in the memtable, or in any of several SSTables on disk. A naive read would have to check all of them. Two mechanisms keep reads sane:
- Bloom filters per SSTable cheaply answer “this key is definitely not here,” so most SSTables are skipped without a disk seek.
- Compaction runs in the background, merging overlapping SSTables into fewer, larger ones, discarding overwritten and deleted (tombstoned) keys along the way.
Compaction is what keeps read amplification and disk usage bounded — but it costs background I/O and CPU, and a poorly tuned compaction strategy can cause latency spikes. That is the fundamental LSM tradeoff: cheap writes now, paid for with background compaction and more complex reads later.
This is why write-heavy systems — Cassandra, RocksDB, LevelDB, ScyllaDB — are built on LSM trees, while read-heavy relational workloads often stay on B-trees.