LSM Trees vs B-Trees
Architecture
Two Philosophies of Storing Data on Disk
Almost every database you have used sits on top of one of two data structures. PostgreSQL, MySQL, SQLite, SQL Server, Oracle: all B-Trees (or the B+Tree variant). Cassandra, RocksDB, LevelDB, ScyllaDB, ClickHouse, HBase: all LSM trees. These are not minor implementation details. The choice between them shapes the performance characteristics, operational behavior, and failure modes of the entire system.
Understanding why these two approaches exist, and when each one wins, is one of the more useful things you can know about database internals.
B-Trees: The Incumbent
B-Trees have been around since 1970. The idea is straightforward: store data in fixed-size pages (typically 4KB, 8KB, or 16KB), organized as a balanced tree. Each internal page contains keys and pointers to child pages. Leaf pages contain the actual data (or pointers to heap storage, depending on the implementation).
Looking up a key means walking from the root page to a leaf page. With a branching factor of 500 (common for 8KB pages with small keys), a tree holding a billion keys is only 4 levels deep. That is 4 page reads for any lookup. If the root and first couple of levels are cached in memory (which they almost always are), real-world lookups hit 1-2 disk reads.
Writing to a B-Tree means finding the right leaf page and updating it in place. Before modifying the page, the database writes the change to a write-ahead log (WAL) for crash safety. If the system crashes mid-write, the WAL allows replaying the change on recovery. After the WAL is durable, the actual page is updated. Pages occasionally split when they fill up, which requires updating the parent pointer, but splits are infrequent.
Range scans are efficient because leaf pages are typically linked together in key order. Scanning from key 1000 to key 2000 means finding the leaf for key 1000 and then following next-page pointers until you pass key 2000. No seeking back to internal pages.
LSM Trees: The Write Optimizer
Log-Structured Merge Trees take a completely different approach. Nothing is ever updated in place. All writes go to an in-memory sorted buffer called a memtable (typically implemented as a skip list or red-black tree). When the memtable fills up (usually at 64-256MB), it is flushed to disk as an immutable sorted file called an SSTable (Sorted String Table).
Over time, multiple SSTables accumulate on disk. To find a key, you check the memtable first, then each SSTable from newest to oldest. The first match wins (because newer SSTables have more recent data).
Without optimization, reads get progressively slower as SSTables pile up. Two mechanisms keep reads fast:
Compaction merges multiple SSTables into fewer, larger ones. During compaction, duplicate keys are resolved (keeping the newest), deleted keys (marked with tombstones) are removed, and the output is a new sorted SSTable. This reduces the number of files that reads need to check.
Bloom filters on each SSTable give a fast "definitely not here" answer. Before reading an SSTable, check its Bloom filter. If the filter says the key is absent, skip that SSTable entirely. With properly sized Bloom filters (10 bits per key gives ~1% false positive rate), most SSTables are skipped.
Write Amplification: The Hidden Cost
Write amplification is the ratio of bytes written to disk versus bytes of user data. It is the single most important metric for comparing these two approaches, and it is surprisingly large for both.
B-Tree write amplification. A small update (say, changing a 100-byte value) requires rewriting the entire page (8KB) plus the WAL entry (maybe 200 bytes). That is about 80x write amplification for a small update. If the page splits, you rewrite two pages plus update the parent. PostgreSQL mitigates this with WAL-based recovery and in-place updates, but the page-granularity write is inherent.
LSM write amplification. The initial write goes to the WAL (1x) and memtable (0x, it is in memory). The memtable flush writes data to L0 (1x). Then compaction promotes data through levels. In RocksDB's default leveled compaction, each level is about 10x larger than the previous. Moving data from L0 to L1 rewrites it. Moving from L1 to L2 rewrites it again. With 5-7 levels total, data gets rewritten about 10-30x depending on update frequency and level size ratios.
Here is the counterintuitive part: despite having higher total write amplification, LSM trees often have better write throughput. Why? Because all LSM writes are sequential. The WAL is an append. The memtable flush writes a file sequentially. Compaction reads and writes files sequentially. B-Tree writes are random: update this page here, that page there. On spinning disks, sequential writes are 100x faster than random writes. On SSDs, the gap is smaller (maybe 4-10x) but still significant.
Read Amplification: Where B-Trees Fight Back
Read amplification measures how many disk reads a single query requires.
B-Tree reads. A point query reads 1-4 pages (root to leaf). With memory caching, the hot path is 1-2 disk reads. Range queries follow leaf pointers sequentially. Read amplification is low and predictable.
LSM reads. A point query might check the memtable (0 disk reads), then L0 SSTables (each one might need a check), then each subsequent level. Bloom filters help enormously: a well-tuned system skips most SSTables. But worst case, especially for keys that do not exist, the query checks a Bloom filter at every level. With 7 levels and a 1% false positive rate per level, about 7% of non-existent key lookups do a wasted disk read.
Range queries on LSM trees are more complex. Each level might have relevant data, so the engine runs a merge sort across iterators from all levels. This is more CPU-intensive than following B-Tree leaf pointers, though the actual IO might be comparable if each level's data is clustered.
The bottom line: B-Trees have fundamentally better read performance for point queries and simple range scans. LSM trees compensate with Bloom filters and caching, but the gap remains.
Space Amplification: The Third Axis
Space amplification is how much extra disk space the data structure uses beyond the raw data size.
B-Trees allocate fixed-size pages. Pages are rarely completely full (typically 60-75% after random inserts). So a dataset that is logically 100GB might occupy 130-160GB of B-Tree pages. But there is only one copy of each key.
LSM trees with leveled compaction maintain about 1.1x space amplification in steady state. The extra space comes from temporarily having both old and new SSTables during compaction. Once compaction finishes, the old SSTables are deleted.
LSM trees with tiered compaction can temporarily use up to 2x the data size during compaction, because multiple overlapping SSTables coexist at the same tier until they are merged.
Compaction Strategies
The compaction strategy is the most impactful tuning knob for LSM-based systems.
Leveled compaction (RocksDB default, LevelDB). Each level has a size limit, typically 10x the previous level. When L1 exceeds its limit, an SSTable from L1 is merged with the overlapping SSTables in L2. This keeps each level sorted and non-overlapping (except L0, which is special). The result: excellent read performance (at most one SSTable per level needs checking), good space efficiency (about 1.1x), but the worst write amplification (10-30x).
Tiered compaction (Cassandra default as Size-Tiered). SSTables accumulate until there are enough similarly-sized ones to merge. Four 1GB SSTables merge into one 4GB SSTable. This drastically reduces write amplification (about 4-10x) because data is rewritten fewer times. The cost: read performance degrades because multiple overlapping SSTables exist at each size tier, and space amplification spikes during compaction.
FIFO compaction. Old SSTables are simply deleted when they expire. No merging, no rewriting. Write amplification is effectively 1x (WAL + flush, no compaction). Only works for time-series data with a TTL where old data is worthless.
Universal compaction (RocksDB). A hybrid that tries to balance write amplification and read performance dynamically. It merges SSTables based on size ratios and sorts, adapting to the workload. More complex to reason about but can outperform both leveled and tiered for mixed workloads.
RocksDB: The LSM Engine That Powers Everything
RocksDB deserves special mention because it has become the de facto LSM engine. Built by Facebook as a fork of Google's LevelDB, it powers:
- CockroachDB (MVCC storage layer)
- TiKV/TiDB (storage engine)
- Apache Flink (state backend)
- Kafka Streams (state stores)
- MySQL's MyRocks engine (Facebook's MySQL fork)
- Apache Cassandra's experimental storage engine
RocksDB exposes an enormous number of tuning knobs: compaction strategy, level size multiplier, Bloom filter bits per key, block size, compression per level, rate limiting for compaction IO, and dozens more. The defaults are reasonable for general use, but production deployments at scale almost always require tuning.
One particularly useful RocksDB feature: per-level compression. L0 and L1 are uncompressed (they are small and accessed frequently), while deeper levels use Snappy, LZ4, or Zstd compression. Data at L5 might compress 2-3x, which reduces both disk usage and IO bandwidth at levels that are read less frequently.
When B-Trees Win
Read-heavy OLTP. The classic web application workload: 90% reads, 10% writes, random access patterns. B-Tree point queries are hard to beat. PostgreSQL with pgbouncer connection pooling can serve millions of reads per second.
Range scans. Analytics queries that scan large key ranges are faster on B-Trees because the leaf pages are sequential. LSM tree range scans require a merge sort across levels.
Predictable latency. B-Tree read latency is consistent: always 1-4 page reads. LSM read latency has a long tail because compaction can temporarily increase the number of SSTables, and background compaction IO can compete with foreground reads.
Frequent updates to existing keys. If your workload repeatedly updates the same keys, LSM write amplification gets worse because each update creates a new entry that must be compacted. B-Trees update the page in place with no additional copies.
When LSM Trees Win
Write-heavy workloads. Ingesting millions of events per second, time-series data, logging. The sequential write pattern of LSM trees saturates disk throughput in a way that B-Tree random writes cannot.
SSD-friendly access patterns. SSDs perform best with large sequential writes (which align with SSTable flushes) and poorly with small random writes (which B-Tree page updates create). LSM trees also cause less write amplification at the SSD firmware level because sequential writes reduce the SSD's internal garbage collection overhead.
Workloads dominated by inserts and deletes. LSM trees handle inserts and tombstone deletes efficiently since both are just appends. B-Trees need to find and modify the relevant page for each operation.
Compression. LSM SSTables are sorted and immutable, which makes them highly compressible. B-Tree pages, with their internal pointers and partially-full structure, compress less effectively.
The Hybrid Future
Some databases are blurring the line. WiredTiger (MongoDB's engine) started as a B-Tree engine but added LSM support. TiDB uses a B-Tree-based SQL layer on top of TiKV's RocksDB LSM storage. CockroachDB layers a B-Tree-like MVCC structure on RocksDB.
Bw-Trees (used in SQL Server's Hekaton and Azure Cosmos DB) try to get the best of both worlds: B-Tree-like structure with log-structured updates to avoid in-place page modifications. PebblesDB and SLM-DB combine skip lists with LSM compaction for better write amplification.
The point is not that one approach is universally better. It is that understanding the fundamental tradeoffs, write amplification vs. read amplification vs. space amplification, lets you reason about why your database behaves the way it does and what knobs to turn when performance degrades.
Key Points
- •B-Trees store data in fixed-size pages organized as a balanced tree. Reads follow root-to-leaf in O(log N) page reads. Writes update pages in place with a write-ahead log for crash safety. PostgreSQL, MySQL InnoDB, and SQLite all use B-Trees
- •LSM Trees buffer all writes in memory (memtable), flush to immutable sorted files (SSTables) when full, and merge SSTables in the background via compaction. RocksDB, LevelDB, Cassandra, and ScyllaDB all use LSM trees
- •Write amplification in LSM trees comes from compaction rewriting data multiple times across levels. Leveled compaction in RocksDB rewrites data about 10-30x. B-Trees rewrite entire pages for small updates plus WAL entries
- •Bloom filters on SSTables are what make LSM read performance tolerable. Without them, a point query might check SSTables across every level. With them, you skip levels that definitely do not contain the key
- •The choice between them almost always comes down to workload: write-heavy and append-heavy favors LSM, read-heavy with random updates favors B-Trees
Used By
Common Mistakes
- ✗Choosing an LSM-based engine for read-heavy OLTP with lots of random point queries. B-Trees will outperform because they do not need to check multiple levels
- ✗Using default leveled compaction for a write-heavy workload without benchmarking tiered compaction. Leveled has the worst write amplification, and tiered can improve write throughput 3-5x
- ✗Ignoring write amplification on SSDs. SSDs have a limited number of write cycles per cell. A 10x write amplification means your drive wears out 10x faster than the application-level write volume suggests
- ✗Not monitoring compaction lag. When incoming writes outpace compaction, unmerged SSTables pile up, reads degrade because they check more files, and space usage balloons