RocksDB
The embedded LSM-tree engine that powers half the databases you already use
Use Cases
Architecture
How It Works Internally
Anyone running CockroachDB, TiKV, Flink state, or Kafka Streams has already used RocksDB, possibly without knowing it. It started as Meta's fork of Google's LevelDB, heavily reworked for SSD workloads. It is not a database. It is a storage engine, an embedded library that other systems build on top of.
The write path is where RocksDB shines. Every write (Put, Delete, Merge) gets appended to the Write-Ahead Log (WAL) for durability, then inserted into the active MemTable, which is a concurrent skip list held in memory. When the MemTable hits write_buffer_size (default 64 MB, but tune this to 128-256 MB), it flips to an immutable MemTable and a fresh one takes over. A background thread flushes the immutable MemTable to disk as a sorted SSTable (Sorted String Table) at Level 0. The whole write path is sequential I/O. No random seeks. That is how it hits 500K-1M writes/sec on modern NVMe.
The read path is more involved. It checks, in order: active MemTable, immutable MemTable(s), then SSTables from Level 0 down to Level N. Each SSTable carries a Bloom filter (10 bits per key produces roughly 1% false positives) so the read path can skip files that definitely do not have the target key. The block cache (LRU or Clock-based) keeps hot data blocks in memory. For point lookups, after Bloom filter elimination, reads typically touch 1-2 SSTables. Not bad.
Compaction is the aspect that demands the most attention. It merges SSTables across levels to keep things organized. Level Compaction (the default) maintains non-overlapping key ranges per level with a 10x size ratio between levels. Great for read-heavy workloads, but the cost is 10-30x write amplification. Universal Compaction (sometimes called Size-Tiered) merges similarly-sized files and writes less, but eats more temporary disk space. FIFO Compaction just drops the oldest SSTable. Only use this for TTL caches where losing old data is the whole point.
As Flink's State Backend
Flink picks RocksDB as its production state backend whenever keyed state outgrows JVM heap. In something like a Top-K streaming pipeline, Flink operators hold state for Count-Min Sketch structures, Space-Saving heaps, window aggregations, potentially 50M+ unique keys. All of that lives in RocksDB instances embedded inside each Flink TaskManager.
The reason this works at all is incremental checkpointing. Flink periodically snapshots RocksDB state to durable storage (S3, HDFS). Rather than copying the entire database, incremental checkpoints only upload newly-flushed SSTables since the last snapshot. That means a few hundred MB instead of tens of GB. Checkpoint duration drops from minutes to seconds, which keeps a 60-second checkpoint interval practical even when sitting on terabytes of state.
Key Flink-RocksDB tuning parameters:
state.backend.rocksdb.memory.managed: Turn this on. It pools block cache and write buffers across all operators.state.backend.rocksdb.block.cache-size: 256 MB to 1 GB per TaskManager slot, depending on state size.state.backend.rocksdb.writebuffer.size: 64-128 MB per column family.state.backend.rocksdb.compaction.level.max-size-level-base: Match this to writebuffer.size for good level sizing.state.backend.rocksdb.use-bloom-filter: Enable it. 10 bits per key. There is no good reason to skip this.
Production Tuning Guide
Write-optimized configuration (streaming state, high-throughput ingestion):
write_buffer_size: 256 MB (fewer flushes, less I/O churn)max_write_buffer_number: 4 (gives the MemTable pipeline room to absorb bursts)min_write_buffer_number_to_merge: 2 (merge MemTables before flushing to cut I/O)level0_file_num_compaction_trigger: 4max_bytes_for_level_base: 1 GB (4 * write_buffer_size)compression: Snappy for L0-L1, Zstd for L2+. Cold levels compress better and they are not read often.
Read-optimized configuration (point lookups, serving layer):
bloom_bits_per_key: 10 (1% false positive rate)cache_index_and_filter_blocks: true (keep Bloom filters and index blocks in the block cache)pin_l0_filter_and_index_blocks_in_cache: true (L0 is always hot, do not let it get evicted)block_cache_size: as big as the budget allows. Ideally the working set fits in cache.optimize_filters_for_hits: true (skips the Bloom filter on the last level when most keys exist)
Write stall prevention:
- Monitor
rocksdb.is-write-stoppedandrocksdb.actual-delayed-write-rate. Not watching these leads to surprises. - Set
level0_slowdown_writes_triggerto 20 andlevel0_stop_writes_triggerto 36. - Bump
max_background_compactionsto 2-4 for SSD, up to 8 for NVMe. - Set a
rate_limiterso compaction I/O does not starve foreground writes.
Capacity Planning
RocksDB uses more disk than the raw data. Budget for compaction overhead, Bloom filters, and in-memory structures.
| Component | Size Formula | Example (50M keys, 1 KB avg) |
|---|---|---|
| Raw data | keys * avg_value_size | ~50 GB |
| Compaction overhead | 10-15% (Level) or 50% (Universal) | 5-25 GB |
| Bloom filters | keys * bits_per_key / 8 | ~60 MB (10 bits) |
| Block cache (RAM) | Configurable | 1-4 GB |
| MemTable (RAM) | write_buffer_size * max_write_buffer_number | 1 GB |
For Flink with 50M keys at 1 KB each, plan on 80 GB of local NVMe per TaskManager slot, plus 2-4 GB of RAM for the block cache and write buffers. Checkpoint to S3 every 60 seconds. Incremental checkpoints upload only the delta, somewhere between 100 MB and 1 GB per checkpoint depending on write rate.
Failure Scenarios
Scenario 1: Write stall triggers a backpressure cascade in Flink. We saw this during a flash sale. Event rate doubled in under a minute. RocksDB's L0 file count blew past level0_slowdown_writes_trigger (default 20) because compaction could not keep up. RocksDB started throttling writes. Every Flink operator writing state to RocksDB slowed down, backpressure propagated upstream, and Kafka consumer lag jumped from seconds to minutes. Top-K results went stale. The fix: monitor rocksdb.num-files-at-level0 and Flink's outPoolUsage (backpressure indicator). Alert when L0 files exceed 15. Immediately bump max_background_compactions to 4-8 on NVMe. Raising level0_slowdown_writes_trigger to 40 works as a stopgap, but the real answer is more Flink parallelism. Spread state across more RocksDB instances so each one handles a lower write rate.
Scenario 2: Checkpoint size balloons until S3 uploads start timing out. This one sneaks up gradually. Over weeks, RocksDB state grows because new keys keep arriving but old ones never get cleaned up (think: counts for items that dropped out of the Top-K long ago). Incremental checkpoints still work fine day to day. But when a full checkpoint is needed (after a rescale, or periodically for safety), it has to upload the entire state. 500 GB. The upload times out, Flink cancels it, and the latest recovery point starts falling behind. If the job crashes now, it restarts from a checkpoint that is 6 hours old and reprocess millions of events. The answer is state TTL. Use StateTtlConfig to expire stale keys automatically. Set state.backend.rocksdb.timer-service.factory: rocksdb to get timers off the heap. And run a compaction filter to strip expired entries during compaction rather than waiting for reads to trigger cleanup. Monitor lastCheckpointDuration and lastCheckpointSize in Flink metrics. Alert when checkpoint duration exceeds 50% of the checkpoint interval.
Pros
- • Built for fast storage (SSD/NVMe) with write throughput that is hard to beat
- • Embeddable. Runs in-process, so no network hop, no serialization tax
- • Extremely tunable compaction, compression, and memory settings
- • Incremental checkpointing through hard links makes snapshots nearly free
- • Column families let you logically separate data inside one instance
Cons
- • Not a standalone database. You need to build access patterns on top of it
- • Read amplification is real. The LSM-tree structure means checking multiple levels
- • Tuning is its own discipline. Dozens of knobs, and they interact in ways that surprise you
- • Write amplification from compaction can hit 10-30x in the worst case
- • Space amplification during compaction means you need to provision extra disk headroom
When to use
- • You are building a system that needs an embedded storage engine (stream processor, distributed DB, etc.)
- • Write throughput is your primary bottleneck
- • Your data fits on a single node's local disk
- • You need fast point lookups and range scans over sorted keys
When NOT to use
- • You want a standalone database with SQL or some query language
- • You need distributed transactions across multiple nodes
- • Your workload is read-heavy with random access patterns (a B-tree will likely serve you better)
- • Your team has no experience tuning LSM-tree engines and no appetite to learn
Key Points
- •LSM-tree architecture turns random writes into sequential I/O. Writes land in an in-memory MemTable, then get flushed as sorted SSTables on disk. This is the core trick.
- •The Write-Ahead Log (WAL) provides durability. Every write hits the WAL before the MemTable, so the system can recover from crashes without losing data.
- •Bloom filters on each SSTable skip files that definitely do not contain the target key. At 10 bits per key, the false positive rate is roughly 1%, cutting reads from dozens of files down to 1-2.
- •Compaction merges SSTables across levels. Level Compaction caps space amplification at about 10%. Universal Compaction trades space for lower write amplification.
- •The block cache keeps hot data blocks in memory. Pair it with a compressed block cache to trade CPU cycles for memory savings.
- •Incremental checkpointing through hard links makes state snapshots nearly instant. This is what makes RocksDB practical for Flink's exactly-once guarantees.
Common Mistakes
- ✗Running default settings in production. RocksDB ships conservative: write_buffer_size (64MB), max_write_buffer_number (2), target_file_size_base (64MB). These are too small for almost any real workload.
- ✗Ignoring write stalls. When L0 files exceed slowdown/stop triggers or pending compaction bytes cross the limits, RocksDB throttles or flat-out blocks writes. Monitor this closely.
- ✗Letting compaction debt pile up. Without scheduling manual compaction, SSTable count grows unchecked and read latency degrades because every read has to check more levels.
- ✗Setting Bloom filter bits per key too low. 10 bits/key gives about 1% false positives. Drop below 8 bits and false positives start hurting read performance noticeably.
- ✗Mixing hot and cold data in one column family. Different access patterns in the same column family force unnecessary compaction of data nobody is reading.