RocksDB

How It Works Internally

Anyone running CockroachDB, TiKV, Flink state, or Kafka Streams has already used RocksDB, possibly without knowing it. It started as Meta's fork of Google's LevelDB, heavily reworked for SSD workloads. It is not a database. It is a storage engine, an embedded library that other systems build on top of.

The write path is where RocksDB shines. Every write (Put, Delete, Merge) gets appended to the Write-Ahead Log (WAL) for durability, then inserted into the active MemTable, which is a concurrent skip list held in memory. When the MemTable hits write_buffer_size (default 64 MB, but tune this to 128-256 MB), it flips to an immutable MemTable and a fresh one takes over. A background thread flushes the immutable MemTable to disk as a sorted SSTable (Sorted String Table) at Level 0. The whole write path is sequential I/O. No random seeks. That is how it hits 500K-1M writes/sec on modern NVMe.

The read path is more involved. It checks, in order: active MemTable, immutable MemTable(s), then SSTables from Level 0 down to Level N. Each SSTable carries a Bloom filter (10 bits per key produces roughly 1% false positives) so the read path can skip files that definitely do not have the target key. The block cache (LRU or Clock-based) keeps hot data blocks in memory. For point lookups, after Bloom filter elimination, reads typically touch 1-2 SSTables. Not bad.

Compaction is the aspect that demands the most attention. It merges SSTables across levels to keep things organized. Level Compaction (the default) maintains non-overlapping key ranges per level with a 10x size ratio between levels. Great for read-heavy workloads, but the cost is 10-30x write amplification. Universal Compaction (sometimes called Size-Tiered) merges similarly-sized files and writes less, but eats more temporary disk space. FIFO Compaction just drops the oldest SSTable. Only use this for TTL caches where losing old data is the whole point.

As Flink's State Backend

Flink picks RocksDB as its production state backend whenever keyed state outgrows JVM heap. In something like a Top-K streaming pipeline, Flink operators hold state for Count-Min Sketch structures, Space-Saving heaps, window aggregations, potentially 50M+ unique keys. All of that lives in RocksDB instances embedded inside each Flink TaskManager.

The reason this works at all is incremental checkpointing. Flink periodically snapshots RocksDB state to durable storage (S3, HDFS). Rather than copying the entire database, incremental checkpoints only upload newly-flushed SSTables since the last snapshot. That means a few hundred MB instead of tens of GB. Checkpoint duration drops from minutes to seconds, which keeps a 60-second checkpoint interval practical even when sitting on terabytes of state.

Key Flink-RocksDB tuning parameters:

state.backend.rocksdb.memory.managed: Turn this on. It pools block cache and write buffers across all operators.
state.backend.rocksdb.block.cache-size: 256 MB to 1 GB per TaskManager slot, depending on state size.
state.backend.rocksdb.writebuffer.size: 64-128 MB per column family.
state.backend.rocksdb.compaction.level.max-size-level-base: Match this to writebuffer.size for good level sizing.
state.backend.rocksdb.use-bloom-filter: Enable it. 10 bits per key. There is no good reason to skip this.

Production Tuning Guide

Write-optimized configuration (streaming state, high-throughput ingestion):

write_buffer_size: 256 MB (fewer flushes, less I/O churn)
max_write_buffer_number: 4 (gives the MemTable pipeline room to absorb bursts)
min_write_buffer_number_to_merge: 2 (merge MemTables before flushing to cut I/O)
level0_file_num_compaction_trigger: 4
max_bytes_for_level_base: 1 GB (4 * write_buffer_size)
compression: Snappy for L0-L1, Zstd for L2+. Cold levels compress better and they are not read often.

Read-optimized configuration (point lookups, serving layer):

bloom_bits_per_key: 10 (1% false positive rate)
cache_index_and_filter_blocks: true (keep Bloom filters and index blocks in the block cache)
pin_l0_filter_and_index_blocks_in_cache: true (L0 is always hot, do not let it get evicted)
block_cache_size: as big as the budget allows. Ideally the working set fits in cache.
optimize_filters_for_hits: true (skips the Bloom filter on the last level when most keys exist)

Write stall prevention:

Monitor rocksdb.is-write-stopped and rocksdb.actual-delayed-write-rate. Not watching these leads to surprises.
Set level0_slowdown_writes_trigger to 20 and level0_stop_writes_trigger to 36.
Bump max_background_compactions to 2-4 for SSD, up to 8 for NVMe.
Set a rate_limiter so compaction I/O does not starve foreground writes.

Capacity Planning

RocksDB uses more disk than the raw data. Budget for compaction overhead, Bloom filters, and in-memory structures.

Component	Size Formula	Example (50M keys, 1 KB avg)
Raw data	keys * avg_value_size	~50 GB
Compaction overhead	10-15% (Level) or 50% (Universal)	5-25 GB
Bloom filters	keys * bits_per_key / 8	~60 MB (10 bits)
Block cache (RAM)	Configurable	1-4 GB
MemTable (RAM)	write_buffer_size * max_write_buffer_number	1 GB

For Flink with 50M keys at 1 KB each, plan on 80 GB of local NVMe per TaskManager slot, plus 2-4 GB of RAM for the block cache and write buffers. Checkpoint to S3 every 60 seconds. Incremental checkpoints upload only the delta, somewhere between 100 MB and 1 GB per checkpoint depending on write rate.

Failure Scenarios

Scenario 1: Write stall triggers a backpressure cascade in Flink. We saw this during a flash sale. Event rate doubled in under a minute. RocksDB's L0 file count blew past level0_slowdown_writes_trigger (default 20) because compaction could not keep up. RocksDB started throttling writes. Every Flink operator writing state to RocksDB slowed down, backpressure propagated upstream, and Kafka consumer lag jumped from seconds to minutes. Top-K results went stale. The fix: monitor rocksdb.num-files-at-level0 and Flink's outPoolUsage (backpressure indicator). Alert when L0 files exceed 15. Immediately bump max_background_compactions to 4-8 on NVMe. Raising level0_slowdown_writes_trigger to 40 works as a stopgap, but the real answer is more Flink parallelism. Spread state across more RocksDB instances so each one handles a lower write rate.

Scenario 2: Checkpoint size balloons until S3 uploads start timing out. This one sneaks up gradually. Over weeks, RocksDB state grows because new keys keep arriving but old ones never get cleaned up (think: counts for items that dropped out of the Top-K long ago). Incremental checkpoints still work fine day to day. But when a full checkpoint is needed (after a rescale, or periodically for safety), it has to upload the entire state. 500 GB. The upload times out, Flink cancels it, and the latest recovery point starts falling behind. If the job crashes now, it restarts from a checkpoint that is 6 hours old and reprocess millions of events. The answer is state TTL. Use StateTtlConfig to expire stale keys automatically. Set state.backend.rocksdb.timer-service.factory: rocksdb to get timers off the heap. And run a compaction filter to strip expired entries during compaction rather than waiting for reads to trigger cleanup. Monitor lastCheckpointDuration and lastCheckpointSize in Flink metrics. Alert when checkpoint duration exceeds 50% of the checkpoint interval.

How It Works Internally

As Flink's State Backend

Key Flink-RocksDB tuning parameters:

state.backend.rocksdb.memory.managed: Turn this on. It pools block cache and write buffers across all operators.
state.backend.rocksdb.block.cache-size: 256 MB to 1 GB per TaskManager slot, depending on state size.
state.backend.rocksdb.writebuffer.size: 64-128 MB per column family.
state.backend.rocksdb.compaction.level.max-size-level-base: Match this to writebuffer.size for good level sizing.
state.backend.rocksdb.use-bloom-filter: Enable it. 10 bits per key. There is no good reason to skip this.

Production Tuning Guide

Write-optimized configuration (streaming state, high-throughput ingestion):

write_buffer_size: 256 MB (fewer flushes, less I/O churn)
max_write_buffer_number: 4 (gives the MemTable pipeline room to absorb bursts)
min_write_buffer_number_to_merge: 2 (merge MemTables before flushing to cut I/O)
level0_file_num_compaction_trigger: 4
max_bytes_for_level_base: 1 GB (4 * write_buffer_size)
compression: Snappy for L0-L1, Zstd for L2+. Cold levels compress better and they are not read often.

Read-optimized configuration (point lookups, serving layer):

bloom_bits_per_key: 10 (1% false positive rate)
cache_index_and_filter_blocks: true (keep Bloom filters and index blocks in the block cache)
pin_l0_filter_and_index_blocks_in_cache: true (L0 is always hot, do not let it get evicted)
block_cache_size: as big as the budget allows. Ideally the working set fits in cache.
optimize_filters_for_hits: true (skips the Bloom filter on the last level when most keys exist)

Write stall prevention:

Monitor rocksdb.is-write-stopped and rocksdb.actual-delayed-write-rate. Not watching these leads to surprises.
Set level0_slowdown_writes_trigger to 20 and level0_stop_writes_trigger to 36.
Bump max_background_compactions to 2-4 for SSD, up to 8 for NVMe.
Set a rate_limiter so compaction I/O does not starve foreground writes.

Capacity Planning

RocksDB uses more disk than the raw data. Budget for compaction overhead, Bloom filters, and in-memory structures.

Component	Size Formula	Example (50M keys, 1 KB avg)
Raw data	keys * avg_value_size	~50 GB
Compaction overhead	10-15% (Level) or 50% (Universal)	5-25 GB
Bloom filters	keys * bits_per_key / 8	~60 MB (10 bits)
Block cache (RAM)	Configurable	1-4 GB
MemTable (RAM)	write_buffer_size * max_write_buffer_number	1 GB

Use Cases

Architecture

How It Works Internally

As Flink's State Backend

Production Tuning Guide

Capacity Planning

Failure Scenarios

Pros

Cons

When to use

When NOT to use

Key Points

Common Mistakes

Related Technologies

RocksDB

Use Cases

Architecture

How It Works Internally

As Flink's State Backend

Production Tuning Guide

Capacity Planning

Failure Scenarios

Pros

Cons

When to use

When NOT to use

Key Points

Common Mistakes

Related Technologies