SSTable & Compaction Strategies

What Is an SSTable?

SSTable stands for Sorted String Table. Google introduced the term in the Bigtable paper (2006), though the concept of sorted, immutable files predates it.

An SSTable is a file on disk containing key-value pairs sorted by key. Once written, it is never modified. The file has three main sections:

Data blocks. The sorted key-value pairs, grouped into fixed-size blocks (typically 4-16 KB). Each block is independently compressed.

Index block. A sparse index mapping key prefixes to data block offsets. To find a key, binary search the index block to locate the right data block, then scan within the block.

Bloom filter block. A Bloom filter covering all keys in the SSTable. Before checking the index, query the Bloom filter. If it says the key is not here, skip this SSTable entirely. This avoids the cost of loading the index and data blocks for keys that are not present.

Optionally, SSTables include metadata blocks (compression dictionary, statistics), a footer with pointers to the other blocks, and data block checksums for integrity verification.

Why Immutability?

Mutable data files require careful concurrency control. If a reader is scanning a file while a writer is modifying it, you need locks or copy-on-write semantics. Recovery after a crash means replaying a log and checking for partially written records.

Immutable SSTables sidestep all of this. Writes go to an in-memory buffer (the memtable, usually a skip list or similar sorted data structure). When the memtable is full, it gets flushed to a new SSTable in one sequential write. Reads consult the memtable first, then SSTables in order from newest to oldest.

The trade-off: over time, you accumulate many SSTables, and a single key might have versions scattered across multiple files. Reading a key potentially requires checking every SSTable. This is where compaction comes in.

Size-Tiered Compaction (STCS)

STCS is the simplest compaction strategy and the default in Cassandra.

The idea: group SSTables into tiers by size. When enough SSTables of similar size accumulate (default: 4 in Cassandra), merge them into one larger SSTable. Small SSTables merge into medium ones, medium into large ones, and so on.

How It Works

New SSTables arrive from memtable flushes. They are all roughly the same size (the memtable flush size).
When 4 SSTables of similar size exist, compact them into one.
The resulting SSTable is ~4x larger. It joins a higher tier.
When 4 SSTables of that larger size accumulate, compact them.
Repeat.

Trade-offs

Write-optimized. Each key is rewritten ~O(log n) times total (once per tier level). Write amplification is low compared to leveled compaction.

Space amplification. During compaction, the old and new SSTables coexist until compaction finishes and the old ones are deleted. Peak space usage can be up to 2x the logical data size. This is why Cassandra recommends keeping 50% free disk space with STCS.

Read penalty. Within a tier, SSTables can have overlapping key ranges. A read might need to check multiple SSTables in the same tier. Bloom filters mitigate this, but dense Bloom filters consume memory.

Tombstone problem. A deleted key creates a tombstone in a new SSTable. The old version of the key lives in an older, larger SSTable. The tombstone only gets removed when a compaction merges those two SSTables, which might take a long time if the old SSTable is in a high tier.

STCS works well for write-heavy workloads where reads are infrequent or where Bloom filters effectively filter out misses.

Leveled Compaction (LCS)

LCS was introduced in LevelDB and refined in RocksDB. It organizes SSTables into numbered levels with strict invariants.

Structure

Level 0: SSTables from memtable flushes. Key ranges can overlap (same as STCS). This is the only level where overlap is allowed.

Level 1+: Each level has a size limit (level L has a limit of 10^L MB, configurable). Within each level, SSTables have non-overlapping key ranges. Any key exists in at most one SSTable per level.

Compaction Process

When Level L exceeds its size limit, pick one SSTable from Level L and merge it with all overlapping SSTables from Level L+1. Write the merged result back to Level L+1. Delete the source SSTables.

Because levels have non-overlapping key ranges (except L0), a read only needs to check one SSTable per level. With 7 levels, a read checks at most 7 SSTables. Compare this to STCS, where a read might check dozens.

Trade-offs

Read-optimized. At most one SSTable per level contains any given key. Read amplification is O(number of levels), typically 5-7.

Space-efficient. No large temporary space spikes. Compaction merges one SSTable at a time with its overlapping neighbors. Peak space overhead is about 10% above the logical data size.

Write-expensive. Data flows from L0 to L1 to L2 and so on. Each level transition rewrites the data. Total write amplification is roughly 10-30x. For every byte you write, the system writes 10-30 bytes to disk over the lifetime of that data.

LCS works best for read-heavy workloads with frequent updates or deletes, where low read amplification and fast tombstone cleanup outweigh the write cost.

FIFO Compaction

FIFO compaction does not merge SSTables at all. It simply drops the oldest SSTables when total size exceeds a configured limit or when SSTables exceed a TTL.

This is purpose-built for time-series and logging workloads where old data has no value. Metrics older than 30 days? Delete the SSTable. Logs older than 7 days? Gone.

The advantages: zero write amplification from compaction (the only writes are memtable flushes), and predictable disk usage. The disadvantage: no key deduplication, no tombstone cleanup, and reads must check every SSTable. FIFO is only viable when queries are time-bounded (always include a time range that maps to a small number of SSTables).

Time-Window Compaction (TWCS)

Cassandra introduced TWCS as a hybrid for time-series data. SSTables are grouped by time window (e.g., one hour). Within each window, STCS applies. Once a window is closed (no more writes expected), its SSTables are compacted into a single file and left alone.

This avoids the STCS problem of compacting recent and historical data together. Old windows are static, and their SSTables are compact and efficient. The active window handles new writes efficiently with STCS.

Dropping old data is simple: delete the SSTables from expired time windows. No compaction needed, no tombstone overhead.

Write Amplification: The Metric That Matters

Write amplification is the ratio of bytes written to disk versus bytes written by the application. If your app writes 1 GB and the storage engine writes 30 GB to disk (due to compaction), your write amplification is 30x.

Strategy	Write Amplification	Read Amplification	Space Amplification
STCS	~5-10x	High (many SSTables to check)	Up to 2x
LCS	~10-30x	Low (1 SSTable per level)	~1.1x
FIFO	1x (no compaction)	Very high (all SSTables)	~1x

Write amplification directly impacts SSD lifespan (SSDs have limited write cycles) and disk throughput. On a system doing 100 MB/s of application writes, 30x write amplification means 3 GB/s of actual disk I/O. If your SSD does 2 GB/s, compaction cannot keep up and you have a problem.

RocksDB Compaction Tuning

RocksDB is the most configurable storage engine for compaction. Key knobs:

max_bytes_for_level_base: size of Level 1. Default: 256 MB. Larger values reduce the number of levels (fewer compactions) but increase space usage and L0 stall risk.

max_bytes_for_level_multiplier: ratio between level sizes. Default: 10. Level 2 is 10x Level 1, Level 3 is 10x Level 2. Lower values (like 8) give less write amplification but more levels.

level0_file_num_compaction_trigger: number of L0 files that triggers compaction to L1. Default: 4. If L0 accumulates too many files (because compaction is behind), reads slow down and eventually writes stall.

target_file_size_base: size of individual SSTables. Default: 64 MB. Larger files mean fewer files to manage but coarser compaction granularity.

Getting these numbers right for your workload is an empirical exercise. RocksDB provides statistics (rocksdb.stats in the log) showing actual write amplification, compaction rates, and stall counts. Tune based on observed metrics, not theoretical expectations.

Compaction and Reads: The Bloom Filter Connection

Bloom filters are inseparable from SSTable-based storage. Without them, every read checks every SSTable's index block. With them, you skip SSTables that definitely do not contain the key.

RocksDB allocates a configurable number of bits per key for its Bloom filters. The default (10 bits/key, ~1% false positive rate) works for most workloads. For workloads with many point lookups and few hits (checking if a key exists), increasing to 14 bits/key (0.1% FPR) can significantly reduce I/O.

The interaction with compaction strategy matters. With LCS, you have fewer SSTables per level, so Bloom filters are checked less frequently. With STCS, you might have dozens of SSTables with overlapping ranges, each requiring a Bloom filter check. This is why STCS workloads tend to benefit more from generous Bloom filter sizing.

What Is an SSTable?

SSTable stands for Sorted String Table. Google introduced the term in the Bigtable paper (2006), though the concept of sorted, immutable files predates it.

An SSTable is a file on disk containing key-value pairs sorted by key. Once written, it is never modified. The file has three main sections:

Data blocks. The sorted key-value pairs, grouped into fixed-size blocks (typically 4-16 KB). Each block is independently compressed.

Index block. A sparse index mapping key prefixes to data block offsets. To find a key, binary search the index block to locate the right data block, then scan within the block.

Optionally, SSTables include metadata blocks (compression dictionary, statistics), a footer with pointers to the other blocks, and data block checksums for integrity verification.

Why Immutability?

Size-Tiered Compaction (STCS)

STCS is the simplest compaction strategy and the default in Cassandra.

How It Works

New SSTables arrive from memtable flushes. They are all roughly the same size (the memtable flush size).
When 4 SSTables of similar size exist, compact them into one.
The resulting SSTable is ~4x larger. It joins a higher tier.
When 4 SSTables of that larger size accumulate, compact them.
Repeat.

Trade-offs

Write-optimized. Each key is rewritten ~O(log n) times total (once per tier level). Write amplification is low compared to leveled compaction.

STCS works well for write-heavy workloads where reads are infrequent or where Bloom filters effectively filter out misses.

Leveled Compaction (LCS)

LCS was introduced in LevelDB and refined in RocksDB. It organizes SSTables into numbered levels with strict invariants.

Structure

Level 0: SSTables from memtable flushes. Key ranges can overlap (same as STCS). This is the only level where overlap is allowed.

Level 1+: Each level has a size limit (level L has a limit of 10^L MB, configurable). Within each level, SSTables have non-overlapping key ranges. Any key exists in at most one SSTable per level.

Compaction Process

When Level L exceeds its size limit, pick one SSTable from Level L and merge it with all overlapping SSTables from Level L+1. Write the merged result back to Level L+1. Delete the source SSTables.

Trade-offs

Read-optimized. At most one SSTable per level contains any given key. Read amplification is O(number of levels), typically 5-7.

Space-efficient. No large temporary space spikes. Compaction merges one SSTable at a time with its overlapping neighbors. Peak space overhead is about 10% above the logical data size.

LCS works best for read-heavy workloads with frequent updates or deletes, where low read amplification and fast tombstone cleanup outweigh the write cost.

FIFO Compaction

FIFO compaction does not merge SSTables at all. It simply drops the oldest SSTables when total size exceeds a configured limit or when SSTables exceed a TTL.

This is purpose-built for time-series and logging workloads where old data has no value. Metrics older than 30 days? Delete the SSTable. Logs older than 7 days? Gone.

Time-Window Compaction (TWCS)

Dropping old data is simple: delete the SSTables from expired time windows. No compaction needed, no tombstone overhead.

Write Amplification: The Metric That Matters

Strategy	Write Amplification	Read Amplification	Space Amplification
STCS	~5-10x	High (many SSTables to check)	Up to 2x
LCS	~10-30x	Low (1 SSTable per level)	~1.1x
FIFO	1x (no compaction)	Very high (all SSTables)	~1x

RocksDB Compaction Tuning

RocksDB is the most configurable storage engine for compaction. Key knobs:

max_bytes_for_level_base: size of Level 1. Default: 256 MB. Larger values reduce the number of levels (fewer compactions) but increase space usage and L0 stall risk.

max_bytes_for_level_multiplier: ratio between level sizes. Default: 10. Level 2 is 10x Level 1, Level 3 is 10x Level 2. Lower values (like 8) give less write amplification but more levels.

target_file_size_base: size of individual SSTables. Default: 64 MB. Larger files mean fewer files to manage but coarser compaction granularity.

Compaction and Reads: The Bloom Filter Connection

Bloom filters are inseparable from SSTable-based storage. Without them, every read checks every SSTable's index block. With them, you skip SSTables that definitely do not contain the key.

Architecture

What Is an SSTable?

Why Immutability?

Size-Tiered Compaction (STCS)

How It Works

Trade-offs

Leveled Compaction (LCS)

Structure

Compaction Process

Trade-offs

FIFO Compaction

Time-Window Compaction (TWCS)

Write Amplification: The Metric That Matters

RocksDB Compaction Tuning

Compaction and Reads: The Bloom Filter Connection

Key Points

Used By

Common Mistakes

Related

SSTable & Compaction Strategies

Architecture

What Is an SSTable?

Why Immutability?

Size-Tiered Compaction (STCS)

How It Works

Trade-offs

Leveled Compaction (LCS)

Structure

Compaction Process

Trade-offs

FIFO Compaction

Time-Window Compaction (TWCS)

Write Amplification: The Metric That Matters

RocksDB Compaction Tuning

Compaction and Reads: The Bloom Filter Connection

Key Points

Used By

Common Mistakes

Related