SSTable & Compaction Strategies
Architecture
What Is an SSTable?
SSTable stands for Sorted String Table. Google introduced the term in the Bigtable paper (2006), though the concept of sorted, immutable files predates it.
An SSTable is a file on disk containing key-value pairs sorted by key. Once written, it is never modified. The file has three main sections:
Data blocks. The sorted key-value pairs, grouped into fixed-size blocks (typically 4-16 KB). Each block is independently compressed.
Index block. A sparse index mapping key prefixes to data block offsets. To find a key, binary search the index block to locate the right data block, then scan within the block.
Bloom filter block. A Bloom filter covering all keys in the SSTable. Before checking the index, query the Bloom filter. If it says the key is not here, skip this SSTable entirely. This avoids the cost of loading the index and data blocks for keys that are not present.
Optionally, SSTables include metadata blocks (compression dictionary, statistics), a footer with pointers to the other blocks, and data block checksums for integrity verification.
Why Immutability?
Mutable data files require careful concurrency control. If a reader is scanning a file while a writer is modifying it, you need locks or copy-on-write semantics. Recovery after a crash means replaying a log and checking for partially written records.
Immutable SSTables sidestep all of this. Writes go to an in-memory buffer (the memtable, usually a skip list or similar sorted data structure). When the memtable is full, it gets flushed to a new SSTable in one sequential write. Reads consult the memtable first, then SSTables in order from newest to oldest.
The trade-off: over time, you accumulate many SSTables, and a single key might have versions scattered across multiple files. Reading a key potentially requires checking every SSTable. This is where compaction comes in.
Size-Tiered Compaction (STCS)
STCS is the simplest compaction strategy and the default in Cassandra.
The idea: group SSTables into tiers by size. When enough SSTables of similar size accumulate (default: 4 in Cassandra), merge them into one larger SSTable. Small SSTables merge into medium ones, medium into large ones, and so on.
How It Works
- New SSTables arrive from memtable flushes. They are all roughly the same size (the memtable flush size).
- When 4 SSTables of similar size exist, compact them into one.
- The resulting SSTable is ~4x larger. It joins a higher tier.
- When 4 SSTables of that larger size accumulate, compact them.
- Repeat.
Trade-offs
Write-optimized. Each key is rewritten ~O(log n) times total (once per tier level). Write amplification is low compared to leveled compaction.
Space amplification. During compaction, the old and new SSTables coexist until compaction finishes and the old ones are deleted. Peak space usage can be up to 2x the logical data size. This is why Cassandra recommends keeping 50% free disk space with STCS.
Read penalty. Within a tier, SSTables can have overlapping key ranges. A read might need to check multiple SSTables in the same tier. Bloom filters mitigate this, but dense Bloom filters consume memory.
Tombstone problem. A deleted key creates a tombstone in a new SSTable. The old version of the key lives in an older, larger SSTable. The tombstone only gets removed when a compaction merges those two SSTables, which might take a long time if the old SSTable is in a high tier.
STCS works well for write-heavy workloads where reads are infrequent or where Bloom filters effectively filter out misses.
Leveled Compaction (LCS)
LCS was introduced in LevelDB and refined in RocksDB. It organizes SSTables into numbered levels with strict invariants.
Structure
Level 0: SSTables from memtable flushes. Key ranges can overlap (same as STCS). This is the only level where overlap is allowed.
Level 1+: Each level has a size limit (level L has a limit of 10^L MB, configurable). Within each level, SSTables have non-overlapping key ranges. Any key exists in at most one SSTable per level.
Compaction Process
When Level L exceeds its size limit, pick one SSTable from Level L and merge it with all overlapping SSTables from Level L+1. Write the merged result back to Level L+1. Delete the source SSTables.
Because levels have non-overlapping key ranges (except L0), a read only needs to check one SSTable per level. With 7 levels, a read checks at most 7 SSTables. Compare this to STCS, where a read might check dozens.
Trade-offs
Read-optimized. At most one SSTable per level contains any given key. Read amplification is O(number of levels), typically 5-7.
Space-efficient. No large temporary space spikes. Compaction merges one SSTable at a time with its overlapping neighbors. Peak space overhead is about 10% above the logical data size.
Write-expensive. Data flows from L0 to L1 to L2 and so on. Each level transition rewrites the data. Total write amplification is roughly 10-30x. For every byte you write, the system writes 10-30 bytes to disk over the lifetime of that data.
LCS works best for read-heavy workloads with frequent updates or deletes, where low read amplification and fast tombstone cleanup outweigh the write cost.
FIFO Compaction
FIFO compaction does not merge SSTables at all. It simply drops the oldest SSTables when total size exceeds a configured limit or when SSTables exceed a TTL.
This is purpose-built for time-series and logging workloads where old data has no value. Metrics older than 30 days? Delete the SSTable. Logs older than 7 days? Gone.
The advantages: zero write amplification from compaction (the only writes are memtable flushes), and predictable disk usage. The disadvantage: no key deduplication, no tombstone cleanup, and reads must check every SSTable. FIFO is only viable when queries are time-bounded (always include a time range that maps to a small number of SSTables).
Time-Window Compaction (TWCS)
Cassandra introduced TWCS as a hybrid for time-series data. SSTables are grouped by time window (e.g., one hour). Within each window, STCS applies. Once a window is closed (no more writes expected), its SSTables are compacted into a single file and left alone.
This avoids the STCS problem of compacting recent and historical data together. Old windows are static, and their SSTables are compact and efficient. The active window handles new writes efficiently with STCS.
Dropping old data is simple: delete the SSTables from expired time windows. No compaction needed, no tombstone overhead.
Write Amplification: The Metric That Matters
Write amplification is the ratio of bytes written to disk versus bytes written by the application. If your app writes 1 GB and the storage engine writes 30 GB to disk (due to compaction), your write amplification is 30x.
| Strategy | Write Amplification | Read Amplification | Space Amplification |
|---|---|---|---|
| STCS | ~5-10x | High (many SSTables to check) | Up to 2x |
| LCS | ~10-30x | Low (1 SSTable per level) | ~1.1x |
| FIFO | 1x (no compaction) | Very high (all SSTables) | ~1x |
Write amplification directly impacts SSD lifespan (SSDs have limited write cycles) and disk throughput. On a system doing 100 MB/s of application writes, 30x write amplification means 3 GB/s of actual disk I/O. If your SSD does 2 GB/s, compaction cannot keep up and you have a problem.
RocksDB Compaction Tuning
RocksDB is the most configurable storage engine for compaction. Key knobs:
max_bytes_for_level_base: size of Level 1. Default: 256 MB. Larger values reduce the number of levels (fewer compactions) but increase space usage and L0 stall risk.
max_bytes_for_level_multiplier: ratio between level sizes. Default: 10. Level 2 is 10x Level 1, Level 3 is 10x Level 2. Lower values (like 8) give less write amplification but more levels.
level0_file_num_compaction_trigger: number of L0 files that triggers compaction to L1. Default: 4. If L0 accumulates too many files (because compaction is behind), reads slow down and eventually writes stall.
target_file_size_base: size of individual SSTables. Default: 64 MB. Larger files mean fewer files to manage but coarser compaction granularity.
Getting these numbers right for your workload is an empirical exercise. RocksDB provides statistics (rocksdb.stats in the log) showing actual write amplification, compaction rates, and stall counts. Tune based on observed metrics, not theoretical expectations.
Compaction and Reads: The Bloom Filter Connection
Bloom filters are inseparable from SSTable-based storage. Without them, every read checks every SSTable's index block. With them, you skip SSTables that definitely do not contain the key.
RocksDB allocates a configurable number of bits per key for its Bloom filters. The default (10 bits/key, ~1% false positive rate) works for most workloads. For workloads with many point lookups and few hits (checking if a key exists), increasing to 14 bits/key (0.1% FPR) can significantly reduce I/O.
The interaction with compaction strategy matters. With LCS, you have fewer SSTables per level, so Bloom filters are checked less frequently. With STCS, you might have dozens of SSTables with overlapping ranges, each requiring a Bloom filter check. This is why STCS workloads tend to benefit more from generous Bloom filter sizing.
Key Points
- •An SSTable (Sorted String Table) is an immutable, sorted file of key-value pairs with an index block for fast lookups. Immutability is the key design decision: writes never modify existing files, they create new ones. This makes writes fast and crash recovery simple
- •Compaction merges multiple SSTables into fewer, larger ones, discarding deleted keys (tombstones) and keeping only the latest version of each key. Without compaction, reads would need to check every SSTable, and disk space would grow without bound
- •Size-Tiered Compaction (STCS) groups SSTables by size and merges them when enough similar-sized files accumulate. Write-optimized but can temporarily use 2x disk space during compaction and creates wide variation in SSTable sizes
- •Leveled Compaction (LCS) organizes SSTables into levels with exponentially increasing size limits. Each level is sorted with no key overlap (except L0). Read-optimized because any key exists in at most one SSTable per level, but write amplification is higher
- •Choosing the wrong compaction strategy for your workload is one of the most common performance mistakes in Cassandra and RocksDB. Write-heavy workloads want STCS. Read-heavy workloads with updates want LCS. Time-series data often wants FIFO or a time-window strategy
Used By
Common Mistakes
- ✗Running leveled compaction on a write-heavy workload without enough disk I/O headroom. LCS rewrites data multiple times as it flows through levels (10-30x write amplification). On spinning disks or underpowered SSDs, compaction falls behind and the system chokes
- ✗Ignoring tombstone accumulation. Deleted keys leave tombstones that persist until compaction removes them. If compaction does not reach old SSTables, tombstones pile up and reads slow down because the system has to process deletes it cannot discard yet
- ✗Not reserving enough disk space for compaction. STCS needs up to 50% free space during a major compaction because it writes new SSTables before deleting old ones. Running at 80% disk utilization with STCS is a ticking time bomb
- ✗Using FIFO compaction for data you query by key. FIFO drops entire SSTables when they exceed a TTL or size limit. It does not merge or sort. If you need to look up individual keys in FIFO-compacted data, reads are slow because there is no compaction to consolidate keys