Cassandra

How It Works Internally

For workloads that demand a database that just will not stop accepting writes, Cassandra is probably the first name that came up. The core insight: every node in the cluster is identical. There is no leader, no follower, no special coordinator. Any node can handle any request. That is what makes it so resilient, and also what makes it so different to operate.

Internally, Cassandra uses a ring with consistent hashing and virtual nodes (vnodes). Each physical node owns multiple token ranges (modern deployments use 16 vnodes per node, down from the old default of 256). When a write arrives, whatever node receives the connection becomes the coordinator. It hashes the partition key with Murmur3 to figure out which token range owns the data, then forwards the write to all replica nodes for that range. With a replication factor of 3, three nodes get the write.

The write path is where Cassandra really shines. The coordinator sends the mutation to replica nodes concurrently. Each replica appends to its commit log (sequential disk IO), then inserts into the in-memory MemTable (a sorted ConcurrentSkipListMap in Java). When the MemTable hits memtable_cleanup_threshold, it flushes to disk as an immutable SSTable (Sorted String Table). No reads required. No disk seeks. That is why a single Cassandra node can push 100K+ writes/sec without breaking a sweat.

The read path is where things get interesting, and by interesting I mean painful if the data model is wrong. The coordinator sends read requests to replicas based on the consistency level. For QUORUM with RF=3, it contacts 2 replicas. Each replica checks the MemTable first, then searches SSTables using a Bloom filter (roughly 1% false-positive rate) to skip irrelevant files, a partition index to locate the partition, and a compression offset map to find the exact data block. If multiple SSTables contain data for the same partition key (common before compaction), results get merged by comparing write timestamps. Last-write-wins. Simple, but it means careful thought about concurrent updates is essential.

Compaction runs in the background, merging SSTables. This is where the strategy choice matters a lot. STCS (Size-Tiered) groups similarly-sized SSTables and merges them. Good for write-heavy workloads, but it temporarily needs 50% extra disk space during compaction. LCS (Leveled) organizes SSTables into levels with non-overlapping key ranges. Reads become fast, but the cost is roughly 10x write amplification. TWCS (Time-Window) creates one SSTable per time window and never compacts across windows. Perfect for time-series data with no updates to old rows. Pick the wrong strategy and the pain is real. Pick the right one and Cassandra mostly stays out of the way.

Production Architecture

Run at least 3 nodes per datacenter with RF=3 and NetworkTopologyStrategy for multi-DC awareness. For consistency on critical paths, use QUORUM or LOCAL_QUORUM. I would recommend LOCAL_QUORUM in most cases because it avoids cross-DC latency while still providing strong consistency within a single datacenter. Discord runs 177 Cassandra nodes storing trillions of messages and uses LOCAL_QUORUM for chat operations. That says something about what this setup can handle.

Set concurrent_compactors to match the number of disks, not CPU cores. Keep compaction_throughput_mb_per_sec between 64-128MB/s so compaction does not starve foreground reads. Put the commit log and data directories on separate disks to avoid IO contention. If the cluster is over 50 nodes, switch to incremental repairs instead of full repairs. Full repairs at that scale will ruin someone's weekend.

What to monitor: Read Latency p99, Write Latency p99, Pending Compactions, SSTable Count per Read, Tombstone Scans per Read, Heap Usage, and GC Pause Duration. Page someone when pending compactions exceed 20 or p99 read latency crosses 100ms. Both of those are signs that compaction is falling behind, and things will get worse before they get better.

Capacity Planning

A single node handles 1-3TB comfortably on modern SSDs. Go past 3TB per node and compaction plus repair times start to hurt. Here is the math for a 30TB dataset with RF=3: that means 90TB of raw storage. STCS requires 50% temporary overhead during compaction, so provision 135TB. That works out to about 45 nodes at 3TB each, or 90 nodes at 1.5TB each for more operational breathing room. I lean toward more nodes with less data each. Repairs finish faster and node failures are less disruptive.

Throughput per node: expect 10,000-30,000 reads/sec at p99 < 10ms for a well-designed data model, and 50,000-100,000 writes/sec. Plan for 50% headroom on both CPU and disk. That buffer is needed to handle compaction spikes and the temporary load increase when a node dies and surviving nodes pick up its traffic. At 80% capacity on a good day, a single node failure will cascade.

On vnodes: use 16 per node. The old default of 256 causes streaming during bootstrap and repair to take forever. Yelp cut repair time on a 30-node cluster from 12 hours to 45 minutes just by dropping from 256 to 16. That is not a minor improvement.

Failure Scenarios

Scenario 1: Tombstone accumulation causes TombstoneOverwhelmingException. I have seen this one take down production reads more than once. An application deletes rows frequently, and each delete creates a tombstone. Over time, reads have to scan through thousands of tombstones before finding live data. Queries start timing out. Eventually, once the count blows past the default tombstone_failure_threshold of 100,000, Cassandra throws TombstoneOverwhelmingException and just stops serving reads for that table. It is exactly as bad as it sounds. To catch it early, monitor TombstoneScannedHistogram per table and alert when p99 exceeds 1,000. To fix it: run manual compaction on the affected tables as a band-aid, then redesign the data model to use TTLs instead of explicit deletes, or switch to TWCS with appropriate time windows. The real lesson is that Cassandra was not built for delete-heavy workloads. If the application deletes more than it writes, reconsider whether Cassandra is the right tool.

Scenario 2: Hot partition causes node instability. One partition key (say, a viral celebrity's activity feed) gets hammered with traffic. The node owning that token range maxes out on CPU, GC pressure climbs, latencies spike. Then the coordinator starts retrying on other replicas, spreading the pain. The result is cascading degradation across the ring. To detect it, monitor per-partition latency via local_read_latency and identify oversized partitions with nodetool tablehistograms. The fix is to redesign the partition key with a bucketing suffix (for example, user_id:bucket_N) so the hot partition spreads across multiple token ranges. This is one of those problems that is obvious in hindsight but brutal to debug at 2 AM while staring at a melting cluster. Plan for it upfront if any access patterns have skewed distributions.

How It Works Internally

Production Architecture

Capacity Planning

Failure Scenarios

Use Cases

Architecture

How It Works Internally

Production Architecture

Capacity Planning

Failure Scenarios

Pros

Cons

When to use

When NOT to use

Key Points

Common Mistakes

Related Technologies

Cassandra

Use Cases

Architecture

How It Works Internally

Production Architecture

Capacity Planning

Failure Scenarios

Pros

Cons

When to use

When NOT to use

Key Points

Common Mistakes

Related Technologies