ScyllaDB

How It Works Internally

Anyone who has run Cassandra in production knows the pain: GC pauses that show up at 3 AM, compaction storms that tank read latency, and a thread model that turns multi-core machines into expensive space heaters. ScyllaDB exists because someone looked at all of that and said, "What if we just rewrote the whole thing in C++ and threw away the JVM?"

The core idea is shard-per-core. Each CPU core runs one independent shard with its own memory allocator, commit log, memtable, cache, and SSTable set. Shards talk to each other only through explicit message passing. No locks. No shared mutable state. No cross-core cache line bouncing. This single design choice wipes out the JVM thread contention and stop-the-world GC pauses that cause most of Cassandra's tail-latency nightmares.

The write path looks a lot like Cassandra on the surface. The coordinator hashes the partition key (Murmur3) to find the replica owners, then forwards the mutation based on the configured consistency level. Each replica's shard appends to its per-core commit log (a sequential NVMe write), inserts into the memtable, and returns. The important difference: because each shard owns its own I/O independently, writes to different partition ranges run in true parallelism with zero contention. Discord migrated their Cassandra clusters (trillions of messages) to ScyllaDB and cut their node count by 75%. Their p99 read latency dropped from 40-125ms to 5-15ms. That is not a typo.

The read path uses the same SSTable format and Bloom filter approach as Cassandra, but adds a row-level cache that lives in shard-local memory instead of on the JVM heap. Cache eviction is per-shard, so one shard's hot working set cannot push out another's. Pair that with speculative execution, which fires a redundant read to a second replica if the first does not respond within a configurable threshold (typically the recent p99), and the result is consistent sub-millisecond reads for cached data and single-digit-millisecond reads from SSD.

Compaction is where things get really interesting. The workload-aware scheduler allocates I/O bandwidth between foreground operations (reads, writes) and background operations (compaction, streaming, repair) in real time. When traffic spikes, compaction gets throttled automatically. When things quiet down, it catches up. Anyone who has been woken up by a Cassandra compaction storm eating all disk I/O knows how much this matters.

Production Architecture

Start with a minimum of 3 nodes (RF=3) on NVMe-backed instances (AWS i3, i4i, or GCP n2-standard with local SSD). ScyllaDB favors vertical scaling, so fewer large nodes beat many small ones. A single i4i.8xlarge (32 vCPUs, 256 GB RAM, 2x3.75 TB NVMe) can push 500K-1M ops/sec. Use NetworkTopologyStrategy for multi-DC deployments.

Turn on consistent hashing with tablets (ScyllaDB's replacement for vnodes in newer versions) for faster topology changes. Set internode_compression to all for anything multi-DC. Configure per-partition rate limiting with ScyllaDB's workload conditioning to protect against hot partitions, because someone will eventually send a query pattern that hammers a single partition.

For a notification system pattern (2.4 PB storage, millions of writes/sec): partition by tenant_id:notification_id, use TWCS (Time-Window Compaction Strategy) for time-ordered data, and set a TTL on notifications (90 days, for example) so storage does not grow forever. ScyllaDB's large-partition detection automatically logs partitions approaching the warning threshold, which gives the team time to fix data model problems before latency degrades.

Key metrics to watch: scylla_storage_proxy_coordinator_write_latency, scylla_storage_proxy_coordinator_read_latency, scylla_compaction_manager_compactions, scylla_scheduler_shares (workload-aware scheduler utilization), and scylla_cache_row_hits vs scylla_cache_row_misses for cache efficiency.

Capacity Planning

Each ScyllaDB node handles 2-5x the throughput of an equivalent Cassandra node thanks to shard-per-core efficiency. Do not trust generic benchmarks. Run the specific workload through cassandra-stress (compatible) or ScyllaDB's own scylla-bench and get real numbers.

Scale Tier	Ops/sec	Nodes	Instance Type	Storage	Reference
Startup	50K	3	i3.2xlarge	1.7 TB NVMe	Early-stage platform
Mid-scale	500K	6-9	i3.4xlarge	3.4 TB NVMe	E-commerce notifications
Large-scale	5M	20-40	i4i.8xlarge	7.5 TB NVMe	Notification system (100M/sec peak)
Hyper-scale	50M+	100+	i4i.16xlarge	15 TB NVMe	Discord-scale messaging

Rule of thumb: provision for 50% CPU headroom to handle compaction and repair. Storage should have 30% free space for compaction overhead with STCS, or 10% with LCS. ScyllaDB's automatic memory management typically splits RAM 50/50 between the row cache and memtables, adjustable via the --memory flag.

Failure Scenarios

Scenario 1: Hot partition from a viral notification campaign. A tenant blasts a push notification to 50M users at once. Every delivery record hashes to a small set of partitions keyed by campaign_id. The shard owning those partitions pegs its CPU core at 100% while every other shard on the node sits idle, doing nothing useful. ScyllaDB's workload conditioning detects the hot shard and logs it, but by then, latency for other tenants on that shard is already degraded. We have seen this happen. It is not fun. Detection: monitor scylla_storage_proxy_coordinator_write_latency per-shard and alert when imbalance exceeds 5x the cluster average. Fix: redesign the partition key from campaign_id to campaign_id:bucket with consistent hash bucketing (256 buckets works well), spreading writes across all shards. Enable per-partition rate limiting as a safety net going forward.

Scenario 2: NVMe disk failure causes shard data loss. One of two NVMe drives on a node dies. ScyllaDB detects the I/O errors and marks the affected SSTables as unavailable. With RAID-0 (common for performance), all data on that drive is gone. RF=3 means replicas exist on other nodes, but the failed node just lost 50% of its capacity. The remaining shards on the healthy drive keep serving traffic while streaming rebuilds the lost data from replicas. Detection: monitor scylla_io_queue_total_operations and disk error counters. Recovery: swap the drive and run nodetool rebuild to stream data from replicas. With 3.75 TB of data per drive at 200 MB/s streaming throughput, expect roughly 5 hours for a full rebuild. During that window, the cluster runs at reduced capacity. This is exactly why the 50% CPU headroom matters. If the nodes are already running hot before a disk fails, it is going to be a very bad day.

How It Works Internally

Production Architecture

Capacity Planning

Scale Tier

Ops/sec

Nodes

Instance Type

Storage

Reference

Startup

50K

i3.2xlarge

1.7 TB NVMe

Early-stage platform

Mid-scale

500K

6-9

i3.4xlarge

3.4 TB NVMe

E-commerce notifications

Large-scale

20-40

i4i.8xlarge

7.5 TB NVMe

Notification system (100M/sec peak)

Hyper-scale

50M+

100+

i4i.16xlarge

15 TB NVMe

Discord-scale messaging

Failure Scenarios

Use Cases

Architecture

How It Works Internally

Production Architecture

Capacity Planning

Failure Scenarios

Pros

Cons

When to use

When NOT to use

Key Points

Common Mistakes

Related Technologies

ScyllaDB

Use Cases

Architecture

How It Works Internally

Production Architecture

Capacity Planning

Failure Scenarios

Pros

Cons

When to use

When NOT to use

Key Points

Common Mistakes

Related Technologies