ScyllaDB
A shard-per-core NoSQL database built for low latency at serious scale
Use Cases
Architecture
How It Works Internally
Anyone who has run Cassandra in production knows the pain: GC pauses that show up at 3 AM, compaction storms that tank read latency, and a thread model that turns multi-core machines into expensive space heaters. ScyllaDB exists because someone looked at all of that and said, "What if we just rewrote the whole thing in C++ and threw away the JVM?"
The core idea is shard-per-core. Each CPU core runs one independent shard with its own memory allocator, commit log, memtable, cache, and SSTable set. Shards talk to each other only through explicit message passing. No locks. No shared mutable state. No cross-core cache line bouncing. This single design choice wipes out the JVM thread contention and stop-the-world GC pauses that cause most of Cassandra's tail-latency nightmares.
The write path looks a lot like Cassandra on the surface. The coordinator hashes the partition key (Murmur3) to find the replica owners, then forwards the mutation based on the configured consistency level. Each replica's shard appends to its per-core commit log (a sequential NVMe write), inserts into the memtable, and returns. The important difference: because each shard owns its own I/O independently, writes to different partition ranges run in true parallelism with zero contention. Discord migrated their Cassandra clusters (trillions of messages) to ScyllaDB and cut their node count by 75%. Their p99 read latency dropped from 40-125ms to 5-15ms. That is not a typo.
The read path uses the same SSTable format and Bloom filter approach as Cassandra, but adds a row-level cache that lives in shard-local memory instead of on the JVM heap. Cache eviction is per-shard, so one shard's hot working set cannot push out another's. Pair that with speculative execution, which fires a redundant read to a second replica if the first does not respond within a configurable threshold (typically the recent p99), and the result is consistent sub-millisecond reads for cached data and single-digit-millisecond reads from SSD.
Compaction is where things get really interesting. The workload-aware scheduler allocates I/O bandwidth between foreground operations (reads, writes) and background operations (compaction, streaming, repair) in real time. When traffic spikes, compaction gets throttled automatically. When things quiet down, it catches up. Anyone who has been woken up by a Cassandra compaction storm eating all disk I/O knows how much this matters.
Production Architecture
Start with a minimum of 3 nodes (RF=3) on NVMe-backed instances (AWS i3, i4i, or GCP n2-standard with local SSD). ScyllaDB favors vertical scaling, so fewer large nodes beat many small ones. A single i4i.8xlarge (32 vCPUs, 256 GB RAM, 2x3.75 TB NVMe) can push 500K-1M ops/sec. Use NetworkTopologyStrategy for multi-DC deployments.
Turn on consistent hashing with tablets (ScyllaDB's replacement for vnodes in newer versions) for faster topology changes. Set internode_compression to all for anything multi-DC. Configure per-partition rate limiting with ScyllaDB's workload conditioning to protect against hot partitions, because someone will eventually send a query pattern that hammers a single partition.
For a notification system pattern (2.4 PB storage, millions of writes/sec): partition by tenant_id:notification_id, use TWCS (Time-Window Compaction Strategy) for time-ordered data, and set a TTL on notifications (90 days, for example) so storage does not grow forever. ScyllaDB's large-partition detection automatically logs partitions approaching the warning threshold, which gives the team time to fix data model problems before latency degrades.
Key metrics to watch: scylla_storage_proxy_coordinator_write_latency, scylla_storage_proxy_coordinator_read_latency, scylla_compaction_manager_compactions, scylla_scheduler_shares (workload-aware scheduler utilization), and scylla_cache_row_hits vs scylla_cache_row_misses for cache efficiency.
Capacity Planning
Each ScyllaDB node handles 2-5x the throughput of an equivalent Cassandra node thanks to shard-per-core efficiency. Do not trust generic benchmarks. Run the specific workload through cassandra-stress (compatible) or ScyllaDB's own scylla-bench and get real numbers.
| Scale Tier | Ops/sec | Nodes | Instance Type | Storage | Reference |
|---|---|---|---|---|---|
| Startup | 50K | 3 | i3.2xlarge | 1.7 TB NVMe | Early-stage platform |
| Mid-scale | 500K | 6-9 | i3.4xlarge | 3.4 TB NVMe | E-commerce notifications |
| Large-scale | 5M | 20-40 | i4i.8xlarge | 7.5 TB NVMe | Notification system (100M/sec peak) |
| Hyper-scale | 50M+ | 100+ | i4i.16xlarge | 15 TB NVMe | Discord-scale messaging |
Rule of thumb: provision for 50% CPU headroom to handle compaction and repair. Storage should have 30% free space for compaction overhead with STCS, or 10% with LCS. ScyllaDB's automatic memory management typically splits RAM 50/50 between the row cache and memtables, adjustable via the --memory flag.
Failure Scenarios
Scenario 1: Hot partition from a viral notification campaign. A tenant blasts a push notification to 50M users at once. Every delivery record hashes to a small set of partitions keyed by campaign_id. The shard owning those partitions pegs its CPU core at 100% while every other shard on the node sits idle, doing nothing useful. ScyllaDB's workload conditioning detects the hot shard and logs it, but by then, latency for other tenants on that shard is already degraded. We have seen this happen. It is not fun. Detection: monitor scylla_storage_proxy_coordinator_write_latency per-shard and alert when imbalance exceeds 5x the cluster average. Fix: redesign the partition key from campaign_id to campaign_id:bucket with consistent hash bucketing (256 buckets works well), spreading writes across all shards. Enable per-partition rate limiting as a safety net going forward.
Scenario 2: NVMe disk failure causes shard data loss. One of two NVMe drives on a node dies. ScyllaDB detects the I/O errors and marks the affected SSTables as unavailable. With RAID-0 (common for performance), all data on that drive is gone. RF=3 means replicas exist on other nodes, but the failed node just lost 50% of its capacity. The remaining shards on the healthy drive keep serving traffic while streaming rebuilds the lost data from replicas. Detection: monitor scylla_io_queue_total_operations and disk error counters. Recovery: swap the drive and run nodetool rebuild to stream data from replicas. With 3.75 TB of data per drive at 200 MB/s streaming throughput, expect roughly 5 hours for a full rebuild. During that window, the cluster runs at reduced capacity. This is exactly why the 50% CPU headroom matters. If the nodes are already running hot before a disk fails, it is going to be a very bad day.
Pros
- • Shard-per-core architecture removes cross-CPU contention entirely
- • Drop-in Cassandra CQL compatibility (drivers, tools, data model all work)
- • Written in C++ on the Seastar framework. No GC pauses. Period.
- • Automatic workload-aware scheduling (reads vs compaction vs streaming)
- • Speculative execution that genuinely cuts tail latency
Cons
- • Same query-driven data modeling constraints as Cassandra
- • Smaller community and ecosystem than Cassandra
- • Enterprise features (CDC, LDAP, encryption at rest) sit behind a paywall
- • Fewer managed service options compared to DynamoDB or Cassandra on Astra
- • Lightweight transactions (LWT) are noticeably slower than regular writes because of Paxos
When to use
- • You want the Cassandra data model but 2-10x lower p99 latency
- • GC pauses in your JVM-based database are causing tail-latency spikes
- • Your workload exceeds 100K ops/sec per node and you want fewer nodes overall
- • You are running latency-sensitive reads alongside compaction-heavy tables
When NOT to use
- • Your dataset fits comfortably in a single relational database
- • You need complex joins or ad-hoc analytical queries
- • Your team has zero experience with Cassandra data modeling
- • You need a fully managed serverless setup with zero ops
Key Points
- •Shard-per-core design gives each CPU core its own independent shard with dedicated memory, commit log, and SSTables. No locks, no context switches, no cross-core coordination.
- •The Seastar framework provides cooperative scheduling, async I/O, and per-core memory allocation. No garbage collector means no GC pauses, ever.
- •CQL-compatible with Cassandra. Existing drivers, tooling, and data models work without modification, making incremental migration straightforward.
- •The workload-aware scheduler dynamically prioritizes foreground reads/writes over background compaction and streaming, which prevents compaction storms from hammering client latency.
- •Speculative execution sends a redundant read to another replica if the primary is slow. It cuts p99 tail latency by 2-5x at the cost of roughly 5% extra read traffic.
- •A single ScyllaDB node can replace 3-5 Cassandra nodes because of higher per-node throughput, which means fewer boxes to manage and lower infra cost.
Common Mistakes
- ✗Blindly applying Cassandra tuning guides. ScyllaDB's shard-per-core model means memory allocation, compaction settings, and thread pools all work differently at the implementation level.
- ✗Over-provisioning nodes instead of scaling vertically first. ScyllaDB is designed to saturate large instances. A single i3.16xlarge can handle 1M+ ops/sec.
- ✗Ignoring partition size limits. Same as Cassandra: partitions over 100MB cause latency spikes. ScyllaDB's large-partition detection log helps catch these early.
- ✗Skipping speculative execution for latency-sensitive reads. Without it, a single slow replica dictates the p99. With it, the fastest replica wins.
- ✗Running on under-provisioned storage. ScyllaDB's I/O scheduler expects direct-attached NVMe. EBS or network storage introduces latency the scheduler simply cannot compensate for.