Database Sharding
Architecture Diagram
Why It Exists
Vertical scaling hits a wall. There is always a biggest machine available, and eventually the database will outgrow it.
When a single database instance cannot keep up with write throughput, storage needs, or query concurrency, sharding splits the dataset horizontally across multiple independent database instances. Each shard holds a distinct slice of the data, delivering near-linear write scalability. Read replicas only help with reads. Sharding scales both reads and writes.
Most teams reach for sharding too early. Before going down this road, exhaust vertical scaling, read replicas, caching, and query optimization first. Sharding adds serious operational complexity, and once committed, the team is living with that complexity for years.
How It Works
Sharding Strategies
- Range-based: Assigns contiguous key ranges to shards (e.g.,
user_id 1-1Mgoes to Shard 1). Simple to build and great for range queries. The downside is real, though: if access patterns cluster in certain ranges, hotspots emerge. A new signup wave hammering the latest range shard is a classic example. - Hash-based: Applies a hash function to the shard key and maps the result to a shard via modulo or consistent hashing. This produces uniform data distribution across shards. The cost? Range queries turn into scatter-gather operations across every shard. For range-heavy workloads, think twice.
- Directory-based: A lookup table maps each shard key to its shard. This provides maximum flexibility, since any key can live on any shard. But the directory itself becomes a single point of failure and a potential bottleneck. It trades one problem for another.
Shard Key Selection Criteria
The shard key determines everything. I cannot overstate this. A bad shard key will cause more pain than almost any other architectural mistake.
A good shard key has high cardinality (millions of distinct values), even distribution (no single value dominates), and query alignment (most queries include the shard key in their WHERE clause). If the most common query does not include the shard key, every request becomes a scatter-gather across all shards. That is worse than not sharding at all.
Composite shard keys (e.g., tenant_id + user_id) are common in multi-tenant systems. They co-locate a tenant's data on the same shard while distributing load across tenants. This is the right call for most SaaS applications.
Resharding Strategies
When shards become unbalanced or capacity runs out, resharding becomes necessary. This is one of the hardest operational tasks in infrastructure. Common approaches:
- Virtual shards: Map logical shards (e.g., 1024 virtual shards) to physical nodes. Moving a virtual shard is a data copy, not a full repartition. Vitess and Citus both work this way. This is the approach I would recommend starting with, because it provides the most flexibility later.
- Consistent hashing: Adding a node only remaps roughly 1/N of keys. DynamoDB and Cassandra rely on this. Works well for key-value workloads, less well for relational data.
- Double-write migration: Write to both old and new shard layout during the transition, then cut over reads. It is complex and error-prone, but it avoids downtime. Budget more time for this than seems necessary.
Cross-Shard Query Patterns
Queries spanning multiple shards use a scatter-gather pattern: the routing layer fans out the query to all relevant shards, collects results, and merges them. This is slow. For transactions, two-phase commit (2PC) provides atomicity but introduces a blocking coordinator. Saga patterns break the transaction into compensating steps for better availability, but at the cost of strong consistency.
My honest take: needing cross-shard transactions frequently usually means the wrong shard key was picked or sharding was the wrong choice. Consider CockroachDB or Spanner instead of fighting this battle with manual sharding.
Production Considerations
- Monitor shard balance: Track row counts, storage size, and QPS per shard. Alert when skew exceeds 20%. Without monitoring this, the first sign of imbalance will be a shard falling over.
- Shard-local indexes only: Global secondary indexes across shards are prohibitively expensive to maintain. Denormalize the data or use a search index (Elasticsearch) for cross-shard lookups. Accept this constraint early and design around it.
- Connection pooling: N shards means N connection pools. Use PgBouncer or ProxySQL to prevent connection exhaustion. This surprises teams who are used to managing connections to a single database.
- Backup and restore per shard: Coordinate point-in-time recovery across shards to maintain cross-shard consistency. Test the restore procedure regularly. An untested backup is not a backup.
- Schema migrations: Roll out migrations to all shards atomically or in a backward-compatible way. Use expand-and-contract migration patterns. This is tedious but non-negotiable.
Failure Scenarios
Scenario 1: Hot Shard Cascade. A celebrity user on a range-sharded system triggers 50x normal write throughput to one shard. The shard's CPU saturates at 100%, the connection pool exhausts, and the shard router starts queuing all requests routed to that shard. Here is the nasty part: application threads block waiting for the hot shard, draining the thread pool, which causes timeouts on requests to healthy shards too. One bad shard takes down everything. Detection: Monitor per-shard P99 latency and connection pool utilization. Alert when any shard exceeds 2x baseline. Recovery: Implement circuit breakers per shard. Short-term, rate-limit the hot key. Long-term, reshard the hot range or put a write-behind buffer in front of it.
Scenario 2: Shard Map Corruption. A directory-based sharding deployment pushes a bad shard map update. 15% of reads get routed to the wrong shard, returning empty results or incorrect data. The application silently serves wrong data for 12 minutes before anyone notices. Silent data corruption is the worst kind of outage. Detection: Track per-shard query rates. A sudden drop on one shard paired with a spike on another points to misrouting. Add data integrity checksums on shard-routed responses. Recovery: Version shard maps with rollback capability. Uber learned this lesson running Schemaless. They maintain shard map history with instant rollback and canary shard map deploys to 1% of traffic first.
Scenario 3: Resharding Data Loss. During a live resharding migration, the double-write pipeline falls behind. The cutover happens too early, and 0.3% of writes during the transition window land on the old shard but never get replicated to the new layout. Detection: Run a post-migration reconciliation job that compares row counts and checksums per shard key range. Alert on any delta greater than zero. Recovery: Keep the old shard layout in read-only mode for at least 48 hours post-migration. Stripe keeps old shards available for a full week during migrations to support reconciliation. That is the right instinct.
Capacity Planning
| Metric | Threshold | Action |
|---|---|---|
| Shard storage utilization | > 70% | Start planning resharding (6-12 week lead time) |
| Per-shard QPS | > 80% of tested max | Add read replicas or reshard |
| Shard count growth rate | > 2x/year | Question whether the sharding strategy itself is correct |
| Cross-shard query ratio | > 15% of total queries | Revisit the shard key. Co-location is failing |
Real-world scale references: Uber's Schemaless runs on 4,500+ MySQL shards handling 1M+ QPS across ride and payments data. Vitess at YouTube manages tens of thousands of shards for MySQL. Pinterest operates 8,000+ MySQL shards partitioned by user_id. These numbers are useful for calibrating mental models, but do not use them to justify premature sharding. Those teams have dedicated platform teams of 20+ engineers managing this infrastructure.
Capacity formula: Required shards = max(total_storage / max_shard_size, peak_write_QPS / max_shard_write_QPS, total_connections / max_connections_per_shard). Always provision for the binding constraint. Add 30% headroom for traffic spikes. If the peak-to-average ratio exceeds 3:1, factor in burst capacity separately.
Architecture Decision Record
When to Shard: Decision Matrix
| Criteria (Weight) | Don't Shard | Application-Level Sharding | Vitess / ProxySQL | CockroachDB / Spanner |
|---|---|---|---|---|
| Dataset size | < 500GB | 500GB-5TB | 1TB-50TB | 5TB+ or global distribution |
| Write QPS | < 10K | 10K-100K | 50K-500K | 100K+ with strong consistency |
| Team expertise | Small team, no DBA | Senior backend + DBA | Dedicated platform team | Cloud-native, less operational burden |
| Consistency needs | Strong (single node) | Application-managed | Configurable | Built-in distributed transactions |
| Cross-shard queries | N/A | Rare and acceptable | Moderate with Vitess scatter-gather | Native distributed joins |
Decision rules:
- Largest table fits in RAM on a single instance (< 256GB): use read replicas. Do not shard.
- If only reads need to scale, add replicas. Sharding is for write scalability. Period.
- Frequent cross-shard transactions needed: choose CockroachDB or Spanner over manual sharding. The operational cost of distributed sagas is higher than the licensing cost. I have seen teams learn this the hard way.
- On MySQL and growing past one node: Vitess is the proven path (YouTube, Slack, HubSpot). For PostgreSQL, evaluate Citus before building custom sharding. Rolling custom shard routing is a trap.
- Partitionable by tenant (SaaS): shard by
tenant_id. This eliminates cross-shard queries for 99% of operations. Slack does exactly this, and it is the simplest sharding model that actually works well.
Key Points
- •Horizontally partitions data across multiple database instances using a shard key
- •Shard key selection is the single most important decision. Get it wrong and the result is hotspots and cross-shard queries everywhere
- •Range-based, hash-based, and directory-based sharding each come with real trade-offs
- •Resharding (adding or removing shards) is operationally painful. Plan capacity early
- •Cross-shard transactions need two-phase commit or saga patterns. Avoid them whenever possible
Tool Comparison
| Tool | Type | Best For | Scale |
|---|---|---|---|
| Vitess | Open Source | MySQL sharding, used by YouTube/Slack | Large-Enterprise |
| CockroachDB | Open Source | Auto-sharding, distributed SQL, strong consistency | Medium-Enterprise |
| Citus (PostgreSQL) | Open Source | PostgreSQL extension, transparent sharding | Medium-Enterprise |
| MongoDB | Open Source | Native sharding with config servers and mongos | Medium-Enterprise |
Common Mistakes
- Choosing a shard key with low cardinality, so all data ends up on one shard
- Not planning for resharding. Data growth will make the initial shard count insufficient
- Designing queries that need cross-shard joins. This defeats the whole point of sharding
- Using auto-increment IDs as shard keys. This creates write hotspots on the latest shard
- Skipping shard failure testing. Losing one shard should not take down the entire system