Database Sharding

Why It Exists

Vertical scaling hits a wall. There is always a biggest machine available, and eventually the database will outgrow it.

When a single database instance cannot keep up with write throughput, storage needs, or query concurrency, sharding splits the dataset horizontally across multiple independent database instances. Each shard holds a distinct slice of the data, delivering near-linear write scalability. Read replicas only help with reads. Sharding scales both reads and writes.

Most teams reach for sharding too early. Before going down this road, exhaust vertical scaling, read replicas, caching, and query optimization first. Sharding adds serious operational complexity, and once committed, the team is living with that complexity for years.

How It Works

Sharding Strategies

Range-based: Assigns contiguous key ranges to shards (e.g., user_id 1-1M goes to Shard 1). Simple to build and great for range queries. The downside is real, though: if access patterns cluster in certain ranges, hotspots emerge. A new signup wave hammering the latest range shard is a classic example.
Hash-based: Applies a hash function to the shard key and maps the result to a shard via modulo or consistent hashing. This produces uniform data distribution across shards. The cost? Range queries turn into scatter-gather operations across every shard. For range-heavy workloads, think twice.
Directory-based: A lookup table maps each shard key to its shard. This provides maximum flexibility, since any key can live on any shard. But the directory itself becomes a single point of failure and a potential bottleneck. It trades one problem for another.

Shard Key Selection Criteria

The shard key determines everything. I cannot overstate this. A bad shard key will cause more pain than almost any other architectural mistake.

A good shard key has high cardinality (millions of distinct values), even distribution (no single value dominates), and query alignment (most queries include the shard key in their WHERE clause). If the most common query does not include the shard key, every request becomes a scatter-gather across all shards. That is worse than not sharding at all.

Composite shard keys (e.g., tenant_id + user_id) are common in multi-tenant systems. They co-locate a tenant's data on the same shard while distributing load across tenants. This is the right call for most SaaS applications.

Resharding Strategies

When shards become unbalanced or capacity runs out, resharding becomes necessary. This is one of the hardest operational tasks in infrastructure. Common approaches:

Virtual shards: Map logical shards (e.g., 1024 virtual shards) to physical nodes. Moving a virtual shard is a data copy, not a full repartition. Vitess and Citus both work this way. This is the approach I would recommend starting with, because it provides the most flexibility later.
Consistent hashing: Adding a node only remaps roughly 1/N of keys. DynamoDB and Cassandra rely on this. Works well for key-value workloads, less well for relational data.
Double-write migration: Write to both old and new shard layout during the transition, then cut over reads. It is complex and error-prone, but it avoids downtime. Budget more time for this than seems necessary.

Cross-Shard Query Patterns

Queries spanning multiple shards use a scatter-gather pattern: the routing layer fans out the query to all relevant shards, collects results, and merges them. This is slow. For transactions, two-phase commit (2PC) provides atomicity but introduces a blocking coordinator. Saga patterns break the transaction into compensating steps for better availability, but at the cost of strong consistency.

My honest take: needing cross-shard transactions frequently usually means the wrong shard key was picked or sharding was the wrong choice. Consider CockroachDB or Spanner instead of fighting this battle with manual sharding.

Production Considerations

Monitor shard balance: Track row counts, storage size, and QPS per shard. Alert when skew exceeds 20%. Without monitoring this, the first sign of imbalance will be a shard falling over.
Shard-local indexes only: Global secondary indexes across shards are prohibitively expensive to maintain. Denormalize the data or use a search index (Elasticsearch) for cross-shard lookups. Accept this constraint early and design around it.
Connection pooling: N shards means N connection pools. Use PgBouncer or ProxySQL to prevent connection exhaustion. This surprises teams who are used to managing connections to a single database.
Backup and restore per shard: Coordinate point-in-time recovery across shards to maintain cross-shard consistency. Test the restore procedure regularly. An untested backup is not a backup.
Schema migrations: Roll out migrations to all shards atomically or in a backward-compatible way. Use expand-and-contract migration patterns. This is tedious but non-negotiable.

Failure Scenarios

Scenario 1: Hot Shard Cascade. A celebrity user on a range-sharded system triggers 50x normal write throughput to one shard. The shard's CPU saturates at 100%, the connection pool exhausts, and the shard router starts queuing all requests routed to that shard. Here is the nasty part: application threads block waiting for the hot shard, draining the thread pool, which causes timeouts on requests to healthy shards too. One bad shard takes down everything. Detection: Monitor per-shard P99 latency and connection pool utilization. Alert when any shard exceeds 2x baseline. Recovery: Implement circuit breakers per shard. Short-term, rate-limit the hot key. Long-term, reshard the hot range or put a write-behind buffer in front of it.

Scenario 2: Shard Map Corruption. A directory-based sharding deployment pushes a bad shard map update. 15% of reads get routed to the wrong shard, returning empty results or incorrect data. The application silently serves wrong data for 12 minutes before anyone notices. Silent data corruption is the worst kind of outage. Detection: Track per-shard query rates. A sudden drop on one shard paired with a spike on another points to misrouting. Add data integrity checksums on shard-routed responses. Recovery: Version shard maps with rollback capability. Uber learned this lesson running Schemaless. They maintain shard map history with instant rollback and canary shard map deploys to 1% of traffic first.

Scenario 3: Resharding Data Loss. During a live resharding migration, the double-write pipeline falls behind. The cutover happens too early, and 0.3% of writes during the transition window land on the old shard but never get replicated to the new layout. Detection: Run a post-migration reconciliation job that compares row counts and checksums per shard key range. Alert on any delta greater than zero. Recovery: Keep the old shard layout in read-only mode for at least 48 hours post-migration. Stripe keeps old shards available for a full week during migrations to support reconciliation. That is the right instinct.

Capacity Planning

Metric	Threshold	Action
Shard storage utilization	> 70%	Start planning resharding (6-12 week lead time)
Per-shard QPS	> 80% of tested max	Add read replicas or reshard
Shard count growth rate	> 2x/year	Question whether the sharding strategy itself is correct
Cross-shard query ratio	> 15% of total queries	Revisit the shard key. Co-location is failing

Real-world scale references: Uber's Schemaless runs on 4,500+ MySQL shards handling 1M+ QPS across ride and payments data. Vitess at YouTube manages tens of thousands of shards for MySQL. Pinterest operates 8,000+ MySQL shards partitioned by user_id. These numbers are useful for calibrating mental models, but do not use them to justify premature sharding. Those teams have dedicated platform teams of 20+ engineers managing this infrastructure.

Capacity formula: Required shards = max(total_storage / max_shard_size, peak_write_QPS / max_shard_write_QPS, total_connections / max_connections_per_shard). Always provision for the binding constraint. Add 30% headroom for traffic spikes. If the peak-to-average ratio exceeds 3:1, factor in burst capacity separately.

Architecture Decision Record

When to Shard: Decision Matrix

Criteria (Weight)	Don't Shard	Application-Level Sharding	Vitess / ProxySQL	CockroachDB / Spanner
Dataset size	< 500GB	500GB-5TB	1TB-50TB	5TB+ or global distribution
Write QPS	< 10K	10K-100K	50K-500K	100K+ with strong consistency
Team expertise	Small team, no DBA	Senior backend + DBA	Dedicated platform team	Cloud-native, less operational burden
Consistency needs	Strong (single node)	Application-managed	Configurable	Built-in distributed transactions
Cross-shard queries	N/A	Rare and acceptable	Moderate with Vitess scatter-gather	Native distributed joins

Decision rules:

Largest table fits in RAM on a single instance (< 256GB): use read replicas. Do not shard.
If only reads need to scale, add replicas. Sharding is for write scalability. Period.
Frequent cross-shard transactions needed: choose CockroachDB or Spanner over manual sharding. The operational cost of distributed sagas is higher than the licensing cost. I have seen teams learn this the hard way.
On MySQL and growing past one node: Vitess is the proven path (YouTube, Slack, HubSpot). For PostgreSQL, evaluate Citus before building custom sharding. Rolling custom shard routing is a trap.
Partitionable by tenant (SaaS): shard by tenant_id. This eliminates cross-shard queries for 99% of operations. Slack does exactly this, and it is the simplest sharding model that actually works well.

Tool	Type	Best For	Scale
Vitess	Open Source	MySQL sharding, used by YouTube/Slack	Large-Enterprise
CockroachDB	Open Source	Auto-sharding, distributed SQL, strong consistency	Medium-Enterprise
Citus (PostgreSQL)	Open Source	PostgreSQL extension, transparent sharding	Medium-Enterprise
MongoDB	Open Source	Native sharding with config servers and mongos	Medium-Enterprise

Why It Exists

Vertical scaling hits a wall. There is always a biggest machine available, and eventually the database will outgrow it.

How It Works

Sharding Strategies

Range-based: Assigns contiguous key ranges to shards (e.g., user_id 1-1M goes to Shard 1). Simple to build and great for range queries. The downside is real, though: if access patterns cluster in certain ranges, hotspots emerge. A new signup wave hammering the latest range shard is a classic example.
Hash-based: Applies a hash function to the shard key and maps the result to a shard via modulo or consistent hashing. This produces uniform data distribution across shards. The cost? Range queries turn into scatter-gather operations across every shard. For range-heavy workloads, think twice.
Directory-based: A lookup table maps each shard key to its shard. This provides maximum flexibility, since any key can live on any shard. But the directory itself becomes a single point of failure and a potential bottleneck. It trades one problem for another.

Shard Key Selection Criteria

The shard key determines everything. I cannot overstate this. A bad shard key will cause more pain than almost any other architectural mistake.

Resharding Strategies

When shards become unbalanced or capacity runs out, resharding becomes necessary. This is one of the hardest operational tasks in infrastructure. Common approaches:

Virtual shards: Map logical shards (e.g., 1024 virtual shards) to physical nodes. Moving a virtual shard is a data copy, not a full repartition. Vitess and Citus both work this way. This is the approach I would recommend starting with, because it provides the most flexibility later.
Consistent hashing: Adding a node only remaps roughly 1/N of keys. DynamoDB and Cassandra rely on this. Works well for key-value workloads, less well for relational data.
Double-write migration: Write to both old and new shard layout during the transition, then cut over reads. It is complex and error-prone, but it avoids downtime. Budget more time for this than seems necessary.

Cross-Shard Query Patterns

Production Considerations

Monitor shard balance: Track row counts, storage size, and QPS per shard. Alert when skew exceeds 20%. Without monitoring this, the first sign of imbalance will be a shard falling over.
Shard-local indexes only: Global secondary indexes across shards are prohibitively expensive to maintain. Denormalize the data or use a search index (Elasticsearch) for cross-shard lookups. Accept this constraint early and design around it.
Connection pooling: N shards means N connection pools. Use PgBouncer or ProxySQL to prevent connection exhaustion. This surprises teams who are used to managing connections to a single database.
Backup and restore per shard: Coordinate point-in-time recovery across shards to maintain cross-shard consistency. Test the restore procedure regularly. An untested backup is not a backup.
Schema migrations: Roll out migrations to all shards atomically or in a backward-compatible way. Use expand-and-contract migration patterns. This is tedious but non-negotiable.

Failure Scenarios

Capacity Planning

Metric	Threshold	Action
Shard storage utilization	> 70%	Start planning resharding (6-12 week lead time)
Per-shard QPS	> 80% of tested max	Add read replicas or reshard
Shard count growth rate	> 2x/year	Question whether the sharding strategy itself is correct
Cross-shard query ratio	> 15% of total queries	Revisit the shard key. Co-location is failing

Architecture Decision Record

When to Shard: Decision Matrix

Criteria (Weight)	Don't Shard	Application-Level Sharding	Vitess / ProxySQL	CockroachDB / Spanner
Dataset size	< 500GB	500GB-5TB	1TB-50TB	5TB+ or global distribution
Write QPS	< 10K	10K-100K	50K-500K	100K+ with strong consistency
Team expertise	Small team, no DBA	Senior backend + DBA	Dedicated platform team	Cloud-native, less operational burden
Consistency needs	Strong (single node)	Application-managed	Configurable	Built-in distributed transactions
Cross-shard queries	N/A	Rare and acceptable	Moderate with Vitess scatter-gather	Native distributed joins

Decision rules:

Largest table fits in RAM on a single instance (< 256GB): use read replicas. Do not shard.
If only reads need to scale, add replicas. Sharding is for write scalability. Period.
Frequent cross-shard transactions needed: choose CockroachDB or Spanner over manual sharding. The operational cost of distributed sagas is higher than the licensing cost. I have seen teams learn this the hard way.
On MySQL and growing past one node: Vitess is the proven path (YouTube, Slack, HubSpot). For PostgreSQL, evaluate Citus before building custom sharding. Rolling custom shard routing is a trap.
Partitionable by tenant (SaaS): shard by tenant_id. This eliminates cross-shard queries for 99% of operations. Slack does exactly this, and it is the simplest sharding model that actually works well.

Architecture Diagram

Why It Exists

How It Works

Sharding Strategies

Shard Key Selection Criteria

Resharding Strategies

Cross-Shard Query Patterns

Production Considerations

Failure Scenarios

Capacity Planning

Architecture Decision Record

When to Shard: Decision Matrix

Key Points

Tool Comparison

Common Mistakes

Related Topics

Database Sharding

Architecture Diagram

Why It Exists

How It Works

Sharding Strategies

Shard Key Selection Criteria

Resharding Strategies

Cross-Shard Query Patterns

Production Considerations

Failure Scenarios

Capacity Planning

Architecture Decision Record

When to Shard: Decision Matrix

Key Points

Tool Comparison

Common Mistakes

Related Topics