Replication & Consistency
Architecture Diagram
Why It Exists
Disks die. Networks split. Entire data centers go dark. None of this is theoretical. Replication keeps copies of data on multiple nodes so that one failure does not cause downtime. But it is not just about durability. Replicas also absorb read traffic, which matters a lot when the workload is 10:1 reads to writes or higher. Most are.
How It Works
Replication Topologies
- Single-leader (primary-replica): All writes hit one primary node, which streams changes out to replicas. Simple, well-understood, and boring in the best way. The downside: the primary is the write bottleneck and a single point of failure until failover kicks in.
- Multi-leader: Multiple nodes accept writes and replicate to each other. This typically appears in multi-datacenter setups where each DC has a local leader. The catch is write conflicts. A strategy is needed: last-writer-wins (LWW), vector clocks, or application-level merge logic. Each has tradeoffs and none of them are free.
- Leaderless (Dynamo-style): Any node accepts reads and writes. Consistency comes from quorums: a write needs
Wacknowledgments, a read needsRresponses, andR + W > Nguarantees overlap. Cassandra and DynamoDB both work this way.
Consistency Models
| Model | Guarantee | Latency | Example |
|---|---|---|---|
| Strong (Linearizable) | Reads reflect the latest write globally | Highest | CockroachDB, Spanner |
| Causal | Causally related operations appear in order | Medium | MongoDB sessions |
| Read-your-own-writes | A client sees its own writes immediately | Low-Medium | PostgreSQL sync replica routing |
| Eventual | All replicas converge given no new writes | Lowest | Cassandra with CL=ONE |
CAP Theorem in Practice
People love debating CAP on whiteboards, but here is what it actually means in production. During a network partition, pick one: reject writes (CP, keep consistency, lose availability) or accept writes on both sides of the partition (AP, keep availability, lose consistency). Partitions are rare but they do happen. The practical answer is that most systems allow tuning this per query. Cassandra's consistency levels (ONE, QUORUM, ALL) provide that tradeoff on every single operation.
Consensus Protocols
Raft and Paxos solve one specific problem: getting multiple nodes to agree on a value. That sounds simple until someone actually tries to implement it. Raft won the popularity contest because it is genuinely easier to understand. It breaks the problem into leader election, log replication, and safety as separate subproblems. CockroachDB, etcd, and TiKV all run on Raft. A Raft cluster with N nodes can survive (N-1)/2 failures.
Production Considerations
- Monitor replication lag. Set alerts when async replicas fall more than a few seconds behind. In PostgreSQL, check
pg_stat_replication. In MySQL,SHOW SLAVE STATUS. Without this monitoring, problems are guaranteed. - Route reads after writes carefully. After a write, send that user's reads to the primary (or a sync replica) for a short window. Otherwise they will see stale data and file bug reports that are technically correct.
- Automate failover. Patroni for PostgreSQL, Orchestrator for MySQL. Manual failover at 3 AM is when mistakes happen. The on-call engineer should not be making judgment calls about replica promotion while half-asleep.
- Accept the latency tax for cross-region sync replication. Expect 50-200ms per write. If that is too expensive, go async with conflict resolution, or use a globally distributed database like Spanner or CockroachDB that handles it internally.
- Prevent split-brain. When a primary becomes unreachable, fencing (STONITH) makes sure the old primary cannot keep accepting writes after a new one gets elected. Skip this and data divergence is a matter of when, not if.
Failure Scenarios
Scenario 1: Split-Brain Writes. A network partition cuts the primary off from its replicas. The failover system promotes a replica to new primary, but the old primary is still alive and happily accepting writes from clients on its side of the partition. Now two nodes are diverging. When the partition heals, the result is two conflicting datasets. Detection: Watch the number of nodes claiming primary role. If it is ever more than 1, page someone immediately. Track write epoch/term numbers so a primary with a stale term rejects writes. Recovery: Use fencing tokens (STONITH) to force-kill the old primary before the new one starts accepting writes. GitHub had a 24-hour outage in 2018 from exactly this scenario on their MySQL cluster. They ended up manually reconciling divergent writes by replaying the old primary's binlog against the new primary. Not fun.
Scenario 2: Replication Lag Amplification. An async replica falls 30 seconds behind during a traffic spike. The application routes 80% of reads to replicas, so now users start seeing stale data. Profile updates "disappear" for half a minute. Worse, a reporting job reading from the lagging replica generates bad aggregates, and someone makes a business decision based on wrong numbers. Detection: Track replication_lag_seconds per replica. Alert at > 5s for user-facing replicas, > 30s for analytics replicas. Also watch the derivative. If lag is growing, the replica is falling further behind and action is needed. Recovery: Temporarily redirect reads to the primary. LinkedIn monitors lag across 1,000+ MySQL replicas and automatically pulls lagging ones out of the read pool when they exceed 10 seconds.
Scenario 3: Consensus Quorum Loss. Two nodes fail at the same time in a 3-node Raft cluster (say, a correlated rack failure). The surviving node cannot form a quorum. Reads still work from that node, but writes are completely dead. This partial outage confuses monitoring because the system looks "mostly up" while actually being unable to accept any mutations. Detection: Alert on Raft leader election failures and write rejection rates. Track the count of healthy voting members and alert when it drops below (N/2 + 1). Recovery: Add replacement nodes and wait for state transfer to complete. CockroachDB recommends 5-node clusters for production specifically to tolerate 2 simultaneous failures. Spread Raft groups across at least 3 availability zones.
Capacity Planning
| Metric | Threshold | Action |
|---|---|---|
| Replication lag | > 1s (OLTP), > 60s (analytics) | Investigate replica throughput bottleneck |
| Replica count per primary | > 5 replicas | Primary's network bandwidth becomes the bottleneck. Use cascading replication |
| Failover time (RTO) | > 30s for critical services | Implement automated failover with pre-elected standby |
| WAL generation rate | > 80% of network link capacity | Upgrade network or reduce write throughput |
Real-world scale references: Google Spanner replicates across 5 zones using synchronized TrueTime clocks, hitting < 10ms commit latency for global strong consistency. Netflix runs 2,500+ Cassandra nodes across 3 regions with LOCAL_QUORUM for most reads. They can lose an entire region with zero data loss. Slack's Vitess deployment replicates each shard to 3 replicas with semi-synchronous replication, targeting < 1s failover via Orchestrator.
Capacity formula for replicas: Required replicas = ceil(peak_read_QPS / single_node_read_QPS) + 1 (failover spare). For cross-region, add at least one replica per region. Each replica costs replication bandwidth equal to WAL_generation_rate * 1.2 (overhead). A PostgreSQL primary generating 100MB/s of WAL needs > 120MB/s sustained network throughput per replica link. Plan for that or replicas will starve.
Architecture Decision Record
Replication Strategy Decision Matrix
| Criteria (Weight) | Single-Leader Async | Single-Leader Semi-Sync | Multi-Leader | Leaderless (Dynamo) |
|---|---|---|---|---|
| Write latency tolerance | < 5ms required | 5-20ms acceptable | Per-DC low latency needed | Tunable per-request |
| Consistency requirement | Eventual OK | Read-your-writes needed | Conflict resolution acceptable | Tunable (ONE to ALL) |
| Geographic distribution | Single region | Single region, multi-AZ | Multi-region writes | Multi-region |
| Operational complexity | Low | Low-Medium | High (conflict resolution) | Medium |
| Data loss tolerance (RPO) | Seconds of loss OK | Zero loss | Depends on conflict resolution | Tunable via W parameter |
Decision rules:
- If single-region, read-heavy, and some data loss is tolerable: Single-leader async replication (PostgreSQL streaming, MySQL async). Simplest to run and debug. Start here unless there's a reason not to.
- If losing a single committed write is unacceptable (payments, financial transactions): Semi-synchronous replication with at least one sync replica. Every committed write must exist on at least 2 nodes before the client gets an acknowledgment.
- If writes in multiple regions are needed with < 100ms write latency per region: Multi-leader or CRDTs. That means signing up for conflict resolution complexity. CockroachDB or Spanner take that problem away but add per-write latency instead.
- If the workload is write-heavy and consistency can be tuned per query: Leaderless (Cassandra, DynamoDB). This is what Netflix and Apple run for workloads above 10M writes/second.
- If the team is fewer than 5 engineers: Stay away from multi-leader and leaderless. The operational cost of debugging consistency issues is overwhelming. Pick a managed solution (Aurora, Cloud SQL, PlanetScale) and spend the time on product instead.
Key Points
- •Copies data across multiple nodes for fault tolerance and read scalability
- •CAP theorem constrains the choices. Partition-tolerant systems must pick between consistency and availability
- •Synchronous replication guarantees consistency but adds real write latency
- •Eventual consistency works fine for most read-heavy workloads when the application is designed around it
- •Consensus protocols (Raft, Paxos) handle leader election and keep replicated state machines in sync
Tool Comparison
| Tool | Type | Best For | Scale |
|---|---|---|---|
| PostgreSQL | Open Source | Streaming replication, synchronous commit options | Small-Enterprise |
| CockroachDB | Open Source | Raft-based automatic replication, strong consistency | Medium-Enterprise |
| Cassandra | Open Source | Tunable consistency, multi-DC replication | Large-Enterprise |
| TiDB | Open Source | MySQL-compatible, Raft-based, HTAP | Large-Enterprise |
Common Mistakes
- Reading from async replicas right after writing. Classic read-your-own-writes violation
- Treating eventual consistency as 'eventually correct.' Conflicts still need explicit resolution
- Ignoring replication lag. Stale reads cause subtle bugs that are painful to track down
- Running synchronous replication across data centers. The latency will destroy write throughput
- Never testing failover. Promoting a replica to primary should be rehearsed, not figured out during a 3 AM incident