Replication & Consistency

Why It Exists

Disks die. Networks split. Entire data centers go dark. None of this is theoretical. Replication keeps copies of data on multiple nodes so that one failure does not cause downtime. But it is not just about durability. Replicas also absorb read traffic, which matters a lot when the workload is 10:1 reads to writes or higher. Most are.

How It Works

Replication Topologies

Single-leader (primary-replica): All writes hit one primary node, which streams changes out to replicas. Simple, well-understood, and boring in the best way. The downside: the primary is the write bottleneck and a single point of failure until failover kicks in.
Multi-leader: Multiple nodes accept writes and replicate to each other. This typically appears in multi-datacenter setups where each DC has a local leader. The catch is write conflicts. A strategy is needed: last-writer-wins (LWW), vector clocks, or application-level merge logic. Each has tradeoffs and none of them are free.
Leaderless (Dynamo-style): Any node accepts reads and writes. Consistency comes from quorums: a write needs W acknowledgments, a read needs R responses, and R + W > N guarantees overlap. Cassandra and DynamoDB both work this way.

Consistency Models

Model	Guarantee	Latency	Example
Strong (Linearizable)	Reads reflect the latest write globally	Highest	CockroachDB, Spanner
Causal	Causally related operations appear in order	Medium	MongoDB sessions
Read-your-own-writes	A client sees its own writes immediately	Low-Medium	PostgreSQL sync replica routing
Eventual	All replicas converge given no new writes	Lowest	Cassandra with CL=ONE

CAP Theorem in Practice

People love debating CAP on whiteboards, but here is what it actually means in production. During a network partition, pick one: reject writes (CP, keep consistency, lose availability) or accept writes on both sides of the partition (AP, keep availability, lose consistency). Partitions are rare but they do happen. The practical answer is that most systems allow tuning this per query. Cassandra's consistency levels (ONE, QUORUM, ALL) provide that tradeoff on every single operation.

Consensus Protocols

Raft and Paxos solve one specific problem: getting multiple nodes to agree on a value. That sounds simple until someone actually tries to implement it. Raft won the popularity contest because it is genuinely easier to understand. It breaks the problem into leader election, log replication, and safety as separate subproblems. CockroachDB, etcd, and TiKV all run on Raft. A Raft cluster with N nodes can survive (N-1)/2 failures.

Production Considerations

Monitor replication lag. Set alerts when async replicas fall more than a few seconds behind. In PostgreSQL, check pg_stat_replication. In MySQL, SHOW SLAVE STATUS. Without this monitoring, problems are guaranteed.
Route reads after writes carefully. After a write, send that user's reads to the primary (or a sync replica) for a short window. Otherwise they will see stale data and file bug reports that are technically correct.
Automate failover. Patroni for PostgreSQL, Orchestrator for MySQL. Manual failover at 3 AM is when mistakes happen. The on-call engineer should not be making judgment calls about replica promotion while half-asleep.
Accept the latency tax for cross-region sync replication. Expect 50-200ms per write. If that is too expensive, go async with conflict resolution, or use a globally distributed database like Spanner or CockroachDB that handles it internally.
Prevent split-brain. When a primary becomes unreachable, fencing (STONITH) makes sure the old primary cannot keep accepting writes after a new one gets elected. Skip this and data divergence is a matter of when, not if.

Failure Scenarios

Scenario 1: Split-Brain Writes. A network partition cuts the primary off from its replicas. The failover system promotes a replica to new primary, but the old primary is still alive and happily accepting writes from clients on its side of the partition. Now two nodes are diverging. When the partition heals, the result is two conflicting datasets. Detection: Watch the number of nodes claiming primary role. If it is ever more than 1, page someone immediately. Track write epoch/term numbers so a primary with a stale term rejects writes. Recovery: Use fencing tokens (STONITH) to force-kill the old primary before the new one starts accepting writes. GitHub had a 24-hour outage in 2018 from exactly this scenario on their MySQL cluster. They ended up manually reconciling divergent writes by replaying the old primary's binlog against the new primary. Not fun.

Scenario 2: Replication Lag Amplification. An async replica falls 30 seconds behind during a traffic spike. The application routes 80% of reads to replicas, so now users start seeing stale data. Profile updates "disappear" for half a minute. Worse, a reporting job reading from the lagging replica generates bad aggregates, and someone makes a business decision based on wrong numbers. Detection: Track replication_lag_seconds per replica. Alert at > 5s for user-facing replicas, > 30s for analytics replicas. Also watch the derivative. If lag is growing, the replica is falling further behind and action is needed. Recovery: Temporarily redirect reads to the primary. LinkedIn monitors lag across 1,000+ MySQL replicas and automatically pulls lagging ones out of the read pool when they exceed 10 seconds.

Scenario 3: Consensus Quorum Loss. Two nodes fail at the same time in a 3-node Raft cluster (say, a correlated rack failure). The surviving node cannot form a quorum. Reads still work from that node, but writes are completely dead. This partial outage confuses monitoring because the system looks "mostly up" while actually being unable to accept any mutations. Detection: Alert on Raft leader election failures and write rejection rates. Track the count of healthy voting members and alert when it drops below (N/2 + 1). Recovery: Add replacement nodes and wait for state transfer to complete. CockroachDB recommends 5-node clusters for production specifically to tolerate 2 simultaneous failures. Spread Raft groups across at least 3 availability zones.

Capacity Planning

Metric	Threshold	Action
Replication lag	> 1s (OLTP), > 60s (analytics)	Investigate replica throughput bottleneck
Replica count per primary	> 5 replicas	Primary's network bandwidth becomes the bottleneck. Use cascading replication
Failover time (RTO)	> 30s for critical services	Implement automated failover with pre-elected standby
WAL generation rate	> 80% of network link capacity	Upgrade network or reduce write throughput

Real-world scale references: Google Spanner replicates across 5 zones using synchronized TrueTime clocks, hitting < 10ms commit latency for global strong consistency. Netflix runs 2,500+ Cassandra nodes across 3 regions with LOCAL_QUORUM for most reads. They can lose an entire region with zero data loss. Slack's Vitess deployment replicates each shard to 3 replicas with semi-synchronous replication, targeting < 1s failover via Orchestrator.

Capacity formula for replicas: Required replicas = ceil(peak_read_QPS / single_node_read_QPS) + 1 (failover spare). For cross-region, add at least one replica per region. Each replica costs replication bandwidth equal to WAL_generation_rate * 1.2 (overhead). A PostgreSQL primary generating 100MB/s of WAL needs > 120MB/s sustained network throughput per replica link. Plan for that or replicas will starve.

Architecture Decision Record

Replication Strategy Decision Matrix

Criteria (Weight)	Single-Leader Async	Single-Leader Semi-Sync	Multi-Leader	Leaderless (Dynamo)
Write latency tolerance	< 5ms required	5-20ms acceptable	Per-DC low latency needed	Tunable per-request
Consistency requirement	Eventual OK	Read-your-writes needed	Conflict resolution acceptable	Tunable (ONE to ALL)
Geographic distribution	Single region	Single region, multi-AZ	Multi-region writes	Multi-region
Operational complexity	Low	Low-Medium	High (conflict resolution)	Medium
Data loss tolerance (RPO)	Seconds of loss OK	Zero loss	Depends on conflict resolution	Tunable via W parameter

Decision rules:

If single-region, read-heavy, and some data loss is tolerable: Single-leader async replication (PostgreSQL streaming, MySQL async). Simplest to run and debug. Start here unless there's a reason not to.
If losing a single committed write is unacceptable (payments, financial transactions): Semi-synchronous replication with at least one sync replica. Every committed write must exist on at least 2 nodes before the client gets an acknowledgment.
If writes in multiple regions are needed with < 100ms write latency per region: Multi-leader or CRDTs. That means signing up for conflict resolution complexity. CockroachDB or Spanner take that problem away but add per-write latency instead.
If the workload is write-heavy and consistency can be tuned per query: Leaderless (Cassandra, DynamoDB). This is what Netflix and Apple run for workloads above 10M writes/second.
If the team is fewer than 5 engineers: Stay away from multi-leader and leaderless. The operational cost of debugging consistency issues is overwhelming. Pick a managed solution (Aurora, Cloud SQL, PlanetScale) and spend the time on product instead.

Tool	Type	Best For	Scale
PostgreSQL	Open Source	Streaming replication, synchronous commit options	Small-Enterprise
CockroachDB	Open Source	Raft-based automatic replication, strong consistency	Medium-Enterprise
Cassandra	Open Source	Tunable consistency, multi-DC replication	Large-Enterprise
TiDB	Open Source	MySQL-compatible, Raft-based, HTAP	Large-Enterprise

Why It Exists

How It Works

Replication Topologies

Single-leader (primary-replica): All writes hit one primary node, which streams changes out to replicas. Simple, well-understood, and boring in the best way. The downside: the primary is the write bottleneck and a single point of failure until failover kicks in.
Multi-leader: Multiple nodes accept writes and replicate to each other. This typically appears in multi-datacenter setups where each DC has a local leader. The catch is write conflicts. A strategy is needed: last-writer-wins (LWW), vector clocks, or application-level merge logic. Each has tradeoffs and none of them are free.
Leaderless (Dynamo-style): Any node accepts reads and writes. Consistency comes from quorums: a write needs W acknowledgments, a read needs R responses, and R + W > N guarantees overlap. Cassandra and DynamoDB both work this way.

Consistency Models

Model	Guarantee	Latency	Example
Strong (Linearizable)	Reads reflect the latest write globally	Highest	CockroachDB, Spanner
Causal	Causally related operations appear in order	Medium	MongoDB sessions
Read-your-own-writes	A client sees its own writes immediately	Low-Medium	PostgreSQL sync replica routing
Eventual	All replicas converge given no new writes	Lowest	Cassandra with CL=ONE

CAP Theorem in Practice

Consensus Protocols

Production Considerations

Monitor replication lag. Set alerts when async replicas fall more than a few seconds behind. In PostgreSQL, check pg_stat_replication. In MySQL, SHOW SLAVE STATUS. Without this monitoring, problems are guaranteed.
Route reads after writes carefully. After a write, send that user's reads to the primary (or a sync replica) for a short window. Otherwise they will see stale data and file bug reports that are technically correct.
Automate failover. Patroni for PostgreSQL, Orchestrator for MySQL. Manual failover at 3 AM is when mistakes happen. The on-call engineer should not be making judgment calls about replica promotion while half-asleep.
Accept the latency tax for cross-region sync replication. Expect 50-200ms per write. If that is too expensive, go async with conflict resolution, or use a globally distributed database like Spanner or CockroachDB that handles it internally.
Prevent split-brain. When a primary becomes unreachable, fencing (STONITH) makes sure the old primary cannot keep accepting writes after a new one gets elected. Skip this and data divergence is a matter of when, not if.

Failure Scenarios

Capacity Planning

Metric	Threshold	Action
Replication lag	> 1s (OLTP), > 60s (analytics)	Investigate replica throughput bottleneck
Replica count per primary	> 5 replicas	Primary's network bandwidth becomes the bottleneck. Use cascading replication
Failover time (RTO)	> 30s for critical services	Implement automated failover with pre-elected standby
WAL generation rate	> 80% of network link capacity	Upgrade network or reduce write throughput

Architecture Decision Record

Replication Strategy Decision Matrix

Criteria (Weight)	Single-Leader Async	Single-Leader Semi-Sync	Multi-Leader	Leaderless (Dynamo)
Write latency tolerance	< 5ms required	5-20ms acceptable	Per-DC low latency needed	Tunable per-request
Consistency requirement	Eventual OK	Read-your-writes needed	Conflict resolution acceptable	Tunable (ONE to ALL)
Geographic distribution	Single region	Single region, multi-AZ	Multi-region writes	Multi-region
Operational complexity	Low	Low-Medium	High (conflict resolution)	Medium
Data loss tolerance (RPO)	Seconds of loss OK	Zero loss	Depends on conflict resolution	Tunable via W parameter

Decision rules:

If single-region, read-heavy, and some data loss is tolerable: Single-leader async replication (PostgreSQL streaming, MySQL async). Simplest to run and debug. Start here unless there's a reason not to.
If losing a single committed write is unacceptable (payments, financial transactions): Semi-synchronous replication with at least one sync replica. Every committed write must exist on at least 2 nodes before the client gets an acknowledgment.
If writes in multiple regions are needed with < 100ms write latency per region: Multi-leader or CRDTs. That means signing up for conflict resolution complexity. CockroachDB or Spanner take that problem away but add per-write latency instead.
If the workload is write-heavy and consistency can be tuned per query: Leaderless (Cassandra, DynamoDB). This is what Netflix and Apple run for workloads above 10M writes/second.
If the team is fewer than 5 engineers: Stay away from multi-leader and leaderless. The operational cost of debugging consistency issues is overwhelming. Pick a managed solution (Aurora, Cloud SQL, PlanetScale) and spend the time on product instead.

Architecture Diagram

Why It Exists

How It Works

Replication Topologies

Consistency Models

CAP Theorem in Practice

Consensus Protocols

Production Considerations

Failure Scenarios

Capacity Planning

Architecture Decision Record

Replication Strategy Decision Matrix

Key Points

Tool Comparison

Common Mistakes

Related Topics

Replication & Consistency

Architecture Diagram

Why It Exists

How It Works

Replication Topologies

Consistency Models

CAP Theorem in Practice

Consensus Protocols

Production Considerations

Failure Scenarios

Capacity Planning

Architecture Decision Record

Replication Strategy Decision Matrix

Key Points

Tool Comparison

Common Mistakes

Related Topics