Network Partition Handling

Partitions in the Real World

Textbook network partitions are clean: two halves of the network can't communicate at all. Real partitions are messy. You get 30% packet loss, 5-second latency spikes, connections that work in one direction but not the other. This partial failure is harder to handle because your systems can't cleanly decide whether a remote node is up or down.

A heartbeat that usually takes 10ms suddenly takes 8 seconds. Is the node down? Is the network slow? Your health check timeout is 5 seconds, so the node is marked unhealthy. But it's still running, still accepting writes, still thinking it's the leader. This is where split-brain begins.

Split-Brain Scenarios

Split-brain happens when a partition causes two nodes to both believe they're the leader. Each accepts writes independently. When the partition heals, you have two divergent datasets that need to be reconciled.

Redis Sentinel with only 2 sentinels is a classic split-brain setup. Each sentinel is in a different AZ. During a partition, each sentinel can see its local Redis instance but not the other sentinel. Neither has quorum, but the configuration might promote a replica anyway. Now you have two masters.

The fix: always deploy an odd number of sentinels (3 or 5) across 3+ availability zones. Quorum-based election ensures only the majority partition can elect a new leader. The minority partition knows it's in the minority and refuses to promote.

CAP Theorem in Practice

The CAP theorem gets oversimplified. It doesn't say "pick two of three." It says during a partition (which will happen), you choose between consistency and availability. When there's no partition, you can have both.

Most systems default to availability without being explicit about it. Your API keeps serving requests during a partition, but some of those responses contain stale data from a lagging replica. You're choosing AP (available but inconsistent) without telling your users.

For financial systems, this is unacceptable. A user checking their balance should never see a stale value. For social media timelines, seeing a post 30 seconds late is fine. Know which category each of your endpoints falls into and configure consistency levels accordingly.

Cross-Region Partitions

Cross-region partitions are rarer but more impactful. If your primary database is in us-east-1 and your disaster recovery is in us-west-2, a partition between regions means your DR site can't replicate. When the partition heals, the replication lag could represent hours of data.

The decision: do you failover to DR during the partition (accepting data loss) or do you wait for the partition to heal (accepting downtime)? This decision should be made in advance, documented in a runbook, and based on the specific RPO/RTO requirements of your business. Making this decision at 3am during an incident is how you get it wrong.

Testing Partition Behavior

Use Toxiproxy to inject latency and packet loss between services. Use Linux tc (traffic control) to simulate network degradation at the infrastructure level. Chaos Mesh and LitmusChaos have network partition experiments for Kubernetes.

The critical test isn't whether your system survives the partition. It's what happens when the partition heals. Does the old leader step down? Do split-brain writes get reconciled? Does replication catch up without overloading the primary? The recovery is often worse than the partition itself.

Partitions in the Real World

Split-Brain Scenarios

CAP Theorem in Practice

Cross-Region Partitions

Testing Partition Behavior

Partitions in the Real World

Split-Brain Scenarios

CAP Theorem in Practice

Cross-Region Partitions

Testing Partition Behavior

Incident Timeline

Detection Signals

Prevention

Key Points

Common Mistakes

Related Topics

Network Partition Handling

Partitions in the Real World

Split-Brain Scenarios

CAP Theorem in Practice

Cross-Region Partitions

Testing Partition Behavior

Incident Timeline

Detection Signals

Prevention

Key Points

Common Mistakes

Related Topics