Network Partition Handling
Partitions in the Real World
Textbook network partitions are clean: two halves of the network can't communicate at all. Real partitions are messy. You get 30% packet loss, 5-second latency spikes, connections that work in one direction but not the other. This partial failure is harder to handle because your systems can't cleanly decide whether a remote node is up or down.
A heartbeat that usually takes 10ms suddenly takes 8 seconds. Is the node down? Is the network slow? Your health check timeout is 5 seconds, so the node is marked unhealthy. But it's still running, still accepting writes, still thinking it's the leader. This is where split-brain begins.
Split-Brain Scenarios
Split-brain happens when a partition causes two nodes to both believe they're the leader. Each accepts writes independently. When the partition heals, you have two divergent datasets that need to be reconciled.
Redis Sentinel with only 2 sentinels is a classic split-brain setup. Each sentinel is in a different AZ. During a partition, each sentinel can see its local Redis instance but not the other sentinel. Neither has quorum, but the configuration might promote a replica anyway. Now you have two masters.
The fix: always deploy an odd number of sentinels (3 or 5) across 3+ availability zones. Quorum-based election ensures only the majority partition can elect a new leader. The minority partition knows it's in the minority and refuses to promote.
CAP Theorem in Practice
The CAP theorem gets oversimplified. It doesn't say "pick two of three." It says during a partition (which will happen), you choose between consistency and availability. When there's no partition, you can have both.
Most systems default to availability without being explicit about it. Your API keeps serving requests during a partition, but some of those responses contain stale data from a lagging replica. You're choosing AP (available but inconsistent) without telling your users.
For financial systems, this is unacceptable. A user checking their balance should never see a stale value. For social media timelines, seeing a post 30 seconds late is fine. Know which category each of your endpoints falls into and configure consistency levels accordingly.
Cross-Region Partitions
Cross-region partitions are rarer but more impactful. If your primary database is in us-east-1 and your disaster recovery is in us-west-2, a partition between regions means your DR site can't replicate. When the partition heals, the replication lag could represent hours of data.
The decision: do you failover to DR during the partition (accepting data loss) or do you wait for the partition to heal (accepting downtime)? This decision should be made in advance, documented in a runbook, and based on the specific RPO/RTO requirements of your business. Making this decision at 3am during an incident is how you get it wrong.
Testing Partition Behavior
Use Toxiproxy to inject latency and packet loss between services. Use Linux tc (traffic control) to simulate network degradation at the infrastructure level. Chaos Mesh and LitmusChaos have network partition experiments for Kubernetes.
The critical test isn't whether your system survives the partition. It's what happens when the partition heals. Does the old leader step down? Do split-brain writes get reconciled? Does replication catch up without overloading the primary? The recovery is often worse than the partition itself.
Incident Timeline
- T+0mNetwork link between us-east-1a and us-east-1b degrades. Packet loss hits 30%. Not a full partition, but enough to cause timeouts on inter-AZ communication.
- T+2mDatabase primary in us-east-1a can't reach replicas in us-east-1b. Replication lag jumps from 50ms to 30 seconds. Reads from replicas return stale data.
- T+5mConsul health checks between AZs start failing. Service discovery removes healthy instances in us-east-1b from the service registry. Traffic concentrates on us-east-1a.
- T+10mus-east-1a instances overloaded from handling all traffic. CPU hits 95%. Autoscaling triggered but new instances take 3 minutes to warm up.
- T+15mSplit-brain detected: both AZs have elected their own Redis sentinel master. Writes happening to both masters. Data divergence in progress.
- T+30mNetwork recovers. Redis split-brain resolution loses 340 writes from the minority partition. Database replicas catch up. Full recovery after 45 minutes total.
Detection Signals
- •Replication lag exceeding baseline by 10x across database replicas in different availability zones
- •Service mesh reporting increased inter-AZ latency or connection failures between specific zone pairs
- •Split-brain indicators: multiple nodes claiming leadership for the same resource simultaneously
- •Gossip protocol failures in distributed systems (Consul, Cassandra) showing unreachable nodes in specific zones
Prevention
- Design services to handle partial failures, not just full connectivity or full partition. Real network issues are usually packet loss and latency spikes, not clean cuts
- Use consensus protocols (Raft, Paxos) for leader election instead of simple heartbeat-based failover that's prone to split-brain
- Configure database replication with appropriate consistency levels. For PostgreSQL, use synchronous_commit for critical writes
- Deploy Redis Sentinel with a minimum of 3 sentinels across 3 AZs so quorum-based leader election survives a single AZ partition
- Implement fencing tokens for distributed locks to prevent stale lock holders from making writes after a partition heals
Key Points
- •Real partitions aren't clean. They're partial: some packets get through, latency spikes, some connections work and others don't. This is harder to handle than a full cut
- •The CAP theorem says during a partition you choose consistency or availability. Most systems choose availability without realizing the consistency implications
- •Split-brain is the most dangerous partition outcome. Two leaders accepting writes means data divergence that requires manual reconciliation after recovery
- •Network partitions between cloud availability zones happen more often than providers admit. AWS had at least 3 significant inter-AZ connectivity events in 2022-2023
- •Testing partition tolerance requires tools like Toxiproxy or tc (traffic control) that simulate partial failures, not just kill-the-connection tests
Common Mistakes
- ✗Using timeout-based leader election without fencing, allowing a slow-but-alive old leader to continue accepting writes
- ✗Assuming that if you can reach the database primary, there's no partition. The partition might be between the primary and its replicas
- ✗Treating replication lag as a performance issue instead of a consistency issue. Lag during a partition means reads return stale data
- ✗Not testing what happens when the partition heals. The recovery phase (data reconciliation, leader re-election) often causes a second outage