Multi-Region Architecture

Active-Passive vs Active-Active

Active-passive means one region handles all traffic while the other stays warm with replicated data. Failover takes 30-60 seconds with DNS TTL changes. This is the right starting point for most organizations. You get disaster recovery without the complexity of distributed writes.

Active-active means both regions serve traffic simultaneously. Users hit the nearest region. Both regions accept writes. This is where things get hard. Netflix runs active-active across three AWS regions. They spent years building custom tooling (Zuul, Eureka, EVCache) to make it work. Most companies are not Netflix.

Data Replication Strategies

The core challenge is keeping data consistent across regions separated by 40-120ms of network latency.

Single-leader replication is the simplest model. One region owns writes, others replicate asynchronously. Read replicas in secondary regions serve local reads. The risk: if the primary goes down, you either lose recent writes or wait for promotion.

Multi-leader replication lets each region accept writes independently. PostgreSQL BDR and MySQL Group Replication support this, but conflict resolution is your problem. Last-write-wins (LWW) is the default strategy and it silently drops data.

CRDTs (Conflict-free Replicated Data Types) solve specific conflict patterns. Counters, sets, and registers that merge automatically without coordination. Redis Enterprise and Riak use CRDTs internally. The limitation: not every data structure maps cleanly to a CRDT.

Routing and Traffic Management

DNS-based routing is the foundation. Route 53 latency-based routing sends users to the closest healthy region. Health checks run every 10-30 seconds. Set DNS TTLs to 60 seconds for faster failover, but accept that some clients cache aggressively.

For more granular control, put a global load balancer (Cloudflare, AWS Global Accelerator, Google Cloud Armor) in front of your regions. These use anycast to route at the TCP level, which is faster than DNS propagation.

Data Sovereignty

Data residency requirements go well beyond "GDPR requires EU data to stay in the EU." The actual constraints are more nuanced and vary by regulation.

GDPR allows processing EU personal data outside the EU, but only in countries with an adequacy decision from the European Commission or under Standard Contractual Clauses (SCCs). The practical implication: you can replicate EU data to us-east-1, but you need legal agreements in place, and the Schrems II ruling made transfers to the US legally complex until the EU-US Data Privacy Framework was adopted in 2023. Many companies chose to keep EU data in EU regions to avoid the legal overhead entirely.

Beyond GDPR, specific industries add layers. Financial services under PSD2 may require transaction data to remain within the EU. Healthcare data under HIPAA has its own residency considerations. China's PIPL and Russia's data localization laws require citizen data to be stored domestically, which can force entirely separate infrastructure deployments rather than simple replication topology changes.

The architectural impact: you need per-table or per-record routing decisions, not just per-region deployments. Slack handles this by routing EU customer data exclusively through EU infrastructure while keeping non-personal metadata (feature flags, configuration) replicated globally. Stripe partitions payment data by merchant region. Both approaches require careful classification of which data fields are subject to residency requirements, a classification exercise that involves legal review, not just engineering judgment.

Cost and Complexity Trade-offs

The raw infrastructure cost of multi-region (roughly 2-2.5x your single-region spend) is the easy part to budget for. The harder costs are operational complexity and engineering time.

Operational complexity compounds. Every deployment now targets multiple regions. Every database migration must be coordinated across replication topologies. Every monitoring alert needs region-aware context. Your on-call team needs runbooks for regional failover, and those runbooks need regular testing to stay current. Netflix estimates that their multi-region operational overhead adds 30-40% to their platform engineering headcount compared to what a single-region deployment would require.

The testing burden is real. You cannot test multi-region behavior with unit tests. You need infrastructure that simulates cross-region latency, partition scenarios, and failover flows. Chaos engineering (intentionally failing a region during business hours) is how Netflix, Amazon, and Google validate their multi-region setups. Without that investment, your failover is theoretical until the day you need it, at which point you discover it does not work.

Match the investment to the SLA requirement. Single-region with multiple availability zones gives you roughly 99.95% availability (about 4.4 hours of downtime per year). Active-passive multi-region gets you to 99.99% (about 52 minutes). Active-active multi-region targets 99.999% (about 5 minutes). Each step up costs significantly more in infrastructure and engineering effort. Most B2B SaaS companies can operate profitably at 99.95% with AZ-level redundancy. Multi-region becomes justified when your contractual SLAs demand 99.99% or when you need sub-100ms latency for users on multiple continents.

Active-Passive vs Active-Active

Data Replication Strategies

The core challenge is keeping data consistent across regions separated by 40-120ms of network latency.

Routing and Traffic Management

Data Sovereignty

Data residency requirements go well beyond "GDPR requires EU data to stay in the EU." The actual constraints are more nuanced and vary by regulation.

Cost and Complexity Trade-offs

The raw infrastructure cost of multi-region (roughly 2-2.5x your single-region spend) is the easy part to budget for. The harder costs are operational complexity and engineering time.

Architecture Diagram

Active-Passive vs Active-Active

Data Replication Strategies

Routing and Traffic Management

Data Sovereignty

Cost and Complexity Trade-offs

Key Points

Common Mistakes

Related Topics

Multi-Region Architecture

Architecture Diagram

Active-Passive vs Active-Active

Data Replication Strategies

Routing and Traffic Management

Data Sovereignty

Cost and Complexity Trade-offs

Key Points

Common Mistakes

Related Topics