Multi-Region Architecture
Architecture Diagram
Active-Passive vs Active-Active
Active-passive means one region handles all traffic while the other stays warm with replicated data. Failover takes 30-60 seconds with DNS TTL changes. This is the right starting point for most organizations. You get disaster recovery without the complexity of distributed writes.
Active-active means both regions serve traffic simultaneously. Users hit the nearest region. Both regions accept writes. This is where things get hard. Netflix runs active-active across three AWS regions. They spent years building custom tooling (Zuul, Eureka, EVCache) to make it work. Most companies are not Netflix.
Data Replication Strategies
The core challenge is keeping data consistent across regions separated by 40-120ms of network latency.
Single-leader replication is the simplest model. One region owns writes, others replicate asynchronously. Read replicas in secondary regions serve local reads. The risk: if the primary goes down, you either lose recent writes or wait for promotion.
Multi-leader replication lets each region accept writes independently. PostgreSQL BDR and MySQL Group Replication support this, but conflict resolution is your problem. Last-write-wins (LWW) is the default strategy and it silently drops data.
CRDTs (Conflict-free Replicated Data Types) solve specific conflict patterns. Counters, sets, and registers that merge automatically without coordination. Redis Enterprise and Riak use CRDTs internally. The limitation: not every data structure maps cleanly to a CRDT.
Routing and Traffic Management
DNS-based routing is the foundation. Route 53 latency-based routing sends users to the closest healthy region. Health checks run every 10-30 seconds. Set DNS TTLs to 60 seconds for faster failover, but accept that some clients cache aggressively.
For more granular control, put a global load balancer (Cloudflare, AWS Global Accelerator, Google Cloud Armor) in front of your regions. These use anycast to route at the TCP level, which is faster than DNS propagation.
Data Sovereignty
Data residency requirements go well beyond "GDPR requires EU data to stay in the EU." The actual constraints are more nuanced and vary by regulation.
GDPR allows processing EU personal data outside the EU, but only in countries with an adequacy decision from the European Commission or under Standard Contractual Clauses (SCCs). The practical implication: you can replicate EU data to us-east-1, but you need legal agreements in place, and the Schrems II ruling made transfers to the US legally complex until the EU-US Data Privacy Framework was adopted in 2023. Many companies chose to keep EU data in EU regions to avoid the legal overhead entirely.
Beyond GDPR, specific industries add layers. Financial services under PSD2 may require transaction data to remain within the EU. Healthcare data under HIPAA has its own residency considerations. China's PIPL and Russia's data localization laws require citizen data to be stored domestically, which can force entirely separate infrastructure deployments rather than simple replication topology changes.
The architectural impact: you need per-table or per-record routing decisions, not just per-region deployments. Slack handles this by routing EU customer data exclusively through EU infrastructure while keeping non-personal metadata (feature flags, configuration) replicated globally. Stripe partitions payment data by merchant region. Both approaches require careful classification of which data fields are subject to residency requirements, a classification exercise that involves legal review, not just engineering judgment.
Cost and Complexity Trade-offs
The raw infrastructure cost of multi-region (roughly 2-2.5x your single-region spend) is the easy part to budget for. The harder costs are operational complexity and engineering time.
Operational complexity compounds. Every deployment now targets multiple regions. Every database migration must be coordinated across replication topologies. Every monitoring alert needs region-aware context. Your on-call team needs runbooks for regional failover, and those runbooks need regular testing to stay current. Netflix estimates that their multi-region operational overhead adds 30-40% to their platform engineering headcount compared to what a single-region deployment would require.
The testing burden is real. You cannot test multi-region behavior with unit tests. You need infrastructure that simulates cross-region latency, partition scenarios, and failover flows. Chaos engineering (intentionally failing a region during business hours) is how Netflix, Amazon, and Google validate their multi-region setups. Without that investment, your failover is theoretical until the day you need it, at which point you discover it does not work.
Match the investment to the SLA requirement. Single-region with multiple availability zones gives you roughly 99.95% availability (about 4.4 hours of downtime per year). Active-passive multi-region gets you to 99.99% (about 52 minutes). Active-active multi-region targets 99.999% (about 5 minutes). Each step up costs significantly more in infrastructure and engineering effort. Most B2B SaaS companies can operate profitably at 99.95% with AZ-level redundancy. Multi-region becomes justified when your contractual SLAs demand 99.99% or when you need sub-100ms latency for users on multiple continents.
Key Points
- •Active-passive is 10x simpler than active-active. Start with active-passive and promote to active-active only when latency requirements demand it
- •Cross-region replication lag is bounded by physics. US-East to EU-West is roughly 80ms round-trip. Your conflict resolution strategy must account for this
- •Data sovereignty laws like GDPR may require that certain data never leaves a geographic boundary, which constrains replication topology
- •Multi-region doubles or triples infrastructure cost. Budget for it explicitly and justify it against the business value of the uptime improvement
- •DNS failover with Route 53 or Cloudflare health checks is the simplest entry point. You can build more sophisticated routing later
Common Mistakes
- ✗Building active-active without a clear conflict resolution strategy. Two regions accepting writes to the same record simultaneously will produce data corruption
- ✗Testing failover only during planned exercises. Chaos engineering practices like randomly failing one region during business hours reveal gaps that planned tests miss
- ✗Assuming the database vendor handles multi-region automatically. CockroachDB and Spanner do, but most databases require careful configuration and ongoing tuning
- ✗Ignoring the blast radius of shared global services. A single global authentication service defeats the purpose of multi-region isolation