Cloud Provider Outage Response
The Uncomfortable Reality
Multi-region architecture is expensive. It doubles your infrastructure cost, triples your operational complexity, and introduces data consistency challenges that most teams aren't equipped to handle. For most startups and mid-sized companies, the honest approach is accepting single-region risk, having a communication plan, and focusing engineering effort on things that fail more often.
But if your revenue loss per hour of downtime exceeds the annual cost of multi-region, you need to build it. That break-even calculation is the starting point for every multi-region discussion.
When the Provider Goes Down
The first 10 minutes of a provider outage are chaos. Your monitoring looks half-green and half-red. Existing services keep running but new deployments fail. Autoscaling doesn't work. The provider's status page shows everything is fine. Twitter is on fire.
Your response depends on whether you have a tested failover region. If you do: activate the runbook, shift traffic, and monitor the secondary region's capacity. If you don't: communicate with customers, reduce load by disabling non-essential features, and wait for the provider to recover.
The critical mistake is trying to build multi-region during the outage. You can't replicate data to us-west-2 when the network to us-east-1 is unreliable. You can't test a deployment in a region you've never deployed to. Everything you need for failover must be pre-provisioned, pre-tested, and pre-deployed.
Control Plane vs Data Plane
During the 2021 AWS us-east-1 outage, existing EC2 instances kept running. ECS containers kept serving traffic. RDS databases stayed up. But nobody could launch new instances, deploy new containers, or modify security groups. The data plane was fine. The control plane was down.
This distinction changes your strategy. If you've pre-provisioned enough capacity to handle peak traffic without autoscaling, a control plane outage is survivable. If you rely on autoscaling to handle traffic spikes, a control plane outage during a traffic spike means you can't scale.
Pre-provision capacity for peak load in your primary region. Yes, this costs more than relying on autoscaling. But during a control plane outage, those pre-provisioned instances are the only ones that will be running.
Active-Active vs Active-Passive
Active-passive means your secondary region is cold. No traffic, possibly outdated deployments, untested data replication. Failover is slow and risky because you're essentially launching a new environment under pressure.
Active-active means both regions serve traffic at all times. Failover is just traffic rebalancing. Both regions have current deployments, warm caches, and proven capacity. The engineering cost is higher, but the failover time drops from 15 minutes to under 1 minute.
The middle ground: active-passive with monthly failover drills. Run production traffic through the DR region for a few hours each month. This validates your deployment pipeline, data replication, and capacity planning without the full cost of active-active.
Customer Communication
When your provider is down, your customers don't care whose fault it is. They care about when their service will be back. Post updates every 15 minutes, even if the update is "still investigating." Acknowledge the impact honestly. "Our cloud provider is experiencing an outage that is affecting our service" is better than vague language about "intermittent issues."
Host your status page on a different provider. Statuspage.io on Atlassian Cloud, a static site on Cloudflare Pages, or even a simple GitHub Pages site. If your status page goes down with your primary infrastructure, you have no way to communicate with customers.
Incident Timeline
- T+0mAWS us-east-1 API calls start timing out. EC2 RunInstances, ECS task launches, and Lambda cold starts all failing. Existing running instances unaffected initially.
- T+2mInternal monitoring shows green (instances still running) but external synthetic checks fail. Confusion about whether it's an internal issue or AWS. Status page shows no incident.
- T+5mTwitter reports confirm us-east-1 issues. Team checks AWS Personal Health Dashboard. No updates yet. Decision made to start failover runbook to us-west-2.
- T+10mFailover initiated. Route 53 health checks detect the failure and start routing traffic to us-west-2. But the us-west-2 deployment is 2 releases behind because cross-region deploys weren't automated.
- T+15mus-west-2 serving traffic but with bugs fixed in the last 2 releases. Customer experience degraded but functional. AWS status page finally acknowledges the incident.
- T+30mAWS us-east-1 recovers. Traffic gradually shifted back. Post-incident reveals failover took 15 minutes instead of the target 5 minutes due to stale deployments and untested runbooks.
Detection Signals
- •Simultaneous failures across multiple unrelated services that all share the same cloud provider region
- •Cloud provider API errors (throttling, timeouts) on infrastructure management calls
- •Auto-scaling failures because the provider can't launch new instances during the outage
- •Social media and community channels (Twitter, HackerNews, Reddit) reporting widespread issues with the same provider
Prevention
- Deploy active-active across at least 2 regions. Both regions serve traffic at all times, so failover is just traffic rebalancing, not cold-starting a new region
- Automate cross-region deployments so DR regions are never more than 1 release behind primary
- Run monthly failover drills. Actually route production traffic to the secondary region for at least 1 hour. Untested failover is not failover
- Minimize dependencies on provider control planes. Use pre-provisioned capacity instead of relying on autoscaling during an outage (you can't scale if the API is down)
- Maintain static site fallbacks for critical customer-facing pages (status page, documentation, basic account access) on a different provider
Key Points
- •AWS us-east-1 has had major outages roughly once a year. The 2021 outage lasted 8 hours and took down a significant chunk of the internet including Amazon's own retail site
- •Multi-AZ is not multi-region. AZ failures are common and well-handled. Regional outages are rarer but affect the entire control plane, including your ability to remediate
- •The control plane vs data plane distinction matters. During an AWS outage, existing EC2 instances keep running (data plane) but you can't launch new ones (control plane)
- •Multi-region adds latency, complexity, and cost. For most companies, the honest answer is accepting single-region risk and having a communication plan for outages
- •Your status page should not be hosted on the same provider as your primary infrastructure. Use a static site on a different CDN
Common Mistakes
- ✗Hosting your status page on the same infrastructure that's down. When AWS goes down, your AWS-hosted status page goes down too
- ✗Having a DR region that hasn't been tested in 6 months and has outdated deployments, configurations, and data
- ✗Trying to migrate data to another region during the outage instead of using pre-replicated data. Network calls to the affected region are unreliable during the outage
- ✗Assuming provider SLAs mean the outage won't happen. 99.99% uptime still means 52 minutes of downtime per year