Deployment Strategies
Architecture Diagram
Why It Exists
The deployment strategy sets the blast radius when things go wrong. It controls how fast the team can roll back, and it shapes whether the team ships with confidence or with dread.
A "stop the world, replace everything" deploy creates downtime. At scale, even a few seconds of unavailability costs real money. That is why progressive delivery exists: make deployments boring, automated, and reversible.
At staff level, treat the deployment strategy as an architecture decision. Weigh infrastructure cost, database compatibility windows, how mature the observability stack is, and how much risk the org will tolerate. There is no single right answer.
How It Works
Rolling Updates
This is the simplest progressive strategy. Instances are replaced one at a time (or in small batches). Kubernetes does this by default, with the pace controlled by maxSurge and maxUnavailable.
The catch: during the rollout, both old and new versions serve traffic at the same time. The application needs to handle version skew. If it cannot, pick a different strategy.
Blue-Green Deployments
Keep two identical environments. Deploy the new version to the idle one, run the smoke tests, then flip the load balancer. Rollback is instant. Just flip back.
The cost is obvious: double the infrastructure is needed during the switch window. In practice, scale the idle environment way down between deployments to keep costs reasonable.
Canary Releases
Send a small slice of traffic (1-5%) to the new version. Watch error rates, latency percentiles (p50, p95, p99), and business metrics closely. If everything looks good against the predefined thresholds, ramp up traffic progressively.
Argo Rollouts and Flagger automate this well. Define AnalysisTemplates that query Prometheus or Datadog, and the tooling auto-promotes or auto-rolls back based on SLO compliance. The biggest win here is removing humans from the judgment loop during deploys.
Shadow / Traffic Mirroring
Duplicate production traffic to the new version, but never serve its responses to real users. This validates performance and correctness under real load with zero user impact.
Istio and Envoy support this natively through VirtualService mirror configurations. I have found this most useful when making big changes to core services where the test suite may not cover all the edge cases production traffic will hit.
Database Migration Patterns
The expand-contract pattern is non-negotiable for zero-downtime deployments:
- Expand: Add the new column or table alongside the old schema. New code writes to both.
- Migrate: Backfill existing data from the old structure to the new one.
- Contract: Remove old code paths and old schema once every instance runs the new version.
Never rename or drop a column in the same release as the code change that depends on it. I have seen this take down production more than once. It always looks safe in staging.
Feature Flags Architecture
Feature flags separate deployment from release. A flag evaluation service (LaunchDarkly, Unleash, or something custom backed by a distributed config store) decides at runtime which users see new functionality.
This is what makes trunk-based development actually work. Incomplete features go out behind flags, the team tests them in production with internal users, and gradually rolls them out to everyone else. It sounds simple. Getting the discipline right across a team is the hard part.
Production Considerations
Automate rollback triggers. Define SLO-based thresholds (error rate > 1%, p99 latency > 500ms) that halt and revert a deployment without waiting for a human to notice. Track deployment frequency as a first-class metric. Teams deploying multiple times per day consistently have lower change failure rates than teams deploying weekly. This is well-documented in the DORA research and matches everything I have seen in practice.
Make sure the monitoring stack can distinguish canary traffic from baseline. This typically means version-labeled metrics and dedicated dashboards per rollout. Without that, the canary is just a smaller blast radius, not an actual feedback mechanism.
Run deployment runbooks as game days before they are needed during an incident. The first time testing rollback should not be at 2 AM with customers paging the on-call.
Failure Scenarios
Scenario 1: Canary Passes but Production Fails at Scale
The canary at 5% traffic looks healthy. The team promotes to 100%, and the new version exhausts the database connection pool. The canary's traffic was too small to stress shared resources like connection pools, cache clusters, or rate-limited third-party APIs.
What happens next: All instances hit connection limits at once, throwing 500s across the entire service. Downstream services cascade into failure as retries amplify load.
How to detect it: Monitor connection pool utilization (db_active_connections / db_max_connections), thread pool saturation, and external API quota usage, all segmented by deployment version. Alert when any shared resource crosses 80% during rollout.
How to fix it: Use staged canary with resource-aware analysis. Go 5% traffic, then 25% with a 10-minute bake time specifically watching shared resource metrics, then 50%, then 100%. Argo Rollouts AnalysisTemplate can query connection pool saturation from Prometheus and gate promotion on it.
Scenario 2: Rolling Deploy with Incompatible Schema Migration
A rolling update pushes v2 code that expects a new database column. During the rollout window, v1 instances (which do not write to the new column) coexist with v2 instances. A user hits v2, which writes to the new column. Their next request hits v1, which ignores it. Data inconsistency piles up silently.
What happens next: Business logic corruption. Orders missing fields, analytics data skewed, reconciliation failures surfacing days later when someone finally notices the numbers do not add up.
How to detect it: Add schema version assertions in health checks. Monitor NULL rates on new columns segmented by app version.
How to fix it: Enforce expand-contract migrations through CI linting. Reject any migration that removes or renames columns without a prior two-phase rollout. Tools like gh-ost (GitHub) or pg-osc handle zero-downtime schema changes with shadow table swaps.
Scenario 3: Feature Flag Evaluation Service Outage
The feature flag service (say, LaunchDarkly) goes unreachable during a deployment. All flag evaluations fall back to defaults. If the default was set to "enabled," an unfinished feature suddenly shows up for every user.
What happens next: Partial feature exposure, broken user experiences, possibly data corruption from half-implemented flows.
How to detect it: Monitor flag evaluation latency and error rate. Alert on flag_sdk_timeout_rate > 1%.
How to fix it: Configure local flag caching with a stale TTL of 5-15 minutes. Set safe defaults (always "off" for incomplete features, no exceptions). Add a circuit breaker that freezes the current flag state if the evaluation service is unreachable.
Capacity Planning
| Metric | Target | Formula / Guidance |
|---|---|---|
| Deployment frequency | Multiple per day per service | DORA Elite: on-demand deployment |
| Rollback time | < 5 min (blue-green), < 15 min (canary) | Measure from detection to full rollback |
| Blue-green infra cost | 2x during deploy, 1.1x at rest | Keep standby at 10% capacity, autoscale on switch |
| Canary bake time | 10-30 min per stage | total_deploy_time = stages x bake_time + promotion_time |
| Change failure rate | < 5% | Requires automated analysis on > 95% of deploys |
Scale references: Amazon deploys roughly every 11.7 seconds on average across all services. Netflix uses Spinnaker for 4,000+ deployments per day, with automated canary analysis comparing real-time metrics against a baseline using statistical methods (Mann-Whitney U tests on latency distributions). These numbers are impressive, but remember that these companies built years of custom tooling to get there. Start with what the team can actually operate.
Capacity formula for blue-green: extra_cost = baseline_infra_cost x deployment_frequency x avg_deploy_duration / 1440. For 10 deploys per day at 15 minutes each with a $50K/month baseline: $50K x 10 x 15 / 1440 = $5.2K/month (roughly 10% overhead). Compare that against the cost of a single outage, which typically runs $5K-$100K per hour for revenue-generating services. The math usually justifies the spend.
Architecture Decision Record
ADR: Choosing a Deployment Strategy
Context: Picking a deployment strategy that balances speed, safety, cost, and operational complexity based on how critical the service is and how mature the team is.
| Criteria (Weight) | Rolling | Blue-Green | Canary | Shadow |
|---|---|---|---|---|
| Rollback speed (20%) | 5 | 10 | 8 | N/A |
| Infrastructure cost (15%) | 9 | 4 | 7 | 3 |
| Blast radius control (25%) | 5 | 8 | 10 | 10 |
| Operational complexity (15%) | 9 | 6 | 4 | 3 |
| Database compatibility needs (15%) | 4 | 7 | 5 | 8 |
| Observability requirements (10%) | 5 | 6 | 9 | 9 |
| Weighted Score | 6.10 | 7.00 | 7.35 | 6.50 |
When to pick what:
- Stateless API with mature observability (> p90 SLO coverage): Go with canary and automated analysis. Best safety-to-complexity ratio. Use Argo Rollouts or Flagger with Prometheus-backed analysis templates.
- Monolith with a shared database: Blue-green with expand-contract migrations. Rolling updates are dangerous here because of the long rollout window with mixed versions hitting the same schema.
- New service pre-launch, unknown traffic patterns: Start with shadow/traffic mirroring to validate performance under real load, then switch to canary for the GA release.
- Cost-constrained team, low deployment frequency: Rolling updates. Minimal infrastructure overhead, acceptable risk for weekly deploys with good test coverage. Graduate to canary as deployment frequency increases and observability matures.
Key Points
- •Blue-green deploys switch traffic between two identical environments for instant rollback
- •Canary releases gradually shift traffic (1% to 5% to 25% to 100%) to catch issues early
- •Rolling updates replace instances one at a time. This is the Kubernetes default.
- •Feature flags decouple deployment from release. Deploy dark, then enable for specific users.
- •Database migrations must be backward-compatible because old code runs alongside new code during rollout
Tool Comparison
| Tool | Type | Best For | Scale |
|---|---|---|---|
| Argo Rollouts | Open Source | K8s-native canary/blue-green with analysis | Medium-Enterprise |
| Flagger | Open Source | Service mesh integration, automated canary | Medium-Enterprise |
| Spinnaker | Open Source | Multi-cloud deployment pipelines | Large-Enterprise |
| AWS CodeDeploy | Managed | EC2/ECS/Lambda deployments, traffic shifting | Small-Enterprise |
Common Mistakes
- Not testing rollback procedures. Rollback always fails when it matters most.
- Running incompatible database migrations that break old code during a rolling deploy
- Canary without automated analysis. Humans cannot watch dashboards 24/7.
- Blue-green without enough capacity. It takes 2x infrastructure while deploying.
- Ignoring deployment velocity. Infrequent large deploys are far riskier than frequent small ones.