Deployment Strategies — Infrastructure | CrackingWalnuts

Why It Exists

The deployment strategy sets the blast radius when things go wrong. It controls how fast the team can roll back, and it shapes whether the team ships with confidence or with dread.

A "stop the world, replace everything" deploy creates downtime. At scale, even a few seconds of unavailability costs real money. That is why progressive delivery exists: make deployments boring, automated, and reversible.

At staff level, treat the deployment strategy as an architecture decision. Weigh infrastructure cost, database compatibility windows, how mature the observability stack is, and how much risk the org will tolerate. There is no single right answer.

How It Works

Rolling Updates

This is the simplest progressive strategy. Instances are replaced one at a time (or in small batches). Kubernetes does this by default, with the pace controlled by maxSurge and maxUnavailable.

The catch: during the rollout, both old and new versions serve traffic at the same time. The application needs to handle version skew. If it cannot, pick a different strategy.

Blue-Green Deployments

Keep two identical environments. Deploy the new version to the idle one, run the smoke tests, then flip the load balancer. Rollback is instant. Just flip back.

The cost is obvious: double the infrastructure is needed during the switch window. In practice, scale the idle environment way down between deployments to keep costs reasonable.

Canary Releases

Send a small slice of traffic (1-5%) to the new version. Watch error rates, latency percentiles (p50, p95, p99), and business metrics closely. If everything looks good against the predefined thresholds, ramp up traffic progressively.

Argo Rollouts and Flagger automate this well. Define AnalysisTemplates that query Prometheus or Datadog, and the tooling auto-promotes or auto-rolls back based on SLO compliance. The biggest win here is removing humans from the judgment loop during deploys.

Shadow / Traffic Mirroring

Duplicate production traffic to the new version, but never serve its responses to real users. This validates performance and correctness under real load with zero user impact.

Istio and Envoy support this natively through VirtualService mirror configurations. I have found this most useful when making big changes to core services where the test suite may not cover all the edge cases production traffic will hit.

Database Migration Patterns

The expand-contract pattern is non-negotiable for zero-downtime deployments:

Expand: Add the new column or table alongside the old schema. New code writes to both.
Migrate: Backfill existing data from the old structure to the new one.
Contract: Remove old code paths and old schema once every instance runs the new version.

Never rename or drop a column in the same release as the code change that depends on it. I have seen this take down production more than once. It always looks safe in staging.

Feature Flags Architecture

Feature flags separate deployment from release. A flag evaluation service (LaunchDarkly, Unleash, or something custom backed by a distributed config store) decides at runtime which users see new functionality.

This is what makes trunk-based development actually work. Incomplete features go out behind flags, the team tests them in production with internal users, and gradually rolls them out to everyone else. It sounds simple. Getting the discipline right across a team is the hard part.

Production Considerations

Automate rollback triggers. Define SLO-based thresholds (error rate > 1%, p99 latency > 500ms) that halt and revert a deployment without waiting for a human to notice. Track deployment frequency as a first-class metric. Teams deploying multiple times per day consistently have lower change failure rates than teams deploying weekly. This is well-documented in the DORA research and matches everything I have seen in practice.

Make sure the monitoring stack can distinguish canary traffic from baseline. This typically means version-labeled metrics and dedicated dashboards per rollout. Without that, the canary is just a smaller blast radius, not an actual feedback mechanism.

Run deployment runbooks as game days before they are needed during an incident. The first time testing rollback should not be at 2 AM with customers paging the on-call.

Failure Scenarios

Scenario 1: Canary Passes but Production Fails at Scale

The canary at 5% traffic looks healthy. The team promotes to 100%, and the new version exhausts the database connection pool. The canary's traffic was too small to stress shared resources like connection pools, cache clusters, or rate-limited third-party APIs.

What happens next: All instances hit connection limits at once, throwing 500s across the entire service. Downstream services cascade into failure as retries amplify load.

How to detect it: Monitor connection pool utilization (db_active_connections / db_max_connections), thread pool saturation, and external API quota usage, all segmented by deployment version. Alert when any shared resource crosses 80% during rollout.

How to fix it: Use staged canary with resource-aware analysis. Go 5% traffic, then 25% with a 10-minute bake time specifically watching shared resource metrics, then 50%, then 100%. Argo Rollouts AnalysisTemplate can query connection pool saturation from Prometheus and gate promotion on it.

Scenario 2: Rolling Deploy with Incompatible Schema Migration

A rolling update pushes v2 code that expects a new database column. During the rollout window, v1 instances (which do not write to the new column) coexist with v2 instances. A user hits v2, which writes to the new column. Their next request hits v1, which ignores it. Data inconsistency piles up silently.

What happens next: Business logic corruption. Orders missing fields, analytics data skewed, reconciliation failures surfacing days later when someone finally notices the numbers do not add up.

How to detect it: Add schema version assertions in health checks. Monitor NULL rates on new columns segmented by app version.

How to fix it: Enforce expand-contract migrations through CI linting. Reject any migration that removes or renames columns without a prior two-phase rollout. Tools like gh-ost (GitHub) or pg-osc handle zero-downtime schema changes with shadow table swaps.

Scenario 3: Feature Flag Evaluation Service Outage

The feature flag service (say, LaunchDarkly) goes unreachable during a deployment. All flag evaluations fall back to defaults. If the default was set to "enabled," an unfinished feature suddenly shows up for every user.

What happens next: Partial feature exposure, broken user experiences, possibly data corruption from half-implemented flows.

How to detect it: Monitor flag evaluation latency and error rate. Alert on flag_sdk_timeout_rate > 1%.

How to fix it: Configure local flag caching with a stale TTL of 5-15 minutes. Set safe defaults (always "off" for incomplete features, no exceptions). Add a circuit breaker that freezes the current flag state if the evaluation service is unreachable.

Capacity Planning

Metric	Target	Formula / Guidance
Deployment frequency	Multiple per day per service	DORA Elite: on-demand deployment
Rollback time	< 5 min (blue-green), < 15 min (canary)	Measure from detection to full rollback
Blue-green infra cost	2x during deploy, 1.1x at rest	Keep standby at 10% capacity, autoscale on switch
Canary bake time	10-30 min per stage	`total_deploy_time = stages x bake_time + promotion_time`
Change failure rate	< 5%	Requires automated analysis on > 95% of deploys

Scale references: Amazon deploys roughly every 11.7 seconds on average across all services. Netflix uses Spinnaker for 4,000+ deployments per day, with automated canary analysis comparing real-time metrics against a baseline using statistical methods (Mann-Whitney U tests on latency distributions). These numbers are impressive, but remember that these companies built years of custom tooling to get there. Start with what the team can actually operate.

Capacity formula for blue-green: extra_cost = baseline_infra_cost x deployment_frequency x avg_deploy_duration / 1440. For 10 deploys per day at 15 minutes each with a $50K/month baseline: $50K x 10 x 15 / 1440 = $5.2K/month (roughly 10% overhead). Compare that against the cost of a single outage, which typically runs $5K-$100K per hour for revenue-generating services. The math usually justifies the spend.

Architecture Decision Record

ADR: Choosing a Deployment Strategy

Context: Picking a deployment strategy that balances speed, safety, cost, and operational complexity based on how critical the service is and how mature the team is.

Criteria (Weight)	Rolling	Blue-Green	Canary	Shadow
Rollback speed (20%)	5	10	8	N/A
Infrastructure cost (15%)	9	4	7	3
Blast radius control (25%)	5	8	10	10
Operational complexity (15%)	9	6	4	3
Database compatibility needs (15%)	4	7	5	8
Observability requirements (10%)	5	6	9	9
Weighted Score	6.10	7.00	7.35	6.50

When to pick what:

Stateless API with mature observability (> p90 SLO coverage): Go with canary and automated analysis. Best safety-to-complexity ratio. Use Argo Rollouts or Flagger with Prometheus-backed analysis templates.
Monolith with a shared database: Blue-green with expand-contract migrations. Rolling updates are dangerous here because of the long rollout window with mixed versions hitting the same schema.
New service pre-launch, unknown traffic patterns: Start with shadow/traffic mirroring to validate performance under real load, then switch to canary for the GA release.
Cost-constrained team, low deployment frequency: Rolling updates. Minimal infrastructure overhead, acceptable risk for weekly deploys with good test coverage. Graduate to canary as deployment frequency increases and observability matures.

Tool	Type	Best For	Scale
Argo Rollouts	Open Source	K8s-native canary/blue-green with analysis	Medium-Enterprise
Flagger	Open Source	Service mesh integration, automated canary	Medium-Enterprise
Spinnaker	Open Source	Multi-cloud deployment pipelines	Large-Enterprise
AWS CodeDeploy	Managed	EC2/ECS/Lambda deployments, traffic shifting	Small-Enterprise

Why It Exists

The deployment strategy sets the blast radius when things go wrong. It controls how fast the team can roll back, and it shapes whether the team ships with confidence or with dread.

How It Works

Rolling Updates

This is the simplest progressive strategy. Instances are replaced one at a time (or in small batches). Kubernetes does this by default, with the pace controlled by maxSurge and maxUnavailable.

The catch: during the rollout, both old and new versions serve traffic at the same time. The application needs to handle version skew. If it cannot, pick a different strategy.

Blue-Green Deployments

Keep two identical environments. Deploy the new version to the idle one, run the smoke tests, then flip the load balancer. Rollback is instant. Just flip back.

The cost is obvious: double the infrastructure is needed during the switch window. In practice, scale the idle environment way down between deployments to keep costs reasonable.

Canary Releases

Shadow / Traffic Mirroring

Duplicate production traffic to the new version, but never serve its responses to real users. This validates performance and correctness under real load with zero user impact.

Database Migration Patterns

The expand-contract pattern is non-negotiable for zero-downtime deployments:

Expand: Add the new column or table alongside the old schema. New code writes to both.
Migrate: Backfill existing data from the old structure to the new one.
Contract: Remove old code paths and old schema once every instance runs the new version.

Never rename or drop a column in the same release as the code change that depends on it. I have seen this take down production more than once. It always looks safe in staging.

Feature Flags Architecture

Production Considerations

Run deployment runbooks as game days before they are needed during an incident. The first time testing rollback should not be at 2 AM with customers paging the on-call.

Failure Scenarios

Scenario 1: Canary Passes but Production Fails at Scale

What happens next: All instances hit connection limits at once, throwing 500s across the entire service. Downstream services cascade into failure as retries amplify load.

Scenario 2: Rolling Deploy with Incompatible Schema Migration

What happens next: Business logic corruption. Orders missing fields, analytics data skewed, reconciliation failures surfacing days later when someone finally notices the numbers do not add up.

How to detect it: Add schema version assertions in health checks. Monitor NULL rates on new columns segmented by app version.

Scenario 3: Feature Flag Evaluation Service Outage

What happens next: Partial feature exposure, broken user experiences, possibly data corruption from half-implemented flows.

How to detect it: Monitor flag evaluation latency and error rate. Alert on flag_sdk_timeout_rate > 1%.

Capacity Planning

Metric	Target	Formula / Guidance
Deployment frequency	Multiple per day per service	DORA Elite: on-demand deployment
Rollback time	< 5 min (blue-green), < 15 min (canary)	Measure from detection to full rollback
Blue-green infra cost	2x during deploy, 1.1x at rest	Keep standby at 10% capacity, autoscale on switch
Canary bake time	10-30 min per stage	`total_deploy_time = stages x bake_time + promotion_time`
Change failure rate	< 5%	Requires automated analysis on > 95% of deploys

Architecture Decision Record

ADR: Choosing a Deployment Strategy

Context: Picking a deployment strategy that balances speed, safety, cost, and operational complexity based on how critical the service is and how mature the team is.

Criteria (Weight)	Rolling	Blue-Green	Canary	Shadow
Rollback speed (20%)	5	10	8	N/A
Infrastructure cost (15%)	9	4	7	3
Blast radius control (25%)	5	8	10	10
Operational complexity (15%)	9	6	4	3
Database compatibility needs (15%)	4	7	5	8
Observability requirements (10%)	5	6	9	9
Weighted Score	6.10	7.00	7.35	6.50

When to pick what:

Stateless API with mature observability (> p90 SLO coverage): Go with canary and automated analysis. Best safety-to-complexity ratio. Use Argo Rollouts or Flagger with Prometheus-backed analysis templates.
Monolith with a shared database: Blue-green with expand-contract migrations. Rolling updates are dangerous here because of the long rollout window with mixed versions hitting the same schema.
New service pre-launch, unknown traffic patterns: Start with shadow/traffic mirroring to validate performance under real load, then switch to canary for the GA release.
Cost-constrained team, low deployment frequency: Rolling updates. Minimal infrastructure overhead, acceptable risk for weekly deploys with good test coverage. Graduate to canary as deployment frequency increases and observability matures.

Architecture Diagram

Why It Exists

How It Works

Rolling Updates

Blue-Green Deployments

Canary Releases

Shadow / Traffic Mirroring

Database Migration Patterns

Feature Flags Architecture

Production Considerations

Failure Scenarios

Scenario 1: Canary Passes but Production Fails at Scale

Scenario 2: Rolling Deploy with Incompatible Schema Migration

Scenario 3: Feature Flag Evaluation Service Outage

Capacity Planning

Architecture Decision Record

ADR: Choosing a Deployment Strategy

Key Points

Tool Comparison

Common Mistakes

Related Topics

Architecture Diagram

Why It Exists

How It Works

Rolling Updates

Blue-Green Deployments

Canary Releases

Shadow / Traffic Mirroring

Database Migration Patterns

Feature Flags Architecture

Production Considerations

Failure Scenarios

Scenario 1: Canary Passes but Production Fails at Scale

Scenario 2: Rolling Deploy with Incompatible Schema Migration

Scenario 3: Feature Flag Evaluation Service Outage

Capacity Planning

Architecture Decision Record

ADR: Choosing a Deployment Strategy

Key Points

Tool Comparison

Common Mistakes

Related Topics