Zero-Downtime Deployment Patterns

Deployment Strategies

Zero-downtime deployment is not a single technique. It is a combination of deployment strategy, database migration approach, and traffic management working together.

Blue-green: You maintain two identical production environments. Blue runs the current version. Green gets the new version deployed and validated. Once green is healthy, you switch the load balancer to point at green. Rollback means switching back to blue. AWS makes this straightforward with target groups and weighted routing. The downside: you pay for double the compute during the deployment window.

Canary: Route 1-5% of traffic to the new version. Monitor error rates, latency percentiles (p50, p95, p99), and business metrics for 10-30 minutes. If everything looks good, gradually increase to 25%, 50%, and 100%. Kubernetes supports this natively with Istio or Flagger. Canary catches problems that blue-green misses because real user traffic exercises edge cases that health checks do not.

Rolling: Replace instances one at a time. Kubernetes does this by default with maxUnavailable: 0 and maxSurge: 1. It is the most resource-efficient strategy but offers the slowest rollback since you have to roll forward through every instance again.

The Expand-Contract Pattern for Databases

Schema changes break zero-downtime deployments more than anything else. The expand-contract pattern splits every migration into three phases.

Expand: Add the new column or table without removing anything. Both old and new application code can work with this schema. Deploy this migration first and verify it is safe.

Migrate: Deploy new application code that writes to both old and new columns. Backfill existing data. This phase is where you run data migrations, ideally in batches to avoid locking.

Contract: Once all data is migrated and no old application instances remain, remove the old column. This is a separate deployment days or weeks later.

GitHub uses this approach for every schema change on their 1.5TB+ MySQL databases. It is slower than a single ALTER TABLE but eliminates the risk of locking tables under production load.

Feature Flags for Safe Rollouts

Feature flags decouple deployment from release. You deploy code to production behind a flag, then enable it gradually. If something breaks, you flip the flag without redeploying.

This is particularly valuable for risky changes that affect data. Deploy the new code path, enable it for internal users, validate, enable for 1% of production users, validate again, and then ramp to 100%. LaunchDarkly, Unleash, and even a simple database-backed flag system work for this.

Health Check Design

A liveness check answers "is the process running?" A readiness check answers "can this instance serve traffic?" They are different and must be separate endpoints.

Readiness checks should verify database connectivity, cache connectivity, and any required downstream service. If any dependency is unreachable, the instance should report not ready and stop receiving traffic. Kubernetes will remove it from the service endpoints until it recovers.

Set readiness check timeouts aggressively. If your readiness check takes more than 2 seconds, something is wrong with your dependencies and you should not be serving traffic.

Deployment Strategies

Zero-downtime deployment is not a single technique. It is a combination of deployment strategy, database migration approach, and traffic management working together.

The Expand-Contract Pattern for Databases

Schema changes break zero-downtime deployments more than anything else. The expand-contract pattern splits every migration into three phases.

Expand: Add the new column or table without removing anything. Both old and new application code can work with this schema. Deploy this migration first and verify it is safe.

Migrate: Deploy new application code that writes to both old and new columns. Backfill existing data. This phase is where you run data migrations, ideally in batches to avoid locking.

Contract: Once all data is migrated and no old application instances remain, remove the old column. This is a separate deployment days or weeks later.

GitHub uses this approach for every schema change on their 1.5TB+ MySQL databases. It is slower than a single ALTER TABLE but eliminates the risk of locking tables under production load.

Feature Flags for Safe Rollouts

Feature flags decouple deployment from release. You deploy code to production behind a flag, then enable it gradually. If something breaks, you flip the flag without redeploying.

Health Check Design

A liveness check answers "is the process running?" A readiness check answers "can this instance serve traffic?" They are different and must be separate endpoints.

Set readiness check timeouts aggressively. If your readiness check takes more than 2 seconds, something is wrong with your dependencies and you should not be serving traffic.

Architecture Diagram

Deployment Strategies

The Expand-Contract Pattern for Databases

Feature Flags for Safe Rollouts

Health Check Design

Key Points

Common Mistakes

Related Topics

Zero-Downtime Deployment Patterns

Architecture Diagram

Deployment Strategies

The Expand-Contract Pattern for Databases

Feature Flags for Safe Rollouts

Health Check Design

Key Points

Common Mistakes

Related Topics