Zero-Downtime Deployment Patterns
Architecture Diagram
Deployment Strategies
Zero-downtime deployment is not a single technique. It is a combination of deployment strategy, database migration approach, and traffic management working together.
Blue-green: You maintain two identical production environments. Blue runs the current version. Green gets the new version deployed and validated. Once green is healthy, you switch the load balancer to point at green. Rollback means switching back to blue. AWS makes this straightforward with target groups and weighted routing. The downside: you pay for double the compute during the deployment window.
Canary: Route 1-5% of traffic to the new version. Monitor error rates, latency percentiles (p50, p95, p99), and business metrics for 10-30 minutes. If everything looks good, gradually increase to 25%, 50%, and 100%. Kubernetes supports this natively with Istio or Flagger. Canary catches problems that blue-green misses because real user traffic exercises edge cases that health checks do not.
Rolling: Replace instances one at a time. Kubernetes does this by default with maxUnavailable: 0 and maxSurge: 1. It is the most resource-efficient strategy but offers the slowest rollback since you have to roll forward through every instance again.
The Expand-Contract Pattern for Databases
Schema changes break zero-downtime deployments more than anything else. The expand-contract pattern splits every migration into three phases.
Expand: Add the new column or table without removing anything. Both old and new application code can work with this schema. Deploy this migration first and verify it is safe.
Migrate: Deploy new application code that writes to both old and new columns. Backfill existing data. This phase is where you run data migrations, ideally in batches to avoid locking.
Contract: Once all data is migrated and no old application instances remain, remove the old column. This is a separate deployment days or weeks later.
GitHub uses this approach for every schema change on their 1.5TB+ MySQL databases. It is slower than a single ALTER TABLE but eliminates the risk of locking tables under production load.
Feature Flags for Safe Rollouts
Feature flags decouple deployment from release. You deploy code to production behind a flag, then enable it gradually. If something breaks, you flip the flag without redeploying.
This is particularly valuable for risky changes that affect data. Deploy the new code path, enable it for internal users, validate, enable for 1% of production users, validate again, and then ramp to 100%. LaunchDarkly, Unleash, and even a simple database-backed flag system work for this.
Health Check Design
A liveness check answers "is the process running?" A readiness check answers "can this instance serve traffic?" They are different and must be separate endpoints.
Readiness checks should verify database connectivity, cache connectivity, and any required downstream service. If any dependency is unreachable, the instance should report not ready and stop receiving traffic. Kubernetes will remove it from the service endpoints until it recovers.
Set readiness check timeouts aggressively. If your readiness check takes more than 2 seconds, something is wrong with your dependencies and you should not be serving traffic.
Key Points
- •Blue-green deployments give you instant rollback by switching traffic between two identical environments. The cost is maintaining double the infrastructure during deployment
- •Canary releases route a small percentage of traffic (1-5%) to the new version first, validating metrics before full rollout. This catches issues that staging environments miss
- •Database schema changes are the hardest part of zero-downtime deploys. The expand-contract pattern splits migrations into backward-compatible steps
- •Health checks must verify application readiness, not just process liveness. A container that starts but cannot connect to its database should not receive traffic
- •Every deployment needs a rollback plan that can execute in under 60 seconds. If rollback requires a database migration, it is not a real rollback plan
Common Mistakes
- ✗Running ALTER TABLE statements that lock the table during deployment. PostgreSQL locks the table for ADD COLUMN with a DEFAULT value in versions before 11
- ✗Deploying database migrations and application code in the same step. Separate them so the old code works with the new schema and vice versa
- ✗Skipping canary validation metrics. Deploying to canary and immediately promoting to 100% defeats the purpose
- ✗Not testing rollback procedures. The first time you roll back should not be during an incident