Database Migration Failures
Why Migrations Break Production
Database migrations are the most common cause of self-inflicted outages. The pattern is always the same: a migration works perfectly in development and staging, then locks a production table for 45 minutes because nobody tested it against real data volumes.
The root problem is that DDL operations (ALTER TABLE, CREATE INDEX, ADD CONSTRAINT) behave fundamentally differently on small tables vs large tables. On a 1,000-row table, adding a column is instant. On a 500-million-row table, the same operation rewrites the entire table on disk.
Lock Behavior by Database
MySQL before 8.0 takes a full metadata lock on most ALTER TABLE operations. The table is completely inaccessible during the migration. MySQL 8.0+ supports instant DDL for some operations (adding columns, renaming columns) but not others (changing column types, adding indexes).
PostgreSQL is better in some ways, worse in others. Adding a nullable column is instant. Adding a column with a DEFAULT is instant since PG 11. But adding a foreign key constraint scans both tables. CREATE INDEX blocks writes unless you use the CONCURRENTLY variant, and CONCURRENTLY can't run inside a transaction.
The gh-ost Approach
GitHub's gh-ost tool changed how large MySQL deployments handle migrations. Instead of ALTER TABLE, it creates a ghost table with the new schema, copies rows in small batches using binary log events, and then does an atomic rename. The migration runs for hours but never locks the original table for more than a few milliseconds.
For PostgreSQL, pg_repack does something similar. pgroll from Xata is a newer option that adds version-aware schema management on top. These tools trade migration speed for zero downtime. A migration that would lock the table for 20 minutes instead runs for 2 hours but with zero user impact.
Expand-Contract Pattern
The safest migration strategy splits every change into two deployments. First deployment (expand): add the new column as nullable, deploy code that writes to both old and new columns, backfill existing rows. Second deployment (contract): add the NOT NULL constraint, remove the old column, deploy code that only uses the new column.
This doubles the number of deployments but eliminates downtime risk. Each individual step is either instant or can be done with online tools.
When Migrations Fail Midway
A killed migration is worse than a slow migration. If you kill an ALTER TABLE in MySQL, the rollback can take as long as the migration itself. The table remains locked during rollback. In PostgreSQL, killing a CREATE INDEX CONCURRENTLY leaves an invalid index that you have to manually drop.
Before starting any migration, have the rollback plan written down. Know the exact commands. Know how long rollback will take. Know which application version is compatible with the pre-migration schema. If you can't answer these questions, you're not ready to run the migration.
Incident Timeline
- T+0mALTER TABLE starts on a 500GB table during deployment. Migration tool estimates 2 minutes. Actual lock acquisition begins blocking all writes.
- T+2mWrite latency spikes to 30 seconds. Connection pool fills up as queries queue behind the DDL lock. Application threads start timing out.
- T+5mHealth checks fail due to database timeouts. Load balancer marks instances unhealthy. Customer-facing 500 errors begin.
- T+10mTeam kills the migration query. But the rollback of the partial ALTER takes another 8 minutes. Database remains locked.
- T+15mMigration rollback completes. Connections drain and refill. Application recovers gradually as connection pools reset.
- T+30mPost-incident: team discovers the migration worked in staging because the staging table had 10,000 rows, not 500 million.
Detection Signals
- •Database connection pool utilization exceeding 80% during deployment windows
- •Lock wait timeout exceeded errors in application logs
- •Sudden spike in query duration for the affected table (p99 latency jump)
- •Replication lag increasing on read replicas during migration execution
Prevention
- Use online schema change tools: gh-ost for MySQL, pg_repack or pgroll for PostgreSQL. These create shadow tables and swap atomically
- Test every migration against a production-sized dataset. Spin up a replica from a snapshot, run the migration, measure time and lock behavior
- Set statement_timeout and lock_timeout on migration connections. A migration that can't acquire a lock in 5 seconds should abort, not queue
- Use the expand-contract pattern: add new columns as nullable first, backfill data, update application code, then add constraints in a separate deployment
- Run migrations during low-traffic windows with a kill switch. Have the exact KILL query ready before you start
Key Points
- •A migration that takes 2 seconds on a table with 1,000 rows can take 2 hours on a table with 100 million rows. Test with production data volumes
- •In MySQL, most ALTER TABLE operations take a full table lock in versions before 8.0. Even in 8.0+, adding an index is online but changing a column type is not
- •PostgreSQL's ADD COLUMN with a DEFAULT is instant since version 11, but adding a NOT NULL constraint still scans the entire table
- •The expand-contract pattern doubles your migration count but eliminates downtime risk entirely
- •Failed migrations are worse than slow migrations because partial schema changes leave the database in an inconsistent state
Common Mistakes
- ✗Testing migrations against a staging database with 0.1% of production data volume and assuming the timing will be similar
- ✗Running migrations inside the same transaction as application deployments, so a slow migration blocks the entire deployment pipeline
- ✗Adding a NOT NULL column without a DEFAULT value, which fails immediately on tables with existing rows
- ✗Forgetting that adding an index in PostgreSQL takes an ACCESS EXCLUSIVE lock unless you use CREATE INDEX CONCURRENTLY