Database Migration Failures

Why Migrations Break Production

Database migrations are the most common cause of self-inflicted outages. The pattern is always the same: a migration works perfectly in development and staging, then locks a production table for 45 minutes because nobody tested it against real data volumes.

The root problem is that DDL operations (ALTER TABLE, CREATE INDEX, ADD CONSTRAINT) behave fundamentally differently on small tables vs large tables. On a 1,000-row table, adding a column is instant. On a 500-million-row table, the same operation rewrites the entire table on disk.

Lock Behavior by Database

MySQL before 8.0 takes a full metadata lock on most ALTER TABLE operations. The table is completely inaccessible during the migration. MySQL 8.0+ supports instant DDL for some operations (adding columns, renaming columns) but not others (changing column types, adding indexes).

PostgreSQL is better in some ways, worse in others. Adding a nullable column is instant. Adding a column with a DEFAULT is instant since PG 11. But adding a foreign key constraint scans both tables. CREATE INDEX blocks writes unless you use the CONCURRENTLY variant, and CONCURRENTLY can't run inside a transaction.

The gh-ost Approach

GitHub's gh-ost tool changed how large MySQL deployments handle migrations. Instead of ALTER TABLE, it creates a ghost table with the new schema, copies rows in small batches using binary log events, and then does an atomic rename. The migration runs for hours but never locks the original table for more than a few milliseconds.

For PostgreSQL, pg_repack does something similar. pgroll from Xata is a newer option that adds version-aware schema management on top. These tools trade migration speed for zero downtime. A migration that would lock the table for 20 minutes instead runs for 2 hours but with zero user impact.

Expand-Contract Pattern

The safest migration strategy splits every change into two deployments. First deployment (expand): add the new column as nullable, deploy code that writes to both old and new columns, backfill existing rows. Second deployment (contract): add the NOT NULL constraint, remove the old column, deploy code that only uses the new column.

This doubles the number of deployments but eliminates downtime risk. Each individual step is either instant or can be done with online tools.

When Migrations Fail Midway

A killed migration is worse than a slow migration. If you kill an ALTER TABLE in MySQL, the rollback can take as long as the migration itself. The table remains locked during rollback. In PostgreSQL, killing a CREATE INDEX CONCURRENTLY leaves an invalid index that you have to manually drop.

Before starting any migration, have the rollback plan written down. Know the exact commands. Know how long rollback will take. Know which application version is compatible with the pre-migration schema. If you can't answer these questions, you're not ready to run the migration.

Why Migrations Break Production

Lock Behavior by Database

The gh-ost Approach

Expand-Contract Pattern

This doubles the number of deployments but eliminates downtime risk. Each individual step is either instant or can be done with online tools.

When Migrations Fail Midway

Why Migrations Break Production

Lock Behavior by Database

The gh-ost Approach

Expand-Contract Pattern

When Migrations Fail Midway

Incident Timeline

Detection Signals

Prevention

Key Points

Common Mistakes

Related Topics

Database Migration Failures

Why Migrations Break Production

Lock Behavior by Database

The gh-ost Approach

Expand-Contract Pattern

When Migrations Fail Midway

Incident Timeline

Detection Signals

Prevention

Key Points

Common Mistakes

Related Topics