Database migration with zero downtime

Zero-Downtime Database Changes: Safe Migration Patterns

Patterns for deploying database schema changes without downtime. Covers expand-contract migrations, backward-compatible columns, and online index builds.

DatabasesData & Databases
database-migrationszero-downtimeschema-changesexpand-contract

Database schema changes are one of the riskiest operations in production systems. A careless ALTER TABLE can lock a table for minutes, blocking every query. A failed migration without a rollback plan can leave your database in an inconsistent state. And a backfill that runs too aggressively can saturate your database's I/O capacity.

This article is a practitioner's guide to zero-downtime database schema changes. It covers the expand-contract pattern, safe backfill strategies, rollback design, and the specific pitfalls that catch teams in 2026 database operations. It is written for backend engineers, DBAs, and SREs who deploy schema changes to production databases under load.

The databases hub provides broader reading resources, and the data and databases path covers foundational skills. For PostgreSQL-specific context, see the PostgreSQL 18 upgrade guide, UUIDv7 article, and skip scan explainer.

The expand-contract pattern

The safest way to make schema changes under load is the expand-contract pattern. Instead of changing a column or table in one step, you break it into phases:

Phase 1: Expand

Add the new structure alongside the old one. Both exist simultaneously.

  • Add a new column (nullable, with a default)
  • Create a new table
  • Add a new index concurrently

This phase must be non-blocking. In PostgreSQL, use ALTER TABLE ... ADD COLUMN with a default (which is instant in modern PostgreSQL) or CREATE INDEX CONCURRENTLY.

Phase 2: Migrate

Update the application code to write to both the old and new structures. Read from the old structure. This dual-write phase ensures that new data is captured in both places.

Phase 3: Backfill

Copy data from the old structure to the new one for rows that were written before the dual-write started.

Phase 4: Cutover

Switch reads to the new structure. The application now reads from the new structure and writes to both.

Phase 5: Contract

Remove the old structure. Drop the old column, table, or index. Update the application to stop writing to the old structure.

Each phase is a separate deployment. Each phase is independently rollbackable.

Safe backfill strategies

Backfilling data is the most dangerous phase because it generates significant database load. Use these strategies:

Batch processing

Process rows in batches of 1,000–10,000. Between each batch, sleep for a short interval (100–500 ms) to let the database process other queries.

Rate limiting

Limit the backfill to a fixed number of rows per second. Monitor database metrics (CPU, I/O wait, replication lag) and pause automatically if thresholds are exceeded.

Off-peak execution

Run backfills during low-traffic periods. If your traffic has a predictable daily pattern, schedule the backfill during the trough.

Idempotent operations

Make every backfill operation idempotent. If the backfill is interrupted and restarted, it should not corrupt data or create duplicates. Use WHERE new_column IS NULL to process only unbackfilled rows.

Rollback design

Every migration must have a rollback plan before it is deployed. The rollback plan should be:

  • Documented: written down and reviewed by a second person
  • Tested: executed against a staging environment
  • Time-bounded: if the migration cannot be rolled back within a defined window, it should be redesigned

Rollback strategies by phase

  • Expand phase: drop the new column/table/index
  • Migrate phase: revert the application code to remove dual writes
  • Backfill phase: stop the backfill process (data in the new column can be truncated or left as-is)
  • Cutover phase: revert reads to the old structure
  • Contract phase: this is the point of no return. Once the old structure is dropped, rollback requires restoring from backup. Never reach this phase until you are confident the migration is stable.

Dangerous operations to avoid

ALTER TABLE ... ADD COLUMN with a volatile default

In older PostgreSQL versions, adding a column with a default rewrote the entire table. Modern PostgreSQL (11+) handles this efficiently for most cases, but volatile defaults (e.g., DEFAULT now()) still cause a full table rewrite. Use a separate backfill step instead.

CREATE INDEX without CONCURRENTLY

A regular CREATE INDEX takes an exclusive lock on the table, blocking all writes until the index is built. For large tables, this can take minutes or hours. Always use CREATE INDEX CONCURRENTLY.

ALTER TABLE ... ALTER COLUMN TYPE

Changing a column's data type typically rewrites the entire table. For large tables, this is a multi-hour operation that blocks writes. Instead, use the expand-contract pattern: add a new column with the target type, backfill, cut over, and drop the old column.

DROP COLUMN on a hot table

In PostgreSQL, DROP COLUMN does not rewrite the table—it marks the column as dropped. However, running DROP COLUMN takes an ACCESS EXCLUSIVE lock briefly, which can cause connection pile-ups on busy tables. Run it during low-traffic periods.

Monitoring during migrations

During every migration phase, monitor:

  • Replication lag: if lag increases significantly, pause the operation
  • Lock wait events: check pg_stat_activity for queries waiting on locks
  • I/O utilisation: watch for I/O saturation, especially during backfills
  • Connection pool health: ensure connection pools are not exhausted by blocked queries
  • Error rates: monitor application error rates for any spike correlated with the migration

Common pitfalls

Pitfall: skipping the dual-write phase. Without dual writes, there is a window between backfill completion and cutover where new data is only written to the old structure. This creates data loss in the new structure.

Pitfall: not testing rollback in staging. Every rollback plan that has not been tested is a hope, not a plan.

Pitfall: running backfills too fast. An aggressive backfill can saturate I/O and cause replication lag, affecting read replicas and potentially triggering failover.

Pitfall: deploying the contract phase too soon. Wait at least one full deploy cycle after cutover before dropping the old structure. This gives you time to detect subtle issues that only appear under production load.

Trade-offs

  • The expand-contract pattern adds complexity and takes longer than a single ALTER TABLE. The trade-off is safety under load.
  • Dual writes add application complexity and a small performance cost. The trade-off is the ability to roll back at any point.
  • Batched backfills are slower than bulk operations but do not impact production traffic.

Further reading on EBooks-Space