How to Run Zero-Downtime Database Migrations

You usually do not notice database migrations until you do. The pattern is familiar: a “small” schema tweak lands during a deploy, latency creeps up, writes stack behind a lock queue, and suddenly your app is technically running but practically unavailable.

Zero-downtime database migrations mean changing schema and data while multiple versions of your application are live, without blocking the reads and writes users depend on. The hard part is not the SQL. It is compatibility across time.

The useful mental model is simple: treat your database like a public API. If old code and new code can run at the same time, your schema must support both.

What experienced teams converge on (and quietly warn you about)

After comparing how mature engineering orgs approach this, a few themes keep repeating. (This convergence is consistent with how leaders listen for tech recommendations from key roles first.)

Martin Fowler, working with teams practicing continuous delivery, popularized the idea of “parallel change,” also called expand and contract. The core insight is that breaking changes are only safe when split into reversible steps that coexist for a while.

GitHub’s database engineers ended up building their own MySQL migration tool, gh-ost, not because SQL was hard, but because operational control mattered more than raw speed. They needed migrations that could be paused, throttled, and observed under real production load.

Meanwhile, the PostgreSQL core documentation makes an uncomfortable truth explicit: many ALTER TABLE operations require strong locks, and those locks conflict with almost everything your application is doing.

Put together, the lesson is clear. Expand and contract is your default strategy for application compatibility. Online schema change tooling becomes necessary when the database engine’s locking behavior makes naive DDL unsafe.

Choose your migration lane before touching production

Before writing SQL, decide what category of migration you are doing. This decision shapes everything else.

Lane	When it fits	Core idea	Typical tools
Expand → migrate → contract	Renames, splits, multi-service rollouts	Old and new schemas coexist temporarily	Any migration tool, feature flags
Online schema change	Large hot tables that would lock	A new table shape was built alongside traffic	gh-ost, pt-online-schema-change
Copy and cutover	Big structural rewrites	Dual write, validate, then switch	App-level logic, CDC

If you are unsure, default to expand and contract. It forces you to design compatibility deliberately, which is where most “zero downtime database migrations” efforts succeed or fail.

The expand and contract playbook that actually works

The pattern is easy to describe and deceptively hard to execute well.

Step 1: Expand the schema without breaking old code

Add new tables, columns, or indexes in a way that existing code does not notice.

In PostgreSQL, even “simple” DDL can queue behind long-running transactions because of the locks it requests. Use guardrails like lock_timeoutkeep changes small, and avoid bundling multiple subcommands that escalate locking requirements.

In MySQL, expansion can still be dangerous for large tables. If the table is hot and large, consider online schema change tooling even for the expand phase.

Step 2: Ship code that can read both shapes

This is the step most teams rush, then regret.

Your application should behave correctly if:

The new column exists but is empty
Some rows are backfilled, and others are not
old and new versions of the app overlap during rollout

That usually means fallback logic on reads. For example, preferfull_name, fall back to name If it is null, keep behavior consistent.

Step 3: Dual-write just enough to stay future-proof

Once the new schema exists, start writing to it for new or updated rows.

You rarely need permanent dual writes. Often, you can dual-write only the fields involved in the transition, behind a feature flag. This preserves rollback safety. If something goes wrong, you can revert reads to the old column and still have correct data.

Step 4: Backfill old data in small, observable batches

Backfills are where “zero downtime database migrations” quietly turn into “slow incident” if you are careless.

A concrete example:

You need to populate full_name for 200 million users.
You choose 5,000 rows per second to keep IO and replication healthy.
200,000,000 ÷ 5,000 = 40,000 seconds, or about 11 hours.

That is manageable if you throttle, monitor, and can pause. It is dangerous if you run one massive transaction and saturate the system.

Run small commits, measure replication lag, and build a kill switch. This is exactly why tools like gh-ost emphasize pause and resume as first-class features.

Step 5: Contract only after you prove that nothing depends on the old schema

A contract is when you drop old columns, remove fallback logic, and delete unused indexes.

Do not rely on intuition. Prove safety:

error rates and latency remain normal with new reads enforced
Logs and query stats show no access to old columns
Old application versions are fully drained
Backfill completeness is verified with counts or sampling
rollback steps are documented before the deploy

When online schema change tools save you from yourself

For MySQL, two approaches dominate production use.

pt-online-schema-change creates a new table, copies rows in the background, and swaps tables at the end, often using triggers to keep data consistent.

gh-ost, built by GitHub, avoids triggers and focuses on low impact and operational control.

The important lesson is not which tool you choose. It is recognizing when the database engine’s locking behavior makes native DDL unsafe for live traffic. (The same concurrency pressures that cause race conditions in distributed caching apply at the schema migration layer.)

Also, be honest about edge cases. For example, heavy foreign key usage changes how table-swap tools behave. Planning for that up front is cheaper than discovering it mid-migration.

PostgreSQL-specific reality checks that prevent lock pileups

Most Postgres migration failures are not about speed. They are about locks and timing.

A few principles that pay off immediately:

Assume ALTER TABLE wants an aggressive lock unless proven otherwise.
Long-running transactions are the hidden enemy. Even a harmless-looking SELECT can block schema changes.
Version details matter. Newer Postgres releases optimize some operations, like adding columns with simple defaults, but you should still validate behavior in your environment.

In practice, zero downtime in Postgres comes from choosing low-lock operations, enforcing timeouts, and designing migrations that tolerate waiting.

FAQ

Can you truly guarantee zero-downtime database migrations?
You can get very close, but guarantees are rare. The real goal is no user-visible interruption and the ability to pause or roll back safely.

Do you need feature flags for database migrations?
If reads or writes change, feature flags dramatically reduce risk. Purely additive changes may not need them.

What is the most common mistake?
Doing a breaking schema change in one deploy and assuming rollout speed will save you.

How do you know it is safe to contract?
When metrics, logs, and deploy state all agree that nothing touches the old schema.

Honest Takeaway

Zero-downtime database migrations are not a trick and not something you schedule “after hours.” They are a compatibility strategy, expand and contract, combined with operational discipline.

If you adopt only one habit, make it this: design every schema change so two versions of your application can safely use it at the same time. That mindset is what separates confident deploys from migration-day anxiety. (For more on aligning this kind of engineering discipline with business priorities, see how to align tech investments with business outcomes.)