Home » How to Index Large Tables Without Causing Downtime

How to Index Large Tables Without Causing Downtime

The ugly part of indexing large tables is not the SQL. It is the blast radius.

On a small table, adding an index feels harmless. On a table with hundreds of millions of rows, it can turn into a slow-motion production incident: elevated I/O, replication lag, blocked writers, cache churn, and a queue of application requests that suddenly all discover what “schema change” really means. When people say they want to index large tables “without downtime,” what they usually mean is simpler and stricter: keep reads and writes flowing while the database builds a new access path in the background.

The good news is that modern engines do have paths for this. PostgreSQL supportsCREATE INDEX CONCURRENTLY, which avoids locks that block inserts, updates, and deletes, but it takes longer and has operational caveats. MySQL InnoDB supports online DDL for adding secondary indexes with concurrent DML. SQL Server supports online index operations and resumable operations built on that. The trick is not just knowing the feature exists. The trick is knowing when its caveats become your outage.

That is also where the practitioners sound more interesting than the marketing copy. Shlomi Noach, the engineer behind gh-ost, has described GitHub’s experience bluntly: on very large, high-traffic MySQL tables, earlier migration approaches could grind the database close to a halt or even cause outages, which is why GitHub built a triggerless migration path with throttling and control. Brandur Leach, longtime Postgres engineer and writer, has made the same point from the Postgres side: operations that look instantaneous on small tables can become production-threatening once they require full-table scans or heavyweight validation on large datasets. Together, that gives you the real rule: the right indexing plan is less about syntax and more about controlling concurrency, load, and rollback. (When your fleet spans PostgreSQL, MySQL, and SQL Server, each with its own locking model, see managing multiple tech stacks in your IDP for how to keep operational standards consistent.)

Start with the boring question: Do you really need this index now?

A lot of downtime stories begin with a technically correct index and a strategically bad moment. If the query is slow because of a missing predicate, poor join order, bloated table statistics, or an ORM-generated monstrosity, adding an index during peak traffic is expensive optimism.

Before you build anything, verify three things. First, the query is important enough to justify permanent write overhead. Second, the index shape is precise enough to win, meaning correct column order, selectivity, and whether you need covering or included columns. Third, that you are not about to build a giant index that duplicates one you already have in practice. Every engine charges rent for extra indexes on writes. Indexes speed reads, but add overhead to the database as a whole.

A simple thought experiment helps. Suppose your events table has 800 million rows and receives 2,000 writes per second. A new composite index might cut one dashboard query from 4.2 seconds to 120 milliseconds. Great. But if that dashboard runs 30 times a minute and the table absorbs a constant firehose of inserts, you still need to ask whether the new index earns its keep on every future write. That is not an argument against indexing. It is an argument for being specific before you light up disks for six hours.

The safe path depends on your database engine

This is where a lot of advice gets dangerously hand-wavy. “Use online indexing” is not a plan. Different engines mean different guarantees.

In PostgreSQL, the production-safe default is usually CREATE INDEX CONCURRENTLY. A normal index build allows reads but blocks writes until completion, while the concurrent form avoids locks that block inserts, updates, and deletes. The price is that PostgreSQL must do more work, including two table scans, plus wait for conflicting transactions and old snapshots. It can also leave behind an INVALID index if the build fails, and only one concurrent index build can run on a table at a time.

In MySQL 8 with InnoDB, adding a secondary index is generally supported as an online DDL operation that permits concurrent DML. The table remains available for reads and writes while the index is being created. But this does not mean free. Some operations, especially primary key changes and certain special index types, are more expensive, rebuild data, or have tighter locking semantics.

In SQL Server, online index operations let users continue to query and update the underlying table while the index operation runs. That is the feature you want on hot production tables, but there are restrictions, including cases involving certain LOB data types and edition-specific availability.

That is the first practical fork in the road. If you are adding a straightforward secondary index in modern MySQL or a nonblocking index in Postgres or SQL Server, the native online feature is usually the cleanest path. If your change falls outside those guarantees, then you are no longer just adding an index. You are planning a migration.

Why large-table indexing still hurts even when it is “online.”

Online does not mean invisible. It means concurrent.

That distinction matters because indexing a large table still consumes I/O, CPU, memory, replication bandwidth, and background maintenance budget. In PostgreSQL, concurrent builds take significantly longer because of the extra scans and wait phases. The engine even exposes those phases inpg_stat_progress_create_index, including waits for writers, validation scans, and waits for old snapshots. In other words, the database is telling you where your pain will show up.

MySQL tells a similar story more politely. InnoDB online DDL keeps the table writable for secondary index creation, but the statement finishes only after transactions already accessing the table complete. Performance, space use, and semantics still depend on the operation type and limitations. That is database-docs language for “yes, but please watch your system.”

This is why big-table indexing often fails in practice for social reasons inside the system. Long-lived transactions stay open. Backfills and ETL jobs pin snapshots. Replica lag goes unnoticed until read traffic shifts badly. (The contention patterns here mirror the concurrency hazards described in race conditions in distributed caching.) Your migration is online, technically, and your service is still miserable.

Here’s how to do it without making your pager interesting

The safest approach is a staged rollout with explicit kill criteria.

Step one is to rehearse on production-like data, not a tiny staging clone. You want a credible estimate of build time, I/O pressure, replication effect, and lock behavior. For PostgreSQL, that also means checking for long-running transactions that can delay concurrent index validation or final usability. For MySQL and SQL Server, it means validating which flavor of online operation your exact statement qualifies for.

Step two is to lower the surprise factor. Schedule the build during a trough in write traffic, even if the feature is technically safe at peak. No downtime is not the same as no user impact. If your workload spikes every hour on the hour from batch jobs, do not challenge it to a duel.

Step three is to monitor the right things while the build runs. Keep the list short:

active long transactions
replica lag
write latency and queue depth
disk I/O saturation
lock waits and blocked sessions

That list looks obvious, but it saves teams because it focuses on symptoms users actually feel, not just whether the DDL session is still alive. (Building this symptom-first instinct is central to on-call rotations that build system ownership.)

Step four is to be ready to abort cleanly. In PostgreSQL, failed concurrent builds can leave INVALID indexes behind, which should be dropped or rebuilt concurrently. SQL Server’s resumable online index operations exist for exactly this operational reason. In MySQL, if the operation is really a more invasive schema migration disguised as an index change, use tooling that supports throttling and pause/resume behavior instead of hoping the database will absorb it.

When native online indexing is not enough

This is where people reach for gh-ost or pt-online-schema-change, sometimes correctly and sometimes because the tool feels more serious than the native feature.

For MySQL, if you are only adding a normal secondary index and your version and table characteristics support online DDL, native ALTER TABLE ... ADD INDEX is usually enough. In that case, wrapping the operation in a copy-and-swap migration tool can add unnecessary complexity.

But if the change is more than that, or if your table has operational wrinkles that make native behavior too risky, then dedicated migration tools earn their keep. gh-ost was built at GitHub specifically because prior approaches on very large, busy MySQL tables could create a severe impact. Its appeal is not just that it is online. It is that it is controllable, testable in production, and explicitly designed to throttle.

pt-online-schema-change takes a different route. It works by creating a copy table, copying rows in chunks, using triggers to keep the new table in sync, and then swapping tables atomically. It avoids blocking reads and writes, but comes with tradeoffs around triggers, foreign keys, and operational risk that you need to understand before touching production.

For PostgreSQL, the equivalent escape hatch is usually architectural, not a universal external utility. You might index partitions one by one concurrently, then attach the partitioned index in a metadata-only step. That pattern is useful because concurrent builds for the partitioned parent are not supported directly.

A realistic playbook for a 500 million row table

Picture an A orders table with 500 million rows, 900 inserts per second, and a query that filters on (customer_id, created_at).

The naive move is to add the composite index at 2 p.m. and hope the engine’s “online” claim does the rest. The safer move is to estimate the benefit first, verify the index really matches the query shape, check for overlapping indexes, and then run the build during your lowest write period with an observer watching replica lag, lock waits, and transaction age. In PostgreSQL, that likely meansCREATE INDEX CONCURRENTLY, plus a pre-flight sweep for sessions holding long snapshots. In MySQL, if this is a regular secondary index on InnoDB, native online DDL is likely enough. In SQL Server, use an online operation and consider resumability if the maintenance window is soft rather than fixed.

The key insight is that the technical step is only half the job. The other half is load management. A concurrent build that stretches for five hours but keeps p99 write latency inside budget is a win. A fast offline build that blocks writes for 90 seconds is not. Large-table indexing is one of those tasks where the adult answer is not the shortest one.

FAQ

Is “online” the same as zero downtime?

Not exactly. Online usually means that reads and writes can continue during the operation. It does not mean zero performance impact, zero waits, or zero operational risk. PostgreSQL, MySQL, and SQL Server all have caveats around concurrency, duration, or restrictions.

Which is safer, native online indexing or migration tools?

For simple supported operations, native features are usually simpler and safer because they avoid copy-and-swap complexity. For more invasive MySQL schema changes on very hot tables, tools like gh-ost or pt-online-schema-change can be safer because they add throttling, testability, and operational control.

How do I know a Postgres concurrent index is stuck?

Check pg_stat_progress_create_index. It exposes phases such as waiting for writers, building the index, validation scans, and waiting for old snapshots. That view is the fastest way to tell whether you are CPU-bound, blocked by transactions, or simply impatient.

What is the most common mistake?

Treating a large-table index build like a syntax problem instead of a workload problem. Long transactions, hidden replicas, batch jobs, and overloaded storage are what turn safe features into incidents.

Honest Takeaway

If you want to index a large table without downtime, do not think first about the statement. Think about the concurrency model of your database, the shape of your write traffic, and the escape hatch if the build starts hurting the system.

For most teams, the winning recipe is boring in the best way: use the engine’s native online indexing when it truly applies, run it during low write pressure, monitor transaction age and lag like a hawk, and only reach for heavier migration tooling when the change stops being a simple index build. That is not flashy advice. It is the advice that keeps your app up.

Kirstie Sands

Journalist at DevX

Kirstie a technology news reporter at DevX. She reports on emerging technologies and startups waiting to skyrocket.

About Our Editorial Process

At DevX, we’re dedicated to tech entrepreneurship. Our team closely follows industry shifts, new products, AI breakthroughs, technology trends, and funding announcements. Articles undergo thorough editing to ensure accuracy and clarity, reflecting DevX’s style and supporting entrepreneurs in the tech sphere.

See our full editorial policy.