devxlogo

Understanding Distributed Locking for High-Scale Systems

Understanding Distributed Locking for High-Scale Systems
Understanding Distributed Locking for High-Scale Systems

If you have ever watched a “simple” cron job turn into a distributed stampede, you already understand the emotional core of distributed locking. One instance “wins,” the rest should back off, and everyone goes home happy. Then the network blips, a GC pause hits, a pod restarts, and suddenly two workers think they own the lock. Your “lock” becomes a suggestion.

Distributed locking is just mutual exclusion, but with failure as a first-class feature. You are trying to coordinate multiple processes that cannot share memory, may not agree on time, and can get partitioned in ways that look indistinguishable from a crash. At scale, the cost of being wrong is not theoretical: you get double-charged customers, duplicated emails, corrupted caches, or hot shards that never cool down.

The punchline is uncomfortable but useful: in high-scale systems, the right question is rarely “How do I lock?” It is “What level of correctness do I need, and what do I do when the lock lies?”

Early in the research for this piece, three themes kept repeating across the people who build these systems for real. Martin Kleppmann, a distributed systems researcher, has been blunt that if correctness depends on the lock, you should rely on a proper consensus system, and he consistently emphasizes fencing tokens as the real safety mechanism. Salvatore Sanfilippo (antirez), creator of Redis, takes a different angle: many systems that reach for strong distributed locks could instead tolerate occasional overlap and design the protected operation to be safe when it happens. And Oskar Dudycz, architect and educator, frames ZooKeeper and etcd as heavier infrastructure, but the kind you choose when you want strong consistency rather than best-effort coordination.

Put together, they suggest a pragmatic rule: treat locks as a traffic signal, and treat fencing as the guardrail.

Know which problem you are solving

A distributed lock gets used for at least four different jobs, and mixing them up is how teams get burned.

The first job is efficiency locking: you just want to reduce duplicated work. Example: “Only one worker should refresh this cache key.” If two workers occasionally overlap, you might waste some CPU, but correctness survives.

The second job is singleton scheduling: “Run the nightly compaction once.” Here, overlap can be expensive, but you can often make it safe with idempotency and deduplication.

The third job is resource ownership: “Only one node may write to this shard.” This is where you need stronger guarantees, because double-writes can corrupt the state.

The fourth job is coordination as a product feature: leases, semaphores, and leader election. At this point, you are basically building a coordination system, and you should probably use one on purpose.

See also  API-Only AI: The Hidden Long-Term Risks

If you cannot clearly say which job you are doing, you will pick the wrong mechanism and then spend a quarter writing compensating logic in panic.

The failure modes that make locks feel haunted

Two facts drive most distributed locking weirdness.

First, you cannot reliably distinguish “slow” from “dead” in a distributed system. Second, most lock designs depend on some notion of time (TTLs, heartbeats, lease renewals), but your processes do not experience time consistently. Garbage collection pauses, CPU throttling, and network jitter all distort reality.

That creates classic failure modes.

You see split-brain ownership, where two clients both believe they hold the lock because the system observed partial failure. You see lease extension lies, where a client keeps its lease alive, but its work is stalled, so the lock blocks progress. You see thundering herds, where many contenders wake up and retry at once, overwhelming the coordinator. And you see unlocking someone else’s lock, where a crash and restart reuses a lock key, then a delayed unlock deletes the new owner’s lock unless you use unique tokens.

Once you internalize these, the design space gets clearer. You either choose a coordinator that can make a strong, linearizable decision, or you accept that the lock is advisory and harden the protected operation.

Compare your main options without religious wars

Here is the smallest table that still helps you choose sanely.

Approach What it’s good at Where it can hurt you
Database transaction or row lock Strong correctness, minimal new infrastructure Hot rows, contention, and tight coupling to database availability
Redis single-instance lock (SET NX with TTL) Cheap efficiency locks, low latency Partitions and failover can violate exclusivity without care
Redis Redlock Attempts stronger semantics across Redis nodes Correctness remains debated, operational complexity rises
ZooKeeper ephemeral sequential locks Strong coordination semantics, well-known recipes Operational overhead, client complexity
etcd concurrency locks Strong consistency via Raft, lease-backed locks Operational overhead, careful lease, and watch tuning required

The quiet truth: if the lock protects something correctness-critical, you usually end up in consensus-backed coordinator territory or transactional databases. If it is just to reduce duplicate work, a simpler advisory lock is often fine and cheaper.

Build the lock like you expect it to fail

Here is the playbook that holds up under real load.

Start by using unique lock tokens. The lock value should be a random identifier generated by the client. You only release the lock if the stored value matches your token. This prevents one client from unlocking another client’s lock after crashes or reordering.

See also  Why Some Architectures Scale and Others Break

Then use leases, not forever locks. In etcd, lock ownership is tied to a lease so that expiration or revocation releases the lock automatically. ZooKeeper’s ephemeral nodes give you the same “dies with the session” property.

Next, watch for herd behavior. ZooKeeper-style lock recipes deliberately structure watches so that only the next contender wakes up, not everyone. In practice, this is one of the biggest differences between a system that works in staging and one that pages you at 2 a.m.

Finally, instrument locking is like a database dependency. At a large scale, coordination latency becomes a real SLO input. Track lock acquisition latency, time holding the lock, contention rate, and lease renewal failures. Those metrics tell you whether locking is your bottleneck or your safety net.

Use fencing tokens when correctness matters

If you only remember one idea from this article, make it fencing tokens.

A fencing token is a monotonically increasing number issued when a lock is acquired. Every write to the protected resource must include the token, and the resource must reject writes with an older token. This solves the core hazard: even if two clients think they hold the lock, only the newest owner can actually mutate the state.

A concrete example makes this obvious.

Imagine you run a payment capture worker.

Worker A acquires the lock and gets fencing token 41. Worker A pauses for several seconds due to GC or CPU throttling. The lease expires. Worker B acquires the lock and gets fencing token 42, then captures the payment and writes the state with token 42. Worker A wakes up and tries to write with token 41.

Without fencing, you risk double capture. With fencing, the database or state machine enforces “only accept writes with a token greater than or equal to the current token.” Token 41 loses, safely.

You can implement this using a database column that stores the last accepted token, compare-and-swap updates, or per-resource sequencing in a strongly consistent store. The key idea is that the lock is no longer the only line of defense.

A practical four-step recipe for high-scale systems

Step 1: Decide if you want correctness or fewer duplicates

If failure to lock can corrupt the state, do not treat the lock as an optimization. That is when consensus systems or transactional guarantees make sense. If overlap is acceptable, choose an advisory lock and move on.

See also  Understanding Replication Lag and How to Mitigate It

Step 2: Pick a coordinator that matches the blast radius

If your system already depends on a relational database, transactional locking or conditional updates can be the simplest strong option. If you already operate etcd or ZooKeeper, their lock recipes are usually the cleanest path.

Step 3: Make the protected operation idempotent anyway

Even with the “right” coordinator, retries happen. Networks partition. Clients restart mid-flight. Design your write path so that repeating the operation is safe. This is where many teams reduce their dependence on perfect locks.

Step 4: Add fencing for any operation that mutates the shared state

Do this for payments, inventory, counters, and state machines. The lock reduces contention. The fencing token protects correctness.

FAQ

Is Redis good enough for distributed locks?
For efficiency locks where overlap is not catastrophic, Redis can work well with unique tokens and conservative TTLs. For correctness-critical locking, serious practitioners typically pair stronger coordination with fencing.

Why do ZooKeeper and etcd feel heavier?
They are coordination systems built around strong consistency and session semantics. They cost more operationally, but they behave more predictably under failure.

What is the most common locking bug at scale?
Treating a TTL-based lock as a correctness guarantee, then getting bitten by a pause or partition. The fix is usually fencing plus idempotency, not a longer TTL.

Should you avoid distributed locks entirely?
No. But you should avoid using a lock as your only correctness mechanism. Design as if the lock can be violated, because sometimes it will be.

Honest Takeaway

Distributed locks are not hard because the algorithms are exotic. They are hard because production systems fail in messy, unpredictable ways, and a lock is just another distributed component that can lie to you under stress.

If you want a practical north star: use locks to reduce contention, use idempotency to survive retries, and use fencing tokens to protect correctness. That mindset scales better than arguing endlessly about which lock implementation is “safe,” because it assumes reality will eventually show up and test you.

sumit_kumar

Senior Software Engineer with a passion for building practical, user-centric applications. He specializes in full-stack development with a strong focus on crafting elegant, performant interfaces and scalable backend solutions. With experience leading teams and delivering robust, end-to-end products, he thrives on solving complex problems through clean and efficient code.

About Our Editorial Process

At DevX, we’re dedicated to tech entrepreneurship. Our team closely follows industry shifts, new products, AI breakthroughs, technology trends, and funding announcements. Articles undergo thorough editing to ensure accuracy and clarity, reflecting DevX’s style and supporting entrepreneurs in the tech sphere.

See our full editorial policy.