devxlogo

How to Reduce Latency in Distributed Systems

How to Reduce Latency in Distributed Systems
How to Reduce Latency in Distributed Systems

If you are working on a distributed system long enough, latency stops being a nice to have metric and starts feeling like a tax on everything you ship. Every feature that touches another service or region quietly adds a few more milliseconds. Then one day a customer demo stalls, the spinner keeps spinning, and everyone stares at a trace that looks like an angry plate of spaghetti.

At a basic level, latency is the time between a request leaving a client and the response coming back. In a single process, that is mostly CPU and maybe a database call. In a distributed system, that same request might hop across ten services, two regions, three caches, and a load balancer. Suddenly you are not fighting just speed of code, but fan out, queues, retries, and the occasional network gremlin.

To write this piece, we pulled together ideas from Jeff Dean, Google Fellow, whose work on tail latency popularized many modern mitigation strategies, Cindy Sridharan, distributed systems engineer and author of “Distributed Systems Observability”, and Werner Vogels, CTO of Amazon, who has spent years reminding people that everything fails unless you plan for the opposite. Across their work and others, a pattern emerges: you cannot buy your way out of latency with hardware alone. You have to design for it.

Let us walk through the concrete techniques that actually move your p95 and p99 down, not just your average.

Why latency becomes brutal in distributed systems

The first mental shift is understanding that average latency is almost useless for user experience. Your system can be fast on average and still feel slow because of the tail.

Jeff Dean and Luiz Barroso’s “The Tail at Scale” shows a simple example. Suppose a single backend has a 99th percentile latency of 10 ms. If a frontend fans out to 100 such backends in parallel and waits for all replies, the 99th percentile latency of that overall request jumps to about 140 ms. A handful of slow outliers per backend combine into a very visible drag.

Why this matters:

  • Most real user actions touch several services.

  • SLAs and SLOs are usually defined on percentiles, not averages.

  • Business critical moments, like “place order” or “checkout,” tend to be the ones hitting the worst tails.

As Igor Ostrovsky and many others have written about latency distributions, it is very common to see 75 percent of requests under 50 ms, another 20 percent under 100 ms, and the last 5 percent much slower. Those 5 percent are where your performance reputation lives.

So the problem is not only “make everything faster.” The real problem is “make the slowest 1 to 5 percent much less bad.”

Measure the right latency signals first

You cannot reduce what you do not measure, and you definitely cannot reduce tail latency with just one metric on a dashboard.

The Google SRE material and many SRE guides converge on the four golden signals: latency, traffic, errors, and saturation. For latency specifically, that usually means:

  • End to end latency from the user or API gateway.

  • Per service latency, broken down by endpoint or method.

  • Percentiles such as p50, p90, p95, p99, and sometimes p99.9.

Tooling wise, teams commonly use Prometheus, OpenTelemetry, and an APM or tracing system such as Tempo, Jaeger, Honeycomb, or Datadog. The important part is consistent labels: service, route, HTTP status, and maybe region or tenant. Google’s SRE workbook suggests exporting size, latency, and response codes for each dependency and graphing them in ways that line up with the golden signals.

See also  How to Use Database Connection Retry Strategies Correctly

Worked example: reading your p99

Imagine your checkout service handles 10,000 requests in a 5 minute window.

  • 9,900 of them finish under 120 ms.

  • 90 take between 120 and 250 ms.

  • 10 are slow, between 250 and 900 ms.

To compute p99, you sort the latencies from fastest to slowest and look at the value at index 9,900 (1 percent of 10,000 is 100, but indices start at 0). That might be 250 ms. So:

  • p50 ≈ 60 ms (typical)

  • p95 ≈ 140 ms (noticeable)

  • p99 ≈ 250 ms (painful for your unluckiest 1 percent)

If marketing expects all checkouts to feel “instant,” that p99 is the number that tells you how far you are from that promise.

Design the architecture for fewer and faster hops

Once you measure latency properly, the next step is architectural. Distributed systems are often slower than they need to be because every feature adds yet another RPC hop.

Start with a service map or a tracing system that shows a full user request. Count how many network hops a typical request goes through. You can often cut latency without touching code speed, just by reorganizing dependencies.

Practical levers:

  1. Reduce fan out. Avoid patterns where one request calls 5 to 10 services in series when you can aggregate data behind a single “backend for frontend” or composite service. This directly shrinks the tail amplification described in “The Tail at Scale.”

  2. Prefer parallel queries where safe. If two dependencies are independent, call them in parallel instead of sequence. This cuts the longest of them, not their sum, into your critical path.

  3. Keep cross region calls off the hot path. Use asynchronous replication, change data capture, or background jobs for cross region work whenever possible. WAN latency is not something you optimize away with clever code.

Here is a small comparison table that captures some core architectural choices and their tradeoffs:

Technique Latency impact Tradeoff or risk
Reduce service fan out Big p95 and p99 improvement Larger, more complex services
Parallelize independent calls Cuts sum of serial dependencies More complex concurrency, error handling
Avoid cross region in hot path Removes large WAN latency spikes Requires eventual consistency considerations

A rule of thumb: if your trace waterfalls are mostly horizontal (many hops), attack architecture first, not code micro optimizations.

Tame tail latency with defensive techniques

Even with a cleaner architecture, variability kills. Queues grow, GC pauses happen, a noisy neighbor hits the same storage. This is where you use techniques explicitly designed for the tail.

Jeff Dean and others popularized patterns such as hedged requests and reissuing lagging requests. The core idea is simple: if one replica is slow, fire a second copy of the request to another replica and use whichever returns first, possibly with a small delay to avoid storming the system. Evaluations often show p99 dropping far more than the extra resource usage would suggest.

Other tail fighting tools:

  • Timeouts and budgets. Give each hop in a call graph a latency budget based on the overall SLO. Enforce timeouts that respect this budget, instead of letting low level libraries retry forever.

  • Circuit breakers. If a dependency is consistently slow or error prone, stop sending traffic for a while and fail fast or degrade gracefully so you do not pile up requests.

  • Graceful degradation. The Tail at Scale paper describes how returning partial results in a tiny fraction of cases improved p99 by more than 50 percent. For many products, “search results without thumbnails” is better than “spinner for 2 seconds and then error.”

See also  How to Design Fault-Tolerant APIs for Distributed Systems

Whenever you introduce these patterns, involve product owners. You are encoding business tradeoffs like “is it better to show a partial dashboard, or to show nothing and try again.”

Use caching and data locality intelligently

Caching can be the difference between a snappy system and a sluggish one, but only if it is designed with latency and locality in mind.

You have several levels to play with:

  1. In process caches such as Guava, Caffeine, or language built ins. These are very fast, usually microseconds to hit, and ideal for small, hot data such as configuration or permission checks.

  2. Near service caches such as Redis or Memcached in the same region. Network hop plus serialization cost is still far less than a relational database query or a cross region call.

  3. Global caches or CDNs, used especially for read heavy APIs and content. For static content or idempotent reads, global edges can win huge latency savings for distant users.

A few practical tips:

  • Cache negative results, not just positive ones, so you do not hammer slow backends for missing objects.

  • Treat cache invalidation as a product decision, not just a technical afterthought. Stale but fast is often better than fresh but slow, as long as everyone agrees on the rules.

  • Measure cache hit ratios by endpoint, not only globally. A single high volume, low hit rate endpoint can ruin your overall performance profile.

Work from traces: if you see the same key queried repeatedly along the critical path, that is your candidate for a higher level, lower latency cache.

Reduce queuing and contention under load

A lot of “mysterious” latency explosions turn out to be boring queueing theory. You get near saturation on a service or a database, request queues grow, and someone posts an incident update.

Golden signal frameworks always pair latency with traffic and saturation for this reason. You should always be able to answer: when latency spikes, did our QPS or concurrency spike too, and did CPU, memory, or connection pools hit a ceiling.

Tactics that work in real systems:

  • Right size your concurrency. More worker threads do not always mean more throughput. At some point, context switching and lock contention make each request slower. Use controlled load tests to find the “knee” in your latency curve.

  • Control load with backpressure. If upstream services can keep sending requests without regard to downstream capacity, you will get unbounded queues. Use admission control, rate limits, or token buckets to keep queues short.

  • Separate hot and cold paths. Put heavy, long running work onto explicit background queues with their own capacity, and keep your synchronous path focused on the minimum needed for a response.

This is where Werner Vogels’s mantra that everything fails all the time shows up in latency graphs. The quote is often paired with the reminder that you must plan for failure if you want systems to feel fast when things go wrong. Build observability that explains latency

You will not get far with a single latency number and a log file. Teams that consistently reduce latency invest deeply in metrics, logs, and traces that can be correlated.

Cindy Sridharan’s work on observability emphasizes that metrics, logs, and traces are not three separate chores. They are three ways of exposing the same internal state such that you can answer “what is slow, where, and why” without shipping new code.

A sane minimal setup:

  • Metrics: golden signals per service, per endpoint, with percentiles.

  • Tracing: end to end traces that show every hop, tag important spans with parameters such as user type or feature flag.

  • Logs: structured, indexed logs for the hot paths. Include correlation IDs that tie them to traces.

See also  Designing Systems That Scale Under Variable Load

Then create a standard workflow for latency incidents. For example:

  1. Start from a user facing SLO or p99 graph.

  2. Jump into a trace sample from that window.

  3. Identify the slowest spans.

  4. From those spans, correlate metrics on the relevant services or dependencies.

Over time, this workflow matters more than any single tool choice. The point is that people can get from “checkout is slow in Europe” to “the recommendation service call to region B is timing out and retrying” in minutes, not hours.

FAQ

How low should my latency be?

There is no universal number. Google engineers suggest that interactions under 100 ms feel “instant” to users, while 500 ms is often still acceptable for content that requires thought, such as reading a page. In practice, teams define SLOs per use case. Search auto complete might target 50 to 100 ms, checkout or login might target 300 to 500 ms at p95.

Is it worth doing complex techniques like hedged requests?

It depends on how bad your tail is and how much extra capacity you can afford. Studies inspired by “The Tail at Scale” show that even modest use of partial results and reissued requests can cut p99 by 40 to 50 percent in some systems.Many teams start with architecture and caching first, then introduce hedging only on the most critical, read heavy endpoints.

Can I just overprovision hardware to fix latency?

More hardware helps avoid saturation, which certainly helps p99, but it cannot fix a bad architecture or noisy dependency. Patterns like cross region calls in the hot path or long serial chains of services will still show up in traces regardless of CPU. Hardware is your buffer, not your primary strategy.

What is the difference between monitoring and observability here?

Monitoring tells you something is wrong, usually by watching thresholds on metrics. Observability is about being able to ask new questions of your system without shipping new instrumentation. In the latency context, observability means you can slice and dice latency by user type, region, feature flag, or payload shape and actually isolate the cause.

Honest takeaway

Reducing latency in distributed systems is not one trick, it is a stack of decisions. You pick better metrics, you trim service graphs, you cache the right things, you add defensive patterns for tails, and you make it possible to debug all of that without guesswork. None of these alone will transform a system, but together they compound.

The realistic path looks like this: choose one critical user journey, instrument it well, and knock 50 to 100 ms off p95 and p99 through obvious fixes such as fan out reduction and caching. Use that win to justify deeper work on observability and tail techniques. Over a year or two, you will build a system that does not just pass load tests, it actually feels fast to the people who pay you.

sumit_kumar

Senior Software Engineer with a passion for building practical, user-centric applications. He specializes in full-stack development with a strong focus on crafting elegant, performant interfaces and scalable backend solutions. With experience leading teams and delivering robust, end-to-end products, he thrives on solving complex problems through clean and efficient code.

About Our Editorial Process

At DevX, we’re dedicated to tech entrepreneurship. Our team closely follows industry shifts, new products, AI breakthroughs, technology trends, and funding announcements. Articles undergo thorough editing to ensure accuracy and clarity, reflecting DevX’s style and supporting entrepreneurs in the tech sphere.

See our full editorial policy.