Monitoring and Observability for Distributed Systems

If you have ever been paged for “elevated error rate” and then spent 45 minutes arguing with dashboards, you already know the dirty secret of distributed systems: the failure is rarely in the place you are looking. Latency spikes in a downstream dependency, retries amplify load, one noisy tenant melts a cache shard, and suddenly your “healthy” service is only healthy in aggregate.

Monitoring and observability are how you get out of that trap. Monitoring is how you detect that something is broken, fast and reliably, with minimal cognitive load. Observability is how you explain what broke, with enough context to debug novel failures without shipping a new build. The best teams treat them as complementary loops. They monitor the user experience and known failure modes, then rely on high cardinality telemetry to slice, compare, and ask better questions when reality diverges from the runbook.

What practitioners actually mean by observability today

If you talk to engineers who live inside production incidents, a consistent theme emerges: system health in the abstract is not enough. You need to understand the health of a specific request, customer, or workflow.

Charity Majors, co-founder and CTO at Honeycomb, has long argued that observability is about being able to ask new questions of your system without predefining them. In practice, that means drilling down to individual events and understanding how real requests move through distributed systems.

Cindy Sridharan, author of Distributed Systems Observability, draws a sharp distinction between monitoring and observability. Monitoring is about known unknowns and predefined failure modes. Observability is about debugging when something surprising happens. Her warning that “monitor everything” quickly becomes noise still holds true in modern stacks.

From the SRE world, Google popularized a pragmatic framing: if you can only measure a few things, focus on latency, traffic, errors, and saturation. Those four signals map directly to user pain and operational risk, and they force teams to prioritize what matters instead of collecting metrics for their own sake.

Taken together, the message is simple. Alert on a small number of user impacting signals. Invest the rest of your energy in rich, structured telemetry that lets you answer “what is different about the bad requests?” when the system behaves in unexpected ways.

Start with the right targets: SLIs, SLOs, and error budgets

The fastest way to make observability useless is to turn it into a dashboard beauty contest. What you need instead are targets that force decisions.

Consider a simple example. You define an availability SLO of 99.9 percent for a critical request path over a 30 day window. Thirty days is 43,200 minutes. An error budget of 0.1 percent gives you 43.2 minutes of allowed failure.

That single number changes behavior:

Burning 10 minutes in an hour is no longer “a small blip.” It means you are on track to exhaust the entire month’s budget in a day.
Burning five minutes over a week is annoying, but probably not worth waking someone up at 3 a.m.

This is why error budgets work. They translate abstract reliability goals into concrete tradeoffs between shipping features and stabilizing systems. When paired with golden signal monitoring, they keep alerting grounded in user impact rather than infrastructure trivia.

Build telemetry that can explain “why,” not just “what”

The classic “three pillars” model, metrics, logs, and traces, is still useful, and many teams now add continuous profiling as a fourth pillar. The deeper best practice is not the number of signals, but the consistency between them.

Metrics tell you something is wrong. Traces show you where time and errors accumulate across services. Logs give you exact events and payload context. Profiles explain where CPU and memory actually go. None of these is sufficient alone.

The unglamorous but critical requirement is shared context. If an alert fires because p95 latency increased, you should be able to pivot directly to traces for that endpoint and then to logs for those same requests. That only works if identifiers, service names, and attributes line up across all telemetry.

Teams that get this right spend far less time correlating dashboards by hand and far more time answering focused questions like “which tenants are affected?” or “what changed in this release?”

Instrumentation that scales in distributed systems

In distributed systems, instrumentation quality matters more than instrumentation volume.

A practical baseline that holds up at scale looks like this:

Standardize context propagation across every hop, including asynchronous boundaries like queues and schedulers.
Use consistent naming and semantic conventions for services, operations, and resources so queries behave the same across teams.
Make sampling an explicit decision. Many teams sample all errors, sample slow requests above a latency threshold, and keep a small baseline sample of everything else.
Treat cardinality as a budget. User IDs and request IDs are invaluable in traces and logs, but they can bankrupt your metrics system if used carelessly.

One high leverage habit is to define a required attribute set for every request boundary. Service name, environment, region, endpoint, status code, and a stable operation name. This turns debugging from guesswork into structured analysis.

Alerting without pager fatigue

Good alerting is boring. If your on call rotation feels exciting every day, something is wrong.

A minimal, effective strategy looks like this:

Page on symptoms, not causes. User facing latency, error rates, saturation, and SLO burn rate deserve immediate attention.
Create tickets for causes. Disk space trends, node churn, replica imbalance, and noisy logs should be addressed, but not at 3 a.m.
Use black box checks for critical user journeys so you catch cases where the system is technically up but functionally broken.

A simple rule keeps alerting sane: every paging alert must have a clear user impact, a defined owner, a linked runbook, and a known first action within five minutes. If any of those are missing, the alert probably does not belong on the pager.

Operate observability like a product

Dashboards are not observability. They are interfaces for testing hypotheses.

Teams that treat observability as a product tend to follow a steady cadence:

Maintain one “what users feel” dashboard per service, focused on golden signals and SLOs.
Maintain trace first views for critical user journeys, showing where time is spent and where failures occur.
After every incident, identify one question you could not answer quickly, then add the missing telemetry or context so the next incident is easier.

This is how observability compounds. Monitoring stays focused on known failure modes. Observability evolves to handle the unknown unknowns that inevitably appear in complex systems.

FAQ

Do I need both metrics and distributed tracing?
Yes, if you run distributed systems. Metrics are unmatched for cheap, continuous alerting. Traces are unmatched for understanding causality across services. Using one to replace the other usually leads to blind spots or runaway costs.

What is the most common observability failure in microservices?
Broken context propagation. When trace or correlation IDs do not survive async hops and service boundaries, you lose end-to-end visibility exactly when you need it.

How do teams keep observability costs under control?
They budget cardinality, sample traces intentionally, and standardize attributes so data is not duplicated across tools and teams. Consistency reduces both cost and cognitive overhead.

Honest Takeaway

Effective monitoring and observability require a deliberate tradeoff: fewer alerts, more context. Golden signals and error budgets keep you focused on user impact. High quality, consistent telemetry is what makes novel debugging possible when systems behave in ways you did not predict.

The payoff is real, but it is earned. Teams that succeed treat observability like software, with conventions, reviews, and continuous improvement after every incident, until “what changed?” becomes a query instead of a meeting.