You have dashboards. Plural. They glow on wall-mounted TVs. They stream into Slack. They’re color-coded, real-time, and technically accurate.
And yet, your last incident still surprised you.
That’s the paradox of modern observability. You’re not lacking data. You’re drowning in it. What you’re missing is a dashboard that surfaces risk, not just metrics.
A monitoring dashboard is not a status page. It is not a vanity chart. It is a decision surface. Its job is simple and ruthless: show you, at a glance, where production is fragile right now.
If your dashboards don’t help you answer “Are we about to hurt users?” in under 10 seconds, they’re ornamental.
Let’s fix that.
What Experienced SREs Say About Dashboards (And Where Most Go Wrong)
Before writing this, we reviewed guidance from Google’s Site Reliability Engineering book and revisited how Datadog, Grafana Labs, and Honeycomb practitioners talk about dashboard design in public talks and docs.
Ben Treynor Sloss, former VP of Engineering at Google and founder of Google SRE, has repeatedly emphasized that SRE exists to manage risk, not to maximize uptime at any cost. That framing matters. If SRE is about risk management, then dashboards must expose risk signals, not just system states.
Charity Majors, co-founder of Honeycomb, often argues that dashboards are for known unknowns. They’re good for monitoring steady-state health but insufficient for exploring novel failure modes. Translation: dashboards should track what you already know can break, and escalate when it drifts.
Grafana Labs engineers, in their documentation and conference talks, consistently recommend organizing dashboards around services and user journeys, not infrastructure components. That’s a subtle but important shift.
Put those together, and a pattern emerges:
- Dashboards must map to user risk.
- They should encode known failure modes.
- They must reflect service boundaries, not host-level trivia.
Most dashboards fail because they optimize for visibility instead of vulnerability.
Start With Risk, Not Metrics
The most common mistake I see is teams starting with the question:
What metrics do we have?
Instead, start with:
How do we fail?
Take a typical SaaS API. Here are the production risks that actually matter:
- Users cannot authenticate
- Requests exceed SLA latency
- Payment processing fails
- Background jobs stall
- Database saturates or deadlocks
Now, map each risk to one or two leading indicators.
For example:
| Production Risk | Leading Signal | Lagging Signal |
|---|---|---|
| Auth outage | Error rate on /login > 2% | Drop in successful logins |
| Latency regression | P95 > 800ms for 5 minutes | SLA violation count |
| Payment failure | Stripe API 5xx spike | Revenue per minute drop |
| Job queue backlog | Queue depth growth rate | Processing delay > 10 min |
| DB saturation | CPU > 85% + connection pool waits | Timeout errors increase |
That table is more useful than 50 random CPU charts.
A good risk dashboard contains only signals that answer:
- Are users impacted?
- Are we close to breaching SLOs?
- Is a failure mode trending toward activation?
Everything else belongs elsewhere.
Build Around SLOs, Not Infrastructure
If you do nothing else, redesign your top-level dashboard around Service Level Objectives.
Google’s SRE model formalized this for a reason. Error budgets turn abstract reliability into a quantifiable risk budget.
Here is the mental shift:
Infrastructure dashboards ask:
- Is the CPU high?
- Is memory growing?
- Is disk usage climbing?
Risk dashboards ask:
- Are we burning the error budget faster than expected?
- Is availability dipping below 99.9%?
- Is latency threatening our contractual SLA?
For a customer-facing API, your top panel should include:
- Current availability percentage
- Error budget remaining
- P50, P95, and P99 latency
- Request rate
- Active incidents or alerts
Only below that should you drill into:
- Dependency health
- Database saturation
- Cache hit rate
- Queue depth
If your homepage shows node-level CPU before user-visible SLOs, you are optimizing for engineers, not users.
Design for Cognitive Load Under Stress
Dashboards are consumed in two states:
- Casual health checks
- Incident response
The second state is what matters.
In an incident, your brain is bandwidth-constrained. You need signal compression.
Here’s how to design for that reality.
1. Make the “Red” Meaningful
If everything is red, nothing is.
Use thresholds tied to SLOs and business impact, not arbitrary values. CPU at 75% is not red unless it correlates with latency or saturation.
2. Collapse by Default
Your top view should have 6 to 10 panels max. Not 40.
For example:
- SLO summary
- Error rate by service
- Latency percentiles
- Dependency health rollup
- Queue depth
- DB saturation index
Everything else goes into drill-down dashboards.
3. Show Trends, Not Just Snapshots
A flat 500ms latency means something different if it was 200ms ten minutes ago.
Always include:
- Short window, last 5 to 15 minutes
- Longer window, lasts 1 to 24 hours
Trend slope is often a better early warning than raw values.
Encode Known Failure Modes Into the Dashboard
Dashboards should reflect postmortems.
Every serious incident should leave a scar on your monitoring surface.
If your last outage was caused by:
- Thread pool exhaustion
- Slow third-party dependency
- Unbounded queue growth
- Cache stampede
Then your dashboard should explicitly track those dimensions.
After one fintech platform I worked with had a cascading failure caused by connection pool exhaustion, we added a “Saturation Panel” to the top-level dashboard that combined:
- Active DB connections
- Wait time for pool acquisition
- Request latency correlation
We never had the same blind spot again.
This is similar in spirit to how strong on-page SEO emphasizes structuring signals clearly so systems can interpret intent. Your dashboard should structure operational signals so humans can interpret risk.
Structure is not cosmetic. It determines what gets noticed.
Separate the Three Dashboard Types Clearly
One mistake is mixing audiences.
You need three distinct layers.
1. Executive Risk Dashboard
Audience: leadership
Question: Are customers impacted?
Content:
- Availability
- Error budget
- Revenue-impacting signals
- Open incidents
No JVM heap graphs. Ever.
2. Service Owner Dashboard
Audience: engineers
Question: Is my service healthy?
Content:
- SLO metrics
- Error rate by endpoint
- Latency percentiles
- Resource saturation
- Dependency status
3. Exploratory Debug Dashboard
Audience: responders
Question: Why is this breaking?
Content:
- High-cardinality breakdowns
- Per-region or per-tenant views
- Trace and log pivots
- Feature flag correlations
Mixing these creates clutter and hides risk.
Use “Risk Indicators” Instead of Raw Metrics
Raw metrics are ingredients. Risk indicators are conclusions.
Instead of showing:
- CPU %
- Memory %
- Request count
- Queue depth
Create composite panels like:
- Saturation index, weighted CPU + memory + connection waits
- Dependency risk score, combining 5xx rate and latency spike
- Backlog growth rate, derivative of queue depth
You can compute these with PromQL, Datadog formulas, or custom exporters.
This mirrors what strong backlink analysis does in SEO: it evaluates quality signals, not just link counts. Likewise, dashboards should elevate meaningful signals, not raw volume.
Signal synthesis beats signal abundance.
Add Leading Indicators for Silent Failures
Some failures do not trip obvious alarms.
Examples:
- Slow degradation before a crash
- Regional imbalance
- Uneven shard utilization
- Gradual error rate increase below threshold
Add:
- Error rate change percentage over 10 minutes
- Traffic skew by region
- Growth rate of retries
- Timeout ratio vs total requests
These often catch issues before the hard threshold alert fires.
If your dashboard only shows “threshold crossed,” you are already late.
One Worked Example: Rebuilding a Real API Dashboard
Let’s say your SaaS API has:
- 99.9% availability SLO
- 800ms P95 latency SLO
- 10k RPS peak traffic
- PostgreSQL primary + replicas
- Redis cache
- Background job workers
Top-Level Risk Dashboard
Panel 1: SLO Summary
- Availability: 99.92%
- Error budget remaining: 68%
- 7-day burn rate: 1.3x
Panel 2: Latency
- P50: 120ms
- P95: 610ms
- P99: 1.4s
- Trend over 1 hour
Panel 3: Error Rate
- 5xx percentage
- Breakdown by endpoint
Panel 4: Saturation
- DB CPU
- Connection pool wait time
- Redis memory utilization
Panel 5: Queue Health
- Queue depth
- Processing latency
- Growth rate
If DB CPU hits 85%, connection waits increase, and P95 climbs from 600ms to 900ms over 10 minutes, the dashboard tells a story:
We are approaching user-visible degradation.
That is a risk surfaced in real time.
FAQ
How many panels should a risk dashboard have?
Six to ten. More than that increases cognitive load. Use drill-down dashboards for detail.
Should dashboards replace alerts?
No. Dashboards are situational awareness tools. Alerts are interrupt mechanisms. They should reinforce each other.
What about logs and traces?
Dashboards should link to them. During incidents, responders pivot from metrics to traces quickly. Your panels should include one-click transitions.
How often should dashboards change?
After every significant postmortem. If you learn something new about how your system fails, encode it.
Honest Takeaway
A good monitoring dashboard does not make you feel informed. It makes you slightly uncomfortable when the risk is rising.
Design around failure modes, SLO burn, and user impact. Remove vanity metrics. Encode postmortem lessons. Separate audiences. Reduce cognitive load.
If you do that, your dashboard stops being a wall of charts and becomes what it was meant to be: an early warning system for production reality.
And the next time something breaks, you will see it coming.
Senior Software Engineer with a passion for building practical, user-centric applications. He specializes in full-stack development with a strong focus on crafting elegant, performant interfaces and scalable backend solutions. With experience leading teams and delivering robust, end-to-end products, he thrives on solving complex problems through clean and efficient code.
























