Monitoring & Alert Quality
Know what to watch, not just what to install
Open interactive version (quiz + challenge)Real-world analogy
Monitoring without tuning is like a hospital alarm that beeps for every patient blink. Doctors ignore it, and the one real emergency gets lost in the noise. Good monitoring beeps only when it matters.
What is it?
Monitoring & alerting is how you find out about problems before users do — and not drown in false alarms. Good monitoring is a product, not a tool install.
Real-world relevance
A payments service starts timing out at p99. Monitoring catches it before customers tweet. An alert with a runbook link tells the on-call to check the DB connection pool, which is near capacity. Scale the pool, incident avoided, post-incident review schedules a permanent fix.
Key points
- The MELT model — Metrics, Events, Logs, Traces — Metrics: time-series numbers (CPU, latency). Events: discrete occurrences (deploy, incident). Logs: text records. Traces: a request’s path. Each answers a different question; mature ops blends all four.
- SLI, SLO, SLA — SLI: what we measure (e.g., 99.9% of requests under 500ms). SLO: internal target. SLA: external contractual promise with penalties. Business cares about SLAs; teams work from SLOs.
- Golden signals — Latency, traffic, errors, saturation. For every critical service, monitor all four. It’s a simple, effective baseline that catches most real problems.
- Actionable alerts — Every alert must have: a human explanation, a runbook link, and a clear action. Non-actionable alerts train humans to ignore alerts — the worst failure mode of all.
- Thresholds and noise — Avoid static thresholds for volatile metrics. Use percentiles (p95/p99 latency), windows (last 5 min), and comparisons (week over week). Alert fatigue kills on-call teams.
- Common open-source tools — Prometheus + Grafana (metrics + dashboards), Elasticsearch/Loki (logs), Tempo/Jaeger (traces), Alertmanager (routing). Commercial: Datadog, New Relic, Dynatrace.
- Synthetic and real-user monitoring — Synthetic: scripted check from outside (‘every 1 minute, log in and search’). RUM: real user data. Both detect problems users feel before they file a ticket.
- On-call hygiene — Runbooks per alert. Escalation paths. Shift handovers. Post-incident review after every page. Rotate calmly — burnt-out on-call is unsafe on-call.
Code example
// Alert design template
Alert name: api_p99_latency_high
Severity: Warning
Metric: histogram_quantile(0.99, ...)
Window: last 5 min
Condition: p99 > 500ms for 10 min
Runbook: https://runbooks.contoso.com/api-latency
Dashboards: https://grafana.contoso.com/d/api-overview
Owner: Team-Payments
On-call: payments-oncall
Suppressions: during maintenance windows tagged "payments"
Notes:
- Check DB connection pool, cache hit ratio, upstream latencies.
- Roll back recent deploys if correlated.
// Golden signals for every critical service
Latency, Traffic, Errors, Saturation.
// SLO example
99.5% of API requests under 500ms over a rolling 28-day window.
Error budget = 0.5% -> budget consumption triggers conversations.Line-by-line walkthrough
- 1. Alert design template
- 2. Alert name
- 3. Severity
- 4. Metric definition
- 5. Evaluation window
- 6. Condition
- 7. Runbook URL
- 8. Dashboards URL
- 9. Owner
- 10. On-call routing
- 11. Suppression windows
- 12. Notes for first-responders
- 13. Blank separator
- 14. Golden signals reminder
- 15. Blank separator
- 16. SLO example
- 17. Error budget description
Spot the bug
On-call person mutes ‘Disk Warning’ alert every week for 6 months. Eventually the disk fills and the app crashes at 3 AM.Need a hint?
What is the real failure here — tool or process?
Show answer
Process. Muting a repeating alert without fixing it teaches the team to ignore reality. Fix: either tune the alert (threshold, window), add cleanup automation, or grow capacity. Every chronic alert needs a review + ticketed remediation — not permanent snoozing.
Explain like I'm 5
Alarms should only ring when something real is wrong. If the alarm cries wolf every hour, everyone stops listening — and then the real wolf eats the sheep.
Fun fact
Google’s SRE book is largely responsible for popularizing SLOs and error budgets worldwide. Many modern on-call cultures are downstream of a single book that said ‘stop celebrating heroes; build better alerts.’
Hands-on challenge
On a test service, define 1 latency SLO and 1 error-rate SLO. Build a Grafana dashboard showing them. Write one actionable alert for each, with a runbook link (even if placeholder).
More resources
- Google SRE workbook (free) (Google SRE)
- Prometheus docs (Prometheus)
- Grafana docs (Grafana)