Lesson 51 of 60 intermediate

Monitoring & Alert Quality

Know what to watch, not just what to install

Open interactive version (quiz + challenge)

Real-world analogy

Monitoring without tuning is like a hospital alarm that beeps for every patient blink. Doctors ignore it, and the one real emergency gets lost in the noise. Good monitoring beeps only when it matters.

What is it?

Monitoring & alerting is how you find out about problems before users do — and not drown in false alarms. Good monitoring is a product, not a tool install.

Real-world relevance

A payments service starts timing out at p99. Monitoring catches it before customers tweet. An alert with a runbook link tells the on-call to check the DB connection pool, which is near capacity. Scale the pool, incident avoided, post-incident review schedules a permanent fix.

Key points

The MELT model — Metrics, Events, Logs, Traces — Metrics: time-series numbers (CPU, latency). Events: discrete occurrences (deploy, incident). Logs: text records. Traces: a request’s path. Each answers a different question; mature ops blends all four.
SLI, SLO, SLA — SLI: what we measure (e.g., 99.9% of requests under 500ms). SLO: internal target. SLA: external contractual promise with penalties. Business cares about SLAs; teams work from SLOs.
Golden signals — Latency, traffic, errors, saturation. For every critical service, monitor all four. It’s a simple, effective baseline that catches most real problems.
Actionable alerts — Every alert must have: a human explanation, a runbook link, and a clear action. Non-actionable alerts train humans to ignore alerts — the worst failure mode of all.
Thresholds and noise — Avoid static thresholds for volatile metrics. Use percentiles (p95/p99 latency), windows (last 5 min), and comparisons (week over week). Alert fatigue kills on-call teams.
Common open-source tools — Prometheus + Grafana (metrics + dashboards), Elasticsearch/Loki (logs), Tempo/Jaeger (traces), Alertmanager (routing). Commercial: Datadog, New Relic, Dynatrace.
Synthetic and real-user monitoring — Synthetic: scripted check from outside (‘every 1 minute, log in and search’). RUM: real user data. Both detect problems users feel before they file a ticket.
On-call hygiene — Runbooks per alert. Escalation paths. Shift handovers. Post-incident review after every page. Rotate calmly — burnt-out on-call is unsafe on-call.

Code example

// Alert design template

Alert name:   api_p99_latency_high
Severity:     Warning
Metric:       histogram_quantile(0.99, ...)
Window:       last 5 min
Condition:    p99 > 500ms for 10 min
Runbook:      https://runbooks.contoso.com/api-latency
Dashboards:   https://grafana.contoso.com/d/api-overview
Owner:        Team-Payments
On-call:      payments-oncall
Suppressions: during maintenance windows tagged "payments"
Notes:
  - Check DB connection pool, cache hit ratio, upstream latencies.
  - Roll back recent deploys if correlated.

// Golden signals for every critical service
  Latency, Traffic, Errors, Saturation.

// SLO example
  99.5% of API requests under 500ms over a rolling 28-day window.
  Error budget = 0.5% -> budget consumption triggers conversations.

Line-by-line walkthrough

1. Alert design template
2. Alert name
3. Severity
4. Metric definition
5. Evaluation window
6. Condition
7. Runbook URL
8. Dashboards URL
9. Owner
10. On-call routing
11. Suppression windows
12. Notes for first-responders
13. Blank separator
14. Golden signals reminder
15. Blank separator
16. SLO example
17. Error budget description

Spot the bug

On-call person mutes ‘Disk Warning’ alert every week for 6 months. Eventually the disk fills and the app crashes at 3 AM.

Need a hint?

What is the real failure here — tool or process?

Show answer

Process. Muting a repeating alert without fixing it teaches the team to ignore reality. Fix: either tune the alert (threshold, window), add cleanup automation, or grow capacity. Every chronic alert needs a review + ticketed remediation — not permanent snoozing.

Explain like I'm 5

Alarms should only ring when something real is wrong. If the alarm cries wolf every hour, everyone stops listening — and then the real wolf eats the sheep.

Fun fact

Google’s SRE book is largely responsible for popularizing SLOs and error budgets worldwide. Many modern on-call cultures are downstream of a single book that said ‘stop celebrating heroes; build better alerts.’

Hands-on challenge

On a test service, define 1 latency SLO and 1 error-rate SLO. Build a Grafana dashboard showing them. Write one actionable alert for each, with a runbook link (even if placeholder).

More resources

Google SRE workbook (free) (Google SRE)
Prometheus docs (Prometheus)
Grafana docs (Grafana)

Open interactive version (quiz + challenge) ← Back to course: IT Jobs Bootcamp