Lesson 51 of 60 intermediate

Monitoring & Alert Quality

Know what to watch, not just what to install

Open interactive version (quiz + challenge)

Real-world analogy

Monitoring without tuning is like a hospital alarm that beeps for every patient blink. Doctors ignore it, and the one real emergency gets lost in the noise. Good monitoring beeps only when it matters.

What is it?

Monitoring & alerting is how you find out about problems before users do — and not drown in false alarms. Good monitoring is a product, not a tool install.

Real-world relevance

A payments service starts timing out at p99. Monitoring catches it before customers tweet. An alert with a runbook link tells the on-call to check the DB connection pool, which is near capacity. Scale the pool, incident avoided, post-incident review schedules a permanent fix.

Key points

Code example

// Alert design template

Alert name:   api_p99_latency_high
Severity:     Warning
Metric:       histogram_quantile(0.99, ...)
Window:       last 5 min
Condition:    p99 > 500ms for 10 min
Runbook:      https://runbooks.contoso.com/api-latency
Dashboards:   https://grafana.contoso.com/d/api-overview
Owner:        Team-Payments
On-call:      payments-oncall
Suppressions: during maintenance windows tagged "payments"
Notes:
  - Check DB connection pool, cache hit ratio, upstream latencies.
  - Roll back recent deploys if correlated.

// Golden signals for every critical service
  Latency, Traffic, Errors, Saturation.

// SLO example
  99.5% of API requests under 500ms over a rolling 28-day window.
  Error budget = 0.5% -> budget consumption triggers conversations.

Line-by-line walkthrough

  1. 1. Alert design template
  2. 2. Alert name
  3. 3. Severity
  4. 4. Metric definition
  5. 5. Evaluation window
  6. 6. Condition
  7. 7. Runbook URL
  8. 8. Dashboards URL
  9. 9. Owner
  10. 10. On-call routing
  11. 11. Suppression windows
  12. 12. Notes for first-responders
  13. 13. Blank separator
  14. 14. Golden signals reminder
  15. 15. Blank separator
  16. 16. SLO example
  17. 17. Error budget description

Spot the bug

On-call person mutes ‘Disk Warning’ alert every week for 6 months. Eventually the disk fills and the app crashes at 3 AM.
Need a hint?
What is the real failure here — tool or process?
Show answer
Process. Muting a repeating alert without fixing it teaches the team to ignore reality. Fix: either tune the alert (threshold, window), add cleanup automation, or grow capacity. Every chronic alert needs a review + ticketed remediation — not permanent snoozing.

Explain like I'm 5

Alarms should only ring when something real is wrong. If the alarm cries wolf every hour, everyone stops listening — and then the real wolf eats the sheep.

Fun fact

Google’s SRE book is largely responsible for popularizing SLOs and error budgets worldwide. Many modern on-call cultures are downstream of a single book that said ‘stop celebrating heroes; build better alerts.’

Hands-on challenge

On a test service, define 1 latency SLO and 1 error-rate SLO. Build a Grafana dashboard showing them. Write one actionable alert for each, with a runbook link (even if placeholder).

More resources

Open interactive version (quiz + challenge) ← Back to course: IT Jobs Bootcamp