Lesson 47 of 60 advanced

From Tabletop to Real Recovery

How juniors contribute during outages

Open interactive version (quiz + challenge)

Real-world analogy

A real incident is like a hospital code-blue. Nobody needs a hero; everyone needs to follow the playbook. The junior who documents, stays calm, and hands over clearly is more valuable than the one who tries to solve everything.

What is it?

Real recovery work involves clear roles, disciplined comms, and structured learning — not heroics. Juniors who embed these habits early get promoted into Incident Commander tracks.

Real-world relevance

A DNS change at 1 AM breaks authentication across a bank. Scribe (junior) captures timeline. IC declares Sev 1. Comms lead updates customers every 30 minutes. Network + IAM teams roll back the change. PIR finds a missing pre-production check step; runbook updated, automation added. Four hours to full restore; weeks of lasting improvement.

Key points

The incident lifecycle — Detect → Declare → Contain → Recover → Verify → Close → Learn. Skipping any stage creates audit findings or repeat incidents.
Declaration criteria — A clear bar for calling an incident vs treating it as a ticket. ‘Customer-impacting’ vs ‘internal’, severity thresholds, external reporting triggers (regulators, customers, partners).
Roles during an incident — Incident Commander (IC), Communications Lead, Subject-Matter Experts, Scribe, Executive Liaison. Juniors often start as scribes — the single best seat to learn incident command.
Comms channels and discipline — One primary channel (e.g., a dedicated Teams/Slack room). Out-of-band comms for compromised-channel scenarios. One spokesperson for external comms. No social posts from responders.
Evidence preservation during recovery — Security incidents need logs, images, and timelines. ‘Just reboot to get it working’ can destroy the evidence needed later. If in doubt, preserve before acting.
Post-incident review (PIR) — A blameless review within 1–2 weeks. Goal: timeline + root causes + systemic improvements + owners + due dates. The PIR document is how organizations actually learn.
Communications to users — Short, clear updates at regular intervals: what we know, what we’re doing, what we don’t know yet, when we’ll update next. Silence is scarier than bad news.
Regulators and external reporting — In regulated sectors, critical incidents trigger external reporting clocks (for example, an expectation of 72-hour reporting in some frameworks). Know who owns these in your org; don’t freelance the reporting.

Code example

// Incident response roles (concise)

Incident Commander (IC)
  - Runs the incident, makes decisions, owns timeline
  - Does NOT type commands or fix hands-on

Communications Lead
  - Owns customer/internal/exec messaging cadence
  - Drafts updates, coordinates with PR / legal if needed

SMEs (network, identity, DB, app)
  - Investigate and execute recovery steps
  - Report facts to IC; do not broadcast independently

Scribe
  - Captures timestamped facts: events, decisions, actions
  - Supports the post-incident review with clean evidence

Executive Liaison
  - Summarizes status for execs; translates technical to business
  - Shields IC from non-essential escalations

Line-by-line walkthrough

1. Incident roles block
2. Incident Commander duties
3. Role boundaries for IC
4. Blank separator
5. Communications Lead duties
6. Coordination with PR/legal
7. Blank separator
8. SMEs header
9. Investigate and execute
10. Report to IC, don’t broadcast solo
11. Blank separator
12. Scribe duties
13. Capture timestamped evidence
14. Support PIR
15. Blank separator
16. Executive Liaison duties
17. Translate status to business
18. Shield IC from non-essentials

Spot the bug

During a Sev 1 outage, the junior posts: 'We’re getting hacked! DM me for details' on LinkedIn.

Need a hint?

Which three rules does this break, and what could it cost?

Show answer

(1) Only designated spokesperson communicates externally, (2) Never speculate publicly during an incident, (3) Preserve confidentiality. Consequences: customer panic, regulatory breach, attacker benefit (they read too), personal disciplinary action. Right behavior: funnel all comms through the Communications Lead; save reflections for a blameless PIR later.

Explain like I'm 5

In a real emergency, you want teammates who listen, take notes, update people calmly, and don’t panic. That’s exactly what a great junior looks like in an outage — priceless.

Fun fact

Google’s site reliability engineering culture popularized ‘blameless post-mortems.’ The premise: humans make mistakes; systems should be designed to survive them. Blame cultures quietly destroy transparency — and transparency is the only reliable input to learning.

Hands-on challenge

Run a tabletop with a friend: pretend an M365 outage is underway. You are the Incident Commander; they are a senior. Practice 3 decision points and 3 customer comms updates — with timestamps.

More resources

Google SRE books (free online) (Google SRE)
NIST SP 800-61r2 Incident Handling (NIST)
CISA Incident Response Plan basics (CISA)

Open interactive version (quiz + challenge) ← Back to course: IT Jobs Bootcamp