Lesson 46 of 60 intermediate

RPO, RTO & DR Drills

Business language for technical people

Open interactive version (quiz + challenge)

Real-world analogy

DR planning is like fire drills. Everyone thinks they’ll remember what to do until the alarm actually rings. RPO is ‘how much data can we lose?’ RTO is ‘how long can we be down?’ Drills are the difference between plan and performance.

What is it?

RPO/RTO/DR translates technical resilience into business language and reveals whether ‘backups’ actually translate to real recovery. Juniors who can speak this language become the bridge between ops and business.

Real-world relevance

A core banking outage at 10 AM on a Monday: RPO target 5 min, RTO target 30 min. DR plan is rehearsed quarterly. Incident commander declares disaster, failover begins at 10:04, service resumes at 10:27. Auditor happy; customers barely notice.

Key points

RPO — Recovery Point Objective — Maximum tolerable data loss measured in time. ‘Our RPO is 15 minutes’ means we can lose up to 15 minutes of changes. Drives backup frequency.
RTO — Recovery Time Objective — Maximum tolerable downtime measured in time. ‘Our RTO is 2 hours’ means we must be back online within 2 hours. Drives recovery architecture (failover, replicas, DR site).
MTTR vs MTBF vs MTTA — MTTR: Mean Time To Repair/Restore. MTBF: Mean Time Between Failures. MTTA: Mean Time To Acknowledge. Business discusses these for service-level reports.
Replication vs backup — Replication: continuous or near-real-time copy for HA/DR. Backup: point-in-time copies for recovery from corruption/ransomware. You need both — replication alone won’t help against ransomware that replicates too.
Active/active vs active/passive — Active/active: multiple sites serving traffic simultaneously. Active/passive: one site primary, another standby, failed over manually or automatically. Cost vs complexity tradeoff.
Dependency mapping — A DR plan lists every dependency: DNS, identity, network, certificates, storage, apps, monitoring, third parties. Missing one equals plan failure.
Runbooks and roles — Who declares a disaster? Who runs the runbook? Who talks to users? Who updates execs? Documented roles prevent chaos on day zero.
Exercises: tabletop, partial, full — Tabletop: discussion-only walkthrough. Partial: recover one system. Full: recover the environment end-to-end. Mature shops run at least one partial exercise yearly.

Code example

// DR drill report template (partial exercise)

Exercise:     Restore CRM database to DR site
Date:         2026-04-15
Participants: DBA, Network, Security, App Owner, IT Manager
RPO target:   15 minutes
RTO target:   2 hours

Pre-checks:
  [ ] Current backup validated (timestamp + checksum)
  [ ] DR network links reachable
  [ ] DR identity and DNS ready
  [ ] Runbook version 2.3 printed and reviewed

Execution timeline:
  T+0       Declaration of exercise
  T+10 min  Backup delivered to DR storage
  T+25 min  DB restored, checksum verified
  T+40 min  App reconfigured to DR endpoints
  T+55 min  User smoke tests passed

Results:
  RPO achieved:  10 minutes (target 15)
  RTO achieved:  55 minutes (target 120)
  Issues found:  DNS TTLs too high; runbook missing a cert step
  Actions:       shorten DNS TTLs; update runbook; re-drill in Q3

Line-by-line walkthrough

1. DR drill template header
2. Exercise name
3. Date
4. Participants
5. RPO target
6. RTO target
7. Blank separator
8. Pre-checks section
9. Validate backup
10. DR networking
11. DR identity/DNS
12. Runbook version
13. Blank separator
14. Execution timeline header
15. T+0 declare
16. T+10 deliver backup
17. T+25 DB restored
18. T+40 app reconfigured
19. T+55 smoke tests
20. Blank separator
21. Results header
22. RPO achieved vs target
23. RTO achieved vs target
24. Issues found
25. Actions to close gaps

Spot the bug

Company has an RTO target of 1 hour for core banking, but no DR site, no runbook, and backups are on a USB drive in the same building.

Need a hint?

Why is the stated RTO meaningless here?

Show answer

Without a DR architecture, recent tested backups offsite, and a documented runbook, a 1-hour RTO is aspirational, not achievable. Fix: geographically separated DR site (cloud or second datacenter), 3-2-1 backups with immutability, dependency-mapped runbooks, at least annual drills, and monitoring that can detect failures fast.

Explain like I'm 5

If your phone dies, how much do you mind losing (RPO) and how long can you wait before getting a new one (RTO)? DR is that question, but for a whole company — and a drill proves you know the answer.

Fun fact

The Bangladesh central bank’s directives and modern ICT-risk frameworks explicitly expect banks to plan, document, test, and report DR exercises. Many banks now run at least annual drills with evidence retained — a trend mirrored across global financial regulators.

Hands-on challenge

Write a one-page DR drill plan for a small fictional company: 3 apps, 1 DB, 1 file server. Include RPO, RTO, dependencies, runbook steps, roles, and success criteria.

More resources

NIST SP 800-34 Contingency Planning Guide (NIST)
ISO/IEC 27031 business continuity (ISO)
DR planning for SMB/Enterprise (Lawrence Systems)

Open interactive version (quiz + challenge) ← Back to course: IT Jobs Bootcamp