RPO, RTO & DR Drills
Business language for technical people
Open interactive version (quiz + challenge)Real-world analogy
DR planning is like fire drills. Everyone thinks they’ll remember what to do until the alarm actually rings. RPO is ‘how much data can we lose?’ RTO is ‘how long can we be down?’ Drills are the difference between plan and performance.
What is it?
RPO/RTO/DR translates technical resilience into business language and reveals whether ‘backups’ actually translate to real recovery. Juniors who can speak this language become the bridge between ops and business.
Real-world relevance
A core banking outage at 10 AM on a Monday: RPO target 5 min, RTO target 30 min. DR plan is rehearsed quarterly. Incident commander declares disaster, failover begins at 10:04, service resumes at 10:27. Auditor happy; customers barely notice.
Key points
- RPO — Recovery Point Objective — Maximum tolerable data loss measured in time. ‘Our RPO is 15 minutes’ means we can lose up to 15 minutes of changes. Drives backup frequency.
- RTO — Recovery Time Objective — Maximum tolerable downtime measured in time. ‘Our RTO is 2 hours’ means we must be back online within 2 hours. Drives recovery architecture (failover, replicas, DR site).
- MTTR vs MTBF vs MTTA — MTTR: Mean Time To Repair/Restore. MTBF: Mean Time Between Failures. MTTA: Mean Time To Acknowledge. Business discusses these for service-level reports.
- Replication vs backup — Replication: continuous or near-real-time copy for HA/DR. Backup: point-in-time copies for recovery from corruption/ransomware. You need both — replication alone won’t help against ransomware that replicates too.
- Active/active vs active/passive — Active/active: multiple sites serving traffic simultaneously. Active/passive: one site primary, another standby, failed over manually or automatically. Cost vs complexity tradeoff.
- Dependency mapping — A DR plan lists every dependency: DNS, identity, network, certificates, storage, apps, monitoring, third parties. Missing one equals plan failure.
- Runbooks and roles — Who declares a disaster? Who runs the runbook? Who talks to users? Who updates execs? Documented roles prevent chaos on day zero.
- Exercises: tabletop, partial, full — Tabletop: discussion-only walkthrough. Partial: recover one system. Full: recover the environment end-to-end. Mature shops run at least one partial exercise yearly.
Code example
// DR drill report template (partial exercise)
Exercise: Restore CRM database to DR site
Date: 2026-04-15
Participants: DBA, Network, Security, App Owner, IT Manager
RPO target: 15 minutes
RTO target: 2 hours
Pre-checks:
[ ] Current backup validated (timestamp + checksum)
[ ] DR network links reachable
[ ] DR identity and DNS ready
[ ] Runbook version 2.3 printed and reviewed
Execution timeline:
T+0 Declaration of exercise
T+10 min Backup delivered to DR storage
T+25 min DB restored, checksum verified
T+40 min App reconfigured to DR endpoints
T+55 min User smoke tests passed
Results:
RPO achieved: 10 minutes (target 15)
RTO achieved: 55 minutes (target 120)
Issues found: DNS TTLs too high; runbook missing a cert step
Actions: shorten DNS TTLs; update runbook; re-drill in Q3Line-by-line walkthrough
- 1. DR drill template header
- 2. Exercise name
- 3. Date
- 4. Participants
- 5. RPO target
- 6. RTO target
- 7. Blank separator
- 8. Pre-checks section
- 9. Validate backup
- 10. DR networking
- 11. DR identity/DNS
- 12. Runbook version
- 13. Blank separator
- 14. Execution timeline header
- 15. T+0 declare
- 16. T+10 deliver backup
- 17. T+25 DB restored
- 18. T+40 app reconfigured
- 19. T+55 smoke tests
- 20. Blank separator
- 21. Results header
- 22. RPO achieved vs target
- 23. RTO achieved vs target
- 24. Issues found
- 25. Actions to close gaps
Spot the bug
Company has an RTO target of 1 hour for core banking, but no DR site, no runbook, and backups are on a USB drive in the same building.Need a hint?
Why is the stated RTO meaningless here?
Show answer
Without a DR architecture, recent tested backups offsite, and a documented runbook, a 1-hour RTO is aspirational, not achievable. Fix: geographically separated DR site (cloud or second datacenter), 3-2-1 backups with immutability, dependency-mapped runbooks, at least annual drills, and monitoring that can detect failures fast.
Explain like I'm 5
If your phone dies, how much do you mind losing (RPO) and how long can you wait before getting a new one (RTO)? DR is that question, but for a whole company — and a drill proves you know the answer.
Fun fact
The Bangladesh central bank’s directives and modern ICT-risk frameworks explicitly expect banks to plan, document, test, and report DR exercises. Many banks now run at least annual drills with evidence retained — a trend mirrored across global financial regulators.
Hands-on challenge
Write a one-page DR drill plan for a small fictional company: 3 apps, 1 DB, 1 file server. Include RPO, RTO, dependencies, runbook steps, roles, and success criteria.
More resources
- NIST SP 800-34 Contingency Planning Guide (NIST)
- ISO/IEC 27031 business continuity (ISO)
- DR planning for SMB/Enterprise (Lawrence Systems)