Application Dependency Mapping
Apps fail in layers, not magic
Open interactive version (quiz + challenge)Real-world analogy
Every app is a kitchen line: front-of-house takes the order, the kitchen cooks, the dishwasher keeps up, the fridge holds ingredients. If the fridge is empty, the customer thinks ‘the restaurant is broken’ — but you find the real culprit by walking the line.
What is it?
App dependency mapping is the operational discipline that turns ‘app is slow’ into ‘service X is slow because dependency Y timed out.’ Once you think this way, every outage becomes solvable, not mystical.
Real-world relevance
A CRM page loads but saves fail. Dependency map shows CRM → API → identity → DB → storage. Identity is fine. DB healthy. Storage returns 503s for some regions — a cloud provider incident. You communicate, wait, validate recovery. Zero wasted escalations.
Key points
- The 5-layer mental model — User → app frontend → app backend/API → supporting services (cache, queue, DB, file store, identity) → infra (OS, network, cloud). A failure usually lives in one layer; your job is to find it.
- Health endpoints — Most apps expose /health or /status. Use them. Also ping DBs, hit cache, check queues, verify identity providers separately. A ‘green frontend’ can hide a red dependency.
- The right first 5 questions — (1) Who is affected (one user, many, all)? (2) When did it start? (3) What changed? (4) Which dependency is failing? (5) What’s the error code / trace ID? Good answers prevent wasted hours.
- Logs, metrics, traces — Logs: what happened. Metrics: numbers over time. Traces: a request’s journey across services. All three together beat any one alone.
- Dependency diagrams — A one-page diagram for each app showing its dependencies saves the world during outages. Label external parties, identity providers, DNS, storage, secrets stores.
- Secrets and certificates — Apps fail spectacularly when a secret rotates unexpectedly or a certificate expires. Know where they live, their renewal cadence, and their owner. ‘Certs expired at midnight’ has caused many global outages.
- Tenants, regions, zones — Many SaaS apps have multi-tenant, multi-region architectures. An outage may affect only your tenant or only one region. Before declaring ‘it’s broken,’ ask scope.
- Talking across teams — Junior IT is often the bridge between end users and multiple specialist teams (app, DB, network, cloud, vendor). Clear dependency language makes you the go-to person.
Code example
// Dependency-mapping template (per app)
App name: BillingPortal
Owner team: Finance Systems
Criticality: Tier 1
External users: Customers via https://billing.contoso.com
Internal users: Finance ops team
Frontend: React SPA on CDN + WAF
Backend: REST API (container) in region A, zone 1+2
Dependencies (and failure signal):
- Identity: Entra (OIDC) -> login fails
- DB: PostgreSQL HA cluster -> 500 on save
- Cache: Redis -> slow reads
- Queue: RabbitMQ -> delayed events
- Storage: S3 bucket for invoices -> download fails
- Secrets: Key Vault -> startup crash
- Email: SMTP gateway -> no notifications
- External: Tax API -> partial failures
Monitors:
- /health endpoint per service
- synthetic user journey (login -> save -> download)
- logs + metrics + traces with correlation IDs
- on-call runbook for each failure modeLine-by-line walkthrough
- 1. Dependency template
- 2. App name
- 3. Owner team
- 4. Criticality tier
- 5. Blank separator
- 6. External users
- 7. Internal users
- 8. Blank separator
- 9. Frontend description
- 10. Backend description
- 11. Blank separator
- 12. Dependencies list
- 13. Identity failure signal
- 14. DB failure signal
- 15. Cache failure signal
- 16. Queue failure signal
- 17. Storage failure signal
- 18. Secrets failure signal
- 19. Email failure signal
- 20. External API failure
- 21. Blank separator
- 22. Monitors
- 23. Per-service health endpoint
- 24. Synthetic user journey
- 25. Logs + metrics + traces + correlation IDs
- 26. On-call runbook per failure mode
Spot the bug
App outage ticket: 'Everything is broken, please fix!' Junior tells the whole company via email: 'CRM down, we don’t know yet, working on it.'Need a hint?
Which two pieces of discipline are missing?
Show answer
(1) Scope first — identify whether this affects all users, one region, or one tenant before announcing enterprise-wide. (2) Use structured comms — follow the incident comms process (Comms Lead, channel, update cadence) rather than an unreviewed company-wide email. Calm, accurate, timed updates beat panic.
Explain like I'm 5
Every big app is a team of smaller helpers. When something breaks, don’t blame the whole team — find which helper fell down, then you can fix it fast.
Fun fact
In many severe incidents, the ‘failure’ turns out to be an expired TLS certificate somewhere quiet, like an internal API or a DNS-validated domain. Certificate observability (renewal alerts + inventory) pays for itself the first time it saves a Sunday night.
Hands-on challenge
Pick any app you use daily (email, bank app, streaming). Draw its likely dependency map: frontend, backend, identity, DB, cache, storage, external APIs. Mark how each failure would feel to a user.
More resources
- Observability basics (Honeycomb) (Honeycomb)
- OpenTelemetry intro (OpenTelemetry)
- Tracing and observability talks (YouTube search)