Lesson 61 of 83 advanced

Production Debugging, Incident Response & Crash Triage

From alert to fix — the senior engineer's playbook for production fires

Open interactive version (quiz + challenge)

Real-world analogy

Think of production debugging like being an ER surgeon. The crash report is your patient's vitals. You don't panic — you triage, stabilize, diagnose, fix, and then write a post-mortem so the same emergency never kills anyone again.

What is it?

Production debugging is the discipline of identifying, triaging, reproducing, and resolving failures in shipped software with minimal user impact and maximum speed. It combines tool mastery (Crashlytics, LeakCanary, adb) with systematic methodology (bisect, staged rollouts, post-mortems) and communication skills that separate senior engineers from juniors.

Real-world relevance

At FieldBuzz, when a critical sync crash hit 12% of field officers on Android 10 devices after a release, the triage process was: Crashlytics showed a NullPointerException in the offline sync worker; the mapping file revealed it was in a DAO query path; git bisect identified a Room schema migration commit; the fix was a hotfix release at 5% rollout; a post-mortem led to adding migration unit tests to the CI pipeline.

Key points

Crashlytics Triage Workflow — Open Crashlytics, sort by 'Most Impacted Users'. Look at crash-free user rate trend. Identify if the spike correlates with a release version. Pivot to the specific issue, read the full stack trace, check OS version and device distribution before touching code.
Reading Android Stack Traces — Start from the top — the first 'Caused by' line is the root cause. Ignore framework internals (android.*) until you've understood your own code frames. Correlate with ProGuard/R8 mapping files to deobfuscate. The crash thread is listed first; note if it's main or a background thread.
Reproducing Issues Reliably — Use Crashlytics breadcrumbs and custom keys to rebuild state. Add a debug build with verbose logging. If device-specific, use Firebase Test Lab to run on real hardware. For race conditions, add Thread.sleep or use a stress test harness. Never debug only in production.
ANR (Application Not Responding) Analysis — ANRs mean main thread was blocked >5s. Pull the ANR traces file (data/anr/traces.txt) from adb or Crashlytics. Look for deadlocks, synchronized blocks, StrictMode violations, or SharedPreferences.commit() on main thread. Fix: move work off main thread using coroutines or WorkManager.
Memory Leak Detection with LeakCanary — Add LeakCanary to debug builds only. It hooks into the GC and alerts on objects that should be GCed but aren't. Common leaks: Activity held by a singleton, anonymous Runnable posted to Handler, context stored in a companion object. Read the leak trace top-to-bottom — the red chain is the retention path.
git bisect for Regression Hunting — Run 'git bisect start', mark HEAD as bad, mark a known-good commit as good. Git checks out the midpoint. Test, mark good or bad. Repeat ~log2(N) times to find the exact commit that introduced the regression. Works in minutes even across hundreds of commits.
Rollback Strategies — Preferred: staged rollout pause + rollback in Play Console. Alternative: server-side feature flag kill switch (Firebase Remote Config, LaunchDarkly). Last resort: emergency release with the fix. Never hot-patch production APKs — it violates Play policy.
Staged Rollouts for Recovery — Release fixes at 1%, watch crash-free rate. Promote to 10% after 2h stability. Full rollout at 24h. If crash rate rises above threshold, pause rollout. This limits blast radius and gives signal early without impacting all users.
Incident Communication Protocol — Acknowledge immediately in the incident channel (< 15 min). Provide hourly status updates: 'We are investigating / We have identified / We are fixing / We have resolved'. Post estimated ETA even if rough. Communicate impact scope — X% of users on Android 12+ affected.
Post-Mortem Process — Write within 48h of resolution. Include: timeline, root cause, impact (users/revenue), mitigation steps, action items with DRI and due dates. Blameless — focus on system failures, not individual mistakes. Distribute to stakeholders. Action items must be tracked in the backlog.
Proactive Crash Rate Alerting — Set Crashlytics alerts for crash-free user rate dropping below 99.5%. Create custom dashboards in Firebase for key user flows. Use BigQuery export for advanced analysis (funnel correlation, device segmentation). Senior engineers set up the guardrails before the fire, not during.
StrictMode as a Debugging Shield — Enable StrictMode in debug builds: detect disk reads on main thread, network calls on main thread, leaked SQLite cursors, and Activity leaks. StrictMode violations in staging catch real production ANRs before they ship. Always pair with baseline profiles for perf measurement.

Code example

// LeakCanary setup (debug only — build.gradle)
// debugImplementation 'com.squareup.leakcanary:leakcanary-android:2.12'
// No code needed — LeakCanary auto-installs via ContentProvider

// StrictMode setup in Application.onCreate()
class MyApp : Application() {
    override fun onCreate() {
        super.onCreate()
        if (BuildConfig.DEBUG) {
            StrictMode.setThreadPolicy(
                StrictMode.ThreadPolicy.Builder()
                    .detectDiskReads()
                    .detectDiskWrites()
                    .detectNetwork()
                    .penaltyLog()
                    .penaltyDialog()
                    .build()
            )
            StrictMode.setVmPolicy(
                StrictMode.VmPolicy.Builder()
                    .detectLeakedSqlLiteObjects()
                    .detectLeakedClosableObjects()
                    .detectActivityLeaks()
                    .penaltyLog()
                    .build()
            )
        }
    }
}

// Crashlytics custom keys for breadcrumbs
fun logSyncAttempt(userId: String, recordCount: Int) {
    Firebase.crashlytics.setCustomKey("last_sync_user", userId)
    Firebase.crashlytics.setCustomKey("last_sync_count", recordCount)
    Firebase.crashlytics.log("Sync started: $recordCount records")
}

// git bisect shell commands (not Kotlin — shown as comments)
// git bisect start
// git bisect bad HEAD
// git bisect good v2.3.1
// [git checks out midpoint — test the app]
// git bisect good    OR    git bisect bad
// [repeat until git prints the offending commit]
// git bisect reset

Line-by-line walkthrough

1. LeakCanary requires zero initialization code — it installs itself via a ContentProvider that runs before Application.onCreate(). Just adding the dependency in debugImplementation is enough.
2. StrictMode is activated in Application.onCreate() inside a BuildConfig.DEBUG check — it must NEVER run in release builds as the dialog penalties would show to real users.
3. ThreadPolicy.detectDiskReads/Writes/Network catches the most common ANR causes — these operations block the main thread and should always run on Dispatchers.IO.
4. VmPolicy.detectLeakedSqlLiteObjects catches unclosed Cursor objects — a classic Android memory and file-descriptor leak that's invisible until it causes 'too many open files' crashes.
5. Firebase.crashlytics.setCustomKey stores key-value pairs that appear alongside a crash report — use these proactively to capture user ID, feature flags, sync state before crashes happen.
6. Firebase.crashlytics.log adds breadcrumb messages to the crash report — these appear in the 'Logs' tab and give you a timeline of what happened before the crash.
7. The git bisect commands shown as comments demonstrate the binary search process — git automates the midpoint selection, you just mark each commit good or bad after testing.
8. git bisect reset at the end restores your working directory to HEAD — forgetting this leaves you on a detached HEAD state in the middle of the commit history.

Spot the bug

class SyncWorker(context: Context, params: WorkerParameters)
    : CoroutineWorker(context, params) {

    override suspend fun doWork(): Result {
        val db = Room.databaseBuilder(
            applicationContext,
            AppDatabase::class.java,
            "app_db"
        ).build()

        return try {
            val unsyncedRecords = db.recordDao().getUnsynced()
            apiService.uploadRecords(unsyncedRecords)
            Result.success()
        } catch (e: Exception) {
            Result.failure()
        }
    }
}

Need a hint?

There are two bugs: one causes a resource leak in every worker execution, and one silently swallows errors that should trigger a retry.

Show answer

Bug 1: A new Room database instance is built inside doWork() every time the worker runs. Room instances hold file handles, thread pools, and WAL connections. Building and never closing them (Room databases don't implement Closeable easily) causes resource exhaustion over time. The database should be a singleton accessed via a DI container (Hilt/Koin) or a companion object, never built fresh in each worker invocation. Bug 2: The catch block returns Result.failure() for ALL exceptions, including transient network errors (IOException, HttpException). Transient errors should return Result.retry() so WorkManager re-schedules the work with exponential backoff. Only permanent errors (auth failures, data validation errors) should return Result.failure(). Fix: check exception type and return Result.retry() for IOException and similar transient failures.

Explain like I'm 5

Imagine your app is a restaurant. Sometimes something goes wrong in the kitchen — a dish takes too long (ANR), or the stove catches fire (crash). Production debugging is like being the head chef who gets a pager alert, rushes to the kitchen, reads the incident log (stack trace), figures out which cook made a mistake (git bisect), tells the manager what's happening every 10 minutes (incident comms), fixes the stove, and then writes a report so the same fire never happens again (post-mortem).

Fun fact

The famous 'Therac-25' radiation therapy machine bug in the 1980s killed patients because race conditions were only reproducible under specific timing. Modern post-mortem culture directly traces its roots to lessons learned from disasters like this — today's blameless post-mortems exist because blame-driven cultures suppressed the very information needed to fix systemic issues.

Hands-on challenge

Given a Crashlytics report showing a NullPointerException in your Room DAO's query method affecting 3% of Android 11 users after your last release, walk through the full triage: (1) What custom keys would you have set pre-emptively to help diagnose this? (2) How would you use git bisect to confirm this is a regression? (3) What's your rollout strategy for the fix? (4) What goes in the post-mortem action items to prevent recurrence?

More resources

Firebase Crashlytics Docs (Firebase)
LeakCanary Documentation (Square)
StrictMode API Reference (Android Developers)
Google SRE Book: Postmortems (Google SRE)
ANR Diagnosis Guide (Android Developers)

Open interactive version (quiz + challenge) ← Back to course: Android Interview Mastery