Lesson 61 of 83 advanced

Production Debugging, Incident Response & Crash Triage

From alert to fix — the senior engineer's playbook for production fires

Open interactive version (quiz + challenge)

Real-world analogy

Think of production debugging like being an ER surgeon. The crash report is your patient's vitals. You don't panic — you triage, stabilize, diagnose, fix, and then write a post-mortem so the same emergency never kills anyone again.

What is it?

Production debugging is the discipline of identifying, triaging, reproducing, and resolving failures in shipped software with minimal user impact and maximum speed. It combines tool mastery (Crashlytics, LeakCanary, adb) with systematic methodology (bisect, staged rollouts, post-mortems) and communication skills that separate senior engineers from juniors.

Real-world relevance

At FieldBuzz, when a critical sync crash hit 12% of field officers on Android 10 devices after a release, the triage process was: Crashlytics showed a NullPointerException in the offline sync worker; the mapping file revealed it was in a DAO query path; git bisect identified a Room schema migration commit; the fix was a hotfix release at 5% rollout; a post-mortem led to adding migration unit tests to the CI pipeline.

Key points

Code example

// LeakCanary setup (debug only — build.gradle)
// debugImplementation 'com.squareup.leakcanary:leakcanary-android:2.12'
// No code needed — LeakCanary auto-installs via ContentProvider

// StrictMode setup in Application.onCreate()
class MyApp : Application() {
    override fun onCreate() {
        super.onCreate()
        if (BuildConfig.DEBUG) {
            StrictMode.setThreadPolicy(
                StrictMode.ThreadPolicy.Builder()
                    .detectDiskReads()
                    .detectDiskWrites()
                    .detectNetwork()
                    .penaltyLog()
                    .penaltyDialog()
                    .build()
            )
            StrictMode.setVmPolicy(
                StrictMode.VmPolicy.Builder()
                    .detectLeakedSqlLiteObjects()
                    .detectLeakedClosableObjects()
                    .detectActivityLeaks()
                    .penaltyLog()
                    .build()
            )
        }
    }
}

// Crashlytics custom keys for breadcrumbs
fun logSyncAttempt(userId: String, recordCount: Int) {
    Firebase.crashlytics.setCustomKey("last_sync_user", userId)
    Firebase.crashlytics.setCustomKey("last_sync_count", recordCount)
    Firebase.crashlytics.log("Sync started: $recordCount records")
}

// git bisect shell commands (not Kotlin — shown as comments)
// git bisect start
// git bisect bad HEAD
// git bisect good v2.3.1
// [git checks out midpoint — test the app]
// git bisect good    OR    git bisect bad
// [repeat until git prints the offending commit]
// git bisect reset

Line-by-line walkthrough

  1. 1. LeakCanary requires zero initialization code — it installs itself via a ContentProvider that runs before Application.onCreate(). Just adding the dependency in debugImplementation is enough.
  2. 2. StrictMode is activated in Application.onCreate() inside a BuildConfig.DEBUG check — it must NEVER run in release builds as the dialog penalties would show to real users.
  3. 3. ThreadPolicy.detectDiskReads/Writes/Network catches the most common ANR causes — these operations block the main thread and should always run on Dispatchers.IO.
  4. 4. VmPolicy.detectLeakedSqlLiteObjects catches unclosed Cursor objects — a classic Android memory and file-descriptor leak that's invisible until it causes 'too many open files' crashes.
  5. 5. Firebase.crashlytics.setCustomKey stores key-value pairs that appear alongside a crash report — use these proactively to capture user ID, feature flags, sync state before crashes happen.
  6. 6. Firebase.crashlytics.log adds breadcrumb messages to the crash report — these appear in the 'Logs' tab and give you a timeline of what happened before the crash.
  7. 7. The git bisect commands shown as comments demonstrate the binary search process — git automates the midpoint selection, you just mark each commit good or bad after testing.
  8. 8. git bisect reset at the end restores your working directory to HEAD — forgetting this leaves you on a detached HEAD state in the middle of the commit history.

Spot the bug

class SyncWorker(context: Context, params: WorkerParameters)
    : CoroutineWorker(context, params) {

    override suspend fun doWork(): Result {
        val db = Room.databaseBuilder(
            applicationContext,
            AppDatabase::class.java,
            "app_db"
        ).build()

        return try {
            val unsyncedRecords = db.recordDao().getUnsynced()
            apiService.uploadRecords(unsyncedRecords)
            Result.success()
        } catch (e: Exception) {
            Result.failure()
        }
    }
}
Need a hint?
There are two bugs: one causes a resource leak in every worker execution, and one silently swallows errors that should trigger a retry.
Show answer
Bug 1: A new Room database instance is built inside doWork() every time the worker runs. Room instances hold file handles, thread pools, and WAL connections. Building and never closing them (Room databases don't implement Closeable easily) causes resource exhaustion over time. The database should be a singleton accessed via a DI container (Hilt/Koin) or a companion object, never built fresh in each worker invocation. Bug 2: The catch block returns Result.failure() for ALL exceptions, including transient network errors (IOException, HttpException). Transient errors should return Result.retry() so WorkManager re-schedules the work with exponential backoff. Only permanent errors (auth failures, data validation errors) should return Result.failure(). Fix: check exception type and return Result.retry() for IOException and similar transient failures.

Explain like I'm 5

Imagine your app is a restaurant. Sometimes something goes wrong in the kitchen — a dish takes too long (ANR), or the stove catches fire (crash). Production debugging is like being the head chef who gets a pager alert, rushes to the kitchen, reads the incident log (stack trace), figures out which cook made a mistake (git bisect), tells the manager what's happening every 10 minutes (incident comms), fixes the stove, and then writes a report so the same fire never happens again (post-mortem).

Fun fact

The famous 'Therac-25' radiation therapy machine bug in the 1980s killed patients because race conditions were only reproducible under specific timing. Modern post-mortem culture directly traces its roots to lessons learned from disasters like this — today's blameless post-mortems exist because blame-driven cultures suppressed the very information needed to fix systemic issues.

Hands-on challenge

Given a Crashlytics report showing a NullPointerException in your Room DAO's query method affecting 3% of Android 11 users after your last release, walk through the full triage: (1) What custom keys would you have set pre-emptively to help diagnose this? (2) How would you use git bisect to confirm this is a regression? (3) What's your rollout strategy for the fix? (4) What goes in the post-mortem action items to prevent recurrence?

More resources

Open interactive version (quiz + challenge) ← Back to course: Android Interview Mastery