Production Debugging, Incident Response & Crash Triage
From alert to fix — the senior engineer's playbook for production fires
Open interactive version (quiz + challenge)Real-world analogy
What is it?
Production debugging is the discipline of identifying, triaging, reproducing, and resolving failures in shipped software with minimal user impact and maximum speed. It combines tool mastery (Crashlytics, LeakCanary, adb) with systematic methodology (bisect, staged rollouts, post-mortems) and communication skills that separate senior engineers from juniors.
Real-world relevance
At FieldBuzz, when a critical sync crash hit 12% of field officers on Android 10 devices after a release, the triage process was: Crashlytics showed a NullPointerException in the offline sync worker; the mapping file revealed it was in a DAO query path; git bisect identified a Room schema migration commit; the fix was a hotfix release at 5% rollout; a post-mortem led to adding migration unit tests to the CI pipeline.
Key points
- Crashlytics Triage Workflow — Open Crashlytics, sort by 'Most Impacted Users'. Look at crash-free user rate trend. Identify if the spike correlates with a release version. Pivot to the specific issue, read the full stack trace, check OS version and device distribution before touching code.
- Reading Android Stack Traces — Start from the top — the first 'Caused by' line is the root cause. Ignore framework internals (android.*) until you've understood your own code frames. Correlate with ProGuard/R8 mapping files to deobfuscate. The crash thread is listed first; note if it's main or a background thread.
- Reproducing Issues Reliably — Use Crashlytics breadcrumbs and custom keys to rebuild state. Add a debug build with verbose logging. If device-specific, use Firebase Test Lab to run on real hardware. For race conditions, add Thread.sleep or use a stress test harness. Never debug only in production.
- ANR (Application Not Responding) Analysis — ANRs mean main thread was blocked >5s. Pull the ANR traces file (data/anr/traces.txt) from adb or Crashlytics. Look for deadlocks, synchronized blocks, StrictMode violations, or SharedPreferences.commit() on main thread. Fix: move work off main thread using coroutines or WorkManager.
- Memory Leak Detection with LeakCanary — Add LeakCanary to debug builds only. It hooks into the GC and alerts on objects that should be GCed but aren't. Common leaks: Activity held by a singleton, anonymous Runnable posted to Handler, context stored in a companion object. Read the leak trace top-to-bottom — the red chain is the retention path.
- git bisect for Regression Hunting — Run 'git bisect start', mark HEAD as bad, mark a known-good commit as good. Git checks out the midpoint. Test, mark good or bad. Repeat ~log2(N) times to find the exact commit that introduced the regression. Works in minutes even across hundreds of commits.
- Rollback Strategies — Preferred: staged rollout pause + rollback in Play Console. Alternative: server-side feature flag kill switch (Firebase Remote Config, LaunchDarkly). Last resort: emergency release with the fix. Never hot-patch production APKs — it violates Play policy.
- Staged Rollouts for Recovery — Release fixes at 1%, watch crash-free rate. Promote to 10% after 2h stability. Full rollout at 24h. If crash rate rises above threshold, pause rollout. This limits blast radius and gives signal early without impacting all users.
- Incident Communication Protocol — Acknowledge immediately in the incident channel (< 15 min). Provide hourly status updates: 'We are investigating / We have identified / We are fixing / We have resolved'. Post estimated ETA even if rough. Communicate impact scope — X% of users on Android 12+ affected.
- Post-Mortem Process — Write within 48h of resolution. Include: timeline, root cause, impact (users/revenue), mitigation steps, action items with DRI and due dates. Blameless — focus on system failures, not individual mistakes. Distribute to stakeholders. Action items must be tracked in the backlog.
- Proactive Crash Rate Alerting — Set Crashlytics alerts for crash-free user rate dropping below 99.5%. Create custom dashboards in Firebase for key user flows. Use BigQuery export for advanced analysis (funnel correlation, device segmentation). Senior engineers set up the guardrails before the fire, not during.
- StrictMode as a Debugging Shield — Enable StrictMode in debug builds: detect disk reads on main thread, network calls on main thread, leaked SQLite cursors, and Activity leaks. StrictMode violations in staging catch real production ANRs before they ship. Always pair with baseline profiles for perf measurement.
Code example
// LeakCanary setup (debug only — build.gradle)
// debugImplementation 'com.squareup.leakcanary:leakcanary-android:2.12'
// No code needed — LeakCanary auto-installs via ContentProvider
// StrictMode setup in Application.onCreate()
class MyApp : Application() {
override fun onCreate() {
super.onCreate()
if (BuildConfig.DEBUG) {
StrictMode.setThreadPolicy(
StrictMode.ThreadPolicy.Builder()
.detectDiskReads()
.detectDiskWrites()
.detectNetwork()
.penaltyLog()
.penaltyDialog()
.build()
)
StrictMode.setVmPolicy(
StrictMode.VmPolicy.Builder()
.detectLeakedSqlLiteObjects()
.detectLeakedClosableObjects()
.detectActivityLeaks()
.penaltyLog()
.build()
)
}
}
}
// Crashlytics custom keys for breadcrumbs
fun logSyncAttempt(userId: String, recordCount: Int) {
Firebase.crashlytics.setCustomKey("last_sync_user", userId)
Firebase.crashlytics.setCustomKey("last_sync_count", recordCount)
Firebase.crashlytics.log("Sync started: $recordCount records")
}
// git bisect shell commands (not Kotlin — shown as comments)
// git bisect start
// git bisect bad HEAD
// git bisect good v2.3.1
// [git checks out midpoint — test the app]
// git bisect good OR git bisect bad
// [repeat until git prints the offending commit]
// git bisect resetLine-by-line walkthrough
- 1. LeakCanary requires zero initialization code — it installs itself via a ContentProvider that runs before Application.onCreate(). Just adding the dependency in debugImplementation is enough.
- 2. StrictMode is activated in Application.onCreate() inside a BuildConfig.DEBUG check — it must NEVER run in release builds as the dialog penalties would show to real users.
- 3. ThreadPolicy.detectDiskReads/Writes/Network catches the most common ANR causes — these operations block the main thread and should always run on Dispatchers.IO.
- 4. VmPolicy.detectLeakedSqlLiteObjects catches unclosed Cursor objects — a classic Android memory and file-descriptor leak that's invisible until it causes 'too many open files' crashes.
- 5. Firebase.crashlytics.setCustomKey stores key-value pairs that appear alongside a crash report — use these proactively to capture user ID, feature flags, sync state before crashes happen.
- 6. Firebase.crashlytics.log adds breadcrumb messages to the crash report — these appear in the 'Logs' tab and give you a timeline of what happened before the crash.
- 7. The git bisect commands shown as comments demonstrate the binary search process — git automates the midpoint selection, you just mark each commit good or bad after testing.
- 8. git bisect reset at the end restores your working directory to HEAD — forgetting this leaves you on a detached HEAD state in the middle of the commit history.
Spot the bug
class SyncWorker(context: Context, params: WorkerParameters)
: CoroutineWorker(context, params) {
override suspend fun doWork(): Result {
val db = Room.databaseBuilder(
applicationContext,
AppDatabase::class.java,
"app_db"
).build()
return try {
val unsyncedRecords = db.recordDao().getUnsynced()
apiService.uploadRecords(unsyncedRecords)
Result.success()
} catch (e: Exception) {
Result.failure()
}
}
}Need a hint?
Show answer
Explain like I'm 5
Fun fact
Hands-on challenge
More resources
- Firebase Crashlytics Docs (Firebase)
- LeakCanary Documentation (Square)
- StrictMode API Reference (Android Developers)
- Google SRE Book: Postmortems (Google SRE)
- ANR Diagnosis Guide (Android Developers)