Production Debugging, Incident Response & Root Cause Analysis
Triaging crashes, reading stack traces, memory leaks, rollback strategies, and post-mortems
Open interactive version (quiz + challenge)Real-world analogy
What is it?
Production debugging combines crash triage, stack trace analysis, memory leak detection, git bisect, rollback strategies, and structured incident response processes to minimize user impact and prevent recurrence of production issues.
Real-world relevance
A fintech payment app sees a P0 crash spike after a release — crash-free rate drops to 97%. Crashlytics shows NullPointerException in PaymentScreen, affecting Android 12 users. Custom keys reveal it's users with saved cards. Git bisect identifies the introducing commit in 10 minutes. A Remote Config kill switch disables the saved-card flow immediately. The fix ships within 2 hours. A post-mortem adds contract tests to prevent similar API contract breaks.
Key points
- Crashlytics triage workflow — Step 1: Monitor crash-free user rate in Firebase Console (alert if it drops below 99.5%). Step 2: Sort crashes by 'Impacted users' descending — fix high-impact crashes first. Step 3: Read the crash report: exception type, stack trace, OS version, device model, custom keys. Step 4: Check if custom keys (userId, screen, featureFlag) correlate with a specific cohort. Step 5: Reproduce locally.
- Reading Flutter stack traces — Flutter stack traces differ by mode: debug mode shows full Dart frames with file/line. Profile/release mode may show obfuscated frames. If you shipped with --obfuscate and --split-debug-info, use flutter symbolize to decode. Key frames: find the first frame in YOUR code (not framework), then trace the call chain upward to find the trigger.
- Reproducing production issues — Steps: (1) Identify the OS version, device model, and Flutter version from the crash report. (2) Check custom keys — what screen was the user on? (3) Review recent commits for changes to that code path. (4) Use git bisect to find the introducing commit if the regression is confirmed. (5) Write a test that reproduces the crash before fixing it.
- ANR (Application Not Responding) analysis — ANRs are caused by blocking the main thread >5s on Android. Crashlytics captures ANR traces. Read the thread dump: identify the main thread's stack frame — is it waiting on a lock, doing I/O, or running computation? Common Flutter ANRs: Platform.isAndroid check in a plugin during startup, synchronous method channel call, or a heavy image decode not wrapped in compute().
- Memory leak detection workflow — Step 1: Run the app in debug mode and navigate the suspected flow multiple times. Step 2: Open DevTools Memory tab → take heap snapshots before and after. Step 3: Compare snapshots — look for growing object counts. Step 4: Enable LeakTracker (built into Flutter SDK, debug mode) — it auto-reports widget lifecycle leaks. Step 5: Check for uncancelled StreamSubscriptions, undisposed controllers, and timers.
- Common Flutter memory leaks — 1. StreamSubscription not cancelled in dispose(). 2. AnimationController, TextEditingController, ScrollController not disposed. 3. Timer.periodic not cancelled. 4. GlobalKey held in a static variable. 5. Image.network loading large images without cacheWidth/cacheHeight constraints (loads full resolution into memory). 6. setState called after dispose (widget unmounted).
- Git bisect for regression finding — git bisect start; git bisect bad HEAD; git bisect good v2.3.0 (last known good release). Git checks out the midpoint commit. Test: does the crash reproduce? git bisect good or git bisect bad. Repeat until the exact introducing commit is identified. For automated: git bisect run flutter test test/regression_test.dart.
- Rollback strategies — Hot fix: fix the bug and ship immediately (use fast-track App Store/Play Store review for critical crashes). Feature flag rollback: if the crash is gated by a Remote Config flag, disable the flag immediately — no release required. In-app kill switch: Remote Config boolean that disables a feature and shows a maintenance message. Code rollback: revert the bad commit on backend/API if the issue is server-side.
- Incident communication — Severity levels: P0 (app unusable for >1% of users), P1 (major feature broken), P2 (minor feature degraded). P0 response: 5min acknowledgment, 30min status update, resolution ETA. Communication channels: status page update, in-app banner via Remote Config, team Slack incident channel. Never leave users silent during an outage.
- Post-mortem process — Blameless post-mortem: focus on systems, not people. Document: timeline (when detected, when fixed), root cause (5-Whys analysis), contributing factors, customer impact (users affected × duration), immediate fix, and action items. Action items must have owners and deadlines. Share with the team — post-mortems are learning opportunities, not blame sessions.
- 5-Whys root cause analysis — Example: App crashes on payment screen. Why? → NullPointerException on user.paymentMethod. Why was it null? → The API returned 200 with an empty paymentMethod field. Why? → A backend migration renamed the field. Why wasn't this caught? → No integration test covered this field. Why? → Integration tests weren't required for backend API changes. Root cause: missing contract testing between Flutter app and backend API.
- Preventing regressions — After every production bug: write a test that would have caught it. For null safety issues: run in strict mode. For API contract breaks: implement consumer-driven contract tests (Pact). For performance regressions: add DevTools performance benchmarks to CI. For crash rate monitoring: set up Crashlytics alerts before regression reaches 1% of users.
Code example
// Production debugging toolkit
// 1. Symbolize obfuscated stack trace
// Command line: flutter symbolize -i crashlytics_stack.txt -d path/to/app.android-arm64.symbols
// 2. LeakTracker integration (debug mode)
import 'package:leak_tracker_flutter_testing/leak_tracker_flutter_testing.dart';
void main() {
// Enable leak tracking in debug/test builds
LeakTracking.enable();
runApp(const App());
}
// 3. Proper resource disposal to prevent leaks
class ChatScreenState extends State<ChatScreen> {
late StreamSubscription<WsEvent> _wsSub;
late AnimationController _typingController;
late TextEditingController _messageController;
Timer? _typingTimer;
@override
void initState() {
super.initState();
_typingController = AnimationController(
vsync: this,
duration: const Duration(milliseconds: 600),
)..repeat(reverse: true);
_messageController = TextEditingController();
_wsSub = context.read<WsService>().events.listen(_handleEvent);
}
@override
void dispose() {
_typingController.dispose(); // MUST dispose all controllers
_messageController.dispose();
_wsSub.cancel(); // MUST cancel subscriptions
_typingTimer?.cancel(); // MUST cancel timers
super.dispose();
}
void _onTyping() {
_typingTimer?.cancel();
_typingTimer = Timer(const Duration(seconds: 3), _sendTypingStop);
}
}
// 4. Crashlytics context for better triage
class CrashlyticsContext {
static Future<void> setUserContext({
required String userId,
required String screen,
required String featureFlag,
}) async {
await FirebaseCrashlytics.instance.setUserIdentifier(userId);
await FirebaseCrashlytics.instance.setCustomKey('screen', screen);
await FirebaseCrashlytics.instance.setCustomKey('feature_flag', featureFlag);
await FirebaseCrashlytics.instance.setCustomKey(
'app_version',
(await PackageInfo.fromPlatform()).version,
);
}
}
// 5. Memory-safe image loading
Image.network(
url,
cacheWidth: 400, // Decode at display size, not full resolution
cacheHeight: 400,
errorBuilder: (_, error, __) => const Icon(Icons.broken_image),
)
// 6. Remote Config kill switch
class FeatureGuard extends StatelessWidget {
final String flagKey;
final Widget child;
final Widget fallback;
const FeatureGuard({
required this.flagKey,
required this.child,
this.fallback = const SizedBox.shrink(),
});
@override
Widget build(BuildContext context) {
final enabled = FirebaseRemoteConfig.instance.getBool(flagKey);
return enabled ? child : fallback;
}
}Line-by-line walkthrough
- 1. LeakTracking.enable() in debug builds activates Flutter's built-in leak detector — it monitors widget and object lifecycles and reports when objects survive past their expected disposal point.
- 2. _typingController.dispose() is called before super.dispose() — all custom resources must be released before the State is torn down.
- 3. _wsSub.cancel() in dispose() is the most commonly forgotten cleanup — StreamSubscriptions hold a reference to the stream and its listener, preventing garbage collection of the entire widget tree in some cases.
- 4. _typingTimer?.cancel() uses null-safe call — the timer may not have been created yet if the user never typed, so null check is necessary.
- 5. CrashlyticsContext.setUserContext should be called on each screen navigation and after login — custom keys give you filtering power in the Crashlytics dashboard (e.g., 'show me only crashes on PaymentScreen for users with the new-checkout flag').
- 6. setCustomKey('app_version', ...) supplements Crashlytics' built-in version tracking — useful when you need to filter by semantic version in custom queries.
- 7. Image.network with cacheWidth/cacheHeight decodes the image into a smaller bitmap — without these, a 4000x3000 JPEG is decoded into a ~46MB memory buffer even if displayed at 100x100 pixels.
- 8. FeatureGuard wraps any feature widget — when the Remote Config flag is false, it renders SizedBox.shrink() (nothing) instead of the feature, giving you a remote kill switch for any feature in the app.
Spot the bug
class ProfileScreen extends StatefulWidget {
const ProfileScreen({super.key});
@override
State<ProfileScreen> createState() => _ProfileScreenState();
}
class _ProfileScreenState extends State<ProfileScreen> {
late Future<User> _userFuture;
StreamSubscription? _syncSub;
@override
void initState() {
super.initState();
_userFuture = context.read<UserRepo>().fetchUser();
_syncSub = context.read<SyncService>().onSync.listen((event) {
setState(() {
_userFuture = context.read<UserRepo>().fetchUser();
});
});
}
}Need a hint?
Show answer
Explain like I'm 5
Fun fact
Hands-on challenge
More resources
- Firebase Crashlytics Flutter documentation (FlutterFire)
- Flutter DevTools Memory documentation (Flutter Docs)
- Google SRE Book — Postmortem Culture (Google SRE)
- git bisect documentation (Git Docs)
- Flutter performance profiling (Flutter Docs)