Lesson 59 of 77 advanced

Production Debugging, Incident Response & Root Cause Analysis

Triaging crashes, reading stack traces, memory leaks, rollback strategies, and post-mortems

Open interactive version (quiz + challenge)

Real-world analogy

Production debugging is like being a detective at a crime scene that's still happening. The clues are stack traces and logs, the witnesses are your monitoring tools, and you need to find the culprit (root cause) fast enough to stop the damage — while keeping stakeholders calm and informed throughout.

What is it?

Production debugging combines crash triage, stack trace analysis, memory leak detection, git bisect, rollback strategies, and structured incident response processes to minimize user impact and prevent recurrence of production issues.

Real-world relevance

A fintech payment app sees a P0 crash spike after a release — crash-free rate drops to 97%. Crashlytics shows NullPointerException in PaymentScreen, affecting Android 12 users. Custom keys reveal it's users with saved cards. Git bisect identifies the introducing commit in 10 minutes. A Remote Config kill switch disables the saved-card flow immediately. The fix ships within 2 hours. A post-mortem adds contract tests to prevent similar API contract breaks.

Key points

Crashlytics triage workflow — Step 1: Monitor crash-free user rate in Firebase Console (alert if it drops below 99.5%). Step 2: Sort crashes by 'Impacted users' descending — fix high-impact crashes first. Step 3: Read the crash report: exception type, stack trace, OS version, device model, custom keys. Step 4: Check if custom keys (userId, screen, featureFlag) correlate with a specific cohort. Step 5: Reproduce locally.
Reading Flutter stack traces — Flutter stack traces differ by mode: debug mode shows full Dart frames with file/line. Profile/release mode may show obfuscated frames. If you shipped with --obfuscate and --split-debug-info, use flutter symbolize to decode. Key frames: find the first frame in YOUR code (not framework), then trace the call chain upward to find the trigger.
Reproducing production issues — Steps: (1) Identify the OS version, device model, and Flutter version from the crash report. (2) Check custom keys — what screen was the user on? (3) Review recent commits for changes to that code path. (4) Use git bisect to find the introducing commit if the regression is confirmed. (5) Write a test that reproduces the crash before fixing it.
ANR (Application Not Responding) analysis — ANRs are caused by blocking the main thread >5s on Android. Crashlytics captures ANR traces. Read the thread dump: identify the main thread's stack frame — is it waiting on a lock, doing I/O, or running computation? Common Flutter ANRs: Platform.isAndroid check in a plugin during startup, synchronous method channel call, or a heavy image decode not wrapped in compute().
Memory leak detection workflow — Step 1: Run the app in debug mode and navigate the suspected flow multiple times. Step 2: Open DevTools Memory tab → take heap snapshots before and after. Step 3: Compare snapshots — look for growing object counts. Step 4: Enable LeakTracker (built into Flutter SDK, debug mode) — it auto-reports widget lifecycle leaks. Step 5: Check for uncancelled StreamSubscriptions, undisposed controllers, and timers.
Common Flutter memory leaks — 1. StreamSubscription not cancelled in dispose(). 2. AnimationController, TextEditingController, ScrollController not disposed. 3. Timer.periodic not cancelled. 4. GlobalKey held in a static variable. 5. Image.network loading large images without cacheWidth/cacheHeight constraints (loads full resolution into memory). 6. setState called after dispose (widget unmounted).
Git bisect for regression finding — git bisect start; git bisect bad HEAD; git bisect good v2.3.0 (last known good release). Git checks out the midpoint commit. Test: does the crash reproduce? git bisect good or git bisect bad. Repeat until the exact introducing commit is identified. For automated: git bisect run flutter test test/regression_test.dart.
Rollback strategies — Hot fix: fix the bug and ship immediately (use fast-track App Store/Play Store review for critical crashes). Feature flag rollback: if the crash is gated by a Remote Config flag, disable the flag immediately — no release required. In-app kill switch: Remote Config boolean that disables a feature and shows a maintenance message. Code rollback: revert the bad commit on backend/API if the issue is server-side.
Incident communication — Severity levels: P0 (app unusable for >1% of users), P1 (major feature broken), P2 (minor feature degraded). P0 response: 5min acknowledgment, 30min status update, resolution ETA. Communication channels: status page update, in-app banner via Remote Config, team Slack incident channel. Never leave users silent during an outage.
Post-mortem process — Blameless post-mortem: focus on systems, not people. Document: timeline (when detected, when fixed), root cause (5-Whys analysis), contributing factors, customer impact (users affected × duration), immediate fix, and action items. Action items must have owners and deadlines. Share with the team — post-mortems are learning opportunities, not blame sessions.
5-Whys root cause analysis — Example: App crashes on payment screen. Why? → NullPointerException on user.paymentMethod. Why was it null? → The API returned 200 with an empty paymentMethod field. Why? → A backend migration renamed the field. Why wasn't this caught? → No integration test covered this field. Why? → Integration tests weren't required for backend API changes. Root cause: missing contract testing between Flutter app and backend API.
Preventing regressions — After every production bug: write a test that would have caught it. For null safety issues: run in strict mode. For API contract breaks: implement consumer-driven contract tests (Pact). For performance regressions: add DevTools performance benchmarks to CI. For crash rate monitoring: set up Crashlytics alerts before regression reaches 1% of users.

Code example

// Production debugging toolkit

// 1. Symbolize obfuscated stack trace
// Command line: flutter symbolize -i crashlytics_stack.txt -d path/to/app.android-arm64.symbols

// 2. LeakTracker integration (debug mode)
import 'package:leak_tracker_flutter_testing/leak_tracker_flutter_testing.dart';

void main() {
  // Enable leak tracking in debug/test builds
  LeakTracking.enable();
  runApp(const App());
}

// 3. Proper resource disposal to prevent leaks
class ChatScreenState extends State<ChatScreen> {
  late StreamSubscription<WsEvent> _wsSub;
  late AnimationController _typingController;
  late TextEditingController _messageController;
  Timer? _typingTimer;

  @override
  void initState() {
    super.initState();
    _typingController = AnimationController(
      vsync: this,
      duration: const Duration(milliseconds: 600),
    )..repeat(reverse: true);
    _messageController = TextEditingController();
    _wsSub = context.read<WsService>().events.listen(_handleEvent);
  }

  @override
  void dispose() {
    _typingController.dispose(); // MUST dispose all controllers
    _messageController.dispose();
    _wsSub.cancel(); // MUST cancel subscriptions
    _typingTimer?.cancel(); // MUST cancel timers
    super.dispose();
  }

  void _onTyping() {
    _typingTimer?.cancel();
    _typingTimer = Timer(const Duration(seconds: 3), _sendTypingStop);
  }
}

// 4. Crashlytics context for better triage
class CrashlyticsContext {
  static Future<void> setUserContext({
    required String userId,
    required String screen,
    required String featureFlag,
  }) async {
    await FirebaseCrashlytics.instance.setUserIdentifier(userId);
    await FirebaseCrashlytics.instance.setCustomKey('screen', screen);
    await FirebaseCrashlytics.instance.setCustomKey('feature_flag', featureFlag);
    await FirebaseCrashlytics.instance.setCustomKey(
      'app_version',
      (await PackageInfo.fromPlatform()).version,
    );
  }
}

// 5. Memory-safe image loading
Image.network(
  url,
  cacheWidth: 400, // Decode at display size, not full resolution
  cacheHeight: 400,
  errorBuilder: (_, error, __) => const Icon(Icons.broken_image),
)

// 6. Remote Config kill switch
class FeatureGuard extends StatelessWidget {
  final String flagKey;
  final Widget child;
  final Widget fallback;

  const FeatureGuard({
    required this.flagKey,
    required this.child,
    this.fallback = const SizedBox.shrink(),
  });

  @override
  Widget build(BuildContext context) {
    final enabled = FirebaseRemoteConfig.instance.getBool(flagKey);
    return enabled ? child : fallback;
  }
}

Line-by-line walkthrough

1. LeakTracking.enable() in debug builds activates Flutter's built-in leak detector — it monitors widget and object lifecycles and reports when objects survive past their expected disposal point.
2. _typingController.dispose() is called before super.dispose() — all custom resources must be released before the State is torn down.
3. _wsSub.cancel() in dispose() is the most commonly forgotten cleanup — StreamSubscriptions hold a reference to the stream and its listener, preventing garbage collection of the entire widget tree in some cases.
4. _typingTimer?.cancel() uses null-safe call — the timer may not have been created yet if the user never typed, so null check is necessary.
5. CrashlyticsContext.setUserContext should be called on each screen navigation and after login — custom keys give you filtering power in the Crashlytics dashboard (e.g., 'show me only crashes on PaymentScreen for users with the new-checkout flag').
6. setCustomKey('app_version', ...) supplements Crashlytics' built-in version tracking — useful when you need to filter by semantic version in custom queries.
7. Image.network with cacheWidth/cacheHeight decodes the image into a smaller bitmap — without these, a 4000x3000 JPEG is decoded into a ~46MB memory buffer even if displayed at 100x100 pixels.
8. FeatureGuard wraps any feature widget — when the Remote Config flag is false, it renders SizedBox.shrink() (nothing) instead of the feature, giving you a remote kill switch for any feature in the app.

Spot the bug

class ProfileScreen extends StatefulWidget {
  const ProfileScreen({super.key});
  @override
  State<ProfileScreen> createState() => _ProfileScreenState();
}

class _ProfileScreenState extends State<ProfileScreen> {
  late Future<User> _userFuture;
  StreamSubscription? _syncSub;

  @override
  void initState() {
    super.initState();
    _userFuture = context.read<UserRepo>().fetchUser();
    _syncSub = context.read<SyncService>().onSync.listen((event) {
      setState(() {
        _userFuture = context.read<UserRepo>().fetchUser();
      });
    });
  }
}

Need a hint?

This screen causes two types of issues reported in Crashlytics. What are they?

Show answer

Bug 1: Missing dispose() — _syncSub is never cancelled. When ProfileScreen is popped from the navigator, the StreamSubscription continues to hold a reference to the State object. When onSync fires, setState is called on a disposed widget, causing 'setState called after dispose' — a common Crashlytics error. Fix: override dispose() and call _syncSub?.cancel() then super.dispose(). Bug 2: setState reassigns _userFuture to a new Future on every sync event. FutureBuilder with a new Future re-shows the loading state on every sync, causing flickering. Fix: use a StreamBuilder instead of FutureBuilder + manual Future reassignment, or update the user data in a local state variable (User? _user) and call setState with the new user directly without creating a new Future.

Explain like I'm 5

Production debugging is like being a doctor when someone calls saying they're sick. First you check who's most sick (crash rate). Then you ask questions about symptoms (stack trace, custom keys). Then you look at what they ate recently (recent commits). Then you give them medicine that works immediately (kill switch or hotfix). Then you write down what happened so you don't make the same mistake again (post-mortem).

Fun fact

Google's Site Reliability Engineering (SRE) book introduced the concept of blameless post-mortems to the industry. Before this practice, engineers would hide bugs to avoid blame — blameless culture creates psychological safety that leads to faster incident disclosure and resolution, ultimately improving system reliability.

Hands-on challenge

Conduct a mock incident response: (1) A crash spike hits your payment app — crash-free rate drops to 96.5% 30 minutes after a release. Write your step-by-step triage process. (2) Crashlytics shows the crash is on Android 12 in PaymentConfirmationScreen. What custom keys would you have pre-configured to help narrow this down? (3) You identify the introducing commit using git bisect. Write the git bisect commands. (4) Design a kill switch strategy using Remote Config. (5) Write a 5-Whys analysis for the fictional root cause: 'API response field renamed without updating the Flutter model'.

More resources

Firebase Crashlytics Flutter documentation (FlutterFire)
Flutter DevTools Memory documentation (Flutter Docs)
Google SRE Book — Postmortem Culture (Google SRE)
git bisect documentation (Git Docs)
Flutter performance profiling (Flutter Docs)

Open interactive version (quiz + challenge) ← Back to course: Flutter Interview Mastery