Lesson 59 of 77 advanced

Production Debugging, Incident Response & Root Cause Analysis

Triaging crashes, reading stack traces, memory leaks, rollback strategies, and post-mortems

Open interactive version (quiz + challenge)

Real-world analogy

Production debugging is like being a detective at a crime scene that's still happening. The clues are stack traces and logs, the witnesses are your monitoring tools, and you need to find the culprit (root cause) fast enough to stop the damage — while keeping stakeholders calm and informed throughout.

What is it?

Production debugging combines crash triage, stack trace analysis, memory leak detection, git bisect, rollback strategies, and structured incident response processes to minimize user impact and prevent recurrence of production issues.

Real-world relevance

A fintech payment app sees a P0 crash spike after a release — crash-free rate drops to 97%. Crashlytics shows NullPointerException in PaymentScreen, affecting Android 12 users. Custom keys reveal it's users with saved cards. Git bisect identifies the introducing commit in 10 minutes. A Remote Config kill switch disables the saved-card flow immediately. The fix ships within 2 hours. A post-mortem adds contract tests to prevent similar API contract breaks.

Key points

Code example

// Production debugging toolkit

// 1. Symbolize obfuscated stack trace
// Command line: flutter symbolize -i crashlytics_stack.txt -d path/to/app.android-arm64.symbols

// 2. LeakTracker integration (debug mode)
import 'package:leak_tracker_flutter_testing/leak_tracker_flutter_testing.dart';

void main() {
  // Enable leak tracking in debug/test builds
  LeakTracking.enable();
  runApp(const App());
}

// 3. Proper resource disposal to prevent leaks
class ChatScreenState extends State<ChatScreen> {
  late StreamSubscription<WsEvent> _wsSub;
  late AnimationController _typingController;
  late TextEditingController _messageController;
  Timer? _typingTimer;

  @override
  void initState() {
    super.initState();
    _typingController = AnimationController(
      vsync: this,
      duration: const Duration(milliseconds: 600),
    )..repeat(reverse: true);
    _messageController = TextEditingController();
    _wsSub = context.read<WsService>().events.listen(_handleEvent);
  }

  @override
  void dispose() {
    _typingController.dispose(); // MUST dispose all controllers
    _messageController.dispose();
    _wsSub.cancel(); // MUST cancel subscriptions
    _typingTimer?.cancel(); // MUST cancel timers
    super.dispose();
  }

  void _onTyping() {
    _typingTimer?.cancel();
    _typingTimer = Timer(const Duration(seconds: 3), _sendTypingStop);
  }
}

// 4. Crashlytics context for better triage
class CrashlyticsContext {
  static Future<void> setUserContext({
    required String userId,
    required String screen,
    required String featureFlag,
  }) async {
    await FirebaseCrashlytics.instance.setUserIdentifier(userId);
    await FirebaseCrashlytics.instance.setCustomKey('screen', screen);
    await FirebaseCrashlytics.instance.setCustomKey('feature_flag', featureFlag);
    await FirebaseCrashlytics.instance.setCustomKey(
      'app_version',
      (await PackageInfo.fromPlatform()).version,
    );
  }
}

// 5. Memory-safe image loading
Image.network(
  url,
  cacheWidth: 400, // Decode at display size, not full resolution
  cacheHeight: 400,
  errorBuilder: (_, error, __) => const Icon(Icons.broken_image),
)

// 6. Remote Config kill switch
class FeatureGuard extends StatelessWidget {
  final String flagKey;
  final Widget child;
  final Widget fallback;

  const FeatureGuard({
    required this.flagKey,
    required this.child,
    this.fallback = const SizedBox.shrink(),
  });

  @override
  Widget build(BuildContext context) {
    final enabled = FirebaseRemoteConfig.instance.getBool(flagKey);
    return enabled ? child : fallback;
  }
}

Line-by-line walkthrough

  1. 1. LeakTracking.enable() in debug builds activates Flutter's built-in leak detector — it monitors widget and object lifecycles and reports when objects survive past their expected disposal point.
  2. 2. _typingController.dispose() is called before super.dispose() — all custom resources must be released before the State is torn down.
  3. 3. _wsSub.cancel() in dispose() is the most commonly forgotten cleanup — StreamSubscriptions hold a reference to the stream and its listener, preventing garbage collection of the entire widget tree in some cases.
  4. 4. _typingTimer?.cancel() uses null-safe call — the timer may not have been created yet if the user never typed, so null check is necessary.
  5. 5. CrashlyticsContext.setUserContext should be called on each screen navigation and after login — custom keys give you filtering power in the Crashlytics dashboard (e.g., 'show me only crashes on PaymentScreen for users with the new-checkout flag').
  6. 6. setCustomKey('app_version', ...) supplements Crashlytics' built-in version tracking — useful when you need to filter by semantic version in custom queries.
  7. 7. Image.network with cacheWidth/cacheHeight decodes the image into a smaller bitmap — without these, a 4000x3000 JPEG is decoded into a ~46MB memory buffer even if displayed at 100x100 pixels.
  8. 8. FeatureGuard wraps any feature widget — when the Remote Config flag is false, it renders SizedBox.shrink() (nothing) instead of the feature, giving you a remote kill switch for any feature in the app.

Spot the bug

class ProfileScreen extends StatefulWidget {
  const ProfileScreen({super.key});
  @override
  State<ProfileScreen> createState() => _ProfileScreenState();
}

class _ProfileScreenState extends State<ProfileScreen> {
  late Future<User> _userFuture;
  StreamSubscription? _syncSub;

  @override
  void initState() {
    super.initState();
    _userFuture = context.read<UserRepo>().fetchUser();
    _syncSub = context.read<SyncService>().onSync.listen((event) {
      setState(() {
        _userFuture = context.read<UserRepo>().fetchUser();
      });
    });
  }
}
Need a hint?
This screen causes two types of issues reported in Crashlytics. What are they?
Show answer
Bug 1: Missing dispose() — _syncSub is never cancelled. When ProfileScreen is popped from the navigator, the StreamSubscription continues to hold a reference to the State object. When onSync fires, setState is called on a disposed widget, causing 'setState called after dispose' — a common Crashlytics error. Fix: override dispose() and call _syncSub?.cancel() then super.dispose(). Bug 2: setState reassigns _userFuture to a new Future on every sync event. FutureBuilder with a new Future re-shows the loading state on every sync, causing flickering. Fix: use a StreamBuilder instead of FutureBuilder + manual Future reassignment, or update the user data in a local state variable (User? _user) and call setState with the new user directly without creating a new Future.

Explain like I'm 5

Production debugging is like being a doctor when someone calls saying they're sick. First you check who's most sick (crash rate). Then you ask questions about symptoms (stack trace, custom keys). Then you look at what they ate recently (recent commits). Then you give them medicine that works immediately (kill switch or hotfix). Then you write down what happened so you don't make the same mistake again (post-mortem).

Fun fact

Google's Site Reliability Engineering (SRE) book introduced the concept of blameless post-mortems to the industry. Before this practice, engineers would hide bugs to avoid blame — blameless culture creates psychological safety that leads to faster incident disclosure and resolution, ultimately improving system reliability.

Hands-on challenge

Conduct a mock incident response: (1) A crash spike hits your payment app — crash-free rate drops to 96.5% 30 minutes after a release. Write your step-by-step triage process. (2) Crashlytics shows the crash is on Android 12 in PaymentConfirmationScreen. What custom keys would you have pre-configured to help narrow this down? (3) You identify the introducing commit using git bisect. Write the git bisect commands. (4) Design a kill switch strategy using Remote Config. (5) Write a 5-Whys analysis for the fictional root cause: 'API response field renamed without updating the Flutter model'.

More resources

Open interactive version (quiz + challenge) ← Back to course: Flutter Interview Mastery