Lesson 56 of 77 advanced

System Design I: Real-time Collaboration App

Designing a Tixio-style multi-workspace chat platform — architecture, WebSockets, offline queue, and scaling

Open interactive version (quiz + challenge)

Real-world analogy

Designing a real-time collaboration app is like designing a city's postal and telephone system simultaneously — messages must deliver instantly (phone calls), survive network outages (mail backup), handle millions of concurrent users (city scale), and organize communication by neighborhoods (workspaces) and streets (channels).

What is it?

A real-time collaboration system design covers the architecture needed to support multi-workspace messaging with presence, offline queuing, file sharing, and horizontal scaling — a common senior Flutter/backend system design interview topic.

Real-world relevance

Tixio is a Teamz Lab SaaS product providing multi-workspace collaboration. Its Flutter app maintains a persistent WebSocket connection, queues messages locally when offline, and uses Redis Pub/Sub to fan out messages across server instances to connected team members worldwide.

Key points

Code example

// Flutter: Offline message queue with optimistic UI
// drift schema
class Messages extends Table {
  IntColumn get id => integer().autoIncrement()();
  TextColumn get localId => text()(); // client-generated UUID
  TextColumn get serverId => text().nullable()(); // assigned after ACK
  TextColumn get channelId => text()();
  TextColumn get content => text()();
  TextColumn get status => text().withDefault(const Constant('pending'))();
  // pending | sending | sent | failed
  DateTimeColumn get createdAt => dateTime()();
}

// MessageQueue service
class MessageQueue {
  final AppDatabase _db;
  final WebSocketService _ws;
  bool _flushing = false;

  MessageQueue(this._db, this._ws);

  // Called when user taps Send
  Future<Message> enqueue(String channelId, String content) async {
    final localId = const Uuid().v4();
    final msg = await _db.messagesDao.insertMessage(
      localId: localId,
      channelId: channelId,
      content: content,
      status: 'pending',
    );
    unawaited(flush()); // attempt immediate send
    return msg;
  }

  // Called on WS connect and network restore
  Future<void> flush() async {
    if (_flushing || !_ws.isConnected) return;
    _flushing = true;
    try {
      final pending = await _db.messagesDao.getPendingMessages();
      for (final msg in pending) {
        await _db.messagesDao.updateStatus(msg.localId, 'sending');
        try {
          final serverMsg = await _ws.sendMessage(
            channelId: msg.channelId,
            content: msg.content,
            idempotencyKey: msg.localId,
          );
          await _db.messagesDao.markSent(
            localId: msg.localId,
            serverId: serverMsg.id,
          );
        } catch (e) {
          await _db.messagesDao.updateStatus(msg.localId, 'failed');
        }
      }
    } finally {
      _flushing = false;
    }
  }
}

// WebSocket event dispatcher
class WebSocketService {
  late WebSocketChannel _channel;
  final StreamController<WsEvent> _events = StreamController.broadcast();

  Stream<WsEvent> get events => _events.stream;
  bool get isConnected => _channel.closeCode == null;

  void connect(String token) {
    _channel = WebSocketChannel.connect(
      Uri.parse('wss://api.tixio.com/ws?token=$token'),
    );
    _channel.stream.listen(
      (data) => _events.add(WsEvent.fromJson(jsonDecode(data as String))),
      onDone: _handleDisconnect,
      onError: (e) => _handleDisconnect(),
    );
  }

  void _handleDisconnect() {
    // Exponential backoff reconnect
    Future.delayed(const Duration(seconds: 2), () => reconnect());
  }

  Future<ServerMessage> sendMessage({
    required String channelId,
    required String content,
    required String idempotencyKey,
  }) async {
    final completer = Completer<ServerMessage>();
    _channel.sink.add(jsonEncode({
      'type': 'message.send',
      'channelId': channelId,
      'content': content,
      'idempotencyKey': idempotencyKey,
    }));
    // Listen for ACK matching idempotencyKey
    final sub = events
        .where((e) => e.type == 'message.ack' && e.idempotencyKey == idempotencyKey)
        .first
        .then((e) => completer.complete(ServerMessage.fromEvent(e)));
    return completer.future.timeout(const Duration(seconds: 10));
  }
}

Line-by-line walkthrough

  1. 1. Messages table uses both localId (UUID from client) and serverId (assigned after server ACK) — this enables optimistic UI (show message immediately) while correctly reconciling with the server's canonical ID.
  2. 2. status column tracks the message lifecycle: pending → sending → sent/failed — the UI uses this to show delivery indicators (clock icon, check mark, error icon).
  3. 3. enqueue() writes to local DB first, then calls flush() with unawaited — the UI responds immediately while delivery happens asynchronously.
  4. 4. flush() guards with _flushing flag to prevent concurrent flush calls (e.g., from reconnect + manual retry simultaneously) causing duplicate sends.
  5. 5. sendMessage sets idempotencyKey from localId — the server returns the same idempotencyKey in the ACK, allowing the client to match and mark the correct message as sent.
  6. 6. The WS event listener uses .first on a filtered stream — it awaits exactly one matching ACK event then completes the Completer.
  7. 7. timeout(10 seconds) on the ACK wait prevents the flush from hanging indefinitely if the server doesn't ACK — triggers the catch block and marks message as failed for retry.
  8. 8. _handleDisconnect triggers reconnect after 2s — a full implementation uses exponential backoff with jitter to avoid thundering herd when a server restarts and many clients reconnect simultaneously.

Spot the bug

// WebSocket reconnect with message replay
class ChatBloc extends Bloc<ChatEvent, ChatState> {
  StreamSubscription? _wsSub;

  ChatBloc() : super(ChatInitial()) {
    on<ConnectWs>((event, emit) {
      _ws.connect(event.token);
      _wsSub = _ws.events.listen((wsEvent) {
        add(WsEventReceived(wsEvent));
      });
    });
    on<WsEventReceived>((event, emit) {
      final messages = (state as ChatLoaded).messages;
      emit(ChatLoaded(messages: [...messages, event.wsEvent.message]));
    });
    on<DisconnectWs>((event, emit) {
      _wsSub?.cancel();
      _ws.disconnect();
    });
  }
}
Need a hint?
After network reconnect, the user sees duplicate messages and misses some messages sent while offline. Two architectural issues cause this.
Show answer
Bug 1: On reconnect, the client re-subscribes and receives new messages, but doesn't request missed messages from when it was disconnected. Fix: track lastEventId or lastSeenAt per channel; on WS reconnect, send a 'replay' request to the server for messages since lastSeenAt. The server returns missed messages before resuming live events. Bug 2: No deduplication — if the server replays messages that were already in the local DB (from optimistic inserts), they appear twice. Fix: deduplicate by message ID before adding to state — check if messages list already contains a message with the same serverId or localId before appending. In practice, use a LinkedHashMap keyed by message ID for O(1) dedup in the state.

Explain like I'm 5

Imagine you're designing a city's communication system. Everyone lives in neighborhoods (workspaces) with streets (channels). When someone shouts a message on a street, a runner (WebSocket server) carries it to everyone on that street. If the runner is sick, a backup runner takes over. If you're sleeping (offline), your messages are saved in your mailbox (queue) and delivered when you wake up. The post office (Redis) makes sure all runners know all the news.

Fun fact

Slack's architecture originally used a single MySQL database per workspace ('cell-based architecture'). This allowed them to scale to thousands of workspaces without a single database becoming a bottleneck — an elegant design that influenced many subsequent collaboration tools.

Hands-on challenge

Whiteboard the complete system design for a Tixio-like collaboration app. Cover: (1) Requirements clarification (list 6 questions you'd ask). (2) High-level architecture diagram with all layers. (3) Data model for workspaces, channels, and messages. (4) WebSocket connection lifecycle (connect, auth, subscribe, disconnect, reconnect). (5) Offline message queue flow. (6) How you'd scale from 1K to 1M concurrent users. (7) One failure scenario and your mitigation strategy.

More resources

Open interactive version (quiz + challenge) ← Back to course: Flutter Interview Mastery