Lesson 56 of 77 advanced

System Design I: Real-time Collaboration App

Designing a Tixio-style multi-workspace chat platform — architecture, WebSockets, offline queue, and scaling

Open interactive version (quiz + challenge)

Real-world analogy

Designing a real-time collaboration app is like designing a city's postal and telephone system simultaneously — messages must deliver instantly (phone calls), survive network outages (mail backup), handle millions of concurrent users (city scale), and organize communication by neighborhoods (workspaces) and streets (channels).

What is it?

A real-time collaboration system design covers the architecture needed to support multi-workspace messaging with presence, offline queuing, file sharing, and horizontal scaling — a common senior Flutter/backend system design interview topic.

Real-world relevance

Tixio is a Teamz Lab SaaS product providing multi-workspace collaboration. Its Flutter app maintains a persistent WebSocket connection, queues messages locally when offline, and uses Redis Pub/Sub to fan out messages across server instances to connected team members worldwide.

Key points

Requirements gathering — the most important step — Senior interviewers watch HOW you gather requirements, not just what you design. Clarify: scale (MAU, concurrent users, message volume), geography (single region vs global), real-time requirements (latency SLA), offline needs, message persistence duration, file sharing limits, and compliance (GDPR, data residency). State your assumptions explicitly.
High-level architecture layers — Client (Flutter) → API Gateway (load balancer + auth) → Application servers (NestJS/Go) → WebSocket server cluster → Message broker (Redis Pub/Sub or Kafka) → Persistence layer (PostgreSQL + Redis cache) → Object storage (S3/Supabase Storage for files) → Push notification service (FCM/APNs). Draw this top-down in interviews.
Data model design — Core entities: User, Workspace, WorkspaceMember (role: owner/admin/member), Channel (public/private), ChannelMember, Message (id, channelId, authorId, content, type, replyToId, reactions, editedAt, deletedAt), FileAttachment, PresenceRecord. Use UUIDs for all IDs — distributed-safe and non-guessable.
WebSocket design and connection management — Each Flutter client opens a persistent WebSocket to a sticky-load-balanced server. On connect: authenticate via JWT, subscribe to all channels the user is a member of, receive missed messages since last_seen_at. Server sends typed events: message.new, message.edited, member.typing, presence.update, channel.created.
Offline message queue — Flutter client maintains a local SQLite queue (drift) of outbound messages. On send: write to local DB with status=pending, display optimistically in UI. Background isolate attempts WS send; on ACK from server, mark status=sent with server-assigned ID. On reconnect, flush queue in order. Never lose a user's message.
Presence and typing indicators — Presence: client sends heartbeat every 30s over WS. Server marks user online; TTL expires after 45s without heartbeat. Typing: client sends typing.start on first keypress, typing.stop on 3s inactivity or send. Server fans out to channel members. Do NOT persist typing events to DB — ephemeral only.
Message delivery guarantees — At-least-once delivery from server to client (client deduplicates by message ID). Client-to-server: queue + retry gives at-least-once. For exactly-once semantics, use idempotency keys (client generates UUID per message; server rejects duplicates). Sequence numbers per channel allow gap detection for missed messages.
Fan-out architecture — When a message is sent to a channel with 500 members: naive approach broadcasts to all 500 WS connections on one server — doesn't scale. Better: use Redis Pub/Sub channel per workspace. Each WS server subscribes to relevant workspace channels. When a message arrives, the broker fans it out to all servers, each delivering to their connected members.
Scaling considerations — WS servers are stateful — use sticky sessions (consistent hashing by userId). Horizontal scaling: add WS servers, Redis handles cross-server pub/sub. Database: read replicas for message history queries; write to primary. For 10M messages/day: PostgreSQL handles it with proper indexing (channelId + createdAt). For 100M+: consider TimescaleDB or Cassandra for time-series message storage.
File sharing design — Client uploads file directly to S3/Supabase Storage via presigned URL (avoids routing large files through app server). On upload complete, client sends message with fileUrl. Server validates URL ownership before persisting. Thumbnails: generate server-side on upload trigger. Max file size enforced at presigned URL generation.
Interview narration strategy — Structure your answer: (1) Clarify requirements — 2 minutes. (2) High-level diagram — 3 minutes. (3) Deep dive into 2-3 components the interviewer flags as interesting. (4) Discuss tradeoffs of your choices. (5) Address scaling and failure scenarios. Signal seniority by mentioning what you'd NOT build initially (over-engineering is a red flag).
Failure handling and resilience — WS connection drop: client reconnects with exponential backoff (1s, 2s, 4s, max 30s). Reconnect sends last_event_id to receive missed messages. Server crash: Redis pub/sub state is in-memory — on restart, clients reconnect and replay. Message loss window: minimize by persisting to DB before broadcasting. Circuit breaker on external services (FCM, S3).

Code example

// Flutter: Offline message queue with optimistic UI
// drift schema
class Messages extends Table {
  IntColumn get id => integer().autoIncrement()();
  TextColumn get localId => text()(); // client-generated UUID
  TextColumn get serverId => text().nullable()(); // assigned after ACK
  TextColumn get channelId => text()();
  TextColumn get content => text()();
  TextColumn get status => text().withDefault(const Constant('pending'))();
  // pending | sending | sent | failed
  DateTimeColumn get createdAt => dateTime()();
}

// MessageQueue service
class MessageQueue {
  final AppDatabase _db;
  final WebSocketService _ws;
  bool _flushing = false;

  MessageQueue(this._db, this._ws);

  // Called when user taps Send
  Future<Message> enqueue(String channelId, String content) async {
    final localId = const Uuid().v4();
    final msg = await _db.messagesDao.insertMessage(
      localId: localId,
      channelId: channelId,
      content: content,
      status: 'pending',
    );
    unawaited(flush()); // attempt immediate send
    return msg;
  }

  // Called on WS connect and network restore
  Future<void> flush() async {
    if (_flushing || !_ws.isConnected) return;
    _flushing = true;
    try {
      final pending = await _db.messagesDao.getPendingMessages();
      for (final msg in pending) {
        await _db.messagesDao.updateStatus(msg.localId, 'sending');
        try {
          final serverMsg = await _ws.sendMessage(
            channelId: msg.channelId,
            content: msg.content,
            idempotencyKey: msg.localId,
          );
          await _db.messagesDao.markSent(
            localId: msg.localId,
            serverId: serverMsg.id,
          );
        } catch (e) {
          await _db.messagesDao.updateStatus(msg.localId, 'failed');
        }
      }
    } finally {
      _flushing = false;
    }
  }
}

// WebSocket event dispatcher
class WebSocketService {
  late WebSocketChannel _channel;
  final StreamController<WsEvent> _events = StreamController.broadcast();

  Stream<WsEvent> get events => _events.stream;
  bool get isConnected => _channel.closeCode == null;

  void connect(String token) {
    _channel = WebSocketChannel.connect(
      Uri.parse('wss://api.tixio.com/ws?token=$token'),
    );
    _channel.stream.listen(
      (data) => _events.add(WsEvent.fromJson(jsonDecode(data as String))),
      onDone: _handleDisconnect,
      onError: (e) => _handleDisconnect(),
    );
  }

  void _handleDisconnect() {
    // Exponential backoff reconnect
    Future.delayed(const Duration(seconds: 2), () => reconnect());
  }

  Future<ServerMessage> sendMessage({
    required String channelId,
    required String content,
    required String idempotencyKey,
  }) async {
    final completer = Completer<ServerMessage>();
    _channel.sink.add(jsonEncode({
      'type': 'message.send',
      'channelId': channelId,
      'content': content,
      'idempotencyKey': idempotencyKey,
    }));
    // Listen for ACK matching idempotencyKey
    final sub = events
        .where((e) => e.type == 'message.ack' && e.idempotencyKey == idempotencyKey)
        .first
        .then((e) => completer.complete(ServerMessage.fromEvent(e)));
    return completer.future.timeout(const Duration(seconds: 10));
  }
}

Line-by-line walkthrough

1. Messages table uses both localId (UUID from client) and serverId (assigned after server ACK) — this enables optimistic UI (show message immediately) while correctly reconciling with the server's canonical ID.
2. status column tracks the message lifecycle: pending → sending → sent/failed — the UI uses this to show delivery indicators (clock icon, check mark, error icon).
3. enqueue() writes to local DB first, then calls flush() with unawaited — the UI responds immediately while delivery happens asynchronously.
4. flush() guards with _flushing flag to prevent concurrent flush calls (e.g., from reconnect + manual retry simultaneously) causing duplicate sends.
5. sendMessage sets idempotencyKey from localId — the server returns the same idempotencyKey in the ACK, allowing the client to match and mark the correct message as sent.
6. The WS event listener uses .first on a filtered stream — it awaits exactly one matching ACK event then completes the Completer.
7. timeout(10 seconds) on the ACK wait prevents the flush from hanging indefinitely if the server doesn't ACK — triggers the catch block and marks message as failed for retry.
8. _handleDisconnect triggers reconnect after 2s — a full implementation uses exponential backoff with jitter to avoid thundering herd when a server restarts and many clients reconnect simultaneously.

Spot the bug

// WebSocket reconnect with message replay
class ChatBloc extends Bloc<ChatEvent, ChatState> {
  StreamSubscription? _wsSub;

  ChatBloc() : super(ChatInitial()) {
    on<ConnectWs>((event, emit) {
      _ws.connect(event.token);
      _wsSub = _ws.events.listen((wsEvent) {
        add(WsEventReceived(wsEvent));
      });
    });
    on<WsEventReceived>((event, emit) {
      final messages = (state as ChatLoaded).messages;
      emit(ChatLoaded(messages: [...messages, event.wsEvent.message]));
    });
    on<DisconnectWs>((event, emit) {
      _wsSub?.cancel();
      _ws.disconnect();
    });
  }
}

Need a hint?

After network reconnect, the user sees duplicate messages and misses some messages sent while offline. Two architectural issues cause this.

Show answer

Bug 1: On reconnect, the client re-subscribes and receives new messages, but doesn't request missed messages from when it was disconnected. Fix: track lastEventId or lastSeenAt per channel; on WS reconnect, send a 'replay' request to the server for messages since lastSeenAt. The server returns missed messages before resuming live events. Bug 2: No deduplication — if the server replays messages that were already in the local DB (from optimistic inserts), they appear twice. Fix: deduplicate by message ID before adding to state — check if messages list already contains a message with the same serverId or localId before appending. In practice, use a LinkedHashMap keyed by message ID for O(1) dedup in the state.

Explain like I'm 5

Imagine you're designing a city's communication system. Everyone lives in neighborhoods (workspaces) with streets (channels). When someone shouts a message on a street, a runner (WebSocket server) carries it to everyone on that street. If the runner is sick, a backup runner takes over. If you're sleeping (offline), your messages are saved in your mailbox (queue) and delivered when you wake up. The post office (Redis) makes sure all runners know all the news.

Fun fact

Slack's architecture originally used a single MySQL database per workspace ('cell-based architecture'). This allowed them to scale to thousands of workspaces without a single database becoming a bottleneck — an elegant design that influenced many subsequent collaboration tools.

Hands-on challenge

Whiteboard the complete system design for a Tixio-like collaboration app. Cover: (1) Requirements clarification (list 6 questions you'd ask). (2) High-level architecture diagram with all layers. (3) Data model for workspaces, channels, and messages. (4) WebSocket connection lifecycle (connect, auth, subscribe, disconnect, reconnect). (5) Offline message queue flow. (6) How you'd scale from 1K to 1M concurrent users. (7) One failure scenario and your mitigation strategy.

More resources

Slack's architecture and scaling story (Slack Engineering)
Redis Pub/Sub documentation (Redis Docs)
Designing real-time messaging systems (systemdesign.one)
WebSocketChannel Flutter package (pub.dev)
System design interview framework (hellointerview.com)

Open interactive version (quiz + challenge) ← Back to course: Flutter Interview Mastery