System Design III: Offline-First Field App at Scale
Designing for 10K+ field workers in low-connectivity — sync queues, conflict resolution, and battery optimisation
Open interactive version (quiz + challenge)Real-world analogy
An offline-first field app is like a clipboard that works anywhere — even underground, in rice fields, or in a tunnel. You write on it all day, then when you're back in range, it automatically syncs with the office filing cabinet. But 10,000 people might edit the same file, so you need rules about who wins when two people changed the same thing.
What is it?
An offline-first field app architecture enables full CRUD functionality without connectivity for thousands of concurrent field workers, with background synchronization, conflict resolution, and battery-optimized sync strategies.
Real-world relevance
BRAC's digital data collection tools and FieldBuzz (Bangladesh field operations platform) operate in areas where connectivity is intermittent. Field officers collect household data, visit records, and form submissions offline all day, with the app syncing when they return to areas with signal — maintaining data integrity across 10,000+ concurrent workers.
Key points
- The offline-first requirement — Field apps used by NGO workers (BRAC), utility crews (Hazira Khata), or farm extension officers operate in areas with no connectivity for hours or days. Offline-first means the app is fully functional without internet — reading, writing, and form submission all work locally. Sync is a background concern, not a blocking one.
- Requirements for field apps at scale — Always clarify: how many field workers (10K in this case), geographic distribution, average connectivity window (30min/day?), data volume per worker per day, multi-device (one worker, two phones?), supervisor review workflow, data integrity requirements, and whether conflicts are common (multiple workers editing the same record).
- Local database design with Drift — Drift (formerly Moor) provides type-safe SQLite with migrations. Schema for field apps: entities like Household, Beneficiary, Visit, FormSubmission, SyncStatus. Add sync metadata to every table: serverId (nullable until synced), clientId (UUID, always set), syncState (pending/syncing/synced/conflict), updatedAt, createdAt, deletedAt (soft delete for sync).
- Sync queue architecture — Every local write (insert, update, delete) generates a SyncQueueEntry: {id, entityType, entityId, operation, payload, timestamp, retryCount, status}. A background isolate processes this queue when connectivity is available. Queue is ordered by timestamp — maintains causality. Failed entries are retried with exponential backoff up to a max retry count.
- Conflict resolution strategies — Last-Write-Wins (LWW): server timestamp determines winner — simple but can silently discard work. Merge (field-level): if two workers edited different fields of the same record, both changes are merged — complex but data-preserving. Manual: flag as conflict, surface to supervisor for resolution — correct but requires UX. Choose based on data importance: demographic data needs merge; GPS coordinates can use LWW.
- Delta sync vs full sync — Full sync: download all records the user has access to — simple but slow for large datasets. Delta sync: server tracks a 'sync cursor' (timestamp or sequence number) per client; only changes since cursor are returned. Much more efficient. Vector clocks (Merkle trees) enable distributed delta sync without a central cursor — used by CRDTs.
- Background sync implementation — On Android: WorkManager with PeriodicWorkRequest (minimum 15min interval) for guaranteed background execution. Constraint: NetworkType.CONNECTED. On iOS: BGTaskScheduler (BGAppRefreshTask or BGProcessingTask). In Flutter: use workmanager package. Sync in a dart isolate to avoid blocking UI. Report progress back to UI via IsolateNameServer.
- Battery optimisation — Background sync is a battery drain risk. Mitigations: batch multiple changes into one HTTP request, use sync windows when device is charging, compress payloads (gzip), avoid syncing images in background (defer to WiFi), exponential backoff on failure (avoid hammering server and wasting battery), monitor with Android Battery Historian.
- Data integrity and consistency — Use database transactions for multi-table writes — never leave data in a partial state. Ensure every locally-created record has a deterministic UUID (clientId) before sync — server acknowledges with its own ID (serverId). Reference foreign keys using clientId until serverId is known, then optionally migrate references.
- Server-side sync endpoint design — POST /sync/push: accepts array of change records with clientId, timestamp, operation. Server processes, detects conflicts, returns results per change: {clientId, status: 'accepted'|'conflict', serverRecord}. GET /sync/pull?cursor=X: returns all changes since cursor for this user's data scope. Return new cursor in response. Batch size limit (e.g., 500 records) prevents timeout.
- Handling large-scale conflict scenarios — With 10K workers, many workers may have the same supervisor or visit the same household. Design access scopes: each worker only syncs data within their assigned area (GPS-bounded or assignment-based). This dramatically reduces conflict probability and sync payload size. Server enforces scope at both push and pull endpoints.
- Interview narration for offline-first design — Signal seniority: (1) Raise the conflict resolution question before the interviewer does — it shows you understand the hard problem. (2) Distinguish between 'offline capable' (reads work) and 'offline-first' (writes and reads work). (3) Discuss sync cursor and delta sync rather than polling. (4) Mention WorkManager/BGTaskScheduler for background sync. (5) Ask about the conflict rate in practice — it changes the conflict strategy choice.
Code example
// Offline-first sync architecture with Drift + WorkManager
// Drift schema with sync metadata
class Visits extends Table {
TextColumn get clientId => text()(); // UUID, set at creation
TextColumn get serverId => text().nullable()(); // set after sync ACK
TextColumn get householdId => text()();
TextColumn get officerId => text()();
TextColumn get notes => text().withDefault(const Constant(''))();
TextColumn get gpsLat => real().nullable().map(const RealConverter())();
TextColumn get syncState => text().withDefault(const Constant('pending'))();
// pending | syncing | synced | conflict
DateTimeColumn get updatedAt => dateTime()();
DateTimeColumn get createdAt => dateTime()();
DateTimeColumn get deletedAt => dateTime().nullable()(); // soft delete
@override
Set<Column> get primaryKey => {clientId};
}
class SyncQueueEntries extends Table {
IntColumn get id => integer().autoIncrement()();
TextColumn get entityType => text()(); // 'visit', 'household', etc.
TextColumn get entityId => text()(); // clientId of the entity
TextColumn get operation => text()(); // 'insert' | 'update' | 'delete'
TextColumn get payload => text()(); // JSON of changed fields only
DateTimeColumn get enqueuedAt => dateTime()();
IntColumn get retryCount => integer().withDefault(const Constant(0))();
TextColumn get status => text().withDefault(const Constant('pending'))();
}
// Visit DAO — every write enqueues a sync entry
extension VisitDao on AppDatabase {
Future<void> createVisit(VisitsCompanion visit) async {
await transaction(() async {
await into(visits).insert(visit);
await into(syncQueueEntries).insert(SyncQueueEntriesCompanion.insert(
entityType: 'visit',
entityId: visit.clientId.value,
operation: 'insert',
payload: jsonEncode(visit.toJson()),
enqueuedAt: DateTime.now(),
));
});
}
}
// WorkManager task (runs in background isolate)
@pragma('vm:entry-point')
void syncBackgroundTask() {
Workmanager().executeTask((taskName, inputData) async {
try {
await SyncService.runSync();
return Future.value(true);
} catch (e) {
return Future.value(false); // WorkManager will retry
}
});
}
// Sync service
class SyncService {
static Future<void> runSync() async {
final db = await DatabaseFactory.open();
final api = ApiClient.create();
// Push: send pending queue entries
final pending = await db.syncQueueDao.getPendingEntries(limit: 100);
if (pending.isEmpty) return;
final response = await api.post('/sync/push', data: {
'changes': pending.map((e) => e.toJson()).toList(),
});
for (final result in response.data['results']) {
final clientId = result['clientId'] as String;
if (result['status'] == 'accepted') {
await db.syncQueueDao.markSent(clientId);
await db.visitsDao.updateServerId(
clientId: clientId,
serverId: result['serverId'] as String,
syncState: 'synced',
);
} else if (result['status'] == 'conflict') {
await db.visitsDao.markConflict(clientId, result['serverRecord']);
}
}
// Pull: fetch changes since last cursor
final cursor = await db.syncMetaDao.getCursor();
final pullResponse = await api.get('/sync/pull', queryParameters: {
'cursor': cursor,
'limit': 500,
});
await db.transaction(() async {
for (final record in pullResponse.data['records']) {
await db.visitsDao.upsertFromServer(record);
}
await db.syncMetaDao.setCursor(pullResponse.data['newCursor']);
});
}
}
// WorkManager registration
Future<void> registerBackgroundSync() async {
await Workmanager().registerPeriodicTask(
'field_sync',
'syncBackgroundTask',
frequency: const Duration(minutes: 15),
constraints: Constraints(networkType: NetworkType.connected),
existingWorkPolicy: ExistingWorkPolicy.keep,
);
}Line-by-line walkthrough
- 1. The Visits table includes both clientId (UUID set at creation) and serverId (set after server ACK) — this two-ID pattern is fundamental to offline-first; the client uses clientId for all local references.
- 2. syncState column tracks the sync lifecycle: pending (awaiting sync), syncing (in-flight), synced (confirmed by server), conflict (server rejected with newer version).
- 3. createVisit uses a Drift transaction to insert the visit AND its sync queue entry atomically — if either write fails, both are rolled back, ensuring the queue is never missing an entry for a local change.
- 4. SyncQueueEntry.operation captures 'insert'/'update'/'delete' so the server can apply the correct operation when processing the push.
- 5. syncBackgroundTask is annotated with @pragma('vm:entry-point') — this prevents the Dart tree shaker from removing the function, which would cause WorkManager to fail to find it.
- 6. Workmanager().executeTask returning false signals WorkManager to retry the task — returning true marks it as complete. The retry policy is configured separately.
- 7. In runSync(), the push batch limit of 100 prevents a single sync from timing out when the queue is very large (e.g., after 7 days offline).
- 8. The pull sync uses a transaction to atomically upsert all received records AND update the cursor — if the transaction rolls back, the cursor is not advanced and the same records are re-fetched on next sync.
Spot the bug
// Sync queue processing
Future<void> processQueue() async {
final entries = await db.syncQueueDao.getAllPending();
for (final entry in entries) {
try {
final result = await api.post('/sync/push-single', data: entry.toJson());
if (result.data['status'] == 'accepted') {
await db.syncQueueDao.delete(entry);
}
} catch (e) {
// Will retry on next sync
}
}
}Need a hint?
This approach causes major problems at scale with 10K workers, even when it works correctly. What are the two architectural problems?
Show answer
Bug 1: N HTTP requests for N queue entries — if a worker has 500 pending changes after a day offline, this makes 500 sequential API calls. Each round-trip is 100-500ms over mobile networks, making the total sync take 50-250 seconds. Fix: batch all entries into a single POST /sync/push request with an array payload — one round-trip regardless of queue size. Bug 2: Sequential processing — if one entry fails (e.g., entry 50 of 500), the loop catches the error and continues, but the failed entry remains 'pending'. On next sync it is re-attempted, but entries 51-500 that succeeded are also re-queued (since we only delete on success). Fix: track status per entry (sent/failed) and only re-process failed entries. With batch processing: the server returns per-entry results, allowing targeted retry of only conflicted or failed entries.
Explain like I'm 5
Imagine 10,000 people all working in places with no phone signal, all writing notes in their personal notebooks. When they get signal again, all their notebooks need to update a single big shared notebook in the office. If two people wrote different things about the same family, someone (or a smart rule) has to decide what the shared notebook says. The app has to do this automatically, without losing anyone's work, and without draining their phone battery doing it.
Fun fact
The Open Data Kit (ODK) — used by WHO, CDC, and hundreds of NGOs for field data collection across 100+ countries — was one of the first mobile systems to prove offline-first data collection at massive scale, influencing the architecture of most modern field app platforms including BRAC's digital tools.
Hands-on challenge
Design the complete offline-first architecture for a field data collection app with 10,000 workers: (1) Drift schema for Visit and SyncQueueEntry tables with all sync metadata columns. (2) The sync queue write pattern (transaction ensures queue entry is always created with the data change). (3) WorkManager task registration with correct constraints. (4) Push sync flow — describe what happens when a conflict is detected. (5) Pull sync flow with cursor-based delta sync. (6) How would you handle the scenario where a worker's device was offline for 7 days?
More resources
- WorkManager Flutter package (pub.dev)
- Drift (SQLite ORM) documentation (Drift Docs)
- Offline-first app architecture patterns (raywenderlich.com)
- Conflict resolution strategies in distributed systems (Martin Fowler)
- CRDTs for conflict-free synchronisation (crdt.tech)