System Design I: Offline-First Field App at Scale
Designing a field operations platform for 10,000+ workers in low-connectivity areas
Open interactive version (quiz + challenge)Real-world analogy
What is it?
Offline-first field app system design is the most common advanced Android system design question for apps targeting developing markets. It requires combining Room for local persistence, a sync queue pattern for reliable offline writes, delta sync for bandwidth efficiency, WorkManager for background synchronization, conflict resolution strategies, and battery optimization — all while maintaining data integrity guarantees that satisfy compliance requirements.
Real-world relevance
BRAC, one of the world's largest NGOs operating in Bangladesh, uses Android field apps for health workers visiting households in remote areas with no connectivity. FieldBuzz, a Bangladeshi SaaS platform, enables FMCG companies to manage field sales teams across areas with 2G-only coverage. Both require exactly this architecture: reliable offline writes, background sync, photo evidence capture, and GPS tracking that works for 10+ hours on a single charge.
Key points
- Requirements gathering — the interview opening move — Never jump into architecture. Ask: How many concurrent users? 10K field workers. Read-heavy or write-heavy? Both — workers read task lists, write inspection results. How long can they be offline? Up to 72 hours in remote areas. What is the acceptable sync lag? 15 minutes when online. What data must never be lost? Inspection reports, GPS coordinates, photos. This framing shows senior-level thinking and sets up every architectural decision.
- Core entities and Room schema design — Entities: Worker (id, role, region, syncedAt), Task (id, assignedTo, status, priority, dueAt, serverVersion), InspectionReport (id, taskId, workerId, answers JSON, photoUris, submittedAt, syncStatus), SyncQueue (id, entityType, entityId, operation, payload, retryCount, createdAt). SyncStatus enum: PENDING, SYNCING, SYNCED, FAILED. The SyncQueue table is the backbone — every local write generates a queue entry.
- Offline write path — the sync queue pattern — When a worker submits a report offline: 1) Write InspectionReport to Room with syncStatus=PENDING. 2) Insert a SyncQueueEntry with operation=INSERT, entityType=REPORT, payload=serialized report JSON. 3) Show success UI immediately — the write is durable on device. 4) WorkManager enqueues a sync job constrained to network availability. The UI never blocks on network — this is the core offline-first contract.
- Delta sync vs full sync — the critical tradeoff — Full sync: download all Tasks every sync cycle. Simple but expensive — 10K workers syncing 5MB task lists every 15 minutes = 50GB/hour of bandwidth. Delta sync: server tracks a lastModifiedAt timestamp per entity. Client sends its last sync timestamp; server returns only changed entities. Far more efficient but requires server-side change tracking and client-side merge logic. Always choose delta sync for field apps at scale.
- Conflict resolution strategy — Conflicts arise when a record is modified on both client and server while offline. Strategies: Last-Write-Wins (LWW) — simpler, use serverVersion timestamp; server always wins on pull, client wins on push if serverVersion matches. Three-way merge — for complex documents, compare base, client change, and server change. For inspection reports, use server-wins for task metadata, client-wins for report content (the worker's answer is authoritative). Document this decision explicitly in interviews — it shows you understand the tradeoffs.
- WorkManager for background sync — Use PeriodicWorkRequest with 15-minute interval and constraints: NetworkType.CONNECTED, battery not low. The SyncWorker reads all PENDING SyncQueueEntries, sends them in a batch POST to the server, and on 200 response marks them SYNCED and updates the corresponding Room entities. On failure, WorkManager retries with exponential backoff. Never use AlarmManager or JobScheduler directly in 2026.
- Photo upload strategy — Photos are the heaviest payload. Strategy: 1) Save photo to local file storage immediately. 2) Store relative file path in InspectionReport, not the URI (URIs can become invalid after app restart). 3) Upload photos separately from report metadata — use a separate PhotoUploadQueue. 4) Server returns a CDN URL after upload; update the report with the URL. 5) Only mark the report SYNCED after all its photos are uploaded. Uploading in the background with WorkManager constraint: NetworkType.UNMETERED is ideal for large photos.
- Battery optimization — Field workers use phones all day — battery is critical. Optimizations: batch sync (aggregate 50 queue entries per network request, not 50 individual requests), compress JSON payloads (gzip), upload photos only on UNMETERED or when battery > 30%, use SENSOR_DELAY_NORMAL for GPS polling, disable GPS when worker is stationary (detected via accelerometer), cache task lists in memory to avoid redundant Room queries. Show these in interviews as evidence of production thinking.
- Data integrity guarantees — Use Room database transactions to write Report + SyncQueueEntry atomically — if the transaction fails, neither write happens, preventing orphaned queue entries. Use unique constraints on (taskId, workerId, submittedAt) to prevent duplicate submissions from UI double-tap. Use server-side idempotency keys (the local UUID of the report) so retried uploads do not create duplicate records on the server.
- Scaling considerations — server side awareness — Mention these to show full-stack thinking: Server needs a change log table (entity_changes) to support delta sync queries efficiently. Index on (entity_type, last_modified_at, region) for fast delta queries by worker region. Background jobs on server aggregate regional stats (do not compute in sync API). CDN for photo delivery — workers download task photos from CDN, not the app server. Consider read replicas for task list queries under 10K concurrent sync requests.
- Reference: BRAC/FieldBuzz architecture patterns — Field operations apps like BRAC's field management tools and FieldBuzz (a Bangladeshi field force management platform) face exactly this architecture. Key lessons from such systems: offline capability is not optional — 40% of work happens in areas with no signal. Photo evidence is legally required — loss of a photo can mean loss of compliance. Sync conflicts between supervisor override and worker submission must be logged for audit, not silently discarded.
- Interview narration strategy — Structure your answer: 1) Requirements (2 min). 2) Core entities + Room schema (3 min). 3) Offline write path with sync queue (3 min). 4) Delta sync and conflict resolution (3 min). 5) WorkManager sync job (2 min). 6) Battery and photo optimizations (2 min). 7) Scaling and failure scenarios (2 min). Draw a simple box diagram: Device (Room + SyncQueue) -> WorkManager -> API Server -> Database + CDN. Interviewers evaluate whether you can narrate confidently under pressure, not just whether your architecture is perfect.
Code example
// Core Room entities
@Entity(tableName = "tasks")
data class Task(
@PrimaryKey val id: String,
val assignedTo: String,
val title: String,
val status: TaskStatus,
val priority: Int,
val dueAt: Long,
val serverVersion: Long,
val syncStatus: SyncStatus = SyncStatus.SYNCED
)
@Entity(tableName = "inspection_reports",
indices = [Index(value = ["task_id", "worker_id", "submitted_at"], unique = true)])
data class InspectionReport(
@PrimaryKey val id: String = UUID.randomUUID().toString(),
val taskId: String,
val workerId: String,
val answersJson: String,
val localPhotoPaths: String,
val serverPhotoUrls: String? = null,
val submittedAt: Long = System.currentTimeMillis(),
val syncStatus: SyncStatus = SyncStatus.PENDING
)
@Entity(tableName = "sync_queue")
data class SyncQueueEntry(
@PrimaryKey(autoGenerate = true) val id: Long = 0,
val entityType: String,
val entityId: String,
val operation: String,
val payload: String,
val retryCount: Int = 0,
val createdAt: Long = System.currentTimeMillis()
)
enum class SyncStatus { PENDING, SYNCING, SYNCED, FAILED }
// Repository — atomic offline write
class FieldRepository(private val db: AppDatabase, private val api: FieldApi) {
suspend fun submitReport(report: InspectionReport) {
db.withTransaction {
db.reportDao().insert(report)
db.syncQueueDao().insert(
SyncQueueEntry(
entityType = "REPORT",
entityId = report.id,
operation = "INSERT",
payload = Json.encodeToString(report)
)
)
}
// WorkManager will pick this up when network is available
}
suspend fun syncPendingReports() {
val pending = db.syncQueueDao().getPendingByType("REPORT", limit = 50)
if (pending.isEmpty()) return
try {
val response = api.batchSubmitReports(pending.map { Json.decodeFromString(it.payload) })
db.withTransaction {
response.synced.forEach { id ->
db.reportDao().updateSyncStatus(id, SyncStatus.SYNCED)
db.syncQueueDao().deleteByEntityId(id)
}
response.conflicts.forEach { conflict ->
db.reportDao().updateSyncStatus(conflict.localId, SyncStatus.FAILED)
db.reportDao().updateWithServerVersion(conflict.serverReport)
}
}
} catch (e: IOException) {
db.syncQueueDao().incrementRetryCount(pending.map { it.id })
}
}
}
// WorkManager sync job
class SyncWorker(context: Context, params: WorkerParameters) : CoroutineWorker(context, params) {
override suspend fun doWork(): Result {
val repo = FieldRepository(AppDatabase.getInstance(applicationContext), FieldApi.create())
return try {
repo.syncPendingReports()
Result.success()
} catch (e: Exception) {
if (runAttemptCount < 3) Result.retry() else Result.failure()
}
}
}
// Scheduling periodic sync
fun scheduleSync(context: Context) {
val constraints = Constraints.Builder()
.setRequiredNetworkType(NetworkType.CONNECTED)
.setRequiresBatteryNotLow(true)
.build()
val syncWork = PeriodicWorkRequestBuilder<SyncWorker>(15, TimeUnit.MINUTES)
.setConstraints(constraints)
.setBackoffCriteria(BackoffPolicy.EXPONENTIAL, 30, TimeUnit.SECONDS)
.build()
WorkManager.getInstance(context).enqueueUniquePeriodicWork(
"field_sync",
ExistingPeriodicWorkPolicy.KEEP,
syncWork
)
}
// Delta sync — server request with last sync timestamp
data class DeltaSyncRequest(
val workerId: String,
val region: String,
val lastSyncAt: Long,
val deviceTime: Long = System.currentTimeMillis()
)
suspend fun performDeltaSync(api: FieldApi, prefs: SyncPrefs) {
val request = DeltaSyncRequest(
workerId = prefs.workerId,
region = prefs.region,
lastSyncAt = prefs.lastSyncAt
)
val response = api.getDelta(request)
db.withTransaction {
response.updatedTasks.forEach { db.taskDao().upsert(it) }
response.deletedTaskIds.forEach { db.taskDao().markDeleted(it) }
prefs.lastSyncAt = response.serverTime
}
}Line-by-line walkthrough
- 1. InspectionReport uses a unique index on (task_id, worker_id, submitted_at) — this is the database-level guard against duplicate submissions, enforcing idempotency without application logic.
- 2. UUID.randomUUID().toString() as the primary key means the client generates the ID, not the server — this is essential for offline-first because the ID must exist before the network call.
- 3. db.withTransaction{} in submitReport() wraps both the report insert and sync queue insert — if either fails, both roll back, maintaining the invariant that every report in the DB has a corresponding queue entry.
- 4. syncPendingReports() fetches in batches of 50 — batching reduces network round trips from O(N) to O(N/50), critical on slow connections.
- 5. The conflict handling block updates the local record with serverReport data — this implements server-wins for conflicts, and the FAILED status allows the worker to review and resubmit.
- 6. incrementRetryCount on IOException — only network failures increment retry; application errors (conflict, validation) are handled separately and do not consume retry budget.
- 7. PeriodicWorkRequestBuilder with 15 minutes is the minimum interval WorkManager allows — Android may batch this with other work and delay up to 5 minutes in Doze mode.
- 8. ExistingPeriodicWorkPolicy.KEEP prevents duplicate sync chains if scheduleSync() is called multiple times (e.g., on every app launch).
- 9. DeltaSyncRequest sends lastSyncAt and deviceTime separately — deviceTime lets the server detect clock skew and adjust the comparison window, preventing missed updates from devices with wrong clocks.
- 10. response.serverTime is stored as the new lastSyncAt — using server time, not device time, eliminates drift from devices with incorrect clocks in the field.
Spot the bug
class SyncWorker(context: Context, params: WorkerParameters) : CoroutineWorker(context, params) {
override suspend fun doWork(): Result {
val db = AppDatabase.getInstance(applicationContext)
val api = FieldApi.create()
val pending = db.syncQueueDao().getAllPending()
pending.forEach { entry ->
try {
api.submitReport(Json.decodeFromString(entry.payload))
db.syncQueueDao().delete(entry)
db.reportDao().updateSyncStatus(entry.entityId, SyncStatus.SYNCED)
} catch (e: Exception) {
// silently continue
}
}
return Result.success()
}
}Need a hint?
Show answer
Explain like I'm 5
Fun fact
Hands-on challenge
More resources
- WorkManager guide — Android Developers (Android Docs)
- Room database with coroutines (Android Docs)
- Offline-first app architecture — Android Developers (Android Docs)
- Data and file storage overview (Android Docs)
- Battery optimization for background work (Android Docs)