Lesson 58 of 83 advanced

System Design I: Offline-First Field App at Scale

Designing a field operations platform for 10,000+ workers in low-connectivity areas

Open interactive version (quiz + challenge)

Real-world analogy

Designing an offline-first field app is like building a network of village outposts that operate completely independently during the day, then send consolidated reports to headquarters every evening over a slow satellite link. Each outpost must be self-sufficient, handle conflicts when two outposts report the same event differently, and never lose data — even if the satellite link fails for three days.

What is it?

Offline-first field app system design is the most common advanced Android system design question for apps targeting developing markets. It requires combining Room for local persistence, a sync queue pattern for reliable offline writes, delta sync for bandwidth efficiency, WorkManager for background synchronization, conflict resolution strategies, and battery optimization — all while maintaining data integrity guarantees that satisfy compliance requirements.

Real-world relevance

BRAC, one of the world's largest NGOs operating in Bangladesh, uses Android field apps for health workers visiting households in remote areas with no connectivity. FieldBuzz, a Bangladeshi SaaS platform, enables FMCG companies to manage field sales teams across areas with 2G-only coverage. Both require exactly this architecture: reliable offline writes, background sync, photo evidence capture, and GPS tracking that works for 10+ hours on a single charge.

Key points

Requirements gathering — the interview opening move — Never jump into architecture. Ask: How many concurrent users? 10K field workers. Read-heavy or write-heavy? Both — workers read task lists, write inspection results. How long can they be offline? Up to 72 hours in remote areas. What is the acceptable sync lag? 15 minutes when online. What data must never be lost? Inspection reports, GPS coordinates, photos. This framing shows senior-level thinking and sets up every architectural decision.
Core entities and Room schema design — Entities: Worker (id, role, region, syncedAt), Task (id, assignedTo, status, priority, dueAt, serverVersion), InspectionReport (id, taskId, workerId, answers JSON, photoUris, submittedAt, syncStatus), SyncQueue (id, entityType, entityId, operation, payload, retryCount, createdAt). SyncStatus enum: PENDING, SYNCING, SYNCED, FAILED. The SyncQueue table is the backbone — every local write generates a queue entry.
Offline write path — the sync queue pattern — When a worker submits a report offline: 1) Write InspectionReport to Room with syncStatus=PENDING. 2) Insert a SyncQueueEntry with operation=INSERT, entityType=REPORT, payload=serialized report JSON. 3) Show success UI immediately — the write is durable on device. 4) WorkManager enqueues a sync job constrained to network availability. The UI never blocks on network — this is the core offline-first contract.
Delta sync vs full sync — the critical tradeoff — Full sync: download all Tasks every sync cycle. Simple but expensive — 10K workers syncing 5MB task lists every 15 minutes = 50GB/hour of bandwidth. Delta sync: server tracks a lastModifiedAt timestamp per entity. Client sends its last sync timestamp; server returns only changed entities. Far more efficient but requires server-side change tracking and client-side merge logic. Always choose delta sync for field apps at scale.
Conflict resolution strategy — Conflicts arise when a record is modified on both client and server while offline. Strategies: Last-Write-Wins (LWW) — simpler, use serverVersion timestamp; server always wins on pull, client wins on push if serverVersion matches. Three-way merge — for complex documents, compare base, client change, and server change. For inspection reports, use server-wins for task metadata, client-wins for report content (the worker's answer is authoritative). Document this decision explicitly in interviews — it shows you understand the tradeoffs.
WorkManager for background sync — Use PeriodicWorkRequest with 15-minute interval and constraints: NetworkType.CONNECTED, battery not low. The SyncWorker reads all PENDING SyncQueueEntries, sends them in a batch POST to the server, and on 200 response marks them SYNCED and updates the corresponding Room entities. On failure, WorkManager retries with exponential backoff. Never use AlarmManager or JobScheduler directly in 2026.
Photo upload strategy — Photos are the heaviest payload. Strategy: 1) Save photo to local file storage immediately. 2) Store relative file path in InspectionReport, not the URI (URIs can become invalid after app restart). 3) Upload photos separately from report metadata — use a separate PhotoUploadQueue. 4) Server returns a CDN URL after upload; update the report with the URL. 5) Only mark the report SYNCED after all its photos are uploaded. Uploading in the background with WorkManager constraint: NetworkType.UNMETERED is ideal for large photos.
Battery optimization — Field workers use phones all day — battery is critical. Optimizations: batch sync (aggregate 50 queue entries per network request, not 50 individual requests), compress JSON payloads (gzip), upload photos only on UNMETERED or when battery > 30%, use SENSOR_DELAY_NORMAL for GPS polling, disable GPS when worker is stationary (detected via accelerometer), cache task lists in memory to avoid redundant Room queries. Show these in interviews as evidence of production thinking.
Data integrity guarantees — Use Room database transactions to write Report + SyncQueueEntry atomically — if the transaction fails, neither write happens, preventing orphaned queue entries. Use unique constraints on (taskId, workerId, submittedAt) to prevent duplicate submissions from UI double-tap. Use server-side idempotency keys (the local UUID of the report) so retried uploads do not create duplicate records on the server.
Scaling considerations — server side awareness — Mention these to show full-stack thinking: Server needs a change log table (entity_changes) to support delta sync queries efficiently. Index on (entity_type, last_modified_at, region) for fast delta queries by worker region. Background jobs on server aggregate regional stats (do not compute in sync API). CDN for photo delivery — workers download task photos from CDN, not the app server. Consider read replicas for task list queries under 10K concurrent sync requests.
Reference: BRAC/FieldBuzz architecture patterns — Field operations apps like BRAC's field management tools and FieldBuzz (a Bangladeshi field force management platform) face exactly this architecture. Key lessons from such systems: offline capability is not optional — 40% of work happens in areas with no signal. Photo evidence is legally required — loss of a photo can mean loss of compliance. Sync conflicts between supervisor override and worker submission must be logged for audit, not silently discarded.
Interview narration strategy — Structure your answer: 1) Requirements (2 min). 2) Core entities + Room schema (3 min). 3) Offline write path with sync queue (3 min). 4) Delta sync and conflict resolution (3 min). 5) WorkManager sync job (2 min). 6) Battery and photo optimizations (2 min). 7) Scaling and failure scenarios (2 min). Draw a simple box diagram: Device (Room + SyncQueue) -> WorkManager -> API Server -> Database + CDN. Interviewers evaluate whether you can narrate confidently under pressure, not just whether your architecture is perfect.

Code example

// Core Room entities
@Entity(tableName = "tasks")
data class Task(
    @PrimaryKey val id: String,
    val assignedTo: String,
    val title: String,
    val status: TaskStatus,
    val priority: Int,
    val dueAt: Long,
    val serverVersion: Long,
    val syncStatus: SyncStatus = SyncStatus.SYNCED
)

@Entity(tableName = "inspection_reports",
    indices = [Index(value = ["task_id", "worker_id", "submitted_at"], unique = true)])
data class InspectionReport(
    @PrimaryKey val id: String = UUID.randomUUID().toString(),
    val taskId: String,
    val workerId: String,
    val answersJson: String,
    val localPhotoPaths: String,
    val serverPhotoUrls: String? = null,
    val submittedAt: Long = System.currentTimeMillis(),
    val syncStatus: SyncStatus = SyncStatus.PENDING
)

@Entity(tableName = "sync_queue")
data class SyncQueueEntry(
    @PrimaryKey(autoGenerate = true) val id: Long = 0,
    val entityType: String,
    val entityId: String,
    val operation: String,
    val payload: String,
    val retryCount: Int = 0,
    val createdAt: Long = System.currentTimeMillis()
)

enum class SyncStatus { PENDING, SYNCING, SYNCED, FAILED }

// Repository — atomic offline write
class FieldRepository(private val db: AppDatabase, private val api: FieldApi) {

    suspend fun submitReport(report: InspectionReport) {
        db.withTransaction {
            db.reportDao().insert(report)
            db.syncQueueDao().insert(
                SyncQueueEntry(
                    entityType = "REPORT",
                    entityId = report.id,
                    operation = "INSERT",
                    payload = Json.encodeToString(report)
                )
            )
        }
        // WorkManager will pick this up when network is available
    }

    suspend fun syncPendingReports() {
        val pending = db.syncQueueDao().getPendingByType("REPORT", limit = 50)
        if (pending.isEmpty()) return

        try {
            val response = api.batchSubmitReports(pending.map { Json.decodeFromString(it.payload) })
            db.withTransaction {
                response.synced.forEach { id ->
                    db.reportDao().updateSyncStatus(id, SyncStatus.SYNCED)
                    db.syncQueueDao().deleteByEntityId(id)
                }
                response.conflicts.forEach { conflict ->
                    db.reportDao().updateSyncStatus(conflict.localId, SyncStatus.FAILED)
                    db.reportDao().updateWithServerVersion(conflict.serverReport)
                }
            }
        } catch (e: IOException) {
            db.syncQueueDao().incrementRetryCount(pending.map { it.id })
        }
    }
}

// WorkManager sync job
class SyncWorker(context: Context, params: WorkerParameters) : CoroutineWorker(context, params) {

    override suspend fun doWork(): Result {
        val repo = FieldRepository(AppDatabase.getInstance(applicationContext), FieldApi.create())
        return try {
            repo.syncPendingReports()
            Result.success()
        } catch (e: Exception) {
            if (runAttemptCount < 3) Result.retry() else Result.failure()
        }
    }
}

// Scheduling periodic sync
fun scheduleSync(context: Context) {
    val constraints = Constraints.Builder()
        .setRequiredNetworkType(NetworkType.CONNECTED)
        .setRequiresBatteryNotLow(true)
        .build()

    val syncWork = PeriodicWorkRequestBuilder<SyncWorker>(15, TimeUnit.MINUTES)
        .setConstraints(constraints)
        .setBackoffCriteria(BackoffPolicy.EXPONENTIAL, 30, TimeUnit.SECONDS)
        .build()

    WorkManager.getInstance(context).enqueueUniquePeriodicWork(
        "field_sync",
        ExistingPeriodicWorkPolicy.KEEP,
        syncWork
    )
}

// Delta sync — server request with last sync timestamp
data class DeltaSyncRequest(
    val workerId: String,
    val region: String,
    val lastSyncAt: Long,
    val deviceTime: Long = System.currentTimeMillis()
)

suspend fun performDeltaSync(api: FieldApi, prefs: SyncPrefs) {
    val request = DeltaSyncRequest(
        workerId = prefs.workerId,
        region = prefs.region,
        lastSyncAt = prefs.lastSyncAt
    )
    val response = api.getDelta(request)
    db.withTransaction {
        response.updatedTasks.forEach { db.taskDao().upsert(it) }
        response.deletedTaskIds.forEach { db.taskDao().markDeleted(it) }
        prefs.lastSyncAt = response.serverTime
    }
}

Line-by-line walkthrough

1. InspectionReport uses a unique index on (task_id, worker_id, submitted_at) — this is the database-level guard against duplicate submissions, enforcing idempotency without application logic.
2. UUID.randomUUID().toString() as the primary key means the client generates the ID, not the server — this is essential for offline-first because the ID must exist before the network call.
3. db.withTransaction{} in submitReport() wraps both the report insert and sync queue insert — if either fails, both roll back, maintaining the invariant that every report in the DB has a corresponding queue entry.
4. syncPendingReports() fetches in batches of 50 — batching reduces network round trips from O(N) to O(N/50), critical on slow connections.
5. The conflict handling block updates the local record with serverReport data — this implements server-wins for conflicts, and the FAILED status allows the worker to review and resubmit.
6. incrementRetryCount on IOException — only network failures increment retry; application errors (conflict, validation) are handled separately and do not consume retry budget.
7. PeriodicWorkRequestBuilder with 15 minutes is the minimum interval WorkManager allows — Android may batch this with other work and delay up to 5 minutes in Doze mode.
8. ExistingPeriodicWorkPolicy.KEEP prevents duplicate sync chains if scheduleSync() is called multiple times (e.g., on every app launch).
9. DeltaSyncRequest sends lastSyncAt and deviceTime separately — deviceTime lets the server detect clock skew and adjust the comparison window, preventing missed updates from devices with wrong clocks.
10. response.serverTime is stored as the new lastSyncAt — using server time, not device time, eliminates drift from devices with incorrect clocks in the field.

Spot the bug

class SyncWorker(context: Context, params: WorkerParameters) : CoroutineWorker(context, params) {

    override suspend fun doWork(): Result {
        val db = AppDatabase.getInstance(applicationContext)
        val api = FieldApi.create()
        val pending = db.syncQueueDao().getAllPending()

        pending.forEach { entry ->
            try {
                api.submitReport(Json.decodeFromString(entry.payload))
                db.syncQueueDao().delete(entry)
                db.reportDao().updateSyncStatus(entry.entityId, SyncStatus.SYNCED)
            } catch (e: Exception) {
                // silently continue
            }
        }
        return Result.success()
    }
}

Need a hint?

There are three significant bugs: one causes data loss on partial failure, one hides all errors making retry impossible, and one is a performance problem that will cause timeouts on slow connections.

Show answer

Bug 1: Processing queue entries one-by-one with forEach and individual API calls (api.submitReport per entry) results in N network requests for N pending entries. On a 2G connection with 500ms RTT, 50 pending reports = 25+ seconds of network time, likely causing a WorkManager timeout (10 minutes default, but real timeout is often shorter). Fix: batch all entries into a single batchSubmitReports() API call. Bug 2: The catch block silently swallows all exceptions and continues. This means network errors, auth errors, and server errors all result in the queue entry being skipped without retry. Since the entry is neither deleted (success) nor marked for retry (failure), it will be re-attempted next cycle — but there is no retry count limit, so permanently failing entries accumulate and bloat the queue indefinitely. Fix: distinguish IOException (retry) from HttpException (check status code — 4xx is client error, 5xx retry). Bug 3: The delete(entry) and updateSyncStatus() are not in a transaction. If the app crashes between these two operations, the queue entry is deleted but the report is still marked PENDING — it will never sync again (orphaned PENDING report). Fix: wrap both in db.withTransaction{}.

Explain like I'm 5

Imagine you are a postal worker in a village with no phone signal. You still write down all your deliveries in your notebook all day. In the evening, when you pass through a town with signal, you send all your reports at once. If your notebook and the main office have different info about the same package, the rules say which version wins. That is exactly what an offline-first field app does — your phone is the notebook, and Room + WorkManager are the rules.

Fun fact

Bangladesh has one of the world's highest concentrations of field force apps — BRAC alone employs over 100,000 field staff. Apps like FieldBuzz process millions of field reports per month from areas where 2G EDGE (50 kbps) is the best available connectivity. Every byte of sync payload matters — gzip alone can reduce sync time by 70% on these connections.

Hands-on challenge

Design the complete sync architecture for a field inspection app with these requirements: 100K reports per day, photos up to 5MB each, 72-hour max offline window, compliance requires audit trail of every sync attempt. Specify: (1) The exact Room schema with all tables and indices. (2) The SyncWorker implementation including retry logic and failure recording. (3) How you handle a scenario where the server rejects a report due to conflict (supervisor already updated the task status). (4) How you ensure photos are not uploaded twice if the app crashes mid-upload. (5) Battery impact analysis and mitigation for a 10-hour work shift.

More resources

WorkManager guide — Android Developers (Android Docs)
Room database with coroutines (Android Docs)
Offline-first app architecture — Android Developers (Android Docs)
Data and file storage overview (Android Docs)
Battery optimization for background work (Android Docs)

Open interactive version (quiz + challenge) ← Back to course: Android Interview Mastery