Lesson 58 of 83 advanced

System Design I: Offline-First Field App at Scale

Designing a field operations platform for 10,000+ workers in low-connectivity areas

Open interactive version (quiz + challenge)

Real-world analogy

Designing an offline-first field app is like building a network of village outposts that operate completely independently during the day, then send consolidated reports to headquarters every evening over a slow satellite link. Each outpost must be self-sufficient, handle conflicts when two outposts report the same event differently, and never lose data — even if the satellite link fails for three days.

What is it?

Offline-first field app system design is the most common advanced Android system design question for apps targeting developing markets. It requires combining Room for local persistence, a sync queue pattern for reliable offline writes, delta sync for bandwidth efficiency, WorkManager for background synchronization, conflict resolution strategies, and battery optimization — all while maintaining data integrity guarantees that satisfy compliance requirements.

Real-world relevance

BRAC, one of the world's largest NGOs operating in Bangladesh, uses Android field apps for health workers visiting households in remote areas with no connectivity. FieldBuzz, a Bangladeshi SaaS platform, enables FMCG companies to manage field sales teams across areas with 2G-only coverage. Both require exactly this architecture: reliable offline writes, background sync, photo evidence capture, and GPS tracking that works for 10+ hours on a single charge.

Key points

Code example

// Core Room entities
@Entity(tableName = "tasks")
data class Task(
    @PrimaryKey val id: String,
    val assignedTo: String,
    val title: String,
    val status: TaskStatus,
    val priority: Int,
    val dueAt: Long,
    val serverVersion: Long,
    val syncStatus: SyncStatus = SyncStatus.SYNCED
)

@Entity(tableName = "inspection_reports",
    indices = [Index(value = ["task_id", "worker_id", "submitted_at"], unique = true)])
data class InspectionReport(
    @PrimaryKey val id: String = UUID.randomUUID().toString(),
    val taskId: String,
    val workerId: String,
    val answersJson: String,
    val localPhotoPaths: String,
    val serverPhotoUrls: String? = null,
    val submittedAt: Long = System.currentTimeMillis(),
    val syncStatus: SyncStatus = SyncStatus.PENDING
)

@Entity(tableName = "sync_queue")
data class SyncQueueEntry(
    @PrimaryKey(autoGenerate = true) val id: Long = 0,
    val entityType: String,
    val entityId: String,
    val operation: String,
    val payload: String,
    val retryCount: Int = 0,
    val createdAt: Long = System.currentTimeMillis()
)

enum class SyncStatus { PENDING, SYNCING, SYNCED, FAILED }

// Repository — atomic offline write
class FieldRepository(private val db: AppDatabase, private val api: FieldApi) {

    suspend fun submitReport(report: InspectionReport) {
        db.withTransaction {
            db.reportDao().insert(report)
            db.syncQueueDao().insert(
                SyncQueueEntry(
                    entityType = "REPORT",
                    entityId = report.id,
                    operation = "INSERT",
                    payload = Json.encodeToString(report)
                )
            )
        }
        // WorkManager will pick this up when network is available
    }

    suspend fun syncPendingReports() {
        val pending = db.syncQueueDao().getPendingByType("REPORT", limit = 50)
        if (pending.isEmpty()) return

        try {
            val response = api.batchSubmitReports(pending.map { Json.decodeFromString(it.payload) })
            db.withTransaction {
                response.synced.forEach { id ->
                    db.reportDao().updateSyncStatus(id, SyncStatus.SYNCED)
                    db.syncQueueDao().deleteByEntityId(id)
                }
                response.conflicts.forEach { conflict ->
                    db.reportDao().updateSyncStatus(conflict.localId, SyncStatus.FAILED)
                    db.reportDao().updateWithServerVersion(conflict.serverReport)
                }
            }
        } catch (e: IOException) {
            db.syncQueueDao().incrementRetryCount(pending.map { it.id })
        }
    }
}

// WorkManager sync job
class SyncWorker(context: Context, params: WorkerParameters) : CoroutineWorker(context, params) {

    override suspend fun doWork(): Result {
        val repo = FieldRepository(AppDatabase.getInstance(applicationContext), FieldApi.create())
        return try {
            repo.syncPendingReports()
            Result.success()
        } catch (e: Exception) {
            if (runAttemptCount < 3) Result.retry() else Result.failure()
        }
    }
}

// Scheduling periodic sync
fun scheduleSync(context: Context) {
    val constraints = Constraints.Builder()
        .setRequiredNetworkType(NetworkType.CONNECTED)
        .setRequiresBatteryNotLow(true)
        .build()

    val syncWork = PeriodicWorkRequestBuilder<SyncWorker>(15, TimeUnit.MINUTES)
        .setConstraints(constraints)
        .setBackoffCriteria(BackoffPolicy.EXPONENTIAL, 30, TimeUnit.SECONDS)
        .build()

    WorkManager.getInstance(context).enqueueUniquePeriodicWork(
        "field_sync",
        ExistingPeriodicWorkPolicy.KEEP,
        syncWork
    )
}

// Delta sync — server request with last sync timestamp
data class DeltaSyncRequest(
    val workerId: String,
    val region: String,
    val lastSyncAt: Long,
    val deviceTime: Long = System.currentTimeMillis()
)

suspend fun performDeltaSync(api: FieldApi, prefs: SyncPrefs) {
    val request = DeltaSyncRequest(
        workerId = prefs.workerId,
        region = prefs.region,
        lastSyncAt = prefs.lastSyncAt
    )
    val response = api.getDelta(request)
    db.withTransaction {
        response.updatedTasks.forEach { db.taskDao().upsert(it) }
        response.deletedTaskIds.forEach { db.taskDao().markDeleted(it) }
        prefs.lastSyncAt = response.serverTime
    }
}

Line-by-line walkthrough

  1. 1. InspectionReport uses a unique index on (task_id, worker_id, submitted_at) — this is the database-level guard against duplicate submissions, enforcing idempotency without application logic.
  2. 2. UUID.randomUUID().toString() as the primary key means the client generates the ID, not the server — this is essential for offline-first because the ID must exist before the network call.
  3. 3. db.withTransaction{} in submitReport() wraps both the report insert and sync queue insert — if either fails, both roll back, maintaining the invariant that every report in the DB has a corresponding queue entry.
  4. 4. syncPendingReports() fetches in batches of 50 — batching reduces network round trips from O(N) to O(N/50), critical on slow connections.
  5. 5. The conflict handling block updates the local record with serverReport data — this implements server-wins for conflicts, and the FAILED status allows the worker to review and resubmit.
  6. 6. incrementRetryCount on IOException — only network failures increment retry; application errors (conflict, validation) are handled separately and do not consume retry budget.
  7. 7. PeriodicWorkRequestBuilder with 15 minutes is the minimum interval WorkManager allows — Android may batch this with other work and delay up to 5 minutes in Doze mode.
  8. 8. ExistingPeriodicWorkPolicy.KEEP prevents duplicate sync chains if scheduleSync() is called multiple times (e.g., on every app launch).
  9. 9. DeltaSyncRequest sends lastSyncAt and deviceTime separately — deviceTime lets the server detect clock skew and adjust the comparison window, preventing missed updates from devices with wrong clocks.
  10. 10. response.serverTime is stored as the new lastSyncAt — using server time, not device time, eliminates drift from devices with incorrect clocks in the field.

Spot the bug

class SyncWorker(context: Context, params: WorkerParameters) : CoroutineWorker(context, params) {

    override suspend fun doWork(): Result {
        val db = AppDatabase.getInstance(applicationContext)
        val api = FieldApi.create()
        val pending = db.syncQueueDao().getAllPending()

        pending.forEach { entry ->
            try {
                api.submitReport(Json.decodeFromString(entry.payload))
                db.syncQueueDao().delete(entry)
                db.reportDao().updateSyncStatus(entry.entityId, SyncStatus.SYNCED)
            } catch (e: Exception) {
                // silently continue
            }
        }
        return Result.success()
    }
}
Need a hint?
There are three significant bugs: one causes data loss on partial failure, one hides all errors making retry impossible, and one is a performance problem that will cause timeouts on slow connections.
Show answer
Bug 1: Processing queue entries one-by-one with forEach and individual API calls (api.submitReport per entry) results in N network requests for N pending entries. On a 2G connection with 500ms RTT, 50 pending reports = 25+ seconds of network time, likely causing a WorkManager timeout (10 minutes default, but real timeout is often shorter). Fix: batch all entries into a single batchSubmitReports() API call. Bug 2: The catch block silently swallows all exceptions and continues. This means network errors, auth errors, and server errors all result in the queue entry being skipped without retry. Since the entry is neither deleted (success) nor marked for retry (failure), it will be re-attempted next cycle — but there is no retry count limit, so permanently failing entries accumulate and bloat the queue indefinitely. Fix: distinguish IOException (retry) from HttpException (check status code — 4xx is client error, 5xx retry). Bug 3: The delete(entry) and updateSyncStatus() are not in a transaction. If the app crashes between these two operations, the queue entry is deleted but the report is still marked PENDING — it will never sync again (orphaned PENDING report). Fix: wrap both in db.withTransaction{}.

Explain like I'm 5

Imagine you are a postal worker in a village with no phone signal. You still write down all your deliveries in your notebook all day. In the evening, when you pass through a town with signal, you send all your reports at once. If your notebook and the main office have different info about the same package, the rules say which version wins. That is exactly what an offline-first field app does — your phone is the notebook, and Room + WorkManager are the rules.

Fun fact

Bangladesh has one of the world's highest concentrations of field force apps — BRAC alone employs over 100,000 field staff. Apps like FieldBuzz process millions of field reports per month from areas where 2G EDGE (50 kbps) is the best available connectivity. Every byte of sync payload matters — gzip alone can reduce sync time by 70% on these connections.

Hands-on challenge

Design the complete sync architecture for a field inspection app with these requirements: 100K reports per day, photos up to 5MB each, 72-hour max offline window, compliance requires audit trail of every sync attempt. Specify: (1) The exact Room schema with all tables and indices. (2) The SyncWorker implementation including retry logic and failure recording. (3) How you handle a scenario where the server rejects a report due to conflict (supervisor already updated the task status). (4) How you ensure photos are not uploaded twice if the app crashes mid-upload. (5) Battery impact analysis and mitigation for a 10-hour work shift.

More resources

Open interactive version (quiz + challenge) ← Back to course: Android Interview Mastery