Designing Reliable Data Synchronization at Scale

Lessons from building resilient synchronization flows across external systems.

Keeping data consistent across systems we don’t control is one of the hardest problems we work on. Every external integration is a small distributed system: networks fail, third-party APIs rate-limit us, and the “source of truth” is often whichever side wrote last. This post covers the patterns we lean on to make synchronization predictable rather than heroic.

The shape of the problem

A sync flow looks deceptively simple: read from a source, transform, write to a destination. In practice each step can fail independently, and the same change can arrive more than once. We design every flow around three assumptions:

  • Delivery is at-least-once. Anything can be retried, so anything can be duplicated.
  • Ordering is not guaranteed. Two updates to the same record may arrive out of order.
  • The other system can disappear. Timeouts and partial outages are normal, not exceptional.

Idempotency is the foundation

If a write can run twice, it must produce the same result both times. We make this concrete by deriving a stable idempotency key from the payload and the target resource, and by treating writes as upserts keyed on that identity.

def sync_record(record: Record, client: ExternalClient) -> SyncResult:
    # A stable key for this logical change — same input, same key, every time.
    idempotency_key = f"{record.resource_type}:{record.id}:{record.version}"

    existing = client.get(record.id)
    if existing and existing.version >= record.version:
        # We've already applied this change (or a newer one). No-op.
        return SyncResult.skipped(reason="stale_or_duplicate")

    return client.upsert(
        record.id,
        payload=record.to_payload(),
        idempotency_key=idempotency_key,
    )

The version check turns “apply this update” into “apply this update only if it’s newer,” which neutralizes both duplicates and out-of-order delivery.

Retries that don’t make things worse

Naive retries turn a brief blip into a self-inflicted outage. We use bounded exponential backoff with jitter, classify errors before retrying, and cap the total attempts. Transient errors (timeouts, 429s, 503s) are retried; client errors (400s, validation failures) are not — they’re sent to a dead-letter queue for inspection instead.

Processing through a queue

We don’t sync inline with user requests. Changes are enqueued and processed by workers, which gives us back-pressure, natural batching, and a place to apply rate limits per downstream system. A queue also makes the system honest about load: when a downstream slows down, the queue depth rises visibly instead of silently timing out user requests.

Observability is part of correctness

A sync system you can’t see into is a sync system you don’t trust. For every flow we emit the same core signals: records processed, skipped, retried, and dead-lettered; end-to-end lag; and per-destination error rates. Lag in particular is the metric customers actually feel — it answers “how stale is the data they’re looking at right now?”

Customer impact comes first

All of this exists to protect a simple promise: the data customers see should be correct and reasonably fresh. When we make trade-offs — dropping ordering guarantees, accepting a few seconds of lag, deferring a non-critical field — we frame them in terms of what a customer would notice. That framing keeps the engineering grounded in something real.

Takeaways

  • Treat every integration as a distributed system with at-least-once delivery.
  • Make writes idempotent and version-aware so retries are always safe.
  • Retry transient failures with backoff; dead-letter the rest.
  • Process through a queue to get back-pressure and rate limiting for free.
  • Measure lag and error rates per destination — they map directly to customer experience.