What We Learned From Running Background Workers in Production

Background workers are where a lot of our most important work happens — sending notifications, generating exports, syncing data, billing. They’re also where the most surprising production incidents start. Here’s what running them at scale has taught us.

Jobs are not functions

A function call either returns or throws. A background job can also be retried, duplicated, delayed for hours, killed mid-execution, or run on code that has since been deployed over. Designing jobs means designing for all of those states, not just success and failure.

Make every job idempotent

Because jobs are retried, every job must be safe to run more than once. We bake this in by checking for completed work before doing it again, and by guarding side effects behind a unique key.

@app.task(bind=True, max_retries=5, acks_late=True)
def send_invoice_email(self, invoice_id: str):
    invoice = Invoice.objects.get(id=invoice_id)

    # Guard: if we've already sent it, don't send a duplicate on retry.
    if invoice.email_sent_at is not None:
        return

    try:
        mailer.send(invoice.to_email_payload())
    except TransientMailerError as exc:
        # Exponential backoff: 2s, 4s, 8s, 16s, 32s.
        raise self.retry(exc=exc, countdown=2 ** self.request.retries)

    invoice.mark_email_sent()

Note acks_late=True: the task is only acknowledged after it finishes, so if a worker dies mid-job the broker redelivers it. That’s exactly why the idempotency guard matters.

Timeouts everywhere

A job with no timeout is a job that can hold a worker forever. We set both a soft and a hard time limit on every task. The soft limit raises an exception the job can catch to clean up; the hard limit kills the process. Without these, one slow downstream call can starve an entire worker pool.

Isolate queues by workload

Fast, latency-sensitive jobs (like sending a verification email) should not share a queue with slow, bulky jobs (like generating a large export). We route them to separate queues with separate worker pools so a backlog in one doesn’t delay the other. This single change did more for perceived reliability than almost anything else.

Watch the queue, not just the workers

The earliest signal of trouble is usually queue depth and oldest-message age, not CPU. A queue that’s growing faster than workers can drain it is a problem in progress. We alert on sustained growth and on the age of the oldest pending job, which directly reflects how long a user is waiting.

Make poison messages visible

Some jobs will never succeed — bad input, a deleted record, a permanent downstream rejection. After a bounded number of retries we move them to a dead-letter queue instead of retrying forever. Dead-lettering turns an invisible infinite-retry loop into a visible, fixable list of failures.

Takeaways

Design jobs for retries, duplication, and mid-execution death — not just the happy path.
Make every job idempotent, and acknowledge late so crashes redeliver.
Set soft and hard timeouts on everything.
Isolate queues by workload so slow jobs can’t starve fast ones.
Alert on queue depth and oldest-message age; dead-letter poison messages.

Jobs are not functions

Make every job idempotent

Timeouts everywhere

Isolate queues by workload

Watch the queue, not just the workers

Make poison messages visible

Takeaways

Related posts

Observability Patterns for Distributed Systems

Designing Reliable Data Synchronization at Scale