Background Job Queue

What it is

A background job queue moves work out of the request path and into asynchronous workers. The user clicks "send", the request handler enqueues a job and responds immediately. A worker picks up the job, sends the email, marks it done.

The pattern is everywhere: email sending, image processing, batch reports, webhooks, daily/hourly jobs. The infrastructure pieces are the same: durable queue, workers, retries, dead-letter handling.

What it requires

Durable storage. Jobs must survive process and machine restart. In-memory queues lose work on crash. Real options: Postgres SKIP LOCKED, Redis (with persistence and consumer groups), AWS SQS, RabbitMQ, Kafka.

Visibility timeout. When a worker claims a job, the queue hides it for some duration (typically 30-300s). If the worker doesn't ack/delete within that window, the job reappears for another worker. Handles worker crashes automatically.

At-least-once delivery + idempotent handlers. Visibility timeout means jobs can be delivered twice (worker timed out, but actually finished, just didn't ack). Handlers must tolerate this: idempotency key check, transactional "do + record done", or naturally idempotent operations.

Retry with backoff. Failed jobs are re-queued with increasing delay. Same exponential backoff with jitter as HTTP retries. Cap the attempts.

Dead-letter queue. After max attempts, the job moves to a separate "dead" table for human inspection. Without DLQ, persistently failing jobs retry forever and flood the queue.

The Postgres SKIP LOCKED pattern

For teams that already run Postgres, this avoids adding a new system:

UPDATE jobs SET state = 'processing', claimed_at = now()
WHERE id = (
    SELECT id FROM jobs
    WHERE state = 'pending' AND run_at <= now()
    ORDER BY priority DESC, run_at
    FOR UPDATE SKIP LOCKED
    LIMIT 1
)
RETURNING id, payload;

Each worker selects one unlocked job, locks it, takes it. Others SKIP the locked rows. Throughput scales with Postgres. Transactional with the rest of the application's data.

For higher throughput or more complex scheduling, dedicated queue systems (Sidekiq for Ruby, Celery for Python, BullMQ for Node, SQS for AWS) give better tooling and observability.

Why idempotency matters here

Visibility-timeout-based queues guarantee at-least-once delivery, not exactly-once. The flow:

Worker A claims job 42, starts processing.
Worker A successfully completes the work (charges the card).
Worker A is killed before it can delete the job.
Visibility timeout expires.
Worker B claims job 42.
Worker B charges the card again.

Without idempotency on the handler, the customer sees two charges. With idempotency (check "have I processed job 42?" before charging), worker B's attempt is a no-op.

What every system gets wrong

Three common mistakes:

No DLQ. Failed jobs retry forever. The queue grows. Real failures are buried in noise. Always have a DLQ and an alert on its size.

Non-idempotent handlers. "It only happens occasionally" is what people say right before the duplicate-charge incident. Make handlers idempotent from day one.

Treating the queue as a database. Long-running queries against the queue table; transactions held for the whole job duration; row-level locks competing with workers. Keep job processing fast (or move long work into the worker process, not held in the queue transaction).

Follow-up questions

▸Why not just use a list in Redis (LPUSH/RPOP)?

Works for the simplest case but lacks visibility timeout, retries, persistence guarantees on crash, scheduled jobs. All of that ends up being rebuilt. For anything beyond toy workloads, use a real job queue (Sidekiq, RQ, Celery, SQS) or Postgres SKIP LOCKED. Redis-only is fine with Redis Streams (which has consumer groups and acks) but not raw LPUSH/RPOP.

▸Why is Postgres SKIP LOCKED so popular?

Most teams already run Postgres. No new system to operate. Transactional with the rest of the data (a job can be enqueued and a row updated atomically). SKIP LOCKED prevents lock contention between workers. Throughput is real (10K+ jobs/sec on a single Postgres). The downside: not as fast as Redis or SQS at very high throughput, and database CPU goes to queue work.

▸What goes in the dead-letter queue?

Jobs that exceeded max retries. Store the original payload, the attempt count, and the last error. Periodically (or via dashboard), humans inspect, decide whether the job is recoverable (fix and replay) or permanently broken (delete). DLQ size is a great alert: a sudden spike means something downstream broke.

▸How do I prioritise jobs?

Add a priority column (0-9, 0=highest). Worker selects ORDER BY priority, run_at. For more sophisticated needs (per-tenant fairness, SLA-aware scheduling), use multiple queues with different worker pools, or a dedicated scheduler. Most teams over-engineer this; a single priority field handles 95% of needs.

What it is

The pattern is everywhere: email sending, image processing, batch reports, webhooks, daily/hourly jobs. The infrastructure pieces are the same: durable queue, workers, retries, dead-letter handling.

What it requires

Retry with backoff. Failed jobs are re-queued with increasing delay. Same exponential backoff with jitter as HTTP retries. Cap the attempts.

Dead-letter queue. After max attempts, the job moves to a separate "dead" table for human inspection. Without DLQ, persistently failing jobs retry forever and flood the queue.

The Postgres SKIP LOCKED pattern

For teams that already run Postgres, this avoids adding a new system:

UPDATE jobs SET state = 'processing', claimed_at = now()
WHERE id = (
    SELECT id FROM jobs
    WHERE state = 'pending' AND run_at <= now()
    ORDER BY priority DESC, run_at
    FOR UPDATE SKIP LOCKED
    LIMIT 1
)
RETURNING id, payload;

Each worker selects one unlocked job, locks it, takes it. Others SKIP the locked rows. Throughput scales with Postgres. Transactional with the rest of the application's data.

For higher throughput or more complex scheduling, dedicated queue systems (Sidekiq for Ruby, Celery for Python, BullMQ for Node, SQS for AWS) give better tooling and observability.

Why idempotency matters here

Visibility-timeout-based queues guarantee at-least-once delivery, not exactly-once. The flow:

Worker A claims job 42, starts processing.
Worker A successfully completes the work (charges the card).
Worker A is killed before it can delete the job.
Visibility timeout expires.
Worker B claims job 42.
Worker B charges the card again.

Without idempotency on the handler, the customer sees two charges. With idempotency (check "have I processed job 42?" before charging), worker B's attempt is a no-op.

What every system gets wrong

Three common mistakes:

No DLQ. Failed jobs retry forever. The queue grows. Real failures are buried in noise. Always have a DLQ and an alert on its size.

Non-idempotent handlers. "It only happens occasionally" is what people say right before the duplicate-charge incident. Make handlers idempotent from day one.

Follow-up questions

▸Why not just use a list in Redis (LPUSH/RPOP)?

▸Why is Postgres SKIP LOCKED so popular?

▸What goes in the dead-letter queue?

▸How do I prioritise jobs?

Diagram

What it is

What it requires

The Postgres SKIP LOCKED pattern

Why idempotency matters here

What every system gets wrong

Implementations

Key points

Follow-up questions

Gotchas

Related reading

Background Job Queue

Diagram

What it is

What it requires

The Postgres SKIP LOCKED pattern

Why idempotency matters here

What every system gets wrong

Implementations

Key points

Follow-up questions

Gotchas

Related reading