Background Job Queue
Persist jobs to durable storage (Postgres, Redis, SQS), workers pull and process them. Need: idempotent handlers (jobs can be delivered twice), visibility timeout (a worker that crashes mid-job must release the job for retry), exponential retry with dead-letter queue, scheduled/delayed jobs, priority queuing.
Diagram
What it is
A background job queue moves work out of the request path and into asynchronous workers. The user clicks "send", the request handler enqueues a job and responds immediately. A worker picks up the job, sends the email, marks it done.
The pattern is everywhere: email sending, image processing, batch reports, webhooks, daily/hourly jobs. The infrastructure pieces are the same: durable queue, workers, retries, dead-letter handling.
What it requires
Durable storage. Jobs must survive process and machine restart. In-memory queues lose work on crash. Real options: Postgres SKIP LOCKED, Redis (with persistence and consumer groups), AWS SQS, RabbitMQ, Kafka.
Visibility timeout. When a worker claims a job, the queue hides it for some duration (typically 30-300s). If the worker doesn't ack/delete within that window, the job reappears for another worker. Handles worker crashes automatically.
At-least-once delivery + idempotent handlers. Visibility timeout means jobs can be delivered twice (worker timed out, but actually finished, just didn't ack). Handlers must tolerate this: idempotency key check, transactional "do + record done", or naturally idempotent operations.
Retry with backoff. Failed jobs are re-queued with increasing delay. Same exponential backoff with jitter as HTTP retries. Cap the attempts.
Dead-letter queue. After max attempts, the job moves to a separate "dead" table for human inspection. Without DLQ, persistently failing jobs retry forever and flood the queue.
The Postgres SKIP LOCKED pattern
For teams that already run Postgres, this avoids adding a new system:
UPDATE jobs SET state = 'processing', claimed_at = now()
WHERE id = (
SELECT id FROM jobs
WHERE state = 'pending' AND run_at <= now()
ORDER BY priority DESC, run_at
FOR UPDATE SKIP LOCKED
LIMIT 1
)
RETURNING id, payload;
Each worker selects one unlocked job, locks it, takes it. Others SKIP the locked rows. Throughput scales with Postgres. Transactional with the rest of the application's data.
For higher throughput or more complex scheduling, dedicated queue systems (Sidekiq for Ruby, Celery for Python, BullMQ for Node, SQS for AWS) give better tooling and observability.
Why idempotency matters here
Visibility-timeout-based queues guarantee at-least-once delivery, not exactly-once. The flow:
- Worker A claims job 42, starts processing.
- Worker A successfully completes the work (charges the card).
- Worker A is killed before it can delete the job.
- Visibility timeout expires.
- Worker B claims job 42.
- Worker B charges the card again.
Without idempotency on the handler, the customer sees two charges. With idempotency (check "have I processed job 42?" before charging), worker B's attempt is a no-op.
What every system gets wrong
Three common mistakes:
No DLQ. Failed jobs retry forever. The queue grows. Real failures are buried in noise. Always have a DLQ and an alert on its size.
Non-idempotent handlers. "It only happens occasionally" is what people say right before the duplicate-charge incident. Make handlers idempotent from day one.
Treating the queue as a database. Long-running queries against the queue table; transactions held for the whole job duration; row-level locks competing with workers. Keep job processing fast (or move long work into the worker process, not held in the queue transaction).
Implementations
On failure, increment attempts, schedule with exponential backoff. After max attempts, move to dead-letter table for human inspection. Without DLQ, persistent failures pile up forever.
1 public void onFailure(Job job, Throwable err) {
2 int next = job.attempts + 1;
3 if (next >= MAX_ATTEMPTS) {
4 jdbc.update("INSERT INTO jobs_dead (id, payload, attempts, last_error) " +
5 "VALUES (?, ?, ?, ?)",
6 job.id, job.payload, next, err.toString());
7 jdbc.update("DELETE FROM jobs WHERE id = ?", job.id);
8 metrics.counter("jobs.dead").increment();
9 return;
10 }
11 long backoffSec = (long) Math.pow(2, next); // 2, 4, 8, 16, ...
12 jdbc.update("UPDATE jobs SET state='pending', attempts=?, " +
13 "run_at = now() + interval '1 second' * ? WHERE id = ?",
14 next, backoffSec, job.id);
15 }Key points
- •Durable storage so jobs survive restart. Memory-only queues lose work on crash.
- •Visibility timeout: when a worker pulls a job, hide it for N seconds. If the worker doesn't ack, it reappears.
- •Idempotency on the handler side: jobs can be delivered more than once (visibility timeout expiry, worker crashes after work but before ack).
- •Dead-letter queue (DLQ) for jobs that fail repeatedly. Don't retry forever; surface to humans.
- •Sidekiq/Resque (Redis), RQ (Redis), Celery (RabbitMQ or Redis broker), BullMQ (Redis), AWS SQS, and Postgres SKIP LOCKED are popular implementations. Only Sidekiq/RQ/BullMQ are Redis-native; Celery's primary broker is RabbitMQ (Redis is supported but not the default).
Follow-up questions
▸Why not just use a list in Redis (LPUSH/RPOP)?
▸Why is Postgres SKIP LOCKED so popular?
▸What goes in the dead-letter queue?
▸How do I prioritise jobs?
Gotchas
- !No visibility timeout: worker crash mid-job means the job is lost OR re-runs immediately and floods retries
- !Non-idempotent handler with at-least-once delivery: side effects double on retry
- !Unbounded retries: a permanently failing job retries forever, drowning out healthy work
- !No DLQ inspection: dead jobs accumulate, no one notices the silent failure
- !Treating queue as a database: long-running jobs hold locks, block compaction, slow queries