Production Postmortem: Concurrency Outages

What it is

A postmortem documents an outage: what happened, why, and what we'll change to prevent it. For concurrency outages specifically, the postmortem is often about defensive patterns that were missing, not about an individual bug.

This lesson walks through two concurrency outage shapes and what the postmortem looks like.

Pattern 1: Thread pool exhaustion

The story:

14:32, Downstream service A starts experiencing database degradation. A's response latency rises from 50ms to 30 seconds.
14:33, Service B (which calls A) sees its thread pool fill up. All 50 threads are stuck waiting on A.
14:34, Calls to B from clients (including for endpoints that don't even use A) start timing out. B is now functionally down.
14:35, Alerts fire: "B latency p99 > SLA". On-call paged.
14:38, On-call rolls back B's recent deploy. No effect (the cause is downstream).
14:42, A's database team discovers the issue and begins recovery.
14:50, A recovers. B's threads are released. B recovers.

Total impact: 18 minutes of B being unavailable. Root cause was A, but the proximate cause of the customer-visible outage was B's lack of timeouts and per-downstream pooling.

Action items:

Per-downstream thread pools (bulkhead). When A fails, only A's pool fills.
Per-call timeouts on every downstream. No more "wait forever".
Circuit breakers around each downstream. Fail fast when it's down.
Alert on per-pool utilisation, not just on aggregate service latency.

Pattern 2: Retry storm

The story:

09:00, Downstream service X has a 5-second blip (deploy rollover).
09:00:05, X recovers. But: 1000 client requests timed out during the blip. Each retries 3 times with exponential backoff but no jitter.
09:00:05, 1000 retries hit X simultaneously (all timed out at the same instant, all backed off by the same 100ms).
09:00:05, X cannot handle 1000 concurrent connections; starts failing.
09:00:05.2, 1000 second retries hit X (200ms backoff).
09:00:05.6, 1000 third retries hit X (400ms backoff).
09:01, Pattern continues; X is in a sustained overload state.
09:05, Operator manually disables retries; X recovers.

Root cause was the original 5-second blip, but the visible outage was the retry storm caused by no jitter and no circuit breaker.

Action items:

Full jitter on all retries: delay = random(0, base * 2^attempt).
Circuit breakers that open during sustained failures, preventing retry floods.
Retry budget: cap retries per second across the fleet; once exceeded, fail fast.
Load shedding on the downstream: when overloaded, return 503 instead of trying to serve everyone.

What good postmortems include

Timeline: what happened in what order. Tied to logs, metrics, alerts. Specific timestamps.

Root causes: the actual bugs AND the gaps that allowed them to escalate. Concurrency outages almost always have multiple contributing causes.

Detection: how did we notice? Could we have noticed sooner? What signals were present but unmonitored?

Mitigation: what stopped the bleeding? Was it manual intervention, automatic recovery, or "the downstream came back"?

Action items: concrete fixes with owners and due dates. Not "improve resilience"; specifically "add per-downstream timeouts to service B by Friday."

Blameless framing: focus on what the system allowed, not who shipped the bug. "How did the system let this happen?" rather than "why did Alice miss this?"

Concurrency outages are usually defensive failures, not algorithmic bugs. The fix is almost always a missing timeout, bulkhead, circuit breaker, or jitter, not a clever new algorithm. And the postmortem itself is the most valuable artifact a team produces: it captures the lesson while it is fresh. Skipping postmortems for "small" incidents means missing the patterns that grow into the big ones. Run them blamelessly, and treat the action items as real work, not a checklist.

Follow-up questions

▸Why do most concurrency outages have multiple causes?

Single failures are usually contained by the system's defenses (timeouts, breakers, retries). Outages happen when multiple defenses fail simultaneously: 'downstream slow' AND 'no timeout' AND 'no bulkhead' AND 'cascading retries'. The postmortem usually finds 3-5 contributing factors. Each one alone wouldn't have caused the outage.

▸What does a good postmortem cover?

Timeline (what happened, when, in what order). Root causes (the bugs and the gaps that allowed the bugs to escalate). Detection (how we noticed; could we have noticed sooner?). Mitigation (what we did to stop the bleeding). Action items (concrete fixes, with owners and due dates). The blameless framing: 'how did the system allow this' rather than 'who screwed up'.

▸How do I prevent the next outage instead of just fixing this one?

Find the class of bug, not just the instance. 'No timeout on this call' → 'audit all calls for missing timeouts'. 'Race in this code path' → 'add race detection to CI'. 'Pool exhaustion' → 'add per-downstream pools and bulkheads'. The postmortem's action items should leave the system safer against similar future incidents, not just patched against this one.

▸What metrics should I have for early detection?

Per-downstream latency (p50, p99). Pool utilisation (in-use / max). Queue depth (job queues, retry queues, message queues). Error rate per downstream. Circuit breaker state. The aggregate 'service latency' is too lagging; per-component metrics catch problems before they cascade.

What it is

This lesson walks through two concurrency outage shapes and what the postmortem looks like.

Pattern 1: Thread pool exhaustion

The story:

14:32, Downstream service A starts experiencing database degradation. A's response latency rises from 50ms to 30 seconds.
14:33, Service B (which calls A) sees its thread pool fill up. All 50 threads are stuck waiting on A.
14:34, Calls to B from clients (including for endpoints that don't even use A) start timing out. B is now functionally down.
14:35, Alerts fire: "B latency p99 > SLA". On-call paged.
14:38, On-call rolls back B's recent deploy. No effect (the cause is downstream).
14:42, A's database team discovers the issue and begins recovery.
14:50, A recovers. B's threads are released. B recovers.

Total impact: 18 minutes of B being unavailable. Root cause was A, but the proximate cause of the customer-visible outage was B's lack of timeouts and per-downstream pooling.

Action items:

Per-downstream thread pools (bulkhead). When A fails, only A's pool fills.
Per-call timeouts on every downstream. No more "wait forever".
Circuit breakers around each downstream. Fail fast when it's down.
Alert on per-pool utilisation, not just on aggregate service latency.

Pattern 2: Retry storm

The story:

09:00, Downstream service X has a 5-second blip (deploy rollover).
09:00:05, X recovers. But: 1000 client requests timed out during the blip. Each retries 3 times with exponential backoff but no jitter.
09:00:05, 1000 retries hit X simultaneously (all timed out at the same instant, all backed off by the same 100ms).
09:00:05, X cannot handle 1000 concurrent connections; starts failing.
09:00:05.2, 1000 second retries hit X (200ms backoff).
09:00:05.6, 1000 third retries hit X (400ms backoff).
09:01, Pattern continues; X is in a sustained overload state.
09:05, Operator manually disables retries; X recovers.

Root cause was the original 5-second blip, but the visible outage was the retry storm caused by no jitter and no circuit breaker.

Action items:

Full jitter on all retries: delay = random(0, base * 2^attempt).
Circuit breakers that open during sustained failures, preventing retry floods.
Retry budget: cap retries per second across the fleet; once exceeded, fail fast.
Load shedding on the downstream: when overloaded, return 503 instead of trying to serve everyone.

What good postmortems include

Timeline: what happened in what order. Tied to logs, metrics, alerts. Specific timestamps.

Root causes: the actual bugs AND the gaps that allowed them to escalate. Concurrency outages almost always have multiple contributing causes.

Detection: how did we notice? Could we have noticed sooner? What signals were present but unmonitored?

Mitigation: what stopped the bleeding? Was it manual intervention, automatic recovery, or "the downstream came back"?

Action items: concrete fixes with owners and due dates. Not "improve resilience"; specifically "add per-downstream timeouts to service B by Friday."

Blameless framing: focus on what the system allowed, not who shipped the bug. "How did the system let this happen?" rather than "why did Alice miss this?"

Follow-up questions

▸Why do most concurrency outages have multiple causes?

▸What does a good postmortem cover?

▸How do I prevent the next outage instead of just fixing this one?

▸What metrics should I have for early detection?

What it is

Pattern 1: Thread pool exhaustion

Pattern 2: Retry storm

What good postmortems include

Implementations

Key points

Follow-up questions

Gotchas

Related reading

Production Postmortem: Concurrency Outages

What it is

Pattern 1: Thread pool exhaustion

Pattern 2: Retry storm

What good postmortems include

Implementations

Key points

Follow-up questions

Gotchas

Related reading