Production Postmortem: Concurrency Outages
Real concurrency outages share patterns: thread pool exhaustion from a slow downstream, missing timeouts that turn local slowness into global outage, retry storms after a brief failure, deadlock under load that's invisible at low traffic. The fixes are usually defensive (timeouts, bulkheads, circuit breakers) rather than algorithmic.
What it is
A postmortem documents an outage: what happened, why, and what we'll change to prevent it. For concurrency outages specifically, the postmortem is often about defensive patterns that were missing, not about an individual bug.
This lesson walks through two concurrency outage shapes and what the postmortem looks like.
Pattern 1: Thread pool exhaustion
The story:
- 14:32, Downstream service A starts experiencing database degradation. A's response latency rises from 50ms to 30 seconds.
- 14:33, Service B (which calls A) sees its thread pool fill up. All 50 threads are stuck waiting on A.
- 14:34, Calls to B from clients (including for endpoints that don't even use A) start timing out. B is now functionally down.
- 14:35, Alerts fire: "B latency p99 > SLA". On-call paged.
- 14:38, On-call rolls back B's recent deploy. No effect (the cause is downstream).
- 14:42, A's database team discovers the issue and begins recovery.
- 14:50, A recovers. B's threads are released. B recovers.
Total impact: 18 minutes of B being unavailable. Root cause was A, but the proximate cause of the customer-visible outage was B's lack of timeouts and per-downstream pooling.
Action items:
- Per-downstream thread pools (bulkhead). When A fails, only A's pool fills.
- Per-call timeouts on every downstream. No more "wait forever".
- Circuit breakers around each downstream. Fail fast when it's down.
- Alert on per-pool utilisation, not just on aggregate service latency.
Pattern 2: Retry storm
The story:
- 09:00, Downstream service X has a 5-second blip (deploy rollover).
- 09:00:05, X recovers. But: 1000 client requests timed out during the blip. Each retries 3 times with exponential backoff but no jitter.
- 09:00:05, 1000 retries hit X simultaneously (all timed out at the same instant, all backed off by the same 100ms).
- 09:00:05, X cannot handle 1000 concurrent connections; starts failing.
- 09:00:05.2, 1000 second retries hit X (200ms backoff).
- 09:00:05.6, 1000 third retries hit X (400ms backoff).
- 09:01, Pattern continues; X is in a sustained overload state.
- 09:05, Operator manually disables retries; X recovers.
Root cause was the original 5-second blip, but the visible outage was the retry storm caused by no jitter and no circuit breaker.
Action items:
- Full jitter on all retries:
delay = random(0, base * 2^attempt). - Circuit breakers that open during sustained failures, preventing retry floods.
- Retry budget: cap retries per second across the fleet; once exceeded, fail fast.
- Load shedding on the downstream: when overloaded, return 503 instead of trying to serve everyone.
What good postmortems include
Timeline: what happened in what order. Tied to logs, metrics, alerts. Specific timestamps.
Root causes: the actual bugs AND the gaps that allowed them to escalate. Concurrency outages almost always have multiple contributing causes.
Detection: how did we notice? Could we have noticed sooner? What signals were present but unmonitored?
Mitigation: what stopped the bleeding? Was it manual intervention, automatic recovery, or "the downstream came back"?
Action items: concrete fixes with owners and due dates. Not "improve resilience"; specifically "add per-downstream timeouts to service B by Friday."
Blameless framing: focus on what the system allowed, not who shipped the bug. "How did the system let this happen?" rather than "why did Alice miss this?"
Concurrency outages are usually defensive failures, not algorithmic bugs. The fix is almost always a missing timeout, bulkhead, circuit breaker, or jitter, not a clever new algorithm. And the postmortem itself is the most valuable artifact a team produces: it captures the lesson while it is fresh. Skipping postmortems for "small" incidents means missing the patterns that grow into the big ones. Run them blamelessly, and treat the action items as real work, not a checklist.
Implementations
Service uses a 50-thread pool to fan out to downstream A and B. A's typical latency is 50ms; B's is 100ms. Both have no timeout configured. One day A's latency jumps to 30 seconds (their database is degraded). Within a minute, all 50 threads are stuck on A. Calls to B start failing with 'pool exhausted' even though B is healthy. Service appears down to users.
1 // BEFORE the outage:
2 ExecutorService pool = Executors.newFixedThreadPool(50);
3
4 Response handle(Request req) {
5 Future<A> a = pool.submit(() -> downstream.callA(req));
6 Future<B> b = pool.submit(() -> downstream.callB(req));
7 return new Response(a.get(), b.get()); // no timeout!
8 }
9
10 // AFTER the outage, the fix:
11 // 1. Per-downstream pools (bulkhead): one pool per downstream
12 ExecutorService poolA = Executors.newFixedThreadPool(20);
13 ExecutorService poolB = Executors.newFixedThreadPool(20);
14
15 // 2. Per-call timeouts
16 Response handle(Request req) {
17 Future<A> a = poolA.submit(() -> downstream.callA(req));
18 Future<B> b = poolB.submit(() -> downstream.callB(req));
19 return new Response(
20 a.get(2, TimeUnit.SECONDS), // bounded
21 b.get(2, TimeUnit.SECONDS)
22 );
23 }
24
25 // 3. Circuit breakers around each downstream (not shown)
26 // 4. Alerts on per-pool utilisation, not just service latencyKey points
- •Most outages have multiple contributing causes; rarely one bug.
- •Common pattern: 'X got slow → no timeout → Y got tied up → cascade'.
- •Timeline analysis: when did metrics change? Tie to deploys, traffic, downstream events.
- •Action items: not just 'fix the bug' but 'prevent the class of bug': timeouts, bulkheads, alerts.
- •The blameless postmortem: focus on systemic gaps, not individual mistakes.
Follow-up questions
▸Why do most concurrency outages have multiple causes?
▸What does a good postmortem cover?
▸How do I prevent the next outage instead of just fixing this one?
▸What metrics should I have for early detection?
Gotchas
- !Postmortem that blames an individual instead of fixing the system
- !Action items without owners or due dates: never get done
- !Fixing only the immediate bug, not the class of bug
- !Not exercising the postmortem's findings: 'we'll add timeouts later'
- !Skipping postmortems for 'small' incidents that turn out to be patterns