Bug Hunt: Why Does This Deadlock Under Load?
Two threads each acquire two locks, in opposite order. Most of the time it works because the timing doesn't line up, under load, eventually it does, and both threads block forever waiting for each other. Fix with global lock ordering, or tryLock with timeout.
The puzzle
A bank-transfer service runs perfectly in dev. Tests pass. Stage runs fine for hours. The day it ships to production, the SRE on-call gets paged at 4 a.m., every transfer endpoint is hung. CPU is 0%. No errors in the logs.
A thread dump is in hand. What to look for?
Two clues that scream deadlock
- CPU at 0% but threads are alive, they're blocked, not crashing.
- Multiple threads stuck waiting on locks, with their wait-for graph forming a cycle.
The bug below is the most common form: two threads, two locks, opposite acquisition order. It's not exotic, every "transfer between two accounts" type of code is vulnerable.
What to look for in the broken code
Read the broken code in the language tab. The clue: each thread holds one lock and waits for another. Trace what happens when thread 1 calls transfer(A, B) and thread 2 simultaneously calls transfer(B, A). Each grabs the first arg's lock, then tries to grab the second arg's lock, which the other thread already holds.
Why "under load" matters The race window is small. Low traffic = the threads serialize naturally and never overlap. High traffic = many concurrent transfers = eventually the unlucky timing happens. Latent deadlock bugs scale with load, the worst kind to debug.
The pattern of fixes
Three escape routes from the Coffman conditions
- Total lock ordering (breaks circular wait): every code path acquires locks in the same global order. Sort by ID, hash, or some stable key.
- tryLock with timeout (breaks no-preemption): if all locks can't be acquired within a deadline, release and retry. Add jitter to avoid livelock.
- Single-owner pattern (breaks mutual exclusion of multiple locks): one goroutine/thread/actor owns the resource; others communicate via channels/queues. Idiomatic Go; the actor model in Erlang/Akka.
The first is the cheapest in code; the third is the cheapest mentally, there's no concurrent state to reason about. Both are dramatically better than building deadlock-detection logic into the application.
Production debugging checklist When a service hangs, before restarting:
- Capture a thread dump (
jstack,py-spy dump, orSIGQUITfor Go in dev). - Look for "Found N deadlocked threads" (Java labels it explicitly).
- If not Java: look for blocked threads whose wait-for chains form a cycle.
- Note which two (or more) locks are involved, that's where to fix the ordering.
Implementations
A bank transfer between accounts. Each account has a lock. Looks fine. Run two threads concurrently calling transfer(A, B) and transfer(B, A), what happens?
1 class Account {
2 private final ReentrantLock lock = new ReentrantLock();
3 int balance;
4
5 static void transfer(Account from, Account to, int amount) {
6 from.lock.lock();
7 try {
8 to.lock.lock(); // ← spot the bug
9 try {
10 from.balance -= amount;
11 to.balance += amount;
12 } finally { to.lock.unlock(); }
13 } finally { from.lock.unlock(); }
14 }
15 }
16
17 // Thread 1: transfer(a, b, 100) // grabs a, waits for b
18 // Thread 2: transfer(b, a, 50) // grabs b, waits for a
19 // → DEADLOCKThe bug: thread 1 grabs a then waits for b; thread 2 grabs b then waits for a. Each holds what the other needs. Both block forever, the canonical deadlock. The fix: enforce a total ordering on locks. Both threads acquire the lower-id lock first. Now the cycle is impossible. Use System.identityHashCode for stable ordering when objects don't have natural IDs.
1 static void transfer(Account from, Account to, int amount) {
2 // Always acquire locks in a consistent order, regardless of from/to
3 Account first = System.identityHashCode(from) < System.identityHashCode(to) ? from : to;
4 Account second = first == from ? to : from;
5 first.lock.lock();
6 try {
7 second.lock.lock();
8 try {
9 from.balance -= amount;
10 to.balance += amount;
11 } finally { second.lock.unlock(); }
12 } finally { first.lock.unlock(); }
13 }
14
15 // Alternative, tryLock with timeout (back off on contention)
16 static boolean transfer(Account from, Account to, int amount) throws InterruptedException {
17 if (!from.lock.tryLock(50, TimeUnit.MILLISECONDS)) return false;
18 try {
19 if (!to.lock.tryLock(50, TimeUnit.MILLISECONDS)) return false;
20 try {
21 from.balance -= amount;
22 to.balance += amount;
23 return true;
24 } finally { to.lock.unlock(); }
25 } finally { from.lock.unlock(); }
26 }Key points
- •Under low load, the unlucky interleaving is rare, the bug looks like a flaky hang
- •Under high load, it becomes nearly inevitable
- •Four Coffman conditions must all hold; break any ONE → no deadlock
- •The cheapest fix: enforce a global total ordering on lock acquisition
Follow-up questions
▸Why does the deadlock only happen 'sometimes'?
▸What are the four Coffman conditions?
▸How is this detected in production?
▸When is tryLock preferable to lock ordering?
Gotchas
- !Reentrant locks (Java ReentrantLock, Python RLock) prevent SELF-deadlock but NOT mutual deadlock between threads
- !Go's runtime only catches 'all goroutines asleep', partial deadlocks go undetected
- !Holding a lock while calling user-supplied callback code is a deadlock waiting to happen
- !Sorting locks by hash is fine for objects, but pointers across processes (shared memory) need stable IDs
Banking transfers, multi-resource allocation, distributed transactions, anything with > 1 lock. The Mars Pathfinder famously hit a related bug (priority inversion). Most production hangs at scale are deadlocks.