Debugging ConcurrencyTopic 6 of 18

ProblemIntermediateAsked Often

Bug Hunt: Why Does This Deadlock Under Load?

In one line

Two threads each acquire two locks, in opposite order. Most of the time it works because the timing doesn't line up, under load, eventually it does, and both threads block forever waiting for each other. Fix with global lock ordering, or tryLock with timeout.

The puzzle

A bank-transfer service runs perfectly in dev. Tests pass. Stage runs fine for hours. The day it ships to production, the SRE on-call gets paged at 4 a.m., every transfer endpoint is hung. CPU is 0%. No errors in the logs.

A thread dump is in hand. What to look for?

Note

Two clues that scream deadlock

CPU at 0% but threads are alive, they're blocked, not crashing.
Multiple threads stuck waiting on locks, with their wait-for graph forming a cycle.

The bug below is the most common form: two threads, two locks, opposite acquisition order. It's not exotic, every "transfer between two accounts" type of code is vulnerable.

What to look for in the broken code

Read the broken code in the language tab. The clue: each thread holds one lock and waits for another. Trace what happens when thread 1 calls transfer(A, B) and thread 2 simultaneously calls transfer(B, A). Each grabs the first arg's lock, then tries to grab the second arg's lock, which the other thread already holds.

Tip

Why "under load" matters The race window is small. Low traffic = the threads serialize naturally and never overlap. High traffic = many concurrent transfers = eventually the unlucky timing happens. Latent deadlock bugs scale with load, the worst kind to debug.

The pattern of fixes

Note

Three escape routes from the Coffman conditions

Total lock ordering (breaks circular wait): every code path acquires locks in the same global order. Sort by ID, hash, or some stable key.
tryLock with timeout (breaks no-preemption): if all locks can't be acquired within a deadline, release and retry. Add jitter to avoid livelock.
Single-owner pattern (breaks mutual exclusion of multiple locks): one goroutine/thread/actor owns the resource; others communicate via channels/queues. Idiomatic Go; the actor model in Erlang/Akka.

The first is the cheapest in code; the third is the cheapest mentally, there's no concurrent state to reason about. Both are dramatically better than building deadlock-detection logic into the application.

Warning

Production debugging checklist When a service hangs, before restarting:

Capture a thread dump (jstack, py-spy dump, or SIGQUIT for Go in dev).
Look for "Found N deadlocked threads" (Java labels it explicitly).
If not Java: look for blocked threads whose wait-for chains form a cycle.
Note which two (or more) locks are involved, that's where to fix the ordering.

Implementations

BROKEN, works on dev, hangs at 1K req/sec

A bank transfer between accounts. Each account has a lock. Looks fine. Run two threads concurrently calling transfer(A, B) and transfer(B, A), what happens?

 1  class Account {
 2      private final ReentrantLock lock = new ReentrantLock();
 3      int balance;
 4  
 5      static void transfer(Account from, Account to, int amount) {
 6          from.lock.lock();
 7          try {
 8              to.lock.lock();           // ← spot the bug
 9              try {
10                  from.balance -= amount;
11                  to.balance += amount;
12              } finally { to.lock.unlock(); }
13          } finally { from.lock.unlock(); }
14      }
15  }
16  
17  // Thread 1: transfer(a, b, 100)  // grabs a, waits for b
18  // Thread 2: transfer(b, a, 50)   // grabs b, waits for a
19  // → DEADLOCK

FIXED, global lock ordering

The bug: thread 1 grabs a then waits for b; thread 2 grabs b then waits for a. Each holds what the other needs. Both block forever, the canonical deadlock. The fix: enforce a total ordering on locks. Both threads acquire the lower-id lock first. Now the cycle is impossible. Use System.identityHashCode for stable ordering when objects don't have natural IDs.

 1  static void transfer(Account from, Account to, int amount) {
 2      // Always acquire locks in a consistent order, regardless of from/to
 3      Account first = System.identityHashCode(from) < System.identityHashCode(to) ? from : to;
 4      Account second = first == from ? to : from;
 5      first.lock.lock();
 6      try {
 7          second.lock.lock();
 8          try {
 9              from.balance -= amount;
10              to.balance += amount;
11          } finally { second.lock.unlock(); }
12      } finally { first.lock.unlock(); }
13  }
14  
15  // Alternative, tryLock with timeout (back off on contention)
16  static boolean transfer(Account from, Account to, int amount) throws InterruptedException {
17      if (!from.lock.tryLock(50, TimeUnit.MILLISECONDS)) return false;
18      try {
19          if (!to.lock.tryLock(50, TimeUnit.MILLISECONDS)) return false;
20          try {
21              from.balance -= amount;
22              to.balance += amount;
23              return true;
24          } finally { to.lock.unlock(); }
25      } finally { from.lock.unlock(); }
26  }

Key points

•Under low load, the unlucky interleaving is rare, the bug looks like a flaky hang
•Under high load, it becomes nearly inevitable
•Four Coffman conditions must all hold; break any ONE → no deadlock
•The cheapest fix: enforce a global total ordering on lock acquisition

Follow-up questions

▸Why does the deadlock only happen 'sometimes'?

The bug requires a specific interleaving: thread 1 must acquire its first lock AND get scheduled out before acquiring its second, AND thread 2 must run during that window. Low load = rare interleaving. High load = inevitable interleaving.

▸What are the four Coffman conditions?

(1) Mutual exclusion (locks are non-shareable), (2) hold-and-wait (thread holds one lock, requests another), (3) no preemption (locks can't be forcibly taken), (4) circular wait (cycle in the wait-for graph). Break ANY one → deadlock impossible. Lock ordering breaks #4.

▸How is this detected in production?

Java: jstack <pid> shows 'Found N deadlocked threads' with the cycle. Go: SIGQUIT or /debug/pprof/goroutine?debug=2. Python: py-spy dump --pid. All-blocked threads in a cycle is the smoking gun.

▸When is tryLock preferable to lock ordering?

When ordering can't be enforced, e.g., locks are dynamic, or third-party code holds locks. tryLock with timeout breaks the no-preemption condition: if all locks can't be acquired, release what's held and retry.

Gotchas

!Reentrant locks (Java ReentrantLock, Python RLock) prevent SELF-deadlock but NOT mutual deadlock between threads
!Go's runtime only catches 'all goroutines asleep', partial deadlocks go undetected
!Holding a lock while calling user-supplied callback code is a deadlock waiting to happen
!Sorting locks by hash is fine for objects, but pointers across processes (shared memory) need stable IDs

Where this shows up

Banking transfers, multi-resource allocation, distributed transactions, anything with > 1 lock. The Mars Pathfinder famously hit a related bug (priority inversion). Most production hangs at scale are deadlocks.

Related reading

The puzzle

A thread dump is in hand. What to look for?

Note

Two clues that scream deadlock

CPU at 0% but threads are alive, they're blocked, not crashing.
Multiple threads stuck waiting on locks, with their wait-for graph forming a cycle.

The bug below is the most common form: two threads, two locks, opposite acquisition order. It's not exotic, every "transfer between two accounts" type of code is vulnerable.

What to look for in the broken code

Tip

The pattern of fixes

Note

Three escape routes from the Coffman conditions

Total lock ordering (breaks circular wait): every code path acquires locks in the same global order. Sort by ID, hash, or some stable key.
tryLock with timeout (breaks no-preemption): if all locks can't be acquired within a deadline, release and retry. Add jitter to avoid livelock.
Single-owner pattern (breaks mutual exclusion of multiple locks): one goroutine/thread/actor owns the resource; others communicate via channels/queues. Idiomatic Go; the actor model in Erlang/Akka.

Warning

Production debugging checklist When a service hangs, before restarting:

Capture a thread dump (jstack, py-spy dump, or SIGQUIT for Go in dev).
Look for "Found N deadlocked threads" (Java labels it explicitly).
If not Java: look for blocked threads whose wait-for chains form a cycle.
Note which two (or more) locks are involved, that's where to fix the ordering.

1 class Account { 2 private final ReentrantLock lock = new ReentrantLock(); 3 int balance; 4 5 static void transfer(Account from, Account to, int amount) { 6 from.lock.lock(); 7 try { 8 to.lock.lock(); // ← spot the bug 9 try { 10 from.balance -= amount; 11 to.balance += amount; 12 } finally { to.lock.unlock(); } 13 } finally { from.lock.unlock(); } 14 } 15 } 16 17 // Thread 1: transfer(a, b, 100) // grabs a, waits for b 18 // Thread 2: transfer(b, a, 50) // grabs b, waits for a 19 // → DEADLOCK

1 static void transfer(Account from, Account to, int amount) { 2 // Always acquire locks in a consistent order, regardless of from/to 3 Account first = System.identityHashCode(from) < System.identityHashCode(to) ? from : to; 4 Account second = first == from ? to : from; 5 first.lock.lock(); 6 try { 7 second.lock.lock(); 8 try { 9 from.balance -= amount; 10 to.balance += amount; 11 } finally { second.lock.unlock(); } 12 } finally { first.lock.unlock(); } 13 } 14 15 // Alternative, tryLock with timeout (back off on contention) 16 static boolean transfer(Account from, Account to, int amount) throws InterruptedException { 17 if (!from.lock.tryLock(50, TimeUnit.MILLISECONDS)) return false; 18 try { 19 if (!to.lock.tryLock(50, TimeUnit.MILLISECONDS)) return false; 20 try { 21 from.balance -= amount; 22 to.balance += amount; 23 return true; 24 } finally { to.lock.unlock(); } 25 } finally { from.lock.unlock(); } 26 }