Diagnosing Deadlocks and Hangs
When a service hangs, dump every thread/goroutine's stack trace. The pattern of who's waiting on what reveals whether it's a deadlock (cyclic wait), a livelock (busy spinning), or starvation (one thread never scheduled). Tools, jstack, py-spy, go pprof, kill -3.
Diagram
What it is
A hang in production is "the service stopped doing useful work, but it's still running." It could be a deadlock (threads in a cyclic wait), a livelock (threads spinning without progress), starvation (one thread never scheduled), an infinite loop, or just an external service that timed out. Diagnosing requires dumping every thread or goroutine's stack and reading the pattern.
Why this matters
The first 30 seconds of any production hang Not running the language's "dump every thread" command in the first 30 seconds means guessing. The dump shows exactly what each worker is doing. Without it, all that's left is logs and intuition, neither of which usually pinpoints the issue.
This is a skill that pays off forever, the tools haven't changed much in 20 years. Java's jstack, Go's SIGQUIT/pprof, Python's py-spy are all still the right answer.
The diagnosis playbook
Step 1: capture state before doing anything else
Don't restart yet Restarting the service "to fix the hang" destroys all evidence. Capture the diagnostic dump first, then restart. The next on-call shift will thank the current one.
Java:
jstack <pid> > thread-dump.txt
# Or, with access to the foreground process:
kill -3 <pid> # sends SIGQUIT, JVM dumps to stderr (does NOT exit)
Go:
# Production-safe (does not kill the process):
curl http://localhost:6060/debug/pprof/goroutine?debug=2 > goroutines.txt
# Last resort (kills the process):
kill -SIGQUIT <pid>
Python:
py-spy dump --pid <pid> > stacks.txt # synchronous code
# For asyncio, expose an admin endpoint that calls asyncio.all_tasks()
Step 2: read the dump
What to look for
- Java: at the bottom of
jstackoutput, "Found X deadlocked threads" is the smoking gun. The cycle is annotated. Fix the lock acquisition order. - Go: each goroutine's state in brackets,
[chan receive],[semacquire],[IO wait]. Two goroutines waiting on locks held by each other = deadlock. - Python: every thread stuck in
acquireon an application-ownedLock= deadlock (review code paths to find the cycle). - All three: most threads at the same code line = lock contention (not deadlock).
Step 3: distinguish symptoms
| Symptom | Diagnosis | Fix |
|---|---|---|
| All threads blocked on locks, cycle visible | Deadlock | Lock ordering or tryLock with timeout |
| All threads blocked on one lock | Lock contention | Shorter critical section, RW lock, sharding |
| 100% CPU but no progress | Livelock or infinite loop | CPU profile (pprof, async-profiler) |
| Some threads progressing, one stuck | Starvation | Fair lock, priority adjustment |
| All threads in network I/O | External hang | Timeout, circuit breaker |
| All threads in GC | GC pause (not really hang) | Heap tuning |
The watchdog pattern, automatic dumps
For sporadic hangs that don't fire on demand, install a watchdog that auto-dumps when something looks wrong:
Three watchdog patterns
- Java:
ThreadMXBean.findDeadlockedThreads()on a scheduled task. If non-null, log + alert. - Python:
faulthandler.dump_traceback_later(timeout=60, repeat=True), auto-dumps every 60s if the main thread is unresponsive. - Go: a goroutine that polls
runtime.NumGoroutine()for unbounded growth, dumps pprof on threshold breach.
Production-safe vs production-fatal
| Tool | Kills process? | Production-safe? |
|---|---|---|
jstack <pid> | No | Yes (low overhead) |
kill -3 <pid> (SIGQUIT) | Java: No / Go: YES | Java: yes; Go: only when a restart is acceptable |
kill -SIGQUIT (Go) | YES | Last resort |
/debug/pprof/goroutine (Go) | No | Yes (private port) |
py-spy dump | No | Yes |
py-spy record | No (some perf hit) | Yes |
gdb attach | Pauses process while attached | Risky |
The Go SIGQUIT trap
Sending kill -3 to a Go process kills it after dumping. This is unlike the JVM. For non-killing dumps, expose net/http/pprof and use curl /debug/pprof/goroutine?debug=2.
When the hang is NOT in the local code
A common surprise: every thread is blocked on the same network call. The hang isn't local, the service is healthy, it's waiting on an upstream service that's slow or down. The fix is:
- Add a timeout to the upstream call.
- Add a circuit breaker so future calls fail fast when upstream is down.
- Add a fallback, degraded response, cached value, queue for later.
The default network timeout is "wait forever" Most HTTP clients, DB drivers, and message queue clients have NO default timeout. Set explicit timeouts on every external call. Otherwise one slow upstream service hangs the entire request path.
The interview answer
When asked "how should a hung service be debugged?", the right answer follows this sequence:
- Capture diagnostics before restarting (jstack/py-spy/pprof).
- Read the dump for the pattern, deadlock cycle, lock contention, network wait.
- Apply the matching fix, lock ordering, RW lock, timeout, circuit breaker.
- Add a watchdog so the next instance is self-diagnosing.
The trap answer: "I'd add more logging and restart." Logs don't show what threads are doing right now; the thread dump does.
Primitives by language
- jstack (thread dump)
- kill -3 <pid> (sends SIGQUIT, JVM dumps threads to stderr)
- ThreadMXBean.findDeadlockedThreads() (programmatic)
- Java Flight Recorder for lock contention events
Implementations
jstack <pid> prints a stack trace for every JVM thread. The killer feature: at the bottom, "Found X deadlocked threads" with the cycle annotated. When that line appears, the deadlock is confirmed and the cycle is shown. If not, look for threads BLOCKED on locks, usually points to lock contention or starvation.
1 # In a terminal on the server:
2 $ jstack 12345
3
4 "pool-2-thread-1" #18 prio=5 ...
5 java.lang.Thread.State: BLOCKED (on object monitor)
6 at com.example.Account.transfer(Account.java:42)
7 - waiting to lock <0x000000076b9c0c80> (a com.example.Account)
8 - locked <0x000000076b9c0c90> (a com.example.Account)
9 ...
10
11 "pool-2-thread-2" #19 prio=5 ...
12 java.lang.Thread.State: BLOCKED (on object monitor)
13 at com.example.Account.transfer(Account.java:42)
14 - waiting to lock <0x000000076b9c0c90>
15 - locked <0x000000076b9c0c80>
16 ...
17
18 Found 1 deadlock:
19 "pool-2-thread-1" → "pool-2-thread-2" → "pool-2-thread-1"
20
21 # The fix: enforce global lock acquisition order in transfer()ThreadMXBean.findDeadlockedThreads() returns the IDs of any threads in a deadlock cycle. Useful for self-monitoring services that should automatically dump diagnostics and alert on detection.
1 import java.lang.management.*;
2
3 public class DeadlockMonitor {
4 private final ThreadMXBean tmx = ManagementFactory.getThreadMXBean();
5
6 public void check() {
7 long[] deadlocked = tmx.findDeadlockedThreads();
8 if (deadlocked == null) return;
9
10 ThreadInfo[] info = tmx.getThreadInfo(deadlocked, true, true);
11 StringBuilder sb = new StringBuilder("DEADLOCK DETECTED:\n");
12 for (ThreadInfo t : info) {
13 sb.append(t.toString());
14 }
15 log.error(sb.toString());
16 alerts.fire("deadlock-detected");
17 }
18 }
19
20 // Run check() on a scheduled executor every 30 secondsKey points
- •Step 1 of any hang: dump every thread/goroutine's stack, without it, the diagnosis is guesswork
- •Java: `jstack <pid>` or `kill -3 <pid>`; output shows 'Found X deadlocked threads'
- •Go: send SIGQUIT or use /debug/pprof/goroutine?debug=2, full stack of every goroutine
- •Python: `py-spy dump --pid <pid>` for sync code; asyncio.all_tasks() for async
- •Deadlock pattern: thread A blocked on lock held by B, B blocked on lock held by A
- •Hang ≠ deadlock, could also be infinite loop, blocked syscall, network timeout, or starvation
Follow-up questions
▸What's the very first thing to do when a Java service hangs?
▸Service is at 100% CPU but hung. Deadlock?
▸How does one tell deadlock from starvation?
▸Should pprof be exposed in production?
▸How to debug a deadlock that won't reproduce?
Gotchas
- !kill -3 in production sends SIGQUIT, the Go runtime PRINTS THE DUMP AND EXITS the process. Use SIGUSR1/2 or a pprof endpoint instead.
- !Java's `jstack` requires the same JVM version as the target process, and same UID; usually requires sudo or being the service user
- !py-spy needs ptrace permissions; on hardened Linux, set CAP_SYS_PTRACE or run as root
- !asyncio hangs don't show up as 'thread blocked', only one thread, looks idle. Use asyncio.all_tasks() instead.
- !Goroutine dump alone doesn't show channel state, the dump shows 'chan receive' but not 'who has the channel'. Combine with code review.
- !Hung threads waiting on a network socket look like deadlock, verify with strace/tcpdump before assuming
Common pitfalls
- Restarting the service before capturing diagnostics, evidence is lost
- Reading the thread dump from the bottom, the cycle is usually labeled at the top
- Assuming any hang is a deadlock, could be I/O, GC, infinite loop, or starvation
- Not having pprof exposed in production, by the time it's needed, it's too late to add
Practice problems
DFS for cycles in directed graph; same algorithm Java's ThreadMXBean uses internally
APIs worth memorising
- Java: jstack, jcmd, ThreadMXBean.findDeadlockedThreads, JFR thread events
- Python: py-spy {dump, top, record}, faulthandler, threading.enumerate, asyncio.all_tasks
- Go: SIGQUIT, /debug/pprof/goroutine, go tool pprof, GODEBUG=schedtrace
Every postmortem about a production hang follows the same playbook: dump threads, find the cycle, fix the lock order. Cloudflare, Stripe, Datadog have public engineering posts about specific deadlock incidents. The tools haven't changed in 20 years, knowing them is non-negotiable.