Debugging ConcurrencyTopic 4 of 18

ConceptIntermediateSometimes

Diagnosing Deadlocks and Hangs

In one line

When a service hangs, dump every thread/goroutine's stack trace. The pattern of who's waiting on what reveals whether it's a deadlock (cyclic wait), a livelock (busy spinning), or starvation (one thread never scheduled). Tools, jstack, py-spy, go pprof, kill -3.

Diagram

What it is

A hang in production is "the service stopped doing useful work, but it's still running." It could be a deadlock (threads in a cyclic wait), a livelock (threads spinning without progress), starvation (one thread never scheduled), an infinite loop, or just an external service that timed out. Diagnosing requires dumping every thread or goroutine's stack and reading the pattern.

Why this matters

Important

The first 30 seconds of any production hang Not running the language's "dump every thread" command in the first 30 seconds means guessing. The dump shows exactly what each worker is doing. Without it, all that's left is logs and intuition, neither of which usually pinpoints the issue.

This is a skill that pays off forever, the tools haven't changed much in 20 years. Java's jstack, Go's SIGQUIT/pprof, Python's py-spy are all still the right answer.

The diagnosis playbook

Step 1: capture state before doing anything else

Warning

Don't restart yet Restarting the service "to fix the hang" destroys all evidence. Capture the diagnostic dump first, then restart. The next on-call shift will thank the current one.

Java:

jstack <pid> > thread-dump.txt
# Or, with access to the foreground process:
kill -3 <pid>      # sends SIGQUIT, JVM dumps to stderr (does NOT exit)

Go:

# Production-safe (does not kill the process):
curl http://localhost:6060/debug/pprof/goroutine?debug=2 > goroutines.txt

# Last resort (kills the process):
kill -SIGQUIT <pid>

Python:

py-spy dump --pid <pid> > stacks.txt    # synchronous code
# For asyncio, expose an admin endpoint that calls asyncio.all_tasks()

Step 2: read the dump

Note

What to look for

Java: at the bottom of jstack output, "Found X deadlocked threads" is the smoking gun. The cycle is annotated. Fix the lock acquisition order.
Go: each goroutine's state in brackets, [chan receive], [semacquire], [IO wait]. Two goroutines waiting on locks held by each other = deadlock.
Python: every thread stuck in acquire on an application-owned Lock = deadlock (review code paths to find the cycle).
All three: most threads at the same code line = lock contention (not deadlock).

Step 3: distinguish symptoms

Symptom	Diagnosis	Fix
All threads blocked on locks, cycle visible	Deadlock	Lock ordering or tryLock with timeout
All threads blocked on one lock	Lock contention	Shorter critical section, RW lock, sharding
100% CPU but no progress	Livelock or infinite loop	CPU profile (pprof, async-profiler)
Some threads progressing, one stuck	Starvation	Fair lock, priority adjustment
All threads in network I/O	External hang	Timeout, circuit breaker
All threads in GC	GC pause (not really hang)	Heap tuning

The watchdog pattern, automatic dumps

For sporadic hangs that don't fire on demand, install a watchdog that auto-dumps when something looks wrong:

Tip

Three watchdog patterns

Java: ThreadMXBean.findDeadlockedThreads() on a scheduled task. If non-null, log + alert.
Python: faulthandler.dump_traceback_later(timeout=60, repeat=True), auto-dumps every 60s if the main thread is unresponsive.
Go: a goroutine that polls runtime.NumGoroutine() for unbounded growth, dumps pprof on threshold breach.

Production-safe vs production-fatal

Tool	Kills process?	Production-safe?
`jstack <pid>`	No	Yes (low overhead)
`kill -3 <pid>` (SIGQUIT)	Java: No / Go: YES	Java: yes; Go: only when a restart is acceptable
`kill -SIGQUIT` (Go)	YES	Last resort
`/debug/pprof/goroutine` (Go)	No	Yes (private port)
`py-spy dump`	No	Yes
`py-spy record`	No (some perf hit)	Yes
`gdb attach`	Pauses process while attached	Risky

Warning

The Go SIGQUIT trap Sending kill -3 to a Go process kills it after dumping. This is unlike the JVM. For non-killing dumps, expose net/http/pprof and use curl /debug/pprof/goroutine?debug=2.

When the hang is NOT in the local code

A common surprise: every thread is blocked on the same network call. The hang isn't local, the service is healthy, it's waiting on an upstream service that's slow or down. The fix is:

Add a timeout to the upstream call.
Add a circuit breaker so future calls fail fast when upstream is down.
Add a fallback, degraded response, cached value, queue for later.

Tip

The default network timeout is "wait forever" Most HTTP clients, DB drivers, and message queue clients have NO default timeout. Set explicit timeouts on every external call. Otherwise one slow upstream service hangs the entire request path.

The interview answer

When asked "how should a hung service be debugged?", the right answer follows this sequence:

Capture diagnostics before restarting (jstack/py-spy/pprof).
Read the dump for the pattern, deadlock cycle, lock contention, network wait.
Apply the matching fix, lock ordering, RW lock, timeout, circuit breaker.
Add a watchdog so the next instance is self-diagnosing.

The trap answer: "I'd add more logging and restart." Logs don't show what threads are doing right now; the thread dump does.

Primitives by language

jstack (thread dump)
kill -3 <pid> (sends SIGQUIT, JVM dumps threads to stderr)
ThreadMXBean.findDeadlockedThreads() (programmatic)
Java Flight Recorder for lock contention events

Implementations

jstack, the first command in any Java hang

jstack <pid> prints a stack trace for every JVM thread. The killer feature: at the bottom, "Found X deadlocked threads" with the cycle annotated. When that line appears, the deadlock is confirmed and the cycle is shown. If not, look for threads BLOCKED on locks, usually points to lock contention or starvation.

 1  # In a terminal on the server:
 2  $ jstack 12345
 3  
 4  "pool-2-thread-1" #18 prio=5 ...
 5     java.lang.Thread.State: BLOCKED (on object monitor)
 6          at com.example.Account.transfer(Account.java:42)
 7          - waiting to lock <0x000000076b9c0c80> (a com.example.Account)
 8          - locked <0x000000076b9c0c90> (a com.example.Account)
 9          ...
10  
11  "pool-2-thread-2" #19 prio=5 ...
12     java.lang.Thread.State: BLOCKED (on object monitor)
13          at com.example.Account.transfer(Account.java:42)
14          - waiting to lock <0x000000076b9c0c90>
15          - locked <0x000000076b9c0c80>
16          ...
17  
18  Found 1 deadlock:
19  "pool-2-thread-1" → "pool-2-thread-2" → "pool-2-thread-1"
20  
21  # The fix: enforce global lock acquisition order in transfer()

Programmatic deadlock detection

ThreadMXBean.findDeadlockedThreads() returns the IDs of any threads in a deadlock cycle. Useful for self-monitoring services that should automatically dump diagnostics and alert on detection.

 1  import java.lang.management.*;
 2  
 3  public class DeadlockMonitor {
 4      private final ThreadMXBean tmx = ManagementFactory.getThreadMXBean();
 5  
 6      public void check() {
 7          long[] deadlocked = tmx.findDeadlockedThreads();
 8          if (deadlocked == null) return;
 9  
10          ThreadInfo[] info = tmx.getThreadInfo(deadlocked, true, true);
11          StringBuilder sb = new StringBuilder("DEADLOCK DETECTED:\n");
12          for (ThreadInfo t : info) {
13              sb.append(t.toString());
14          }
15          log.error(sb.toString());
16          alerts.fire("deadlock-detected");
17      }
18  }
19  
20  // Run check() on a scheduled executor every 30 seconds

Key points

•Step 1 of any hang: dump every thread/goroutine's stack, without it, the diagnosis is guesswork
•Java: `jstack <pid>` or `kill -3 <pid>`; output shows 'Found X deadlocked threads'
•Go: send SIGQUIT or use /debug/pprof/goroutine?debug=2, full stack of every goroutine
•Python: `py-spy dump --pid <pid>` for sync code; asyncio.all_tasks() for async
•Deadlock pattern: thread A blocked on lock held by B, B blocked on lock held by A
•Hang ≠ deadlock, could also be infinite loop, blocked syscall, network timeout, or starvation

Follow-up questions

▸What's the very first thing to do when a Java service hangs?

`jstack <pid>` (or `kill -3 <pid>` if the process is in the foreground). The dump reveals within 30 seconds whether the cause is a deadlock (explicitly labeled), lock contention (many BLOCKED threads on one monitor), or something else (waiting on I/O, infinite loop, GC pause).

▸Service is at 100% CPU but hung. Deadlock?

Probably not, deadlocked threads are blocked, not running. 100% CPU + no progress = livelock (busy retry loops) or infinite loop. Take a CPU profile (`go tool pprof`, `async-profiler`, `py-spy record`) to find what's hot.

▸How does one tell deadlock from starvation?

Thread dumps. Deadlock: a closed cycle of threads each waiting on a lock held by the next. Starvation: ONE thread is never scheduled while OTHERS make progress. Both look like 'something is stuck' from outside; the dump distinguishes them.

▸Should pprof be exposed in production?

Yes, but on a private port (localhost or VPC-internal), never on a public port. The /debug/pprof endpoints can leak request data and let an attacker DOS the service. Standard practice: `localhost:6060` and access via SSH or admin VPN.

▸How to debug a deadlock that won't reproduce?

Add a watchdog: every 30s, check for deadlock (Java's findDeadlockedThreads, or in any language, time-out individual operations and dump on timeout). Log full thread states when triggered. Eventually the deadlock fires; the evidence is waiting the next morning.

Gotchas

!kill -3 in production sends SIGQUIT, the Go runtime PRINTS THE DUMP AND EXITS the process. Use SIGUSR1/2 or a pprof endpoint instead.
!Java's `jstack` requires the same JVM version as the target process, and same UID; usually requires sudo or being the service user
!py-spy needs ptrace permissions; on hardened Linux, set CAP_SYS_PTRACE or run as root
!asyncio hangs don't show up as 'thread blocked', only one thread, looks idle. Use asyncio.all_tasks() instead.
!Goroutine dump alone doesn't show channel state, the dump shows 'chan receive' but not 'who has the channel'. Combine with code review.
!Hung threads waiting on a network socket look like deadlock, verify with strace/tcpdump before assuming

Common pitfalls

Restarting the service before capturing diagnostics, evidence is lost
Reading the thread dump from the bottom, the cycle is usually labeled at the top
Assuming any hang is a deadlock, could be I/O, GC, infinite loop, or starvation
Not having pprof exposed in production, by the time it's needed, it's too late to add

Practice problems

Detect a deadlock cycle in a wait-for graph

DFS for cycles in directed graph; same algorithm Java's ThreadMXBean uses internally

APIs worth memorising

Java: jstack, jcmd, ThreadMXBean.findDeadlockedThreads, JFR thread events
Python: py-spy {dump, top, record}, faulthandler, threading.enumerate, asyncio.all_tasks
Go: SIGQUIT, /debug/pprof/goroutine, go tool pprof, GODEBUG=schedtrace

Where this shows up

Every postmortem about a production hang follows the same playbook: dump threads, find the cycle, fix the lock order. Cloudflare, Stripe, Datadog have public engineering posts about specific deadlock incidents. The tools haven't changed in 20 years, knowing them is non-negotiable.

Diagnosing Deadlocks and Hangs

In one line

Diagram

What it is

Why this matters

Important

This is a skill that pays off forever, the tools haven't changed much in 20 years. Java's jstack, Go's SIGQUIT/pprof, Python's py-spy are all still the right answer.

The diagnosis playbook

Step 1: capture state before doing anything else

Warning

Don't restart yet Restarting the service "to fix the hang" destroys all evidence. Capture the diagnostic dump first, then restart. The next on-call shift will thank the current one.

Java:

jstack <pid> > thread-dump.txt
# Or, with access to the foreground process:
kill -3 <pid>      # sends SIGQUIT, JVM dumps to stderr (does NOT exit)

Go:

# Production-safe (does not kill the process):
curl http://localhost:6060/debug/pprof/goroutine?debug=2 > goroutines.txt

# Last resort (kills the process):
kill -SIGQUIT <pid>

Python:

py-spy dump --pid <pid> > stacks.txt    # synchronous code
# For asyncio, expose an admin endpoint that calls asyncio.all_tasks()

Step 2: read the dump

Note

What to look for

Java: at the bottom of jstack output, "Found X deadlocked threads" is the smoking gun. The cycle is annotated. Fix the lock acquisition order.
Go: each goroutine's state in brackets, [chan receive], [semacquire], [IO wait]. Two goroutines waiting on locks held by each other = deadlock.
Python: every thread stuck in acquire on an application-owned Lock = deadlock (review code paths to find the cycle).
All three: most threads at the same code line = lock contention (not deadlock).

Step 3: distinguish symptoms

Symptom	Diagnosis	Fix
All threads blocked on locks, cycle visible	Deadlock	Lock ordering or tryLock with timeout
All threads blocked on one lock	Lock contention	Shorter critical section, RW lock, sharding
100% CPU but no progress	Livelock or infinite loop	CPU profile (pprof, async-profiler)
Some threads progressing, one stuck	Starvation	Fair lock, priority adjustment
All threads in network I/O	External hang	Timeout, circuit breaker
All threads in GC	GC pause (not really hang)	Heap tuning

The watchdog pattern, automatic dumps

For sporadic hangs that don't fire on demand, install a watchdog that auto-dumps when something looks wrong:

Tip

Three watchdog patterns

Java: ThreadMXBean.findDeadlockedThreads() on a scheduled task. If non-null, log + alert.
Python: faulthandler.dump_traceback_later(timeout=60, repeat=True), auto-dumps every 60s if the main thread is unresponsive.
Go: a goroutine that polls runtime.NumGoroutine() for unbounded growth, dumps pprof on threshold breach.

Production-safe vs production-fatal

Tool	Kills process?	Production-safe?
`jstack <pid>`	No	Yes (low overhead)
`kill -3 <pid>` (SIGQUIT)	Java: No / Go: YES	Java: yes; Go: only when a restart is acceptable
`kill -SIGQUIT` (Go)	YES	Last resort
`/debug/pprof/goroutine` (Go)	No	Yes (private port)
`py-spy dump`	No	Yes
`py-spy record`	No (some perf hit)	Yes
`gdb attach`	Pauses process while attached	Risky

Warning

The Go SIGQUIT trap Sending kill -3 to a Go process kills it after dumping. This is unlike the JVM. For non-killing dumps, expose net/http/pprof and use curl /debug/pprof/goroutine?debug=2.

When the hang is NOT in the local code

A common surprise: every thread is blocked on the same network call. The hang isn't local, the service is healthy, it's waiting on an upstream service that's slow or down. The fix is:

Add a timeout to the upstream call.
Add a circuit breaker so future calls fail fast when upstream is down.
Add a fallback, degraded response, cached value, queue for later.

Tip

The interview answer

When asked "how should a hung service be debugged?", the right answer follows this sequence:

Capture diagnostics before restarting (jstack/py-spy/pprof).
Read the dump for the pattern, deadlock cycle, lock contention, network wait.
Apply the matching fix, lock ordering, RW lock, timeout, circuit breaker.
Add a watchdog so the next instance is self-diagnosing.

The trap answer: "I'd add more logging and restart." Logs don't show what threads are doing right now; the thread dump does.

Primitives by language

jstack (thread dump)
kill -3 <pid> (sends SIGQUIT, JVM dumps threads to stderr)
ThreadMXBean.findDeadlockedThreads() (programmatic)
Java Flight Recorder for lock contention events

Implementations

jstack, the first command in any Java hang

 1  # In a terminal on the server:
 2  $ jstack 12345
 3  
 4  "pool-2-thread-1" #18 prio=5 ...
 5     java.lang.Thread.State: BLOCKED (on object monitor)
 6          at com.example.Account.transfer(Account.java:42)
 7          - waiting to lock <0x000000076b9c0c80> (a com.example.Account)
 8          - locked <0x000000076b9c0c90> (a com.example.Account)
 9          ...
10  
11  "pool-2-thread-2" #19 prio=5 ...
12     java.lang.Thread.State: BLOCKED (on object monitor)
13          at com.example.Account.transfer(Account.java:42)
14          - waiting to lock <0x000000076b9c0c90>
15          - locked <0x000000076b9c0c80>
16          ...
17  
18  Found 1 deadlock:
19  "pool-2-thread-1" → "pool-2-thread-2" → "pool-2-thread-1"
20  
21  # The fix: enforce global lock acquisition order in transfer()

Programmatic deadlock detection

ThreadMXBean.findDeadlockedThreads() returns the IDs of any threads in a deadlock cycle. Useful for self-monitoring services that should automatically dump diagnostics and alert on detection.

 1  import java.lang.management.*;
 2  
 3  public class DeadlockMonitor {
 4      private final ThreadMXBean tmx = ManagementFactory.getThreadMXBean();
 5  
 6      public void check() {
 7          long[] deadlocked = tmx.findDeadlockedThreads();
 8          if (deadlocked == null) return;
 9  
10          ThreadInfo[] info = tmx.getThreadInfo(deadlocked, true, true);
11          StringBuilder sb = new StringBuilder("DEADLOCK DETECTED:\n");
12          for (ThreadInfo t : info) {
13              sb.append(t.toString());
14          }
15          log.error(sb.toString());
16          alerts.fire("deadlock-detected");
17      }
18  }
19  
20  // Run check() on a scheduled executor every 30 seconds

Key points

•Step 1 of any hang: dump every thread/goroutine's stack, without it, the diagnosis is guesswork
•Java: `jstack <pid>` or `kill -3 <pid>`; output shows 'Found X deadlocked threads'
•Go: send SIGQUIT or use /debug/pprof/goroutine?debug=2, full stack of every goroutine
•Python: `py-spy dump --pid <pid>` for sync code; asyncio.all_tasks() for async
•Deadlock pattern: thread A blocked on lock held by B, B blocked on lock held by A
•Hang ≠ deadlock, could also be infinite loop, blocked syscall, network timeout, or starvation

Follow-up questions

▸What's the very first thing to do when a Java service hangs?

▸Service is at 100% CPU but hung. Deadlock?

▸How does one tell deadlock from starvation?

▸Should pprof be exposed in production?

▸How to debug a deadlock that won't reproduce?

Gotchas

!kill -3 in production sends SIGQUIT, the Go runtime PRINTS THE DUMP AND EXITS the process. Use SIGUSR1/2 or a pprof endpoint instead.
!Java's `jstack` requires the same JVM version as the target process, and same UID; usually requires sudo or being the service user
!py-spy needs ptrace permissions; on hardened Linux, set CAP_SYS_PTRACE or run as root
!asyncio hangs don't show up as 'thread blocked', only one thread, looks idle. Use asyncio.all_tasks() instead.
!Goroutine dump alone doesn't show channel state, the dump shows 'chan receive' but not 'who has the channel'. Combine with code review.
!Hung threads waiting on a network socket look like deadlock, verify with strace/tcpdump before assuming

Common pitfalls

Restarting the service before capturing diagnostics, evidence is lost
Reading the thread dump from the bottom, the cycle is usually labeled at the top
Assuming any hang is a deadlock, could be I/O, GC, infinite loop, or starvation
Not having pprof exposed in production, by the time it's needed, it's too late to add

Practice problems

Detect a deadlock cycle in a wait-for graph

DFS for cycles in directed graph; same algorithm Java's ThreadMXBean uses internally

APIs worth memorising

Java: jstack, jcmd, ThreadMXBean.findDeadlockedThreads, JFR thread events
Python: py-spy {dump, top, record}, faulthandler, threading.enumerate, asyncio.all_tasks
Go: SIGQUIT, /debug/pprof/goroutine, go tool pprof, GODEBUG=schedtrace

Where this shows up

Diagram

What it is

Why this matters

The diagnosis playbook

Step 1: capture state before doing anything else

Step 2: read the dump

Step 3: distinguish symptoms

The watchdog pattern, automatic dumps

Production-safe vs production-fatal

When the hang is NOT in the local code

The interview answer

Primitives by language

Implementations

Key points

Follow-up questions

Gotchas

Common pitfalls

Practice problems

APIs worth memorising

Related reading

Diagram

What it is

Why this matters

The diagnosis playbook

Step 1: capture state before doing anything else

Step 2: read the dump

Step 3: distinguish symptoms

The watchdog pattern, automatic dumps

Production-safe vs production-fatal

When the hang is NOT in the local code

The interview answer

Primitives by language

Implementations

Key points

Follow-up questions

Gotchas

Common pitfalls

Practice problems

APIs worth memorising

Related reading