Debugging ConcurrencyTopic 3 of 18

ConceptIntermediateSometimes

Profiling Concurrent Code

In one line

On-CPU profilers (perf, async-profiler, py-spy) show where threads spend CPU. Off-CPU profilers show where threads wait (locks, I/O). Concurrent code needs both: 'CPU is fine but throughput won't go higher' means off-CPU work, not on-CPU. Lock profilers and goroutine analyses are the specific tools.

What it is

Profiling concurrent code requires a different toolkit than profiling sequential code. The dominant performance issue in concurrent code is often "what are threads doing when they're not running?", which on-CPU profilers can't answer.

The toolkit:

On-CPU profilers: where threads spend CPU. async-profiler (Java), perf (Linux), py-spy (Python), pprof (Go). Use for CPU-bound bottlenecks.

Off-CPU profilers: where threads wait. async-profiler in 'lock' or 'wall' mode, Go's mutex/block profilers, perf with sched events. Use for contention and I/O bottlenecks.

Thread/goroutine dumps: snapshot of every thread's stack. jstack (Java), py-spy dump (Python), pprof goroutine (Go). Use for hangs and deadlocks.

Tracing: per-request timeline. OpenTelemetry, Go's runtime/trace, JFR. Use for tail latency and slow-request analysis.

Picking the right tool

The starting question: what's the symptom?

"Service uses 100% CPU but throughput is low" → on-CPU profile. Find the hot loop. Optimise the algorithm or the hot path.

"CPU is at 30% but throughput won't go higher" → off-CPU profile. Find the lock or I/O wait. Reduce contention or fan out work.

"Service hangs" → thread dump. Look for threads BLOCKED or WAITING. Look for cycles (deadlock).

"P99 latency is bad but average is fine" → tracing. Look at what slow requests do that fast ones don't.

"Sometimes a request is 10x slower than usual" → tracing + GC logs. Often a GC pause coincides with the slow request.

On-CPU vs off-CPU

The distinction is crucial.

On-CPU sampling: every 10ms (100Hz typical), check what each thread is doing right now. If a thread is sleeping or blocked, it's not sampled. The flame graph shows code that runs.

Off-CPU sampling: track when threads block, what they're blocked on, for how long. The flame graph shows code that waits.

For concurrent code where the issue is "threads waiting for locks", on-CPU profiling reveals... nothing. The waiting threads aren't running. Off-CPU is the right tool.

For concurrent code where the issue is "the algorithm is slow", on-CPU profiling shows the hot loop. Off-CPU shows nothing meaningful.

Most concurrency bottlenecks are off-CPU. CPU bottlenecks are usually algorithmic (use a better algorithm) rather than concurrency-specific.

Thread dumps for hangs

When the service stops responding, take 3-4 thread dumps separated by seconds. Compare:

Threads stuck in the same place across all dumps → genuinely blocked.
Threads moving between calls → still working, slow but progressing.
A cycle of "A waits for B, B waits for A" → deadlock. Most JVM tools flag these explicitly.
Many threads waiting on the same monitor → contention; that monitor is the bottleneck.

Most hangs in production show a clear pattern in 5 minutes of thread-dump analysis. It's the cheapest debugging tool.

Tracing for tail latency

P99 latency reports that the worst 1% of requests are slow. Profiling reports what's typical, not what's slow. Tracing fills the gap: each request gets a trace ID; slow requests have full timing breakdown.

OpenTelemetry is the standard. Each span shows what happened: DB query took 50ms, downstream call took 200ms, GC pause coincided. Patterns emerge: "all slow requests hit the same downstream", "all slow requests happen during GC".

Profile both on-CPU and off-CPU. Most concurrency bottlenecks are off-CPU and they don't show in a CPU profile at all. Thread dumps are the cheapest debugging tool there is, so when a service hangs, take dumps before restarting; the pattern is usually obvious. And remember tracing answers a different question than profiling: tracing is for tail latency, profiling is for steady-state.

The common mistake is profiling only one mode and concluding "it's fine, no hot spot". The hot spot is in the mode that wasn't profiled.

Implementations

async-profiler: on-CPU and off-CPU

async-profiler is the standard JVM profiler. CPU mode for hot code paths. Lock mode for contention. Wall mode for end-to-end (including blocking time). Each mode answers a different question.

 1  // CPU profile (on-CPU)
 2  // ./profiler.sh -d 60 -e cpu -f cpu.html <pid>
 3  // Flame graph: methods consuming CPU. Find hot loops.
 4  
 5  // Lock profile (off-CPU contention)
 6  // ./profiler.sh -d 60 -e lock -f lock.html <pid>
 7  // Flame graph: methods blocked on monitors. Find contended locks.
 8  
 9  // Wall profile (everything: CPU + waits)
10  // ./profiler.sh -d 60 -e wall -f wall.html <pid>
11  // Most useful for "where does time go in a slow request"
12  
13  // Allocation profile
14  // ./profiler.sh -d 60 -e alloc -f alloc.html <pid>
15  // Flame graph by allocation. Find GC pressure.

Thread dump for hangs

jstack -l <pid> dumps every thread's stack. The -l flag includes lock information (which thread holds which lock). For deadlocks, jstack shows them explicitly. For hangs, look for threads in BLOCKED or WAITING states.

 1  // jstack -l <pid> > dump.txt
 2  //
 3  // Sample output snippets:
 4  //
 5  // "worker-1" #42 prio=5 os_prio=0 tid=0x... nid=0x... waiting on condition [...]
 6  //    java.lang.Thread.State: WAITING (parking)
 7  //         at sun.misc.Unsafe.park(Native Method)
 8  //         - parking to wait for  <0x000000076ab62ff0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
 9  //         at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
10  //
11  // Look for:
12  //  - many threads BLOCKED on the same lock object → contention
13  //  - thread cycle (A waits for B, B waits for A) → deadlock (jstack flags these)
14  //  - threads stuck in I/O → downstream slow
15  //
16  // For deadlock detection specifically:
17  // jstack -l <pid> | grep -A 3 "Found one Java-level deadlock"

Key points

•On-CPU profile: where threads spend CPU. Use for CPU-bound bottlenecks.
•Off-CPU profile: where threads wait (block, sleep, park). Use for concurrency bottlenecks.
•Lock profilers: per-lock contention time. Async-profiler 'lock' mode, Go's mutex profiler.
•Thread/goroutine dumps: snapshot of every thread's stack. Use for hangs and deadlocks.
•Tracing (OpenTelemetry, runtime/trace): per-request timeline; useful for tail latency.

Follow-up questions

▸How to tell if the problem is CPU or contention?

Look at CPU utilisation under load. If CPU is at 100% and throughput plateaus, the workload is CPU-bound; profile on-CPU. If CPU is well below 100% but throughput plateaus, threads are blocked on something; profile off-CPU (lock, mutex, block profile). Both can be true; profile both.

▸What's the difference between CPU and wall profiling?

CPU profile only counts time the thread was running on a CPU. Wall profile counts everything including waits. For pure CPU bottlenecks, CPU is more focused. For 'why is this request slow', wall is the right pick (it shows the I/O waits, lock waits, GC pauses).

▸When to use tracing instead of profiling?

Profiling is statistical: aggregates over time, shows what's typical. Tracing is per-request: shows exactly what happened for one specific request. For tail latency ('the p99 is bad'), tracing wins (it surfaces what's different about slow requests). For throughput ('the average is too slow'), profiling wins (it shows where time goes in steady state).

▸How much overhead does profiling add?

On-CPU sampling at 100Hz: roughly 1-5%. Lock profiling: depends on event rate, can be 10%+. Tracing every request: 5-20%. For production profiling, use sampling: profile a fraction of requests / a percentage of the time. async-profiler at default rate is generally safe in production.

Gotchas

!Profiling only on-CPU when the bottleneck is contention: misses the problem
!Single thread dump is a snapshot; take 3-4 separated by seconds for hang diagnosis
!Profiling overhead can itself cause symptoms; use sampling at production scale
!Reading flame graphs without normalising by sample count: large boxes might be common, not slow
!Conflating tracing with profiling: tracing reports on specific requests, profiling on aggregate behaviour

Profiling Concurrent Code

In one line

What it is

The toolkit:

On-CPU profilers: where threads spend CPU. async-profiler (Java), perf (Linux), py-spy (Python), pprof (Go). Use for CPU-bound bottlenecks.

Off-CPU profilers: where threads wait. async-profiler in 'lock' or 'wall' mode, Go's mutex/block profilers, perf with sched events. Use for contention and I/O bottlenecks.

Thread/goroutine dumps: snapshot of every thread's stack. jstack (Java), py-spy dump (Python), pprof goroutine (Go). Use for hangs and deadlocks.

Tracing: per-request timeline. OpenTelemetry, Go's runtime/trace, JFR. Use for tail latency and slow-request analysis.

Picking the right tool

The starting question: what's the symptom?

"Service uses 100% CPU but throughput is low" → on-CPU profile. Find the hot loop. Optimise the algorithm or the hot path.

"CPU is at 30% but throughput won't go higher" → off-CPU profile. Find the lock or I/O wait. Reduce contention or fan out work.

"Service hangs" → thread dump. Look for threads BLOCKED or WAITING. Look for cycles (deadlock).

"P99 latency is bad but average is fine" → tracing. Look at what slow requests do that fast ones don't.

"Sometimes a request is 10x slower than usual" → tracing + GC logs. Often a GC pause coincides with the slow request.

On-CPU vs off-CPU

The distinction is crucial.

On-CPU sampling: every 10ms (100Hz typical), check what each thread is doing right now. If a thread is sleeping or blocked, it's not sampled. The flame graph shows code that runs.

Off-CPU sampling: track when threads block, what they're blocked on, for how long. The flame graph shows code that waits.

For concurrent code where the issue is "threads waiting for locks", on-CPU profiling reveals... nothing. The waiting threads aren't running. Off-CPU is the right tool.

For concurrent code where the issue is "the algorithm is slow", on-CPU profiling shows the hot loop. Off-CPU shows nothing meaningful.

Most concurrency bottlenecks are off-CPU. CPU bottlenecks are usually algorithmic (use a better algorithm) rather than concurrency-specific.

Thread dumps for hangs

When the service stops responding, take 3-4 thread dumps separated by seconds. Compare:

Threads stuck in the same place across all dumps → genuinely blocked.
Threads moving between calls → still working, slow but progressing.
A cycle of "A waits for B, B waits for A" → deadlock. Most JVM tools flag these explicitly.
Many threads waiting on the same monitor → contention; that monitor is the bottleneck.

Most hangs in production show a clear pattern in 5 minutes of thread-dump analysis. It's the cheapest debugging tool.

Tracing for tail latency

The common mistake is profiling only one mode and concluding "it's fine, no hot spot". The hot spot is in the mode that wasn't profiled.

Implementations

async-profiler: on-CPU and off-CPU

async-profiler is the standard JVM profiler. CPU mode for hot code paths. Lock mode for contention. Wall mode for end-to-end (including blocking time). Each mode answers a different question.

 1  // CPU profile (on-CPU)
 2  // ./profiler.sh -d 60 -e cpu -f cpu.html <pid>
 3  // Flame graph: methods consuming CPU. Find hot loops.
 4  
 5  // Lock profile (off-CPU contention)
 6  // ./profiler.sh -d 60 -e lock -f lock.html <pid>
 7  // Flame graph: methods blocked on monitors. Find contended locks.
 8  
 9  // Wall profile (everything: CPU + waits)
10  // ./profiler.sh -d 60 -e wall -f wall.html <pid>
11  // Most useful for "where does time go in a slow request"
12  
13  // Allocation profile
14  // ./profiler.sh -d 60 -e alloc -f alloc.html <pid>
15  // Flame graph by allocation. Find GC pressure.

Thread dump for hangs

 1  // jstack -l <pid> > dump.txt
 2  //
 3  // Sample output snippets:
 4  //
 5  // "worker-1" #42 prio=5 os_prio=0 tid=0x... nid=0x... waiting on condition [...]
 6  //    java.lang.Thread.State: WAITING (parking)
 7  //         at sun.misc.Unsafe.park(Native Method)
 8  //         - parking to wait for  <0x000000076ab62ff0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
 9  //         at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
10  //
11  // Look for:
12  //  - many threads BLOCKED on the same lock object → contention
13  //  - thread cycle (A waits for B, B waits for A) → deadlock (jstack flags these)
14  //  - threads stuck in I/O → downstream slow
15  //
16  // For deadlock detection specifically:
17  // jstack -l <pid> | grep -A 3 "Found one Java-level deadlock"

Key points

•On-CPU profile: where threads spend CPU. Use for CPU-bound bottlenecks.
•Off-CPU profile: where threads wait (block, sleep, park). Use for concurrency bottlenecks.
•Lock profilers: per-lock contention time. Async-profiler 'lock' mode, Go's mutex profiler.
•Thread/goroutine dumps: snapshot of every thread's stack. Use for hangs and deadlocks.
•Tracing (OpenTelemetry, runtime/trace): per-request timeline; useful for tail latency.

Follow-up questions

▸How to tell if the problem is CPU or contention?

▸What's the difference between CPU and wall profiling?

▸When to use tracing instead of profiling?

▸How much overhead does profiling add?

Gotchas

!Profiling only on-CPU when the bottleneck is contention: misses the problem
!Single thread dump is a snapshot; take 3-4 separated by seconds for hang diagnosis
!Profiling overhead can itself cause symptoms; use sampling at production scale
!Reading flame graphs without normalising by sample count: large boxes might be common, not slow
!Conflating tracing with profiling: tracing reports on specific requests, profiling on aggregate behaviour

Profiling Concurrent Code

What it is

Picking the right tool

On-CPU vs off-CPU

Thread dumps for hangs

Tracing for tail latency

Implementations

Key points

Follow-up questions

Gotchas

Related reading

Profiling Concurrent Code

What it is

Picking the right tool

On-CPU vs off-CPU

Thread dumps for hangs

Tracing for tail latency

Implementations

Key points

Follow-up questions

Gotchas

Related reading