Profiling Concurrent Code
On-CPU profilers (perf, async-profiler, py-spy) show where threads spend CPU. Off-CPU profilers show where threads wait (locks, I/O). Concurrent code needs both: 'CPU is fine but throughput won't go higher' means off-CPU work, not on-CPU. Lock profilers and goroutine analyses are the specific tools.
What it is
Profiling concurrent code requires a different toolkit than profiling sequential code. The dominant performance issue in concurrent code is often "what are threads doing when they're not running?", which on-CPU profilers can't answer.
The toolkit:
On-CPU profilers: where threads spend CPU. async-profiler (Java), perf (Linux), py-spy (Python), pprof (Go). Use for CPU-bound bottlenecks.
Off-CPU profilers: where threads wait. async-profiler in 'lock' or 'wall' mode, Go's mutex/block profilers, perf with sched events. Use for contention and I/O bottlenecks.
Thread/goroutine dumps: snapshot of every thread's stack. jstack (Java), py-spy dump (Python), pprof goroutine (Go). Use for hangs and deadlocks.
Tracing: per-request timeline. OpenTelemetry, Go's runtime/trace, JFR. Use for tail latency and slow-request analysis.
Picking the right tool
The starting question: what's the symptom?
"Service uses 100% CPU but throughput is low" → on-CPU profile. Find the hot loop. Optimise the algorithm or the hot path.
"CPU is at 30% but throughput won't go higher" → off-CPU profile. Find the lock or I/O wait. Reduce contention or fan out work.
"Service hangs" → thread dump. Look for threads BLOCKED or WAITING. Look for cycles (deadlock).
"P99 latency is bad but average is fine" → tracing. Look at what slow requests do that fast ones don't.
"Sometimes a request is 10x slower than usual" → tracing + GC logs. Often a GC pause coincides with the slow request.
On-CPU vs off-CPU
The distinction is crucial.
On-CPU sampling: every 10ms (100Hz typical), check what each thread is doing right now. If a thread is sleeping or blocked, it's not sampled. The flame graph shows code that runs.
Off-CPU sampling: track when threads block, what they're blocked on, for how long. The flame graph shows code that waits.
For concurrent code where the issue is "threads waiting for locks", on-CPU profiling reveals... nothing. The waiting threads aren't running. Off-CPU is the right tool.
For concurrent code where the issue is "the algorithm is slow", on-CPU profiling shows the hot loop. Off-CPU shows nothing meaningful.
Most concurrency bottlenecks are off-CPU. CPU bottlenecks are usually algorithmic (use a better algorithm) rather than concurrency-specific.
Thread dumps for hangs
When the service stops responding, take 3-4 thread dumps separated by seconds. Compare:
- Threads stuck in the same place across all dumps → genuinely blocked.
- Threads moving between calls → still working, slow but progressing.
- A cycle of "A waits for B, B waits for A" → deadlock. Most JVM tools flag these explicitly.
- Many threads waiting on the same monitor → contention; that monitor is the bottleneck.
Most hangs in production show a clear pattern in 5 minutes of thread-dump analysis. It's the cheapest debugging tool.
Tracing for tail latency
P99 latency reports that the worst 1% of requests are slow. Profiling reports what's typical, not what's slow. Tracing fills the gap: each request gets a trace ID; slow requests have full timing breakdown.
OpenTelemetry is the standard. Each span shows what happened: DB query took 50ms, downstream call took 200ms, GC pause coincided. Patterns emerge: "all slow requests hit the same downstream", "all slow requests happen during GC".
Profile both on-CPU and off-CPU. Most concurrency bottlenecks are off-CPU and they don't show in a CPU profile at all. Thread dumps are the cheapest debugging tool there is, so when a service hangs, take dumps before restarting; the pattern is usually obvious. And remember tracing answers a different question than profiling: tracing is for tail latency, profiling is for steady-state.
The common mistake is profiling only one mode and concluding "it's fine, no hot spot". The hot spot is in the mode that wasn't profiled.
Implementations
async-profiler is the standard JVM profiler. CPU mode for hot code paths. Lock mode for contention. Wall mode for end-to-end (including blocking time). Each mode answers a different question.
1 // CPU profile (on-CPU)
2 // ./profiler.sh -d 60 -e cpu -f cpu.html <pid>
3 // Flame graph: methods consuming CPU. Find hot loops.
4
5 // Lock profile (off-CPU contention)
6 // ./profiler.sh -d 60 -e lock -f lock.html <pid>
7 // Flame graph: methods blocked on monitors. Find contended locks.
8
9 // Wall profile (everything: CPU + waits)
10 // ./profiler.sh -d 60 -e wall -f wall.html <pid>
11 // Most useful for "where does time go in a slow request"
12
13 // Allocation profile
14 // ./profiler.sh -d 60 -e alloc -f alloc.html <pid>
15 // Flame graph by allocation. Find GC pressure.jstack -l <pid> dumps every thread's stack. The -l flag includes lock information (which thread holds which lock). For deadlocks, jstack shows them explicitly. For hangs, look for threads in BLOCKED or WAITING states.
1 // jstack -l <pid> > dump.txt
2 //
3 // Sample output snippets:
4 //
5 // "worker-1" #42 prio=5 os_prio=0 tid=0x... nid=0x... waiting on condition [...]
6 // java.lang.Thread.State: WAITING (parking)
7 // at sun.misc.Unsafe.park(Native Method)
8 // - parking to wait for <0x000000076ab62ff0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
9 // at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
10 //
11 // Look for:
12 // - many threads BLOCKED on the same lock object → contention
13 // - thread cycle (A waits for B, B waits for A) → deadlock (jstack flags these)
14 // - threads stuck in I/O → downstream slow
15 //
16 // For deadlock detection specifically:
17 // jstack -l <pid> | grep -A 3 "Found one Java-level deadlock"Key points
- •On-CPU profile: where threads spend CPU. Use for CPU-bound bottlenecks.
- •Off-CPU profile: where threads wait (block, sleep, park). Use for concurrency bottlenecks.
- •Lock profilers: per-lock contention time. Async-profiler 'lock' mode, Go's mutex profiler.
- •Thread/goroutine dumps: snapshot of every thread's stack. Use for hangs and deadlocks.
- •Tracing (OpenTelemetry, runtime/trace): per-request timeline; useful for tail latency.
Follow-up questions
▸How to tell if the problem is CPU or contention?
▸What's the difference between CPU and wall profiling?
▸When to use tracing instead of profiling?
▸How much overhead does profiling add?
Gotchas
- !Profiling only on-CPU when the bottleneck is contention: misses the problem
- !Single thread dump is a snapshot; take 3-4 separated by seconds for hang diagnosis
- !Profiling overhead can itself cause symptoms; use sampling at production scale
- !Reading flame graphs without normalising by sample count: large boxes might be common, not slow
- !Conflating tracing with profiling: tracing reports on specific requests, profiling on aggregate behaviour