Measuring Lock Contention
Lock contention is when threads spend significant time waiting for locks instead of running. Measure with: profiler block-time view, JFR/async-profiler lock events, perf-events for futex contention. The fix is usually one of: shrink the critical section, shard the lock, replace with lock-free, or remove shared state entirely.
What it is
Lock contention is when threads spend significant time waiting for locks instead of running. The symptom: throughput plateaus while CPU utilisation sits well below 100%. Somewhere, threads are queuing for a shared resource that only one of them can use at a time.
Diagnosing contention means finding the bottleneck lock. Fixing it means shrinking the critical section, sharding the lock, or removing the sharing entirely.
A picture of contention
Wall-clock time ------------------------------------------------------>
Thread A: ███-░░░░░░░░░░░░░░-███-░░░░░░░░░░░░-███-░░░░░░░░░░░░-███
Thread B: ░░░-███-----------░░░-░░░-███-------░░░-░░░-███-------░░░
Thread C: ░░░-░░░-----------███-░░░-░░░-------███-░░░-░░░-------███
Thread D: ░░░-░░░-----------░░░-███-░░░-------░░░-███-░░░-------░░░
███ = inside the critical section, doing work
░░░ = waiting on the lock
4 threads, 1 contended lock.
Throughput ≈ 1 thread's worth of work, no matter how many threads run.
Cores 2, 3, 4 sit idle most of the time.
CPU utilisation looks low; latency at the lock looks bad.
Adding cores does not help when the bottleneck is one lock that everyone has to take.
Two symptoms that suggest "look at locks"
1. CPU plateaus below 100% under load.
You add load. Throughput stops climbing. CPU sits at 40-60%.
Something is making threads wait.
2. Latency rises faster than throughput.
Each new request waits longer at the lock. Little's Law:
if the service rate is fixed, the queue grows linearly with arrival rate.
Neither is conclusive (could be slow I/O, slow downstream service, GC pause). Both are signals to pick up an off-CPU profiler.
Tools to find the contended lock
Java async-profiler in lock mode (flame graph of who waits where)
JFR JavaMonitorEnter events
jstack thread dump (look for BLOCKED state on a monitor)
Go runtime.SetMutexProfileFraction(1)
go tool pprof http://localhost:6060/debug/pprof/mutex
Linux perf record -e sched:sched_stat_sleep (off-CPU time)
perf trace -e 'syscalls:sys_enter_futex' (kernel-level lock waits)
Python The GIL is usually the main source; py-spy shows it
For app locks: instrument by hand
A futex (the Linux primitive behind most user-space locks) shows up in perf traces; many futex enters with long waits is the kernel-level signature of a hot lock.
How to fix
In order of preference (cheapest first):
1. Shrink the critical section. Move work outside the lock. Most critical sections do too much. The pattern: compute the new value before acquiring the lock; acquire; install; release. Don't compute under the lock.
2. Reduce hold time. Replace expensive operations with cheaper ones. A synchronized block that does I/O is dramatically wrong; do the I/O outside.
3. Shard the lock. Split the protected resource into N partitions, each with its own lock. ConcurrentHashMap does this internally (per-bin locking). Cache servers shard by key hash.
4. Lock-free. Replace the lock with atomic operations on the protected state. AtomicLong instead of synchronised counter, atomic.Pointer instead of mutex around a config object.
5. Eliminate sharing. The shared state itself might be unnecessary. Per-thread state with periodic merging often works for counters and metrics.
Each step adds complexity. Stop at the first one that brings the workload under the contention threshold.
False sharing vs real contention
These look similar (high block time, low throughput) but have different fixes.
Real contention: threads explicitly waiting on the same lock or atomic. The fix is reducing or eliminating the sharing.
False sharing: threads modifying different fields that happen to share a cache line. No logical contention but cache-line ping-pong between cores. Symptom: high CPU but low IPC (instructions per cycle); cache-miss profile shows hot fields.
The fix for false sharing is padding: ensure independent fields live on different cache lines. @Contended in Java, _Alignas(64) in C++.
Don't optimise contention without measuring it first. "This lock looks contended" is wrong half the time. Profile first. And once contention is found, the biggest wins are usually in shrinking the critical section, not in fancier locking. Most synchronized blocks do work that could comfortably live outside the lock.
For most applications, plain synchronized (or Mutex) is fine. Reach for ConcurrentHashMap, sharded structures, or lock-free data structures only when a profiler points to a specific lock as the bottleneck.
Implementations
async-profiler can collect lock events: where threads waited for monitor entry. The output is a flame graph of contention. The hottest lock is the bottleneck.
1 // Run with async-profiler in lock mode:
2 // ./profiler.sh -d 60 -e lock -f locks.html <pid>
3 //
4 // The HTML flame graph shows methods grouped by lock object.
5 // Wide flames = lots of contention on that lock.
6 //
7 // Sample output: 60% of profiled time was in synchronized(cacheMap)
8 // → cacheMap is the contention bottleneck.
9 //
10 // Fix options:
11 // 1. Replace synchronized(cacheMap) with ConcurrentHashMap
12 // 2. Shard the cache into 16 partitions
13 // 3. Move the contended operation outside the lockKey points
- •Symptom: high CPU time spent in lock acquisition / blocked / parked states.
- •Java: JFR events 'jdk.JavaMonitorEnter', async-profiler 'lock' mode, jstack thread dumps.
- •Linux: perf trace -e 'sched:*' or perf record with off-cpu profiling.
- •Fix priority: shrink critical section > shard lock > lock-free structure > redesign to avoid sharing.
- •False sharing looks like contention; measure cache-miss rate to distinguish.
Follow-up questions
▸How is contention identified as the bottleneck?
▸Shrink critical section vs shard lock: which first?
▸What's the difference between contention and false sharing?
▸Is GIL contention measurable in CPython?
Gotchas
- !CPU at 100%: contention is NOT the bottleneck; the system is CPU-bound, optimise the code
- !CPU low but throughput limited: lock or I/O is the bottleneck; profile to find which
- !Confusing high CPU with high contention: profile both on-CPU AND off-CPU
- !Profiler overhead: high sampling rates can themselves cause perceived contention
- !Optimising a non-bottleneck: makes the system more complex without helping