Measuring Lock Contention

What it is

Lock contention is when threads spend significant time waiting for locks instead of running. The symptom: throughput plateaus while CPU utilisation sits well below 100%. Somewhere, threads are queuing for a shared resource that only one of them can use at a time.

Diagnosing contention means finding the bottleneck lock. Fixing it means shrinking the critical section, sharding the lock, or removing the sharing entirely.

A picture of contention

Wall-clock time ------------------------------------------------------>

Thread A:  ███-░░░░░░░░░░░░░░-███-░░░░░░░░░░░░-███-░░░░░░░░░░░░-███
Thread B:  ░░░-███-----------░░░-░░░-███-------░░░-░░░-███-------░░░
Thread C:  ░░░-░░░-----------███-░░░-░░░-------███-░░░-░░░-------███
Thread D:  ░░░-░░░-----------░░░-███-░░░-------░░░-███-░░░-------░░░

  ███ = inside the critical section, doing work
  ░░░ = waiting on the lock

  4 threads, 1 contended lock.
  Throughput ≈ 1 thread's worth of work, no matter how many threads run.
  Cores 2, 3, 4 sit idle most of the time.
  CPU utilisation looks low; latency at the lock looks bad.

Adding cores does not help when the bottleneck is one lock that everyone has to take.

Two symptoms that suggest "look at locks"

1. CPU plateaus below 100% under load.
   You add load. Throughput stops climbing. CPU sits at 40-60%.
   Something is making threads wait.

2. Latency rises faster than throughput.
   Each new request waits longer at the lock. Little's Law:
   if the service rate is fixed, the queue grows linearly with arrival rate.

Neither is conclusive (could be slow I/O, slow downstream service, GC pause). Both are signals to pick up an off-CPU profiler.

Tools to find the contended lock

Java     async-profiler in lock mode (flame graph of who waits where)
         JFR JavaMonitorEnter events
         jstack thread dump (look for BLOCKED state on a monitor)

Go       runtime.SetMutexProfileFraction(1)
         go tool pprof http://localhost:6060/debug/pprof/mutex

Linux    perf record -e sched:sched_stat_sleep    (off-CPU time)
         perf trace -e 'syscalls:sys_enter_futex' (kernel-level lock waits)

Python   The GIL is usually the main source; py-spy shows it
         For app locks: instrument by hand

A futex (the Linux primitive behind most user-space locks) shows up in perf traces; many futex enters with long waits is the kernel-level signature of a hot lock.

How to fix

In order of preference (cheapest first):

1. Shrink the critical section. Move work outside the lock. Most critical sections do too much. The pattern: compute the new value before acquiring the lock; acquire; install; release. Don't compute under the lock.

2. Reduce hold time. Replace expensive operations with cheaper ones. A synchronized block that does I/O is dramatically wrong; do the I/O outside.

3. Shard the lock. Split the protected resource into N partitions, each with its own lock. ConcurrentHashMap does this internally (per-bin locking). Cache servers shard by key hash.

4. Lock-free. Replace the lock with atomic operations on the protected state. AtomicLong instead of synchronised counter, atomic.Pointer instead of mutex around a config object.

5. Eliminate sharing. The shared state itself might be unnecessary. Per-thread state with periodic merging often works for counters and metrics.

Each step adds complexity. Stop at the first one that brings the workload under the contention threshold.

False sharing vs real contention

These look similar (high block time, low throughput) but have different fixes.

Real contention: threads explicitly waiting on the same lock or atomic. The fix is reducing or eliminating the sharing.

False sharing: threads modifying different fields that happen to share a cache line. No logical contention but cache-line ping-pong between cores. Symptom: high CPU but low IPC (instructions per cycle); cache-miss profile shows hot fields.

The fix for false sharing is padding: ensure independent fields live on different cache lines. @Contended in Java, _Alignas(64) in C++.

Don't optimise contention without measuring it first. "This lock looks contended" is wrong half the time. Profile first. And once contention is found, the biggest wins are usually in shrinking the critical section, not in fancier locking. Most synchronized blocks do work that could comfortably live outside the lock.

For most applications, plain synchronized (or Mutex) is fine. Reach for ConcurrentHashMap, sharded structures, or lock-free data structures only when a profiler points to a specific lock as the bottleneck.

Follow-up questions

▸How is contention identified as the bottleneck?

Check CPU utilisation under load. If CPU is well below 100% but throughput won't go higher, threads are blocked on something. The 'something' is often a lock. Profile with off-CPU sampling (async-profiler lock mode, perf trace, Go mutex profiler) to find the specific lock.

▸Shrink critical section vs shard lock: which first?

Shrink first. Often the critical section contains work that doesn't need the lock (compute outside, only update inside). Shrinking is cheap and doesn't change the API. Sharding is more invasive (queries that span partitions have to deal with N partitions). Try shrinking; if still contended, shard.

▸What's the difference between contention and false sharing?

Contention: threads explicitly waiting on the same lock. False sharing: threads writing to different fields that happen to be on the same cache line, causing cache-line ping-pong even though there's no logical contention. Contention shows up as block time; false sharing shows up as high cache-miss rate with high CPU. Different fixes (lock partitioning vs padding/separation).

▸Is GIL contention measurable in CPython?

Yes. py-spy --gil shows per-thread GIL hold time. ceval.c instrumentation (CPython internals) can be enabled. For most Python services, the GIL is the bottleneck on multi-threaded CPU work, which is why 'use multiprocessing' or 'use C extensions that release the GIL' are the standard advice.

What it is

Diagnosing contention means finding the bottleneck lock. Fixing it means shrinking the critical section, sharding the lock, or removing the sharing entirely.

A picture of contention

Wall-clock time ------------------------------------------------------>

Thread A:  ███-░░░░░░░░░░░░░░-███-░░░░░░░░░░░░-███-░░░░░░░░░░░░-███
Thread B:  ░░░-███-----------░░░-░░░-███-------░░░-░░░-███-------░░░
Thread C:  ░░░-░░░-----------███-░░░-░░░-------███-░░░-░░░-------███
Thread D:  ░░░-░░░-----------░░░-███-░░░-------░░░-███-░░░-------░░░

  ███ = inside the critical section, doing work
  ░░░ = waiting on the lock

  4 threads, 1 contended lock.
  Throughput ≈ 1 thread's worth of work, no matter how many threads run.
  Cores 2, 3, 4 sit idle most of the time.
  CPU utilisation looks low; latency at the lock looks bad.

Adding cores does not help when the bottleneck is one lock that everyone has to take.

Two symptoms that suggest "look at locks"

1. CPU plateaus below 100% under load.
   You add load. Throughput stops climbing. CPU sits at 40-60%.
   Something is making threads wait.

2. Latency rises faster than throughput.
   Each new request waits longer at the lock. Little's Law:
   if the service rate is fixed, the queue grows linearly with arrival rate.

Neither is conclusive (could be slow I/O, slow downstream service, GC pause). Both are signals to pick up an off-CPU profiler.

Tools to find the contended lock

Java     async-profiler in lock mode (flame graph of who waits where)
         JFR JavaMonitorEnter events
         jstack thread dump (look for BLOCKED state on a monitor)

Go       runtime.SetMutexProfileFraction(1)
         go tool pprof http://localhost:6060/debug/pprof/mutex

Linux    perf record -e sched:sched_stat_sleep    (off-CPU time)
         perf trace -e 'syscalls:sys_enter_futex' (kernel-level lock waits)

Python   The GIL is usually the main source; py-spy shows it
         For app locks: instrument by hand

A futex (the Linux primitive behind most user-space locks) shows up in perf traces; many futex enters with long waits is the kernel-level signature of a hot lock.

How to fix

In order of preference (cheapest first):

2. Reduce hold time. Replace expensive operations with cheaper ones. A synchronized block that does I/O is dramatically wrong; do the I/O outside.

3. Shard the lock. Split the protected resource into N partitions, each with its own lock. ConcurrentHashMap does this internally (per-bin locking). Cache servers shard by key hash.

4. Lock-free. Replace the lock with atomic operations on the protected state. AtomicLong instead of synchronised counter, atomic.Pointer instead of mutex around a config object.

5. Eliminate sharing. The shared state itself might be unnecessary. Per-thread state with periodic merging often works for counters and metrics.

Each step adds complexity. Stop at the first one that brings the workload under the contention threshold.

False sharing vs real contention

These look similar (high block time, low throughput) but have different fixes.

Real contention: threads explicitly waiting on the same lock or atomic. The fix is reducing or eliminating the sharing.

The fix for false sharing is padding: ensure independent fields live on different cache lines. @Contended in Java, _Alignas(64) in C++.

Follow-up questions

▸How is contention identified as the bottleneck?

▸Shrink critical section vs shard lock: which first?

▸What's the difference between contention and false sharing?

▸Is GIL contention measurable in CPython?

What it is

A picture of contention

Two symptoms that suggest "look at locks"

Tools to find the contended lock

How to fix

False sharing vs real contention

Implementations

Key points

Follow-up questions

Gotchas

Related reading

Measuring Lock Contention

What it is

A picture of contention

Two symptoms that suggest "look at locks"

Tools to find the contended lock

How to fix

False sharing vs real contention

Implementations

Key points

Follow-up questions

Gotchas

Related reading