Performance & TradeoffsTopic 10 of 14

ConceptAdvancedSometimes

Lock Convoys

In one line

A lock convoy happens when threads contending for a lock end up serialized in the kernel wait queue. The first thread runs and releases. The next thread is woken from a kernel sleep, which costs a context switch and a cache miss. The lock is fair, but throughput drops because every acquire pays a kernel round trip even when the critical section is microseconds. Common in fair mutexes under steady contention.

What a convoy looks like in plain words

The setup is simple. There is a short critical section, maybe 50 nanoseconds of real work. Many threads want to run it. The lock is fair, meaning whoever has been waiting longest goes next. That sounds reasonable, and it tanks throughput.

The reason is the cost of waking a sleeping thread. When a fair lock is released, the next waiter has been parked by the kernel and has to be woken up. That wake-up costs the kernel roughly 3 to 10 microseconds: a context switch out of whatever was running on the target core, restoring the registers and stack of the woken thread, and finally getting it onto a CPU. The actual work the thread is about to do is 50 nanoseconds. So every operation pays roughly 5 microseconds of "wake the next person up" plus 50 nanoseconds of real work. The wake-up dominates by two orders of magnitude.

The math behind the convoy:

Per-operation wall time: about 5 µs of wake-up plus 50 ns of work ≈ 5 µs.
Steady-state throughput: about 200,000 operations per second, regardless of core count.

Throwing 100 cores at the problem still serves only around 200K operations per second, because that is how fast the kernel can wake one thread at a time. Adding more threads adds more waiters but does not add more wake-ups per second.

Why "fair" causes it

The convoy is specifically a fair-lock problem. An unfair lock under the same workload behaves completely differently.

An unfair lock lets the thread that just released the lock grab it again immediately if it wants to. The lock's cache line is already hot in that thread's L1, the lock state is local, no kernel call is needed. If one thread is doing all the work in a tight loop, it keeps the lock and goes fast. The other waiters stay parked, but that does not slow anyone down.

A fair lock refuses. The releasing thread has to go to the back of the line, even if everyone in the line is asleep and even if the releasing thread has more work ready to go right now. Every single acquire pays the kernel-wakeup tax.

The contrast on a contended lock with three waiters:

Step	Unfair lock	Fair lock (convoy)
1	T1 acquires, works, releases (cache hot, no wake-up)	T1 acquires, works, releases
2	T1 reacquires immediately, works, releases	Kernel wakes T2 (~5 µs)
3	T1 reacquires again, works, releases	T2 acquires, works, releases
4	... T1 keeps the lock for many iterations	Kernel wakes T3 (~5 µs)
5	Eventually T1 yields; T2 wakes once	T3 acquires, works, releases
6	T2 runs a batch, then yields; T3 wakes once	Every step pays ~5 µs of kernel wake-up
Throughput	Tens of millions of operations per second	About 200K operations per second

The unfair lock amortises the kernel wake-up cost across many acquisitions by the same thread; the fair lock pays it on every single acquisition. The throughput gap is two orders of magnitude.

Warning

"Fair" is not the same as "fast" Fairness is a property of the lock's ordering: it controls who gets the lock next. It says nothing about how fast anyone gets it. Fair locks are slower in steady state and have worse tail latency under sustained contention than unfair locks. Use them only when starvation is a measured, demonstrated problem, not as a precaution.

How to spot one

Throughput goes flat as more threads are added, but no CPU is saturated.
top shows lots of S (sleeping) threads.
A mutex profiler shows the contended lock has cumulative wait time several orders of magnitude higher than the critical section.
Latency histograms show a fat shoulder at roughly the OS wakeup latency, repeating per acquire.
Removing the lock entirely (or replacing with an atomic) shows a 100x throughput jump.

How to fix it

Three buckets of fixes, in order of preference:

Remove the contention. Shard the data structure (LongAdder, per-CPU caches, RCU). This is the only fix that scales. A design that requires one mutable thing serializing every request will have a convoy or a bottleneck regardless of lock choice.
Use an unfair lock with a starvation backstop. Java's default ReentrantLock, Go's default sync.Mutex, glibc's default pthread_mutex_t. They allow barging in steady state and force fairness only when someone has waited too long. Best of both worlds for almost every workload.
Make the critical section so short that nobody blocks. If the section runs in a few hundred nanoseconds and contention is light, threads spin briefly in the lock's fast path and never reach the kernel. This is what synchronized and ReentrantLock both do well in the uncontended case.

Tip

When a fair lock is actually warranted Real-time systems where bounded acquire latency matters more than throughput. Priority-inversion-prone systems where a low-priority owner could starve a high-priority waiter. In normal application code: never as a default.

Primitives by language

ReentrantLock(true) (fair, convoy-prone)
ReentrantLock() (unfair, the default, much harder to convoy)
synchronized (unfair, with adaptive spinning; biased locking was removed in JDK 18)

Implementations

Reproducing a convoy with a fair lock

A fair ReentrantLock makes the JVM grant the lock in arrival order. Under steady contention, every release wakes the next thread from park, which costs a kernel transition. Throughput is bounded by wakeup cost, not by the critical section. Switching to the unfair default lets the releasing thread re-acquire immediately if it needs to, skipping the wait queue entirely.

 1  import java.util.concurrent.locks.ReentrantLock;
 2  import java.util.concurrent.atomic.AtomicLong;
 3  
 4  class CounterFair {
 5      // FAIR mode: convoy-prone under sustained contention
 6      private final ReentrantLock lock = new ReentrantLock(true);
 7      private long count = 0;
 8  
 9      void inc() {
10          lock.lock();
11          try { count++; } finally { lock.unlock(); }
12      }
13  }
14  
15  class CounterUnfair {
16      // Default UNFAIR: throughput-friendly, allows lock barging
17      private final ReentrantLock lock = new ReentrantLock();
18      private long count = 0;
19  
20      void inc() {
21          lock.lock();
22          try { count++; } finally { lock.unlock(); }
23      }
24  }
25  
26  // Bench (8 threads, 10s, JMH):
27  //   CounterFair    ~  150K ops/sec  (each acquire pays ~5us wakeup)
28  //   CounterUnfair  ~ 80M ops/sec    (releaser re-acquires inline)
29  //   AtomicLong     ~250M ops/sec    (no lock at all)

Real fix: get rid of the contention

Once a workload is in convoy territory, the lock itself is the wrong tool. LongAdder shards counters per CPU and recombines on read. Reads cost more, writes are nearly contention-free. This is the standard escape hatch for hot counters.

 1  import java.util.concurrent.atomic.LongAdder;
 2  
 3  class CounterSharded {
 4      private final LongAdder count = new LongAdder();
 5  
 6      void inc()      { count.increment(); }
 7      long sum()      { return count.sum(); }
 8  }
 9  
10  // 8 threads: ~600M ops/sec, no lock involved
11  // Read cost: O(num cells) instead of O(1), so use LongAdder when
12  // writes dominate and reads are infrequent (metrics scrapes, etc.)

Key points

•All N threads end up in the kernel wait queue. Each acquire pays a wakeup, a context switch, and cold-cache restart.
•Critical section is short, but wall-clock time per acquire is dominated by the wakeup cost (microseconds), not the section itself (nanoseconds).
•Triggered by fair locks (FIFO wakeup), high contention, and short critical sections. The trio is the warning sign.
•Symptom: throughput plateaus or drops as more threads are added, but no thread is CPU bound; perf shows time inside futex/wait, not inside the critical section.
•Fixes: reduce contention (sharding, RCU, lock-free), prefer adaptive/unfair locks (default in glibc, Java ReentrantLock without fair=true), or restructure so the section is so short that nobody ever blocks.

Follow-up questions

▸Why does fairness make convoys worse?

A fair lock parks the releasing thread instead of letting it grab the lock back. So even if the releasing thread is the one with the hottest cache and zero work to do before the next acquire, it has to go to the back of the queue. Every acquire pays a wakeup. An unfair lock lets a recently-running thread re-acquire (`barge`), which usually costs nothing because it never blocked in the first place. Throughput goes up, fairness goes down.

▸Glibc moved away from fair futexes years ago. Why is this still a problem?

Modern futex-based mutexes are unfair by default, which avoids the worst case. Convoys still appear when (1) fair locks are explicitly opted into (Java's ReentrantLock(true), Go's mutex in starvation mode after 1ms wait, pthread mutexes with PTHREAD_PRIO_INHERIT and FIFO scheduler), or (2) the critical section grows because of GC pauses, page faults, or scheduler preemption. Once parking starts, it tends to keep happening.

▸How is a convoy different from regular lock contention?

Regular contention: threads spin briefly, occasionally one parks, throughput scales sub-linearly with cores. Convoy: every acquire pays a wakeup, threads spend most of their wall-clock time in `park`, throughput goes flat or drops. The tell is the breakdown of time per acquire: critical section in nanoseconds, total time per op in microseconds. The gap is the convoy.

▸Can a convoy happen without a lock?

Yes. Channel-based serialization (Go), single-threaded actor (Erlang, Node.js), or a single-writer queue can all convoy under enough load. The mechanism is the same: producers stack up faster than the consumer can drain, and each producer pays a wakeup when finally serviced. The fix is also the same: shard the consumer, or batch.

Gotchas

!Adding a 'just to be safe' fair lock to fix a starvation complaint can tank throughput by 100x.
!Convoys are time-of-day sensitive: light load looks fine, peak load suddenly cliffs. Load-test under realistic concurrency.
!GC pauses and STW phases extend critical sections enough to push a healthy unfair mutex into convoy mode for a few seconds.
!Profilers that sample the running thread won't show the convoy: time is inside the kernel scheduler, not application code. Use mutex/contention profilers.

Common pitfalls

Concluding from microbenchmarks that 'lock is cheap' and using one lock for a hot path. Microbenchmarks don't have the steady contention that triggers convoys.
Switching from synchronized to ReentrantLock(true) for 'better fairness' on a hot lock.
Trying to fix a convoy by adding more workers. More workers means a longer queue, which makes the convoy worse.

APIs worth memorising

java.util.concurrent.locks.ReentrantLock(boolean fair)
runtime.SetMutexProfileFraction (Go)
futex(2) (Linux)

Where this shows up

Common in monitoring code paths (one shared rate limiter or counter, every request takes the lock), in connection pool acquisition under saturation, and in old-style BlockingQueue producers when the consumer is slightly behind. The fix is almost always to remove the contention (sharding, lock-free, batching), not to tune the lock.

Lock Convoys

In one line

What a convoy looks like in plain words

The math behind the convoy:

Per-operation wall time: about 5 µs of wake-up plus 50 ns of work ≈ 5 µs.
Steady-state throughput: about 200,000 operations per second, regardless of core count.

Why "fair" causes it

The convoy is specifically a fair-lock problem. An unfair lock under the same workload behaves completely differently.

The contrast on a contended lock with three waiters:

Step	Unfair lock	Fair lock (convoy)
1	T1 acquires, works, releases (cache hot, no wake-up)	T1 acquires, works, releases
2	T1 reacquires immediately, works, releases	Kernel wakes T2 (~5 µs)
3	T1 reacquires again, works, releases	T2 acquires, works, releases
4	... T1 keeps the lock for many iterations	Kernel wakes T3 (~5 µs)
5	Eventually T1 yields; T2 wakes once	T3 acquires, works, releases
6	T2 runs a batch, then yields; T3 wakes once	Every step pays ~5 µs of kernel wake-up
Throughput	Tens of millions of operations per second	About 200K operations per second

The unfair lock amortises the kernel wake-up cost across many acquisitions by the same thread; the fair lock pays it on every single acquisition. The throughput gap is two orders of magnitude.

Warning

How to spot one

Throughput goes flat as more threads are added, but no CPU is saturated.
top shows lots of S (sleeping) threads.
A mutex profiler shows the contended lock has cumulative wait time several orders of magnitude higher than the critical section.
Latency histograms show a fat shoulder at roughly the OS wakeup latency, repeating per acquire.
Removing the lock entirely (or replacing with an atomic) shows a 100x throughput jump.

How to fix it

Three buckets of fixes, in order of preference:

Remove the contention. Shard the data structure (LongAdder, per-CPU caches, RCU). This is the only fix that scales. A design that requires one mutable thing serializing every request will have a convoy or a bottleneck regardless of lock choice.
Use an unfair lock with a starvation backstop. Java's default ReentrantLock, Go's default sync.Mutex, glibc's default pthread_mutex_t. They allow barging in steady state and force fairness only when someone has waited too long. Best of both worlds for almost every workload.
Make the critical section so short that nobody blocks. If the section runs in a few hundred nanoseconds and contention is light, threads spin briefly in the lock's fast path and never reach the kernel. This is what synchronized and ReentrantLock both do well in the uncontended case.

Tip

Primitives by language

ReentrantLock(true) (fair, convoy-prone)
ReentrantLock() (unfair, the default, much harder to convoy)
synchronized (unfair, with adaptive spinning; biased locking was removed in JDK 18)

Implementations

Reproducing a convoy with a fair lock

 1  import java.util.concurrent.locks.ReentrantLock;
 2  import java.util.concurrent.atomic.AtomicLong;
 3  
 4  class CounterFair {
 5      // FAIR mode: convoy-prone under sustained contention
 6      private final ReentrantLock lock = new ReentrantLock(true);
 7      private long count = 0;
 8  
 9      void inc() {
10          lock.lock();
11          try { count++; } finally { lock.unlock(); }
12      }
13  }
14  
15  class CounterUnfair {
16      // Default UNFAIR: throughput-friendly, allows lock barging
17      private final ReentrantLock lock = new ReentrantLock();
18      private long count = 0;
19  
20      void inc() {
21          lock.lock();
22          try { count++; } finally { lock.unlock(); }
23      }
24  }
25  
26  // Bench (8 threads, 10s, JMH):
27  //   CounterFair    ~  150K ops/sec  (each acquire pays ~5us wakeup)
28  //   CounterUnfair  ~ 80M ops/sec    (releaser re-acquires inline)
29  //   AtomicLong     ~250M ops/sec    (no lock at all)

Real fix: get rid of the contention

 1  import java.util.concurrent.atomic.LongAdder;
 2  
 3  class CounterSharded {
 4      private final LongAdder count = new LongAdder();
 5  
 6      void inc()      { count.increment(); }
 7      long sum()      { return count.sum(); }
 8  }
 9  
10  // 8 threads: ~600M ops/sec, no lock involved
11  // Read cost: O(num cells) instead of O(1), so use LongAdder when
12  // writes dominate and reads are infrequent (metrics scrapes, etc.)

Key points

•All N threads end up in the kernel wait queue. Each acquire pays a wakeup, a context switch, and cold-cache restart.
•Critical section is short, but wall-clock time per acquire is dominated by the wakeup cost (microseconds), not the section itself (nanoseconds).
•Triggered by fair locks (FIFO wakeup), high contention, and short critical sections. The trio is the warning sign.
•Symptom: throughput plateaus or drops as more threads are added, but no thread is CPU bound; perf shows time inside futex/wait, not inside the critical section.
•Fixes: reduce contention (sharding, RCU, lock-free), prefer adaptive/unfair locks (default in glibc, Java ReentrantLock without fair=true), or restructure so the section is so short that nobody ever blocks.

Follow-up questions

▸Why does fairness make convoys worse?

▸Glibc moved away from fair futexes years ago. Why is this still a problem?

▸How is a convoy different from regular lock contention?

▸Can a convoy happen without a lock?

Gotchas

!Adding a 'just to be safe' fair lock to fix a starvation complaint can tank throughput by 100x.
!Convoys are time-of-day sensitive: light load looks fine, peak load suddenly cliffs. Load-test under realistic concurrency.
!GC pauses and STW phases extend critical sections enough to push a healthy unfair mutex into convoy mode for a few seconds.
!Profilers that sample the running thread won't show the convoy: time is inside the kernel scheduler, not application code. Use mutex/contention profilers.

Common pitfalls

Concluding from microbenchmarks that 'lock is cheap' and using one lock for a hot path. Microbenchmarks don't have the steady contention that triggers convoys.
Switching from synchronized to ReentrantLock(true) for 'better fairness' on a hot lock.
Trying to fix a convoy by adding more workers. More workers means a longer queue, which makes the convoy worse.

APIs worth memorising

java.util.concurrent.locks.ReentrantLock(boolean fair)
runtime.SetMutexProfileFraction (Go)
futex(2) (Linux)

Where this shows up

Lock Convoys

What a convoy looks like in plain words

Why "fair" causes it

How to spot one

How to fix it

Primitives by language

Implementations

Key points

Follow-up questions

Gotchas

Common pitfalls

APIs worth memorising

Related reading

Lock Convoys

What a convoy looks like in plain words

Why "fair" causes it

How to spot one

How to fix it

Primitives by language

Implementations

Key points

Follow-up questions

Gotchas

Common pitfalls

APIs worth memorising

Related reading