Lock Convoys
A lock convoy happens when threads contending for a lock end up serialized in the kernel wait queue. The first thread runs and releases. The next thread is woken from a kernel sleep, which costs a context switch and a cache miss. The lock is fair, but throughput drops because every acquire pays a kernel round trip even when the critical section is microseconds. Common in fair mutexes under steady contention.
What a convoy looks like in plain words
The setup is simple. There is a short critical section, maybe 50 nanoseconds of real work. Many threads want to run it. The lock is fair, meaning whoever has been waiting longest goes next. That sounds reasonable, and it tanks throughput.
The reason is the cost of waking a sleeping thread. When a fair lock is released, the next waiter has been parked by the kernel and has to be woken up. That wake-up costs the kernel roughly 3 to 10 microseconds: a context switch out of whatever was running on the target core, restoring the registers and stack of the woken thread, and finally getting it onto a CPU. The actual work the thread is about to do is 50 nanoseconds. So every operation pays roughly 5 microseconds of "wake the next person up" plus 50 nanoseconds of real work. The wake-up dominates by two orders of magnitude.
The math behind the convoy:
- Per-operation wall time: about 5 µs of wake-up plus 50 ns of work ≈ 5 µs.
- Steady-state throughput: about 200,000 operations per second, regardless of core count.
Throwing 100 cores at the problem still serves only around 200K operations per second, because that is how fast the kernel can wake one thread at a time. Adding more threads adds more waiters but does not add more wake-ups per second.
Why "fair" causes it
The convoy is specifically a fair-lock problem. An unfair lock under the same workload behaves completely differently.
An unfair lock lets the thread that just released the lock grab it again immediately if it wants to. The lock's cache line is already hot in that thread's L1, the lock state is local, no kernel call is needed. If one thread is doing all the work in a tight loop, it keeps the lock and goes fast. The other waiters stay parked, but that does not slow anyone down.
A fair lock refuses. The releasing thread has to go to the back of the line, even if everyone in the line is asleep and even if the releasing thread has more work ready to go right now. Every single acquire pays the kernel-wakeup tax.
The contrast on a contended lock with three waiters:
| Step | Unfair lock | Fair lock (convoy) |
|---|---|---|
| 1 | T1 acquires, works, releases (cache hot, no wake-up) | T1 acquires, works, releases |
| 2 | T1 reacquires immediately, works, releases | Kernel wakes T2 (~5 µs) |
| 3 | T1 reacquires again, works, releases | T2 acquires, works, releases |
| 4 | ... T1 keeps the lock for many iterations | Kernel wakes T3 (~5 µs) |
| 5 | Eventually T1 yields; T2 wakes once | T3 acquires, works, releases |
| 6 | T2 runs a batch, then yields; T3 wakes once | Every step pays ~5 µs of kernel wake-up |
| Throughput | Tens of millions of operations per second | About 200K operations per second |
The unfair lock amortises the kernel wake-up cost across many acquisitions by the same thread; the fair lock pays it on every single acquisition. The throughput gap is two orders of magnitude.
"Fair" is not the same as "fast" Fairness is a property of the lock's ordering: it controls who gets the lock next. It says nothing about how fast anyone gets it. Fair locks are slower in steady state and have worse tail latency under sustained contention than unfair locks. Use them only when starvation is a measured, demonstrated problem, not as a precaution.
How to spot one
- Throughput goes flat as more threads are added, but no CPU is saturated.
topshows lots ofS(sleeping) threads.- A mutex profiler shows the contended lock has cumulative wait time several orders of magnitude higher than the critical section.
- Latency histograms show a fat shoulder at roughly the OS wakeup latency, repeating per acquire.
- Removing the lock entirely (or replacing with an atomic) shows a 100x throughput jump.
How to fix it
Three buckets of fixes, in order of preference:
-
Remove the contention. Shard the data structure (LongAdder, per-CPU caches, RCU). This is the only fix that scales. A design that requires one mutable thing serializing every request will have a convoy or a bottleneck regardless of lock choice.
-
Use an unfair lock with a starvation backstop. Java's default
ReentrantLock, Go's defaultsync.Mutex, glibc's defaultpthread_mutex_t. They allow barging in steady state and force fairness only when someone has waited too long. Best of both worlds for almost every workload. -
Make the critical section so short that nobody blocks. If the section runs in a few hundred nanoseconds and contention is light, threads spin briefly in the lock's fast path and never reach the kernel. This is what
synchronizedand ReentrantLock both do well in the uncontended case.
When a fair lock is actually warranted Real-time systems where bounded acquire latency matters more than throughput. Priority-inversion-prone systems where a low-priority owner could starve a high-priority waiter. In normal application code: never as a default.
Primitives by language
- ReentrantLock(true) (fair, convoy-prone)
- ReentrantLock() (unfair, the default, much harder to convoy)
- synchronized (unfair, with adaptive spinning; biased locking was removed in JDK 18)
Implementations
A fair ReentrantLock makes the JVM grant the lock in arrival order. Under steady contention, every release wakes the next thread from park, which costs a kernel transition. Throughput is bounded by wakeup cost, not by the critical section. Switching to the unfair default lets the releasing thread re-acquire immediately if it needs to, skipping the wait queue entirely.
1 import java.util.concurrent.locks.ReentrantLock;
2 import java.util.concurrent.atomic.AtomicLong;
3
4 class CounterFair {
5 // FAIR mode: convoy-prone under sustained contention
6 private final ReentrantLock lock = new ReentrantLock(true);
7 private long count = 0;
8
9 void inc() {
10 lock.lock();
11 try { count++; } finally { lock.unlock(); }
12 }
13 }
14
15 class CounterUnfair {
16 // Default UNFAIR: throughput-friendly, allows lock barging
17 private final ReentrantLock lock = new ReentrantLock();
18 private long count = 0;
19
20 void inc() {
21 lock.lock();
22 try { count++; } finally { lock.unlock(); }
23 }
24 }
25
26 // Bench (8 threads, 10s, JMH):
27 // CounterFair ~ 150K ops/sec (each acquire pays ~5us wakeup)
28 // CounterUnfair ~ 80M ops/sec (releaser re-acquires inline)
29 // AtomicLong ~250M ops/sec (no lock at all)Once a workload is in convoy territory, the lock itself is the wrong tool. LongAdder shards counters per CPU and recombines on read. Reads cost more, writes are nearly contention-free. This is the standard escape hatch for hot counters.
1 import java.util.concurrent.atomic.LongAdder;
2
3 class CounterSharded {
4 private final LongAdder count = new LongAdder();
5
6 void inc() { count.increment(); }
7 long sum() { return count.sum(); }
8 }
9
10 // 8 threads: ~600M ops/sec, no lock involved
11 // Read cost: O(num cells) instead of O(1), so use LongAdder when
12 // writes dominate and reads are infrequent (metrics scrapes, etc.)Key points
- •All N threads end up in the kernel wait queue. Each acquire pays a wakeup, a context switch, and cold-cache restart.
- •Critical section is short, but wall-clock time per acquire is dominated by the wakeup cost (microseconds), not the section itself (nanoseconds).
- •Triggered by fair locks (FIFO wakeup), high contention, and short critical sections. The trio is the warning sign.
- •Symptom: throughput plateaus or drops as more threads are added, but no thread is CPU bound; perf shows time inside futex/wait, not inside the critical section.
- •Fixes: reduce contention (sharding, RCU, lock-free), prefer adaptive/unfair locks (default in glibc, Java ReentrantLock without fair=true), or restructure so the section is so short that nobody ever blocks.
Follow-up questions
▸Why does fairness make convoys worse?
▸Glibc moved away from fair futexes years ago. Why is this still a problem?
▸How is a convoy different from regular lock contention?
▸Can a convoy happen without a lock?
Gotchas
- !Adding a 'just to be safe' fair lock to fix a starvation complaint can tank throughput by 100x.
- !Convoys are time-of-day sensitive: light load looks fine, peak load suddenly cliffs. Load-test under realistic concurrency.
- !GC pauses and STW phases extend critical sections enough to push a healthy unfair mutex into convoy mode for a few seconds.
- !Profilers that sample the running thread won't show the convoy: time is inside the kernel scheduler, not application code. Use mutex/contention profilers.
Common pitfalls
- Concluding from microbenchmarks that 'lock is cheap' and using one lock for a hot path. Microbenchmarks don't have the steady contention that triggers convoys.
- Switching from synchronized to ReentrantLock(true) for 'better fairness' on a hot lock.
- Trying to fix a convoy by adding more workers. More workers means a longer queue, which makes the convoy worse.
APIs worth memorising
- java.util.concurrent.locks.ReentrantLock(boolean fair)
- runtime.SetMutexProfileFraction (Go)
- futex(2) (Linux)
Common in monitoring code paths (one shared rate limiter or counter, every request takes the lock), in connection pool acquisition under saturation, and in old-style BlockingQueue producers when the consumer is slightly behind. The fix is almost always to remove the contention (sharding, lock-free, batching), not to tune the lock.