Futexes: Fast User-Space Locking
Mental Model
A bathroom with a "vacant/occupied" sign on the door. Walk up, flip the sign to "occupied," use the room, flip it back. No conversation needed. But if the sign already reads "occupied," press a buzzer and wait in the hallway. The occupant finishes, presses the buzzer to wake the next person. Flipping the sign is instantaneous. The buzzer and the hallway wait only happen when the room is actually in use.
The Problem
A kernel-only mutex costs 200ns per lock/unlock -- syscall overhead even when no other thread is contending. At 10 million lock operations per second, that is 2 full seconds of CPU burned on locks nobody was fighting over. A database using kernel mutexes for row-level locking would spend more time locking than querying. A lock profiler confirms the waste: 99.5% of acquisitions are uncontested.
Architecture
Every mutex in common use is hiding its true cost.
pthread_mutex_lock. sync.Mutex. Java synchronized. They all feel like simple lock/unlock operations. But underneath, they are all doing the same clever trick: avoiding the kernel entirely in the common case.
That trick is the futex.
What Actually Happens
The core of a futex-based mutex is a 32-bit word in user space. Three states: 0 (unlocked), 1 (locked, no waiters), 2 (locked, with waiters).
Locking (fast path): Try CAS(0, 1) on the futex word. If it succeeds, the thread holds the lock. Zero syscalls. ~10 nanoseconds.
Locking (slow path): CAS failed -- someone else holds the lock. Set the futex word to 2 (marking that waiters exist), then call futex(FUTEX_WAIT, &word, 2). The kernel atomically checks that the word still equals 2 and puts the thread on a wait queue. If the word changed (the lock was released between the CAS and the syscall), it returns EAGAIN immediately -- preventing the lost-wakeup race.
Unlocking (fast path): If the futex word was 1 (no waiters), just CAS to 0. No FUTEX_WAKE needed. No syscall.
Unlocking (slow path): If the futex word was 2 (has waiters), CAS to 0 AND call futex(FUTEX_WAKE, &word, 1) to wake one sleeping thread.
This is why the three-state protocol matters. Without it, every unlock would need a wake syscall "just in case." With it, the unlocker knows whether anyone is waiting and skips the syscall when nobody is.
Under the Hood
The kernel side is a hash table. Futex addresses map to wait queue buckets (256 by default). Each bucket has a spinlock protecting its chain of sleeping threads. Private futexes (within a single process) hash by virtual address. Shared futexes (across processes via shared memory) hash by the backing page's (inode, offset) -- more expensive but works regardless of differing virtual addresses.
The atomic check-and-sleep is critical. FUTEX_WAIT atomically verifies *addr == expected before sleeping. Without this, there is a race: thread A checks the word, sees it locked, but before it enters the kernel, thread B unlocks and calls FUTEX_WAKE. Thread A then enters the kernel and sleeps -- but the wake already happened. Thread A sleeps forever. The atomic check closes this window.
Priority inheritance prevents inversion. FUTEX_LOCK_PI handles the classic scenario: high-priority thread H blocks on a lock held by low-priority thread L, while medium-priority thread M preempts L. H is effectively blocked by M. With PI, the kernel boosts L's priority to H's level, ensuring L runs and releases the lock quickly.
Robust futexes handle crashes. If a thread dies holding a futex, other waiters deadlock. The kernel maintains a per-thread list of held futexes (registered via set_robust_list()). On thread death, it sets the FUTEX_OWNER_DIED bit and wakes a waiter, who can recover or abort.
Requeue avoids thundering herds. FUTEX_CMP_REQUEUE is used by pthread_cond_broadcast: instead of waking all waiters (who all contend on the mutex), it wakes one and moves the rest from the condition variable's queue to the mutex's queue. Only one thread wakes; the others stay sleeping and wake one at a time as the mutex is released.
Common Questions
Why does the kernel need to verify the value in FUTEX_WAIT?
To prevent lost wakeups. Between the user-space check and the syscall, the lock can be released and the wake sent. If the kernel does not re-verify, the thread sleeps after the wake, potentially forever. The atomic check ensures that if the value changed, FUTEX_WAIT returns EAGAIN immediately.
How does glibc pthread_mutex_lock use futexes?
It implements the three-state protocol. First: CAS(0, 1). If it succeeds, done -- zero syscalls. If not, loop: CAS(current, 2) to mark waiters, then futex(FUTEX_WAIT, 2) to sleep. On wakeup, retry CAS(0, 2). Unlock: atomic_fetch_sub(1) -- if previous value was 2, call futex(FUTEX_WAKE, 1). PTHREAD_MUTEX_ADAPTIVE_NE spins a tunable number of times before sleeping.
How does Go use futexes?
Go's runtime uses futexsleep and futexwakeup in its scheduler. When an M (OS thread) has no runnable goroutines, it parks with futexsleep. When work appears (channel send unblocks a goroutine), futexwakeup wakes an idle M. sync.Mutex, channels, and WaitGroup all ultimately use the same mechanism.
What is the performance difference vs a pure kernel mutex?
A kernel mutex costs ~200-400ns per lock/unlock even when uncontended. A futex costs ~10ns for the uncontended CAS. With 10% contention, futex-based locks are 10-50x faster. That is why every user-space synchronization primitive is built on futexes.
How Technologies Use This
Containerd's throughput drops dramatically when coordinating 500 container lifecycle operations per second across 16 threads. Every lock acquisition in internal data structures costs 200ns for a kernel round-trip, adding 100us of overhead per container operation even when no threads are competing for the same lock.
The key insight is that 99% of lock acquisitions are uncontended -- no other thread wants the lock at that moment. Paying 200ns for a kernel syscall on every uncontended lock is wasteful when an atomic CAS instruction can resolve the same operation in about 10ns entirely in user space. The kernel only needs to be involved when threads actually contend and must sleep.
Futexes make this split design the default for every mutex in containerd and every container running on the host. Docker's seccomp profile must whitelist the futex syscall because blocking it breaks every mutex, condition variable, and thread synchronization primitive in the container.
The Go runtime needs to park and wake thousands of OS threads per second as goroutines block and unblock. If every thread park and wake required a heavyweight kernel mutex, the scheduler overhead alone would consume more CPU than the actual application logic.
The mechanism that makes Go's M:N scheduling practical is the futex. When an M (OS thread) runs out of runnable goroutines, it calls futexsleep to park with a single syscall costing about 200ns. When a channel send unblocks a goroutine, futexwakeup wakes an idle M to pick it up. For sync.Mutex, channels, and WaitGroup, the uncontended fast path is a single atomic CAS at roughly 10ns with zero kernel involvement.
In a server handling 50K goroutines, fewer than 1% of synchronization operations ever touch the kernel, saving millions of unnecessary syscalls per second. The takeaway is that Go's goroutine performance is not just about the scheduler -- it depends fundamentally on futexes making thread synchronization nearly free in the common case.
A Java application executes billions of synchronized blocks per second in hot loops. If every synchronized keyword required a 200ns kernel syscall, methods called in tight loops would spend 99% of their time on lock overhead rather than application logic.
The JVM avoids this with thin locks: an atomic CAS on the object header's mark word costs about 10ns with zero kernel involvement. Only when two threads actually contend for the same lock does the JVM inflate it to a heavyweight monitor backed by a futex. The vast majority of synchronized blocks in real applications are uncontended because they protect thread-local access patterns or short critical sections.
In a typical Spring application, less than 0.1% of lock acquisitions ever reach the kernel, keeping per-request overhead under 50us. The lesson is that the synchronized keyword is essentially free in the common case because futexes eliminate the kernel from the uncontended path.
Chrome's UI thread must render at 60fps, giving it exactly 16.6ms per frame. The rendering pipeline acquires 300 locks per frame, and if each lock costs 200ns for a kernel round-trip, synchronization alone burns 60us -- nearly 4% of the frame budget before any rendering work begins.
The problem is that even at 200ns per lock, the cumulative cost of hundreds of lock operations per frame becomes significant against a tight timing deadline. Traditional kernel mutexes cannot keep up because each one requires a privilege transition regardless of whether any thread is actually contending.
Chrome's base::Lock uses adaptive spinning that tries a CAS loop for short-held locks before falling back to futex sleep. In the common uncontended case, each lock costs roughly 10ns, keeping the synchronization overhead per frame under 3us. This leaves the rendering pipeline well within its timing budget and is why Chrome can maintain 60fps even with a complex multi-threaded rendering architecture.
Same Concept Across Tech
| Technology | How it uses futexes | Key detail |
|---|---|---|
| pthread (C/C++) | pthread_mutex is a futex wrapper. pthread_cond uses futex for wait/signal | Every C mutex is a futex |
| Go | sync.Mutex uses futex on Linux for the slow path (when spinning fails) | Fast path is atomic CAS, slow path is futex_wait |
| Java | synchronized and ReentrantLock use futex via pthread underneath (on Linux) | Biased locking eliminates even the CAS in single-thread case |
| Rust | std::sync::Mutex uses futex on Linux (parking_lot crate also uses futex) | Minimal overhead, direct futex syscall wrapper |
| Node.js | Worker threads use Atomics.wait() which maps to futex on Linux | SharedArrayBuffer enables cross-worker synchronization |
| PostgreSQL | Spinlocks + semaphores. Does not use futex directly (historical, uses SysV semaphores) | Moving to futex-based locking in newer versions |
Stack layer mapping (lock contention debugging):
| Layer | What to check | Tool |
|---|---|---|
| Application | Which lock is hot? How many threads contend? | Application lock profiler, flame graph |
| Runtime | Is the runtime spinning before falling back to futex? | Go: runtime mutex profiling. JVM: -XX:+PrintLockStatistics |
| Syscall | How many futex() calls? How long are FUTEX_WAIT durations? | strace -e futex -c, strace -T |
| Kernel | Is the futex hash table causing collisions? | Rare issue, only at extreme scale |
| Hardware | Is the CPU cache line bouncing between cores (false sharing)? | perf c2c |
Design Rationale Both existing synchronization options were broken at scale. Kernel mutexes cost 200-400ns per operation from the syscall overhead alone, even with zero contention -- a penalty paid billions of times per second in real applications. User-space spinlocks avoided the syscall but could not put a thread to sleep, burning CPU when contention lasted more than a few microseconds. Futexes split the difference: the uncontended path stays in user space at 10ns, and the kernel is invoked only when a thread genuinely needs to sleep. The three-state protocol (0/1/2) goes further, skipping the FUTEX_WAKE syscall when no thread is waiting -- cutting the slow-path cost in half for the common unlock-with-no-waiters case.
If You See This, Think This
| Symptom | Likely cause | First check |
|---|---|---|
| strace shows many FUTEX_WAIT calls | Threads contending on a lock, actually sleeping | strace -e futex -c for count and time |
| High system CPU (sy%) in multi-threaded app | futex syscalls from lock contention | perf top, look for futex_wait in kernel |
| Application throughput drops with more threads | Lock contention negating parallelism (Amdahl's law) | Lock profiler, reduce critical section size |
| Thread stuck in FUTEX_WAIT indefinitely | Deadlock or lock holder not releasing | cat /proc/PID/task/TID/wchan for all threads |
| Latency spikes in otherwise fast operations | Priority inversion: low-priority lock holder preempted | Use FUTEX_LOCK_PI for priority inheritance |
| perf shows cache-line bouncing | False sharing: unrelated data on same cache line as futex word | perf c2c, pad data structures to separate cache lines |
When to Use / Avoid
Relevant when:
- Debugging lock contention in multi-threaded applications (high futex wait time in strace)
- Understanding how pthread_mutex, Go sync.Mutex, and Java synchronized work internally
- Profiling applications that show high system CPU from futex syscalls (contention signal)
- Implementing custom synchronization primitives
Watch out for:
- High futex() syscall count means real contention (threads are actually sleeping on locks)
- strace showing many FUTEX_WAIT calls: not inherently bad, but high wait times indicate a hot lock
- Priority inversion: low-priority thread holds a futex while high-priority thread waits (use FUTEX_LOCK_PI for priority inheritance)
- Futex word must be in shared memory for cross-process synchronization (mmap MAP_SHARED)
Try It Yourself
1 # Trace futex syscalls for a process (shows contention)
2 strace -e futex -f -c -p $(pidof java) 2>&1 | tail -20
3
4 # Watch futex operations in real-time (strace per-call)
5 strace -e futex -f -tt -p $(pidof python3) 2>&1 | head -50
6
7 # Decode futex op codes from strace output
8 # FUTEX_WAIT_PRIVATE = 128, FUTEX_WAKE_PRIVATE = 129
9 strace -e futex -f -p $$ -o /tmp/futex.log & sleep 2; kill %1
10 grep -c FUTEX_WAIT /tmp/futex.log
11 grep -c FUTEX_WAKE /tmp/futex.log
12
13 # Check reliable futex list head for a thread
14 cat /proc/$$/status | grep -i reliable
15
16 # Count futex syscalls system-wide with bpftrace
17 # bpftrace -e 'tracepoint:syscalls:sys_enter_futex { @[comm] = count(); }' (requires root)
18
19 # Show lock contention with perf
20 # perf lock record -p $(pidof myapp) -- sleep 10
21 # perf lock reportDebug Checklist
- 1
Check futex contention: strace -e futex -c -p <pid> -- sleep 5 - 2
Find hot locks: perf lock record -p <pid> -- sleep 5 && perf lock report - 3
Monitor futex wait time: bpftrace -e 'tracepoint:syscalls:sys_enter_futex { @[comm] = count(); }' - 4
Check if threads are stuck in futex: cat /proc/<pid>/task/*/wchan | sort | uniq -c | sort -rn - 5
Profile lock contention: perf record -e syscalls:sys_enter_futex -p <pid> - 6
Check for priority inversion: strace -e futex -T (look for long FUTEX_WAIT durations)
Key Takeaways
- ✓The fast path is ZERO syscalls -- just an atomic CAS (lock cmpxchg on x86, ldxr/stxr on ARM64). 5-20 nanoseconds on modern CPUs. The kernel is only involved when a thread must sleep. Futexes are 100x faster than pure-kernel mutexes in the common case.
- ✓The three-state protocol (0/1/2) eliminates unnecessary wakes. State 0 = unlocked. State 1 = locked, no waiters. State 2 = locked, has waiters. On unlock, if state was 1, just CAS to 0 -- no FUTEX_WAKE needed. Only state 2 triggers a wake syscall.
- ✓FUTEX_LOCK_PI prevents priority inversion. When a high-priority thread blocks on a lock held by a low-priority thread, the kernel boosts the holder's priority. Used by real-time systems, Android Binder, and audio frameworks. Internally backed by rt_mutex.
- ✓Robust futexes solve the crash-while-holding-lock problem. The kernel maintains a per-thread list of held futexes. On thread death, it sets FUTEX_OWNER_DIED and wakes a waiter. Used by glibc's PTHREAD_MUTEX_ROBUST.
- ✓Private futexes (the default since glibc 2.10) hash by virtual address and only work within a single process. Shared futexes (for cross-process shared memory) hash by (inode, offset) and are ~30% slower.
Common Pitfalls
- ✗Mistake: Calling FUTEX_WAIT without trying CAS first. Reality: The syscall costs ~200ns vs ~10ns for an uncontended CAS. Always try the atomic operation first. The kernel is the fallback, not the default path.
- ✗Mistake: Always calling FUTEX_WAKE on unlock. Reality: If no threads are sleeping (state was 1, not 2), the wake is a wasted syscall. The three-state protocol exists to avoid this -- check the state before waking.
- ✗Mistake: Using FUTEX_WAIT with a shared mapping but forgetting to clear FUTEX_PRIVATE_FLAG. Reality: Private futexes hash by virtual address. Two processes mapping the same shared memory at different addresses hash to different buckets -- the waiter and waker never find each other.
- ✗Mistake: Not handling spurious wakeups. Reality: FUTEX_WAIT can return 0 without a corresponding FUTEX_WAKE (signal delivery, kernel requeue). Always re-check the futex word in a loop, just like pthread_cond_wait.
Reference
In One Line
Uncontested lock? One atomic CAS, 10ns, no syscall. Contested? Then and only then does the kernel get involved.