Spinlocks: When Busy-Waiting Wins
A spinlock is a mutex where the waiter spins in a tight CAS loop instead of going to sleep. Faster than mutex when the critical section is shorter than a context switch (~1µs). Awful when the critical section is long. Used inside OS kernels, sometimes in HFT systems; almost never appropriate in application code.
What it is
A spinlock is a mutex where a thread that cannot acquire the lock keeps trying in a tight loop instead of going to sleep. It "spins" on the lock variable, checking it over and over, until the lock becomes free. The bet behind a spinlock is simple: the lock holder will be done very soon, and burning a few thousand CPU cycles in a loop is cheaper than paying the operating system to park the thread and wake it up later.
The two strategies differ in what the contended thread does while the lock is held by somebody else.
The mutex pays a fixed round-trip cost in microseconds (one context switch out, one back in) but uses zero CPU while waiting. The spinlock pays no round-trip cost but burns 100% of one core every cycle until it gets the lock. If the critical section is shorter than the round-trip cost (around one microsecond on Linux), the spinlock wins. If it is longer, the spinlock loses badly.
Why this is rarely the right choice in application code
There are three real problems with using spinlocks in normal user-space code, and any one of them is enough to make the spinlock worse than a mutex.
The OS can preempt the lock holder. The lock holder's CPU quantum expires and the operating system schedules another thread. Now the waiters are spinning hard on every available core, but the holder cannot get back on a CPU to release the lock, because the spinners are taking the CPUs. The system burns through every core until the OS finally schedules the holder again. The longer the preemption, the more CPU is wasted.
Critical-section length is rarely predictable. A spinlock assumes the section is short. In real code, the section can stretch unexpectedly: a cache miss, a paged-out memory access, a sudden GC pause inside the locked region. The waiters do not know any of this is happening; they just keep spinning. Microseconds turn into milliseconds. CPU usage spikes for no useful work.
Wasted CPU is somebody else's loss. In a shared environment (a container with a CPU quota, a serverless host, a multitenant VM), spinning threads consume CPU budget that other workloads on the same machine could be using. The neighbouring service slows down even though it is doing its own unrelated work.
The defensive default for application code is to use a regular mutex. Modern mutex implementations are adaptive: they spin briefly first (a few microseconds), hoping the lock will release, and only fall back to parking the thread if the spin runs out. This gives the spinlock's win when the section is short and the mutex's safety when the section is long. There is essentially no situation in normal application code where a hand-rolled spinlock beats the runtime's adaptive mutex.
When spinlocks actually win
A spinlock outperforms a mutex by a factor of two to ten when all three of the following conditions hold:
- Very short critical sections. Shorter than a context switch round trip, which is around one microsecond on Linux. Often the section is only tens of nanoseconds: flipping a flag, updating two adjacent fields, swapping a pointer.
- Spare CPU cores. Spinning is fine on a machine with idle cores. On an oversubscribed machine, every cycle a spinner burns is a cycle some other ready thread could have used.
- The lock holder cannot be preempted. This is the hardest one. It is true inside the OS kernel running with interrupts disabled. It is true on a real-time thread with elevated priority. It is true on a CPU that has been pinned and isolated for a single workload. It is almost never true for a normal user-space thread.
When all three hold, the spinlock is the right answer. This is why operating system kernels use them internally, why some HFT systems use them on dedicated cores, and why some lock-free data structure libraries use them in their hot paths. Outside those domains, the conditions almost never line up.
Implementations matter
Once a spinlock is the right choice, the specific implementation has a large effect on how well it performs under contention.
Plain test-and-set (TAS). The simplest spinlock: every iteration runs an atomic compare-and-swap on the lock variable. The problem is that every CAS is a write, and writes invalidate the cache line on every other core that has the line cached. With many threads spinning, the lock's cache line bounces between cores constantly. Even when no thread is acquiring, the cache traffic eats most of the available memory bandwidth.
Test-then-test-and-set (TTAS). A small but important refinement. The waiter first does a plain load of the lock variable, which is a read and so the cache line can stay in shared state across many cores. Only when the load shows the lock is free does the waiter actually run a CAS to try to acquire. Reads are nearly free; writes are expensive. TTAS dramatically reduces cache-line traffic under contention.
The PAUSE and YIELD hints. On x86, the PAUSE instruction in the spin loop tells the CPU that this is a spin (so the pipeline can deprioritise it, save power, and not starve a co-running hyperthread). On ARM, YIELD plays the same role. Forgetting these is a small but real bug; the cost is wasted power and worse performance for the rest of the workload.
Queue-based locks (ticket lock, MCS lock). The most sophisticated spinlocks. Each waiting thread spins on its own cache line rather than the shared lock variable, so contention does not cause cache-line ping-pong at all. These are what the Linux kernel uses for many of its internal locks. They are more code than a TTAS spinlock and rarely worth implementing by hand outside the kernel.
The mental model
A spinlock trades CPU time for latency. It is worth that trade when latency is the bottleneck and CPU is plentiful, and not worth it when CPU is contested or when critical sections can stretch.
For application code: do not reach for a spinlock. Use the standard mutex. The runtime's adaptive mutex already spins briefly when that helps, and parks when it doesn't. The performance difference between an adaptive mutex and a hand-rolled spinlock in user code is tiny, and the failure modes of the spinlock under preemption are large.
For kernel code, library internals, real-time systems, and hardware-near work: understand the three conditions, confirm they all hold, and use a vetted spinlock implementation rather than rolling a new one. The C++ standard library has spinlock primitives; Rust's parking_lot and crossbeam provide tested ones; the kernel ships its own. Custom spinlock code is one of the easier places to introduce a subtle correctness or performance bug.
Implementations
Java's synchronized and ReentrantLock are adaptive: they spin for a short time (a few microseconds) hoping the lock will be released, then fall back to parking the thread. This is almost always the right behaviour: gets the spinlock win when the critical section is short, the mutex win when it isn't.
1 // Java's synchronized blocks already do adaptive spin-then-park.
2 // ReentrantLock too.
3 // Writing one by hand is almost never necessary.
4
5 // But here's the conceptual structure:
6 class AdaptiveLock {
7 AtomicBoolean locked = new AtomicBoolean(false);
8 int SPIN_BUDGET = 100;
9
10 void lock() {
11 for (int i = 0; i < SPIN_BUDGET; i++) {
12 if (locked.compareAndSet(false, true)) return;
13 }
14 // Spin failed; fall back to parking
15 parkUntilUnlocked();
16 }
17 }Key points
- •Spin in a CAS loop instead of parking the thread. No context switch cost.
- •Wins when critical section < ~1µs (context switch cost) and there are spare cores.
- •Loses badly when critical section is long: spinning threads burn 100% CPU doing nothing.
- •Adaptive mutexes (Linux futex, Java synchronized): spin briefly, then park. Usually the right default.
- •Spinlocks should NOT be used in user code without guaranteeing very short critical sections AND a spare core.
Follow-up questions
▸When do spinlocks actually win?
▸What's the difference between a spinlock and a busy-wait?
▸Why does the PAUSE instruction matter?
▸When does ttas (test-test-and-set) help?
Gotchas
- !Spinlock with a long critical section = wasted CPU and bad latency
- !Spinning waiter preventing the lock holder from being scheduled = livelock-like behaviour
- !Plain TAS spinlock causes cache-line ping-pong; use TTAS
- !Forgetting PAUSE on x86 wastes power and hurts co-running hyperthreads
- !Userspace spinlocks where the OS can preempt the holder are almost always wrong