Lock-Free Programming, CAS Loops
Lock-free algorithms use atomic compare-and-swap (CAS) instead of mutexes. The pattern: read current value, compute new value, CAS-install, retry on failure. At least one thread always makes progress (no deadlocks, no waiting on a slow holder), but individual threads can starve under contention. Pays off when contention is high and locks would serialise.
What lock-free means in plain English
Most code that touches shared data uses a mutex: grab the lock, do the work, release the lock. While one thread holds the lock, every other thread that wants to do the same work has to wait. If the lock holder is unlucky enough to get suspended by the operating system in the middle of its work, every waiter is stuck behind it until the holder is scheduled again.
Lock-free programming gets around this by not using a lock at all. Instead, it relies on a hardware instruction called compare-and-swap, almost always shortened to CAS. CAS is a small, indivisible operation provided directly by the CPU. It says: "look at this memory location. If its value is still the value I last saw, replace it with this new value. Otherwise, do nothing and tell me you did nothing." All of that happens as one step that no other thread can interrupt.
The promise of a lock-free design is simple: at least one thread is always making progress. No single thread can freeze the whole system by holding something and getting suspended, because there is nothing to hold.
The cost is that an individual thread can lose the CAS race many times in a row. Progress is guaranteed for the system as a whole, not for any specific thread.
How a CAS retry loop runs
Almost every lock-free algorithm has the same shape: read the current value, compute the next one, try to install it with CAS, and try again on failure. A picture of one of those loops, with two threads racing on the same value, makes it concrete.
Both threads start by reading the same value. Both compute their next value off it. Only one CAS can succeed, because once the value moves on, the other thread's expected value is stale. The loser does not get blocked or parked; it just goes back to the top of the loop, reads the new value, and tries again. The system keeps moving as long as somebody is winning.
That picture is the lock-free counterpart to "acquire mutex, modify, release". It is faster when the critical section is small and contention is bursty. It loses to a mutex when contention is sustained (lots of CAS failures, cache lines bouncing between cores) or when the work inside the loop is expensive enough that doing it twice is a real cost.
How locked and lock-free behave under stress
The real reason lock-free exists is what happens when something goes wrong with the lock holder, not in the steady state.
While the lock holder is suspended, every thread that wanted that lock is sitting there doing nothing. Throughput goes to zero for that section.
Now the same hiccup, but with a CAS loop:
Thread 1 going to sleep does not block Thread 2, because Thread 1 was not holding anything across the suspend. The system keeps advancing on whoever is awake.
That is the practical advantage of lock-free: bounded blast radius from any single thread getting unlucky.
What is given up
Lock-free is not a free lunch. Three things get harder.
Multi-field updates. A single CAS works on one word at a time. If an invariant spans two fields (the head and the tail of a queue, the balance and the audit log of an account), CAS alone cannot update both as one step. Common workarounds are packing two small fields into one word, swapping a pointer to a brand-new immutable struct that has both new values, or putting the section under a lock anyway.
Code complexity. Lock-free code is harder to write, harder to read, harder to test. Most well-known lock-free data structures (Michael-Scott queue, Treiber stack, Harris-Michael linked list) are research papers, not "type it out for an interview". The realistic advice is to reach for vetted library implementations and not hand-roll one.
Per-thread fairness. Lock-free guarantees that somebody makes progress, not that everybody does. A specific thread can keep losing its CAS forever while the others sail past. Real-time systems that need every thread to complete in bounded time use the stricter wait-free model. Most production code is fine with lock-free.
When the trade is worth it
Lock-free pays off cleanly in three cases:
- Counters and small atomic state. A hit counter, a high-water-mark sample, a set-once flag. The work inside the loop is tiny, the CAS is the whole story, and there is no mutex overhead.
- Hot-swappable references. A configuration pointer that gets atomically replaced when the config is reloaded. Readers do a single atomic load on the pointer; writers swap the pointer with CAS. Very common in long-running services.
- Specialised high-throughput structures. Lock-free queues, ring buffers, and concurrent hash maps inside message brokers, schedulers, and trading systems. Built once, reused everywhere.
For everything else, a mutex is the right default. The mental cost of lock-free is real, and a profiler will usually show that the lock was not the bottleneck in the first place. Switch to lock-free only when a measurement says a specific lock is hurting and the critical section is small enough that CAS will be a clear win.
What the rest of this section covers
The other lessons in this section build on the CAS retry loop:
- Atomic Operations and CAS explains the hardware primitive in detail.
- Memory Ordering explains how loose the ordering can be without breaking things.
- The Treiber Stack is the canonical lock-free data structure, simple enough to follow end to end.
- The ABA Problem is the classic hazard with pointer-based lock-free code, where a value goes from A to B and back to A and a CAS succeeds when it should not have.
- Hazard Pointers and RCU are two ways to safely free memory in lock-free code without a garbage collector.
- Wait-Free vs Lock-Free vs Obstruction-Free is the formal taxonomy of progress guarantees.
- Spinlocks is the special case where a small busy-wait is faster than parking the thread.
Together they cover the foundation needed to read lock-free papers and library source code. Writing a new lock-free data structure from scratch is a much harder problem; the right answer in practice is almost always to use a tested library.
Implementations
The shape of every lock-free update: load the current value, compute the next one, try to swap atomically. If the CAS fails (someone else updated in between), re-read and retry.
1 AtomicLong maxSeen = new AtomicLong(0);
2
3 void recordSample(long sample) {
4 while (true) {
5 long current = maxSeen.get();
6 if (sample <= current) return;
7 if (maxSeen.compareAndSet(current, sample)) return;
8 // CAS failed; another thread updated; loop and retry.
9 }
10 }Key points
- •Lock-free = at least one thread is always making progress (system-wide). Individual threads may retry many times.
- •The CAS retry loop is the building block: load → compute → CAS → retry on failure.
- •No deadlocks possible (no thread blocks on another). No priority inversion. No waiting on a slow holder.
- •Costs: ABA problem in pointer-based structures, memory reclamation in non-GC languages, cache-line contention under hot CAS.
- •Use when locks are measurably the bottleneck. For most code, mutexes are simpler AND faster.
Follow-up questions
▸Lock-free vs wait-free?
▸When should I use lock-free instead of a mutex?
▸What is the ABA problem?
▸Why doesn't lock-free help with multi-field updates?
Gotchas
- !Treating 'lock-free' as 'always faster', measure first; mutexes often win for simple cases
- !ABA in pointer structures, without GC, requires extra machinery
- !Hot CAS contention causes cache-line ping-pong; lock-free counters can be slower than sharded
- !Multi-field invariants need locks or pointer-swap; CAS only protects one word
- !Lock-free does NOT mean wait-free; individual threads can starve