Atomic Operations & CAS
Atomic operations are indivisible CPU instructions that read-modify-write a single memory location safely, without locks. Compare-and-swap (CAS) is the foundation, "if this value is still X, replace it with Y", and it's how every lock-free data structure is built.
The two ideas in plain English
Atomic operation: a CPU instruction that reads, modifies, and writes a piece of memory in one indivisible step. Nobody else can see a half-finished update. Nobody can sneak in between the read and the write.
Compare-and-swap (CAS): the most useful atomic operation. It says: "If this memory still holds the value I expect, replace it with my new value. If something changed, don't write anything, and tell me what's actually there."
CAS is the foundation for coordinator-free updates. No locks, no waiting, no kernel calls. Just one hardware instruction.
A picture of CAS in action
Initial value at memory address X: 5
Time Thread 1 Thread 2
---- -------- --------
t1 load X = 5
compute new = 6
(about to CAS)
load X = 5
compute new = 7
CAS(X, expected=5, new=6)
SUCCESS, X is now 6
CAS(X, expected=5, new=7)
FAIL: X is 6, not 5
(the CAS returns the current value)
load X = 6
compute new = 8
CAS(X, expected=6, new=8)
SUCCESS, X is now 8
Thread 1 wins the first round. Thread 2's CAS sees that the memory changed (expected 5, found 6), so it doesn't overwrite. Thread 2 re-reads, recomputes, and tries again. Eventually one of them succeeds. No locks. No blocking.
The retry-loop pattern
This is the shape every lock-free counter, queue, and hashmap uses internally:
loop:
current = atomic.load(X)
new = compute(current)
if CAS(X, current, new) succeeded:
return
// someone else won; loop and try again
Want atomic increment? compute(current) = current + 1.
Want atomic max-of-seen? compute(current) = max(current, my_sample).
Want atomic "swap if even"? compute(current) = current % 2 == 0 ? new : current.
The loop turns one CAS instruction into any single-word atomic operation needed.
What CAS cannot do
One memory location at a time. CAS works on a single word (or in some hardware, a single double-word). If the invariant spans two variables ("transfer 50 from account A to account B" or "update both head and tail of a queue"), CAS alone is not enough.
What CAS handles cleanly: What CAS cannot do alone:
- Counters - Two-account money transfer
- Single-word state (flag, pointer) - Update head AND tail together
- Lock-free stack head - Multi-field invariants
- Single hash bucket - Anything that needs >1 location
For multi-location updates: use a lock, or restructure the data so the invariant fits in one word (often: one atomic pointer to an immutable record that contains both fields).
When CAS gets expensive: contention
A CAS retry loop spins. If 16 threads are all CAS-ing the same AtomicLong 100,000 times per second, most CAS attempts will fail. The CPUs go to 100% with little throughput.
Low contention (few writers): High contention (many writers):
T1: CAS succeeds, return T1: CAS fails, retry
T2: CAS succeeds, return T2: CAS fails, retry
T1: CAS succeeds, return T3: CAS succeeds
... T1: CAS fails again, retry
T2: CAS fails again, retry
...
Throughput: 100M ops/sec Throughput: 1M ops/sec
CPU: pegged at 100%
The fix is to remove the contention, not to make CAS faster:
- Stripe the counter. Split it across N cells, one per CPU.
LongAdderdoes this. Each thread bumps its own cell. The total is the sum of cells, computed on read. Trades read cost for write throughput. - Backoff. If the data structure can't be changed, sleep a small (random, exponentially growing) amount between failed retries to spread out the contention.
- Fall back to a lock. Sometimes a quick lock is cheaper than 50 failed CAS attempts.
The ABA problem in one sentence
If memory holds A, changes to B, then back to A, the CAS sees "still A" and assumes nothing changed. But the world did change in between. Covered in detail in its own lesson; the short summary: garbage-collected languages mostly dodge it; C/C++ and code with object pools have to defend with tagged pointers, hazard pointers, or epoch reclamation.
Picking the right tool
Single counter, mostly one writer
----------------------------------
USE AtomicLong / atomic.Int64 / a single CAS
Single counter, many concurrent writers
---------------------------------------
USE LongAdder / sharded counter
Multi-field invariant
---------------------
USE a Mutex / synchronized block
Read-mostly state, hot-swappable
--------------------------------
USE atomic pointer to an immutable record
The mental model
Atomics are mutual exclusion on a single word, at hardware speed. They're the cheapest way to coordinate when the data fits in one location. They're also the deepest building block: every lock in common use was implemented with atomic operations under the hood.
The skill is knowing when an atomic is enough (one word, one simple update) and when to step up to a lock (multi-field invariant, long critical section, complex condition).
Implementations
Every method on AtomicInteger is one atomic CPU instruction (or a CAS retry loop on top of one). No lock, no contention storm. The hardware guarantees the read-modify-write isn't interrupted.
1 AtomicInteger counter = new AtomicInteger(0);
2
3 counter.incrementAndGet(); // atomic ++; uses CAS internally
4 counter.addAndGet(5); // atomic += 5
5 int old = counter.getAndSet(0); // atomic swap
6
7 // The most powerful primitive, explicit CAS:
8 // "set to 100 only if still 5"
9 boolean success = counter.compareAndSet(5, 100);AtomicInteger doesn't have an atomic max, but one can be built with CAS. Read current, compute new, CAS to install. If another thread raced in, CAS fails and the loop retries with the new current value. This is how every lock-free operation is built.
1 AtomicInteger highWaterMark = new AtomicInteger(0);
2
3 void recordSample(int sample) {
4 int current;
5 do {
6 current = highWaterMark.get();
7 if (sample <= current) return; // already higher; no-op
8 } while (!highWaterMark.compareAndSet(current, sample));
9 }Under heavy contention, every CAS on the same AtomicLong fails repeatedly, threads spin retrying. LongAdder shards internally across N counters, one per thread, summing on read. Trades memory for throughput.
1 // Hot counter under heavy contention, AtomicLong becomes a bottleneck
2 AtomicLong requestCount = new AtomicLong();
3 requestCount.incrementAndGet(); // every thread CAS's the same word
4
5 // LongAdder, internally sharded, much higher throughput
6 LongAdder adder = new LongAdder();
7 adder.increment(); // mostly hits a thread-local striped cell
8 long total = adder.sum(); // aggregates on read; eventually consistentKey points
- •An atomic operation is a single CPU instruction that can't be interrupted partway through
- •CAS = compare-and-swap: 'if memory[addr] == expected, write new and return success; else fail and return current value'
- •Lock-free programming uses CAS in retry loops, never blocks, but may spin under contention
- •Atomics give single-variable safety; multi-variable invariants still need a lock
- •ABA problem: a value goes A → B → A in between observations; CAS thinks nothing changed
- •LongAdder/striped counters trade memory for throughput under high contention
Tradeoffs
| Option | Pros | Cons | When to use |
|---|---|---|---|
| Atomic primitive (AtomicInteger / atomic.Int64) |
|
| Counters, flags, lock-free queues, stack pointers |
| CAS retry loop |
|
| Custom atomic operations not in the stdlib |
| LongAdder / striped counter |
|
| Hot counters with many writer threads, infrequent reads, request counts, metrics |
| Mutex |
|
| Anything more than one variable; long critical sections |
Follow-up questions
▸What's the ABA problem?
▸When does a CAS retry loop become a busy-wait?
▸Are AtomicReference assignments atomic for compound operations?
▸Why is volatile not enough for counter++?
▸How does CAS work at the CPU level?
Gotchas
- !ABA problem on linked-list CAS, value comes back to original after intermediate change
- !Reading an atomic field non-atomically (`int x = atomic.Load(); other code; use x`), value can change between read and use
- !AtomicInteger.incrementAndGet() is atomic; AtomicInteger.get() + 1 followed by set() is NOT
- !Hot CAS loop on a single AtomicLong from many threads = CPU bottleneck, switch to LongAdder
- !Java: AtomicInteger.lazySet vs set, lazySet doesn't establish happens-before, can lose visibility
- !Go: atomic operations require *value pointer (& prefix); using value-by-value is silently wrong
Common pitfalls
- Building lock-free data structures by hand without understanding ABA
- Using AtomicReference for what should be a Lock, multi-variable invariants slip through
- Spinning forever on a CAS loop without backoff under contention
- Mixing atomic and non-atomic access to the same variable
Practice problems
AtomicInteger.incrementAndGet, or CAS retry loop on AtomicReference, or LongAdder for contention
Linked list with atomic head pointer; push/pop use CAS retry to install new head
CAS retry loop: read current, check if new is larger, CAS to install
APIs worth memorising
- Java: AtomicInteger, AtomicLong, AtomicReference, AtomicMarkableReference, LongAdder, VarHandle
- Go: sync/atomic.{Int32, Int64, Uint64, Bool, Pointer, Value}.{Load, Store, Swap, Add, CompareAndSwap}
- Python: no atomics in stdlib, use threading.Lock or multiprocessing.Value(c_int)
Every concurrent runtime. Java's ConcurrentHashMap uses CAS for resize and bin updates. Linux kernel uses atomic ops everywhere, RCU is built on them. Go's runtime uses CAS for goroutine scheduling. Most interview-level lock-free queues are CAS-based Treiber stacks or Michael-Scott queues.