Lock-Free MPSC Queue
A multi-producer, single-consumer queue using CAS on linked-list pointers. Producers append at tail; consumer pops from head. No locks. The Michael-Scott (1996) design is the textbook impl. Used in Go's runtime, Linux kernel, low-latency trading.
What this is
An MPSC queue is a queue with a specific shape of access: many producer threads can enqueue items concurrently, but only one consumer thread ever dequeues. The queue is built as a linked list. Two atomic pointers are kept at the ends of the list: head (where the consumer pops from) and tail (where producers append to). All synchronization is done by CAS on those two pointers; there are no locks.
The asymmetry between producers and consumer is what makes MPSC simpler than the general case. Producers race against each other to install their node at the tail, so the tail pointer needs CAS. The consumer is alone, so the head pointer can be advanced with a plain store. Half of the synchronization disappears compared to a multi-producer multi-consumer (MPMC) queue.
Why lock-free wins here
The straightforward way to build a thread-safe queue is two locks, one for the head and one for the tail. Java's LinkedBlockingQueue does exactly this. Under high producer contention, the tail lock becomes the bottleneck; under high consumer contention, the head lock does. Each lock acquire-release pair pays roughly fifty nanoseconds even when uncontended, plus a kernel transition if the contention forces parking.
A lock-free MPSC queue replaces both locks with CAS on the pointer. There is no kernel transition, no thread parking, no holder-suspended pathology. Under contention, the throughput is dominated by raw CAS speed instead of by lock acquisition. The queue also gains immunity to priority inversion: a low-priority producer that gets preempted in the middle of an enqueue does not block any other producer.
The textbook algorithm here is the Michael-Scott queue (Maged Michael and Michael Scott, 1996). It is technically MPMC, but the consumer side simplifies in the MPSC case: with a single consumer, the head pointer never has multiple writers, so the head update can be a plain store rather than a CAS. The producer side is identical.
Where this is used in real systems
- LMAX Disruptor (a Java library used heavily in high-frequency trading) is built around a ring buffer with lock-free producer and consumer slots. It is the canonical example of "lock-free queue makes a measurable end-to-end difference".
- Go's runtime scheduler uses lock-free work-stealing deques to move goroutines between OS threads.
- The Linux kernel uses lock-free queues in several places, including network packet queues and some scheduler internals.
- JCTools, a Java library, ships production-grade lock-free queues (SPSC, MPSC, MPMC) that are widely used in latency-sensitive Java services.
When not to roll a custom one
Lock-free is hard
Hand-rolled lock-free data structures fail code review most of the time. The reasons are always the same: memory ordering subtleties (which atomics need acquire, which need release, which need both), the ABA problem under address reuse, false sharing on hot cache lines, and missing edge cases in the helping protocol. Use a vetted library. JCTools for Java, the LMAX Disruptor for trading systems, std::concurrent_queue and folly's queue family for C++. Roll a new one only with strong expertise and a measured need that the existing libraries cannot meet.
ABA in detail
ABA is the canonical pointer-based hazard for any lock-free structure that uses CAS on pointers. The Michael-Scott queue is no exception.
The story: producer thread A reads the tail pointer (value = node X). It plans to install its new node at X.next. Before A's CAS happens, the consumer thread (or, in the general MPMC case, another producer) pops X so it is no longer in the queue. The allocator reclaims X's memory. Then a fresh node is allocated and the allocator gives it back the same address that X used to live at. That fresh node ends up in the queue. Producer A finally runs its CAS. The pointer comparison succeeds because the bits match, but the node those bits refer to is a different node in a different position. The queue structure ends up corrupted.
The fix depends on the language.
- In Java and Go, the garbage collector prevents the address reuse from happening at all. As long as A holds a reference to the original
X, the GC keepsXalive and the allocator cannot reuse its address. ABA is effectively impossible in idiomatic GC code. - In C and C++, ABA has to be handled explicitly. The standard tools are hazard pointers (each thread publishes the pointers it is about to dereference, and the freer scans every thread's hazard list before freeing), epoch-based reclamation (delay every free until all threads have advanced past the freeing epoch), or tagged pointers (pack a small version counter into the pointer's unused bits, so the CAS compares both address and version).
There is a separate lesson on the ABA problem that goes through each of these in detail.
Implementations
Java's GC prevents node reuse, sidestepping ABA. The classic enqueue: read tail, attempt CAS to install new node at tail.next; if successful, advance tail. Failures retry.
1 import java.util.concurrent.atomic.AtomicReference;
2
3 class MichaelScottQueue<T> {
4 static class Node<T> {
5 T value;
6 AtomicReference<Node<T>> next = new AtomicReference<>();
7 Node(T v) { value = v; }
8 }
9
10 private final AtomicReference<Node<T>> head;
11 private final AtomicReference<Node<T>> tail;
12
13 public MichaelScottQueue() {
14 Node<T> dummy = new Node<>(null);
15 head = new AtomicReference<>(dummy);
16 tail = new AtomicReference<>(dummy);
17 }
18
19 public void enqueue(T item) {
20 Node<T> node = new Node<>(item);
21 while (true) {
22 Node<T> last = tail.get();
23 Node<T> next = last.next.get();
24 if (last == tail.get()) {
25 if (next == null) {
26 if (last.next.compareAndSet(null, node)) {
27 tail.compareAndSet(last, node); // help advance
28 return;
29 }
30 } else {
31 tail.compareAndSet(last, next); // help advance
32 }
33 }
34 }
35 }
36
37 public T dequeue() {
38 while (true) {
39 Node<T> first = head.get();
40 Node<T> last = tail.get();
41 Node<T> next = first.next.get();
42 if (first == head.get()) {
43 if (first == last) {
44 if (next == null) return null; // empty
45 tail.compareAndSet(last, next); // help
46 } else {
47 T v = next.value;
48 if (head.compareAndSet(first, next)) return v;
49 }
50 }
51 }
52 }
53 }Key points
- •Linked list with atomic head and tail pointers
- •enqueue: CAS to append; on failure, retry with new tail
- •dequeue: CAS to advance head; on failure, retry
- •ABA-prone, needs hazard pointers, GC (Java), or tagged pointers
- •Note: the Michael-Scott algorithm shown below is actually MPMC; a true MPSC variant simplifies the consumer side (no CAS on head)
Follow-up questions
▸When is lock-free actually faster?
▸How does Java sidestep ABA?
▸MPSC vs MPMC, what's harder?
▸Should I use this in production Java?
Gotchas
- !ABA in C/C++: freed node reused at same address, CAS thinks unchanged
- !Memory ordering: need acquire/release semantics or full fences
- !Tail can lag, dequeue must help advance tail before returning empty
- !Cache line contention: head and tail on same cache line → false sharing; pad to 64 bytes