Lock-Free MPSC Queue

What this is

An MPSC queue is a queue with a specific shape of access: many producer threads can enqueue items concurrently, but only one consumer thread ever dequeues. The queue is built as a linked list. Two atomic pointers are kept at the ends of the list: head (where the consumer pops from) and tail (where producers append to). All synchronization is done by CAS on those two pointers; there are no locks.

The asymmetry between producers and consumer is what makes MPSC simpler than the general case. Producers race against each other to install their node at the tail, so the tail pointer needs CAS. The consumer is alone, so the head pointer can be advanced with a plain store. Half of the synchronization disappears compared to a multi-producer multi-consumer (MPMC) queue.

Why lock-free wins here

The straightforward way to build a thread-safe queue is two locks, one for the head and one for the tail. Java's LinkedBlockingQueue does exactly this. Under high producer contention, the tail lock becomes the bottleneck; under high consumer contention, the head lock does. Each lock acquire-release pair pays roughly fifty nanoseconds even when uncontended, plus a kernel transition if the contention forces parking.

A lock-free MPSC queue replaces both locks with CAS on the pointer. There is no kernel transition, no thread parking, no holder-suspended pathology. Under contention, the throughput is dominated by raw CAS speed instead of by lock acquisition. The queue also gains immunity to priority inversion: a low-priority producer that gets preempted in the middle of an enqueue does not block any other producer.

The textbook algorithm here is the Michael-Scott queue (Maged Michael and Michael Scott, 1996). It is technically MPMC, but the consumer side simplifies in the MPSC case: with a single consumer, the head pointer never has multiple writers, so the head update can be a plain store rather than a CAS. The producer side is identical.

Where this is used in real systems

LMAX Disruptor (a Java library used heavily in high-frequency trading) is built around a ring buffer with lock-free producer and consumer slots. It is the canonical example of "lock-free queue makes a measurable end-to-end difference".
Go's runtime scheduler uses lock-free work-stealing deques to move goroutines between OS threads.
The Linux kernel uses lock-free queues in several places, including network packet queues and some scheduler internals.
JCTools, a Java library, ships production-grade lock-free queues (SPSC, MPSC, MPMC) that are widely used in latency-sensitive Java services.

When not to roll a custom one

Warning

Lock-free is hard Hand-rolled lock-free data structures fail code review most of the time. The reasons are always the same: memory ordering subtleties (which atomics need acquire, which need release, which need both), the ABA problem under address reuse, false sharing on hot cache lines, and missing edge cases in the helping protocol. Use a vetted library. JCTools for Java, the LMAX Disruptor for trading systems, std::concurrent_queue and folly's queue family for C++. Roll a new one only with strong expertise and a measured need that the existing libraries cannot meet.

ABA in detail

ABA is the canonical pointer-based hazard for any lock-free structure that uses CAS on pointers. The Michael-Scott queue is no exception.

The story: producer thread A reads the tail pointer (value = node X). It plans to install its new node at X.next. Before A's CAS happens, the consumer thread (or, in the general MPMC case, another producer) pops X so it is no longer in the queue. The allocator reclaims X's memory. Then a fresh node is allocated and the allocator gives it back the same address that X used to live at. That fresh node ends up in the queue. Producer A finally runs its CAS. The pointer comparison succeeds because the bits match, but the node those bits refer to is a different node in a different position. The queue structure ends up corrupted.

The fix depends on the language.

In Java and Go, the garbage collector prevents the address reuse from happening at all. As long as A holds a reference to the original X, the GC keeps X alive and the allocator cannot reuse its address. ABA is effectively impossible in idiomatic GC code.
In C and C++, ABA has to be handled explicitly. The standard tools are hazard pointers (each thread publishes the pointers it is about to dereference, and the freer scans every thread's hazard list before freeing), epoch-based reclamation (delay every free until all threads have advanced past the freeing epoch), or tagged pointers (pack a small version counter into the pointer's unused bits, so the CAS compares both address and version).

There is a separate lesson on the ABA problem that goes through each of these in detail.

What this is

Why lock-free wins here

Where this is used in real systems

LMAX Disruptor (a Java library used heavily in high-frequency trading) is built around a ring buffer with lock-free producer and consumer slots. It is the canonical example of "lock-free queue makes a measurable end-to-end difference".
Go's runtime scheduler uses lock-free work-stealing deques to move goroutines between OS threads.
The Linux kernel uses lock-free queues in several places, including network packet queues and some scheduler internals.
JCTools, a Java library, ships production-grade lock-free queues (SPSC, MPSC, MPMC) that are widely used in latency-sensitive Java services.

When not to roll a custom one

Warning

ABA in detail

ABA is the canonical pointer-based hazard for any lock-free structure that uses CAS on pointers. The Michael-Scott queue is no exception.

The fix depends on the language.

In Java and Go, the garbage collector prevents the address reuse from happening at all. As long as A holds a reference to the original X, the GC keeps X alive and the allocator cannot reuse its address. ABA is effectively impossible in idiomatic GC code.
In C and C++, ABA has to be handled explicitly. The standard tools are hazard pointers (each thread publishes the pointers it is about to dereference, and the freer scans every thread's hazard list before freeing), epoch-based reclamation (delay every free until all threads have advanced past the freeing epoch), or tagged pointers (pack a small version counter into the pointer's unused bits, so the CAS compares both address and version).

There is a separate lesson on the ABA problem that goes through each of these in detail.

What this is

Why lock-free wins here

Where this is used in real systems

When not to roll a custom one

ABA in detail

Implementations

Key points

Follow-up questions

Gotchas

Related reading

Lock-Free MPSC Queue

What this is

Why lock-free wins here

Where this is used in real systems

When not to roll a custom one

ABA in detail

Implementations

Key points

Follow-up questions

Gotchas

Related reading