LMAX Disruptor
A pre-allocated ring buffer plus per-consumer sequence cursors. Producers claim slots with CAS; consumers wait until the producer cursor advances past their position. Slots are reused (no per-message allocation), data lives in cache lines under explicit control (no false sharing), and the wait strategy is pluggable (busy spin, yield, block). Built by LMAX for trading, hits 25M+ messages per second per pipeline.
Diagram
The Disruptor in plain English
Picture a circular conveyor belt with 1024 slots. The producer drops items into slots, walking around the ring. The consumer picks items out of slots, walking around the same ring, a little behind the producer. Both walkers keep track of which slot they are at right now using a number called a cursor. That is the entire data structure: a ring of slots and two cursors. There are no locks anywhere.
The "done" slots are behind the consumer, already taken. The "ready" slots are between the two cursors, waiting to be consumed. The "empty" slots are ahead of the producer, free for the next round. The gap between the two cursors is the backlog of items waiting to be consumed. The maximum gap that can ever exist is the ring size itself; if the producer pulls one full lap ahead, it would overwrite a slot the consumer has not read yet, so it must wait.
The producer publishes a new item by writing into the slot at producerCursor + 1 and then bumping producerCursor. The consumer reads from consumerCursor + 1, then bumps its own cursor. No locks, no per-message allocation (the slots are pre-allocated objects that get reused forever), no cache-line bouncing (each cursor sits alone in its own 64-byte cache line so writes from one core do not invalidate the other core's cached copy).
Why it is fast
The classic point of comparison is LinkedBlockingQueue or any other standard locked queue from java.util.concurrent. The Disruptor wins on three independent things at once.
| Per-message cost | LinkedBlockingQueue | LMAX Disruptor |
|---|---|---|
| Allocation | One new node per put | Zero, slots are pre-allocated and reused |
| Lock | Acquire on every put and take | None, atomic cursor read/write |
| Memory layout | Scattered nodes on the heap | One contiguous 64KB ring (1024 × 64 bytes) |
| Garbage on hot path | One node every message, GC churn | None |
| Cross-core coordination | Lock contention, parking, kernel transitions | Cache-friendly atomic cursor advances |
Each row helps a little; the three together compound. The result on hot paths is around ten to a hundred times higher throughput than a locked queue. The LMAX trading exchange itself runs around six million trades per second through this design. Log4j2's async logger uses the same library and pushes around eighteen million events per second.
When this is the right tool
The Disruptor is a specialised tool. It pays off when all of these are true:
- The workload moves more than around a million messages per second through a single pipeline.
- Tail latency matters, that is, the goal is a tight p99 or p99.9, not just an acceptable average.
- The work per message is small enough that queue overhead is a real fraction of total time.
For a typical web service, a CRUD application, or a request handler, none of those are true. The right choice in those cases is a LinkedBlockingQueue, an ArrayBlockingQueue, or a Java 21 virtual-thread-per-request model. They are simpler, well-understood, and fast enough. The Disruptor's complexity is only worth it under sustained, high-rate, low-latency workloads.
How the cursors stay in sync without locks
The whole design is held together by two invariants. As long as both invariants are kept, no lock is needed.
| Invariant | What it prevents |
|---|---|
consumerCursor <= producerCursor | The consumer reading a slot the producer has not written |
producerCursor - consumerCursor <= SIZE | The producer overwriting a slot the consumer has not read |
The producer's protocol for publishing slot N is three steps:
- Wait until
N - consumerCursor <= SIZE. There is room in the ring. - Write the data into
slots[N & MASK]. The mask trick (whereMASK = SIZE - 1) replaces a modulo with a single AND becauseSIZEis a power of two. - Atomically set
producerCursor = Nwith release semantics. This is the publish.
The consumer's protocol for consuming slot N is also three steps:
- Wait until
N <= producerCursor. The producer has written the slot. - Read
slots[N & MASK]with acquire semantics. - Atomically set
consumerCursor = N. The slot is now free for the producer to reuse on its next lap.
The atomic write to producerCursor is the publish. Any consumer that observes the new cursor value is guaranteed to also see the slot data, because the slot write happens-before the cursor write in program order, and the release-acquire pair on the cursor makes that ordering visible across cores. No locks, no allocations, just two atomic numbers and a chunk of contiguous memory.
Wait strategies
How a thread waits when it cannot make progress (the consumer waiting for the producer, or the producer waiting for slot reclamation) is configurable. The four common strategies trade CPU usage against latency, in that order from lowest latency to lowest CPU:
- BusySpinWaitStrategy. A tight
while (!ready) Thread.onSpinWait();loop. Microsecond-level latency, eats 100% of a core while waiting. Only sensible on a dedicated core in a latency-critical system. - YieldingWaitStrategy. Spin a few times, then call
Thread.yield(). Low latency, less CPU greedy than busy spin. The sensible default for most low-latency systems. - SleepingWaitStrategy. Spin briefly, then
LockSupport.parkNanos(1). Higher latency, near-zero CPU when idle. Good for cold or low-rate consumers. - BlockingWaitStrategy. Park on a
Conditionuntil the producer signals. Highest latency, lowest CPU. Use for consumers that are expected to be cold most of the time, mixed with hot ones in the same pipeline.
The mistake people make is picking BusySpinWaitStrategy because it sounds fast, then watching the rest of the services on the box starve for CPU. Always pick the wait strategy under a representative load and look at the impact on neighbours, not just on the Disruptor itself.
Pre-allocation cuts both ways
Slots hold references. Putting a reference to a large object into a slot prevents that object from being garbage-collected until the slot is overwritten on the next lap. For events that contain only primitive fields (longs, fixed-width records), this is fine. For events that hold byte[] payloads, large String references, or other heap-heavy data, a lot of garbage stays pinned in the ring even though the consumer is "done" with it. Either size the ring carefully so the lap time is short, or explicitly null out the heavy fields once a consumer is finished with the slot.
Implementations
The simplest Disruptor variant: one producer, one consumer, fixed-size ring buffer of pre-allocated event objects. Producer claims the next slot, fills it, publishes by advancing the cursor. Consumer waits for the cursor, then reads. No locks, no per-event allocation. Padding around the cursors prevents false sharing between producer and consumer caches.
1 import java.util.concurrent.atomic.AtomicLong;
2
3 class Event { long value; } // mutable, reused
4
5 class SpscRingBuffer {
6 private static final int SIZE = 1024; // power of two
7 private static final int MASK = SIZE - 1;
8 private final Event[] slots = new Event[SIZE];
9
10 // Padded cursors: each AtomicLong sits alone in its cache line
11 private final PaddedAtomicLong producerCursor = new PaddedAtomicLong(-1);
12 private final PaddedAtomicLong consumerCursor = new PaddedAtomicLong(-1);
13
14 SpscRingBuffer() { for (int i = 0; i < SIZE; i++) slots[i] = new Event(); }
15
16 // Producer
17 void publish(long value) {
18 long seq = producerCursor.get() + 1;
19 while (seq - consumerCursor.get() > SIZE) Thread.onSpinWait(); // wait for room
20 slots[(int)(seq & MASK)].value = value;
21 producerCursor.set(seq); // release: makes the write visible
22 }
23
24 // Consumer
25 long consume() {
26 long seq = consumerCursor.get() + 1;
27 while (seq > producerCursor.get()) Thread.onSpinWait(); // wait for data
28 long v = slots[(int)(seq & MASK)].value;
29 consumerCursor.set(seq); // release the slot
30 return v;
31 }
32 }
33
34 // PaddedAtomicLong: AtomicLong + 7 unused longs to fill 64-byte cache line
35 // (better: use jdk.internal.vm.annotation.Contended or @Contended)With multiple producers, each one CASes the producer cursor to its claimed sequence. The catch: a producer that claims sequence N may publish after a producer that claimed N+1, so consumers can't just read producerCursor and assume everything below is published. Disruptor solves this with an "available buffer" array tracking which sequences are actually published.
1 // Sketch only; the real Disruptor adds an availableBuffer for visibility tracking.
2 class MpscRingBuffer {
3 private static final int SIZE = 1024;
4 private static final int MASK = SIZE - 1;
5 private final Event[] slots = new Event[SIZE];
6 private final AtomicLong cursor = new AtomicLong(-1);
7 private final AtomicLong consumerCursor = new AtomicLong(-1);
8
9 long claim() {
10 while (true) {
11 long current = cursor.get();
12 long seq = current + 1;
13 if (seq - consumerCursor.get() > SIZE) {
14 Thread.onSpinWait(); // wait for room
15 continue;
16 }
17 if (cursor.compareAndSet(current, seq)) { // try to claim seq
18 return seq;
19 }
20 // CAS lost to another producer; loop and try the next seq
21 }
22 }
23
24 void publish(long seq, long value) {
25 slots[(int)(seq & MASK)].value = value;
26 // Real Disruptor: mark availableBuffer[seq & MASK] = (seq >>> log2(SIZE))
27 // Consumer scans availableBuffer to find the highest fully-published seq
28 }
29 }In production, use the LMAX library rather than rolling a custom one. The API exposes the ring buffer through an EventFactory (slot allocator), an EventTranslator (producer-side fill), and an EventHandler (consumer-side process). Wait strategies and producer types are configuration, not code.
1 // Maven: com.lmax:disruptor:4.0.0
2 import com.lmax.disruptor.*;
3 import com.lmax.disruptor.dsl.Disruptor;
4 import com.lmax.disruptor.dsl.ProducerType;
5
6 class TradeEvent { String symbol; double price; long qty; }
7
8 // Slot factory: pre-allocate the events
9 EventFactory<TradeEvent> factory = TradeEvent::new;
10
11 Disruptor<TradeEvent> disruptor = new Disruptor<>(
12 factory,
13 1024, // ring size, power of two
14 Thread.ofPlatform().name("disruptor-", 0).factory(),
15 ProducerType.SINGLE, // SINGLE is much faster
16 new YieldingWaitStrategy()
17 );
18
19 // Pipeline: handler1 -> handler2 (handler2 sees handler1's writes)
20 disruptor.handleEventsWith((event, seq, end) -> riskCheck(event))
21 .then((event, seq, end) -> persist(event));
22 disruptor.start();
23
24 // Publish (producer side)
25 RingBuffer<TradeEvent> rb = disruptor.getRingBuffer();
26 rb.publishEvent((event, seq) -> {
27 event.symbol = "AAPL";
28 event.price = 235.10;
29 event.qty = 100;
30 });
31
32 // No allocations on the hot path; events recycle around the ring.Two unrelated AtomicLongs in the same cache line cause writes from one core to invalidate the other core's cached copy. Result: throughput drops 5-10x. Padding to a full cache line (64 bytes on x86) fixes it. JDK 8+ has @Contended for this.
1 import jdk.internal.vm.annotation.Contended;
2 import java.util.concurrent.atomic.AtomicLong;
3
4 // BAD: false sharing
5 class UnpaddedCursors {
6 AtomicLong producer = new AtomicLong(); // both fit in
7 AtomicLong consumer = new AtomicLong(); // one cache line
8 }
9
10 // GOOD: each in its own line
11 class PaddedCursors {
12 @Contended AtomicLong producer = new AtomicLong();
13 @Contended AtomicLong consumer = new AtomicLong();
14 }
15
16 // Note: @Contended requires --add-opens=java.base/jdk.internal.vm.annotation=ALL-UNNAMED
17 // or running with -XX:-RestrictContended. Most projects just hand-roll padding.Key points
- •Ring buffer of fixed size (must be a power of two so index = sequence & (size - 1) is one AND instruction).
- •Slots are pre-allocated objects that get reused; producers fill in fields, consumers read them. Zero per-message allocation.
- •Producer cursor: an AtomicLong (cache-line padded) holding the highest published sequence. Consumers spin until their target sequence <= producer cursor.
- •Consumer cursor: each consumer has its own cursor. Producers wait if the slowest consumer is too far behind (overwrites would lose data).
- •Wait strategies are pluggable: BusySpinWaitStrategy (lowest latency, burns CPU), YieldingWaitStrategy (spin then yield), BlockingWaitStrategy (park on a condition).
- •False sharing is the silent killer; every cursor is padded to occupy its own cache line.
- •Single-producer mode skips the producer-side CAS and is much faster than multi-producer mode.
Follow-up questions
▸Why is a ring buffer so much faster than a BlockingQueue?
▸Why power of two for the ring size?
▸What happens when the slowest consumer falls too far behind?
▸Single-producer vs multi-producer, how big is the difference?
▸When is the Disruptor the wrong choice?
Gotchas
- !Forgetting padding. The single most common mistake: a 'simple ring buffer' loses 80% of its throughput to false sharing between cursors.
- !Using Object[] of allocated objects, not pre-allocated slot fields. Defeats the no-allocation property.
- !Multi-producer mode without availability tracking: consumers will read uninitialized slots.
- !Wait strategies that burn CPU in production. BusySpinWaitStrategy is for dedicated cores; on shared boxes it starves other work.
- !Treating the ring as a dynamically resizable queue. The size is fixed at construction; pick it based on max in-flight, not average.
Common pitfalls
- Building a Disruptor for a workload that doesn't need it. The library has nontrivial complexity; the win only shows up under sustained, high-rate, low-latency workloads (HFT, market data, log shipping).
- Pinning consumer threads incorrectly. Disruptor benefits from CPU pinning, but pinning the wrong thread to a shared core ruins it.
- Mixing Disruptor with reflection-based serialization. Per-event Kryo or Jackson eats the latency budget; pre-marshal upstream.
Practice problems
APIs worth memorising
- com.lmax.disruptor.RingBuffer
- com.lmax.disruptor.dsl.Disruptor
- com.lmax.disruptor.WaitStrategy (BlockingWaitStrategy, YieldingWaitStrategy, BusySpinWaitStrategy, SleepingWaitStrategy)
- jdk.internal.vm.annotation.Contended
LMAX trading exchange (the original use case, ~6M trades per second). Log4j2's async logger uses Disruptor for 18M events per second. Apache Storm uses ring-buffer-style transport. Many HFT firms have internal Disruptor variants. Outside of trading and ultra-high-throughput logging, the pattern is rare; most teams reach for a BlockingQueue or a Kafka topic.