Bug Hunt: Memory Ordering on ARM

The Bug

The reproducer is small. Thread A writes data, then writes a "ready" flag. Thread B spins on the flag, then reads data. On x86, the result is 42. On ARM, it might be 0.

This is the canonical memory-ordering bug. It's easy to miss because most developers test on x86 and the bug doesn't show up. When the code moves to ARM (Apple Silicon, AWS Graviton, Raspberry Pi), the bug surfaces as flaky behaviour.

What's Going Wrong

Without explicit synchronisation:

The writer's two stores (data = 42; ready = true) can be reordered by the compiler or by the CPU. On ARM, the hardware allows the ready = true to become visible to other cores before data = 42 is.
The reader's loads (if (ready) return data) can speculate. On ARM, the load of data can execute before the load of ready completes (out-of-order execution).

Either reordering causes the reader to observe ready = true but data = 0. The "happens before" relationship the programmer assumed is not enforced by the hardware.

Why x86 Hides It

x86 has historically strong memory ordering:

Stores are observed in program order. The writer's two stores can't be reordered.
Loads can't be reordered with prior loads. The reader's loads of ready and then data are sequential.

This is essentially Total Store Order (TSO). It's stronger than what the C/Java memory models guarantee for non-synchronised code. Code that relies on x86 ordering can have hidden bugs.

ARM and POWER have weaker models. Stores can be reordered. Loads can speculate. The hardware allows what the spec allows.

The Fix

Use proper synchronisation:

Java: declare the flag volatile, OR use AtomicBoolean. Volatile gives release-on-write, acquire-on-read.
C++: use std::atomic<bool> with memory_order_release on the store and memory_order_acquire on the load.
Go: use sync/atomic.Bool. Atomics in Go are sequentially consistent.
Rust: use std::sync::atomic::AtomicBool with Ordering::Release and Ordering::Acquire.

These primitives insert the right memory barriers for the target architecture. On x86, they're often free (the hardware already provides the ordering). On ARM, they emit explicit DMB or LDAR/STLR instructions.

Detection

Three approaches:

Race detector. Go's -race, C++/Rust ThreadSanitizer (TSan). They detect concurrent unprotected access regardless of architecture. Run in CI.
Stress testing on ARM hardware. Apple Silicon laptops, AWS Graviton EC2, Raspberry Pi. Reproduce the workload at high concurrency; bugs that pass on x86 often show up.
Memory model tools. jcstress for Java tests memory model assumptions exhaustively. CDSChecker, RCMC for C++. Used in lock-free library development.

Never read or write a shared mutable variable without synchronisation: pick volatile, atomic, mutex, or channel, but pick one. Remember that x86 is generous, ARM is strict, and the language spec is stricter still. Code that "works on x86" may break on ARM, and code that "works on ARM" may still violate what the spec actually promises.

The pattern is: write portable code that follows the language memory model, then catch the strays with race detection in CI. Once that's in place, the hardware's ordering rules stop mattering.

Follow-up questions

▸Why doesn't this bug appear on an x86 laptop?

x86 has strong memory ordering: stores are observed in program order, loads aren't reordered with respect to each other. The hardware enforces what the language semantically promises only when atomic/volatile is used. So bug code 'just works' on x86 because the hardware is generous. On ARM/POWER, the hardware is more permissive, exposing the missing synchronisation.

▸Are these bugs detectable in tests?

Stress testing on ARM hardware exposes most. Race detectors (Go's -race, ThreadSanitizer for C++/Rust) catch concurrent unprotected access regardless of platform. Run TSan in CI. For Java, jcstress is the gold standard for stress-testing memory model assumptions.

▸What if the code is just for x86?

Three reasons to still fix it: (1) future-proofing, the code may move to ARM (Apple Silicon, AWS Graviton), (2) compiler reordering, even on x86, the compiler can reorder reads/writes within a thread, breaking lock-free protocols, (3) correctness, using proper synchronisation makes the code's intent clear to readers.

▸How does this relate to volatile in Java?

Java's volatile gives release-on-write and acquire-on-read semantics on a single field. It's the minimum required to publish a value safely across threads. Plain (non-volatile) Java fields can be cached in registers, reordered, or have stale values across threads. Use volatile (or atomic, or synchronized) for any shared mutable state.

The Bug

The reproducer is small. Thread A writes data, then writes a "ready" flag. Thread B spins on the flag, then reads data. On x86, the result is 42. On ARM, it might be 0.

What's Going Wrong

Without explicit synchronisation:

The writer's two stores (data = 42; ready = true) can be reordered by the compiler or by the CPU. On ARM, the hardware allows the ready = true to become visible to other cores before data = 42 is.
The reader's loads (if (ready) return data) can speculate. On ARM, the load of data can execute before the load of ready completes (out-of-order execution).

Either reordering causes the reader to observe ready = true but data = 0. The "happens before" relationship the programmer assumed is not enforced by the hardware.

Why x86 Hides It

x86 has historically strong memory ordering:

Stores are observed in program order. The writer's two stores can't be reordered.
Loads can't be reordered with prior loads. The reader's loads of ready and then data are sequential.

This is essentially Total Store Order (TSO). It's stronger than what the C/Java memory models guarantee for non-synchronised code. Code that relies on x86 ordering can have hidden bugs.

ARM and POWER have weaker models. Stores can be reordered. Loads can speculate. The hardware allows what the spec allows.

The Fix

Use proper synchronisation:

Java: declare the flag volatile, OR use AtomicBoolean. Volatile gives release-on-write, acquire-on-read.
C++: use std::atomic<bool> with memory_order_release on the store and memory_order_acquire on the load.
Go: use sync/atomic.Bool. Atomics in Go are sequentially consistent.
Rust: use std::sync::atomic::AtomicBool with Ordering::Release and Ordering::Acquire.

Detection

Three approaches:

Race detector. Go's -race, C++/Rust ThreadSanitizer (TSan). They detect concurrent unprotected access regardless of architecture. Run in CI.
Stress testing on ARM hardware. Apple Silicon laptops, AWS Graviton EC2, Raspberry Pi. Reproduce the workload at high concurrency; bugs that pass on x86 often show up.
Memory model tools. jcstress for Java tests memory model assumptions exhaustively. CDSChecker, RCMC for C++. Used in lock-free library development.

The pattern is: write portable code that follows the language memory model, then catch the strays with race detection in CI. Once that's in place, the hardware's ordering rules stop mattering.

Follow-up questions

▸Why doesn't this bug appear on an x86 laptop?

▸Are these bugs detectable in tests?

▸What if the code is just for x86?

▸How does this relate to volatile in Java?

The Bug

What's Going Wrong

Why x86 Hides It

The Fix

Detection

Implementations

Key points

Follow-up questions

Gotchas

Related reading

Bug Hunt: Memory Ordering on ARM

The Bug

What's Going Wrong

Why x86 Hides It

The Fix

Detection

Implementations

Key points

Follow-up questions

Gotchas

Related reading