Bug Hunt: Memory Ordering on ARM
Code passes tests on x86, fails on ARM. The bug: missing memory barrier (volatile in Java, atomic in C++, sync in Go) on a flag that the writer sets and the reader checks. x86's strong ordering hides the bug; ARM's weaker ordering exposes it. Fix: use proper synchronization primitives, never plain reads of shared variables.
The Bug
The reproducer is small. Thread A writes data, then writes a "ready" flag. Thread B spins on the flag, then reads data. On x86, the result is 42. On ARM, it might be 0.
This is the canonical memory-ordering bug. It's easy to miss because most developers test on x86 and the bug doesn't show up. When the code moves to ARM (Apple Silicon, AWS Graviton, Raspberry Pi), the bug surfaces as flaky behaviour.
What's Going Wrong
Without explicit synchronisation:
- The writer's two stores (
data = 42; ready = true) can be reordered by the compiler or by the CPU. On ARM, the hardware allows theready = trueto become visible to other cores beforedata = 42is. - The reader's loads (
if (ready) return data) can speculate. On ARM, the load ofdatacan execute before the load ofreadycompletes (out-of-order execution).
Either reordering causes the reader to observe ready = true but data = 0. The "happens before" relationship the programmer assumed is not enforced by the hardware.
Why x86 Hides It
x86 has historically strong memory ordering:
- Stores are observed in program order. The writer's two stores can't be reordered.
- Loads can't be reordered with prior loads. The reader's loads of
readyand thendataare sequential.
This is essentially Total Store Order (TSO). It's stronger than what the C/Java memory models guarantee for non-synchronised code. Code that relies on x86 ordering can have hidden bugs.
ARM and POWER have weaker models. Stores can be reordered. Loads can speculate. The hardware allows what the spec allows.
The Fix
Use proper synchronisation:
- Java: declare the flag
volatile, OR useAtomicBoolean. Volatile gives release-on-write, acquire-on-read. - C++: use
std::atomic<bool>withmemory_order_releaseon the store andmemory_order_acquireon the load. - Go: use
sync/atomic.Bool. Atomics in Go are sequentially consistent. - Rust: use
std::sync::atomic::AtomicBoolwithOrdering::ReleaseandOrdering::Acquire.
These primitives insert the right memory barriers for the target architecture. On x86, they're often free (the hardware already provides the ordering). On ARM, they emit explicit DMB or LDAR/STLR instructions.
Detection
Three approaches:
-
Race detector. Go's
-race, C++/Rust ThreadSanitizer (TSan). They detect concurrent unprotected access regardless of architecture. Run in CI. -
Stress testing on ARM hardware. Apple Silicon laptops, AWS Graviton EC2, Raspberry Pi. Reproduce the workload at high concurrency; bugs that pass on x86 often show up.
-
Memory model tools. jcstress for Java tests memory model assumptions exhaustively. CDSChecker, RCMC for C++. Used in lock-free library development.
Never read or write a shared mutable variable without synchronisation: pick volatile, atomic, mutex, or channel, but pick one. Remember that x86 is generous, ARM is strict, and the language spec is stricter still. Code that "works on x86" may break on ARM, and code that "works on ARM" may still violate what the spec actually promises.
The pattern is: write portable code that follows the language memory model, then catch the strays with race detection in CI. Once that's in place, the hardware's ordering rules stop mattering.
Implementations
The writer sets data, then sets ready=true. The reader sees ready=true and reads data. On x86, this works (writes are not reordered, reads are not speculated past). On ARM, the writer's stores can be reordered (ready=true might become visible before data), or the reader's loads can speculate (reading data before ready). Reader sees stale data.
1 // BAD: race
2 class Publisher {
3 int data; // not volatile
4 boolean ready = false; // not volatile
5
6 // Thread 1
7 void publish() {
8 data = 42;
9 ready = true; // can be reordered to before data on ARM
10 }
11
12 // Thread 2
13 int read() {
14 while (!ready) {} // can speculate; can hang on x86 too
15 return data; // may see 0 on ARM
16 }
17 }
18
19 // GOOD: volatile
20 class PublisherFixed {
21 int data;
22 volatile boolean ready = false;
23
24 void publish() {
25 data = 42;
26 ready = true; // volatile write: release barrier
27 }
28
29 int read() {
30 while (!ready) {} // volatile read: acquire barrier
31 return data; // guaranteed to see 42
32 }
33 }Key points
- •x86 has strong ordering: most reorderings forbidden by hardware. Bugs hidden.
- •ARM/POWER have weak ordering: writes can be reordered, loads can speculate ahead.
- •The bug usually looks like: writer thread sets flag, reader thread sees the flag set but old data.
- •Fix: use volatile / atomic / proper synchronisation. Never plain reads of shared mutable state.
- •Detect via stress testing on ARM hardware (Apple Silicon, AWS Graviton, Raspberry Pi).
Follow-up questions
▸Why doesn't this bug appear on an x86 laptop?
▸Are these bugs detectable in tests?
▸What if the code is just for x86?
▸How does this relate to volatile in Java?
Gotchas
- !Testing only on x86 = bugs hide until ARM deployment
- !'Works in dev, fails in prod' when prod is on ARM
- !Race detector catches issues regardless of platform; run it in CI
- !C++ default memory_order is SeqCst (safe but slow); explicit order_relaxed needs justification
- !Java plain int reads/writes are NOT atomic for long; use volatile or AtomicLong