Bug Hunt: Works on x86, Fails on ARM, Why?
x86 has a strong memory model that hides reordering bugs at runtime. ARM and POWER have weaker memory models that expose them. Code missing proper synchronization can 'work' on Intel and break on Apple Silicon, AWS Graviton, or mobile. The fix is to add a memory barrier, volatile, atomic, or a lock.
The puzzle
A Java service has run flawlessly for two years on x86 EC2 instances. Cost-optimization quarter rolls around; the team migrates to AWS Graviton (ARM). Within a week, on-call gets reports of NullPointerException deep inside a service that never saw one before. The stack trace points at getConfig(), a method whose null return path "should be impossible."
What changed?
The TL;DR before reading the code x86 has a strong memory model, most reordering is forbidden by the hardware. ARM has a weak model, reordering is allowed unless explicit barriers are used. Code missing proper synchronization can hide on x86 for years and fire immediately on ARM. The bug was always there. The architecture was hiding it.
What to look for
Read the broken code in the language tab. The "obvious" execution order:
config = expensiveLoad()
ready = true
The reader expects: if ready is true, config was set first. Wrong on weak memory models. The compiler, JIT, and CPU can all reorder these two writes (single-threaded execution doesn't change), so a reader on another core can observe ready=true while config is still null.
The hardware reality Each CPU core has its own L1/L2 cache and store buffer. A write on Core 1 sits in Core 1's buffer until cache coherence propagates it. On x86, the architecture forces certain ordering guarantees as the writes drain. On ARM, the writes can become visible to Core 2 in any order. There is no global "memory", there are per-core views, kept loosely consistent.
The fix is a memory barrier
Every "fix" in the tabs is the same idea: insert a release barrier on the write side and an acquire barrier on the read side. The barriers force the CPU to drain its buffers, prevent the compiler from reordering across them, and ensure the reader sees a coherent view.
| Mechanism | Release-on-write | Acquire-on-read |
|---|---|---|
Java volatile | ✅ | ✅ |
Java synchronized | on monitor exit | on monitor enter |
Java AtomicReference.set/get | ✅ | ✅ |
Go sync/atomic.Store/Load | ✅ | ✅ |
Go Mutex.Unlock/Lock | ✅ | ✅ |
Go channel <- / receive | ✅ | ✅ |
Python threading.Event.set/wait | ✅ | ✅ |
Python Lock.release/acquire | ✅ | ✅ |
What does NOT work
- "It's just one variable, the read is atomic", atomicity ≠ ordering.
- "x86 has strong memory model", true at the hardware level, but the compiler/JIT can still reorder.
- "The GIL serializes everything", not for ordering between bytecodes.
- "It worked in testing for hours", race windows are unpredictable; absence of failure ≠ proof of correctness.
How to find these across a codebase
The diagnostic
- Run on ARM (Apple Silicon laptop, AWS Graviton, Azure ARM VMs).
- Use the language's race detector:
go run -race, ThreadSanitizer for C++, jcstress for JMM verification. - Code review: every shared mutable variable accessed by more than one thread must be protected by a barrier,
volatile, atomic, lock, or channel.
The skill that pays off: when reviewing concurrent code, ask for each shared variable, "what's the happens-before edge from the writer to the reader?" If the answer is "I assume it's atomic" or "x86 handles it," the code is broken.
Implementations
A worker thread builds an expensive Config object, then sets a ready flag. Another thread polls ready and uses config once true. On x86, this works (almost always). On ARM, it can read config as null even when ready is true. Why?
1 class ConfigService {
2 private Config config;
3 private boolean ready; // ← spot the bug
4
5 void publish() {
6 this.config = loadExpensiveConfig();
7 this.ready = true; // signal: config is ready
8 }
9
10 Config get() {
11 if (!ready) return null;
12 return config; // may be NULL on ARM!
13 }
14 }The bug: without volatile (or another synchronization), the JIT and CPU can reorder the writes to config and ready. On weak memory models (ARM), the reader's view of these writes can arrive in either order. ready may become true before config is visible. The fix: making ready volatile inserts a release barrier on write and an acquire barrier on read. The barrier prevents reordering AND flushes/observes the cache. Now the reader is guaranteed to see config != null whenever ready == true.
1 class ConfigService {
2 private Config config;
3 private volatile boolean ready; // volatile, release/acquire
4
5 void publish() {
6 this.config = loadExpensiveConfig();
7 this.ready = true; // release: prior writes flushed
8 }
9
10 Config get() {
11 if (!ready) return null; // acquire: sees prior writes
12 return config; // guaranteed non-null
13 }
14 }
15
16 // Alternative, make config itself volatile (write-once reference)
17 class ConfigService {
18 private volatile Config config;
19 void publish() { config = loadExpensiveConfig(); }
20 Config get() { return config; }
21 }Key points
- •x86 uses TSO (Total Store Order), most reads and writes can't be reordered with each other
- •ARM, POWER, RISC-V allow more reordering; the same code 'works' less reliably
- •Compilers reorder too, even on x86, missing synchronization can fire on a JIT optimization
- •Memory barriers (volatile, atomic store/load, lock acquire/release) prevent reordering
Follow-up questions
▸Why does x86 hide so many reordering bugs?
▸Does this affect compiled languages only?
▸Are 64-bit reads/writes atomic in Java?
▸Should I just make everything volatile / atomic?
Gotchas
- !Tests pass on x86 dev laptops, fail on ARM CI runners, TEST ON BOTH
- !JIT can hoist a non-volatile read out of a loop entirely → 'why is my thread spinning forever?'
- !Java: long/double on 32-bit JVMs aren't atomic without volatile, torn reads possible
- !Python: free-threaded CPython (3.13+) loosens visibility guarantees that GIL-based code accidentally relied on
- !Go: copying a struct that contains atomic fields silently breaks them
- !x86 hides write-write reordering; ARM does not
Cloudflare migrated services to ARM-based AWS Graviton and discovered latent memory model bugs that had hidden on x86 for years. Apple Silicon adoption surfaced similar bugs in Mac dev environments. Every large-scale ARM migration uncovers code that was 'working' only because of x86's strong model.