Debugging ConcurrencyTopic 14 of 18

ProblemAdvancedSometimes

Bug Hunt: Works on x86, Fails on ARM, Why?

In one line

x86 has a strong memory model that hides reordering bugs at runtime. ARM and POWER have weaker memory models that expose them. Code missing proper synchronization can 'work' on Intel and break on Apple Silicon, AWS Graviton, or mobile. The fix is to add a memory barrier, volatile, atomic, or a lock.

The puzzle

A Java service has run flawlessly for two years on x86 EC2 instances. Cost-optimization quarter rolls around; the team migrates to AWS Graviton (ARM). Within a week, on-call gets reports of NullPointerException deep inside a service that never saw one before. The stack trace points at getConfig(), a method whose null return path "should be impossible."

What changed?

Important

The TL;DR before reading the code x86 has a strong memory model, most reordering is forbidden by the hardware. ARM has a weak model, reordering is allowed unless explicit barriers are used. Code missing proper synchronization can hide on x86 for years and fire immediately on ARM. The bug was always there. The architecture was hiding it.

What to look for

Read the broken code in the language tab. The "obvious" execution order:

config = expensiveLoad()
ready = true

The reader expects: if ready is true, config was set first. Wrong on weak memory models. The compiler, JIT, and CPU can all reorder these two writes (single-threaded execution doesn't change), so a reader on another core can observe ready=true while config is still null.

Note

The hardware reality Each CPU core has its own L1/L2 cache and store buffer. A write on Core 1 sits in Core 1's buffer until cache coherence propagates it. On x86, the architecture forces certain ordering guarantees as the writes drain. On ARM, the writes can become visible to Core 2 in any order. There is no global "memory", there are per-core views, kept loosely consistent.

The fix is a memory barrier

Every "fix" in the tabs is the same idea: insert a release barrier on the write side and an acquire barrier on the read side. The barriers force the CPU to drain its buffers, prevent the compiler from reordering across them, and ensure the reader sees a coherent view.

Mechanism	Release-on-write	Acquire-on-read
Java `volatile`	✅	✅
Java `synchronized`	on monitor exit	on monitor enter
Java `AtomicReference.set/get`	✅	✅
Go `sync/atomic.Store/Load`	✅	✅
Go `Mutex.Unlock/Lock`	✅	✅
Go channel `<- / receive`	✅	✅
Python `threading.Event.set/wait`	✅	✅
Python `Lock.release/acquire`	✅	✅

Warning

What does NOT work

"It's just one variable, the read is atomic", atomicity ≠ ordering.
"x86 has strong memory model", true at the hardware level, but the compiler/JIT can still reorder.
"The GIL serializes everything", not for ordering between bytecodes.
"It worked in testing for hours", race windows are unpredictable; absence of failure ≠ proof of correctness.

How to find these across a codebase

Tip

The diagnostic

Run on ARM (Apple Silicon laptop, AWS Graviton, Azure ARM VMs).
Use the language's race detector: go run -race, ThreadSanitizer for C++, jcstress for JMM verification.
Code review: every shared mutable variable accessed by more than one thread must be protected by a barrier, volatile, atomic, lock, or channel.

The skill that pays off: when reviewing concurrent code, ask for each shared variable, "what's the happens-before edge from the writer to the reader?" If the answer is "I assume it's atomic" or "x86 handles it," the code is broken.

Implementations

BROKEN, the publication pattern without volatile

A worker thread builds an expensive Config object, then sets a ready flag. Another thread polls ready and uses config once true. On x86, this works (almost always). On ARM, it can read config as null even when ready is true. Why?

 1  class ConfigService {
 2      private Config config;
 3      private boolean ready;            // ← spot the bug
 4  
 5      void publish() {
 6          this.config = loadExpensiveConfig();
 7          this.ready = true;             // signal: config is ready
 8      }
 9  
10      Config get() {
11          if (!ready) return null;
12          return config;                 // may be NULL on ARM!
13      }
14  }

FIXED, volatile establishes happens-before

The bug: without volatile (or another synchronization), the JIT and CPU can reorder the writes to config and ready. On weak memory models (ARM), the reader's view of these writes can arrive in either order. ready may become true before config is visible. The fix: making ready volatile inserts a release barrier on write and an acquire barrier on read. The barrier prevents reordering AND flushes/observes the cache. Now the reader is guaranteed to see config != null whenever ready == true.

 1  class ConfigService {
 2      private Config config;
 3      private volatile boolean ready;    // volatile, release/acquire
 4  
 5      void publish() {
 6          this.config = loadExpensiveConfig();
 7          this.ready = true;             // release: prior writes flushed
 8      }
 9  
10      Config get() {
11          if (!ready) return null;       // acquire: sees prior writes
12          return config;                 // guaranteed non-null
13      }
14  }
15  
16  // Alternative, make config itself volatile (write-once reference)
17  class ConfigService {
18      private volatile Config config;
19      void publish() { config = loadExpensiveConfig(); }
20      Config get() { return config; }
21  }

Key points

•x86 uses TSO (Total Store Order), most reads and writes can't be reordered with each other
•ARM, POWER, RISC-V allow more reordering; the same code 'works' less reliably
•Compilers reorder too, even on x86, missing synchronization can fire on a JIT optimization
•Memory barriers (volatile, atomic store/load, lock acquire/release) prevent reordering

Follow-up questions

▸Why does x86 hide so many reordering bugs?

x86 uses TSO (Total Store Order). Most reads and writes can't be reordered with each other, only store-load reordering is allowed (and even that is rare in practice). The hardware is doing a lot of work to maintain the appearance of sequential consistency. ARM and POWER have weaker models that expose more reordering.

▸Does this affect compiled languages only?

No. Compilers and runtimes also reorder. The JIT can hoist a non-volatile read out of a loop on x86, and the bug fires immediately. Go's race detector catches some of these even on x86. The memory model rules apply at every layer, compiler, runtime, hardware.

▸Are 64-bit reads/writes atomic in Java?

Not always, long/double on 32-bit JVMs can be torn. Even on 64-bit, the JLS only guarantees atomicity for fields declared `volatile`. In practice HotSpot makes 64-bit reads atomic, but the spec doesn't require it.

▸Should I just make everything volatile / atomic?

It's tempting but expensive, every volatile/atomic access emits memory barriers that flush CPU state, costing tens of cycles. Use synchronization where there's actual cross-thread sharing; leave thread-local state alone. The cost is real on hot paths.

Gotchas

!Tests pass on x86 dev laptops, fail on ARM CI runners, TEST ON BOTH
!JIT can hoist a non-volatile read out of a loop entirely → 'why is my thread spinning forever?'
!Java: long/double on 32-bit JVMs aren't atomic without volatile, torn reads possible
!Python: free-threaded CPython (3.13+) loosens visibility guarantees that GIL-based code accidentally relied on
!Go: copying a struct that contains atomic fields silently breaks them
!x86 hides write-write reordering; ARM does not

Where this shows up

Cloudflare migrated services to ARM-based AWS Graviton and discovered latent memory model bugs that had hidden on x86 for years. Apple Silicon adoption surfaced similar bugs in Mac dev environments. Every large-scale ARM migration uncovers code that was 'working' only because of x86's strong model.

Related reading

The puzzle

What changed?

Important

What to look for

Read the broken code in the language tab. The "obvious" execution order:

config = expensiveLoad() ready = true

Note

The fix is a memory barrier

Mechanism

Release-on-write

Acquire-on-read

Java volatile

✅

Java synchronized

on monitor exit

on monitor enter

Java AtomicReference.set/get

✅

Go sync/atomic.Store/Load

✅

Go Mutex.Unlock/Lock

✅

Go channel <- / receive

✅

Python threading.Event.set/wait

✅

Python Lock.release/acquire

✅

Warning

What does NOT work

"It's just one variable, the read is atomic", atomicity ≠ ordering.
"x86 has strong memory model", true at the hardware level, but the compiler/JIT can still reorder.
"The GIL serializes everything", not for ordering between bytecodes.
"It worked in testing for hours", race windows are unpredictable; absence of failure ≠ proof of correctness.

How to find these across a codebase

Tip

The diagnostic

Run on ARM (Apple Silicon laptop, AWS Graviton, Azure ARM VMs).
Use the language's race detector: go run -race, ThreadSanitizer for C++, jcstress for JMM verification.
Code review: every shared mutable variable accessed by more than one thread must be protected by a barrier, volatile, atomic, lock, or channel.

1 class ConfigService { 2 private Config config; 3 private boolean ready; // ← spot the bug 4 5 void publish() { 6 this.config = loadExpensiveConfig(); 7 this.ready = true; // signal: config is ready 8 } 9 10 Config get() { 11 if (!ready) return null; 12 return config; // may be NULL on ARM! 13 } 14 }

1 class ConfigService { 2 private Config config; 3 private volatile boolean ready; // volatile, release/acquire 4 5 void publish() { 6 this.config = loadExpensiveConfig(); 7 this.ready = true; // release: prior writes flushed 8 } 9 10 Config get() { 11 if (!ready) return null; // acquire: sees prior writes 12 return config; // guaranteed non-null 13 } 14 } 15 16 // Alternative, make config itself volatile (write-once reference) 17 class ConfigService { 18 private volatile Config config; 19 void publish() { config = loadExpensiveConfig(); } 20 Config get() { return config; } 21 }