Memory Visibility & CPU Reordering
Modern CPUs and compilers reorder reads and writes for performance. Without explicit synchronization, one thread's writes may NEVER become visible to another thread, or may appear in a different order. This is the deepest reason concurrent code breaks, and skipping it means writing code that "works on a laptop, fails in production."
The short version
Code looks like a list of instructions that run top to bottom. On real hardware, that is a polite fiction. The compiler reorders. The CPU reorders. Each CPU core has its own cache, and a write made on one core can sit there for a long time before any other core sees it. Some writes never become visible to other threads at all unless the system is told "publish this now."
That is what memory visibility is about. It's not about correctness of one thread's view of itself. It's about whether other threads see the writes, and in what order.
Skipping this leads to broken code Tests pass. Code looks fine. Then under load on a different CPU, two flags get out of order, a counter never updates, a worker spins forever. This is the deepest reason concurrent code fails, and the hardest one to debug after the fact.
A picture of the layers
When the code runs x = 42, the value passes through a stack of layers before any other thread can possibly see it:
Six layers. Each one can hold the write for an indefinite time. Without a synchronization primitive forcing the write all the way through, the value is local to the writing thread.
A concrete bug: the publication pattern
Here is the code everyone writes the first time:
data = 42 // step 1
ready = true // step 2
(other thread)
if (ready) { // step 3
print(data) // step 4
}
The intent: writer fills in data, then sets ready. Reader sees ready and trusts that data is filled in.
What actually happens, with no synchronization:
The reader sees ready=true because that write happened to flush first. Then it reads data and gets 0, the old value, because the data=42 write hasn't propagated yet (or was reordered after ready=true).
This is real. It happens on ARM, POWER, RISC-V, even on x86 in some cases. x86 happens to hide most reordering bugs because of its strong memory model, which is why "works on my Intel laptop, fails on AWS Graviton" is a recurring story.
The fix is one idea
Every concurrency primitive in every language is, underneath, a way to tell the compiler and the CPU two things:
- Flush. Push my pending writes all the way through the stack before this point.
- Don't reorder. Keep these instructions in source order across this point.
That's it. Mutex lock and unlock do this. Atomic store and load do this. Volatile read and write do this. Channel send and receive do this. They all enforce the same two rules at different granularities.
| Primitive | What it forces |
|---|---|
volatile write (Java) | Flush pending writes before this; later instructions cannot move above |
volatile read (Java) | Cannot move below later instructions; sees the latest published value |
synchronized enter / exit | Acquire (enter) and release (exit) on the monitor |
sync/atomic.Store (Go) | Flush before storing |
sync/atomic.Load (Go) | Sees the latest published value |
Mutex.Lock / Mutex.Unlock | Acquire on lock, release on unlock |
| Channel send / receive (Go) | Send publishes, receive sees |
queue.Queue.put / get (Python) | Lock-based, same effect |
One way to think about it
Calling mutex.Lock() does not only say "no one else can enter this section." It also says "publish my pending writes so the next thread to lock this sees them." Same for atomic stores, channel sends, and volatile writes. The literature calls these memory barriers or fences. Plain language: a publish-then-subscribe contract for memory.
When none of this matters
A short list:
- Single-threaded code. No one else is reading; intra-thread reorderings can't hurt.
- Read-only shared state after construction. Built once, handed off via a synchronization primitive, never modified again, every reader sees the finished version.
- Everything else. If two threads share a mutable value, every access needs a synchronization primitive. No exceptions.
The skill that pays off the most
For every shared variable in concurrent code, ask one question: what synchronization primitive guarantees the writer's update is visible to the reader? If no answer exists, the code is broken, even if today's tests pass.
This is the question senior reviewers ask. It's also the question behind every "why is volatile needed here?" interview probe. Build the habit of asking it on every shared field, every time.
Implementations
Two threads, two flags. Each writes its flag, then reads the other. Naively, at least one thread should see the other's write, but on real hardware, both can see false. The CPU and compiler are free to reorder the read before the write, because in single-threaded execution it makes no difference.
1 class Reorder {
2 int a = 0, b = 0;
3 int x = 0, y = 0;
4
5 void thread1() {
6 a = 1; // write a
7 x = b; // read b
8 }
9 void thread2() {
10 b = 1; // write b
11 y = a; // read a
12 }
13 // After both threads finish:
14 // Naive expectation: x==1 OR y==1 (or both)
15 // Reality: x==0 AND y==0 is OBSERVABLE on real hardware
16 }volatile writes are release operations: prior writes become visible to threads that perform an acquire read. Volatile reads can't be reordered before volatile writes. The (0,0) outcome becomes impossible.
1 class Fixed {
2 volatile int a = 0, b = 0;
3 int x = 0, y = 0;
4
5 void thread1() {
6 a = 1; // volatile write, release barrier
7 x = b; // volatile read, acquire barrier
8 }
9 void thread2() {
10 b = 1;
11 y = a;
12 }
13 }Key points
- •Skipping this lesson leads to broken concurrent code
- •CPUs have per-core caches, a write on Core 1 may sit in Core 1's cache forever, invisible to Core 2
- •Compilers reorder instructions for optimization; CPUs reorder them for pipelining
- •The 'obvious sequential order' of source code is not what runs on the hardware
- •x86 has a strong memory model and hides most reordering bugs; ARM and POWER expose them
- •ALL modern languages have a memory model, Java JMM, Go, C++11, even Python (informally)
- •Synchronization primitives (volatile, atomic, mutex, channel) insert MEMORY BARRIERS that flush caches and prevent reordering
Tradeoffs
| Option | Pros | Cons | When to use |
|---|---|---|---|
| Rely on x86 strong memory model |
|
| Never. This is what 'works on my machine, fails in production' looks like. |
| volatile / atomic primitive |
|
| One-flag handoff, status booleans, write-once references, hot counters via Atomic* |
| Mutex / lock |
|
| Multi-variable invariants, anything more than one field |
| Channel / Queue |
|
| Producer-consumer, work handoff, ownership transfer |
Follow-up questions
▸What's a memory barrier?
▸Why does x86 hide so many reordering bugs?
▸What's the difference between visibility and ordering?
▸Are 64-bit reads/writes atomic in Java?
▸If a Mutex is used everywhere, is reordering still a concern?
Gotchas
- !Tests pass on x86 dev laptops, fail on ARM CI runners, TEST ON BOTH
- !Compiler optimizations can hoist a volatile-less read out of a loop entirely → 'why is my thread spinning forever?'
- !long/double on 32-bit JVMs aren't atomic without volatile, torn reads possible
- !Java: writing to a field after publishing a reference to its containing object can be invisible to readers who don't synchronize
- !Go: copying a sync.Mutex breaks it (go vet catches); copying a struct with atomic fields breaks them
- !Python's GIL gives bytecode-level atomicity but NOT visibility for the publication pattern
Common pitfalls
- Adding 'volatile' as a magic spell to fix unrelated bugs
- Assuming x86 strong memory model is portable
- Mixing volatile and non-volatile access to the same field
- Forgetting that volatile doesn't make compound operations atomic
Practice problems
Double-checked locking with volatile, or a static inner-class holder, or sync.Once
APIs worth memorising
- Java: volatile, VarHandle.acquire/release/opaque, java.util.concurrent.atomic.*
- Go: sync/atomic.{Bool, Int64, Pointer}.{Load, Store, Swap, CompareAndSwap}
- Python: threading.Lock, threading.Event, queue.Queue (no lower-level barriers)
Every concurrent runtime, JIT, and OS kernel. Linux's RCU (read-copy-update) is built around explicit memory barriers. ConcurrentHashMap's resize uses careful release/acquire ordering. Bugs missed here become postmortems.