Core Mental ModelsTopic 13 of 19

ConceptAdvancedAsked Often

Memory Visibility & CPU Reordering

In one line

Modern CPUs and compilers reorder reads and writes for performance. Without explicit synchronization, one thread's writes may NEVER become visible to another thread, or may appear in a different order. This is the deepest reason concurrent code breaks, and skipping it means writing code that "works on a laptop, fails in production."

The short version

Code looks like a list of instructions that run top to bottom. On real hardware, that is a polite fiction. The compiler reorders. The CPU reorders. Each CPU core has its own cache, and a write made on one core can sit there for a long time before any other core sees it. Some writes never become visible to other threads at all unless the system is told "publish this now."

That is what memory visibility is about. It's not about correctness of one thread's view of itself. It's about whether other threads see the writes, and in what order.

Important

Skipping this leads to broken code Tests pass. Code looks fine. Then under load on a different CPU, two flags get out of order, a counter never updates, a worker spins forever. This is the deepest reason concurrent code fails, and the hardest one to debug after the fact.

A picture of the layers

When the code runs x = 42, the value passes through a stack of layers before any other thread can possibly see it:

Six layers. Each one can hold the write for an indefinite time. Without a synchronization primitive forcing the write all the way through, the value is local to the writing thread.

A concrete bug: the publication pattern

Here is the code everyone writes the first time:

data = 42         // step 1
ready = true      // step 2

(other thread)
if (ready) {      // step 3
    print(data)   // step 4
}

The intent: writer fills in data, then sets ready. Reader sees ready and trusts that data is filled in.

What actually happens, with no synchronization:

The reader sees ready=true because that write happened to flush first. Then it reads data and gets 0, the old value, because the data=42 write hasn't propagated yet (or was reordered after ready=true).

This is real. It happens on ARM, POWER, RISC-V, even on x86 in some cases. x86 happens to hide most reordering bugs because of its strong memory model, which is why "works on my Intel laptop, fails on AWS Graviton" is a recurring story.

The fix is one idea

Every concurrency primitive in every language is, underneath, a way to tell the compiler and the CPU two things:

Flush. Push my pending writes all the way through the stack before this point.
Don't reorder. Keep these instructions in source order across this point.

That's it. Mutex lock and unlock do this. Atomic store and load do this. Volatile read and write do this. Channel send and receive do this. They all enforce the same two rules at different granularities.

Primitive	What it forces
`volatile` write (Java)	Flush pending writes before this; later instructions cannot move above
`volatile` read (Java)	Cannot move below later instructions; sees the latest published value
`synchronized` enter / exit	Acquire (enter) and release (exit) on the monitor
`sync/atomic.Store` (Go)	Flush before storing
`sync/atomic.Load` (Go)	Sees the latest published value
`Mutex.Lock` / `Mutex.Unlock`	Acquire on lock, release on unlock
Channel send / receive (Go)	Send publishes, receive sees
`queue.Queue.put` / `get` (Python)	Lock-based, same effect

Tip

One way to think about it Calling mutex.Lock() does not only say "no one else can enter this section." It also says "publish my pending writes so the next thread to lock this sees them." Same for atomic stores, channel sends, and volatile writes. The literature calls these memory barriers or fences. Plain language: a publish-then-subscribe contract for memory.

When none of this matters

A short list:

Single-threaded code. No one else is reading; intra-thread reorderings can't hurt.
Read-only shared state after construction. Built once, handed off via a synchronization primitive, never modified again, every reader sees the finished version.
Everything else. If two threads share a mutable value, every access needs a synchronization primitive. No exceptions.

The skill that pays off the most

For every shared variable in concurrent code, ask one question: what synchronization primitive guarantees the writer's update is visible to the reader? If no answer exists, the code is broken, even if today's tests pass.

This is the question senior reviewers ask. It's also the question behind every "why is volatile needed here?" interview probe. Build the habit of asking it on every shared field, every time.

Implementations

BROKEN, the canonical reordering bug

Two threads, two flags. Each writes its flag, then reads the other. Naively, at least one thread should see the other's write, but on real hardware, both can see false. The CPU and compiler are free to reorder the read before the write, because in single-threaded execution it makes no difference.

 1  class Reorder {
 2      int a = 0, b = 0;
 3      int x = 0, y = 0;
 4  
 5      void thread1() {
 6          a = 1;        // write a
 7          x = b;        // read b
 8      }
 9      void thread2() {
10          b = 1;        // write b
11          y = a;        // read a
12      }
13      // After both threads finish:
14      // Naive expectation: x==1 OR y==1 (or both)
15      // Reality: x==0 AND y==0 is OBSERVABLE on real hardware
16  }

FIXED, volatile inserts memory barriers

volatile writes are release operations: prior writes become visible to threads that perform an acquire read. Volatile reads can't be reordered before volatile writes. The (0,0) outcome becomes impossible.

 1  class Fixed {
 2      volatile int a = 0, b = 0;
 3      int x = 0, y = 0;
 4  
 5      void thread1() {
 6          a = 1;        // volatile write, release barrier
 7          x = b;        // volatile read, acquire barrier
 8      }
 9      void thread2() {
10          b = 1;
11          y = a;
12      }
13  }

Key points

•Skipping this lesson leads to broken concurrent code
•CPUs have per-core caches, a write on Core 1 may sit in Core 1's cache forever, invisible to Core 2
•Compilers reorder instructions for optimization; CPUs reorder them for pipelining
•The 'obvious sequential order' of source code is not what runs on the hardware
•x86 has a strong memory model and hides most reordering bugs; ARM and POWER expose them
•ALL modern languages have a memory model, Java JMM, Go, C++11, even Python (informally)
•Synchronization primitives (volatile, atomic, mutex, channel) insert MEMORY BARRIERS that flush caches and prevent reordering

Tradeoffs

Option	Pros	Cons	When to use
Rely on x86 strong memory model	No code changes needed (sometimes)	Breaks on ARM, POWER, RISC-V Compiler reordering still applies Code is technically wrong even when it 'works'	Never. This is what 'works on my machine, fails in production' looks like.
volatile / atomic primitive	Cheap reads Single-variable visibility + ordering No locking	Doesn't compose, multi-variable invariants need a lock Easy to misunderstand semantics	One-flag handoff, status booleans, write-once references, hot counters via Atomic*
Mutex / lock	Mutual exclusion + memory barrier in one Composable Familiar	Blocking Contention overhead	Multi-variable invariants, anything more than one field
Channel / Queue	Synchronization is invisible, built into the API No lock to forget Idiomatic in Go/Python	Slower than atomic for hot paths	Producer-consumer, work handoff, ownership transfer

Follow-up questions

▸What's a memory barrier?

A CPU instruction (or compiler directive) that prevents reordering across it and flushes caches. Different barriers do different things: a 'release' barrier ensures prior writes are visible; an 'acquire' barrier ensures subsequent reads see those writes; a 'full' barrier does both. Java's volatile, Go's sync/atomic, and C++11 atomics all emit appropriate barriers.

▸Why does x86 hide so many reordering bugs?

x86 has a TSO (total store order) memory model, most reads and writes can't be reordered with each other (only store-load can). ARM and POWER have weaker models that allow more reordering. Code that 'works' on Intel laptops can fail on Apple Silicon, AWS Graviton, mobile devices, or any non-x86 production target.

▸What's the difference between visibility and ordering?

Visibility: when does a write become observable to another thread? Without synchronization, possibly never. Ordering: when multiple writes happen, in what order do other threads see them? Without synchronization, any order. Both are required for correct concurrent code; both are provided together by proper synchronization primitives.

▸Are 64-bit reads/writes atomic in Java?

Not always. The JLS guarantees atomicity of 32-bit ops; 64-bit long/double reads/writes can be torn on 32-bit JVMs unless declared volatile. On 64-bit JVMs and HotSpot in practice they're atomic, but the spec doesn't require it. Use volatile or AtomicLong to be safe.

▸If a Mutex is used everywhere, is reordering still a concern?

Inside the critical section, no, Mutex.acquire/release are full memory barriers. Between critical sections, yes, the lock only orders accesses that happen WHILE it is held. Common bug: reading shared state outside the lock 'because it's just a check.'

Gotchas

!Tests pass on x86 dev laptops, fail on ARM CI runners, TEST ON BOTH
!Compiler optimizations can hoist a volatile-less read out of a loop entirely → 'why is my thread spinning forever?'
!long/double on 32-bit JVMs aren't atomic without volatile, torn reads possible
!Java: writing to a field after publishing a reference to its containing object can be invisible to readers who don't synchronize
!Go: copying a sync.Mutex breaks it (go vet catches); copying a struct with atomic fields breaks them
!Python's GIL gives bytecode-level atomicity but NOT visibility for the publication pattern

Common pitfalls

Adding 'volatile' as a magic spell to fix unrelated bugs
Assuming x86 strong memory model is portable
Mixing volatile and non-volatile access to the same field
Forgetting that volatile doesn't make compound operations atomic

Practice problems

Implement a thread-safe lazy initializer (no synchronization on the fast path)

Double-checked locking with volatile, or a static inner-class holder, or sync.Once

APIs worth memorising

Java: volatile, VarHandle.acquire/release/opaque, java.util.concurrent.atomic.*
Go: sync/atomic.{Bool, Int64, Pointer}.{Load, Store, Swap, CompareAndSwap}
Python: threading.Lock, threading.Event, queue.Queue (no lower-level barriers)

Where this shows up

Every concurrent runtime, JIT, and OS kernel. Linux's RCU (read-copy-update) is built around explicit memory barriers. ConcurrentHashMap's resize uses careful release/acquire ordering. Bugs missed here become postmortems.

Memory Visibility & CPU Reordering

In one line

The short version

That is what memory visibility is about. It's not about correctness of one thread's view of itself. It's about whether other threads see the writes, and in what order.

Important

A picture of the layers

When the code runs x = 42, the value passes through a stack of layers before any other thread can possibly see it:

Six layers. Each one can hold the write for an indefinite time. Without a synchronization primitive forcing the write all the way through, the value is local to the writing thread.

A concrete bug: the publication pattern

Here is the code everyone writes the first time:

data = 42         // step 1
ready = true      // step 2

(other thread)
if (ready) {      // step 3
    print(data)   // step 4
}

The intent: writer fills in data, then sets ready. Reader sees ready and trusts that data is filled in.

What actually happens, with no synchronization:

The fix is one idea

Every concurrency primitive in every language is, underneath, a way to tell the compiler and the CPU two things:

Flush. Push my pending writes all the way through the stack before this point.
Don't reorder. Keep these instructions in source order across this point.

Primitive	What it forces
`volatile` write (Java)	Flush pending writes before this; later instructions cannot move above
`volatile` read (Java)	Cannot move below later instructions; sees the latest published value
`synchronized` enter / exit	Acquire (enter) and release (exit) on the monitor
`sync/atomic.Store` (Go)	Flush before storing
`sync/atomic.Load` (Go)	Sees the latest published value
`Mutex.Lock` / `Mutex.Unlock`	Acquire on lock, release on unlock
Channel send / receive (Go)	Send publishes, receive sees
`queue.Queue.put` / `get` (Python)	Lock-based, same effect

Tip

When none of this matters

A short list:

Single-threaded code. No one else is reading; intra-thread reorderings can't hurt.
Read-only shared state after construction. Built once, handed off via a synchronization primitive, never modified again, every reader sees the finished version.
Everything else. If two threads share a mutable value, every access needs a synchronization primitive. No exceptions.

The skill that pays off the most

This is the question senior reviewers ask. It's also the question behind every "why is volatile needed here?" interview probe. Build the habit of asking it on every shared field, every time.

Implementations

BROKEN, the canonical reordering bug

 1  class Reorder {
 2      int a = 0, b = 0;
 3      int x = 0, y = 0;
 4  
 5      void thread1() {
 6          a = 1;        // write a
 7          x = b;        // read b
 8      }
 9      void thread2() {
10          b = 1;        // write b
11          y = a;        // read a
12      }
13      // After both threads finish:
14      // Naive expectation: x==1 OR y==1 (or both)
15      // Reality: x==0 AND y==0 is OBSERVABLE on real hardware
16  }

FIXED, volatile inserts memory barriers

 1  class Fixed {
 2      volatile int a = 0, b = 0;
 3      int x = 0, y = 0;
 4  
 5      void thread1() {
 6          a = 1;        // volatile write, release barrier
 7          x = b;        // volatile read, acquire barrier
 8      }
 9      void thread2() {
10          b = 1;
11          y = a;
12      }
13  }

Key points

•Skipping this lesson leads to broken concurrent code
•CPUs have per-core caches, a write on Core 1 may sit in Core 1's cache forever, invisible to Core 2
•Compilers reorder instructions for optimization; CPUs reorder them for pipelining
•The 'obvious sequential order' of source code is not what runs on the hardware
•x86 has a strong memory model and hides most reordering bugs; ARM and POWER expose them
•ALL modern languages have a memory model, Java JMM, Go, C++11, even Python (informally)
•Synchronization primitives (volatile, atomic, mutex, channel) insert MEMORY BARRIERS that flush caches and prevent reordering

Tradeoffs

Option	Pros	Cons	When to use
Rely on x86 strong memory model	No code changes needed (sometimes)	Breaks on ARM, POWER, RISC-V Compiler reordering still applies Code is technically wrong even when it 'works'	Never. This is what 'works on my machine, fails in production' looks like.
volatile / atomic primitive	Cheap reads Single-variable visibility + ordering No locking	Doesn't compose, multi-variable invariants need a lock Easy to misunderstand semantics	One-flag handoff, status booleans, write-once references, hot counters via Atomic*
Mutex / lock	Mutual exclusion + memory barrier in one Composable Familiar	Blocking Contention overhead	Multi-variable invariants, anything more than one field
Channel / Queue	Synchronization is invisible, built into the API No lock to forget Idiomatic in Go/Python	Slower than atomic for hot paths	Producer-consumer, work handoff, ownership transfer

Follow-up questions

▸What's a memory barrier?

▸Why does x86 hide so many reordering bugs?

▸What's the difference between visibility and ordering?

▸Are 64-bit reads/writes atomic in Java?

▸If a Mutex is used everywhere, is reordering still a concern?

Gotchas

!Tests pass on x86 dev laptops, fail on ARM CI runners, TEST ON BOTH
!Compiler optimizations can hoist a volatile-less read out of a loop entirely → 'why is my thread spinning forever?'
!long/double on 32-bit JVMs aren't atomic without volatile, torn reads possible
!Java: writing to a field after publishing a reference to its containing object can be invisible to readers who don't synchronize
!Go: copying a sync.Mutex breaks it (go vet catches); copying a struct with atomic fields breaks them
!Python's GIL gives bytecode-level atomicity but NOT visibility for the publication pattern

Common pitfalls

Adding 'volatile' as a magic spell to fix unrelated bugs
Assuming x86 strong memory model is portable
Mixing volatile and non-volatile access to the same field
Forgetting that volatile doesn't make compound operations atomic

Practice problems

Implement a thread-safe lazy initializer (no synchronization on the fast path)

Double-checked locking with volatile, or a static inner-class holder, or sync.Once

APIs worth memorising

Java: volatile, VarHandle.acquire/release/opaque, java.util.concurrent.atomic.*
Go: sync/atomic.{Bool, Int64, Pointer}.{Load, Store, Swap, CompareAndSwap}
Python: threading.Lock, threading.Event, queue.Queue (no lower-level barriers)

Where this shows up

Memory Visibility & CPU Reordering

The short version

A picture of the layers

A concrete bug: the publication pattern

The fix is one idea

When none of this matters

The skill that pays off the most

Implementations

Key points

Tradeoffs

Follow-up questions

Gotchas

Common pitfalls

Practice problems

APIs worth memorising

Related reading

Memory Visibility & CPU Reordering

The short version

A picture of the layers

A concrete bug: the publication pattern

The fix is one idea

When none of this matters

The skill that pays off the most

Implementations

Key points

Tradeoffs

Follow-up questions

Gotchas

Common pitfalls

Practice problems

APIs worth memorising

Related reading