Core Mental ModelsTopic 5 of 19

ConceptBasicAsked Often

Race Conditions, Why Concurrent Bugs Happen

In one line

A race condition is when the program's correctness depends on the unpredictable interleaving of two or more threads' operations on shared state. Fix it by removing the sharing, or by making the operations atomic with a lock or atomic primitive.

What it is

A race condition is a bug whose outcome depends on the order in which threads' operations end up interleaving. Run the same input twice and the answer can differ. The program is no longer deterministic, and that's a correctness disaster.

The bug isn't "slow" or "occasionally weird." It's "sometimes returns the wrong answer, and it's not reproducible locally."

A picture of the classic race

Counter starts at 5. Two threads each run counter++. The OS interleaves the underlying load/add/store:

The bug is the gap between "load" and "store." Either thread can step inside that gap and corrupt the result. With a different schedule (rare on dev laptops, common at peak load), a different bad outcome surfaces.

Why it matters

Race conditions cause the worst class of production bugs:

Tests pass. Unit tests run sequentially with predictable timing. Races need contention.
Logs lie. The race interleaving usually doesn't get logged. Postmortem reads as "everything looks fine."
They scale with load. A race that fires once per million ops is invisible at 100 req/sec, painful at 1M req/sec.
They're worse on ARM than x86. Code that "works" on dev laptops can break on production servers.

Warning

The infamous Knight Capital incident A 2012 deployment race triggered an unused order-routing code path; an algorithm bought $7B of stock in 45 minutes. Knight lost $440M and went out of business. The race condition was in a deployment script, not the trading code itself.

How they happen, the three ingredients

Every race needs all three:

Shared mutable state, a variable, struct field, map, or buffer that more than one thread can touch.
Multiple threads, two or more execution units that can run "at the same time" (concurrent or parallel).
Unsynchronized access, at least one writer, with no lock/atomic/channel/happens-before edge ordering the operations.

Note

Remove ANY one and the race vanishes

Make the state immutable → no shared mutable state.
Confine the state to one thread → no multiple threads touching it.
Add synchronization (lock, atomic, channel) → access is no longer unsynchronized.

Why `counter++` is the textbook example

counter++ looks atomic. It isn't. The compiler emits three instructions:

Load counter from memory into a register.
Increment the register.
Store the register back to memory.

Each of those three instructions is atomic on its own. The sequence of three is not. Any other thread can interleave between them — exactly the t1..t6 pattern in the diagram above. Two increments ran, but the counter went from 5 to 6, not 7. One update was lost.

This read-modify-write pattern is everywhere: incrementing counters, appending to lists, updating cache entries, updating linked-list pointers. All of them are races without synchronization.

How to fix them

Tip

The fix hierarchy

Eliminate sharing, best. Use immutable data, copy-on-write, or per-thread state with a final merge. No race possible.
Confine state, second best. One thread owns the data; others communicate via channels/queues. ("Share by communicating.")
Atomic primitive, for single-variable updates. AtomicInteger, sync/atomic, threading.Lock-wrapped read-modify-write.
Lock the critical section, for multi-step invariants. Cheapest mental model, costliest at runtime.

How to find them before they bite

Race detector: go run -race, ThreadSanitizer (clang/gcc), Java's jcstress. Slow in CI, priceless.
Property-based tests: run the operation N times concurrently, assert invariants.
Code review checklist: every shared mutable field, is it volatile/atomic, locked, or confined?

Warning

"It works on my machine" Race conditions love x86 (strong memory model) and break on ARM (relaxed memory model). The same code can pass on Intel CI and fail on Apple Silicon laptops. Test on both.

Implementations

Classic counter race

counter++ reads the value, increments, writes back, three instructions. Two threads can both read the same value, both increment, both write, and one update vanishes. This is the textbook race condition.

1  class Counter {
2      int value = 0;  // shared mutable state
3      void inc() { value++; }  // race
4  }
5  
6  Counter c = new Counter();
7  // 1000 threads each call c.inc() 1000 times
8  // Expected: 1,000,000.  Actual: less.

Fix #1, synchronized

synchronized enforces that only one thread executes the method body at a time, AND establishes happens-before so the increment is visible to other threads.

1  class Counter {
2      int value = 0;
3      synchronized void inc() { value++; }
4  }

Fix #2, atomic primitive (preferred)

AtomicInteger.incrementAndGet() is a single CAS-based operation, no lock, no contention storm, faster under load.

1  class Counter {
2      AtomicInteger value = new AtomicInteger(0);
3      void inc() { value.incrementAndGet(); }
4  }

Key points

•A race condition is a CORRECTNESS bug, not a 'maybe slow' bug, the wrong answer can be observed
•Three ingredients: shared mutable state, multiple threads, unsynchronized access
•Remove ANY one of the three and the race goes away
•Read-modify-write operations like counter++ are NEVER atomic, they're three instructions
•Compilers and CPUs reorder reads/writes, what looks sequential in source isn't
•Race conditions hide on x86 (strong memory model) and explode on ARM

Follow-up questions

▸What's the difference between a data race and a race condition?

Subtle but important. A data race is unsynchronized access where at least one is a write, a memory model violation, undefined behavior. A race condition is broader, any logic bug whose outcome depends on timing. All data races are race conditions; not all race conditions are data races (e.g., check-then-act with atomics).

▸Why does counter++ fail even with volatile?

volatile gives visibility (other threads see the latest value) and ordering, but ++ is still three operations: read, add, write. Two threads can both read the same value, both add 1, both write, losing one update. volatile makes each operation atomic; only the COMPOUND operation needs to be atomic.

▸If code 'works' on x86, is there still a race?

Yes. x86's strong memory model masks many races at runtime, but the compiler is free to reorder. Move to ARM (Apple Silicon, AWS Graviton, mobile) and the same code may break. Use the race detector on every platform.

▸What's the simplest way to remove a race?

Remove the sharing. Confine the state to one thread (Go's idiom: 'share by communicating'), or make it immutable. Locks and atomics are the next-best fixes, they manage the race; removing the sharing eliminates it.

▸Are reads-only races okay?

Yes, concurrent reads of immutable state are always safe. A race requires at least one write.

Gotchas

!Tests almost always pass, races need real load and timing to manifest
!x86 hides races that ARM exposes, test on both
!Even atomic.Add doesn't help with check-then-act: if (atomic.Load() == 0) atomic.Store(1) is racy
!The Go race detector slows code 5-10x, run it in CI, not production
!Java's volatile is for visibility, not atomicity of compound ops

Common pitfalls

Adding a lock around the wrong thing, protecting the read but not the write, or vice versa
Using a different lock instance for related operations, looks locked, isn't
Assuming int64 reads/writes are atomic, they're not on 32-bit platforms or under reordering

Practice problems

Build a thread-safe counter, three implementations

Lock-based, atomic, and per-thread accumulation with final merge. Compare contention behavior.

APIs worth memorising

Java: AtomicInteger, AtomicLong, LongAdder, synchronized, ReentrantLock
Python: threading.Lock, threading.RLock, queue.Queue (avoids the race)
Go: sync/atomic, sync.Mutex, go run -race

Where this shows up

Race conditions cause the worst production incidents. Knight Capital's $440M loss (2012) was a deployment race. The CSRF Same-Site cookie patch in Chrome had a race that broke logins for some users. Every postmortem at scale eventually mentions a race condition.

Race Conditions, Why Concurrent Bugs Happen

In one line

What it is

The bug isn't "slow" or "occasionally weird." It's "sometimes returns the wrong answer, and it's not reproducible locally."

A picture of the classic race

Counter starts at 5. Two threads each run counter++. The OS interleaves the underlying load/add/store:

Why it matters

Race conditions cause the worst class of production bugs:

Tests pass. Unit tests run sequentially with predictable timing. Races need contention.
Logs lie. The race interleaving usually doesn't get logged. Postmortem reads as "everything looks fine."
They scale with load. A race that fires once per million ops is invisible at 100 req/sec, painful at 1M req/sec.
They're worse on ARM than x86. Code that "works" on dev laptops can break on production servers.

Warning

How they happen, the three ingredients

Every race needs all three:

Shared mutable state, a variable, struct field, map, or buffer that more than one thread can touch.
Multiple threads, two or more execution units that can run "at the same time" (concurrent or parallel).
Unsynchronized access, at least one writer, with no lock/atomic/channel/happens-before edge ordering the operations.

Note

Remove ANY one and the race vanishes

Make the state immutable → no shared mutable state.
Confine the state to one thread → no multiple threads touching it.
Add synchronization (lock, atomic, channel) → access is no longer unsynchronized.

Why `counter++` is the textbook example

counter++ looks atomic. It isn't. The compiler emits three instructions:

Load counter from memory into a register.
Increment the register.
Store the register back to memory.

This read-modify-write pattern is everywhere: incrementing counters, appending to lists, updating cache entries, updating linked-list pointers. All of them are races without synchronization.

How to fix them

Tip

The fix hierarchy

Eliminate sharing, best. Use immutable data, copy-on-write, or per-thread state with a final merge. No race possible.
Confine state, second best. One thread owns the data; others communicate via channels/queues. ("Share by communicating.")
Atomic primitive, for single-variable updates. AtomicInteger, sync/atomic, threading.Lock-wrapped read-modify-write.
Lock the critical section, for multi-step invariants. Cheapest mental model, costliest at runtime.

How to find them before they bite

Race detector: go run -race, ThreadSanitizer (clang/gcc), Java's jcstress. Slow in CI, priceless.
Property-based tests: run the operation N times concurrently, assert invariants.
Code review checklist: every shared mutable field, is it volatile/atomic, locked, or confined?

Warning

"It works on my machine" Race conditions love x86 (strong memory model) and break on ARM (relaxed memory model). The same code can pass on Intel CI and fail on Apple Silicon laptops. Test on both.

Implementations

Classic counter race

1  class Counter {
2      int value = 0;  // shared mutable state
3      void inc() { value++; }  // race
4  }
5  
6  Counter c = new Counter();
7  // 1000 threads each call c.inc() 1000 times
8  // Expected: 1,000,000.  Actual: less.

Fix #1, synchronized

synchronized enforces that only one thread executes the method body at a time, AND establishes happens-before so the increment is visible to other threads.

1  class Counter {
2      int value = 0;
3      synchronized void inc() { value++; }
4  }

Fix #2, atomic primitive (preferred)

AtomicInteger.incrementAndGet() is a single CAS-based operation, no lock, no contention storm, faster under load.

1  class Counter {
2      AtomicInteger value = new AtomicInteger(0);
3      void inc() { value.incrementAndGet(); }
4  }

Key points

•A race condition is a CORRECTNESS bug, not a 'maybe slow' bug, the wrong answer can be observed
•Three ingredients: shared mutable state, multiple threads, unsynchronized access
•Remove ANY one of the three and the race goes away
•Read-modify-write operations like counter++ are NEVER atomic, they're three instructions
•Compilers and CPUs reorder reads/writes, what looks sequential in source isn't
•Race conditions hide on x86 (strong memory model) and explode on ARM

Follow-up questions

▸What's the difference between a data race and a race condition?

▸Why does counter++ fail even with volatile?

▸If code 'works' on x86, is there still a race?

▸What's the simplest way to remove a race?

▸Are reads-only races okay?

Yes, concurrent reads of immutable state are always safe. A race requires at least one write.

Gotchas

!Tests almost always pass, races need real load and timing to manifest
!x86 hides races that ARM exposes, test on both
!Even atomic.Add doesn't help with check-then-act: if (atomic.Load() == 0) atomic.Store(1) is racy
!The Go race detector slows code 5-10x, run it in CI, not production
!Java's volatile is for visibility, not atomicity of compound ops

Common pitfalls

Adding a lock around the wrong thing, protecting the read but not the write, or vice versa
Using a different lock instance for related operations, looks locked, isn't
Assuming int64 reads/writes are atomic, they're not on 32-bit platforms or under reordering

Practice problems

Build a thread-safe counter, three implementations

Lock-based, atomic, and per-thread accumulation with final merge. Compare contention behavior.

APIs worth memorising

Java: AtomicInteger, AtomicLong, LongAdder, synchronized, ReentrantLock
Python: threading.Lock, threading.RLock, queue.Queue (avoids the race)
Go: sync/atomic, sync.Mutex, go run -race

Where this shows up

Race Conditions, Why Concurrent Bugs Happen

What it is

A picture of the classic race

Why it matters

How they happen, the three ingredients

Why `counter++` is the textbook example

How to fix them

How to find them before they bite

Implementations

Key points

Follow-up questions

Gotchas

Common pitfalls

Practice problems

APIs worth memorising

Related reading

Race Conditions, Why Concurrent Bugs Happen

What it is

A picture of the classic race

Why it matters

How they happen, the three ingredients

Why `counter++` is the textbook example

How to fix them

How to find them before they bite

Implementations

Key points

Follow-up questions

Gotchas

Common pitfalls

Practice problems

APIs worth memorising

Related reading

What it is

A picture of the classic race

Why it matters

How they happen, the three ingredients

Why counter++ is the textbook example

How to fix them

How to find them before they bite

Implementations

Key points

Follow-up questions

Gotchas

Common pitfalls

Practice problems

APIs worth memorising

Related reading

What it is

A picture of the classic race

Why it matters

How they happen, the three ingredients

Why counter++ is the textbook example

How to fix them

How to find them before they bite

Implementations

Key points

Follow-up questions

Gotchas

Common pitfalls

Practice problems

APIs worth memorising

Related reading

Why `counter++` is the textbook example

Why `counter++` is the textbook example