False Sharing
Two unrelated variables can sit on the same CPU cache line. When two threads write to them concurrently, the cache line ping-pongs between cores even though the variables are not actually shared. The code is correct but slow. Fix by padding to one variable per cache line.
The one-line definition
Two CPU cores writing to different variables that happen to live in the same 64-byte chunk of memory end up serialising on hardware, even though their writes don't overlap logically. The cores think they're working independently. The cache says otherwise.
Why it happens
CPUs don't read or write 8 bytes at a time. They work in 64-byte chunks called cache lines. When any byte in a line is written, the cache reloads or invalidates the whole line. Most of the time that's a feature: programs tend to use nearby data together (one memory fetch loads a useful neighbourhood of bytes).
It becomes a bug when two unrelated variables land on the same line and two cores write to them at the same time.
Memory layout (one 64-byte cache line):
+----------+----------+----------+----------+----------+---------+
| counterA | counterB | unused | unused | unused | unused |
| (8 B) | (8 B) | (8 B) | (8 B) | (8 B) | (16 B) |
+----------+----------+----------+----------+----------+---------+
^ ^
| |
Core 0 Core 1
writes writes
here here
Logically: independent fields. Two different variables.
Hardware view: same cache line. They share fate.
What the hardware does next is the bug:
Step 1. Core 0 writes counterA.
Core 0's cache line is now MODIFIED.
The coherence protocol invalidates the line on Core 1.
Step 2. Core 1 wants to write counterB.
Its cache line was just invalidated.
It has to fetch the line back from Core 0 (or from L3).
Core 0's line goes from MODIFIED to SHARED or INVALID.
Step 3. Core 1 writes counterB.
Core 1's cache line is now MODIFIED.
The coherence protocol invalidates the line on Core 0.
Step 4. Core 0 wants to write counterA again. Goes back to Step 1.
The 64-byte cache line ping-pongs between cores forever.
Throughput drops 5-10x. Both cores spend most of their time
waiting for the line, not doing work.
Logically the writes don't conflict. Physically the cache line they share enforces serialisation. That's false sharing: the contention is fake (the variables are independent), but the slowdown is real.
Where it bites
Most concurrent code that suffers from false sharing looks fine. Two counters in a struct. Two pool slots in an array. Two queue heads side by side. The bug shows up only as "this should scale linearly with cores but doesn't." Adding cores can make things slower, because more cores means more parties fighting over the same cache line.
How to spot it If a benchmark gets slower going from 1 core to many, and the code is not obviously sharing data, suspect false sharing. Run with padding around the hot fields and re-measure. If padding fixes it, that was the cause.
Where it shows up
The classic places: per-thread counters in profiling code, the head and tail pointers of a queue, the state field of a lock, statistics arrays where each thread writes to a different index. Any array of small writeable structs is a false-sharing candidate.
The JDK and Go runtime fix this internally. LongAdder pads its cells. ForkJoinPool pads its task queues. sync.Mutex and sync.Pool in Go are sized to land on cache-line boundaries. Library code that handles its own concurrency usually has this figured out. Application code that builds fresh primitives often does not.
How to fix it
Three options. Pad the variable to fill a full line. Place each variable in its own struct or object. Or use a primitive that already pads.
In Java, @Contended on a field tells the JVM to pad it. The annotation is internal and restricted by default; for benchmarks, run with -XX:-RestrictContended. For production, use LongAdder/DoubleAdder for counters, or wrap each hot field in its own object.
In Go, declare the field inside a struct that includes 56 bytes of explicit padding. The compiler will not reorder it away.
In C and C++, use alignas(64) on the variable, or place it inside std::hardware_destructive_interference_size containers (C++17).
When to care
Most application code never needs to worry about cache lines. The threshold to start caring: high-frequency writes to multiple "independent" fields by multiple threads, where benchmarks show worse-than-linear scaling. Code that doesn't have that profile is over-engineered by padding and wastes memory.
But for a counter that gets billions of increments per second, or a queue used by every request, or a thread-pool internal, false sharing is the difference between scaling and not.
Implementations
Two threads each increment their own field. Logically independent. But both fields fit in one 64-byte cache line, so each write forces the other core to re-fetch. Throughput is much worse than two truly independent counters.
1 class Counters {
2 public volatile long a; // both fields end up in the same cache line
3 public volatile long b;
4 }
5
6 Counters c = new Counters();
7
8 Thread t1 = new Thread(() -> { for (long i = 0; i < 1_000_000_000L; i++) c.a++; });
9 Thread t2 = new Thread(() -> { for (long i = 0; i < 1_000_000_000L; i++) c.b++; });
10
11 // Compare against two separate Counter objects in different lines.
12 // The two-object version is dramatically faster on multi-core.@jdk.internal.vm.annotation.Contended instructs the JVM to pad the field so it sits alone in a cache line. Used by LongAdder, ForkJoinPool, the JDK's own concurrent classes. Requires -XX:-RestrictContended on user code.
1 import jdk.internal.vm.annotation.Contended;
2
3 class PaddedCounters {
4 @Contended public volatile long a;
5 @Contended public volatile long b;
6 }
7
8 // Or hand-rolled with explicit padding:
9 class HandPaddedCounter {
10 public volatile long value;
11 public long p1, p2, p3, p4, p5, p6, p7; // 56 bytes of padding
12 }Key points
- •CPU cache lines are 64 bytes on x86 and ARM. The smallest unit of cache coherence is the line, not the variable.
- •If two threads write variables in the same line, every write invalidates the other core's cache copy.
- •Symptom: linear-looking code that does not scale with cores. Adding threads makes it slower.
- •Fix: pad each hot variable to a full cache line, or use frameworks (LongAdder, padded atomics) that already do it.
- •Most common in counters, queues, lock state, per-thread statistics.