False Sharing

The one-line definition

Two CPU cores writing to different variables that happen to live in the same 64-byte chunk of memory end up serialising on hardware, even though their writes don't overlap logically. The cores think they're working independently. The cache says otherwise.

Why it happens

CPUs don't read or write 8 bytes at a time. They work in 64-byte chunks called cache lines. When any byte in a line is written, the cache reloads or invalidates the whole line. Most of the time that's a feature: programs tend to use nearby data together (one memory fetch loads a useful neighbourhood of bytes).

It becomes a bug when two unrelated variables land on the same line and two cores write to them at the same time.

Memory layout (one 64-byte cache line):

  +----------+----------+----------+----------+----------+---------+
  | counterA | counterB | unused   | unused   | unused   | unused  |
  |  (8 B)   |  (8 B)   |  (8 B)   |  (8 B)   |  (8 B)   |  (16 B) |
  +----------+----------+----------+----------+----------+---------+
       ^          ^
       |          |
  Core 0       Core 1
  writes       writes
  here         here       
  
  Logically: independent fields. Two different variables.
  Hardware view: same cache line. They share fate.

What the hardware does next is the bug:

Step 1.  Core 0 writes counterA.
         Core 0's cache line is now MODIFIED.
         The coherence protocol invalidates the line on Core 1.

Step 2.  Core 1 wants to write counterB.
         Its cache line was just invalidated.
         It has to fetch the line back from Core 0 (or from L3).
         Core 0's line goes from MODIFIED to SHARED or INVALID.

Step 3.  Core 1 writes counterB.
         Core 1's cache line is now MODIFIED.
         The coherence protocol invalidates the line on Core 0.

Step 4.  Core 0 wants to write counterA again. Goes back to Step 1.

The 64-byte cache line ping-pongs between cores forever.
Throughput drops 5-10x. Both cores spend most of their time
waiting for the line, not doing work.

Logically the writes don't conflict. Physically the cache line they share enforces serialisation. That's false sharing: the contention is fake (the variables are independent), but the slowdown is real.

Where it bites

Most concurrent code that suffers from false sharing looks fine. Two counters in a struct. Two pool slots in an array. Two queue heads side by side. The bug shows up only as "this should scale linearly with cores but doesn't." Adding cores can make things slower, because more cores means more parties fighting over the same cache line.

Tip

How to spot it If a benchmark gets slower going from 1 core to many, and the code is not obviously sharing data, suspect false sharing. Run with padding around the hot fields and re-measure. If padding fixes it, that was the cause.

Where it shows up

The classic places: per-thread counters in profiling code, the head and tail pointers of a queue, the state field of a lock, statistics arrays where each thread writes to a different index. Any array of small writeable structs is a false-sharing candidate.

The JDK and Go runtime fix this internally. LongAdder pads its cells. ForkJoinPool pads its task queues. sync.Mutex and sync.Pool in Go are sized to land on cache-line boundaries. Library code that handles its own concurrency usually has this figured out. Application code that builds fresh primitives often does not.

How to fix it

Three options. Pad the variable to fill a full line. Place each variable in its own struct or object. Or use a primitive that already pads.

In Java, @Contended on a field tells the JVM to pad it. The annotation is internal and restricted by default; for benchmarks, run with -XX:-RestrictContended. For production, use LongAdder/DoubleAdder for counters, or wrap each hot field in its own object.

In Go, declare the field inside a struct that includes 56 bytes of explicit padding. The compiler will not reorder it away.

In C and C++, use alignas(64) on the variable, or place it inside std::hardware_destructive_interference_size containers (C++17).

When to care

Most application code never needs to worry about cache lines. The threshold to start caring: high-frequency writes to multiple "independent" fields by multiple threads, where benchmarks show worse-than-linear scaling. Code that doesn't have that profile is over-engineered by padding and wastes memory.

But for a counter that gets billions of increments per second, or a queue used by every request, or a thread-pool internal, false sharing is the difference between scaling and not.

The one-line definition

Why it happens

It becomes a bug when two unrelated variables land on the same line and two cores write to them at the same time.

Memory layout (one 64-byte cache line):

  +----------+----------+----------+----------+----------+---------+
  | counterA | counterB | unused   | unused   | unused   | unused  |
  |  (8 B)   |  (8 B)   |  (8 B)   |  (8 B)   |  (8 B)   |  (16 B) |
  +----------+----------+----------+----------+----------+---------+
       ^          ^
       |          |
  Core 0       Core 1
  writes       writes
  here         here       
  
  Logically: independent fields. Two different variables.
  Hardware view: same cache line. They share fate.

What the hardware does next is the bug:

Step 1.  Core 0 writes counterA.
         Core 0's cache line is now MODIFIED.
         The coherence protocol invalidates the line on Core 1.

Step 2.  Core 1 wants to write counterB.
         Its cache line was just invalidated.
         It has to fetch the line back from Core 0 (or from L3).
         Core 0's line goes from MODIFIED to SHARED or INVALID.

Step 3.  Core 1 writes counterB.
         Core 1's cache line is now MODIFIED.
         The coherence protocol invalidates the line on Core 0.

Step 4.  Core 0 wants to write counterA again. Goes back to Step 1.

The 64-byte cache line ping-pongs between cores forever.
Throughput drops 5-10x. Both cores spend most of their time
waiting for the line, not doing work.

Where it bites

Tip

Where it shows up

How to fix it

Three options. Pad the variable to fill a full line. Place each variable in its own struct or object. Or use a primitive that already pads.

In Go, declare the field inside a struct that includes 56 bytes of explicit padding. The compiler will not reorder it away.

In C and C++, use alignas(64) on the variable, or place it inside std::hardware_destructive_interference_size containers (C++17).

When to care

But for a counter that gets billions of increments per second, or a queue used by every request, or a thread-pool internal, false sharing is the difference between scaling and not.

The one-line definition

Why it happens

Where it bites

Where it shows up

How to fix it

When to care

Implementations

Key points

Follow-up questions

Related reading

False Sharing

The one-line definition

Why it happens

Where it bites

Where it shows up

How to fix it

When to care

Implementations

Key points

Follow-up questions

Related reading