Bug Hunt: False Sharing in a Striped Counter
Striped counter built from an array of AtomicLong shows worse throughput at high thread count than a single AtomicLong. Bug: the array elements share cache lines; threads writing different stripes cause cache-line ping-pong. Fix: pad each element to its own cache line (LongAdder, @Contended, or manual padding).
The Bug
A striped counter is designed to scale better than a single AtomicLong. Each thread hashes to a stripe; only that stripe gets incremented. Looks like it should give linear scaling.
Benchmark: at 16 threads, the striped version is barely faster than a single AtomicLong. Sometimes worse. CPU is at 100% but throughput is the same as one thread doing all the work.
The bug isn't in the striping logic. It's in the memory layout.
What's Going Wrong
AtomicLong instances are 16-24 bytes (depending on JVM internals: object header + long value). An array of 16 AtomicLongs occupies maybe 320 bytes, spread across 5 cache lines (64 bytes each).
So 4 AtomicLongs per cache line. When 16 threads each write to "their" stripe, multiple threads end up writing to AtomicLongs that share a cache line. Even though each thread writes a different field, the cache coherency protocol enforces "only one core can modify this cache line".
The line bounces between L1 caches. Each write costs a full cache-coherency round trip (~40-100 cycles) instead of the ~1-5 cycles a hot L1 access would cost. Throughput plateaus.
How to Detect
Three signals together strongly suggest false sharing:
-
High CPU utilisation, low throughput. At 100% CPU when the operation per second doesn't match the work, something is preventing real progress.
-
Hot atomic operations in the profile. The flame graph shows the atomic-add or its CAS retry as the hot spot.
-
High cache-miss rate at L1, low IPC.
perf stat -e cache-misses,instructions,cyclesshows high cache misses and low IPC. The cache lines are bouncing.
If all three are present, false sharing is the most likely cause. Pad and re-measure.
The Fix
Three options.
Use LongAdder. Java's standard library answer. Internally uses padded Cells, dynamically grown based on contention. Scales near-linearly. Almost always the right answer for a counter.
@Contended annotation. Tells the JVM to pad a field to its own cache line. Requires -XX:-RestrictContended (it's an internal annotation). Use when a custom striped structure is required.
Manual padding. In C++, alignas(std::hardware_destructive_interference_size). In Go, add a padding field of [56]byte. In C, manual padding. Use when no language-level helper exists.
All three give each field its own 64-byte cache line. Cores writing to different fields no longer contend on the same line.
Why this is sneaky
False sharing has no source-code symptom. The code looks fine: each thread writes its own stripe; nothing is logically shared. The bug is in the memory layout, which the source code doesn't expose.
It's also platform-dependent. On a single-socket machine, false sharing within a socket's L3 is the issue. On multi-socket NUMA, false sharing across sockets is dramatically worse (cross-socket cache coherency is much more expensive).
The defensive practice: when writing a striped/sharded structure with hot per-shard writes, pad. The cost (a few hundred bytes per shard) is negligible compared to the throughput win.
Adjacent fields share cache lines, so writes from different cores to fields on the same line cause cache-line bouncing. Pad atomics that multiple cores write to in hot paths (@Contended in Java, alignas in C++, a [56]byte filler in Go), or reach for a primitive that pads automatically (LongAdder and friends).
The bug is invisible in source code; the fix is invisible in performance graphs (just pad, and the curve changes shape). Knowing to look for it is most of the work.
Implementations
The counter is supposed to scale: each thread hashes to a stripe, increments only its stripe. But AtomicLong instances are 16-24 bytes; multiple stripes share each 64-byte cache line. False sharing kills the scaling.
1 // BAD: false sharing across stripes
2 public class StripedCounter {
3 private final AtomicLong[] stripes;
4
5 public StripedCounter(int n) {
6 stripes = new AtomicLong[n];
7 for (int i = 0; i < n; i++) stripes[i] = new AtomicLong();
8 }
9
10 public void inc() {
11 int idx = (int) (Thread.currentThread().getId() % stripes.length);
12 stripes[idx].incrementAndGet(); // hot op
13 }
14
15 public long sum() {
16 long s = 0;
17 for (AtomicLong c : stripes) s += c.get();
18 return s;
19 }
20 }
21
22 // Benchmark: at 16 threads, this is barely faster than a single AtomicLong.
23 // 4 AtomicLongs per cache line; 16 threads scrambling for ~4 lines.LongAdder solves the same problem with internal padding. It's the standard library answer.
1 import java.util.concurrent.atomic.LongAdder;
2
3 public class GoodCounter {
4 private final LongAdder counter = new LongAdder();
5
6 public void inc() { counter.increment(); }
7 public long sum() { return counter.sum(); }
8 }
9
10 // LongAdder dynamically grows a per-thread Cell array, each Cell padded
11 // to its own cache line via @Contended. Scales near-linearly with cores.@Contended tells the JVM to pad the field to its own cache line. Requires -XX:-RestrictContended. Use only after measuring false sharing.
1 import jdk.internal.vm.annotation.Contended;
2
3 public class PaddedStriped {
4 // Each entry is its own cache line
5 @Contended
6 static final class Cell {
7 volatile long value;
8 }
9
10 private final Cell[] stripes;
11
12 public PaddedStriped(int n) {
13 stripes = new Cell[n];
14 for (int i = 0; i < n; i++) stripes[i] = new Cell();
15 }
16
17 public void inc() {
18 int idx = (int) (Thread.currentThread().getId() % stripes.length);
19 long old, next;
20 do {
21 old = stripes[idx].value;
22 next = old + 1;
23 } while (!UNSAFE.compareAndSwapLong(stripes[idx], VALUE_OFFSET, old, next));
24 }
25 }
26
27 // With @Contended, each Cell occupies its own cache line.
28 // Throughput at 16 threads: ~10x the unpadded version.Key points
- •Symptom: striped counter scales worse than expected; CPU at 100%, throughput plateaus or degrades.
- •Root cause: array elements occupy adjacent memory; multiple elements share each cache line.
- •Two cores writing to different elements on the same line: the line bounces between L1 caches.
- •Fix: pad each element to 64 bytes (cache line size). Java's LongAdder does this internally.
- •Diagnose: profile shows hot atomic-add op, cache miss rate is high, IPC (instructions per cycle) is low.
Follow-up questions
▸How do I detect false sharing in profiling?
▸Should I always pad atomics?
▸Why does cache line bouncing matter so much?
▸How does LongAdder choose the number of cells?
Gotchas
- !Striped counters without padding scale worse than non-striped
- !Padding only some atomics: misses the bouncing on the unpadded ones
- !@Contended without -XX:-RestrictContended is silently ignored on application classes
- !False sharing in field layout (two atomic fields adjacent in a struct): same problem as arrays
- !Padding waste: 64 bytes per atomic adds up; use sparingly