Debugging ConcurrencyTopic 11 of 18

ProblemAdvancedSometimes

Bug Hunt: Why Is This LongAdder Slower Than AtomicLong?

In one line

LongAdder shards counters across threads to win under contention, but its sum() walks every shard, so reads are O(N). When read rate is high relative to write rate, AtomicLong (single CAS) is faster. The bug is using the right tool for the wrong workload.

The puzzle

A senior engineer reads "LongAdder is faster than AtomicLong" on a blog. They migrate a hot counter in the service from AtomicLong to LongAdder and ship. Latency p99 doubles. Throughput drops. They roll back. What did the blog get right that the engineer applied wrong?

Important

The general lesson "X is faster than Y" is never the whole truth. Every concurrent data structure trades one operation's cost for another's. The blog said "LongAdder wins under contention", true. The engineer's workload had occasional contention but frequent reads. The two facts don't intersect.

How to read the broken code

Look at the language tab for the relevant stack. Notice what's similar between the AtomicLong and LongAdder versions: both increment counters, both expose get(). Notice what's different:

The increment path: AtomicLong does one CAS on a single memory location. LongAdder picks a thread-local cell and increments it.
The read path: AtomicLong does one load. LongAdder walks every cell and sums.

Note

The two costs that trade off

Write contention cost (CAS retries when many threads update the same word): high for AtomicLong under contention, low for LongAdder.
Read cost: O(1) for AtomicLong, O(num shards) for LongAdder.

Pick LongAdder only when the contention cost is the bottleneck AND reads are infrequent enough to absorb the O(N) sum.

The decision matrix

Workload	Right tool
Hit counter read on every request, few concurrent writers	AtomicLong, read is O(1)
Metrics counter scraped every 10-60s, many writers	LongAdder, write is contention-free
Counter for billing/accounting (exact reads required)	AtomicLong or Lock, eventual consistency unacceptable
Read-write ratio uncertain	Benchmark both with the actual load

How to actually measure this

Tip

The benchmark that matters Don't trust micro-benchmarks from blogs. Write one for the actual workload:

Vary contending writers from 1 to 64.
Vary read frequency from "every increment" to "once per second."
Plot throughput; pick the implementation that wins at the relevant operating point.

Java: jmh. Go: go test -bench with RunParallel. Python: timeit with threading. The numbers from a 5-minute benchmark beat hours of reasoning about cache-line behavior.

The lesson beyond LongAdder

Warning

The pattern recurs everywhere

ConcurrentHashMap vs Hashtable: same tradeoff at scale. CHM is faster for reads under contention.
CopyOnWriteArrayList vs synchronized List: COW wins for read-heavy, dies under write-heavy load.
Lock-free queues vs blocking queues: lock-free wins for low-contention high-throughput; blocking wins for back-pressure-aware workloads.
Spinlocks vs mutexes: spin wins for very short critical sections; mutex wins everywhere else.

The shape of the lesson: every concurrent data structure has a "designed for" workload. Using it on a different workload reverses its claimed advantage.

Implementations

BROKEN, replacing AtomicLong with LongAdder, expecting speedup

A monitoring component reports a read-heavy counter, getCount() is called per request to populate a header; the counter is incremented occasionally. Someone reads "LongAdder is faster" and swaps it in. Throughput drops. Why?

 1  // BEFORE, AtomicLong
 2  class HitCounter {
 3      private final AtomicLong count = new AtomicLong();
 4      void inc()     { count.incrementAndGet(); }       // CAS
 5      long getCount() { return count.get(); }            // single load
 6  }
 7  
 8  // AFTER, "optimized" with LongAdder
 9  class HitCounter {
10      private final LongAdder count = new LongAdder();
11      void inc()     { count.increment(); }              // sharded
12      long getCount() { return count.sum(); }            // ← walks all cells
13  }
14  
15  // Workload: 1 writer, 1000 readers/sec
16  // Result: throughput WORSE with LongAdder

FIXED, match the data structure to the workload

The bug: LongAdder optimizes for write contention. Its sum() walks every cell, typically Runtime.getRuntime().availableProcessors() cells, sometimes more. For a read-heavy workload, every getCount() does ~16 cache-line fetches instead of 1. The fix: use AtomicLong when reads dominate, LongAdder when writes dominate AND reads are infrequent (e.g., metrics scraping every 10s). Benchmark first; intuition about "newer = faster" lies.

 1  // RULE: read-heavy → AtomicLong; write-heavy + infrequent reads → LongAdder
 2  
 3  // Read-heavy (thousands of reads, occasional writes)
 4  class ReadHeavyCounter {
 5      private final AtomicLong count = new AtomicLong();
 6      void inc()     { count.incrementAndGet(); }
 7      long getCount() { return count.get(); }            // O(1)
 8  }
 9  
10  // Write-heavy with periodic snapshots
11  class WriteHeavyCounter {
12      private final LongAdder count = new LongAdder();
13      void inc()     { count.increment(); }              // O(1) most of the time
14      long sample()   { return count.sum(); }            // O(num cells), called once per metric flush
15  }
16  
17  // The benchmark to write (jmh):
18  // - Vary writer thread count (1, 2, 4, 8, 16, 32, 64)
19  // - Vary read frequency (every-write, 1:10, 1:100, 1:10000)
20  // - Pick the impl that wins for the real workload

Key points

•AtomicLong: cheap reads (one load), CAS retries on contention (slow when many writers contend)
•LongAdder: cheap writes under contention (per-thread cells), expensive reads (sum walks all cells)
•Read-heavy + few writers → AtomicLong wins
•Write-heavy + many writers → LongAdder wins
•Always benchmark with the actual workload, intuition lies

Follow-up questions

▸When does LongAdder definitively beat AtomicLong?

Many writer threads, infrequent reads. Classic example: metrics counters scraped every 10-60 seconds while every request increments. Goetz's original benchmarks showed crossover at 4+ contending threads. Below that, AtomicLong wins.

▸Why is AtomicLong's read O(1) but LongAdder's O(N)?

AtomicLong stores a single value, read is one cache-line load. LongAdder stores N cells (typically one per CPU); sum() walks all of them and adds. False sharing protection means each cell is on its own cache line, so reading sum is N cache misses.

▸Are sharded counters ever wrong?

sum() can return slightly stale values because cells are read sequentially without locking the whole structure. For metrics, this is fine. For exact accounting (billing, money), use AtomicLong or a Lock.

▸Why does the JDK ship both AtomicLong and LongAdder if one is just better under contention?

They're complementary, not substitutes. AtomicLong fits the 'consistent read at any time, low contention' niche. LongAdder fits the 'high write throughput, periodic read' niche. The JDK author (Doug Lea) explicitly says: profile to choose.

Gotchas

!Many counters are read on every request (rate limiters, header injection) → AtomicLong wins
!Metrics counters scraped via Prometheus/StatsD → LongAdder may win
!Sharded counters are eventually consistent, sum() can return slightly old values
!False sharing across cache lines kills sharded perf if cells aren't padded, JDK pads automatically; hand-rolled implementations often forget
!Profile-driven optimization beats intuition every time

Where this shows up

Hadoop, Cassandra, Spring Actuator, Hibernate stats, Datadog Java tracer all use LongAdder for high-throughput metrics. Lock-free counters in Linux kernel use per-CPU shards. Production systems have switched in BOTH directions based on measurement.

Related reading

The puzzle

Important

How to read the broken code

Look at the language tab for the relevant stack. Notice what's similar between the AtomicLong and LongAdder versions: both increment counters, both expose get(). Notice what's different:

The increment path: AtomicLong does one CAS on a single memory location. LongAdder picks a thread-local cell and increments it.

The read path: AtomicLong does one load. LongAdder walks every cell and sums.

Note

The two costs that trade off

Write contention cost (CAS retries when many threads update the same word): high for AtomicLong under contention, low for LongAdder.
Read cost: O(1) for AtomicLong, O(num shards) for LongAdder.

Pick LongAdder only when the contention cost is the bottleneck AND reads are infrequent enough to absorb the O(N) sum.

The decision matrix

Workload

Right tool

Hit counter read on every request, few concurrent writers

AtomicLong, read is O(1)

Metrics counter scraped every 10-60s, many writers

LongAdder, write is contention-free

Counter for billing/accounting (exact reads required)

AtomicLong or Lock, eventual consistency unacceptable

Read-write ratio uncertain

Benchmark both with the actual load

How to actually measure this

Tip

The benchmark that matters Don't trust micro-benchmarks from blogs. Write one for the actual workload:

Vary contending writers from 1 to 64.
Vary read frequency from "every increment" to "once per second."
Plot throughput; pick the implementation that wins at the relevant operating point.

Java: jmh. Go: go test -bench with RunParallel. Python: timeit with threading. The numbers from a 5-minute benchmark beat hours of reasoning about cache-line behavior.

The lesson beyond LongAdder

Warning

The pattern recurs everywhere

ConcurrentHashMap vs Hashtable: same tradeoff at scale. CHM is faster for reads under contention.
CopyOnWriteArrayList vs synchronized List: COW wins for read-heavy, dies under write-heavy load.
Lock-free queues vs blocking queues: lock-free wins for low-contention high-throughput; blocking wins for back-pressure-aware workloads.
Spinlocks vs mutexes: spin wins for very short critical sections; mutex wins everywhere else.

The shape of the lesson: every concurrent data structure has a "designed for" workload. Using it on a different workload reverses its claimed advantage.

1 // BEFORE, AtomicLong 2 class HitCounter { 3 private final AtomicLong count = new AtomicLong(); 4 void inc() { count.incrementAndGet(); } // CAS 5 long getCount() { return count.get(); } // single load 6 } 7 8 // AFTER, "optimized" with LongAdder 9 class HitCounter { 10 private final LongAdder count = new LongAdder(); 11 void inc() { count.increment(); } // sharded 12 long getCount() { return count.sum(); } // ← walks all cells 13 } 14 15 // Workload: 1 writer, 1000 readers/sec 16 // Result: throughput WORSE with LongAdder

1 // RULE: read-heavy → AtomicLong; write-heavy + infrequent reads → LongAdder 2 3 // Read-heavy (thousands of reads, occasional writes) 4 class ReadHeavyCounter { 5 private final AtomicLong count = new AtomicLong(); 6 void inc() { count.incrementAndGet(); } 7 long getCount() { return count.get(); } // O(1) 8 } 9 10 // Write-heavy with periodic snapshots 11 class WriteHeavyCounter { 12 private final LongAdder count = new LongAdder(); 13 void inc() { count.increment(); } // O(1) most of the time 14 long sample() { return count.sum(); } // O(num cells), called once per metric flush 15 } 16 17 // The benchmark to write (jmh): 18 // - Vary writer thread count (1, 2, 4, 8, 16, 32, 64) 19 // - Vary read frequency (every-write, 1:10, 1:100, 1:10000) 20 // - Pick the impl that wins for the real workload