Bug Hunt: Why Is This LongAdder Slower Than AtomicLong?
LongAdder shards counters across threads to win under contention, but its sum() walks every shard, so reads are O(N). When read rate is high relative to write rate, AtomicLong (single CAS) is faster. The bug is using the right tool for the wrong workload.
The puzzle
A senior engineer reads "LongAdder is faster than AtomicLong" on a blog. They migrate a hot counter in the service from AtomicLong to LongAdder and ship. Latency p99 doubles. Throughput drops. They roll back. What did the blog get right that the engineer applied wrong?
The general lesson "X is faster than Y" is never the whole truth. Every concurrent data structure trades one operation's cost for another's. The blog said "LongAdder wins under contention", true. The engineer's workload had occasional contention but frequent reads. The two facts don't intersect.
How to read the broken code
Look at the language tab for the relevant stack. Notice what's similar between the AtomicLong and LongAdder versions: both increment counters, both expose get(). Notice what's different:
- The increment path: AtomicLong does one CAS on a single memory location. LongAdder picks a thread-local cell and increments it.
- The read path: AtomicLong does one load. LongAdder walks every cell and sums.
The two costs that trade off
- Write contention cost (CAS retries when many threads update the same word): high for AtomicLong under contention, low for LongAdder.
- Read cost: O(1) for AtomicLong, O(num shards) for LongAdder.
Pick LongAdder only when the contention cost is the bottleneck AND reads are infrequent enough to absorb the O(N) sum.
The decision matrix
| Workload | Right tool |
|---|---|
| Hit counter read on every request, few concurrent writers | AtomicLong, read is O(1) |
| Metrics counter scraped every 10-60s, many writers | LongAdder, write is contention-free |
| Counter for billing/accounting (exact reads required) | AtomicLong or Lock, eventual consistency unacceptable |
| Read-write ratio uncertain | Benchmark both with the actual load |
How to actually measure this
The benchmark that matters Don't trust micro-benchmarks from blogs. Write one for the actual workload:
- Vary contending writers from 1 to 64.
- Vary read frequency from "every increment" to "once per second."
- Plot throughput; pick the implementation that wins at the relevant operating point.
Java: jmh. Go: go test -bench with RunParallel. Python: timeit with threading. The numbers from a 5-minute benchmark beat hours of reasoning about cache-line behavior.
The lesson beyond LongAdder
The pattern recurs everywhere
ConcurrentHashMapvsHashtable: same tradeoff at scale. CHM is faster for reads under contention.CopyOnWriteArrayListvssynchronized List: COW wins for read-heavy, dies under write-heavy load.- Lock-free queues vs blocking queues: lock-free wins for low-contention high-throughput; blocking wins for back-pressure-aware workloads.
- Spinlocks vs mutexes: spin wins for very short critical sections; mutex wins everywhere else.
The shape of the lesson: every concurrent data structure has a "designed for" workload. Using it on a different workload reverses its claimed advantage.
Implementations
A monitoring component reports a read-heavy counter, getCount() is called per request to populate a header; the counter is incremented occasionally. Someone reads "LongAdder is faster" and swaps it in. Throughput drops. Why?
1 // BEFORE, AtomicLong
2 class HitCounter {
3 private final AtomicLong count = new AtomicLong();
4 void inc() { count.incrementAndGet(); } // CAS
5 long getCount() { return count.get(); } // single load
6 }
7
8 // AFTER, "optimized" with LongAdder
9 class HitCounter {
10 private final LongAdder count = new LongAdder();
11 void inc() { count.increment(); } // sharded
12 long getCount() { return count.sum(); } // ← walks all cells
13 }
14
15 // Workload: 1 writer, 1000 readers/sec
16 // Result: throughput WORSE with LongAdderThe bug: LongAdder optimizes for write contention. Its sum() walks every cell, typically Runtime.getRuntime().availableProcessors() cells, sometimes more. For a read-heavy workload, every getCount() does ~16 cache-line fetches instead of 1. The fix: use AtomicLong when reads dominate, LongAdder when writes dominate AND reads are infrequent (e.g., metrics scraping every 10s). Benchmark first; intuition about "newer = faster" lies.
1 // RULE: read-heavy → AtomicLong; write-heavy + infrequent reads → LongAdder
2
3 // Read-heavy (thousands of reads, occasional writes)
4 class ReadHeavyCounter {
5 private final AtomicLong count = new AtomicLong();
6 void inc() { count.incrementAndGet(); }
7 long getCount() { return count.get(); } // O(1)
8 }
9
10 // Write-heavy with periodic snapshots
11 class WriteHeavyCounter {
12 private final LongAdder count = new LongAdder();
13 void inc() { count.increment(); } // O(1) most of the time
14 long sample() { return count.sum(); } // O(num cells), called once per metric flush
15 }
16
17 // The benchmark to write (jmh):
18 // - Vary writer thread count (1, 2, 4, 8, 16, 32, 64)
19 // - Vary read frequency (every-write, 1:10, 1:100, 1:10000)
20 // - Pick the impl that wins for the real workloadKey points
- •AtomicLong: cheap reads (one load), CAS retries on contention (slow when many writers contend)
- •LongAdder: cheap writes under contention (per-thread cells), expensive reads (sum walks all cells)
- •Read-heavy + few writers → AtomicLong wins
- •Write-heavy + many writers → LongAdder wins
- •Always benchmark with the actual workload, intuition lies
Follow-up questions
▸When does LongAdder definitively beat AtomicLong?
▸Why is AtomicLong's read O(1) but LongAdder's O(N)?
▸Are sharded counters ever wrong?
▸Why does the JDK ship both AtomicLong and LongAdder if one is just better under contention?
Gotchas
- !Many counters are read on every request (rate limiters, header injection) → AtomicLong wins
- !Metrics counters scraped via Prometheus/StatsD → LongAdder may win
- !Sharded counters are eventually consistent, sum() can return slightly old values
- !False sharing across cache lines kills sharded perf if cells aren't padded, JDK pads automatically; hand-rolled implementations often forget
- !Profile-driven optimization beats intuition every time
Hadoop, Cassandra, Spring Actuator, Hibernate stats, Datadog Java tracer all use LongAdder for high-throughput metrics. Lock-free counters in Linux kernel use per-CPU shards. Production systems have switched in BOTH directions based on measurement.