Java ConcurrencyTopic 16 of 16

LanguageJavaAdvancedSometimes

JMH: Benchmarking Concurrent Java

In one line

JMH (Java Microbenchmark Harness) is the only Java microbenchmarking tool worth trusting. Hand-rolled timing loops give meaningless numbers because the JIT optimises away the work, dead-code-eliminates results, and warm-up effects dominate. JMH manages all of that and produces statistically sound measurements.

What it is

JMH is the OpenJDK project's microbenchmark harness for Java. Maintained by JIT engineers, designed to defeat the JVM optimisations that make naive benchmarks lie. For measuring Java performance below the millisecond, JMH is the right tool.

The hand-rolled long start = System.nanoTime(); for (int i = 0; i < N; i++) { ... } long end = System.nanoTime(); pattern is broken in three ways:

The JIT runs cold for the first hundred-or-so invocations, then partially-compiled, then fully optimised. The loop measures all three regimes blended together. The reported number is meaningless.

If the loop body produces a value that goes unused, the JIT may eliminate the entire computation as dead code. The harness is timing nothing. The number is small and stable, which makes it look real.

If the input is a constant or a small set of constants, the JIT may constant-fold the work at compile time. The harness is timing a literal load. Again, small and stable, again misleading.

JMH addresses all of this with a careful protocol around warmup, forced consumption of results via Blackhole, per-thread state to defeat constant folding, and statistical reporting that surfaces variance.

The mental model

A benchmark class has @Benchmark methods (the things being measured) and @State fields (the inputs). Each iteration, JMH calls the benchmark method as many times as it can in a fixed time window (Throughput mode) or measures the per-call latency (AverageTime mode). After warm-up iterations to let the JIT stabilise, it runs measurement iterations and reports.

State scope decides whether each benchmark thread gets its own instance (Scope.Thread) or shares one (Scope.Benchmark or Scope.Group). For concurrency benchmarks, this is critical. Shared state with @Threads(N) is the way to measure contention.

The harness forks a fresh JVM for each fork. Multiple forks let it see JVM-to-JVM variance (from GC layout, JIT decisions, code cache). Confidence intervals in the output are computed across all measurement iterations across all forks.

Reading the output

Numbers come with confidence intervals. 18,451,212 ± 523,141 ops/s means the true value is in that range with 99.9% confidence. When two benchmarks have overlapping CIs, neither can be declared faster.

This catches a common mistake: the optimisation that "looks 5% faster" but is well within noise.

Where it does not help

JMH is for microbenchmarks: small chunks of code, sub-millisecond. For end-to-end workloads (full request paths, database calls, external APIs), use a load testing tool (k6, wrk, gatling) and look at p50/p99/throughput at the system level. JMH-style precision does not help when the workload is dominated by a network round trip.

For finding hot spots, use a profiler (async-profiler is the standard for Java). JMH says "operation A is faster than operation B"; the profiler says "this loop is where 60% of CPU goes". Different questions, different tools.

A practical workflow

Start with the profiler to find the hot path. Write a JMH benchmark for the operation in question. Make a code change. Re-run the JMH benchmark to confirm the change actually helps. Re-run the load test to confirm the system-level metric moved.

Without JMH in the middle, there is no way to tell whether the change was a real improvement or just system-level noise.

Primitives by language

@Benchmark methods (the unit being measured)
@State (object lifecycle: Thread, Group, Benchmark)
@Setup / @TearDown (per-trial init/cleanup)
Blackhole (consume results to defeat dead-code elimination)
@Threads (concurrent benchmarking)

Implementation

Minimal benchmark: AtomicLong vs LongAdder

Per-thread state for the index. Shared state for the counter. @Threads(8) runs the benchmark on 8 worker threads concurrently. JMH warms up, then measures, then reports op/sec for each.

 1  import org.openjdk.jmh.annotations.*;
 2  import java.util.concurrent.TimeUnit;
 3  import java.util.concurrent.atomic.AtomicLong;
 4  import java.util.concurrent.atomic.LongAdder;
 5  
 6  @State(Scope.Benchmark)
 7  @BenchmarkMode(Mode.Throughput)
 8  @OutputTimeUnit(TimeUnit.SECONDS)
 9  @Warmup(iterations = 5, time = 1)
10  @Measurement(iterations = 5, time = 1)
11  @Fork(value = 2)
12  public class CounterBench {
13  
14      AtomicLong atomic = new AtomicLong();
15      LongAdder adder = new LongAdder();
16  
17      @Benchmark @Threads(8)
18      public long benchAtomic() { return atomic.incrementAndGet(); }
19  
20      @Benchmark @Threads(8)
21      public void benchAdder() { adder.increment(); }
22  }

Why Blackhole is needed

Without consuming the result, the JIT may delete the entire compute() call as dead code. Blackhole.consume tells the JIT that the value is observed externally, so the compute cannot be eliminated. Without it, a benchmark of a fast pure function reports 0.5 ns/op because nothing is actually running.

 1  import org.openjdk.jmh.infra.Blackhole;
 2  
 3  @State(Scope.Thread)
 4  public class HashBench {
 5      int[] data = new int[1024];
 6  
 7      @Setup
 8      public void setup() {
 9          for (int i = 0; i < data.length; i++) data[i] = i;
10      }
11  
12      // BAD: result discarded; JIT may remove the compute
13      @Benchmark
14      public void bad() {
15          long sum = 0;
16          for (int i : data) sum += i;
17          // sum is unused; JIT may eliminate the loop
18      }
19  
20      // GOOD: return forces consumption
21      @Benchmark
22      public long good() {
23          long sum = 0;
24          for (int i : data) sum += i;
25          return sum;
26      }
27  
28      // ALSO GOOD: explicit consumption
29      @Benchmark
30      public void blackhole(Blackhole bh) {
31          for (int i : data) bh.consume(i * 31L);
32      }
33  }

Per-thread state vs shared state

@State(Scope.Thread): each benchmark thread gets its own instance. @State(Scope.Benchmark): one shared instance across all threads. The choice changes what is being measured. Counters need shared. RNGs and per-thread caches need per-thread.

 1  @State(Scope.Thread)
 2  public static class ThreadLocalState {
 3      long[] arr = new long[1024];                       // per thread
 4  }
 5  
 6  @State(Scope.Benchmark)
 7  public static class SharedState {
 8      AtomicLong counter = new AtomicLong();             // shared across threads
 9  }
10  
11  @Benchmark @Threads(8)
12  public long contended(SharedState s, ThreadLocalState ts) {
13      return s.counter.addAndGet(ts.arr[0]);
14  }

Running and reading the output

JMH builds a runnable JAR. Output shows op/sec for each benchmark with 99.9% confidence intervals. When the CIs overlap between two benchmarks, neither can be declared faster. Run multiple forks to reduce JVM-to-JVM noise.

 1  // Build and run
 2  // mvn clean package
 3  // java -jar target/benchmarks.jar CounterBench
 4  
 5  // Sample output:
 6  //
 7  // Benchmark              Mode  Cnt        Score        Error  Units
 8  // CounterBench.atomic    thrpt  10  18,451,212 ±   523,141  ops/s
 9  // CounterBench.adder     thrpt  10  98,212,847 ± 1,345,910  ops/s
10  //
11  // Adder is ~5.3x faster under 8-thread contention.

Key points

•JIT warmup matters: cold code runs interpreted, warm code is fully JIT-compiled. Measure only warm.
•Dead code elimination silently deletes unconsumed results. Use Blackhole or return values.
•Loop unrolling and constant folding can vaporise simple benchmarks. Inputs must come from @State.
•@Threads(N) measures contention. Per-thread state vs shared state matters.
•Reported numbers are op/sec or ns/op with confidence intervals. Trust the CIs.

Follow-up questions

▸Why not just use System.nanoTime() in a loop?

Three reasons. JIT warmup: the first thousand iterations run interpreted or partially compiled, giving wildly different timings than the steady state. Dead code elimination: if the result is unused, the JIT may remove the work entirely. Loop unrolling and constant folding: simple inputs let the JIT precompute, so the measurement reflects a literal-load instead of compute. JMH addresses all three.

▸How many forks to use?

Start with 2-3. JVM-to-JVM variance comes from GC layout, code cache layout, JIT decisions. Multiple forks let the harness see that variance and report it in the confidence intervals. For publish-quality numbers, 5-10 forks. For day-to-day work, 2 is fine.

▸Does JMH measure contention correctly?

Yes, with @Threads(N). The N benchmark threads run the same @Benchmark method concurrently. Pair with @State(Scope.Benchmark) for shared state to measure contention. JMH also has @Group for asymmetric workloads (some threads do A, others do B).

▸What is the relationship between JMH and async-profiler?

JMH reports how fast something is. Async-profiler reports where the time goes. Both are essential. JMH-only gives a number with no path to improvement. Profiler-only gives a flame graph but no measurement of impact. Use them together: JMH to compare, profiler to explain the comparison.

Gotchas

!Hand-rolled timing loops are almost always wrong. Use JMH.
!Returning the result is the easy way to defeat DCE; use Blackhole when a return is not possible
!@State scope decides shared vs per-thread; getting it wrong measures the wrong thing
!Single-fork results are noisy; multiple forks expose JVM-to-JVM variance
!Runtime in dev IDE is misleading; build the JAR and run from CLI

JMH: Benchmarking Concurrent Java

In one line

What it is

The hand-rolled long start = System.nanoTime(); for (int i = 0; i < N; i++) { ... } long end = System.nanoTime(); pattern is broken in three ways:

The JIT runs cold for the first hundred-or-so invocations, then partially-compiled, then fully optimised. The loop measures all three regimes blended together. The reported number is meaningless.

If the loop body produces a value that goes unused, the JIT may eliminate the entire computation as dead code. The harness is timing nothing. The number is small and stable, which makes it look real.

If the input is a constant or a small set of constants, the JIT may constant-fold the work at compile time. The harness is timing a literal load. Again, small and stable, again misleading.

The mental model

Reading the output

This catches a common mistake: the optimisation that "looks 5% faster" but is well within noise.

Where it does not help

A practical workflow

Without JMH in the middle, there is no way to tell whether the change was a real improvement or just system-level noise.

Primitives by language

@Benchmark methods (the unit being measured)
@State (object lifecycle: Thread, Group, Benchmark)
@Setup / @TearDown (per-trial init/cleanup)
Blackhole (consume results to defeat dead-code elimination)
@Threads (concurrent benchmarking)

Implementation

Minimal benchmark: AtomicLong vs LongAdder

Per-thread state for the index. Shared state for the counter. @Threads(8) runs the benchmark on 8 worker threads concurrently. JMH warms up, then measures, then reports op/sec for each.

 1  import org.openjdk.jmh.annotations.*;
 2  import java.util.concurrent.TimeUnit;
 3  import java.util.concurrent.atomic.AtomicLong;
 4  import java.util.concurrent.atomic.LongAdder;
 5  
 6  @State(Scope.Benchmark)
 7  @BenchmarkMode(Mode.Throughput)
 8  @OutputTimeUnit(TimeUnit.SECONDS)
 9  @Warmup(iterations = 5, time = 1)
10  @Measurement(iterations = 5, time = 1)
11  @Fork(value = 2)
12  public class CounterBench {
13  
14      AtomicLong atomic = new AtomicLong();
15      LongAdder adder = new LongAdder();
16  
17      @Benchmark @Threads(8)
18      public long benchAtomic() { return atomic.incrementAndGet(); }
19  
20      @Benchmark @Threads(8)
21      public void benchAdder() { adder.increment(); }
22  }

Why Blackhole is needed

 1  import org.openjdk.jmh.infra.Blackhole;
 2  
 3  @State(Scope.Thread)
 4  public class HashBench {
 5      int[] data = new int[1024];
 6  
 7      @Setup
 8      public void setup() {
 9          for (int i = 0; i < data.length; i++) data[i] = i;
10      }
11  
12      // BAD: result discarded; JIT may remove the compute
13      @Benchmark
14      public void bad() {
15          long sum = 0;
16          for (int i : data) sum += i;
17          // sum is unused; JIT may eliminate the loop
18      }
19  
20      // GOOD: return forces consumption
21      @Benchmark
22      public long good() {
23          long sum = 0;
24          for (int i : data) sum += i;
25          return sum;
26      }
27  
28      // ALSO GOOD: explicit consumption
29      @Benchmark
30      public void blackhole(Blackhole bh) {
31          for (int i : data) bh.consume(i * 31L);
32      }
33  }

Per-thread state vs shared state

 1  @State(Scope.Thread)
 2  public static class ThreadLocalState {
 3      long[] arr = new long[1024];                       // per thread
 4  }
 5  
 6  @State(Scope.Benchmark)
 7  public static class SharedState {
 8      AtomicLong counter = new AtomicLong();             // shared across threads
 9  }
10  
11  @Benchmark @Threads(8)
12  public long contended(SharedState s, ThreadLocalState ts) {
13      return s.counter.addAndGet(ts.arr[0]);
14  }

Running and reading the output

 1  // Build and run
 2  // mvn clean package
 3  // java -jar target/benchmarks.jar CounterBench
 4  
 5  // Sample output:
 6  //
 7  // Benchmark              Mode  Cnt        Score        Error  Units
 8  // CounterBench.atomic    thrpt  10  18,451,212 ±   523,141  ops/s
 9  // CounterBench.adder     thrpt  10  98,212,847 ± 1,345,910  ops/s
10  //
11  // Adder is ~5.3x faster under 8-thread contention.

Key points

•JIT warmup matters: cold code runs interpreted, warm code is fully JIT-compiled. Measure only warm.
•Dead code elimination silently deletes unconsumed results. Use Blackhole or return values.
•Loop unrolling and constant folding can vaporise simple benchmarks. Inputs must come from @State.
•@Threads(N) measures contention. Per-thread state vs shared state matters.
•Reported numbers are op/sec or ns/op with confidence intervals. Trust the CIs.

Follow-up questions

▸Why not just use System.nanoTime() in a loop?

▸How many forks to use?

▸Does JMH measure contention correctly?

▸What is the relationship between JMH and async-profiler?

Gotchas

!Hand-rolled timing loops are almost always wrong. Use JMH.
!Returning the result is the easy way to defeat DCE; use Blackhole when a return is not possible
!@State scope decides shared vs per-thread; getting it wrong measures the wrong thing
!Single-fork results are noisy; multiple forks expose JVM-to-JVM variance
!Runtime in dev IDE is misleading; build the JAR and run from CLI

JMH: Benchmarking Concurrent Java

What it is

The mental model

Reading the output

Where it does not help

A practical workflow

Primitives by language

Implementation

Key points

Follow-up questions

Gotchas

Related reading

JMH: Benchmarking Concurrent Java

What it is

The mental model

Reading the output

Where it does not help

A practical workflow

Primitives by language

Implementation

Key points

Follow-up questions

Gotchas

Related reading