JMH: Benchmarking Concurrent Java
JMH (Java Microbenchmark Harness) is the only Java microbenchmarking tool worth trusting. Hand-rolled timing loops give meaningless numbers because the JIT optimises away the work, dead-code-eliminates results, and warm-up effects dominate. JMH manages all of that and produces statistically sound measurements.
What it is
JMH is the OpenJDK project's microbenchmark harness for Java. Maintained by JIT engineers, designed to defeat the JVM optimisations that make naive benchmarks lie. For measuring Java performance below the millisecond, JMH is the right tool.
The hand-rolled long start = System.nanoTime(); for (int i = 0; i < N; i++) { ... } long end = System.nanoTime(); pattern is broken in three ways:
The JIT runs cold for the first hundred-or-so invocations, then partially-compiled, then fully optimised. The loop measures all three regimes blended together. The reported number is meaningless.
If the loop body produces a value that goes unused, the JIT may eliminate the entire computation as dead code. The harness is timing nothing. The number is small and stable, which makes it look real.
If the input is a constant or a small set of constants, the JIT may constant-fold the work at compile time. The harness is timing a literal load. Again, small and stable, again misleading.
JMH addresses all of this with a careful protocol around warmup, forced consumption of results via Blackhole, per-thread state to defeat constant folding, and statistical reporting that surfaces variance.
The mental model
A benchmark class has @Benchmark methods (the things being measured) and @State fields (the inputs). Each iteration, JMH calls the benchmark method as many times as it can in a fixed time window (Throughput mode) or measures the per-call latency (AverageTime mode). After warm-up iterations to let the JIT stabilise, it runs measurement iterations and reports.
State scope decides whether each benchmark thread gets its own instance (Scope.Thread) or shares one (Scope.Benchmark or Scope.Group). For concurrency benchmarks, this is critical. Shared state with @Threads(N) is the way to measure contention.
The harness forks a fresh JVM for each fork. Multiple forks let it see JVM-to-JVM variance (from GC layout, JIT decisions, code cache). Confidence intervals in the output are computed across all measurement iterations across all forks.
Reading the output
Numbers come with confidence intervals. 18,451,212 ± 523,141 ops/s means the true value is in that range with 99.9% confidence. When two benchmarks have overlapping CIs, neither can be declared faster.
This catches a common mistake: the optimisation that "looks 5% faster" but is well within noise.
Where it does not help
JMH is for microbenchmarks: small chunks of code, sub-millisecond. For end-to-end workloads (full request paths, database calls, external APIs), use a load testing tool (k6, wrk, gatling) and look at p50/p99/throughput at the system level. JMH-style precision does not help when the workload is dominated by a network round trip.
For finding hot spots, use a profiler (async-profiler is the standard for Java). JMH says "operation A is faster than operation B"; the profiler says "this loop is where 60% of CPU goes". Different questions, different tools.
A practical workflow
Start with the profiler to find the hot path. Write a JMH benchmark for the operation in question. Make a code change. Re-run the JMH benchmark to confirm the change actually helps. Re-run the load test to confirm the system-level metric moved.
Without JMH in the middle, there is no way to tell whether the change was a real improvement or just system-level noise.
Primitives by language
- @Benchmark methods (the unit being measured)
- @State (object lifecycle: Thread, Group, Benchmark)
- @Setup / @TearDown (per-trial init/cleanup)
- Blackhole (consume results to defeat dead-code elimination)
- @Threads (concurrent benchmarking)
Implementation
Per-thread state for the index. Shared state for the counter. @Threads(8) runs the benchmark on 8 worker threads concurrently. JMH warms up, then measures, then reports op/sec for each.
1 import org.openjdk.jmh.annotations.*;
2 import java.util.concurrent.TimeUnit;
3 import java.util.concurrent.atomic.AtomicLong;
4 import java.util.concurrent.atomic.LongAdder;
5
6 @State(Scope.Benchmark)
7 @BenchmarkMode(Mode.Throughput)
8 @OutputTimeUnit(TimeUnit.SECONDS)
9 @Warmup(iterations = 5, time = 1)
10 @Measurement(iterations = 5, time = 1)
11 @Fork(value = 2)
12 public class CounterBench {
13
14 AtomicLong atomic = new AtomicLong();
15 LongAdder adder = new LongAdder();
16
17 @Benchmark @Threads(8)
18 public long benchAtomic() { return atomic.incrementAndGet(); }
19
20 @Benchmark @Threads(8)
21 public void benchAdder() { adder.increment(); }
22 }Without consuming the result, the JIT may delete the entire compute() call as dead code. Blackhole.consume tells the JIT that the value is observed externally, so the compute cannot be eliminated. Without it, a benchmark of a fast pure function reports 0.5 ns/op because nothing is actually running.
1 import org.openjdk.jmh.infra.Blackhole;
2
3 @State(Scope.Thread)
4 public class HashBench {
5 int[] data = new int[1024];
6
7 @Setup
8 public void setup() {
9 for (int i = 0; i < data.length; i++) data[i] = i;
10 }
11
12 // BAD: result discarded; JIT may remove the compute
13 @Benchmark
14 public void bad() {
15 long sum = 0;
16 for (int i : data) sum += i;
17 // sum is unused; JIT may eliminate the loop
18 }
19
20 // GOOD: return forces consumption
21 @Benchmark
22 public long good() {
23 long sum = 0;
24 for (int i : data) sum += i;
25 return sum;
26 }
27
28 // ALSO GOOD: explicit consumption
29 @Benchmark
30 public void blackhole(Blackhole bh) {
31 for (int i : data) bh.consume(i * 31L);
32 }
33 }@State(Scope.Thread): each benchmark thread gets its own instance. @State(Scope.Benchmark): one shared instance across all threads. The choice changes what is being measured. Counters need shared. RNGs and per-thread caches need per-thread.
1 @State(Scope.Thread)
2 public static class ThreadLocalState {
3 long[] arr = new long[1024]; // per thread
4 }
5
6 @State(Scope.Benchmark)
7 public static class SharedState {
8 AtomicLong counter = new AtomicLong(); // shared across threads
9 }
10
11 @Benchmark @Threads(8)
12 public long contended(SharedState s, ThreadLocalState ts) {
13 return s.counter.addAndGet(ts.arr[0]);
14 }JMH builds a runnable JAR. Output shows op/sec for each benchmark with 99.9% confidence intervals. When the CIs overlap between two benchmarks, neither can be declared faster. Run multiple forks to reduce JVM-to-JVM noise.
1 // Build and run
2 // mvn clean package
3 // java -jar target/benchmarks.jar CounterBench
4
5 // Sample output:
6 //
7 // Benchmark Mode Cnt Score Error Units
8 // CounterBench.atomic thrpt 10 18,451,212 ± 523,141 ops/s
9 // CounterBench.adder thrpt 10 98,212,847 ± 1,345,910 ops/s
10 //
11 // Adder is ~5.3x faster under 8-thread contention.Key points
- •JIT warmup matters: cold code runs interpreted, warm code is fully JIT-compiled. Measure only warm.
- •Dead code elimination silently deletes unconsumed results. Use Blackhole or return values.
- •Loop unrolling and constant folding can vaporise simple benchmarks. Inputs must come from @State.
- •@Threads(N) measures contention. Per-thread state vs shared state matters.
- •Reported numbers are op/sec or ns/op with confidence intervals. Trust the CIs.
Follow-up questions
▸Why not just use System.nanoTime() in a loop?
▸How many forks to use?
▸Does JMH measure contention correctly?
▸What is the relationship between JMH and async-profiler?
Gotchas
- !Hand-rolled timing loops are almost always wrong. Use JMH.
- !Returning the result is the easy way to defeat DCE; use Blackhole when a return is not possible
- !@State scope decides shared vs per-thread; getting it wrong measures the wrong thing
- !Single-fork results are noisy; multiple forks expose JVM-to-JVM variance
- !Runtime in dev IDE is misleading; build the JAR and run from CLI