Performance & TradeoffsTopic 8 of 14

ConceptIntermediateSometimes

Benchmarking Methodology

In one line

Microbenchmarks lie unless they're done carefully. Warm up the JIT/runtime. Defeat dead-code elimination. Use realistic input. Measure with confidence intervals (multiple runs). Benchmark the right thing (the operation, not the noise around it). End-to-end load tests for system-level decisions, microbenchmarks for code-level decisions.

What it is

Benchmarking is the discipline of measuring code performance accurately. Sounds easy, isn't.

The trap: write a loop, time it, get a number, believe it. The number is often wrong by 10x or more, in either direction. JIT warmup makes cold code look slow; dead code elimination makes hot code look impossibly fast; per-run variance from GC, scheduling, and cache effects can swing results by 2x.

Production microbenchmarking tools (JMH for Java, pytest-benchmark for Python, testing.B for Go, criterion for Rust) handle all of this. Use them.

What real benchmarking provides

Warmup. The JIT compiles hot code, the branch predictor learns the patterns, the caches load the working set. A real benchmark waits for steady state before measuring.

Multiple iterations and forks. JVM-to-JVM variance from GC layout, code cache layout, JIT decisions can swing measurements significantly. Multiple forks expose this and report it as the confidence interval.

Result consumption. The compiler/JIT may eliminate work whose result is unused. Benchmarks must consume results (return them, or pass to a Blackhole-style sink) so the work is observable.

Statistical reporting. The output should be 'X ns/op ± Y' with a confidence interval. Two operations with overlapping CIs are not measurably different.

Microbenchmark vs load test

Two different questions, two different tools.

Microbenchmark: "is operation A faster than operation B?" Suitable for: comparing two algorithms, measuring the cost of a single function. Tools: JMH, testing.B, pytest-benchmark.

Load test: "does the system handle the expected load?" Suitable for: end-to-end latency under load, throughput limits, resource saturation. Tools: k6, wrk, gatling, locust.

Don't use microbenchmarks for system questions. The microbenchmark will report that the operation is fast; the load test will reveal that the system collapses at 800 RPS because of GC pressure or connection pool exhaustion.

Don't use load tests for code-level questions. The load test averages too many variables; the signal-to-noise ratio is too low to compare two implementations of one function.

Use a real benchmark harness instead of a hand-rolled time.time() or System.nanoTime() loop; the harness handles warmup, dead-code elimination, and statistical reporting. Report confidence intervals, not single numbers. Two means with overlapping CIs are indistinguishable, and enough iterations to get tight CIs is the only way to claim a real difference. And match the tool to the question: microbenchmark for "is this faster", load test for "does this scale", profiler for "where is the time spent". The right tool gives a clear answer; the wrong tool gives a number that looks right but is meaningless.

Implementations

JMH for microbenchmarks

JMH handles warmup, fork separation, dead-code elimination (via Blackhole or returns), confidence intervals. Hand-rolled timing loops are almost always wrong on the JVM.

 1  @State(Scope.Thread)
 2  @BenchmarkMode(Mode.Throughput)
 3  @OutputTimeUnit(TimeUnit.SECONDS)
 4  @Warmup(iterations = 5, time = 1)
 5  @Measurement(iterations = 5, time = 1)
 6  @Fork(value = 2)
 7  public class HashBench {
 8      private byte[] data = new byte[1024];
 9  
10      @Setup public void setup() { new Random(42).nextBytes(data); }
11  
12      @Benchmark
13      public long sha256() {
14          return MessageDigest.getInstance("SHA-256").digest(data)[0];
15      }
16  
17      @Benchmark
18      public long blake3() {
19          return Blake3.hash(data)[0];
20      }
21  }
22  // Sample output:
23  //  Benchmark              Mode  Cnt  Score      Error  Units
24  //  HashBench.sha256       thrpt  10  524,123  ±  4,210  ops/s
25  //  HashBench.blake3       thrpt  10  892,541  ±  9,103  ops/s

Key points

•Warmup is critical: JIT, branch predictor, caches all need to stabilise.
•Dead-code elimination silently deletes the benchmark; consume the result.
•Multiple runs in fresh JVMs (forks) catch JVM-to-JVM variance.
•Report confidence intervals; if they overlap, neither result can be called faster.
•Microbenchmark vs load test: microbenchmark for 'is op A faster than op B'; load test for 'does the system handle 1000 RPS'.

Follow-up questions

▸Why doesn't a simple time.time() loop work?

Three reasons. (1) Warmup: cold code runs differently from warm code. (2) Dead code elimination: if the result is unused, the compiler deletes the work. (3) Variance: GC pauses, scheduler quanta, cache effects produce huge run-to-run variance. JMH/pytest-benchmark/testing.B handle all three. Hand-rolled timing is almost always wrong unless the target is whole applications, not microbenchmarks.

▸How many runs are needed?

Enough that confidence intervals are tight. JMH defaults: 5 warmup + 5 measurement iterations × 2 forks = 10 measurements. For tighter CIs, increase forks (more JVM-to-JVM samples). For tighter relative comparisons, increase iterations within each fork.

▸When to prefer load testing over microbenchmarks?

When the question is system-level: 'can we handle Black Friday', 'what's our p99 at 1000 RPS', 'does the cache help end-to-end'. Microbenchmarks answer 'is op A faster than op B'. Use load tests (k6, wrk, gatling) for the system question, microbenchmarks to drill into specific hot spots.

▸What's the relationship between benchmarking and profiling?

Benchmarking measures: 'this takes X ns'. Profiling explains: 'X ns is spent here'. Both essential. Benchmark-only gives a number with no path to improvement. Profile-only gives a flame graph but no impact estimate. Workflow: profile to find the hot spot → benchmark the hot spot → make a change → re-benchmark to confirm.

Gotchas

!Hand-rolled timing loops give numbers that are off by 10x or more
!Single run = no idea of variance; could be off by a factor of 2
!Discarding the benchmark result lets DCE delete the loop
!Reporting averages without intervals hides 'no significant difference'
!Benchmarking in dev/IDE = unreliable; always run benchmarks from CLI

Benchmarking Methodology

In one line

What it is

Benchmarking is the discipline of measuring code performance accurately. Sounds easy, isn't.

Production microbenchmarking tools (JMH for Java, pytest-benchmark for Python, testing.B for Go, criterion for Rust) handle all of this. Use them.

What real benchmarking provides

Warmup. The JIT compiles hot code, the branch predictor learns the patterns, the caches load the working set. A real benchmark waits for steady state before measuring.

Result consumption. The compiler/JIT may eliminate work whose result is unused. Benchmarks must consume results (return them, or pass to a Blackhole-style sink) so the work is observable.

Statistical reporting. The output should be 'X ns/op ± Y' with a confidence interval. Two operations with overlapping CIs are not measurably different.

Microbenchmark vs load test

Two different questions, two different tools.

Microbenchmark: "is operation A faster than operation B?" Suitable for: comparing two algorithms, measuring the cost of a single function. Tools: JMH, testing.B, pytest-benchmark.

Load test: "does the system handle the expected load?" Suitable for: end-to-end latency under load, throughput limits, resource saturation. Tools: k6, wrk, gatling, locust.

Don't use load tests for code-level questions. The load test averages too many variables; the signal-to-noise ratio is too low to compare two implementations of one function.

Implementations

JMH for microbenchmarks

JMH handles warmup, fork separation, dead-code elimination (via Blackhole or returns), confidence intervals. Hand-rolled timing loops are almost always wrong on the JVM.

 1  @State(Scope.Thread)
 2  @BenchmarkMode(Mode.Throughput)
 3  @OutputTimeUnit(TimeUnit.SECONDS)
 4  @Warmup(iterations = 5, time = 1)
 5  @Measurement(iterations = 5, time = 1)
 6  @Fork(value = 2)
 7  public class HashBench {
 8      private byte[] data = new byte[1024];
 9  
10      @Setup public void setup() { new Random(42).nextBytes(data); }
11  
12      @Benchmark
13      public long sha256() {
14          return MessageDigest.getInstance("SHA-256").digest(data)[0];
15      }
16  
17      @Benchmark
18      public long blake3() {
19          return Blake3.hash(data)[0];
20      }
21  }
22  // Sample output:
23  //  Benchmark              Mode  Cnt  Score      Error  Units
24  //  HashBench.sha256       thrpt  10  524,123  ±  4,210  ops/s
25  //  HashBench.blake3       thrpt  10  892,541  ±  9,103  ops/s

Key points

•Warmup is critical: JIT, branch predictor, caches all need to stabilise.
•Dead-code elimination silently deletes the benchmark; consume the result.
•Multiple runs in fresh JVMs (forks) catch JVM-to-JVM variance.
•Report confidence intervals; if they overlap, neither result can be called faster.
•Microbenchmark vs load test: microbenchmark for 'is op A faster than op B'; load test for 'does the system handle 1000 RPS'.

Follow-up questions

▸Why doesn't a simple time.time() loop work?

▸How many runs are needed?

▸When to prefer load testing over microbenchmarks?

▸What's the relationship between benchmarking and profiling?

Gotchas

!Hand-rolled timing loops give numbers that are off by 10x or more
!Single run = no idea of variance; could be off by a factor of 2
!Discarding the benchmark result lets DCE delete the loop
!Reporting averages without intervals hides 'no significant difference'
!Benchmarking in dev/IDE = unreliable; always run benchmarks from CLI

Benchmarking Methodology

What it is

What real benchmarking provides

Microbenchmark vs load test

Implementations

Key points

Follow-up questions

Gotchas

Related reading

Benchmarking Methodology

What it is

What real benchmarking provides

Microbenchmark vs load test

Implementations

Key points

Follow-up questions

Gotchas

Related reading