Benchmarking Methodology
Microbenchmarks lie unless they're done carefully. Warm up the JIT/runtime. Defeat dead-code elimination. Use realistic input. Measure with confidence intervals (multiple runs). Benchmark the right thing (the operation, not the noise around it). End-to-end load tests for system-level decisions, microbenchmarks for code-level decisions.
What it is
Benchmarking is the discipline of measuring code performance accurately. Sounds easy, isn't.
The trap: write a loop, time it, get a number, believe it. The number is often wrong by 10x or more, in either direction. JIT warmup makes cold code look slow; dead code elimination makes hot code look impossibly fast; per-run variance from GC, scheduling, and cache effects can swing results by 2x.
Production microbenchmarking tools (JMH for Java, pytest-benchmark for Python, testing.B for Go, criterion for Rust) handle all of this. Use them.
What real benchmarking provides
Warmup. The JIT compiles hot code, the branch predictor learns the patterns, the caches load the working set. A real benchmark waits for steady state before measuring.
Multiple iterations and forks. JVM-to-JVM variance from GC layout, code cache layout, JIT decisions can swing measurements significantly. Multiple forks expose this and report it as the confidence interval.
Result consumption. The compiler/JIT may eliminate work whose result is unused. Benchmarks must consume results (return them, or pass to a Blackhole-style sink) so the work is observable.
Statistical reporting. The output should be 'X ns/op ± Y' with a confidence interval. Two operations with overlapping CIs are not measurably different.
Microbenchmark vs load test
Two different questions, two different tools.
Microbenchmark: "is operation A faster than operation B?" Suitable for: comparing two algorithms, measuring the cost of a single function. Tools: JMH, testing.B, pytest-benchmark.
Load test: "does the system handle the expected load?" Suitable for: end-to-end latency under load, throughput limits, resource saturation. Tools: k6, wrk, gatling, locust.
Don't use microbenchmarks for system questions. The microbenchmark will report that the operation is fast; the load test will reveal that the system collapses at 800 RPS because of GC pressure or connection pool exhaustion.
Don't use load tests for code-level questions. The load test averages too many variables; the signal-to-noise ratio is too low to compare two implementations of one function.
Use a real benchmark harness instead of a hand-rolled time.time() or System.nanoTime() loop; the harness handles warmup, dead-code elimination, and statistical reporting. Report confidence intervals, not single numbers. Two means with overlapping CIs are indistinguishable, and enough iterations to get tight CIs is the only way to claim a real difference. And match the tool to the question: microbenchmark for "is this faster", load test for "does this scale", profiler for "where is the time spent". The right tool gives a clear answer; the wrong tool gives a number that looks right but is meaningless.
Implementations
JMH handles warmup, fork separation, dead-code elimination (via Blackhole or returns), confidence intervals. Hand-rolled timing loops are almost always wrong on the JVM.
1 @State(Scope.Thread)
2 @BenchmarkMode(Mode.Throughput)
3 @OutputTimeUnit(TimeUnit.SECONDS)
4 @Warmup(iterations = 5, time = 1)
5 @Measurement(iterations = 5, time = 1)
6 @Fork(value = 2)
7 public class HashBench {
8 private byte[] data = new byte[1024];
9
10 @Setup public void setup() { new Random(42).nextBytes(data); }
11
12 @Benchmark
13 public long sha256() {
14 return MessageDigest.getInstance("SHA-256").digest(data)[0];
15 }
16
17 @Benchmark
18 public long blake3() {
19 return Blake3.hash(data)[0];
20 }
21 }
22 // Sample output:
23 // Benchmark Mode Cnt Score Error Units
24 // HashBench.sha256 thrpt 10 524,123 ± 4,210 ops/s
25 // HashBench.blake3 thrpt 10 892,541 ± 9,103 ops/sKey points
- •Warmup is critical: JIT, branch predictor, caches all need to stabilise.
- •Dead-code elimination silently deletes the benchmark; consume the result.
- •Multiple runs in fresh JVMs (forks) catch JVM-to-JVM variance.
- •Report confidence intervals; if they overlap, neither result can be called faster.
- •Microbenchmark vs load test: microbenchmark for 'is op A faster than op B'; load test for 'does the system handle 1000 RPS'.
Follow-up questions
▸Why doesn't a simple time.time() loop work?
▸How many runs are needed?
▸When to prefer load testing over microbenchmarks?
▸What's the relationship between benchmarking and profiling?
Gotchas
- !Hand-rolled timing loops give numbers that are off by 10x or more
- !Single run = no idea of variance; could be off by a factor of 2
- !Discarding the benchmark result lets DCE delete the loop
- !Reporting averages without intervals hides 'no significant difference'
- !Benchmarking in dev/IDE = unreliable; always run benchmarks from CLI