Series/Learn Java Formal Methods, Testing, Benchmarking, and Performance Engineering

Deepen PracticeOrdered learning track

JMH Deep Dive and Microbenchmark Correctness

Learn Java Formal Methods, Testing, Benchmarking, and Performance Engineering - Part 027

A production-grade deep dive into JMH and microbenchmark correctness: modes, warmup, forks, state, Blackhole, JVM optimizations, profilers, workload realism, review checklists, and benchmark governance.

[2026-07-02]15 min read2942 words

In This Lesson

1. What microbenchmarks are for 2. Minimal Maven setup 3. Minimal benchmark class

PrevNext

Lesson 2740 lesson track23–33 Deepen Practice

#java#jmh#benchmarking#performance-engineering+3 more

Part 027 — JMH Deep Dive and Microbenchmark Correctness

JMH is not a magic truth machine.

It is a harness that helps you ask performance questions on the JVM without being destroyed immediately by warmup, dead-code elimination, tiered compilation, inlining, constant folding, and measurement overhead. But it cannot decide whether your benchmark represents production. It cannot know whether your input distribution is honest. It cannot know whether your benchmark accidentally measures a hot cache path while production suffers cold cache misses. It cannot know whether the JVM profile collected during the benchmark is completely different from the profile collected inside the real service.

So the rule is:

JMH makes JVM benchmarking possible. Engineering discipline makes it meaningful.

This part is a practical deep dive into JMH as an evidence tool.

We will focus on correctness before speed.

A wrong benchmark is worse than no benchmark because it gives confidence to the wrong decision.

1. What microbenchmarks are for

A microbenchmark answers a narrow question:

Under controlled JVM/runtime conditions, how does this small unit of code behave under a specified workload shape?

Good uses:

comparing two parsing strategies;
measuring allocation rate of two object construction paths;
choosing between data structure strategies under a specific lookup/write distribution;
understanding whether an optimization changes CPU cost, allocation cost, or branch behavior;
proving a local performance regression before changing an implementation;
building repeatable evidence for a method-level performance patch.

Bad uses:

proving end-to-end service capacity;
predicting database latency from an in-memory mock;
claiming a framework is faster from one synthetic request path;
choosing production architecture from a benchmark that ignores network, GC, data size, contention, and error paths;
ranking code using average latency only.

The boundary matters.

A microbenchmark is most valuable when you can clearly say:

This benchmark does not prove the whole system is fast.
It proves this local implementation behaves better/worse under this defined workload.

2. Minimal Maven setup

A serious Java codebase should keep JMH benchmarks separate from normal tests.

A common layout:

project/
  src/main/java/...
  src/test/java/...
  src/jmh/java/...
  pom.xml

A minimal Maven configuration can look like this:

<properties>
    <jmh.version>1.37</jmh.version>
</properties>

<dependencies>
    <dependency>
        <groupId>org.openjdk.jmh</groupId>
        <artifactId>jmh-core</artifactId>
        <version>${jmh.version}</version>
        <scope>test</scope>
    </dependency>
    <dependency>
        <groupId>org.openjdk.jmh</groupId>
        <artifactId>jmh-generator-annprocess</artifactId>
        <version>${jmh.version}</version>
        <scope>test</scope>
    </dependency>
</dependencies>

For production-grade usage, create a separate benchmark module:

service-core/
service-adapters/
service-benchmarks/

Why?

Because benchmarks often need:

larger fixtures;
multiple implementations;
benchmark-specific dependencies;
generated data files;
custom JVM arguments;
CI isolation;
result archives.

Do not pollute normal test execution with benchmark execution. Benchmarks are evidence jobs, not regular unit tests.

3. Minimal benchmark class

Example: compare two normalization strategies.

package com.acme.benchmark;

import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.annotations.BenchmarkMode;
import org.openjdk.jmh.annotations.Fork;
import org.openjdk.jmh.annotations.Measurement;
import org.openjdk.jmh.annotations.Mode;
import org.openjdk.jmh.annotations.OutputTimeUnit;
import org.openjdk.jmh.annotations.Param;
import org.openjdk.jmh.annotations.Scope;
import org.openjdk.jmh.annotations.Setup;
import org.openjdk.jmh.annotations.State;
import org.openjdk.jmh.annotations.Warmup;

import java.util.Locale;
import java.util.concurrent.TimeUnit;

@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Warmup(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 10, time = 1, timeUnit = TimeUnit.SECONDS)
@Fork(value = 3)
public class NormalizationBenchmark {

    @State(Scope.Thread)
    public static class InputState {
        @Param({"SMALL", "MEDIUM", "LARGE"})
        public String size;

        public String input;

        @Setup
        public void setup() {
            input = switch (size) {
                case "SMALL" -> "  Case-123  ";
                case "MEDIUM" -> "  Regulatory Case Submission 12345 / Region APAC  ";
                case "LARGE" -> "  ".repeat(32) + "Regulatory Case Submission 12345 / Region APAC" + "  ".repeat(32);
                default -> throw new IllegalArgumentException(size);
            };
        }
    }

    @Benchmark
    public String trimThenLowercase(InputState state) {
        return state.input.trim().toLowerCase(Locale.ROOT);
    }

    @Benchmark
    public String lowercaseThenTrim(InputState state) {
        return state.input.toLowerCase(Locale.ROOT).trim();
    }
}

This benchmark already shows several important rules:

benchmark methods are annotated with @Benchmark;
parameters are explicit with @Param;
input setup is separated from measurement;
warmup and measurement are configured;
multiple forks isolate JVM runs;
result is returned so the JVM cannot simply discard the computation;
output time unit is explicit;
locale is explicit, not environment-dependent.

A benchmark should read like an experiment, not like a random code snippet.

4. Benchmark modes

JMH supports multiple benchmark modes. Choose based on the question.

Mode	Measures	Use when
`Mode.AverageTime`	average time per operation	comparing operation cost
`Mode.Throughput`	operations per time unit	maximizing completed operations
`Mode.SampleTime`	sampled latency distribution	understanding tail-ish behavior in micro context
`Mode.SingleShotTime`	one invocation timing	cold-start-ish operations, setup-heavy paths
`Mode.All`	all modes	exploration, not final evidence

Avoid defaulting to throughput because it looks impressive.

For small local operations, AverageTime is often easier to reason about:

@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)

For batch-oriented operations, throughput may be more natural:

@BenchmarkMode(Mode.Throughput)
@OutputTimeUnit(TimeUnit.SECONDS)

For latency-sensitive code, average time is not enough. Consider SampleTime, but remember: microbenchmark latency distribution is not the same as service p99. Service tail latency includes queueing, locks, IO, GC, scheduling, network, database, and downstream services.

5. Warmup, measurement, and forks

The JVM changes behavior while your program runs.

During warmup:

methods become hot;
bytecode is interpreted, then compiled;
call sites collect type profiles;
branches collect probability profiles;
allocations may be optimized;
methods may be inlined;
speculative assumptions may be made;
deoptimization can happen later if assumptions fail.

So a benchmark with no warmup is usually measuring startup/transient behavior, not steady-state behavior.

@Warmup(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 10, time = 1, timeUnit = TimeUnit.SECONDS)
@Fork(value = 3)

Interpretation:

each fork launches a fresh JVM;
warmup iterations are not reported as final measurement;
measurement iterations are reported;
multiple forks reduce single-JVM accident risk.

A serious benchmark result should include:

Benchmark: ValidationEngineBenchmark.evaluate
JDK: Eclipse Temurin 21.0.x / OpenJDK 25.x
OS: Linux x86_64
CPU: c6i.2xlarge / local machine model
Forks: 5
Warmup: 10 x 1s
Measurement: 20 x 1s
Mode: AverageTime
Unit: ns/op
Params: rules=100, facts=20, hitRate=0.25
Profiler: gc

Do not compare benchmark results from different machines unless the purpose is explicitly cross-machine comparison.

6. State scope

@State controls how benchmark state is shared.

Scope	Meaning	Use when
`Scope.Thread`	each benchmark thread gets its own state	no sharing, per-thread local work
`Scope.Benchmark`	state shared across all benchmark threads	shared data structure, contention, cache, global registry
`Scope.Group`	state shared within a benchmark group	producer/consumer or mixed read/write group

Example:

@State(Scope.Thread)
public static class PerThreadState {
    byte[] payload;

    @Setup
    public void setup() {
        payload = new byte[1024];
    }
}

For concurrent data structure benchmarking:

@State(Scope.Benchmark)
public static class SharedMapState {
    ConcurrentHashMap<String, String> map = new ConcurrentHashMap<>();
}

Wrong state scope can invalidate the benchmark.

If production has shared contention but your benchmark uses Scope.Thread, you measured a fantasy.

If production has per-request local state but your benchmark uses shared state, you measured artificial contention.

7. Setup levels

JMH supports setup at different levels:

@Setup(Level.Trial)
public void setupOncePerFork() {}

@Setup(Level.Iteration)
public void setupBeforeEachIteration() {}

@Setup(Level.Invocation)
public void setupBeforeEachInvocation() {}

Use them carefully.

Level	Cost included in measured operation?	Typical use
`Trial`	no	large immutable fixtures, lookup tables
`Iteration`	mostly no	reset moderate state per measurement iteration
`Invocation`	setup overhead can dominate	avoid unless operation mutates state and must be reset every call

Level.Invocation is dangerous because it can dominate or distort tiny benchmark operations.

Example of dangerous benchmark:

@Setup(Level.Invocation)
public void setup() {
    list = new ArrayList<>(initialData);
}

@Benchmark
public Object removeFirst() {
    return list.remove(0);
}

This may measure copying/setup more than removal. Better options:

benchmark a batch operation;
include setup intentionally and name the benchmark accordingly;
use larger operation granularity;
use @OperationsPerInvocation if applicable;
write separate benchmarks for setup cost and operation cost.

8. Returning values vs Blackhole

The JVM is allowed to remove work if the result is unused and observable behavior does not change.

Bad:

@Benchmark
public void bad() {
    expensiveComputation();
}

If expensiveComputation() has no observable side effects, the optimizer may eliminate it.

Better:

@Benchmark
public Result good() {
    return expensiveComputation();
}

Or use Blackhole:

@Benchmark
public void goodWithBlackhole(Blackhole blackhole) {
    blackhole.consume(expensiveComputation());
}

Prefer returning the result when natural.

Use Blackhole when:

there are multiple intermediate results;
the method naturally returns void but produces values internally;
you need to consume several outputs;
returning would distort benchmark structure.

But do not use Blackhole as ritual decoration. It is a tool for preserving benchmark semantics.

9. Dead-code elimination

Dead-code elimination happens when the JVM concludes work has no observable effect.

Example:

@Benchmark
public void eliminated() {
    int x = 1 + 2;
}

This benchmark may measure almost nothing.

Correct benchmark:

@Benchmark
public int notEliminated() {
    return 1 + 2;
}

But even this can be suspicious because constant folding can precompute the result.

10. Constant folding

The JVM can precompute constant expressions.

Bad:

@Benchmark
public int bad() {
    return hash("constant-input");
}

If the method is pure and input constant, the optimizer may simplify more than production would.

Better:

@State(Scope.Thread)
public static class Inputs {
    @Param({"case-123", "case-456", "case-789"})
    String input;
}

@Benchmark
public int better(Inputs inputs) {
    return hash(inputs.input);
}

Even better: use a representative corpus.

@State(Scope.Thread)
public static class Corpus {
    List<String> values;
    int index;

    @Setup
    public void setup() {
        values = List.of("case-123", "submission-456", "appeal-789");
    }

    String next() {
        String value = values.get(index);
        index = (index + 1) % values.size();
        return value;
    }
}

@Benchmark
public int hashCorpus(Corpus corpus) {
    return hash(corpus.next());
}

But be careful: the benchmark now includes index update and list access. That may be fine if the operation is large enough. If not, increase operation granularity.

11. Loop benchmarks

A common mistake is putting a loop inside a benchmark and reporting per-call time incorrectly.

@Benchmark
public int loopInsideBenchmark() {
    int sum = 0;
    for (int i = 0; i < 1_000; i++) {
        sum += compute(i);
    }
    return sum;
}

This measures a batch of 1,000 operations. That is not wrong, but the unit is now “batch operation,” not one compute().

If you want to express that, use:

@OperationsPerInvocation(1_000)
@Benchmark
public int batchedCompute() {
    int sum = 0;
    for (int i = 0; i < 1_000; i++) {
        sum += compute(i);
    }
    return sum;
}

Batching is useful when each operation is too small and measurement overhead dominates.

But batching can hide:

branch behavior;
cache misses;
allocation bursts;
per-request setup cost;
synchronization cost;
real data distribution.

Use batching intentionally.

12. Avoid benchmarking the wrong thing

A benchmark method contains three kinds of work:

measured operation = target work + harness work + accidental work

Accidental work includes:

generating random values per invocation;
parsing fixtures per invocation;
allocating test data per invocation;
logging;
calling assertions;
using synchronized fixture access;
using nonrepresentative adapters;
consuming data structure setup cost accidentally.

Example of accidental benchmark:

@Benchmark
public boolean contains() {
    List<String> data = loadData();        // accidental setup
    String key = UUID.randomUUID().toString(); // accidental random generation
    return data.contains(key);
}

Better:

@State(Scope.Thread)
public static class ContainsState {
    List<String> data;
    String[] keys;
    int index;

    @Setup(Level.Trial)
    public void setup() {
        data = loadDataOnce();
        keys = loadKeysOnce();
    }

    String nextKey() {
        String key = keys[index];
        index = (index + 1) % keys.length;
        return key;
    }
}

@Benchmark
public boolean contains(ContainsState state) {
    return state.data.contains(state.nextKey());
}

Now the benchmark measures lookup plus lightweight key selection.

If key selection is still significant, benchmark key selection separately or increase operation granularity.

13. Workload realism

The most dangerous benchmark is technically correct but semantically irrelevant.

Example:

Production:
- 80% small payloads
- 15% medium payloads
- 5% very large payloads
- 70% successful validation
- 20% rejected validation
- 10% exceptional/edge path

Benchmark:
- one small successful payload repeated forever

The benchmark will collect a clean JVM profile:

predictable branches;
stable receiver types;
hot cache paths;
no error allocation;
no large payload pressure;
no branch misprediction;
no polymorphic call sites.

Then it may claim a speedup that disappears in production.

A representative benchmark needs workload dimensions:

Dimension	Example
Input size	small / medium / large / pathological
Hit ratio	cache hit / cache miss
Branch mix	valid / invalid / exceptional
Data shape	flat / nested / sparse / duplicated
Type profile	one implementation / many implementations
State	empty / warm / saturated
Allocation	no allocation / moderate / bursty
Locality	sequential / random
Concurrency	single-threaded / contended / mixed role

Do not benchmark only the happy path unless production only has happy paths.

14. Parametrize the experiment

Use @Param to expose workload dimensions.

@State(Scope.Thread)
public static class ValidationState {
    @Param({"10", "100", "1000"})
    int ruleCount;

    @Param({"0.10", "0.50", "0.90"})
    double hitRate;

    RuleEngine engine;
    List<Request> requests;
    int index;

    @Setup(Level.Trial)
    public void setup() {
        engine = RuleEngineFactory.create(ruleCount);
        requests = RequestCorpus.generate(ruleCount, hitRate, 10_000);
    }

    Request next() {
        Request request = requests.get(index);
        index = (index + 1) % requests.size();
        return request;
    }
}

@Benchmark
public Decision evaluate(ValidationState state) {
    return state.engine.evaluate(state.next());
}

Now the benchmark is not one number. It is a matrix.

ruleCount=10, hitRate=0.10
ruleCount=10, hitRate=0.50
ruleCount=10, hitRate=0.90
ruleCount=100, hitRate=0.10
...

This often reveals nonlinear behavior:

algorithm performs well for small n, badly for large n;
cache works at 90% hit rate, collapses at 50%;
branchy code wins for predictable input, loses for mixed input;
allocation strategy is fine until payload size crosses threshold.

The purpose of @Param is not convenience. It is experimental structure.

15. Benchmarking allocation

For Java, allocation rate is often more important than raw CPU time.

Two implementations with similar ns/op may differ drastically in B/op.

Run with GC profiler:

java -jar target/benchmarks.jar ValidationBenchmark -prof gc

Example result shape:

Benchmark                       Mode  Cnt     Score    Error   Units
oldEngine                       avgt   30   820.000 ± 20.000   ns/op
oldEngine:·gc.alloc.rate.norm   avgt   30  2048.000           B/op
newEngine                       avgt   30   760.000 ± 18.000   ns/op
newEngine:·gc.alloc.rate.norm   avgt   30   128.000           B/op

Interpretation:

The new engine is only ~7% faster by average time, but reduces allocation by ~94%.
This may reduce GC pressure and improve tail latency under service load.
Need macrobenchmark/load test to confirm system-level effect.

Allocation is not automatically evil. Short-lived allocation can be cheap. But high allocation rate can create GC pressure, memory bandwidth pressure, cache churn, and latency instability.

Benchmark reports should include allocation evidence when optimization changes object creation.

16. Benchmarking polymorphism and call-site profile

JVM performance depends heavily on call-site profiles.

A benchmark with one implementation may become monomorphic:

interface Rule {
    boolean applies(Request request);
}

final class CountryRule implements Rule { ... }

Bad benchmark:

Rule rule = new CountryRule();

Production may use many rule implementations:

CountryRule
ProductRule
CustomerSegmentRule
RiskRule
DateWindowRule
ManualOverrideRule

The JIT may optimize the monomorphic benchmark aggressively, while production sees polymorphic or megamorphic dispatch.

Better benchmark:

@State(Scope.Thread)
public static class RuleState {
    List<Rule> rules;
    Request request;

    @Param({"MONOMORPHIC", "POLYMORPHIC", "MEGAMORPHIC"})
    String profile;

    @Setup
    public void setup() {
        rules = switch (profile) {
            case "MONOMORPHIC" -> List.of(
                new CountryRule(), new CountryRule(), new CountryRule()
            );
            case "POLYMORPHIC" -> List.of(
                new CountryRule(), new ProductRule(), new DateWindowRule()
            );
            case "MEGAMORPHIC" -> List.of(
                new CountryRule(), new ProductRule(), new DateWindowRule(),
                new RiskRule(), new SegmentRule(), new OverrideRule()
            );
            default -> throw new IllegalArgumentException(profile);
        };
        request = RequestFixtures.valid();
    }
}

@Benchmark
public int evaluateRules(RuleState state) {
    int matched = 0;
    for (Rule rule : state.rules) {
        if (rule.applies(state.request)) {
            matched++;
        }
    }
    return matched;
}

This does not perfectly reproduce production, but it makes type profile an explicit variable.

That is the point.

17. Benchmarking branch behavior

Branch predictability matters.

Bad:

@Benchmark
public int alwaysValid() {
    return validator.validate(validRequest).score();
}

If the branch is always true, the CPU and JIT can optimize for that path.

Better:

@Param({"0.00", "0.25", "0.50", "0.75", "1.00"})
double invalidRate;

Generate a corpus with controlled invalid rate:

requests = RequestCorpus.withInvalidRate(invalidRate, 10_000);

This reveals behavior under branch mix.

The fastest implementation under 0% invalid may not be fastest under 50% invalid.

18. Benchmarking cache effects

Repeatedly benchmarking the same object may measure a hot-cache fantasy.

Bad:

@Benchmark
public Decision sameRequestEveryTime(State state) {
    return state.engine.evaluate(state.request);
}

Better:

@Benchmark
public Decision corpus(State state) {
    return state.engine.evaluate(state.nextRequest());
}

But even corpus cycling can become predictable.

For local microbenchmarks, this may be acceptable if the corpus is large enough and the benchmark question is local. For realistic cache/memory behavior, move to component or macrobenchmark.

Use this decision rule:

If cache locality is central to the performance question, do not hide it.
Make cache state a benchmark parameter.

Example:

@Param({"100", "10000", "1000000"})
int corpusSize;

19. Multi-threaded JMH benchmarks

Use multi-threaded JMH when the unit under test is shared or contention-sensitive.

Example:

@State(Scope.Benchmark)
public static class SharedCounterState {
    AtomicLong counter = new AtomicLong();
}

@Benchmark
@Threads(8)
public long increment(SharedCounterState state) {
    return state.counter.incrementAndGet();
}

This measures contention on one shared counter.

But ask:

Does production have one shared counter?
How many threads contend?
Are threads CPU-bound or blocking?
Is the contention local or distributed?
Does production use batching/sharding?

A contention microbenchmark is useful, but easy to overgeneralize.

For mixed workloads, use groups.

@State(Scope.Group)
public static class SharedMap {
    ConcurrentHashMap<String, String> map = new ConcurrentHashMap<>();
    String[] keys;
    int index;
}

@Group("map")
@GroupThreads(3)
@Benchmark
public String read(SharedMap state) {
    return state.map.get(state.keys[state.index++ & 1023]);
}

@Group("map")
@GroupThreads(1)
@Benchmark
public String write(SharedMap state) {
    return state.map.put(state.keys[state.index++ & 1023], "value");
}

This models a 3:1 reader/writer group.

Still, it is not a service load test. It is local contention evidence.

False sharing happens when independent variables used by different threads occupy the same cache line, causing unnecessary coherence traffic.

In benchmarks, false sharing can appear accidentally in state objects.

Example danger:

@State(Scope.Benchmark)
public static class Counters {
    volatile long a;
    volatile long b;
}

Thread 1 updates a, thread 2 updates b. They may still contend at the cache-line level.

JMH has tools and annotations to reduce this kind of issue in some contexts, but the deeper rule is:

If your benchmark is concurrent, think about memory layout and sharing.

Do not benchmark concurrent performance without looking at:

shared mutable state;
cache line interaction;
synchronization path;
allocation path;
object graph locality;
thread count;
CPU topology.

21. Profilers in JMH

JMH can run with profilers:

java -jar target/benchmarks.jar MyBenchmark -prof gc
java -jar target/benchmarks.jar MyBenchmark -prof stack
java -jar target/benchmarks.jar MyBenchmark -prof perfasm

Common use:

Profiler	Useful for
`gc`	allocation rate, GC count/time
`stack`	rough stack sampling
`perf` / `perfasm`	native profiling / assembly-level analysis on supported platforms
external async-profiler	CPU/allocation/wall/lock analysis outside or alongside JMH

Do not start with assembly unless needed.

A practical flow:

1. Run benchmark normally.
2. Run with -prof gc.
3. If CPU question remains, use async-profiler/flamegraph.
4. If compiler/codegen question remains, inspect perfasm/JIT logs.
5. Confirm with macrobenchmark or production profiling if the change is system-relevant.

22. Reading JMH output

Example:

Benchmark                       (size)  Mode  Cnt     Score    Error   Units
ParserBenchmark.regex            SMALL  avgt   30   410.230 ± 12.100  ns/op
ParserBenchmark.manual           SMALL  avgt   30   180.840 ±  8.540  ns/op
ParserBenchmark.regex           MEDIUM  avgt   30  1550.120 ± 60.120  ns/op
ParserBenchmark.manual          MEDIUM  avgt   30   780.440 ± 35.230  ns/op

Do not report only:

manual is faster

Report:

For SMALL and MEDIUM input sizes, manual parser is ~2.0x faster in avgt mode under the benchmarked corpus. However, we need to compare correctness coverage, maintenance risk, Unicode handling, allocation rate, and behavior on malformed/pathological input before replacing regex globally.

Performance decision requires engineering context.

23. The benchmark review checklist

Every serious benchmark should pass review.

23.1 Question

What decision will this benchmark inform?

Bad:

See which one is faster.

Good:

Decide whether replacing regex-based case reference parsing with manual scanning reduces CPU/allocation cost for the top 5 production input shapes without breaking malformed-input behavior.

23.2 Boundary

Is this method-level, component-level, or service-level evidence?

23.3 Workload

Are input sizes, hit rates, branch mix, and data shape representative?

23.4 State

Is benchmark state shared or per-thread in the same way as production?

23.5 Setup

Is setup excluded or included intentionally?

23.6 JVM behavior

Are warmup, forks, compiler profile, and allocation behavior considered?

23.7 Result use

Will this result trigger direct code change, deeper profiling, or macrobenchmark confirmation?

24. Common anti-patterns

Anti-pattern 1: One input forever

@Benchmark
public Output benchmark() {
    return parser.parse("CASE-123");
}

Why it lies:

branch profile too clean;
cache too hot;
no malformed input;
no size variation;
may be constant-folded or over-specialized.

Anti-pattern 2: Benchmarking random generation

@Benchmark
public Output benchmark() {
    return parser.parse(UUID.randomUUID().toString());
}

Why it lies:

random generation dominates;
input distribution may not match production;
benchmark becomes noisy.

Anti-pattern 3: Measuring logging

@Benchmark
public void benchmark() {
    log.info("value {}", service.compute());
}

Why it lies:

logging backend/config dominates;
async logging may move cost elsewhere;
result may depend on environment.

Anti-pattern 4: Benchmarking assertions

@Benchmark
public void benchmark() {
    assertThat(service.compute()).isEqualTo(expected);
}

Why it lies:

assertion library cost is included;
benchmark becomes a test;
use tests for correctness, benchmark for cost.

Anti-pattern 5: Comparing without allocation evidence

Implementation A: 200 ns/op
Implementation B: 190 ns/op

But:

A: 16 B/op
B: 2,048 B/op

The faster local operation may create worse system behavior under load.

25. Benchmarking correctness oracle

A benchmark must not replace correctness tests.

Before benchmarking two implementations, prove they are semantically equivalent for the intended domain.

class ParserEquivalenceTest {

    @Property
    void manualAndRegexParserAgree(@ForAll("caseReferences") String input) {
        ParseResult regex = RegexParser.parse(input);
        ParseResult manual = ManualParser.parse(input);

        assertThat(manual).isEqualTo(regex);
    }
}

Then benchmark:

@Benchmark
public ParseResult regex(ParserState state) {
    return RegexParser.parse(state.next());
}

@Benchmark
public ParseResult manual(ParserState state) {
    return ManualParser.parse(state.next());
}

The evidence chain is:

Never optimize into semantic drift.

26. Case study: validation rule engine

Suppose a regulatory case platform has a rule engine:

public interface Rule {
    boolean applies(CaseSubmission submission);
    Violation violation();
}

public final class RuleEngine {
    private final List<Rule> rules;

    public ValidationResult validate(CaseSubmission submission) {
        List<Violation> violations = new ArrayList<>();
        for (Rule rule : rules) {
            if (rule.applies(submission)) {
                violations.add(rule.violation());
            }
        }
        return new ValidationResult(violations);
    }
}

A proposed optimization changes rule storage from list scan to indexed rules.

Weak benchmark:

@Benchmark
public ValidationResult validate() {
    return engine.validate(validSubmission);
}

Better benchmark design:

Dimension	Values
rule count	10 / 100 / 1000
submission size	small / medium / large
violation rate	0% / 10% / 50%
rule profile	monomorphic / polymorphic
result allocation	full list / early exit / lazy result
corpus size	100 / 10,000

Benchmark state:

@State(Scope.Thread)
public static class RuleEngineState {
    @Param({"10", "100", "1000"})
    int ruleCount;

    @Param({"0.0", "0.1", "0.5"})
    double violationRate;

    @Param({"LIST", "INDEXED"})
    String implementation;

    RuleEngine engine;
    List<CaseSubmission> submissions;
    int index;

    @Setup(Level.Trial)
    public void setup() {
        RuleCorpus corpus = RuleCorpus.generate(ruleCount, violationRate, 10_000);
        this.engine = switch (implementation) {
            case "LIST" -> RuleEngineFactory.listBased(corpus.rules());
            case "INDEXED" -> RuleEngineFactory.indexed(corpus.rules());
            default -> throw new IllegalArgumentException(implementation);
        };
        this.submissions = corpus.submissions();
    }

    CaseSubmission next() {
        CaseSubmission submission = submissions.get(index);
        index = (index + 1) % submissions.size();
        return submission;
    }
}

@Benchmark
public ValidationResult validate(RuleEngineState state) {
    return state.engine.validate(state.next());
}

Now the benchmark can reveal:

indexed engine wins only after ruleCount >= 100;
indexed engine allocates more during setup but less per validation;
list engine wins for small rule sets;
indexed engine has worse cold-start cost;
polymorphic rules reduce the expected win;
violation-heavy workloads allocate more result data.

This is decision-grade evidence.

27. JMH in CI

Do not run every benchmark on every commit by default.

Better layers:

Layer	Trigger	Purpose
local benchmark	developer command	explore implementation alternatives
PR smoke benchmark	opt-in label or changed performance-sensitive path	catch obvious regression
nightly benchmark	scheduled dedicated runner	trend tracking
release benchmark	before release candidate	release confidence
investigation benchmark	incident/regression analysis	root cause support

CI benchmark rules:

use dedicated runners if possible;
pin JDK version;
record CPU/machine metadata;
avoid noisy shared runners for strict gates;
compare against historical baseline, not arbitrary threshold;
archive raw JMH JSON;
archive profiler artifacts for important runs;
do not block PRs on statistically weak evidence;
require human review for benchmark meaning.

JMH can output JSON:

java -jar target/benchmarks.jar \
  -rf json \
  -rff target/jmh-results.json

Store it.

Benchmark evidence that cannot be revisited becomes tribal memory.

28. From benchmark to decision

A benchmark result should end with a decision frame:

## Decision

We will replace RegexCaseReferenceParser with ManualCaseReferenceParser for the hot ingestion path only.

## Evidence

- Manual parser is 1.8x-2.4x faster across representative SMALL/MEDIUM/LARGE corpora.
- Allocation decreases from 320 B/op to 48 B/op.
- Property-based equivalence tests pass for generated valid/malformed case references.
- Fuzz corpus found two malformed-input differences; fixed before merge.
- Component benchmark shows ingestion CPU decreases by 11% at 800 msg/s.

## Limits

- Does not prove end-to-end p99 improvement.
- Does not cover non-ASCII normalization outside accepted product scope.
- Production canary must watch parse failure rate and ingestion latency.

This is how top engineers use benchmarks: as one layer in an evidence chain.

29. Practical command patterns

Run all benchmarks:

java -jar target/benchmarks.jar

Run one benchmark class:

java -jar target/benchmarks.jar '.*ParserBenchmark.*'

Run with GC profiler:

java -jar target/benchmarks.jar '.*ParserBenchmark.*' -prof gc

Run with JSON output:

java -jar target/benchmarks.jar \
  '.*ParserBenchmark.*' \
  -rf json \
  -rff target/parser-benchmark-results.json

Override forks/warmup/measurement from CLI:

java -jar target/benchmarks.jar \
  '.*ParserBenchmark.*' \
  -f 5 \
  -wi 10 \
  -i 20

List benchmarks:

java -jar target/benchmarks.jar -l

30. What to put in the repository

Recommended structure:

benchmarks/
  README.md
  pom.xml
  src/jmh/java/com/acme/benchmark/
    ParserBenchmark.java
    RuleEngineBenchmark.java
    SerializationBenchmark.java
  src/jmh/resources/
    corpora/
      case-references-small.txt
      case-references-large.txt
  results/
    README.md

Benchmark README should explain:

# Benchmarks

## Purpose
These benchmarks support local implementation decisions for parser, validation, and serialization hot paths.
They do not replace service-level load tests.

## Running
...

## Interpreting Results
Always run with at least 3 forks for decision-making.
Use `-prof gc` when comparing allocation-sensitive paths.

## Hardware
Record CPU, OS, JDK, and JVM args with every result.

## Review Rules
Every new benchmark must include a workload explanation and correctness oracle.

A benchmark without documentation decays quickly.

31. Exercises

Exercise 1 — parser benchmark

Take a parser from your codebase.

Create:

example tests;
property-based equivalence tests if replacing implementation;
JMH benchmark with SMALL/MEDIUM/LARGE input;
GC profiler output;
decision note.

Exercise 2 — rule engine benchmark

Model:

10, 100, 1000 rules;
different violation rates;
monomorphic vs polymorphic rule implementations;
result allocation strategy.

Find where the algorithm changes behavior.

Exercise 3 — benchmark review

Pick an old benchmark and answer:

What decision did this benchmark support?
What workload did it model?
What production assumption did it encode?
What source of invalidity is most likely?

If you cannot answer, rewrite or delete the benchmark.

32. Final mental model

JMH is a microscope.

A microscope is powerful because it narrows attention. But if you put the wrong sample under it, you will confidently study the wrong thing.

Use JMH when:

the boundary is local;
workload dimensions are explicit;
correctness equivalence is already protected;
JVM optimization traps are considered;
allocation and profile effects are visible;
the result feeds into a broader evidence chain.

Do not ask:

Which code is faster?

Ask:

Under this workload, with this state, on this JVM, for this boundary, with this correctness oracle, what changed and what decision does that justify?

That is performance engineering.

References

OpenJDK JMH project: https://openjdk.org/projects/code-tools/jmh/
OpenJDK JMH repository and samples: https://github.com/openjdk/jmh
JMH Blackhole source: https://github.com/openjdk/jmh/blob/master/jmh-core/src/main/java/org/openjdk/jmh/infra/Blackhole.java

Lesson Recap

You just completed lesson 27 in deepen practice. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Back To Series Next Lesson

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.

Previous Lesson

Lesson 26

Performance Measurement Theory for Java

Next Lesson

Lesson 28

Benchmarking Data Structures, Algorithms, and IO

JMH Deep Dive and Microbenchmark Correctness

Part 027 — JMH Deep Dive and Microbenchmark Correctness

1. What microbenchmarks are for

2. Minimal Maven setup

3. Minimal benchmark class

4. Benchmark modes

5. Warmup, measurement, and forks

6. State scope

7. Setup levels

8. Returning values vs Blackhole

9. Dead-code elimination

10. Constant folding

11. Loop benchmarks

12. Avoid benchmarking the wrong thing

13. Workload realism

14. Parametrize the experiment

15. Benchmarking allocation

16. Benchmarking polymorphism and call-site profile

17. Benchmarking branch behavior

18. Benchmarking cache effects

19. Multi-threaded JMH benchmarks

20. False sharing and state layout

21. Profilers in JMH

22. Reading JMH output

23. The benchmark review checklist

23.1 Question

23.2 Boundary

23.3 Workload

23.4 State

23.5 Setup

23.6 JVM behavior

23.7 Result use

24. Common anti-patterns

Anti-pattern 1: One input forever

Anti-pattern 2: Benchmarking random generation

Anti-pattern 3: Measuring logging

Anti-pattern 4: Benchmarking assertions

Anti-pattern 5: Comparing without allocation evidence

25. Benchmarking correctness oracle

26. Case study: validation rule engine

27. JMH in CI

28. From benchmark to decision

29. Practical command patterns

30. What to put in the repository

31. Exercises

Exercise 1 — parser benchmark

Exercise 2 — rule engine benchmark

Exercise 3 — benchmark review

32. Final mental model

References