Series/Learn Java Formal Methods, Testing, Benchmarking, and Performance Engineering

Start HereOrdered learning track

Test Taxonomy and Verification Ladder

Learn Java Formal Methods, Testing, Benchmarking, and Performance Engineering - Part 003

Test taxonomy and verification ladder untuk Java engineer: bagaimana memilih bentuk evidence paling murah dan paling kuat untuk risiko correctness, integration, concurrency, performance, dan production behavior.

[2026-07-02]18 min read3409 words

In This Lesson

1. Core Mental Model 2. Verification Ladder 3. Why “Test Pyramid” Is Not Enough

PrevNext

Lesson 0340 lesson track01–08 Start Here

#java#testing#formal-methods#verification+2 more

Part 003 — Test Taxonomy and Verification Ladder

Tujuan bagian ini: membangun peta keputusan. Bukan sekadar tahu istilah unit test, integration test, contract test, property-based test, mutation test, fuzzing, benchmark, load test, dan observability. Kita ingin tahu kapan memakai apa, risiko apa yang dideteksi, bukti apa yang dihasilkan, dan di mana blind spot-nya.

Engineer senior tidak menulis test sebanyak mungkin. Engineer senior menulis evidence system yang paling ekonomis untuk risiko yang paling penting.

Kita akan memakai prinsip berikut sepanjang seri:

For every meaningful engineering risk, choose the cheapest detector that can falsify it early.

Artinya:

kalau risiko bisa dideteksi oleh unit test deterministic, jangan menunggu E2E;
kalau risiko adalah schema drift, jangan berharap unit test menangkapnya;
kalau risiko adalah race condition, contoh test tunggal tidak cukup;
kalau risiko adalah throughput collapse, mock test tidak relevan;
kalau risiko adalah invariant state machine, happy-path test tidak cukup;
kalau risiko adalah misunderstanding antara service, contract test lebih bernilai daripada mock lokal;
kalau risiko adalah desain distributed protocol, formal model sering lebih murah daripada debugging production incident.

1. Core Mental Model

Kebanyakan tim berpikir test sebagai folder:

src/test/java

Itu terlalu dangkal.

Untuk sistem Java yang serius, test harus dipahami sebagai lapisan deteksi risiko.

Setiap tool punya deteksi optimal dan blind spot.

Contoh sederhana:

Risiko	Detector murah	Detector kuat	Blind spot umum
Formula salah	Unit test	Property-based test	Hanya test 3 contoh happy path
Rule priority salah	Unit test table	Property/model-based test	Test tidak mengeksplorasi kombinasi
Schema response berubah	Contract test	Provider verification in CI	Mock client lokal tetap hijau
DB query salah	Integration test	Integration + data fixture realistic	H2 tidak sama dengan PostgreSQL
Race condition	Stress/concurrency test	Formal model + deterministic scheduler	Test lokal kebetulan lewat
Allocation melonjak	JMH + allocation profiler	JFR/async-profiler under workload	Benchmark micro tidak representatif
Latency P99 buruk	Load test	Load + production telemetry	Average latency menipu
Incident partial failure	Chaos/failure injection	Formal model + production invariant metrics	Happy-path E2E tidak mengecek failure

2. Verification Ladder

Verification ladder adalah urutan evidence dari paling lokal, murah, dan cepat menuju paling realistis, mahal, dan lambat.

Semakin naik:

realism meningkat;
cost meningkat;
feedback makin lambat;
debugging makin sulit;
signal makin dekat dengan user impact.

Semakin turun:

feedback makin cepat;
defect localization makin mudah;
realism makin rendah;
perlu desain kode yang testable.

Rule praktis:

Push detection downward whenever possible.
Keep only irreducible realism at higher layers.

Misalnya:

validasi rule pajak tidak perlu E2E;
query SQL spesifik database perlu integration test;
compatibility response butuh contract test;
P99 latency butuh load test;
GC pause butuh runtime/profiling evidence;
distributed liveness kadang butuh formal model.

3. Why “Test Pyramid” Is Not Enough

Test pyramid berguna untuk melawan E2E-heavy testing. Namun untuk sistem modern, pyramid terlalu sempit.

Masalah test pyramid:

Tidak membedakan example-based vs property-based evidence.
Tidak menjawab schema compatibility.
Tidak membantu concurrency bug.
Tidak membahas performance regression.
Tidak memasukkan production feedback.
Tidak mengukur apakah test oracle kuat.
Tidak mendorong formal reasoning untuk stateful/distributed behavior.

Pyramid tetap berguna, tapi bukan operating model lengkap.

Untuk seri ini kita pakai model verification portfolio.

4. Taxonomy by Risk Type

4.1 Functional correctness

Pertanyaan:

Does this code compute the right result for the intended input space?

Tools:

unit test;
parameterized test;
property-based test;
mutation testing;
static analysis;
design by contract;
model-based testing untuk stateful logic.

Contoh Java domain:

final class PenaltyCalculator {
    Money calculatePenalty(Money baseFine, int daysLate, ViolationSeverity severity) {
        if (daysLate <= 0) return Money.zero(baseFine.currency());

        BigDecimal multiplier = switch (severity) {
            case LOW -> new BigDecimal("0.01");
            case MEDIUM -> new BigDecimal("0.025");
            case HIGH -> new BigDecimal("0.05");
        };

        return baseFine.multiply(multiplier).multiply(daysLate).capAt(baseFine.multiply("2.0"));
    }
}

Example-based tests bisa mengecek beberapa case:

@Test
void highSeverityPenaltyIsFivePercentPerDay() {
    var penalty = calculator.calculatePenalty(
        Money.usd("1000.00"),
        3,
        ViolationSeverity.HIGH
    );

    assertThat(penalty).isEqualTo(Money.usd("150.00"));
}

Tapi properti yang lebih kuat adalah:

For any baseFine > 0 and daysLate > 0:
penalty >= 0
penalty <= baseFine * 2
increasing daysLate must not decrease penalty
higher severity must not produce lower penalty

Itu bukan sekadar contoh. Itu invariant.

4.2 Behavioral correctness

Pertanyaan:

Does the system perform the right observable behavior under a scenario?

Tools:

component test;
workflow test;
state machine test;
BDD-style scenario test;
model-based testing;
approval/snapshot testing untuk output tertentu;
event-driven test harness.

Contoh behavior:

Given a case is under investigation
When the officer submits an escalation request
Then the case moves to ESCALATION_REVIEW
And the escalation audit event is recorded
And the SLA clock switches to escalation policy

Ini bukan hanya method result. Ini perubahan state dan side effect.

Test-nya harus memverifikasi:

state transition;
emitted event;
audit trail;
SLA recalculation;
idempotency;
authorization;
failure semantics.

4.3 Integration correctness

Pertanyaan:

Does our code work with the real dependency contract and behavior?

Tools:

Testcontainers;
real database integration test;
messaging broker integration test;
external service fake with recorded contract;
migration test;
transaction test.

Example risk:

repository.findActiveCasesByOfficer(officerId, PageRequest.of(0, 50));

Unit test dengan mock repository tidak membuktikan query SQL benar.

Integration test harus menangkap:

SQL syntax;
mapping column;
transaction boundary;
constraint violation;
index-sensitive behavior;
database-specific semantics;
isolation behavior;
timezone conversion;
JSONB/operator behavior jika PostgreSQL.

4.4 Contract compatibility

Pertanyaan:

Can producers and consumers evolve independently without breaking each other?

Tools:

consumer-driven contract test;
OpenAPI schema validation;
protobuf/avro compatibility rules;
generated client/server verification;
backward/forward compatibility test;
golden contract tests.

Contract test bukan integration test penuh. Ia menjawab pertanyaan sempit:

Does this API still satisfy the contract expected by consumers?

Contoh breaking changes:

required field baru ditambahkan;
enum value berubah;
response field hilang;
type berubah dari number ke string;
error format berubah;
pagination semantics berubah;
status code berubah;
event payload incompatible.

4.5 Temporal correctness

Pertanyaan:

Does behavior remain correct when time matters?

Risiko:

deadline;
SLA;
retry delay;
timeout;
scheduled job;
expiration;
grace period;
business calendar;
daylight saving/timezone;
monotonic vs wall-clock time.

Tools:

fake clock;
deterministic scheduler;
temporal property tests;
integration test for scheduled jobs;
production metric for late/expired items.

Rule:

Never let domain logic call Instant.now() directly.

Lebih baik:

final class SlaPolicy {
    private final Clock clock;

    SlaPolicy(Clock clock) {
        this.clock = clock;
    }

    boolean isBreached(CaseFile file) {
        return Instant.now(clock).isAfter(file.slaDeadline());
    }
}

4.6 Concurrency correctness

Pertanyaan:

Does behavior remain correct under interleaving, contention, duplicated execution, and reordering?

Tools:

deterministic concurrency test;
stress test;
jcstress-style thinking;
model checking;
TLA+ for protocol logic;
database isolation tests;
idempotency tests;
race reproduction harness.

Concurrency bug jarang tertangkap oleh one-shot unit test.

Typical bug:

Two workers claim the same case concurrently.

Invariant:

At most one active assignment exists for a case at any time.

Detector candidates:

Detector	Good for	Weak for
Unit test	sequential logic	interleaving
Integration test	DB constraint / transaction behavior	large interleaving space
Stress test	probabilistic race exposure	proof / exhaustive coverage
Formal model	protocol correctness	implementation details
Production invariant metric	real-world detection	late signal

4.7 Performance correctness

Pertanyaan:

Does the system satisfy performance requirements under representative conditions?

Performance correctness is correctness.

Kalau requirement berkata:

P99 case search latency <= 300 ms at 200 requests/second with 5M active cases

Maka response yang benar secara functional tapi 4 detik adalah gagal.

Tools:

JMH microbenchmark;
macrobenchmark;
load test;
soak test;
JFR;
async-profiler;
database execution plan;
production SLO metrics.

Important distinction:

Tool	Answers
JMH	Is this local operation faster/slower under controlled JVM measurement?
Profiler	Where is time/allocation/lock contention spent?
Load test	Does the service satisfy latency/throughput under workload?
Soak test	Does performance degrade over time?
Production telemetry	What actually happens for real users?

5. Taxonomy by Scope

5.1 Unit test

Scope:

One class/function/module, isolated from real external systems.

Good for:

pure domain rules;
branching logic;
validation;
small algorithms;
error handling;
edge cases;
state transition rules if model small.

Bad for:

SQL correctness;
serialization compatibility;
actual broker semantics;
HTTP behavior;
JIT/GC/performance behavior;
distributed protocol realism.

Healthy unit test properties:

fast;
deterministic;
local;
readable;
checks behavior, not implementation noise;
fails for meaningful regression;
does not require external infra.

Bad unit test smell:

verify(repository).save(any());
verify(auditClient).send(any());
verify(clock).instant();
verify(logger).info(anyString());

This often means the test checks implementation choreography, not domain behavior.

Better:

assertThat(result.newStatus()).isEqualTo(CaseStatus.ESCALATION_REVIEW);
assertThat(result.events()).containsExactly(
    new CaseEscalated(caseId, officerId, reason)
);

5.2 Component test

Scope:

A meaningful slice of application code, often with fake adapters.

Example:

Command handler + domain service + in-memory repository + fake event publisher.

Good for:

application behavior;
use-case orchestration;
transaction-like semantics at logical level;
verifying emitted events;
checking authorization flow;
testing without real DB/network.

Bad for:

real SQL/migration bugs;
real serialization;
broker-specific behavior;
performance.

Component tests are often the sweet spot for application-layer behavior.

5.3 Integration test

Scope:

Our code + real external dependency instance.

Example:

Java service + PostgreSQL container;
repository + real schema migration;
Kafka producer/consumer + broker;
Redis cache integration;
HTTP client + WireMock/fake server;
file storage adapter + local compatible service.

Integration tests answer:

Does our adapter really speak the dependency's language?

They should not test every domain rule again.

Good integration test:

Repository persists and reloads Case aggregate with all important fields.

Poor integration test:

Full 12-step user journey repeated for every business rule.

5.4 Contract test

Scope:

Boundary compatibility between producer and consumer.

Contract tests are about evolution safety.

They should answer:

can old consumers read new producer output?
can new consumers tolerate old producer output?
does provider still honor consumer expectation?
are error responses compatible?
are required fields stable?
is enum expansion safe?

Contract test failure means:

Do not deploy this change without migration/coordination.

5.5 End-to-end test

Scope:

Entire system path, close to user journey.

Good for:

deployment wiring;
critical journey confidence;
smoke checks;
cross-service behavior;
environment readiness;
release validation.

Bad for:

exhaustive business rule testing;
debugging;
fast feedback;
finding local logic bugs cheaply.

Use E2E sparingly.

A useful E2E test usually says:

The system is wired and the critical path still works.

It should not be the primary detector for every rule.

6. Taxonomy by Technique

6.1 Example-based testing

Most familiar:

Given this input, expect this output.

Strength:

readable;
concrete;
good for known cases;
easy to explain;
good regression test after bug fix.

Weakness:

narrow input coverage;
can miss edge cases;
depends on author imagination;
often overfits implementation.

Use it for:

critical examples;
discovered bugs;
business examples;
boundary examples;
readable documentation.

6.2 Table-driven / parameterized testing

Useful when the rule is a matrix.

@ParameterizedTest
@CsvSource({
    "LOW, 1, 10.00",
    "LOW, 7, 70.00",
    "MEDIUM, 1, 25.00",
    "HIGH, 1, 50.00"
})
void calculatesPenaltyBySeverity(String severity, int daysLate, String expected) {
    var penalty = calculator.calculatePenalty(
        Money.usd("1000.00"),
        daysLate,
        ViolationSeverity.valueOf(severity)
    );

    assertThat(penalty).isEqualTo(Money.usd(expected));
}

Good for:

decision table;
validation matrix;
role/permission matrix;
state transition matrix;
parsing matrix.

Weakness:

still example-based;
combinatorial explosion;
can become unreadable.

6.3 Property-based testing

Instead of examples:

For all generated inputs satisfying assumptions, property must hold.

Good for:

algebraic behavior;
round-trip serialization;
parser/printer consistency;
monotonicity;
ordering;
idempotency;
commutativity;
state transition invariants;
business constraints with large input space.

Example properties:

serialize(deserialize(x)) preserves meaning
sort(xs) contains same elements as xs
applying same command twice is equivalent to applying it once
penalty never exceeds legal maximum
case cannot be both CLOSED and UNDER_REVIEW

Property-based testing requires stronger thinking: you need to specify what must always be true.

6.4 Mutation testing

Mutation testing asks:

If the production code is slightly wrong, do tests fail?

Example mutation:

if (daysLate <= 0) { ... }

becomes:

if (daysLate < 0) { ... }

If tests still pass, the test suite may be weak.

Mutation testing is excellent for detecting:

missing assertions;
tests that execute code but don't verify behavior;
weak branch coverage;
fake green tests;
overreliance on line coverage.

It is not a perfect metric. Some mutants are equivalent or irrelevant. Use it as a feedback tool, not a religion.

6.5 Fuzzing

Fuzzing explores malformed, unexpected, adversarial, or high-volume input.

Good for:

parsers;
decoders;
JSON/XML handling;
regex;
upload processing;
protocol handling;
validation boundary;
robustness/security-adjacent failures.

Fuzzing is less about expected business answer and more about:

Never crash, hang, corrupt, leak, or consume unbounded resources for invalid input.

6.6 Model-based testing

Model-based testing uses a simplified model as oracle.

Example:

A case can transition:
DRAFT -> SUBMITTED -> UNDER_REVIEW -> DECIDED -> CLOSED
UNDER_REVIEW -> ESCALATED -> DECIDED

Generate command sequences:

submit, assign, escalate, decide, close, reopen, assign, decide...

After each command:

implementation state must match model state
invariants must hold
invalid commands must be rejected

This is powerful for workflows, lifecycle engines, order management, case management, payment state machines, and approval systems.

6.7 Formal model checking

Formal model checking explores state space of a model.

It is not “testing production Java”. It is checking whether a simplified specification admits bad states.

Good for:

concurrency protocol;
distributed coordination;
idempotency;
retry/timeout behavior;
leader election-like patterns;
queue/worker claim logic;
workflow state invariants;
eventually-completes properties.

Formal model checking can find counterexamples before code exists.

This is especially valuable when implementation bugs are expensive to reproduce.

6.8 Benchmarking and performance tests

Benchmarking answers:

How does a specific piece of code perform under controlled measurement?

Performance testing answers:

Does the system meet performance goals under workload?

These are not the same.

Activity	Scope	Example
Microbenchmark	method/class/algorithm	compare JSON serializers
Component benchmark	local subsystem	evaluate cache strategy
Macrobenchmark	service/application	benchmark API with realistic DB
Load test	deployed system	500 rps for 30 minutes
Soak test	long-running stability	12 hours under normal load
Stress test	find breaking point	ramp until SLA failure
Spike test	sudden surge	10x traffic in 60 seconds
Profiling	explain bottleneck	CPU/allocation/lock flamegraph

Never treat a microbenchmark as proof of end-to-end latency.

7. Evidence Strength vs Cost

A practical scoring model:

Technique	Speed	Debuggability	Realism	Bug class	Cost
Static check	Very high	High	Low	syntax/type/style/simple defects	Low
Unit test	Very high	High	Low-medium	local logic	Low
Property test	High-medium	Medium	Medium	broad input logic	Medium
Mutation test	Medium-low	Medium	N/A	weak tests	Medium
Component test	High-medium	High-medium	Medium	use-case behavior	Medium
Contract test	Medium	Medium	Boundary-realistic	compatibility	Medium
Integration test	Medium-low	Medium	High for dependency	adapter/data bugs	Medium-high
E2E test	Low	Low	High	wiring/journey	High
Formal model	Medium	Medium	Abstract	protocol/state bugs	Medium-high
JMH benchmark	Medium	Medium	Low-medium	local perf behavior	Medium
Load test	Low	Low-medium	High	capacity/SLO	High
Production telemetry	Continuous	Variable	Highest	real behavior	Required

Use this to decide where a check belongs.

8. Choosing the Right Test: Decision Tree

Before writing a test, force yourself to answer:

What failure would this test catch?
What failure would it not catch?
Is there a cheaper layer that should catch it earlier?
Is this test a detector or just ceremony?

9. Example: Case Escalation Workflow

Assume a case management domain:

Case states:
DRAFT -> SUBMITTED -> UNDER_REVIEW -> ESCALATION_REQUESTED -> ESCALATION_REVIEW -> DECIDED -> CLOSED

Important invariants:

closed case cannot be escalated
case cannot have two active owners
escalation requires reason
escalation emits audit event
escalation changes SLA policy
same escalation command is idempotent

Verification portfolio:

Risk	Detector
Closed case escalated	Unit/component test
Missing reason accepted	Unit/property test
Duplicate escalation command creates duplicate event	Property/model-based test
Two officers claim same case	DB constraint integration + formal model
Escalation event schema breaks consumer	Contract test
SLA query too slow at scale	DB integration + macrobenchmark/load test
Event publisher unavailable	Component failure test + integration failure injection
Production duplicate active owner	Invariant metric/alert

Test distribution:

The lesson:

One domain invariant may need multiple detectors at different layers.

Example:

At most one active owner per case.

Detectors:

Domain unit test prevents command-level violation.
Database unique partial index prevents persistence-level violation.
Integration test proves index works.
Formal model explores concurrent claim interleavings.
Production metric checks no corrupted data exists.

This is evidence layering.

10. The Test Oracle Problem

A test is only as strong as its oracle.

A test has two parts:

execution + oracle

Execution runs the code. Oracle decides whether behavior is correct.

Weak oracle:

@Test
void createsCase() {
    service.createCase(command);
}

This test only checks “no exception”. Sometimes useful, usually weak.

Slightly better:

@Test
void createsCase() {
    var id = service.createCase(command);
    assertThat(id).isNotNull();
}

Still weak.

Better:

@Test
void createsCaseWithSubmittedStatusAndAuditEvent() {
    var id = service.createCase(command);

    var saved = repository.get(id);
    assertThat(saved.status()).isEqualTo(CaseStatus.SUBMITTED);
    assertThat(saved.createdBy()).isEqualTo(command.requestedBy());
    assertThat(eventPublisher.events()).containsExactly(
        new CaseSubmitted(id, command.requestedBy())
    );
}

Strong oracle checks externally meaningful behavior.

Mutation testing is useful because it exposes weak oracles.

11. Coverage Is Not Confidence

Line coverage tells you:

Was this line executed?

It does not tell you:

Was the result checked?
Was the branch meaningfully asserted?
Were edge cases covered?
Would tests fail if logic were wrong?

Example:

@Test
void weakCoverage() {
    calculator.calculatePenalty(Money.usd("100"), 10, ViolationSeverity.HIGH);
}

This can cover many lines with zero meaningful verification.

Better metrics:

branch coverage for decision-heavy code;
mutation score for oracle strength;
property coverage for input-space behavior;
requirement/invariant coverage;
defect detection history;
flaky rate;
mean time to localize failure;
test runtime cost;
escaped defect analysis.

The best metric is not a number. It is:

Can the test suite detect the failures we actually care about at a reasonable cost?

12. Flakiness as a Design Smell

Flaky tests are not merely annoying. They destroy trust.

Common causes:

Cause	Example	Fix
Real time	sleep(1000)	fake clock / await condition
Shared state	static mutable fixture	isolated fixture
Test order dependency	relies on previous test	reset state
External dependency	real third-party API	fake/contract test
Race condition	unsynchronized assertion	deterministic scheduler/barrier
Resource contention	shared DB/schema	unique schema/container
Randomness	random UUID/data without seed	seed and print failing case
Async eventual consistency	immediate assertion	await with bounded timeout

Bad:

Thread.sleep(1000);
assertThat(repository.find(id)).isPresent();

Better:

await().atMost(Duration.ofSeconds(5))
       .untilAsserted(() -> assertThat(repository.find(id)).isPresent());

Even better: design the component to expose deterministic completion signals in tests.

13. Test Runtime Budget

A serious codebase should have test runtime budget.

Example:

Suite	Target runtime	Trigger
Static checks	< 30 sec	every local/PR run
Unit tests	< 2 min	every local/PR run
Component tests	< 5 min	every PR
Contract tests	< 5 min	every PR/provider build
Integration tests	< 10-15 min	PR/merge queue
Mutation tests	selective/nightly	targeted modules
JMH smoke	selective/nightly	perf-sensitive modules
Full load/soak	scheduled/release	staging/perf environment
Formal model checking	on spec change/nightly	protocol/workflow modules

This is not universal. The key is explicit policy.

Without runtime budget, teams drift into either:

slow CI nobody trusts; or
shallow tests that miss real defects.

14. CI Verification Pipeline

A practical pipeline:

Important principle:

CI should not merely run tests. CI should preserve evidence artifacts.

Useful artifacts:

test report;
failing seed for property tests;
mutation report;
contract verification report;
DB migration logs;
JMH JSON output;
GC logs/JFR recording for perf tests;
flamegraphs;
load test summary;
SLO/canary comparison.

Evidence without artifact is hard to audit.

15. Mapping Testing Strategy to Code Ownership

A verification portfolio only works if ownership is clear.

Artifact	Owner
Domain unit/property tests	owning feature team
Shared test utilities	platform/testing guild
Contract definitions	producer + consumer jointly
Integration harness	service team
Performance benchmark	module owner + performance owner
Formal model	protocol/workflow owner
Load test scenario	service owner + SRE/perf engineer
Production invariant alerts	service owner

Avoid this failure:

Everyone relies on the tests, but nobody owns the meaning of the tests.

Every important test should have a clear reason to exist.

16. Test Smells and What They Usually Mean

Smell	Usually means
Test name says implementation detail	behavior not clearly understood
20 mocks in one test	design has poor boundaries
Excessive `verify()` calls	test over-specifies choreography
Frequent fixture edits break many tests	shared fixture too broad
Lots of sleeps	nondeterminism not controlled
High coverage but many escaped bugs	weak oracle
E2E catches business rule bugs	lower layers missing detectors
Integration test validates pure math	wrong layer
Property test has too many assumptions	generator/model mismatch
Mutation score low in critical module	assertions too weak
Perf tests unstable	environment/workload not controlled
Test passes locally, fails in CI	hidden environment dependency

A test smell is not automatically a bug. It is a design signal.

17. A Practical Verification Design Recipe

For each feature, write a small verification plan before coding.

Template:

Feature:
Primary behavior:
Critical invariants:
Failure modes:
External contracts:
State transitions:
Concurrency risks:
Performance risks:
Observability requirements:

Evidence plan:
- Unit tests:
- Property/model-based tests:
- Contract tests:
- Integration tests:
- E2E smoke:
- Performance/profiling:
- Production metrics/assertions:

Example:

Feature: Escalate case

Critical invariants:
- closed cases cannot be escalated
- escalation requires reason
- duplicate command is idempotent
- one active escalation at a time
- audit event is recorded exactly once per logical escalation

Evidence plan:
- Unit: invalid states rejected, reason required
- Property: duplicate command idempotency
- Model-based: generated state transitions preserve invariants
- Integration: DB unique constraint prevents duplicate active escalation
- Contract: escalation event schema remains compatible
- E2E: officer can escalate case through API path
- Performance: escalation list query remains under SLA
- Production: metric active_escalations_per_case_max <= 1

This is far more useful than asking “how many tests should we write?”

18. The Verification Portfolio for This Series

This series will build the following stack:

The important thing is not tool mastery in isolation. The important thing is knowing where each tool fits in the evidence chain.

19. Minimal Practice Exercise

Take one feature from your codebase or an imaginary case management system.

Write:

1. Three invariants.
2. Three failure modes.
3. One example-based test.
4. One property you wish you could test.
5. One integration risk.
6. One contract risk.
7. One performance risk.
8. One production metric that would detect corruption or degradation.

Then classify each item into the verification ladder.

The purpose is to train your reflex:

risk -> detector -> evidence -> blind spot

20. Part Summary

You now have the mental map for the rest of the series.

Key points:

Testing is not one activity. It is a portfolio of evidence.
Test type should be selected by risk type.
The cheapest reliable detector should catch defects as early as possible.
Unit tests are excellent but not universal.
E2E tests are valuable but should not carry all correctness burden.
Property-based and model-based tests are powerful for broad input/state spaces.
Mutation testing evaluates test oracle strength.
Formal methods help when interleavings and distributed behavior exceed human intuition.
Benchmarking and profiling are part of correctness when performance is a requirement.
Production telemetry closes the evidence loop.

The next part moves from strategy to design:

How do we design Java code so that these verification techniques are cheap, deterministic, and meaningful?

References

JUnit User Guide: https://docs.junit.org/
JUnit project overview: https://junit.org/
Testcontainers for Java: https://java.testcontainers.org/
Java Microbenchmark Harness: https://openjdk.org/projects/code-tools/jmh/
JDK Flight Recorder: https://dev.java/learn/jvm/jfr/

Lesson Recap

You just completed lesson 03 in start here. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Back To Series Next Lesson

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.

Previous Lesson

Lesson 02

Engineering Invariants and Failure Models

Next Lesson

Lesson 04

Design for Testability and Measurability