Build CoreOrdered learning track

TLA+ for Java System Design

Learn Java Formal Methods, Testing, Benchmarking, and Performance Engineering - Part 020

TLA+ untuk desain sistem Java: memodelkan state, action, invariant, liveness, idempotent command processing, optimistic locking, outbox, retry, counterexample, dan mapping dari specification ke implementasi Java/JUnit/jqwik.

7 min read1312 words
PrevNext
Lesson 2040 lesson track0922 Build Core
#java#tla-plus#formal-methods#model-checking+6 more

Part 020 — TLA+ for Java System Design

Tujuan bagian ini: memakai TLA+ sebagai alat desain untuk Java services yang stateful, concurrent, retry-prone, dan sulit divalidasi hanya dengan test contoh.

TLA+ bukan bahasa pemrograman. TLA+ adalah bahasa spesifikasi.

Cara berpikirnya berbeda dari Java:

Java asks: how do we execute this?
TLA+ asks: what behaviors are possible?

Itu membuat TLA+ sangat cocok untuk area yang sering menjadi sumber incident:

concurrent workflow transitions
idempotent commands
outbox/inbox messaging
retry and timeout
optimistic locking
scheduler race
state machine terminality
distributed coordination

Pada bagian ini, kita akan memakai contoh yang dekat dengan Java production system:

Case Command Processor
- menerima command ResolveCase
- harus idempotent
- memakai case status dan version
- menulis outbox event
- command bisa dikirim ulang
- dua worker bisa memproses case yang sama
- publisher bisa menerbitkan outbox event belakangan

Kita tidak akan mencoba memodelkan HTTP, JSON, JDBC, atau Kafka secara penuh. Kita akan memodelkan behavior correctness-nya.


1. TLA+ Mental Model

TLA+ melihat sistem sebagai sequence of states.

state0 -> state1 -> state2 -> state3 -> ...

Setiap transition dari satu state ke state berikutnya disebabkan oleh action.

Spesifikasi biasanya punya elemen inti:

CONSTANTS  = nilai yang ditentukan oleh model checker
VARIABLES  = state sistem
Init       = kondisi state awal
Next       = semua action yang boleh mengubah state
Spec       = Init dan selalu Next
Invariant  = properti yang harus selalu benar
Liveness   = properti yang akhirnya harus terjadi

Skeleton:

---- MODULE Example ----
EXTENDS Naturals, FiniteSets, Sequences

CONSTANTS Cases, Commands

VARIABLES status, processed, outbox

vars == << status, processed, outbox >>

Init ==
  /\ status = [c \in Cases |-> "OPEN"]
  /\ processed = {}
  /\ outbox = << >>

Next ==
  \/ SomeAction
  \/ SomeOtherAction

Spec == Init /\ [][Next]_vars

Invariant == TRUE
====

Makna [][Next]_vars secara praktis:

Setelah Init, setiap langkah berikutnya harus mengikuti salah satu action di Next atau stuttering step yang tidak mengubah vars.

Stuttering penting karena TLA+ memodelkan behavior temporal, dan sistem boleh “tidak melakukan apa-apa” pada satu step.


2. Jangan Modelkan Kode; Modelkan Risiko

Java implementation mungkin terlihat seperti ini:

@Transactional
public ResolveResult resolve(ResolveCaseCommand command) {
    IdempotencyRecord existing = idempotency.find(command.idempotencyKey());
    if (existing != null) {
        return existing.toResult();
    }

    CaseEntity entity = caseRepository.findByIdForUpdate(command.caseId());

    if (entity.status().isTerminal()) {
        ResolveResult result = ResolveResult.alreadyTerminal(entity.id());
        idempotency.save(command.idempotencyKey(), command.requestHash(), result);
        return result;
    }

    entity.resolve(command.resolution());
    caseRepository.save(entity);
    outbox.insert(CaseResolvedEvent.from(entity));

    ResolveResult result = ResolveResult.resolved(entity.id(), entity.version());
    idempotency.save(command.idempotencyKey(), command.requestHash(), result);
    return result;
}

TLA+ tidak perlu memodelkan:

class name
repository interface
JSON deserialization
JDBC statement
transaction annotation
logging format

TLA+ perlu memodelkan risiko:

same command processed twice
same case resolved twice
outbox event without committed state
command outcome changes across retries
two workers interleave
failure occurs between state update and outbox append

3. Domain Abstraction

Kita mulai dengan domain kecil.

Case status:
- OPEN
- RESOLVED

Command:
- command id
- case id

Kita abaikan detail resolution body. Kita fokus pada idempotency dan terminal event.

Desired properties:

P1: A case can have at most one CaseResolved event.
P2: A processed command id always maps to stable outcome.
P3: No CaseResolved event exists unless the case is RESOLVED.
P4: A RESOLVED case never returns to OPEN.

TLA+ state variables:

status     : case -> status
processed  : set of processed command ids
outcome    : command id -> outcome
outbox     : sequence of events

Nondeterministic environment:

which command arrives
whether duplicate command arrives
which worker processes it
whether publisher runs

4. First TLA+ Model: Naive Resolve

Model pertama sengaja salah. Tujuannya agar model checker menemukan bug.

---- MODULE CaseResolveNaive ----
EXTENDS Naturals, FiniteSets, Sequences

CONSTANTS Cases, Commands

VARIABLES status, processed, outcome, outbox

vars == << status, processed, outcome, outbox >>

Statuses == {"OPEN", "RESOLVED"}
Outcomes == {"RESOLVED", "ALREADY_TERMINAL"}

Init ==
  /\ status = [c \in Cases |-> "OPEN"]
  /\ processed = {}
  /\ outcome = [cmd \in Commands |-> "NONE"]
  /\ outbox = << >>

Resolve(cmd, c) ==
  /\ cmd \in Commands
  /\ c \in Cases
  /\ status[c] = "OPEN"
  /\ status' = [status EXCEPT ![c] = "RESOLVED"]
  /\ processed' = processed \cup {cmd}
  /\ outcome' = [outcome EXCEPT ![cmd] = "RESOLVED"]
  /\ outbox' = Append(outbox, [type |-> "CaseResolved", case |-> c, command |-> cmd])

RetryAlreadyProcessed(cmd) ==
  /\ cmd \in processed
  /\ UNCHANGED << status, processed, outcome, outbox >>

Next ==
  \/ \E cmd \in Commands, c \in Cases : Resolve(cmd, c)
  \/ \E cmd \in Commands : RetryAlreadyProcessed(cmd)

Spec == Init /\ [][Next]_vars

ResolvedEvents(c) ==
  {i \in 1..Len(outbox) : outbox[i].type = "CaseResolved" /\ outbox[i].case = c}

AtMostOneResolvedEvent ==
  \A c \in Cases : Cardinality(ResolvedEvents(c)) <= 1

TerminalNeverReopens ==
  \A c \in Cases : status[c] \in Statuses
====

Model ini terlihat aman karena Resolve hanya boleh berjalan saat status[c] = "OPEN". Namun ia belum memodelkan concurrency read/write race.

Jika kita hanya memodelkan atomic resolve, bug interleaving tidak muncul. Ini pelajaran penting:

If your model action is too atomic, you can hide the exact race you wanted to find.

5. Modeling Interleaving: Read Then Commit

Dalam sistem Java dengan dua worker, operasi tidak selalu satu langkah abstrak.

Realistic flow:

1. worker reads case
2. worker decides case is open
3. worker commits resolution
4. worker appends outbox event

Kalau dua worker membaca sebelum salah satu commit, keduanya bisa berpikir case masih open.

Kita tambahkan state inflight.

inflight[worker] = case read by worker and decision captured

Model:

---- MODULE CaseResolveRace ----
EXTENDS Naturals, FiniteSets, Sequences

CONSTANTS Cases, Commands, Workers

VARIABLES status, processed, outcome, outbox, inflight

vars == << status, processed, outcome, outbox, inflight >>

Statuses == {"OPEN", "RESOLVED"}
Outcomes == {"NONE", "RESOLVED", "ALREADY_TERMINAL"}

NoWork == [cmd |-> "NONE", case |-> "NONE", observed |-> "NONE"]

Init ==
  /\ status = [c \in Cases |-> "OPEN"]
  /\ processed = {}
  /\ outcome = [cmd \in Commands |-> "NONE"]
  /\ outbox = << >>
  /\ inflight = [w \in Workers |-> NoWork]

BeginResolve(w, cmd, c) ==
  /\ w \in Workers
  /\ cmd \in Commands
  /\ c \in Cases
  /\ inflight[w] = NoWork
  /\ cmd \notin processed
  /\ inflight' = [inflight EXCEPT ![w] = [cmd |-> cmd, case |-> c, observed |-> status[c]]]
  /\ UNCHANGED << status, processed, outcome, outbox >>

CommitResolveNaive(w) ==
  LET work == inflight[w] IN
  /\ w \in Workers
  /\ work # NoWork
  /\ work.observed = "OPEN"
  /\ status' = [status EXCEPT ![work.case] = "RESOLVED"]
  /\ processed' = processed \cup {work.cmd}
  /\ outcome' = [outcome EXCEPT ![work.cmd] = "RESOLVED"]
  /\ outbox' = Append(outbox, [type |-> "CaseResolved", case |-> work.case, command |-> work.cmd])
  /\ inflight' = [inflight EXCEPT ![w] = NoWork]

CommitAlreadyTerminal(w) ==
  LET work == inflight[w] IN
  /\ w \in Workers
  /\ work # NoWork
  /\ work.observed = "RESOLVED"
  /\ processed' = processed \cup {work.cmd}
  /\ outcome' = [outcome EXCEPT ![work.cmd] = "ALREADY_TERMINAL"]
  /\ inflight' = [inflight EXCEPT ![w] = NoWork]
  /\ UNCHANGED << status, outbox >>

Next ==
  \/ \E w \in Workers, cmd \in Commands, c \in Cases : BeginResolve(w, cmd, c)
  \/ \E w \in Workers : CommitResolveNaive(w)
  \/ \E w \in Workers : CommitAlreadyTerminal(w)

Spec == Init /\ [][Next]_vars

ResolvedEvents(c) ==
  {i \in 1..Len(outbox) : outbox[i].type = "CaseResolved" /\ outbox[i].case = c}

AtMostOneResolvedEvent ==
  \A c \in Cases : Cardinality(ResolvedEvents(c)) <= 1
====

Dengan Cases = {c1}, Commands = {cmd1, cmd2}, dan Workers = {w1, w2}, TLC dapat menemukan interleaving seperti:

w1 begins resolve c1, observes OPEN
w2 begins resolve c1, observes OPEN
w1 commits, appends event
w2 commits, appends event
AtMostOneResolvedEvent violated

Ini bug desain. Unit test happy path tidak akan menemukannya.


6. Fixing the Model: Optimistic Commit Check

Di Java, salah satu fix adalah optimistic locking/version check atau conditional update.

Abstract behavior:

A worker may decide based on old read, but commit must check current state/version.

Ubah commit action:

CommitResolveWithCheck(w) ==
  LET work == inflight[w] IN
  /\ w \in Workers
  /\ work # NoWork
  /\ work.observed = "OPEN"
  /\ status[work.case] = "OPEN"
  /\ status' = [status EXCEPT ![work.case] = "RESOLVED"]
  /\ processed' = processed \cup {work.cmd}
  /\ outcome' = [outcome EXCEPT ![work.cmd] = "RESOLVED"]
  /\ outbox' = Append(outbox, [type |-> "CaseResolved", case |-> work.case, command |-> work.cmd])
  /\ inflight' = [inflight EXCEPT ![w] = NoWork]

Lalu tambahkan action untuk stale decision:

CommitStaleDecision(w) ==
  LET work == inflight[w] IN
  /\ w \in Workers
  /\ work # NoWork
  /\ work.observed = "OPEN"
  /\ status[work.case] = "RESOLVED"
  /\ processed' = processed \cup {work.cmd}
  /\ outcome' = [outcome EXCEPT ![work.cmd] = "ALREADY_TERMINAL"]
  /\ inflight' = [inflight EXCEPT ![w] = NoWork]
  /\ UNCHANGED << status, outbox >>

Sekarang AtMostOneResolvedEvent seharusnya aman untuk model kecil.

Java mapping:

UPDATE case_table
SET status = 'RESOLVED', version = version + 1
WHERE case_id = :caseId
  AND status = 'OPEN'
  AND version = :observedVersion;

atau:

int updated = caseRepository.resolveIfOpen(caseId, observedVersion, resolution);

if (updated == 1) {
    outbox.insert(CaseResolvedEvent.of(caseId, observedVersion + 1));
    return resolved();
}

return alreadyTerminalOrConflict();

Critical detail:

outbox insert must happen in the same transaction as the successful state update.

Otherwise the model fix does not refine the implementation.


7. Idempotency Is Not Just “Already Processed”

The model above only partly handles command idempotency.

A stronger idempotency model needs:

command id
request hash
stored outcome
side effect count

Properties:

same command id never maps to two outcomes
same command id never creates two resolved events
same command id with different request hash is rejected

TLA abstraction:

CONSTANTS Cases, Commands, RequestHashes, Workers

VARIABLES status, processed, requestHash, outcome, outbox, inflight

Invariant:

StableProcessedOutcome ==
  \A cmd \in processed : outcome[cmd] # "NONE"

But this is weak. It only says processed commands have outcomes.

Stronger property needs history. Add outcomeHistory:

VARIABLES outcomeHistory

Whenever an outcome is recorded:

outcomeHistory' = Append(outcomeHistory, [cmd |-> work.cmd, result |-> "RESOLVED"])

Invariant:

NoTwoOutcomesForSameCommand ==
  \A cmd \in Commands :
    Cardinality({r \in Outcomes :
      \E i \in 1..Len(outcomeHistory) :
        outcomeHistory[i].cmd = cmd /\ outcomeHistory[i].result = r}) <= 1

This is common in TLA+ modeling:

If you want to check history-dependent properties, explicitly model history.

Do not keep history in production just because the model does. The model history is a verification device.


8. Modeling Outbox Atomicity

A common mistake:

Update aggregate in one transaction, insert outbox in another transaction.

Bug window:

state changed but no event exists

Model the bad design:

CommitStateOnly(w) ==
  LET work == inflight[w] IN
  /\ work # NoWork
  /\ status[work.case] = "OPEN"
  /\ status' = [status EXCEPT ![work.case] = "RESOLVED"]
  /\ processed' = processed \cup {work.cmd}
  /\ outcome' = [outcome EXCEPT ![work.cmd] = "RESOLVED"]
  /\ inflight' = [inflight EXCEPT ![w] = NoWork]
  /\ UNCHANGED outbox

Then model a crash or simply stop taking steps. Safety invariant:

ResolvedHasOutbox ==
  \A c \in Cases :
    status[c] = "RESOLVED" =>
      \E i \in 1..Len(outbox) : outbox[i].case = c /\ outbox[i].type = "CaseResolved"

This invariant will fail after CommitStateOnly.

The model teaches:

The atomic unit is not “case row update”.
The atomic unit is “case row update + outbox row insert”.

Correct abstract action:

CommitStateAndOutbox(w) ==
  LET work == inflight[w] IN
  /\ work # NoWork
  /\ status[work.case] = "OPEN"
  /\ status' = [status EXCEPT ![work.case] = "RESOLVED"]
  /\ processed' = processed \cup {work.cmd}
  /\ outcome' = [outcome EXCEPT ![work.cmd] = "RESOLVED"]
  /\ outbox' = Append(outbox, [type |-> "CaseResolved", case |-> work.case, command |-> work.cmd])
  /\ inflight' = [inflight EXCEPT ![w] = NoWork]

Java mapping:

@Transactional
public ResolveResult resolve(ResolveCaseCommand command) {
    int updated = caseRepository.resolveIfOpen(command.caseId(), command.observedVersion());

    if (updated == 1) {
        outboxRepository.insert(new CaseResolvedOutboxEvent(...));
        idempotencyRepository.record(command.key(), RESOLVED);
        return ResolveResult.resolved();
    }

    idempotencyRepository.record(command.key(), ALREADY_TERMINAL);
    return ResolveResult.alreadyTerminal();
}

Test mapping:

integration test verifies case update and outbox insert commit/rollback together
property test generates duplicate resolve commands
model counterexample becomes regression test
production metric tracks resolved cases without outbox event

9. TLC Configuration

TLC checks finite models. You provide small finite sets.

Example .cfg:

SPECIFICATION Spec

CONSTANTS
  Cases = {c1}
  Commands = {cmd1, cmd2}
  Workers = {w1, w2}

INVARIANTS
  AtMostOneResolvedEvent
  ResolvedHasOutbox

Small scope is not a weakness. For many concurrency bugs, two workers and one entity are enough.

If a design fails with 1 case and 2 workers, it will fail in production.

Increase scope gradually:

1 case, 2 commands, 2 workers
2 cases, 2 commands, 2 workers
2 cases, 3 commands, 3 workers

Watch state explosion.


10. Reading Counterexamples

A TLC counterexample is a trace of states/actions.

Treat it like a production incident replay.

Ask:

What was the first state where the invariant became doomed?
What action created the unsafe condition?
Was the action unrealistic or correctly modeled?
Did we omit a guard?
Did we make an action too atomic or not atomic enough?
Did the invariant overstate the requirement?

Counterexample classification:

TypeMeaningResponse
Real design bugModel found valid bad behaviorFix design
Missing environment constraintModel allowed impossible behaviorAdd assumption/guard
Over-strong invariantRequirement was too strictRefine property
Under-abstracted modelToo much irrelevant stateSimplify
Over-abstracted modelHidden important detailAdd state/action

Do not silence counterexamples until you understand them.


11. Safety Properties for Java Services

Useful TLA+ invariants for Java service design:

11.1 No Duplicate Terminal Event

NoDuplicateTerminalEvent ==
  \A c \in Cases : Cardinality(ResolvedEvents(c)) <= 1

Java enforcement:

unique index on event semantic key
conditional update
idempotency key
consumer deduplication

11.2 Event Corresponds to State

NoEventWithoutResolvedState ==
  \A i \in 1..Len(outbox) :
    outbox[i].type = "CaseResolved" => status[outbox[i].case] = "RESOLVED"

Java enforcement:

same transaction for state and outbox
payload generated from persisted aggregate version
transactional tests

11.3 Terminal State Immutable

TerminalStateValid ==
  \A c \in Cases : status[c] \in {"OPEN", "RESOLVED"}

Better with history:

NeverReopened ==
  \A c \in Cases :
    WasResolved(c) => status[c] = "RESOLVED"

Where WasResolved uses event/state history.

11.4 Stable Idempotency Outcome

StableOutcome ==
  \A cmd \in Commands :
    cmd \in processed => outcome[cmd] # "NONE"

Stronger with history:

NoOutcomeChange ==
  \A cmd \in Commands :
    Cardinality(OutcomesObservedFor(cmd)) <= 1

12. Liveness and Fairness

Safety says bad things never happen. Liveness says good things eventually happen.

Example:

Every outbox event is eventually published.

TLA+ style:

VARIABLES outbox, published

Publish(i) ==
  /\ i \in 1..Len(outbox)
  /\ i \notin published
  /\ published' = published \cup {i}
  /\ UNCHANGED outbox

AllOutboxEventuallyPublished ==
  \A i \in 1..Len(outbox) : <> (i \in published)

But this property is not true unless publishing action is fairly scheduled.

Fairness says, roughly:

If an action remains continuously enabled, it eventually occurs.

Spec with weak fairness:

Spec == Init /\ [][Next]_vars /\ WF_vars(PublishNext)

Practical Java interpretation:

If publisher keeps running and broker/database eventually respond, queued events should drain.

Without these assumptions, liveness is false in any real system. The publisher can be stopped forever. The broker can be down forever. The database can be unavailable forever.

So in engineering docs, write liveness properties with assumptions:

Assuming publisher is scheduled, database is available, and broker eventually accepts writes, every committed outbox row is eventually published or moved to a terminal dead-letter state.

13. Mapping TLA+ to Java Implementation

Use a mapping table.

TLA+ ConceptJava/Production Artifact
status[c]case_table.status
processedidempotency_key table
outcome[cmd]stored idempotency outcome/result reference
outboxoutbox_event table
publishedbroker ack + outbox publication marker
BeginResolveload aggregate / read status/version
CommitResolveWithCheckconditional update in DB transaction
AtMostOneResolvedEventunique semantic event constraint + tests
ResolvedHasOutboxtransactional outbox integration test
liveness publisher propertyoutbox lag SLO and dead-letter policy

This table is not bureaucracy. It prevents a common failure:

The model is correct, but nobody knows which part of Java enforces it.

14. Turning Counterexamples into JUnit Tests

Suppose TLC found:

w1 reads OPEN
w2 reads OPEN
w1 commits resolved event
w2 commits resolved event

Translate to a deterministic test harness.

@Test
void twoConcurrentResolveAttemptsMustProduceSingleResolvedEvent() throws Exception {
    CaseId caseId = fixtures.openCase();

    CyclicBarrier bothRead = new CyclicBarrier(2);
    CountDownLatch allowCommit = new CountDownLatch(1);

    ResolveWorker worker1 = new ResolveWorker(caseId, bothRead, allowCommit);
    ResolveWorker worker2 = new ResolveWorker(caseId, bothRead, allowCommit);

    Future<ResolveResult> f1 = executor.submit(worker1);
    Future<ResolveResult> f2 = executor.submit(worker2);

    bothRead.await(5, TimeUnit.SECONDS);
    allowCommit.countDown();

    ResolveResult r1 = f1.get(5, TimeUnit.SECONDS);
    ResolveResult r2 = f2.get(5, TimeUnit.SECONDS);

    assertThat(List.of(r1, r2))
        .filteredOn(ResolveResult::resolved)
        .hasSize(1);

    assertThat(outboxRepository.findByCaseId(caseId))
        .filteredOn(event -> event.type().equals("CaseResolved"))
        .hasSize(1);
}

The exact test implementation depends on repository seams. The important part:

The test reproduces the interleaving from the formal counterexample.

15. TLA+ and Property-Based Testing Together

TLA+ explores abstract state space. jqwik explores generated concrete test data.

They complement each other.

TLA+ property:

No duplicate terminal event under any modeled interleaving.

jqwik property:

For generated command sequences with duplicate ids and random order, Java implementation maintains domain invariant.

Example structure:

@Property
void generatedResolveCommandsDoNotCreateDuplicateTerminalEvents(
        @ForAll("resolveCommandSequences") List<ResolveCaseCommand> commands) {

    CaseId caseId = fixtures.openCase();

    for (ResolveCaseCommand command : commands) {
        service.resolve(command.withCaseId(caseId));
    }

    assertThat(outboxRepository.caseResolvedEvents(caseId)).hasSizeLessThanOrEqualTo(1);
}

TLA+ gives confidence about behavior shape. Property-based tests check implementation over more data combinations.


16. Modeling Time Carefully

Time is dangerous in models.

Bad model:

time increases automatically and continuously

Better model:

Tick is an action
scheduler behavior is explicit
SLA threshold is abstracted into small finite values

Example:

VARIABLES age, status, escalated

Tick(c) ==
  /\ c \in Cases
  /\ age' = [age EXCEPT ![c] = age[c] + 1]
  /\ UNCHANGED << status, escalated >>

Escalate(c) ==
  /\ c \in Cases
  /\ status[c] = "ASSIGNED"
  /\ age[c] >= SlaLimit
  /\ escalated' = escalated \cup {c}
  /\ UNCHANGED << status, age >>

Java mapping:

Clock injection
scheduler test harness
assignment epoch
SLA pause/resume state
idempotent escalation command

Do not model wall-clock details unless wall-clock behavior is the risk.


17. State Explosion Control

TLC explores state spaces. State space can explode.

Control techniques:

use small finite constants
model only relevant state
avoid unbounded sequences when possible
abstract payloads into symbols
reduce number of workers/entities
split specs by concern
check safety before liveness
add symmetry when appropriate

Bad:

Cases = 1000
Commands = 10000
Workers = 50
full event payload records
full user objects
full timestamp values

Good:

Cases = {c1, c2}
Commands = {cmd1, cmd2, cmd3}
Workers = {w1, w2}
Status = symbolic values
Time = 0..2

Remember:

A concurrency design that is wrong often fails with tiny numbers.

18. Modeling Checklist for Java Services

Before writing a TLA+ spec:

What production incident are we trying to prevent?
What is the smallest subsystem that contains the risk?
What are the state variables?
What actions change state?
Which actions are concurrent?
Which actions can retry?
Which actions can fail halfway?
Which operation must be atomic?
What invariant would be expensive to violate?
What progress condition matters?

While writing:

Are actions too atomic?
Are actions not atomic enough?
Are duplicate messages modeled?
Are stale reads modeled?
Are crash windows modeled?
Are terminal states modeled?
Are environment assumptions explicit?

After TLC passes:

Did we run with at least two workers where concurrency matters?
Did we try at least two commands where duplication matters?
Did we check non-trivial invariants?
Did we document what the model omits?
Did we map each invariant to Java enforcement?
Did we create tests for previous counterexamples?

19. What Not to Do

19.1 Do Not Translate Java Line-by-Line

TLA+ is not pseudo-Java. If the spec has the same accidental complexity as implementation, it loses value.

19.2 Do Not Trust a Passing Spec Blindly

Passing TLC means:

No violation found in this finite model with these properties.

It does not mean production is correct.

19.3 Do Not Hide Retry

If production has retry but model has no retry, model is probably too optimistic.

19.4 Do Not Hide Duplicate Delivery

If message brokers, HTTP clients, or job schedulers can duplicate execution, model duplicates.

19.5 Do Not Skip Mapping to Implementation

A formal model with no Java mapping becomes a detached artifact.


20. Minimal Repository Layout

Recommended layout:

/specs/tla/case-resolve
  CaseResolve.tla
  CaseResolve.cfg
  README.md
/src/test/java/.../CaseResolveConcurrencyTest.java
/src/test/java/.../CaseResolvePropertyTest.java
/docs/adr/ADR-021-idempotent-case-resolution.md

README should include:

Purpose
State variables
Actions
Invariants
Liveness assumptions
How to run TLC
Known omissions
Counterexamples found
Java mapping
Related tests

Example README section:

## Java Mapping

- `status[c]` maps to `case_table.status`.
- `processed` maps to `idempotency_key.command_id`.
- `outbox` maps to `outbox_event`.
- `CommitResolveWithCheck` maps to `resolveIfOpen(caseId, observedVersion)`.
- `AtMostOneResolvedEvent` is enforced by conditional update and unique semantic event key.

21. From TLA+ to Production Evidence

A strong engineering loop does not stop at model checking.

Production metrics linked to this model:

resolved_case_without_outbox_count
case_resolved_event_duplicate_count
idempotency_key_conflict_count
outbox_publish_lag_seconds
resolve_optimistic_lock_conflict_count
stale_command_attempt_count

Logs/events:

command id
case id
observed version
committed version
idempotency key
outbox event id
worker id
correlation id

Now the model is not just design-time assurance. It shapes runtime observability.


22. A Practical TLA+ Adoption Path

Do not start by mandating TLA+ for every team.

Start with one high-value design.

Week 1: pick one painful stateful/concurrent risk
Week 1: write invariant catalog
Week 2: write small TLA+ model
Week 2: run TLC and collect counterexamples
Week 3: patch Java design and tests
Week 3: document mapping and assumptions
Week 4: use spec in design review

Good first candidates:

idempotent command processing
outbox correctness
workflow terminality
lease ownership
scheduler duplicate execution
approval separation-of-duties handoff

Poor first candidates:

entire platform architecture
all business rules
large database schema
full Kafka topology
all HTTP endpoints

Start small. Win with one real bug. Then expand.


23. Closing Mental Model

TLA+ helps Java engineers because production systems are not just code. They are possible behaviors.

Java implementation gives one execution at a time. TLA+ explores many executions in a small model.

The value is not the syntax. The value is the shift:

from example thinking
to state-space thinking

from “this path works”
to “no allowed path violates the invariant”

from “we think retry is safe”
to “duplicate and stale retries were modeled”

from “outbox exists”
to “state + outbox atomicity is enforced”

Use TLA+ when the cost of a design mistake is higher than the cost of modeling.

For top-level engineering maturity, that is often the difference between:

finding a concurrency bug in a model trace today

and:

explaining a production incident months later

24. What Comes Next

Part 021 will cover PlusCal and algorithmic specifications.

TLA+ is powerful but mathematically shaped. PlusCal gives an algorithm/process-oriented way to write specs and translate them to TLA+.

That is useful when the system behavior naturally looks like:

workers
queues
locks
retry loops
publishers
consumers
scheduler processes
Lesson Recap

You just completed lesson 20 in build core. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.