Deepen PracticeOrdered learning track

Idempotency, Retry, Reconciliation, and Exactly-Once Illusions

Learn Java Large Scale ERP - Part 023

Idempotency, retry, reconciliation, and exactly-once illusions for large-scale Java ERP integration and transaction processing.

[2026-07-01]26 min read5136 words

In This Lesson

1. Why This Part Matters 2. Kaufman Skill Deconstruction 3. The Exactly-Once Illusion

PrevNext

Lesson 2334 lesson track19–28 Deepen Practice

#java#erp#idempotency#retry+6 more

Part 023 — Idempotency, Retry, Reconciliation, and Exactly-Once Illusions

1. Why This Part Matters

Large ERP systems fail less often because a single method throws an exception and more often because one business fact crosses a boundary twice, zero times, too late, in the wrong order, or without evidence.

Typical incidents look like this:

a supplier invoice is imported twice;
a payment request is retried after timeout and the bank executes both attempts;
a warehouse shipment event arrives before the sales order allocation event;
a customer payment is received but not matched to the AR invoice;
an inventory adjustment succeeds locally but fails to reach the finance subledger;
a nightly batch is re-run and posts another copy of the same journals;
a message broker redelivers a message after the consumer already committed the database transaction;
a user clicks submit twice and creates two purchase orders;
an API caller times out and resends the same command with a different request id;
an integration team says “the broker guarantees delivery”, but no one can prove whether the ERP processed the business fact.

In a small CRUD system, duplicate processing is annoying. In a large ERP, duplicate processing can create financial misstatement, inventory mismatch, broken audit trail, month-end close delay, legal sequence gaps, and customer/vendor disputes.

The mental model for this part is:

In ERP, reliability is not merely a transport property. Reliability is a business property proven by durable intent, idempotent execution, controlled retry, explicit exception handling, and reconciliation evidence.

A top-tier ERP engineer does not ask only:

Did the HTTP call return 200?

They ask:

Can we prove which business fact was intended, whether it was accepted, whether it was processed once in effect, whether downstream systems observed it, whether duplicates were suppressed, and whether exceptions are reconcilable?

2. Kaufman Skill Deconstruction

Following Kaufman's method, this skill must be decomposed into small capabilities that can be practiced independently.

Sub-skill	What You Need to Learn	Failure If Ignored
Delivery semantics	At-most-once, at-least-once, effective-once, ordering, replay	Engineers believe the infrastructure solves business correctness
Idempotency design	How to design commands, events, imports, projections, and workflows to tolerate duplicates	Duplicate invoices, duplicate postings, duplicate payments
Durable intent	Outbox, inbox, command log, request ledger, processing ledger	Work disappears between database commit and message publish
Retry classification	Transient, permanent, business, security, capacity, and poison failures	Retry storms and silent data corruption
Reconciliation	Count, sum, hash, status, aging, exception queues, control accounts	Month-end discovers mismatch too late
Compensation	Reversal, cancellation, correction, adjustment, voiding, settlement repair	Illegal deletes and non-auditable fixes
Observability	Correlation id, causation id, business key, attempt id, replay id	Support cannot trace what happened
Operational control	DLQ, backoff, quarantine, replay, manual resolution, runbooks	Operators either do nothing or mutate production data unsafely
Testing	Duplicate, reorder, timeout, partial commit, replay, concurrent retry	Happy-path integration passes but production fails

The 20-hour practice target is not to memorize patterns. It is to become able to look at any ERP integration and answer:

What is the business fact?
Who owns it?
What is its stable identity?
What is the idempotency key?
What durable log proves intent and processing?
Which failures are retriable?
Which failures must be quarantined?
How do we reconcile source, ERP, and destination?
What is the legal/audit correction path?
How do we prove it worked?

3. The Exactly-Once Illusion

The phrase “exactly once” is dangerous because people use it at different layers.

Layer	What “Once” Might Mean	Why It Is Not Enough
Network	Packet sent once	Network can lose, duplicate, delay, or reorder
HTTP client	Request attempted once	Client timeout may hide server-side success
Broker	Message delivered once	Broker guarantees rarely cover consumer database side effects
Consumer	Handler invoked once	Handler may crash after side effect and before acknowledgment
Database	Row inserted once	Business effect may be duplicate under different key
Business	Invoice/payment/posting has one effective outcome	This is the only level ERP users care about

The correct ERP objective is usually effective-once business processing:

The same business intent may be transmitted, retried, redelivered, re-read, or replayed many times, but it produces one accepted business outcome or one controlled rejection with evidence.

That is different from pretending duplicates never happen.

3.1 Failure Timeline: The Classic Duplicate

The important property is not that every component processed once. The important property is that every component had a stable key and a duplicate decision.

4. Delivery Semantics You Must Model Explicitly

4.1 At-Most-Once

At-most-once means the system tries once and does not retry.

It avoids duplicates but allows loss.

Useful for:

telemetry;
non-critical notifications;
transient UI hints;
low-value cache invalidation.

Dangerous for:

invoice posting;
stock movement;
payment execution;
tax submission;
approval decision;
legal document generation.

4.2 At-Least-Once

At-least-once means the system retries until it believes processing happened.

It reduces loss but creates duplicates.

This is common in messaging systems and integration pipelines. For ERP, it is acceptable only when consumers are idempotent.

4.3 At-Exactly-Once Infrastructure

Some infrastructure offers exactly-once processing under specific constraints. That does not remove the need for business idempotency.

Reasons:

external systems may not participate in the same transaction;
users may submit duplicate business commands;
batch jobs may be re-run;
messages may be replayed for recovery;
business keys may be unstable;
different transports may deliver the same fact;
manual repair may reintroduce a previously processed record.

4.4 Effective-Once Business Outcome

This is the ERP target.

Effective-once means:

stable business intent identity;
durable acceptance record;
deterministic duplicate handling;
side-effect ledger;
state transition guard;
reconciliation evidence;
auditable correction path.

5. ERP Idempotency Taxonomy

Not all idempotency is the same. An ERP platform needs multiple idempotency scopes.

Scope	Example	Key	Expected Duplicate Behavior
API command	Submit purchase requisition	`tenant + requester + clientRequestId`	Return original accepted command result
Business document	Supplier invoice import	`vendor + invoiceNumber + invoiceDate + legalEntity`	Reject duplicate or link to existing invoice
Posting	GL posting request	`sourceDocumentType + sourceDocumentId + postingPurpose`	Return existing journal
Payment	Bank payment execution	`paymentInstructionId`	Return same bank execution reference
Event projection	Update read model from event	`eventId` or `aggregateId + version`	Skip already applied event
Workflow task	Approver decision	`taskId + actor + decisionSequence`	Return already recorded decision
Batch import	CSV file import row	`fileId + rowNumber + rowHash`	Skip, reject, or mark duplicate
Reconciliation run	Daily bank recon	`account + statementDate + statementId`	Continue or return existing run
External callback	Payment gateway webhook	`provider + webhookId`	Acknowledge duplicate without reprocessing

The mistake is to rely on only one key type. A robust ERP usually combines:

technical event id;
source system id;
natural business key;
document identity;
version;
semantic purpose;
hash/fingerprint;
legal entity/tenant scope.

6. The ERP Idempotency Contract

Every command/event/import should answer these questions.

Question	Why It Matters
What is the idempotency key?	Duplicate suppression cannot work without stable identity
Who generated the key?	Client-generated and server-generated keys have different risks
What is the key scope?	A vendor invoice number may be unique only per vendor and legal entity
What is the retention period?	Duplicate may arrive weeks later during replay or partner re-send
What response is returned for duplicate?	API callers need deterministic behavior
Is the payload allowed to differ for same key?	Detect accidental key reuse or data tampering
Is duplicate accepted, rejected, linked, or ignored?	Business semantics differ by domain
What is logged as evidence?	Audit needs the reason for duplicate decision

6.1 Idempotency Key Design

Good idempotency keys are:

stable across retry;
scoped by tenant/legal entity/source;
generated before side effects;
stored durably;
linked to payload fingerprint;
retained long enough for operational replay;
visible to support users;
included in logs, traces, and reconciliation reports.

Bad idempotency keys are:

random per retry;
based on database auto-increment id created after accepting the command;
based only on timestamp;
based only on user id;
hidden in code;
not persisted;
not included in downstream calls;
not protected by a unique constraint.

6.2 Payload Fingerprint

If the same key arrives with a different payload, it is not a harmless duplicate. It may be:

client bug;
replay with corrupted data;
malicious tampering;
source system reusing keys incorrectly;
user resubmitting a changed document under old request id.

A command ledger should record a payload fingerprint.

create table erp_command_request (
    tenant_id           varchar(64) not null,
    idempotency_key     varchar(128) not null,
    command_type        varchar(80) not null,
    payload_hash        varchar(128) not null,
    status              varchar(40) not null,
    result_type         varchar(80),
    result_id           varchar(128),
    rejection_code      varchar(80),
    created_at          timestamp not null,
    completed_at        timestamp,
    created_by          varchar(128) not null,
    primary key (tenant_id, idempotency_key)
);

Duplicate behavior:

Existing Status	Same Payload	Different Payload
`ACCEPTED`	Return accepted status	Reject as idempotency conflict
`COMPLETED`	Return original result	Reject as idempotency conflict
`REJECTED`	Return original rejection	Reject or return original rejection based on policy
`IN_PROGRESS`	Return pending or retry later	Reject as conflict
`UNKNOWN`	Quarantine	Quarantine

7. Durable Intent: Outbox, Inbox, and Processing Ledger

The hardest reliability bug is the gap between:

committing local database changes; and
publishing a message or calling an external service.

If the database commit succeeds but publish fails, the world never hears about the business fact. If publish succeeds but database commit fails, the world hears about a fact that does not exist.

7.1 Outbox Pattern

The outbox records integration intent inside the same database transaction as the business change. A separate publisher safely sends it later.

Outbox table example:

create table erp_outbox_event (
    outbox_id           uuid primary key,
    tenant_id           varchar(64) not null,
    aggregate_type      varchar(80) not null,
    aggregate_id        varchar(128) not null,
    aggregate_version   bigint not null,
    event_type          varchar(120) not null,
    event_key           varchar(200) not null,
    causation_id        varchar(128),
    correlation_id      varchar(128),
    payload_json        text not null,
    payload_hash        varchar(128) not null,
    status              varchar(40) not null,
    attempts            int not null default 0,
    next_attempt_at     timestamp,
    created_at          timestamp not null,
    published_at        timestamp,
    unique (tenant_id, event_key)
);

Design rules:

write outbox inside the same transaction as the source fact;
make event identity stable;
include aggregate version or source version;
publish asynchronously;
retry publishing safely;
do not delete outbox rows immediately;
expose outbox backlog as operational metric;
do not treat outbox published as downstream processed.

7.2 Inbox Pattern

The inbox records received messages before or during processing so duplicates can be suppressed.

create table erp_inbox_message (
    tenant_id           varchar(64) not null,
    source_system       varchar(80) not null,
    message_id          varchar(160) not null,
    message_type        varchar(120) not null,
    payload_hash        varchar(128) not null,
    status              varchar(40) not null,
    received_at         timestamp not null,
    processed_at        timestamp,
    failure_code        varchar(80),
    failure_message     text,
    primary key (tenant_id, source_system, message_id)
);

Consumer rule:

receive message;
compute message key and payload hash;
insert inbox row with unique key;
if duplicate with same hash, acknowledge and skip or return original result;
if duplicate with different hash, quarantine;
process business side effect;
mark inbox processed;
publish further outbox events if needed.

7.3 Processing Ledger

Some side effects need their own ledger, not merely outbox/inbox.

Examples:

payment execution ledger;
tax submission ledger;
GL posting ledger;
inventory movement ledger;
notification delivery ledger;
partner file transmission ledger;
report generation ledger.

A processing ledger answers:

For this business purpose, has this source fact already produced this side effect?

Example:

create table erp_processing_ledger (
    tenant_id           varchar(64) not null,
    process_type        varchar(80) not null,
    source_type         varchar(80) not null,
    source_id           varchar(128) not null,
    semantic_purpose    varchar(80) not null,
    status              varchar(40) not null,
    result_reference    varchar(160),
    attempts            int not null default 0,
    created_at          timestamp not null,
    completed_at        timestamp,
    primary key (tenant_id, process_type, source_type, source_id, semantic_purpose)
);

The semantic_purpose matters. The same sales invoice may produce:

initial AR posting;
tax reporting;
customer notification;
revenue recognition schedule;
analytics event;
document archive.

Each purpose needs its own idempotency boundary.

8. Retry Is a Business Decision, Not a Loop

A retry policy is not:

while (true) {
    trySend();
}

A retry policy is a controlled state machine.

8.1 Failure Classification

Failure Type	Example	Retry?	Action
Transient technical	Network timeout, temporary broker unavailable	Yes	Backoff + jitter
Capacity	DB pool exhausted, rate limit	Yes, slower	Throttle and alert
Permanent technical	Invalid endpoint config, schema mismatch	No automatic retry after threshold	Quarantine and fix config/code
Business validation	Vendor blocked, period closed, credit exceeded	Usually no	Route to business exception queue
Security	Unauthorized token, certificate expired	No blind retry	Rotate credential / alert security
Duplicate	Same idempotency key and same payload	No side effect	Return existing result
Idempotency conflict	Same key different payload	No	Quarantine / reject
Poison message	Handler always fails on same payload	No after threshold	DLQ with evidence
Unknown outcome	External call timed out after possible execution	Do not blindly repeat	Query status / reconcile first

8.2 Backoff and Jitter

Without backoff, retries amplify incidents.

A retry storm can turn a 30-second database slowdown into a multi-hour outage.

A sane retry policy includes:

max attempts;
exponential backoff;
random jitter;
per-partner rate limits;
circuit breaker;
retry budget;
queue depth monitoring;
dead-letter/quarantine path;
operator-visible reason code.

8.3 Unknown Outcome Is Not Failure

The most dangerous state is not failure. The most dangerous state is unknown outcome.

Example:

ERP sends payment to bank.
Bank times out.
ERP does not know whether the bank executed.
ERP retries.
Bank executes twice.

Correct model:

Rule:

If an external side effect may have happened, query/reconcile before retrying the side effect.

9. Java Implementation Sketch: Idempotent Command Handler

This is not generic Spring/JPA material. The point is the ERP-specific control shape.

public final class IdempotentCommandService {

    private final CommandLedgerRepository ledger;
    private final PurchaseOrderApplicationService purchaseOrders;
    private final PayloadHasher hasher;

    public CommandResult submitPurchaseOrder(
            TenantId tenantId,
            String idempotencyKey,
            SubmitPurchaseOrderCommand command,
            Actor actor
    ) {
        String payloadHash = hasher.sha256CanonicalJson(command);

        CommandLedgerRecord existing = ledger.find(tenantId, idempotencyKey);
        if (existing != null) {
            if (!existing.payloadHash().equals(payloadHash)) {
                throw new IdempotencyConflictException(
                        "Same idempotency key was used with a different payload"
                );
            }
            return existing.toCommandResult();
        }

        CommandLedgerRecord accepted = CommandLedgerRecord.accepted(
                tenantId,
                idempotencyKey,
                "SubmitPurchaseOrder",
                payloadHash,
                actor.id()
        );

        ledger.insert(accepted); // unique constraint protects concurrent duplicates

        try {
            PurchaseOrderId poId = purchaseOrders.createFromApprovedCommand(command, actor);
            ledger.markCompleted(tenantId, idempotencyKey, "PurchaseOrder", poId.value());
            return CommandResult.completed("PurchaseOrder", poId.value());
        } catch (BusinessRejection ex) {
            ledger.markRejected(tenantId, idempotencyKey, ex.code(), ex.getMessage());
            return CommandResult.rejected(ex.code(), ex.getMessage());
        }
    }
}

Important design notes:

the ledger insert must be protected by a unique constraint;
payload must be canonicalized before hashing;
duplicate request should return original outcome, not create new work;
business rejection can be recorded as an outcome;
conflict must be explicit;
command result must not depend on volatile transient state;
for long-running commands, return accepted/pending with correlation id.

10. Canonical Payload Hashing

Hashing raw JSON is often wrong because two equivalent payloads can have different formatting or field order.

Canonical hashing rules:

stable field ordering;
stable date/time format;
explicit timezone treatment;
stable number scale/rounding;
no volatile fields such as request timestamp unless semantically relevant;
normalized whitespace;
normalized default values;
schema version included;
tenant/source context included when necessary.

Example semantic fingerprint fields for supplier invoice import:

tenantId
legalEntityId
vendorId
vendorInvoiceNumber
vendorInvoiceDate
currency
invoiceGrossAmount
lineCount
lineItemSkuAndAmountHash
sourceSystem
sourceDocumentId
schemaVersion

A good fingerprint detects meaningful differences without being disturbed by formatting noise.

11. Idempotency by ERP Domain

11.1 Procure-to-Pay

Process	Idempotency Key	Notes
Requisition submission	requester + client request id	User double-click safe
Purchase order creation	requisition id + conversion purpose	Prevent multiple POs from same approved requisition unless split intentionally
Goods receipt import	warehouse + ASN/receipt source id	External WMS may resend
Supplier invoice import	vendor + invoice number + legal entity	Natural duplicate detection matters
3-way matching	invoice id + match run id or match version	Re-run should not duplicate match result
AP posting	invoice id + posting purpose	One initial liability posting
Payment proposal	payment run id + invoice id	Prevent duplicate inclusion
Bank payment	payment instruction id	Must survive timeout and replay

11.2 Order-to-Cash

Process	Idempotency Key	Notes
Sales order submission	channel + cart/order id	E-commerce retries are common
Allocation	sales order line + allocation purpose + version	Reallocation must be explicit
Shipment confirmation	WMS shipment id	WMS may resend status
Invoice generation	shipment id or order milestone + billing purpose	Prevent duplicate invoice
AR posting	invoice id + posting purpose	Link journal to invoice
Cash application	bank statement line id + invoice id	Prevent duplicate receipt matching
Return authorization	original sale + return request id	Avoid duplicate returns
Credit note	return id + credit purpose	Avoid duplicate credits

11.3 Inventory

Process	Idempotency Key	Notes
Stock movement	movement source + source id + line id	Ledger must not duplicate movement
Reservation	demand id + item + location + reservation purpose	Avoid over-reservation
Pick confirmation	pick task id + picked line version	WMS event duplicates common
Cycle count adjustment	count session + item + bin + serial/lot	Re-run should update result, not duplicate
Manufacturing issue	work order + component line + issue sequence	Backflush must be safe
Completion receipt	work order + completion batch	Avoid duplicate finished goods receipt

11.4 General Ledger

Process	Idempotency Key	Notes
Subledger posting	source document + posting purpose	One journal per purpose
Reversal	original journal + reversal reason + reversal sequence	Reversal is separate legal event
Adjustment	adjustment request id	Must be approved and auditable
Period close task	fiscal period + task code + run id	Re-run must be controlled
Balance projection	journal id + projection version	Projection idempotency separate from posting

12. Reconciliation as First-Class Architecture

Reconciliation is not a report at the end. It is a core reliability mechanism.

A system that cannot reconcile cannot prove correctness.

12.1 Reconciliation Types

Type	Example	Purpose
Count reconciliation	Number of invoices sent vs received	Detect missing/extra records
Sum reconciliation	Total payment amount in ERP vs bank	Detect amount mismatch
Hash reconciliation	File hash or line fingerprint	Detect tampering/corruption
Status reconciliation	ERP says shipped, WMS says pending	Detect lifecycle mismatch
Balance reconciliation	Inventory subledger vs GL control account	Detect financial mismatch
Aging reconciliation	Items stuck in pending > threshold	Detect operational blockage
Sequence reconciliation	Legal invoice sequence gaps	Detect missing/voided document
Cross-system reference reconciliation	ERP document not known by partner	Detect integration loss

12.2 Reconciliation Architecture

12.3 Control Totals

For file/batch integrations, control totals are mandatory.

Example file manifest:

{
  "sourceSystem": "WMS-EU-01",
  "fileId": "SHIPMENT-2026-07-01-001",
  "businessDate": "2026-07-01",
  "recordCount": 125000,
  "totalQuantity": "884321.000",
  "totalAmount": "0.00",
  "payloadSha256": "...",
  "createdAt": "2026-07-01T02:00:00Z"
}

The import should verify:

file id has not been processed before;
manifest hash matches payload;
record count matches parsed rows;
quantity/amount totals match;
schema version is supported;
business date is acceptable;
source system is authorized;
import result has matched count, rejected count, duplicate count, and quarantined count.

13. Reconciliation Data Model

create table erp_reconciliation_run (
    run_id              uuid primary key,
    tenant_id           varchar(64) not null,
    recon_type          varchar(80) not null,
    scope_key           varchar(200) not null,
    business_date       date not null,
    status              varchar(40) not null,
    source_count        bigint,
    erp_count           bigint,
    destination_count   bigint,
    source_total        numeric(30, 6),
    erp_total           numeric(30, 6),
    destination_total   numeric(30, 6),
    started_at          timestamp not null,
    completed_at        timestamp,
    created_by          varchar(128) not null,
    unique (tenant_id, recon_type, scope_key, business_date)
);

create table erp_reconciliation_exception (
    exception_id        uuid primary key,
    run_id              uuid not null references erp_reconciliation_run(run_id),
    exception_type      varchar(80) not null,
    severity            varchar(40) not null,
    source_reference    varchar(200),
    erp_reference       varchar(200),
    destination_reference varchar(200),
    expected_value      text,
    actual_value        text,
    status              varchar(40) not null,
    resolution_code     varchar(80),
    resolution_note     text,
    created_at          timestamp not null,
    resolved_at         timestamp,
    resolved_by         varchar(128)
);

The reconciliation exception is not merely a technical ticket. It is business evidence.

14. Exception Queue Design

An exception queue is different from a dead-letter queue.

Queue	Meaning	Owner
Dead-letter queue	Technical message cannot be processed automatically	Platform/integration team
Business exception queue	Business fact is valid structurally but cannot complete because of domain rule	Business operations
Reconciliation exception queue	Cross-system mismatch found	Joint business + support
Security exception queue	Unauthorized/suspicious request	Security/compliance
Data quality queue	Master/reference data issue blocks processing	Data governance team

Exception queue item should include:

business key;
source system;
payload snapshot or secure reference;
failure reason code;
severity;
retry eligibility;
suggested resolution;
owner team;
SLA;
correlation id;
related document;
audit trail;
allowed actions.

Allowed actions must be controlled.

Do not let operators “edit raw payload and replay” without evidence.

15. Compensation, Reversal, and Correction

ERP systems should rarely delete mistakes. They correct them.

Action	Meaning	Example
Retry	Attempt same side effect again because previous attempt did not complete	Re-send message to broker
Replay	Process historical fact again through idempotent handler	Rebuild projection
Reverse	Create equal and opposite financial/stock effect	Reverse wrong journal
Void	Mark unused legal document number as void with reason	Void invoice number generated in error
Cancel	Stop a document before irreversible business effect	Cancel draft PO
Adjust	Add explicit correction transaction	Inventory adjustment after count
Amend	Create new revision while retaining history	Amend contract terms
Reconcile	Match/resolve difference between systems	Match payment to invoice

The key rule:

Correction must preserve evidence of the original mistake and the authorized remedy.

15.1 Bad Fix

delete from gl_journal where journal_id = 'J-123';

This destroys evidence.

15.2 Better Fix

create reversal journal;
link reversal to original journal;
require approval;
record reason code;
include period policy;
expose both in audit timeline;
update reconciliation status.

16. Ordering and Versioning

Duplicate handling is not enough. Some ERP facts must be processed in order.

Example:

sales order created;
sales order approved;
shipment confirmed;
invoice generated;
payment received.

If event 3 arrives before event 2, the consumer must not pretend the world is fine.

Ordering strategies:

Strategy	Use Case	Trade-off
Aggregate version	Events for same aggregate	Requires versioned aggregate
Sequence per source	File rows or partner events	Source must guarantee sequence
State guard	Reject illegal transition	Needs lifecycle model
Holding buffer	Temporarily hold future event	Requires timeout and cleanup
Reconciliation repair	Accept eventual mismatch and repair	Requires strong exception process
Snapshot replacement	Latest state wins	Dangerous for financial facts

For ERP financial and stock facts, “latest wins” is usually wrong.

16.1 Event Version Guard

public void applyShipmentEvent(ShipmentConfirmed event) {
    ProjectionState state = repository.getState(event.shipmentId());

    if (state.hasApplied(event.eventId())) {
        return;
    }

    if (event.aggregateVersion() != state.nextExpectedVersion()) {
        repository.recordOutOfOrder(event);
        throw new OutOfOrderEventException(event.shipmentId(), event.aggregateVersion());
    }

    state.apply(event);
    state.markApplied(event.eventId());
    repository.save(state);
}

17. Multi-System Correlation Model

You cannot operate a large ERP without correlation.

Minimum identifiers:

Identifier	Meaning
`traceId`	Technical distributed trace
`correlationId`	Business process correlation across commands/events
`causationId`	Immediate cause of this event/command
`idempotencyKey`	Duplicate suppression key
`sourceSystem`	Origin of the fact
`sourceDocumentId`	Source-side business reference
`erpDocumentId`	ERP-side document reference
`externalReference`	Destination-side reference
`attemptId`	Specific retry attempt
`replayId`	Specific replay operation

Every log/event/exception should carry enough of this to answer:

what happened;
why it happened;
what caused it;
whether it was duplicate;
whether it was retried;
what downstream result was produced;
how to reconcile it.

18. Observability Metrics

Technical metrics:

outbox backlog count;
outbox oldest age;
publish attempts per event;
inbox duplicate count;
DLQ size;
retry queue depth;
retry success rate;
max processing latency;
handler error rate;
downstream timeout rate.

Business metrics:

unposted invoice count;
unmatched payment count;
unreconciled shipment count;
stock movement pending finance count;
GL posting exceptions;
bank payment unknown outcome count;
duplicate supplier invoice attempts;
legal sequence gaps;
stuck approval tasks;
SLA breach count.

A top-tier ERP monitoring dashboard does not show only CPU and memory. It shows business process health.

19. Testing Strategy

19.1 Duplicate Tests

Test same command twice:

same key, same payload;
same key, different payload;
different key, same natural business document;
concurrent duplicate submission;
duplicate after completed;
duplicate while in progress;
duplicate after rejection.

19.2 Retry Tests

Inject:

broker publish failure after DB commit;
DB failure before outbox insert;
consumer failure after side effect before ack;
external timeout after possible execution;
downstream rate limit;
poison payload;
schema mismatch;
unauthorized credential;
max retry exhausted.

19.3 Reconciliation Tests

Create datasets with:

missing source row;
missing ERP row;
missing destination row;
amount mismatch;
status mismatch;
duplicate candidate;
sequence gap;
wrong currency;
timezone boundary issue;
partial file import.

19.4 Property-Based Invariants

For generated sequences of retries, duplicates, and replays:

one supplier invoice business key creates at most one active payable;
one posted document purpose creates at most one initial GL journal;
duplicate stock movement key does not change stock twice;
payment unknown outcome is not blindly re-executed;
replayed events do not change final read model after first application;
every quarantined item has owner, reason, and audit evidence.

20. Failure Modes and Anti-Patterns

Anti-pattern	Why It Fails	Better Approach
“Broker guarantees exactly once”	Does not cover business side effects and external systems	Design effective-once business processing
Random idempotency key per retry	Duplicate cannot be detected	Stable client/source/business key
Retry all exceptions	Business/security failures become storms	Classify failure and route appropriately
Delete bad data	Destroys audit evidence	Reverse, void, adjust, or amend
No payload hash	Same key can mutate meaning	Store canonical fingerprint
DLQ as graveyard	Failures are hidden	Owned exception workflow and SLA
Reporting as reconciliation	Too late and too passive	Dedicated recon runs and exception queues
Manual SQL repair	Uncontrolled, unaudited changes	Governed admin action with evidence
Source-only uniqueness	Cross-tenant/legal entity duplicates	Scope keys correctly
“Latest status wins”	Financial/stock facts lose history	Append-only ledger and state guards
Infinite replay	Reprocesses poison forever	Replay policy, dry-run, and cutoff
Outbox deletion too soon	Cannot audit or recover	Retain by policy and archive safely

21. Design Review Checklist

Use this checklist in architecture review.

21.1 Command/API

Does every mutating command have an idempotency key?
Is the key stable across retry?
Is the key scoped correctly?
Is payload hash stored?
Is duplicate behavior defined?
Is conflict behavior defined?
Is unique constraint enforced in the database?
Is the response deterministic for duplicate commands?

21.2 Messaging

Is outbox written in same transaction as business state?
Is inbox used by consumers?
Is event identity stable?
Is ordering required and enforced?
Is replay supported safely?
Is duplicate event application harmless?
Are broker acknowledgments aligned with DB commit?

21.3 External Side Effects

Is the external request idempotent?
Is unknown outcome modelled?
Is status inquiry available?
Is blind retry prohibited after possible execution?
Is external reference stored?
Is reconciliation scheduled?

21.4 Retry and Exception

Are failures classified?
Is backoff with jitter used?
Is max attempt configured?
Is DLQ/quarantine visible?
Is owner assigned?
Is manual replay controlled?
Is every resolution audited?

21.5 Reconciliation

Are count/sum/hash/status reconciliations defined?
Are control totals captured?
Are exceptions routed to workflow?
Is aging monitored?
Are close blockers visible?
Can support answer “what happened” without raw SQL?

22. 20-Hour Practice Plan

Hour 1-3: Failure Vocabulary

Take five ERP processes:

supplier invoice import;
bank payment execution;
WMS shipment confirmation;
stock movement posting;
GL journal posting.

For each, define:

business fact;
source of truth;
idempotency key;
duplicate behavior;
retry behavior;
reconciliation method.

Hour 4-6: Build Command Ledger

Implement a small Java/Spring command ledger:

accept command;
hash payload;
enforce unique idempotency key;
return original result on duplicate;
reject same key/different payload.

Hour 7-9: Build Outbox/Inbox

Implement:

outbox table;
poller;
publisher simulation;
inbox table;
duplicate consumer suppression.

Inject failures after DB commit and before publish.

Hour 10-12: Retry State Machine

Implement retry states:

pending;
processing;
retry scheduled;
quarantined;
completed;
cancelled.

Add transient/permanent/business failure classification.

Hour 13-15: Reconciliation Engine

Build a small reconciliation engine for ERP vs partner records:

match by key;
detect missing;
detect amount mismatch;
detect duplicate;
generate exception queue.

Hour 16-18: Unknown Outcome Simulation

Simulate bank timeout after possible payment execution.

Rules:

do not retry immediately;
query status first;
if unknown remains, route to manual review;
record evidence.

Hour 19-20: Architecture Review

Review your design using the checklist above.

Your output should include:

idempotency matrix;
retry policy;
reconciliation plan;
exception queue model;
sequence diagram;
failure-mode table;
runbook outline.

23. Source Notes

Jakarta Messaging describes APIs for Java applications to create, send, and receive messages through loosely coupled, reliable asynchronous communication services.
The idempotent consumer pattern is necessary in at-least-once delivery environments because message handlers may receive the same message repeatedly.
The transactional outbox pattern is commonly used to publish messages reliably when a service updates a database and must also notify other systems.
Jakarta Batch defines a Java API and job specification language for batch jobs, useful for controlled import/export/reconciliation workloads.
PostgreSQL transaction and locking behavior matters when implementing command ledgers, outbox polling, unique constraints, and concurrent duplicate suppression.

24. Key Takeaways

Exactly-once should be treated as an illusion unless defined at the business outcome level.
ERP correctness requires stable identity, idempotent behavior, durable intent, retry classification, and reconciliation.
Retry is a state machine, not a loop.
Unknown outcome must be resolved by inquiry/reconciliation before repeating irreversible side effects.
Outbox proves local intent; inbox protects consumers; processing ledger protects semantic side effects.
Reconciliation is not optional reporting. It is the proof mechanism that business facts survived distributed processing.
Audit-safe correction uses reversal, void, adjustment, amendment, or controlled replay, not destructive mutation.

Lesson Recap

You just completed lesson 23 in deepen practice. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Back To Series Next Lesson

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.

Previous Lesson

Lesson 22

Integration Architecture for ERP Landscapes

Next Lesson

Lesson 24

Reporting, Analytics, and Operational Read Models