Deepen PracticeOrdered learning track

Idempotency, Retry, Reconciliation, and Exactly-Once Illusions

Learn Java Large Scale ERP - Part 023

Idempotency, retry, reconciliation, and exactly-once illusions for large-scale Java ERP integration and transaction processing.

26 min read5136 words
PrevNext
Lesson 2334 lesson track1928 Deepen Practice
#java#erp#idempotency#retry+6 more

Part 023 — Idempotency, Retry, Reconciliation, and Exactly-Once Illusions

1. Why This Part Matters

Large ERP systems fail less often because a single method throws an exception and more often because one business fact crosses a boundary twice, zero times, too late, in the wrong order, or without evidence.

Typical incidents look like this:

  • a supplier invoice is imported twice;
  • a payment request is retried after timeout and the bank executes both attempts;
  • a warehouse shipment event arrives before the sales order allocation event;
  • a customer payment is received but not matched to the AR invoice;
  • an inventory adjustment succeeds locally but fails to reach the finance subledger;
  • a nightly batch is re-run and posts another copy of the same journals;
  • a message broker redelivers a message after the consumer already committed the database transaction;
  • a user clicks submit twice and creates two purchase orders;
  • an API caller times out and resends the same command with a different request id;
  • an integration team says “the broker guarantees delivery”, but no one can prove whether the ERP processed the business fact.

In a small CRUD system, duplicate processing is annoying. In a large ERP, duplicate processing can create financial misstatement, inventory mismatch, broken audit trail, month-end close delay, legal sequence gaps, and customer/vendor disputes.

The mental model for this part is:

In ERP, reliability is not merely a transport property. Reliability is a business property proven by durable intent, idempotent execution, controlled retry, explicit exception handling, and reconciliation evidence.

A top-tier ERP engineer does not ask only:

Did the HTTP call return 200?

They ask:

Can we prove which business fact was intended, whether it was accepted, whether it was processed once in effect, whether downstream systems observed it, whether duplicates were suppressed, and whether exceptions are reconcilable?


2. Kaufman Skill Deconstruction

Following Kaufman's method, this skill must be decomposed into small capabilities that can be practiced independently.

Sub-skillWhat You Need to LearnFailure If Ignored
Delivery semanticsAt-most-once, at-least-once, effective-once, ordering, replayEngineers believe the infrastructure solves business correctness
Idempotency designHow to design commands, events, imports, projections, and workflows to tolerate duplicatesDuplicate invoices, duplicate postings, duplicate payments
Durable intentOutbox, inbox, command log, request ledger, processing ledgerWork disappears between database commit and message publish
Retry classificationTransient, permanent, business, security, capacity, and poison failuresRetry storms and silent data corruption
ReconciliationCount, sum, hash, status, aging, exception queues, control accountsMonth-end discovers mismatch too late
CompensationReversal, cancellation, correction, adjustment, voiding, settlement repairIllegal deletes and non-auditable fixes
ObservabilityCorrelation id, causation id, business key, attempt id, replay idSupport cannot trace what happened
Operational controlDLQ, backoff, quarantine, replay, manual resolution, runbooksOperators either do nothing or mutate production data unsafely
TestingDuplicate, reorder, timeout, partial commit, replay, concurrent retryHappy-path integration passes but production fails

The 20-hour practice target is not to memorize patterns. It is to become able to look at any ERP integration and answer:

  1. What is the business fact?
  2. Who owns it?
  3. What is its stable identity?
  4. What is the idempotency key?
  5. What durable log proves intent and processing?
  6. Which failures are retriable?
  7. Which failures must be quarantined?
  8. How do we reconcile source, ERP, and destination?
  9. What is the legal/audit correction path?
  10. How do we prove it worked?

3. The Exactly-Once Illusion

The phrase “exactly once” is dangerous because people use it at different layers.

LayerWhat “Once” Might MeanWhy It Is Not Enough
NetworkPacket sent onceNetwork can lose, duplicate, delay, or reorder
HTTP clientRequest attempted onceClient timeout may hide server-side success
BrokerMessage delivered onceBroker guarantees rarely cover consumer database side effects
ConsumerHandler invoked onceHandler may crash after side effect and before acknowledgment
DatabaseRow inserted onceBusiness effect may be duplicate under different key
BusinessInvoice/payment/posting has one effective outcomeThis is the only level ERP users care about

The correct ERP objective is usually effective-once business processing:

The same business intent may be transmitted, retried, redelivered, re-read, or replayed many times, but it produces one accepted business outcome or one controlled rejection with evidence.

That is different from pretending duplicates never happen.

3.1 Failure Timeline: The Classic Duplicate

The important property is not that every component processed once. The important property is that every component had a stable key and a duplicate decision.


4. Delivery Semantics You Must Model Explicitly

4.1 At-Most-Once

At-most-once means the system tries once and does not retry.

It avoids duplicates but allows loss.

Useful for:

  • telemetry;
  • non-critical notifications;
  • transient UI hints;
  • low-value cache invalidation.

Dangerous for:

  • invoice posting;
  • stock movement;
  • payment execution;
  • tax submission;
  • approval decision;
  • legal document generation.

4.2 At-Least-Once

At-least-once means the system retries until it believes processing happened.

It reduces loss but creates duplicates.

This is common in messaging systems and integration pipelines. For ERP, it is acceptable only when consumers are idempotent.

4.3 At-Exactly-Once Infrastructure

Some infrastructure offers exactly-once processing under specific constraints. That does not remove the need for business idempotency.

Reasons:

  • external systems may not participate in the same transaction;
  • users may submit duplicate business commands;
  • batch jobs may be re-run;
  • messages may be replayed for recovery;
  • business keys may be unstable;
  • different transports may deliver the same fact;
  • manual repair may reintroduce a previously processed record.

4.4 Effective-Once Business Outcome

This is the ERP target.

Effective-once means:

  • stable business intent identity;
  • durable acceptance record;
  • deterministic duplicate handling;
  • side-effect ledger;
  • state transition guard;
  • reconciliation evidence;
  • auditable correction path.

5. ERP Idempotency Taxonomy

Not all idempotency is the same. An ERP platform needs multiple idempotency scopes.

ScopeExampleKeyExpected Duplicate Behavior
API commandSubmit purchase requisitiontenant + requester + clientRequestIdReturn original accepted command result
Business documentSupplier invoice importvendor + invoiceNumber + invoiceDate + legalEntityReject duplicate or link to existing invoice
PostingGL posting requestsourceDocumentType + sourceDocumentId + postingPurposeReturn existing journal
PaymentBank payment executionpaymentInstructionIdReturn same bank execution reference
Event projectionUpdate read model from eventeventId or aggregateId + versionSkip already applied event
Workflow taskApprover decisiontaskId + actor + decisionSequenceReturn already recorded decision
Batch importCSV file import rowfileId + rowNumber + rowHashSkip, reject, or mark duplicate
Reconciliation runDaily bank reconaccount + statementDate + statementIdContinue or return existing run
External callbackPayment gateway webhookprovider + webhookIdAcknowledge duplicate without reprocessing

The mistake is to rely on only one key type. A robust ERP usually combines:

  • technical event id;
  • source system id;
  • natural business key;
  • document identity;
  • version;
  • semantic purpose;
  • hash/fingerprint;
  • legal entity/tenant scope.

6. The ERP Idempotency Contract

Every command/event/import should answer these questions.

QuestionWhy It Matters
What is the idempotency key?Duplicate suppression cannot work without stable identity
Who generated the key?Client-generated and server-generated keys have different risks
What is the key scope?A vendor invoice number may be unique only per vendor and legal entity
What is the retention period?Duplicate may arrive weeks later during replay or partner re-send
What response is returned for duplicate?API callers need deterministic behavior
Is the payload allowed to differ for same key?Detect accidental key reuse or data tampering
Is duplicate accepted, rejected, linked, or ignored?Business semantics differ by domain
What is logged as evidence?Audit needs the reason for duplicate decision

6.1 Idempotency Key Design

Good idempotency keys are:

  • stable across retry;
  • scoped by tenant/legal entity/source;
  • generated before side effects;
  • stored durably;
  • linked to payload fingerprint;
  • retained long enough for operational replay;
  • visible to support users;
  • included in logs, traces, and reconciliation reports.

Bad idempotency keys are:

  • random per retry;
  • based on database auto-increment id created after accepting the command;
  • based only on timestamp;
  • based only on user id;
  • hidden in code;
  • not persisted;
  • not included in downstream calls;
  • not protected by a unique constraint.

6.2 Payload Fingerprint

If the same key arrives with a different payload, it is not a harmless duplicate. It may be:

  • client bug;
  • replay with corrupted data;
  • malicious tampering;
  • source system reusing keys incorrectly;
  • user resubmitting a changed document under old request id.

A command ledger should record a payload fingerprint.

create table erp_command_request (
    tenant_id           varchar(64) not null,
    idempotency_key     varchar(128) not null,
    command_type        varchar(80) not null,
    payload_hash        varchar(128) not null,
    status              varchar(40) not null,
    result_type         varchar(80),
    result_id           varchar(128),
    rejection_code      varchar(80),
    created_at          timestamp not null,
    completed_at        timestamp,
    created_by          varchar(128) not null,
    primary key (tenant_id, idempotency_key)
);

Duplicate behavior:

Existing StatusSame PayloadDifferent Payload
ACCEPTEDReturn accepted statusReject as idempotency conflict
COMPLETEDReturn original resultReject as idempotency conflict
REJECTEDReturn original rejectionReject or return original rejection based on policy
IN_PROGRESSReturn pending or retry laterReject as conflict
UNKNOWNQuarantineQuarantine

7. Durable Intent: Outbox, Inbox, and Processing Ledger

The hardest reliability bug is the gap between:

  1. committing local database changes; and
  2. publishing a message or calling an external service.

If the database commit succeeds but publish fails, the world never hears about the business fact. If publish succeeds but database commit fails, the world hears about a fact that does not exist.

7.1 Outbox Pattern

The outbox records integration intent inside the same database transaction as the business change. A separate publisher safely sends it later.

Outbox table example:

create table erp_outbox_event (
    outbox_id           uuid primary key,
    tenant_id           varchar(64) not null,
    aggregate_type      varchar(80) not null,
    aggregate_id        varchar(128) not null,
    aggregate_version   bigint not null,
    event_type          varchar(120) not null,
    event_key           varchar(200) not null,
    causation_id        varchar(128),
    correlation_id      varchar(128),
    payload_json        text not null,
    payload_hash        varchar(128) not null,
    status              varchar(40) not null,
    attempts            int not null default 0,
    next_attempt_at     timestamp,
    created_at          timestamp not null,
    published_at        timestamp,
    unique (tenant_id, event_key)
);

Design rules:

  • write outbox inside the same transaction as the source fact;
  • make event identity stable;
  • include aggregate version or source version;
  • publish asynchronously;
  • retry publishing safely;
  • do not delete outbox rows immediately;
  • expose outbox backlog as operational metric;
  • do not treat outbox published as downstream processed.

7.2 Inbox Pattern

The inbox records received messages before or during processing so duplicates can be suppressed.

create table erp_inbox_message (
    tenant_id           varchar(64) not null,
    source_system       varchar(80) not null,
    message_id          varchar(160) not null,
    message_type        varchar(120) not null,
    payload_hash        varchar(128) not null,
    status              varchar(40) not null,
    received_at         timestamp not null,
    processed_at        timestamp,
    failure_code        varchar(80),
    failure_message     text,
    primary key (tenant_id, source_system, message_id)
);

Consumer rule:

  1. receive message;
  2. compute message key and payload hash;
  3. insert inbox row with unique key;
  4. if duplicate with same hash, acknowledge and skip or return original result;
  5. if duplicate with different hash, quarantine;
  6. process business side effect;
  7. mark inbox processed;
  8. publish further outbox events if needed.

7.3 Processing Ledger

Some side effects need their own ledger, not merely outbox/inbox.

Examples:

  • payment execution ledger;
  • tax submission ledger;
  • GL posting ledger;
  • inventory movement ledger;
  • notification delivery ledger;
  • partner file transmission ledger;
  • report generation ledger.

A processing ledger answers:

For this business purpose, has this source fact already produced this side effect?

Example:

create table erp_processing_ledger (
    tenant_id           varchar(64) not null,
    process_type        varchar(80) not null,
    source_type         varchar(80) not null,
    source_id           varchar(128) not null,
    semantic_purpose    varchar(80) not null,
    status              varchar(40) not null,
    result_reference    varchar(160),
    attempts            int not null default 0,
    created_at          timestamp not null,
    completed_at        timestamp,
    primary key (tenant_id, process_type, source_type, source_id, semantic_purpose)
);

The semantic_purpose matters. The same sales invoice may produce:

  • initial AR posting;
  • tax reporting;
  • customer notification;
  • revenue recognition schedule;
  • analytics event;
  • document archive.

Each purpose needs its own idempotency boundary.


8. Retry Is a Business Decision, Not a Loop

A retry policy is not:

while (true) {
    trySend();
}

A retry policy is a controlled state machine.

8.1 Failure Classification

Failure TypeExampleRetry?Action
Transient technicalNetwork timeout, temporary broker unavailableYesBackoff + jitter
CapacityDB pool exhausted, rate limitYes, slowerThrottle and alert
Permanent technicalInvalid endpoint config, schema mismatchNo automatic retry after thresholdQuarantine and fix config/code
Business validationVendor blocked, period closed, credit exceededUsually noRoute to business exception queue
SecurityUnauthorized token, certificate expiredNo blind retryRotate credential / alert security
DuplicateSame idempotency key and same payloadNo side effectReturn existing result
Idempotency conflictSame key different payloadNoQuarantine / reject
Poison messageHandler always fails on same payloadNo after thresholdDLQ with evidence
Unknown outcomeExternal call timed out after possible executionDo not blindly repeatQuery status / reconcile first

8.2 Backoff and Jitter

Without backoff, retries amplify incidents.

A retry storm can turn a 30-second database slowdown into a multi-hour outage.

A sane retry policy includes:

  • max attempts;
  • exponential backoff;
  • random jitter;
  • per-partner rate limits;
  • circuit breaker;
  • retry budget;
  • queue depth monitoring;
  • dead-letter/quarantine path;
  • operator-visible reason code.

8.3 Unknown Outcome Is Not Failure

The most dangerous state is not failure. The most dangerous state is unknown outcome.

Example:

  1. ERP sends payment to bank.
  2. Bank times out.
  3. ERP does not know whether the bank executed.
  4. ERP retries.
  5. Bank executes twice.

Correct model:

Rule:

If an external side effect may have happened, query/reconcile before retrying the side effect.


9. Java Implementation Sketch: Idempotent Command Handler

This is not generic Spring/JPA material. The point is the ERP-specific control shape.

public final class IdempotentCommandService {

    private final CommandLedgerRepository ledger;
    private final PurchaseOrderApplicationService purchaseOrders;
    private final PayloadHasher hasher;

    public CommandResult submitPurchaseOrder(
            TenantId tenantId,
            String idempotencyKey,
            SubmitPurchaseOrderCommand command,
            Actor actor
    ) {
        String payloadHash = hasher.sha256CanonicalJson(command);

        CommandLedgerRecord existing = ledger.find(tenantId, idempotencyKey);
        if (existing != null) {
            if (!existing.payloadHash().equals(payloadHash)) {
                throw new IdempotencyConflictException(
                        "Same idempotency key was used with a different payload"
                );
            }
            return existing.toCommandResult();
        }

        CommandLedgerRecord accepted = CommandLedgerRecord.accepted(
                tenantId,
                idempotencyKey,
                "SubmitPurchaseOrder",
                payloadHash,
                actor.id()
        );

        ledger.insert(accepted); // unique constraint protects concurrent duplicates

        try {
            PurchaseOrderId poId = purchaseOrders.createFromApprovedCommand(command, actor);
            ledger.markCompleted(tenantId, idempotencyKey, "PurchaseOrder", poId.value());
            return CommandResult.completed("PurchaseOrder", poId.value());
        } catch (BusinessRejection ex) {
            ledger.markRejected(tenantId, idempotencyKey, ex.code(), ex.getMessage());
            return CommandResult.rejected(ex.code(), ex.getMessage());
        }
    }
}

Important design notes:

  • the ledger insert must be protected by a unique constraint;
  • payload must be canonicalized before hashing;
  • duplicate request should return original outcome, not create new work;
  • business rejection can be recorded as an outcome;
  • conflict must be explicit;
  • command result must not depend on volatile transient state;
  • for long-running commands, return accepted/pending with correlation id.

10. Canonical Payload Hashing

Hashing raw JSON is often wrong because two equivalent payloads can have different formatting or field order.

Canonical hashing rules:

  • stable field ordering;
  • stable date/time format;
  • explicit timezone treatment;
  • stable number scale/rounding;
  • no volatile fields such as request timestamp unless semantically relevant;
  • normalized whitespace;
  • normalized default values;
  • schema version included;
  • tenant/source context included when necessary.

Example semantic fingerprint fields for supplier invoice import:

tenantId
legalEntityId
vendorId
vendorInvoiceNumber
vendorInvoiceDate
currency
invoiceGrossAmount
lineCount
lineItemSkuAndAmountHash
sourceSystem
sourceDocumentId
schemaVersion

A good fingerprint detects meaningful differences without being disturbed by formatting noise.


11. Idempotency by ERP Domain

11.1 Procure-to-Pay

ProcessIdempotency KeyNotes
Requisition submissionrequester + client request idUser double-click safe
Purchase order creationrequisition id + conversion purposePrevent multiple POs from same approved requisition unless split intentionally
Goods receipt importwarehouse + ASN/receipt source idExternal WMS may resend
Supplier invoice importvendor + invoice number + legal entityNatural duplicate detection matters
3-way matchinginvoice id + match run id or match versionRe-run should not duplicate match result
AP postinginvoice id + posting purposeOne initial liability posting
Payment proposalpayment run id + invoice idPrevent duplicate inclusion
Bank paymentpayment instruction idMust survive timeout and replay

11.2 Order-to-Cash

ProcessIdempotency KeyNotes
Sales order submissionchannel + cart/order idE-commerce retries are common
Allocationsales order line + allocation purpose + versionReallocation must be explicit
Shipment confirmationWMS shipment idWMS may resend status
Invoice generationshipment id or order milestone + billing purposePrevent duplicate invoice
AR postinginvoice id + posting purposeLink journal to invoice
Cash applicationbank statement line id + invoice idPrevent duplicate receipt matching
Return authorizationoriginal sale + return request idAvoid duplicate returns
Credit notereturn id + credit purposeAvoid duplicate credits

11.3 Inventory

ProcessIdempotency KeyNotes
Stock movementmovement source + source id + line idLedger must not duplicate movement
Reservationdemand id + item + location + reservation purposeAvoid over-reservation
Pick confirmationpick task id + picked line versionWMS event duplicates common
Cycle count adjustmentcount session + item + bin + serial/lotRe-run should update result, not duplicate
Manufacturing issuework order + component line + issue sequenceBackflush must be safe
Completion receiptwork order + completion batchAvoid duplicate finished goods receipt

11.4 General Ledger

ProcessIdempotency KeyNotes
Subledger postingsource document + posting purposeOne journal per purpose
Reversaloriginal journal + reversal reason + reversal sequenceReversal is separate legal event
Adjustmentadjustment request idMust be approved and auditable
Period close taskfiscal period + task code + run idRe-run must be controlled
Balance projectionjournal id + projection versionProjection idempotency separate from posting

12. Reconciliation as First-Class Architecture

Reconciliation is not a report at the end. It is a core reliability mechanism.

A system that cannot reconcile cannot prove correctness.

12.1 Reconciliation Types

TypeExamplePurpose
Count reconciliationNumber of invoices sent vs receivedDetect missing/extra records
Sum reconciliationTotal payment amount in ERP vs bankDetect amount mismatch
Hash reconciliationFile hash or line fingerprintDetect tampering/corruption
Status reconciliationERP says shipped, WMS says pendingDetect lifecycle mismatch
Balance reconciliationInventory subledger vs GL control accountDetect financial mismatch
Aging reconciliationItems stuck in pending > thresholdDetect operational blockage
Sequence reconciliationLegal invoice sequence gapsDetect missing/voided document
Cross-system reference reconciliationERP document not known by partnerDetect integration loss

12.2 Reconciliation Architecture

12.3 Control Totals

For file/batch integrations, control totals are mandatory.

Example file manifest:

{
  "sourceSystem": "WMS-EU-01",
  "fileId": "SHIPMENT-2026-07-01-001",
  "businessDate": "2026-07-01",
  "recordCount": 125000,
  "totalQuantity": "884321.000",
  "totalAmount": "0.00",
  "payloadSha256": "...",
  "createdAt": "2026-07-01T02:00:00Z"
}

The import should verify:

  • file id has not been processed before;
  • manifest hash matches payload;
  • record count matches parsed rows;
  • quantity/amount totals match;
  • schema version is supported;
  • business date is acceptable;
  • source system is authorized;
  • import result has matched count, rejected count, duplicate count, and quarantined count.

13. Reconciliation Data Model

create table erp_reconciliation_run (
    run_id              uuid primary key,
    tenant_id           varchar(64) not null,
    recon_type          varchar(80) not null,
    scope_key           varchar(200) not null,
    business_date       date not null,
    status              varchar(40) not null,
    source_count        bigint,
    erp_count           bigint,
    destination_count   bigint,
    source_total        numeric(30, 6),
    erp_total           numeric(30, 6),
    destination_total   numeric(30, 6),
    started_at          timestamp not null,
    completed_at        timestamp,
    created_by          varchar(128) not null,
    unique (tenant_id, recon_type, scope_key, business_date)
);

create table erp_reconciliation_exception (
    exception_id        uuid primary key,
    run_id              uuid not null references erp_reconciliation_run(run_id),
    exception_type      varchar(80) not null,
    severity            varchar(40) not null,
    source_reference    varchar(200),
    erp_reference       varchar(200),
    destination_reference varchar(200),
    expected_value      text,
    actual_value        text,
    status              varchar(40) not null,
    resolution_code     varchar(80),
    resolution_note     text,
    created_at          timestamp not null,
    resolved_at         timestamp,
    resolved_by         varchar(128)
);

The reconciliation exception is not merely a technical ticket. It is business evidence.


14. Exception Queue Design

An exception queue is different from a dead-letter queue.

QueueMeaningOwner
Dead-letter queueTechnical message cannot be processed automaticallyPlatform/integration team
Business exception queueBusiness fact is valid structurally but cannot complete because of domain ruleBusiness operations
Reconciliation exception queueCross-system mismatch foundJoint business + support
Security exception queueUnauthorized/suspicious requestSecurity/compliance
Data quality queueMaster/reference data issue blocks processingData governance team

Exception queue item should include:

  • business key;
  • source system;
  • payload snapshot or secure reference;
  • failure reason code;
  • severity;
  • retry eligibility;
  • suggested resolution;
  • owner team;
  • SLA;
  • correlation id;
  • related document;
  • audit trail;
  • allowed actions.

Allowed actions must be controlled.

Do not let operators “edit raw payload and replay” without evidence.


15. Compensation, Reversal, and Correction

ERP systems should rarely delete mistakes. They correct them.

ActionMeaningExample
RetryAttempt same side effect again because previous attempt did not completeRe-send message to broker
ReplayProcess historical fact again through idempotent handlerRebuild projection
ReverseCreate equal and opposite financial/stock effectReverse wrong journal
VoidMark unused legal document number as void with reasonVoid invoice number generated in error
CancelStop a document before irreversible business effectCancel draft PO
AdjustAdd explicit correction transactionInventory adjustment after count
AmendCreate new revision while retaining historyAmend contract terms
ReconcileMatch/resolve difference between systemsMatch payment to invoice

The key rule:

Correction must preserve evidence of the original mistake and the authorized remedy.

15.1 Bad Fix

delete from gl_journal where journal_id = 'J-123';

This destroys evidence.

15.2 Better Fix

  • create reversal journal;
  • link reversal to original journal;
  • require approval;
  • record reason code;
  • include period policy;
  • expose both in audit timeline;
  • update reconciliation status.

16. Ordering and Versioning

Duplicate handling is not enough. Some ERP facts must be processed in order.

Example:

  1. sales order created;
  2. sales order approved;
  3. shipment confirmed;
  4. invoice generated;
  5. payment received.

If event 3 arrives before event 2, the consumer must not pretend the world is fine.

Ordering strategies:

StrategyUse CaseTrade-off
Aggregate versionEvents for same aggregateRequires versioned aggregate
Sequence per sourceFile rows or partner eventsSource must guarantee sequence
State guardReject illegal transitionNeeds lifecycle model
Holding bufferTemporarily hold future eventRequires timeout and cleanup
Reconciliation repairAccept eventual mismatch and repairRequires strong exception process
Snapshot replacementLatest state winsDangerous for financial facts

For ERP financial and stock facts, “latest wins” is usually wrong.

16.1 Event Version Guard

public void applyShipmentEvent(ShipmentConfirmed event) {
    ProjectionState state = repository.getState(event.shipmentId());

    if (state.hasApplied(event.eventId())) {
        return;
    }

    if (event.aggregateVersion() != state.nextExpectedVersion()) {
        repository.recordOutOfOrder(event);
        throw new OutOfOrderEventException(event.shipmentId(), event.aggregateVersion());
    }

    state.apply(event);
    state.markApplied(event.eventId());
    repository.save(state);
}

17. Multi-System Correlation Model

You cannot operate a large ERP without correlation.

Minimum identifiers:

IdentifierMeaning
traceIdTechnical distributed trace
correlationIdBusiness process correlation across commands/events
causationIdImmediate cause of this event/command
idempotencyKeyDuplicate suppression key
sourceSystemOrigin of the fact
sourceDocumentIdSource-side business reference
erpDocumentIdERP-side document reference
externalReferenceDestination-side reference
attemptIdSpecific retry attempt
replayIdSpecific replay operation

Every log/event/exception should carry enough of this to answer:

  • what happened;
  • why it happened;
  • what caused it;
  • whether it was duplicate;
  • whether it was retried;
  • what downstream result was produced;
  • how to reconcile it.

18. Observability Metrics

Technical metrics:

  • outbox backlog count;
  • outbox oldest age;
  • publish attempts per event;
  • inbox duplicate count;
  • DLQ size;
  • retry queue depth;
  • retry success rate;
  • max processing latency;
  • handler error rate;
  • downstream timeout rate.

Business metrics:

  • unposted invoice count;
  • unmatched payment count;
  • unreconciled shipment count;
  • stock movement pending finance count;
  • GL posting exceptions;
  • bank payment unknown outcome count;
  • duplicate supplier invoice attempts;
  • legal sequence gaps;
  • stuck approval tasks;
  • SLA breach count.

A top-tier ERP monitoring dashboard does not show only CPU and memory. It shows business process health.


19. Testing Strategy

19.1 Duplicate Tests

Test same command twice:

  • same key, same payload;
  • same key, different payload;
  • different key, same natural business document;
  • concurrent duplicate submission;
  • duplicate after completed;
  • duplicate while in progress;
  • duplicate after rejection.

19.2 Retry Tests

Inject:

  • broker publish failure after DB commit;
  • DB failure before outbox insert;
  • consumer failure after side effect before ack;
  • external timeout after possible execution;
  • downstream rate limit;
  • poison payload;
  • schema mismatch;
  • unauthorized credential;
  • max retry exhausted.

19.3 Reconciliation Tests

Create datasets with:

  • missing source row;
  • missing ERP row;
  • missing destination row;
  • amount mismatch;
  • status mismatch;
  • duplicate candidate;
  • sequence gap;
  • wrong currency;
  • timezone boundary issue;
  • partial file import.

19.4 Property-Based Invariants

For generated sequences of retries, duplicates, and replays:

  • one supplier invoice business key creates at most one active payable;
  • one posted document purpose creates at most one initial GL journal;
  • duplicate stock movement key does not change stock twice;
  • payment unknown outcome is not blindly re-executed;
  • replayed events do not change final read model after first application;
  • every quarantined item has owner, reason, and audit evidence.

20. Failure Modes and Anti-Patterns

Anti-patternWhy It FailsBetter Approach
“Broker guarantees exactly once”Does not cover business side effects and external systemsDesign effective-once business processing
Random idempotency key per retryDuplicate cannot be detectedStable client/source/business key
Retry all exceptionsBusiness/security failures become stormsClassify failure and route appropriately
Delete bad dataDestroys audit evidenceReverse, void, adjust, or amend
No payload hashSame key can mutate meaningStore canonical fingerprint
DLQ as graveyardFailures are hiddenOwned exception workflow and SLA
Reporting as reconciliationToo late and too passiveDedicated recon runs and exception queues
Manual SQL repairUncontrolled, unaudited changesGoverned admin action with evidence
Source-only uniquenessCross-tenant/legal entity duplicatesScope keys correctly
“Latest status wins”Financial/stock facts lose historyAppend-only ledger and state guards
Infinite replayReprocesses poison foreverReplay policy, dry-run, and cutoff
Outbox deletion too soonCannot audit or recoverRetain by policy and archive safely

21. Design Review Checklist

Use this checklist in architecture review.

21.1 Command/API

  • Does every mutating command have an idempotency key?
  • Is the key stable across retry?
  • Is the key scoped correctly?
  • Is payload hash stored?
  • Is duplicate behavior defined?
  • Is conflict behavior defined?
  • Is unique constraint enforced in the database?
  • Is the response deterministic for duplicate commands?

21.2 Messaging

  • Is outbox written in same transaction as business state?
  • Is inbox used by consumers?
  • Is event identity stable?
  • Is ordering required and enforced?
  • Is replay supported safely?
  • Is duplicate event application harmless?
  • Are broker acknowledgments aligned with DB commit?

21.3 External Side Effects

  • Is the external request idempotent?
  • Is unknown outcome modelled?
  • Is status inquiry available?
  • Is blind retry prohibited after possible execution?
  • Is external reference stored?
  • Is reconciliation scheduled?

21.4 Retry and Exception

  • Are failures classified?
  • Is backoff with jitter used?
  • Is max attempt configured?
  • Is DLQ/quarantine visible?
  • Is owner assigned?
  • Is manual replay controlled?
  • Is every resolution audited?

21.5 Reconciliation

  • Are count/sum/hash/status reconciliations defined?
  • Are control totals captured?
  • Are exceptions routed to workflow?
  • Is aging monitored?
  • Are close blockers visible?
  • Can support answer “what happened” without raw SQL?

22. 20-Hour Practice Plan

Hour 1-3: Failure Vocabulary

Take five ERP processes:

  • supplier invoice import;
  • bank payment execution;
  • WMS shipment confirmation;
  • stock movement posting;
  • GL journal posting.

For each, define:

  • business fact;
  • source of truth;
  • idempotency key;
  • duplicate behavior;
  • retry behavior;
  • reconciliation method.

Hour 4-6: Build Command Ledger

Implement a small Java/Spring command ledger:

  • accept command;
  • hash payload;
  • enforce unique idempotency key;
  • return original result on duplicate;
  • reject same key/different payload.

Hour 7-9: Build Outbox/Inbox

Implement:

  • outbox table;
  • poller;
  • publisher simulation;
  • inbox table;
  • duplicate consumer suppression.

Inject failures after DB commit and before publish.

Hour 10-12: Retry State Machine

Implement retry states:

  • pending;
  • processing;
  • retry scheduled;
  • quarantined;
  • completed;
  • cancelled.

Add transient/permanent/business failure classification.

Hour 13-15: Reconciliation Engine

Build a small reconciliation engine for ERP vs partner records:

  • match by key;
  • detect missing;
  • detect amount mismatch;
  • detect duplicate;
  • generate exception queue.

Hour 16-18: Unknown Outcome Simulation

Simulate bank timeout after possible payment execution.

Rules:

  • do not retry immediately;
  • query status first;
  • if unknown remains, route to manual review;
  • record evidence.

Hour 19-20: Architecture Review

Review your design using the checklist above.

Your output should include:

  • idempotency matrix;
  • retry policy;
  • reconciliation plan;
  • exception queue model;
  • sequence diagram;
  • failure-mode table;
  • runbook outline.

23. Source Notes

  • Jakarta Messaging describes APIs for Java applications to create, send, and receive messages through loosely coupled, reliable asynchronous communication services.
  • The idempotent consumer pattern is necessary in at-least-once delivery environments because message handlers may receive the same message repeatedly.
  • The transactional outbox pattern is commonly used to publish messages reliably when a service updates a database and must also notify other systems.
  • Jakarta Batch defines a Java API and job specification language for batch jobs, useful for controlled import/export/reconciliation workloads.
  • PostgreSQL transaction and locking behavior matters when implementing command ledgers, outbox polling, unique constraints, and concurrent duplicate suppression.

24. Key Takeaways

  • Exactly-once should be treated as an illusion unless defined at the business outcome level.
  • ERP correctness requires stable identity, idempotent behavior, durable intent, retry classification, and reconciliation.
  • Retry is a state machine, not a loop.
  • Unknown outcome must be resolved by inquiry/reconciliation before repeating irreversible side effects.
  • Outbox proves local intent; inbox protects consumers; processing ledger protects semantic side effects.
  • Reconciliation is not optional reporting. It is the proof mechanism that business facts survived distributed processing.
  • Audit-safe correction uses reversal, void, adjustment, amendment, or controlled replay, not destructive mutation.
Lesson Recap

You just completed lesson 23 in deepen practice. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.