Series MapLesson 37 / 64
Deepen PracticeOrdered learning track

Learn Java Payment Systems Part 037 Smart Retry And Fallback

10 min read1946 words
PrevNext
Lesson 3764 lesson track3653 Deepen Practice

title: Build From Scratch: Large Production Grade Java Payment Systems - Part 037 description: Smart retry and fallback design for enterprise Java payment systems, covering retriable vs non-retriable failures, unknown outcome, idempotency, provider fallback, retry budget, circuit breaking, subscriptions, payout retries, and operational controls. series: learn-java-payment-systems seriesTitle: Build From Scratch: Large Production Grade Java Payment Systems order: 37 partTitle: Smart Retry and Fallback tags:

  • java
  • payments
  • retry
  • fallback
  • idempotency
  • orchestration
  • resilience
  • payment-systems
  • enterprise-architecture date: 2026-07-02

Part 037 — Smart Retry and Fallback

A retry is not a loop.

In a payment system, a retry is a financial decision.

A naive engineer sees this:

call provider
if timeout, call again

A production payment engineer sees this:

call provider
if timeout, decide whether a financial side effect may already exist
if side effect may exist, do not blindly call another provider
first resolve outcome through idempotency, provider reference, webhook, status inquiry, or reconciliation
only retry when the operation is proven safe under a specific retry contract

That difference is the boundary between a normal distributed system and a system that moves money.

Most distributed-systems retry advice is directionally useful:

use timeouts
use exponential backoff
use jitter
make APIs idempotent
avoid retry storms

But payment systems add a sharper rule:

never use retry to hide uncertainty about money movement.

If a payment provider times out, there are three possible realities:

1. provider never received the request
2. provider received it but did not execute it
3. provider received it, executed it, but response was lost

The client sees the same symptom in all three cases:

timeout

The payment system must not collapse those into:

failed

It must classify them as:

unknown until proven otherwise

This part builds the retry and fallback subsystem for a production-grade Java payment platform.


1. Core Mental Model

Retries are safe only when the repeated operation has the same business effect as one execution.

In normal backend systems, idempotency often means:

same request -> same response

In payment systems, idempotency means more:

same financial operation -> at most one external side effect -> at most one ledger posting -> same observable business outcome

A retry policy therefore needs four inputs:

operation type
current known state
provider contract
failure classification

Example:

operation type: card authorization
current state: AUTHORIZING
provider contract: supports idempotency key on authorization
failure: HTTP 503 before body received

Possible decision:

retry same provider with same idempotency key after backoff

Different example:

operation type: card authorization
current state: AUTHORIZING_UNKNOWN
provider contract: no reliable idempotency key
failure: socket timeout after request body was sent

Possible decision:

do not retry authorization
run status inquiry / wait webhook / queue reconciliation repair

The retry is not determined by error code alone.

It is determined by the combination of:

error + operation + state + idempotency guarantee + side-effect risk

2. Retry Classes in a Payment Platform

A payment platform has several retry domains. Mixing them is a common source of double charge, double refund, duplicate payout, or noisy incidents.

Each domain has different safety rules.

Retry DomainTypical TriggerSafe WhenDangerous When
Client to platformmobile retry, browser retry, API timeoutplatform idempotency key is stablekey changes per click
Platform to providerprovider timeout, 5xx, rate limitprovider idempotency and operation log are reliableoutcome may already exist
Webhook processingDB deadlock, transient consumer crashraw webhook already stored and dedupedhandler calls external side effect repeatedly
Ledger postingoutbox consumer retry, DB retryjournal idempotency key is uniqueposting is generated with new key every retry
Reconciliation retryfile parse failure, provider API pagination failureimports are content-addressed/dedupedeach import creates new financial movement
Payout worker retrybank timeout, worker crashpayout operation reference is stableworker creates new bank transfer each time
Subscription retrysoft decline, issuer unavailableretry window and card network rules allowhard decline/fraud decline is retried aggressively
Backoffice action retryoperator page timeoutaction command has idempotency and approval idduplicate manual adjustment

A production platform does not have one retry utility.

It has a retry decision layer.


3. The Most Important Rule: Retry Same Operation, Not Same Intent

Payment systems often confuse these two statements:

Retry this purchase.
Retry this provider operation.

They are not the same.

A purchase intent may have multiple attempts:

payment_intent pi_123
  attempt att_1 -> Provider A -> timeout -> unknown
  attempt att_2 -> Provider B -> authorised

This can be valid only if att_1 is proven not authorised, or if the business explicitly permits multiple attempts with only one capturable authorization and has a cancellation/reversal process for late success.

Most systems should not fallback from Provider A to Provider B while Provider A is unknown.

Safer model:

same external operation unknown -> resolve outcome first
same payment intent failed with terminal non-side-effect failure -> may create new attempt
same payment intent refused by issuer -> may ask customer for another method
same subscription invoice soft-declined -> may schedule future collection attempt

The retry target matters.

Retry provider operation = same attempt, same provider, same idempotency key, same amount, same method.
Retry payment intent = new attempt, possibly different provider or method, only after prior attempt is terminal or safely isolated.

4. Failure Classification

The orchestrator should not consume raw errors directly. It should receive normalized failure classification.

raw provider outcome -> adapter normalization -> retry decision

Basic taxonomy:

Failure ClassMeaningExampleRetry Default
VALIDATION_ERRORrequest invalid before side effectinvalid amount, unsupported currencyno retry
AUTHENTICATION_ERRORcredential/config issueinvalid API keyno automatic retry
RATE_LIMITEDprovider throttlingHTTP 429retry with backoff if safe
TEMPORARY_PROVIDER_ERRORprovider transient failureHTTP 500/503retry if idempotent
NETWORK_CONNECT_FAILUREcould not connectDNS/connect timeoutretry if request not sent
NETWORK_READ_TIMEOUTrequest may have been sentsocket read timeoutunknown unless idempotency/status inquiry exists
PROVIDER_TIMEOUTprovider acknowledged timeoutgateway timeoutunknown/safe retry depends on provider contract
ISSUER_SOFT_DECLINEissuer says try later or authenticateissuer unavailable, try again laterscheduled retry or customer action
ISSUER_HARD_DECLINEissuer says do not retrystolen card, invalid accountno retry
RISK_DECLINErisk system blocksfraud suspectedno retry unless review changes decision
UNKNOWN_OUTCOMEplatform cannot know side effectlost response after request sentresolve first

The retry policy should not say:

retry all 5xx

It should say:

retry if the operation is retry-safe under the provider contract and current state remains non-terminal

5. Unknown Outcome Is Not Failure

This is the central payment reliability idea.

UNKNOWN must be explicit in the domain model.

Bad model:

if (timeout) {
    payment.markFailed();
}

Better model:

if (failure.isOutcomeUnknown()) {
    payment.markAuthorizingUnknown(operationId, failure.evidence());
    resolutionQueue.enqueue(operationId);
}

A timeout is evidence of communication failure, not evidence of payment failure.


6. Provider Operation Log

Every external side-effecting call should be represented as an operation.

PaymentIntent
  PaymentAttempt
    ProviderOperation AUTHORIZATION
    ProviderOperation CAPTURE
    ProviderOperation REFUND

Why?

Because retry and fallback decisions need operation-level evidence:

operation id
operation type
provider
provider account
idempotency key
request fingerprint
request sent at
network phase
response received at
normalized result
raw response reference
retry count
last failure class
resolution state

Example schema:

create table provider_operation (
    id uuid primary key,
    payment_attempt_id uuid not null,
    operation_type text not null,
    provider_code text not null,
    provider_account_id text not null,
    idempotency_key text not null,
    request_fingerprint text not null,
    status text not null,
    outcome text not null,
    retry_count integer not null default 0,
    next_retry_at timestamptz,
    last_failure_class text,
    provider_reference text,
    raw_request_ref text,
    raw_response_ref text,
    created_at timestamptz not null,
    updated_at timestamptz not null,
    constraint uq_provider_operation_idempotency
        unique (provider_code, provider_account_id, operation_type, idempotency_key),
    constraint ck_provider_operation_status
        check (status in (
            'PREPARED',
            'SENDING',
            'SUCCEEDED',
            'FAILED',
            'UNKNOWN',
            'RESOLUTION_PENDING',
            'REQUIRES_REVIEW'
        )),
    constraint ck_provider_operation_outcome
        check (outcome in (
            'NONE',
            'AUTHORISED',
            'CAPTURED',
            'REFUNDED',
            'DECLINED',
            'CANCELLED',
            'UNKNOWN'
        ))
);

Notice the unique constraint.

That is not just database hygiene.

It is a financial control.


7. Request Fingerprint

An idempotency key without a request fingerprint is dangerous.

Bad:

idempotency key: order-123
first request: amount 100000 IDR
second request: amount 150000 IDR

A payment platform should store a fingerprint of semantic request fields:

merchant_id
payment_intent_id
operation_type
provider
amount
currency
payment_method_reference
capture_method
customer_reference

Then reject incompatible retries.

create table idempotent_operation_guard (
    idempotency_scope text not null,
    idempotency_key text not null,
    request_fingerprint text not null,
    operation_id uuid not null,
    created_at timestamptz not null,
    expires_at timestamptz not null,
    primary key (idempotency_scope, idempotency_key)
);

If the same key arrives with a different fingerprint:

409 IDEMPOTENCY_KEY_REUSED_WITH_DIFFERENT_PAYLOAD

Do not continue.


8. Retry Decision Model

A good retry decision should be explicit and auditable.

public enum RetryAction {
    RETRY_SAME_OPERATION,
    SCHEDULE_RETRY,
    WAIT_FOR_WEBHOOK,
    STATUS_INQUIRY,
    MARK_TERMINAL_FAILURE,
    ASK_CUSTOMER_ACTION,
    FALLBACK_NEW_ATTEMPT,
    SEND_TO_MANUAL_REVIEW,
    STOP
}

public record RetryDecision(
    RetryAction action,
    boolean externalSideEffectMayExist,
    boolean requiresSameIdempotencyKey,
    Duration delay,
    String reasonCode,
    String explanation
) {}

The policy input should be rich enough:

public record RetryContext(
    OperationType operationType,
    PaymentState paymentState,
    ProviderCode providerCode,
    ProviderContract providerContract,
    FailureClass failureClass,
    int retryCount,
    Instant firstAttemptAt,
    Instant now,
    boolean requestBytesSent,
    boolean providerReferenceKnown,
    boolean statusInquiryAvailable,
    boolean webhookExpected,
    boolean paymentIntentAllowsNewAttempt
) {}

Policy skeleton:

public final class PaymentRetryPolicy {

    public RetryDecision decide(RetryContext ctx) {
        if (ctx.failureClass() == FailureClass.VALIDATION_ERROR) {
            return stop("REQUEST_INVALID", "Invalid request cannot be fixed by retry.");
        }

        if (ctx.failureClass() == FailureClass.ISSUER_HARD_DECLINE) {
            return customerAction("HARD_DECLINE", "Customer must use another payment method.");
        }

        if (ctx.failureClass() == FailureClass.NETWORK_READ_TIMEOUT && ctx.requestBytesSent()) {
            if (ctx.providerContract().supportsIdempotencyFor(ctx.operationType())) {
                return retrySameOperationWithBackoff(ctx, "READ_TIMEOUT_WITH_IDEMPOTENCY");
            }
            if (ctx.statusInquiryAvailable()) {
                return statusInquiry("READ_TIMEOUT_NEEDS_RESOLUTION");
            }
            return manualReview("READ_TIMEOUT_UNRESOLVABLE_AUTOMATICALLY");
        }

        if (ctx.failureClass() == FailureClass.RATE_LIMITED) {
            return scheduleBackoff(ctx, "PROVIDER_RATE_LIMIT");
        }

        if (ctx.failureClass() == FailureClass.TEMPORARY_PROVIDER_ERROR) {
            if (ctx.retryCount() < ctx.providerContract().maxSafeRetries(ctx.operationType())) {
                return retrySameOperationWithBackoff(ctx, "TRANSIENT_PROVIDER_ERROR");
            }
            return statusInquiry("RETRY_BUDGET_EXHAUSTED");
        }

        if (ctx.failureClass() == FailureClass.UNKNOWN_OUTCOME) {
            return statusInquiry("UNKNOWN_OUTCOME_MUST_BE_RESOLVED");
        }

        return manualReview("UNCLASSIFIED_FAILURE");
    }
}

The implementation detail is less important than the invariant:

retry decisions are explicit domain decisions, not incidental exception handling.

9. Retry Budget

Unbounded retry is self-inflicted outage.

A retry budget limits retries per:

operation
payment intent
merchant
provider
route
error class
time window

Example:

operation_policies:
  card_authorization:
    transient_provider_error:
      max_attempts: 2
      backoff: exponential_jitter
      initial_delay_ms: 300
      max_delay_ms: 5000
    network_read_timeout:
      max_attempts: 1
      only_if_provider_idempotency: true
      otherwise: status_inquiry
    issuer_hard_decline:
      max_attempts: 0
    issuer_soft_decline:
      realtime_retry: 0
      scheduled_retry: allowed_if_subscription

Provider-level budget:

Provider A has 40% timeout rate in last 5 minutes.
Stop retrying immediately.
Open circuit.
Route new eligible attempts elsewhere.
Resolve existing unknown operations separately.

Retry budget should be visible in metrics:

payment_retry_attempts_total{operation, provider, failure_class}
payment_retry_budget_exhausted_total{operation, provider}
payment_unknown_outcome_total{provider, operation}
payment_status_inquiry_total{provider, reason}

10. Backoff With Jitter

Backoff without jitter can synchronize workers and amplify incidents.

Bad:

retry all failed provider calls after exactly 1 second

Better:

base delay grows exponentially
actual delay has jitter
max delay is capped
retry budget is enforced

Example:

public final class JitteredBackoff {
    private final Duration base;
    private final Duration max;
    private final SecureRandom random = new SecureRandom();

    public JitteredBackoff(Duration base, Duration max) {
        this.base = base;
        this.max = max;
    }

    public Duration delayForAttempt(int attempt) {
        long exponent = 1L << Math.min(attempt, 10);
        long upperMillis = Math.min(max.toMillis(), base.toMillis() * exponent);
        long jitteredMillis = random.nextLong(Math.max(1, upperMillis));
        return Duration.ofMillis(jitteredMillis);
    }
}

Payment-specific addition:

backoff schedules retry of same operation, not creation of a new financial operation.

11. Fallback Semantics

Fallback means using an alternative route or provider.

Fallback is safe only in certain states.

Unsafe fallback:

Provider A timed out after receiving authorization request.
System immediately sends same card authorization to Provider B.
Later Provider A webhook says authorised.
Provider B also authorised.
Customer has two authorizations.

Safe fallback cases:

connect failure before request bytes sent
provider rejects request before side effect because capability/config validation failed
provider returns terminal failure that proves no authorization/capture/payout happened
risk/policy rejects route before external call
provider health circuit is open before route selection

Potentially safe but requires design:

provider returns issuer soft decline and policy says another acquirer may improve authorization rate
subscription retry after delay with stored credential rules
customer selects a different payment method after terminal decline

Never treat fallback as a generic catch block.


12. Route Attempt vs Provider Operation

Payment routing and retry must coordinate.

A common model:

PaymentIntent
  PaymentAttempt #1
    RouteDecision #1: Provider A
    ProviderOperation #1: AUTHORIZE
  PaymentAttempt #2
    RouteDecision #2: Provider B
    ProviderOperation #2: AUTHORIZE

Rules:

one payment attempt has one selected route for authorization
same attempt may retry same provider operation
fallback to another provider usually creates a new attempt
old attempt must be terminal failed or safely isolated before new attempt becomes active

Example table:

create table payment_attempt (
    id uuid primary key,
    payment_intent_id uuid not null,
    attempt_no integer not null,
    provider_code text not null,
    status text not null,
    amount_minor bigint not null,
    currency char(3) not null,
    created_at timestamptz not null,
    constraint uq_payment_attempt_no unique (payment_intent_id, attempt_no),
    constraint ck_payment_attempt_status check (status in (
        'CREATED',
        'ROUTE_SELECTED',
        'AUTHORIZING',
        'AUTHORIZING_UNKNOWN',
        'AUTHORISED',
        'FAILED',
        'CANCELLED',
        'REQUIRES_REVIEW'
    ))
);

A fallback attempt should reference the reason:

alter table payment_attempt
add column fallback_from_attempt_id uuid,
add column fallback_reason_code text,
add column fallback_policy_version text;

This makes later analysis possible:

Did fallback improve conversion?
Did fallback increase duplicate authorization incidents?
Did fallback produce higher cost?
Did fallback violate provider routing policy?

13. Operation-Specific Retry Rules

Not all payment operations are equally dangerous.

Authorization

Authorization creates a hold or approval.

Retry with same idempotency key can be safe if provider supports it.

Fallback to another acquirer is risky when the first outcome is unknown.

Capture

Capture moves an authorization toward clearing/settlement.

Rules:

capture amount cannot exceed authorized remaining amount
capture retry must use same capture operation idempotency key
unknown capture should be resolved before recapture
partial capture needs separate capture sequence

Refund

Refund moves money back to customer.

Rules:

refund amount cannot exceed refundable amount
refund retry must not create a new refund object accidentally
unknown refund should be resolved before issuing another refund

Void/Cancel

Void/cancel releases an authorization before capture.

Rules:

cancel after capture may not be possible
cancel/capture race must be serialized per authorization
unknown cancel should not be interpreted as failed if provider may process it later

Payout

Payout sends money out of the platform.

Rules:

beneficiary verified before retry
balance reservation persists across retry
bank reference/idempotency key stable
unknown bank transfer is never retried as a new transfer without inquiry

Ledger Posting

Ledger posting is internal but financially critical.

Rules:

posting rule idempotency key unique
same business event cannot post twice
retry after DB deadlock is safe if journal key is stable

14. Subscription Retry and Dunning

Subscription retry is different from realtime checkout retry.

Realtime checkout retry:

customer is present
latency matters
retry budget is tiny
fallback may be possible only when safe
customer can choose another method

Subscription retry:

customer may be absent
time window is days/weeks
retry must respect stored credential/mandate semantics
notifications and dunning matter
entitlement policy matters
card network retry constraints matter

Dunning flow:

For subscription retry, the platform should store:

retry schedule version
failure classification
merchant policy
invoice state
customer notification evidence
payment method mandate/stored credential reference
attempt count
next collection time

Do not model dunning as only:

next_retry_at

Dunning is customer communication plus financial collection policy.


15. Rate Limits and Retry Storms

Provider rate limits require backoff, not aggression.

If Provider A returns rate limit:

wrong response: all workers retry immediately
right response: shared provider budget reduces traffic, queue delays retries, circuit health degrades

A provider health model should include:

success rate
timeout rate
p95/p99 latency
rate limit rate
unknown outcome rate
error mix
status inquiry success rate
webhook delay

Circuit states:

CLOSED: route normally
DEGRADED: route lower priority, reduce retries
OPEN: do not route new attempts, only resolve existing unknowns
HALF_OPEN: limited probes

Payment-specific circuit rule:

open circuit for new attempts does not mean stop resolution of existing unknown operations.

Existing unknowns still need status inquiry, webhook ingestion, or reconciliation.


16. Retry Worker Design

A retry worker should not select arbitrary failed records.

It should acquire due operations safely.

select id
from provider_operation
where status in ('FAILED', 'UNKNOWN', 'RESOLUTION_PENDING')
  and next_retry_at <= now()
order by next_retry_at asc
for update skip locked
limit 100;

But the worker should still re-evaluate the state inside the transaction.

public final class ProviderOperationRetryWorker {

    public void handle(UUID operationId) {
        ProviderOperation op = operationRepository.lockById(operationId);
        PaymentAttempt attempt = attemptRepository.lockById(op.paymentAttemptId());

        RetryContext ctx = retryContextFactory.from(op, attempt);
        RetryDecision decision = retryPolicy.decide(ctx);

        switch (decision.action()) {
            case RETRY_SAME_OPERATION -> retrySameOperation(op, decision);
            case STATUS_INQUIRY -> scheduleStatusInquiry(op, decision);
            case WAIT_FOR_WEBHOOK -> waitForWebhook(op, decision);
            case MARK_TERMINAL_FAILURE -> markFailed(op, decision);
            case FALLBACK_NEW_ATTEMPT -> createFallbackAttempt(attempt, decision);
            case SEND_TO_MANUAL_REVIEW -> openCase(op, decision);
            case ASK_CUSTOMER_ACTION -> requestCustomerAction(attempt, decision);
            case STOP -> stop(op, decision);
            default -> throw new IllegalStateException("Unsupported retry action " + decision.action());
        }
    }
}

Important invariant:

retry worker does not decide from stale state.

It locks and revalidates.


17. Status Inquiry as Retry Alternative

Sometimes the correct retry is not repeating the original operation.

It is asking:

what happened to the original operation?

Status inquiry is necessary for:

unknown authorization
unknown capture
unknown refund
unknown payout
late webhook suspicion
provider response lost
merchant support investigation

Status inquiry result should feed the same state machine as webhook.

Do not create a second path that bypasses validation.

The state machine should not care whether evidence came from:

synchronous response
webhook
status inquiry
reconciliation file
operator action

It should apply evidence consistently.


18. Customer-Facing Retry

Internal retry and customer retry are not the same.

Customer-facing message should not say:

Provider A returned acquirer error 905 with network timeout after socket write.

Customer-facing messages should be safe and actionable:

Your bank could not approve this payment. Try another payment method or contact your bank.

Customer-facing retry options:

try same method again
use another card
complete authentication
choose bank transfer
wait and retry later
contact issuer
contact merchant support

The platform should map failure classifications to recommended actions.

issuer_hard_decline -> use another method / contact issuer
insufficient_funds -> use another method
authentication_required -> complete authentication
provider_temporary_error -> try again shortly
payment_unknown -> do not ask customer to pay again until resolved, unless duplicate risk is explicitly managed

This is both UX and risk control.


19. Fallback and Ledger Timing

Do not post success ledger entries before outcome is known.

Possible states:

authorization pending -> no capture/settlement ledger movement yet, maybe optional internal hold/reservation
authorization success -> post authorization/receivable/pending payable as appropriate
capture success -> post capture/clearing-related movement
settlement success -> post settlement movement

When fallback happens, ledger must represent each attempt distinctly.

attempt A failed terminally -> no success posting
attempt B authorised -> success posting tied to attempt B
attempt A later succeeds unexpectedly -> duplicate incident, reversal/cancel workflow, ledger correction

A robust system also has a late-success handler:

if previously failed/unknown attempt later receives success evidence:
  if payment intent already paid by another attempt:
      open duplicate authorization case
      cancel/void/refund depending on operation stage
      post correction/reserve if money moved
  else:
      apply success normally

Production systems must be humble about time.

Late events happen.


20. Backoffice Controls

Operations need visibility into retry/fallback decisions.

Backoffice should show:

payment intent timeline
attempt timeline
provider operations
idempotency key
request fingerprint
raw response reference
failure classification
retry decisions
next retry time
retry budget remaining
status inquiry attempts
fallback attempts
manual override availability

Manual actions should be controlled:

force status inquiry
stop retry
resume retry
mark requires review
create customer action request
approve fallback
trigger provider-specific repair

Manual actions should not include:

click random retry with new provider operation id
edit amount after authorization
delete failed attempt record
manually mark paid without ledger posting

Backoffice must be safer than the automated system, not more dangerous.


21. Observability

Useful metrics:

payment_operation_retry_total{operation_type, provider, failure_class, decision}
payment_operation_unknown_total{operation_type, provider}
payment_operation_unknown_age_seconds{provider}
payment_fallback_total{from_provider, to_provider, reason}
payment_duplicate_late_success_total{provider, operation_type}
payment_retry_budget_exhausted_total{provider, operation_type}
payment_status_inquiry_total{provider, outcome}
payment_retry_queue_depth{provider, operation_type}
payment_retry_queue_oldest_age_seconds{provider, operation_type}

Important dashboards:

provider health
unknown outcome aging
retry volume by error class
fallback conversion impact
duplicate authorization incidents
subscription recovery rate
payout retry backlog

Important alerts:

unknown outcome count above threshold
unknown outcome age above SLA
fallback spike
retry budget exhausted spike
provider rate limit spike
late success after fallback
payout unknown state unresolved

22. Testing Matrix

Test retry logic as a state machine, not only as unit tests.

ScenarioExpected Result
connect timeout before request sentretry or fallback may be allowed
read timeout after request sent with provider idempotencyretry same operation with same key or inquiry
read timeout after request sent without idempotencystatus inquiry/manual review
provider 500 before side effect provenretry same operation if safe
provider 429backoff; reduce provider health score
issuer hard declineno retry; customer action
issuer soft decline in checkoutlimited retry/customer action
issuer soft decline in subscriptionscheduled retry/dunning
capture timeoutdo not create second capture blindly
refund timeoutdo not create second refund blindly
payout timeoutdo not create second bank transfer blindly
webhook success after fallback successduplicate incident workflow
retry worker crashes after external success before DB updateidempotency/status inquiry resolves
operation log duplicate insertunique constraint prevents duplicate side effect record

Property-like invariant:

For any sequence of retries, crashes, webhooks, and status inquiries,
there must be at most one successful financial posting per idempotent operation.

Another invariant:

Fallback must not be selected while previous attempt has unresolved side-effect risk,
unless the platform has explicit duplicate-risk containment logic.

23. Anti-Patterns

Anti-Pattern 1: Retry All Exceptions

@Retryable
public AuthorizationResult authorize(...) {
    return provider.authorize(...);
}

This is dangerous because exception type alone does not encode side-effect risk.

Anti-Pattern 2: New Idempotency Key Per Retry

attempt 1: key abc
retry 1: key def
retry 2: key ghi

This defeats provider idempotency.

Anti-Pattern 3: Fallback on Timeout

Provider A timeout -> Provider B immediately

This can create duplicate authorization.

Anti-Pattern 4: Treat Decline as Error

Issuer decline is often a valid business outcome, not a system error.

Anti-Pattern 5: Hide Unknown From Domain Model

timeout -> failed

This corrupts settlement and support workflows.

Anti-Pattern 6: Retry Without Budget

A provider outage plus aggressive retry can become a self-inflicted traffic storm.

Anti-Pattern 7: Manual Retry Button Without Idempotency

Backoffice must not bypass financial controls.


24. Production Checklist

Before enabling smart retry/fallback, verify:

[ ] every external operation has operation id
[ ] operation has stable idempotency key
[ ] request fingerprint is stored
[ ] raw request/response evidence is stored safely
[ ] timeout phase is classified
[ ] unknown outcome is explicit
[ ] retry budget exists per operation/provider/failure class
[ ] backoff uses jitter
[ ] provider health can suppress retry/fallback
[ ] fallback is blocked while prior side-effect risk is unresolved
[ ] status inquiry is integrated into same state machine as webhook
[ ] ledger posting is idempotent
[ ] late success after fallback has incident workflow
[ ] subscription retry is separated from realtime checkout retry
[ ] payout retry has beneficiary and balance reservation controls
[ ] operations can inspect and control retry safely
[ ] metrics and alerts exist for unknown outcomes and retry storms

25. Minimal Build Order

Build in this order:

1. provider operation table
2. idempotency guard with request fingerprint
3. failure classification from adapter
4. explicit UNKNOWN state
5. retry policy engine
6. retry queue with retry budget
7. jittered backoff
8. status inquiry path
9. provider health circuit
10. fallback with attempt isolation
11. backoffice timeline
12. late success handler

Do not start with dynamic routing.

Start with safety.


References

Lesson Recap

You just completed lesson 37 in deepen practice. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.