Series MapLesson 13 / 35
Build CoreOrdered learning track

Learn Java Error Reliability Observability Part 013 Retry Timeout Idempotency

14 min read2724 words
PrevNext
Lesson 1335 lesson track0719 Build Core

title: Learn Java Error, Reliability & Observability Engineering - Part 013 description: Retry, timeout, idempotency, backoff, jitter, deadline propagation, retry safety, and duplicate-effect control for production Java systems. series: learn-java-error-reliability-observability seriesTitle: Learn Java Error, Reliability & Observability Engineering order: 13 partTitle: Retry, Timeout & Idempotency tags:

  • java
  • reliability
  • error-handling
  • retry
  • timeout
  • idempotency
  • observability
  • distributed-systems date: 2026-06-28

Part 013 — Retry, Timeout & Idempotency

Target Pembelajaran

Setelah part ini, kamu harus bisa menjawab pertanyaan produksi berikut dengan presisi:

  1. Apakah operasi ini boleh di-retry?
  2. Retry dilakukan oleh siapa: client, gateway, worker, message broker, atau scheduler?
  3. Apa yang terjadi kalau request pertama sebenarnya sukses, tetapi response hilang?
  4. Timeout mana yang aktif: connect timeout, read timeout, request timeout, total deadline, queue timeout, atau transaction timeout?
  5. Apakah retry memperbaiki reliability, atau justru menciptakan retry storm?
  6. Bagaimana membuktikan secara telemetry bahwa retry membantu, bukan memperburuk?

Topik ini bukan sekadar library configuration. Ini adalah desain failure control loop.

Retry adalah mekanisme optimisme. Timeout adalah mekanisme batas waktu. Idempotency adalah mekanisme keselamatan efek samping. Tanpa ketiganya, distributed system mudah berubah menjadi mesin pengganda kerusakan.


1. Kaufman Skill Deconstruction

Berdasarkan pendekatan Josh Kaufman, skill besar dipecah menjadi sub-skill kecil yang bisa dilatih cepat.

20-Hour Practice Breakdown

JamFokusOutput
1-2Failure classificationMatrix error retryable/non-retryable
3-4Timeout taxonomyDiagram timeout end-to-end
5-7Retry policyImplement retry wrapper dengan classification
8-10Backoff & jitterSimulasi retry storm vs jitter
11-13Idempotency keyImplement dedupe table
14-16Duplicate effect controlSimulasi lost response dan duplicate request
17-18ObservabilityMetrics/logs/traces untuk retry
19-20Review produksiChecklist, failure injection, postmortem mini

2. Mental Model: Distributed Operation Selalu Punya Outcome yang Tidak Pasti

Dalam single-process Java, method call biasanya punya outcome jelas:

Order order = orderService.create(command);

Entah return sukses atau throw exception.

Dalam distributed system, call ke dependency punya empat kemungkinan:

Client MelihatServer SebenarnyaMakna
Success responseSuccess committedAman
Error responseFailure before commitBiasanya aman untuk retry jika transient
Timeout/no responseUnknownBahaya: mungkin gagal, mungkin sukses
Connection resetUnknownBahaya: commit state bisa sudah terjadi

Yang paling berbahaya bukan failure eksplisit. Yang paling berbahaya adalah unknown outcome.

Tanpa idempotency, retry bisa membuat pembayaran, pengiriman notifikasi, pembuatan case, atau mutasi status terjadi dua kali.


3. Istilah Dasar yang Harus Presisi

IstilahMakna
AttemptSatu percobaan call
Initial attemptPercobaan pertama sebelum retry
RetryPercobaan ulang setelah failure/timeout
Max attemptsTotal attempts, biasanya termasuk initial attempt
BackoffWaktu tunggu sebelum retry
JitterRandomisasi backoff agar retry tidak sinkron
TimeoutBatas waktu lokal untuk menunggu operasi
DeadlineBatas waktu total yang diwariskan sepanjang call chain
Retry budgetBatas maksimum retry agar tidak mengamplifikasi overload
IdempotencyProperti bahwa repeated request dengan maksud sama tidak mengubah hasil lebih dari sekali
Dedupe storePenyimpanan untuk mengenali request duplikat
Unknown outcomeCaller tidak tahu apakah side effect sudah commit

Perhatikan beda timeout dan deadline:

timeout  = batas waktu untuk satu operasi lokal
deadline = batas waktu total untuk seluruh workflow/request

Contoh buruk:

User request deadline: 2s
Service A timeout ke B: 2s
Service B timeout ke C: 2s
Service C timeout ke D: 2s

Secara teori chain bisa memakan 6s+ walaupun user sudah menyerah.

Contoh benar:

Request deadline: now + 2s
A menerima deadline
A memberi B sisa budget: 1.7s
B memberi C sisa budget: 1.1s
C memberi D sisa budget: 500ms

4. Retry Bukan Error Handling Universal

Retry hanya masuk akal jika failure bersifat:

  1. Transient — overload sementara, network blip, leader election, throttling sementara.
  2. Safe to repeat — tidak menggandakan efek samping.
  3. Bounded — dibatasi jumlah, waktu, dan budget.
  4. Observable — ada metrics untuk mengetahui apakah retry berhasil.
  5. Coordinated — tidak dilakukan di banyak layer tanpa kontrol.

Jangan Retry

FailureAlasan
Validation errorInput salah tidak akan benar dengan retry
Business rejectionRule/domain state menolak operasi
Authorization failureRetry bisa menjadi brute-force/noise
Deterministic 404 untuk resource yang memang tidak adaTidak transient
Duplicate key karena request berbedaHarus conflict handling, bukan retry
Programming bugRetry hanya mengulang bug
Permanent config errorHarus fail fast dan alert
Payload too largeTidak transient
Unsupported operationTidak transient

Mungkin Retry

FailureSyarat
408 Request TimeoutCek idempotency dan client/server semantics
429 Too Many RequestsHormati rate limit dan Retry-After jika ada
500 Internal Server ErrorHanya jika operasi idempotent atau deduped
502 Bad GatewayBiasanya transient
503 Service UnavailableBiasanya transient, tapi perhatikan overload
504 Gateway TimeoutUnknown outcome; butuh idempotency
Connection resetUnknown outcome
Socket timeoutUnknown outcome
Optimistic lock conflictBisa retry jika command masih valid
Deadlock/serialization failureBisa retry transaction dengan batas ketat

5. Timeout Taxonomy

Banyak bug produksi terjadi karena engineer berkata “sudah ada timeout”, padahal timeout yang ada hanya salah satu dari banyak jenis timeout.

TimeoutMelindungi dari
Queue timeoutMenunggu terlalu lama sebelum request mulai
DNS timeoutResolusi DNS macet
Connect timeoutTCP connection tidak terbentuk
TLS handshake timeoutHandshake lambat/macet
Write timeoutRequest body tidak terkirim
Read/socket timeoutResponse tidak diterima
Request timeoutKeseluruhan HTTP call
Transaction timeoutDB transaction terlalu lama
Lock timeoutMenunggu lock terlalu lama
Executor timeoutTask menunggu/mengerjakan terlalu lama
Total deadlineSeluruh user/business operation terlalu lama

Design Rule

Setiap external call harus memiliki:

  1. Single-attempt timeout
  2. Total retry deadline
  3. Cancellation propagation
  4. Observable timeout reason
  5. Fallback atau failure mapping

Bukan cukup hanya maxAttempts=3.


6. Timeout Budgeting

Misal SLO API adalah p95 latency 800ms. Request path:

API Gateway -> Order Service -> Payment Service -> Fraud Service -> DB

Budget kasar:

SegmentBudget
Gateway overhead50ms
Order validation50ms
Payment call total350ms
Fraud call total200ms
DB write100ms
Serialization/response50ms

Jika Payment Service retry 3 kali dengan timeout 350ms per attempt, budget langsung rusak:

3 attempts * 350ms = 1050ms

Maka retry harus memakai total deadline, bukan timeout per attempt yang berdiri sendiri.

Deadline-Aware Retry

public final class Deadline {
    private final long deadlineNanos;

    private Deadline(long deadlineNanos) {
        this.deadlineNanos = deadlineNanos;
    }

    public static Deadline after(Duration duration) {
        return new Deadline(System.nanoTime() + duration.toNanos());
    }

    public Duration remaining() {
        long remaining = deadlineNanos - System.nanoTime();
        return remaining <= 0 ? Duration.ZERO : Duration.ofNanos(remaining);
    }

    public boolean expired() {
        return remaining().isZero();
    }
}

Retry loop:

public <T> T executeWithDeadline(
        Supplier<T> operation,
        Deadline deadline,
        RetryPolicy policy
) {
    int attempt = 1;
    Throwable lastFailure = null;

    while (!deadline.expired() && attempt <= policy.maxAttempts()) {
        try {
            return operation.get();
        } catch (Throwable failure) {
            lastFailure = failure;

            if (!policy.isRetryable(failure)) {
                throw failure;
            }

            Duration delay = policy.nextDelay(attempt);

            if (delay.compareTo(deadline.remaining()) >= 0) {
                break;
            }

            sleepInterruptibly(delay);
            attempt++;
        }
    }

    throw new DependencyTimeoutException(
            "Operation did not complete within deadline",
            lastFailure
    );
}

Catatan penting: contoh ini ilustratif. Di production, jangan catch Throwable sembarangan kecuali kamu benar-benar punya alasan untuk membiarkan Error lewat atau menanganinya secara khusus. Biasanya gunakan Exception atau failure type yang lebih sempit.


7. Retry Amplification

Retry memperbesar traffic.

Jika ada 5 layer dan masing-masing retry 3 kali:

3^5 = 243 attempts

Satu user request bisa menjadi 243 dependency calls.

Rule: Retry Ownership Harus Jelas

Untuk setiap dependency, tentukan:

PertanyaanKeputusan
Layer mana yang boleh retry?Biasanya client paling dekat dengan user intent atau library boundary yang distandarkan
Apakah gateway boleh retry?Hanya untuk safe/idempotent operation
Apakah worker boleh retry?Ya, jika message idempotent dan DLQ policy jelas
Apakah DB driver boleh retry?Hati-hati, transaction semantics harus jelas
Apakah service mesh boleh retry?Bahaya jika aplikasi tidak tahu duplicate effect

8. Backoff & Jitter

Retry langsung sering buruk:

t=0ms failure
t=1ms retry
t=2ms retry
t=3ms retry

Ketika ribuan client melakukan itu bersamaan, dependency yang sedang sakit justru dihantam lebih keras.

Exponential Backoff

delay = min(base * 2^(attempt - 1), maxDelay)

Contoh:

AttemptDelay
1100ms
2200ms
3400ms
4800ms
51000ms cap

Full Jitter

delay = random(0, cappedExponentialDelay)

Contoh Java:

public final class Backoff {
    private final Duration base;
    private final Duration max;

    public Backoff(Duration base, Duration max) {
        this.base = base;
        this.max = max;
    }

    public Duration fullJitterDelay(int attempt) {
        long exponential = base.toMillis() * (1L << Math.max(0, attempt - 1));
        long capped = Math.min(exponential, max.toMillis());
        long jittered = ThreadLocalRandom.current().nextLong(capped + 1);
        return Duration.ofMillis(jittered);
    }
}

Anti-Pattern: Fixed Retry Delay

// Bad: all clients retry at the same rhythm
Thread.sleep(100);

Better

Duration delay = backoff.fullJitterDelay(attempt);
sleepInterruptibly(delay);

9. Retry Classification

Retry policy harus berbasis classification, bukan sekadar “catch exception lalu ulang”.

public enum FailureKind {
    TRANSIENT,
    THROTTLED,
    TIMEOUT_UNKNOWN_OUTCOME,
    PERMANENT,
    DOMAIN_REJECTION,
    PROGRAMMER_ERROR
}

public record ClassifiedFailure(
        FailureKind kind,
        boolean retryable,
        boolean requiresIdempotency,
        String reason
) {}

Classifier:

public final class DependencyFailureClassifier {

    public ClassifiedFailure classify(Throwable failure) {
        if (failure instanceof java.net.SocketTimeoutException) {
            return new ClassifiedFailure(
                    FailureKind.TIMEOUT_UNKNOWN_OUTCOME,
                    true,
                    true,
                    "socket timeout; server outcome unknown"
            );
        }

        if (failure instanceof java.net.ConnectException) {
            return new ClassifiedFailure(
                    FailureKind.TRANSIENT,
                    true,
                    false,
                    "connection could not be established"
            );
        }

        if (failure instanceof IllegalArgumentException) {
            return new ClassifiedFailure(
                    FailureKind.PROGRAMMER_ERROR,
                    false,
                    false,
                    "invalid caller argument"
            );
        }

        return new ClassifiedFailure(
                FailureKind.PERMANENT,
                false,
                false,
                "unclassified failure is not retryable by default"
        );
    }
}

Default yang aman:

unknown failure => not retryable
unknown outcome => retry only with idempotency

10. HTTP Status Mapping untuk Retry

StatusRetry?Catatan
400NoBad request
401NoAuth challenge mungkin butuh token refresh, bukan blind retry
403NoForbidden
404Usually noKecuali eventual consistency atau async provisioning
409MaybeConflict bisa retry jika optimistic concurrency dan command masih valid
408MaybeUnknown outcome
425Maybe laterToo early; butuh protocol-specific handling
429Yes with backoffHormati throttling
500MaybeHanya transient/idempotent
502MaybeUpstream gateway issue
503MaybeService unavailable; perhatikan overload
504MaybeUnknown outcome

Jangan samakan HTTP idempotent method dengan business idempotency. PUT dan DELETE didefinisikan sebagai idempotent secara HTTP semantics, tetapi operasi bisnis tetap bisa punya efek samping tambahan jika implementasi buruk, misalnya mengirim email setiap kali PUT dipanggil.


11. Idempotency: Safety Net untuk Unknown Outcome

Idempotency bukan berarti “response selalu sama”. Idempotency berarti efek state utama tidak terjadi lebih dari sekali untuk intent yang sama.

Idempotency Key

Client mengirim key stabil:

POST /payments
Idempotency-Key: 01J2A4P9Y6W7K3C8K9M4Q6R2ZS
Content-Type: application/json

{
  "invoiceId": "INV-2026-0001",
  "amount": 100000,
  "currency": "IDR"
}

Server menyimpan:

FieldTujuan
keyIdentitas intent
actor/clientScope key agar tidak collision antar tenant
payload_hashDeteksi key reuse dengan payload berbeda
statusIN_PROGRESS, SUCCEEDED, FAILED_RETRYABLE, FAILED_FINAL
response_snapshotReplay hasil sukses
resource_idLink ke entity yang dibuat
created_atTTL cleanup
expires_atDedupe window
locked_untilMenghindari concurrent duplicate processing

Dedupe Table

CREATE TABLE idempotency_record (
    scope              VARCHAR(128) NOT NULL,
    idempotency_key    VARCHAR(128) NOT NULL,
    payload_hash       VARCHAR(128) NOT NULL,
    status             VARCHAR(32)  NOT NULL,
    resource_type      VARCHAR(64),
    resource_id        VARCHAR(128),
    response_status    INTEGER,
    response_body      TEXT,
    created_at         TIMESTAMP NOT NULL,
    updated_at         TIMESTAMP NOT NULL,
    expires_at         TIMESTAMP NOT NULL,
    PRIMARY KEY (scope, idempotency_key)
);

Processing Flow


12. Atomicity: Business State dan Idempotency Record Harus Konsisten

Bug umum:

1. create payment committed
2. application crashes before saving idempotency result
3. client retries
4. second payment created

Solusi: simpan idempotency record dan business effect dalam transaction yang sama jika memungkinkan.

@Transactional
public PaymentResponse createPayment(CreatePaymentCommand command, IdempotencyKey key) {
    IdempotencyRecord record = idempotencyRepository.tryStart(
            command.scope(),
            key.value(),
            command.payloadHash()
    );

    if (record.isReplayable()) {
        return record.replayAs(PaymentResponse.class);
    }

    if (record.isPayloadMismatch()) {
        throw new IdempotencyConflictException("Idempotency key reused with different payload");
    }

    Payment payment = paymentRepository.save(Payment.create(command));

    PaymentResponse response = PaymentResponse.from(payment);

    idempotencyRepository.markSucceeded(
            command.scope(),
            key.value(),
            payment.id(),
            response
    );

    return response;
}

Invariant

For one idempotency scope + key + payload hash,
there must be at most one committed business effect.

Jika database yang sama tidak bisa dipakai untuk keduanya, desain menjadi lebih sulit. Kamu perlu outbox, saga, dedupe downstream, atau compensating action.


13. Idempotency Scope

Key global jarang ideal. Gunakan scope.

ScopeContoh
Tenanttenant-123
Actoruser-456
Client applicationmobile-app
Business aggregateinvoice-789
Operationpayment:create

Contoh composite scope:

tenantId + clientId + operationName

Kenapa penting?

User A dan User B bisa sama-sama generate UUID yang sama walau probabilitas kecil.
Client internal dan external tidak boleh saling overwrite.
Operation create-payment dan create-refund tidak boleh berbagi key namespace.

14. Payload Hash

Idempotency key tidak boleh dipakai untuk payload berbeda.

public final class PayloadHasher {

    public String sha256CanonicalJson(String canonicalJson) {
        try {
            MessageDigest digest = MessageDigest.getInstance("SHA-256");
            byte[] hash = digest.digest(canonicalJson.getBytes(StandardCharsets.UTF_8));
            return HexFormat.of().formatHex(hash);
        } catch (NoSuchAlgorithmException e) {
            throw new IllegalStateException("SHA-256 not available", e);
        }
    }
}

Canonicalization penting. JSON berikut secara semantik sama, tetapi string berbeda:

{"amount":100000,"currency":"IDR"}
{"currency":"IDR","amount":100000}

Untuk production, jangan hash raw JSON tanpa canonicalization jika client bisa mengubah order field.


15. Retry dalam Transaction

Hati-hati dengan retry di dalam transaction.

Buruk

@Transactional
public void process() {
    retry.execute(() -> {
        repository.updateSomething();
        externalClient.call();
        return null;
    });
}

Masalah:

  1. Transaction terbuka saat external call.
  2. Lock DB ditahan selama retry.
  3. Kalau external call sukses lalu transaction rollback, external side effect sudah terjadi.
  4. Retry bisa menggandakan side effect.

Lebih Aman

Pisahkan:

  1. Validate command.
  2. Commit intent/outbox.
  3. Worker memproses side effect idempotently.
  4. Update state berdasarkan result.

16. Retry untuk Optimistic Locking

Optimistic locking conflict bisa retry, tetapi tidak selalu.

Boleh retry jika:

  1. Operation pure terhadap state terbaru.
  2. Command masih valid setelah reload.
  3. Tidak ada external side effect sebelum commit.
  4. Retry count rendah.
  5. Conflict memang transient.

Contoh:

public Case assignInvestigator(CaseId caseId, InvestigatorId investigatorId) {
    return retryOptimisticLock(() -> {
        Case caze = caseRepository.findById(caseId);
        caze.assignInvestigator(investigatorId);
        return caseRepository.save(caze);
    });
}

Tidak boleh blind retry jika command bergantung pada state lama:

"Approve if amount <= previous balance"

State berubah berarti keputusan domain harus dihitung ulang, bukan sekadar ulang write.


17. Retry untuk Message Processing

Dalam messaging, retry sering dilakukan oleh broker atau consumer framework.

FailureAction
Transient dependency failureRetry with backoff
Invalid message schemaReject/DLQ
Business rejection finalAck + record rejected outcome
Unknown processing outcomeIdempotency/dedupe required
Poison messageDLQ after bounded attempts
Downstream overloadedPause consumer / backpressure

Consumer harus idempotent karena broker umumnya memberi at-least-once delivery.

public void handle(MessageEnvelope envelope) {
    if (processedMessageRepository.exists(envelope.messageId())) {
        return;
    }

    processBusinessCommand(envelope);

    processedMessageRepository.markProcessed(envelope.messageId());
}

Dalam production, marking processed dan business effect harus atomic, atau minimal punya reconciliation path.


18. Retry Budget

Retry budget membatasi retry sebagai persentase dari traffic normal.

Contoh:

Normal requests per minute: 10,000
Retry budget: 10%
Max retry attempts per minute: 1,000

Jika retry sudah melebihi budget, sistem harus stop retry dan fail fast.

public final class RetryBudget {
    private final AtomicLong primaryCalls = new AtomicLong();
    private final AtomicLong retryCalls = new AtomicLong();

    public void recordPrimary() {
        primaryCalls.incrementAndGet();
    }

    public boolean tryAcquireRetry() {
        long primary = Math.max(1, primaryCalls.get());
        long retry = retryCalls.get();

        if (retry * 10 >= primary) {
            return false;
        }

        retryCalls.incrementAndGet();
        return true;
    }
}

Ini contoh sederhana. Production implementation harus windowed, thread-safe, dan terintegrasi metrics.


19. Observability untuk Retry/Timeout/Idempotency

Tanpa telemetry, retry policy hanya tebakan.

Metrics Minimal

MetricTypeTags
dependency.callscounterdependency, operation, outcome
dependency.attemptscounterdependency, operation, attempt_number
dependency.retriescounterdependency, reason
dependency.timeoutscounterdependency, timeout_type
dependency.latencyhistogram/timerdependency, operation
idempotency.recordscounteroperation, status
idempotency.duplicatescounteroperation, outcome
retry.budget.remaininggaugedependency
retry.exhaustedcounterdependency, operation

Logs

Good log:

{
  "event": "dependency_retry_scheduled",
  "dependency": "payment-service",
  "operation": "createPayment",
  "attempt": 2,
  "maxAttempts": 3,
  "delayMs": 180,
  "failureKind": "TIMEOUT_UNKNOWN_OUTCOME",
  "idempotencyKeyPresent": true,
  "correlationId": "corr-123",
  "traceId": "trace-abc"
}

Bad log:

Retrying...

Trace

Setiap retry attempt sebaiknya terlihat sebagai span event atau child span.

Span: POST /orders
  Event: dependency.attempt attempt=1
  Event: dependency.timeout timeout_type=request
  Event: dependency.retry_scheduled delay_ms=180
  Event: dependency.attempt attempt=2
  Event: dependency.success

Invariant Telemetry

Kamu harus bisa menjawab:

Dari semua request yang semula gagal, berapa persen diselamatkan retry?
Berapa latency tambahan karena retry?
Berapa duplicate request yang ditahan idempotency?
Berapa retry yang memperburuk overload?

20. Java HTTP Client Example dengan Timeout

Contoh menggunakan java.net.http.HttpClient.

HttpClient client = HttpClient.newBuilder()
        .connectTimeout(Duration.ofMillis(200))
        .build();

HttpRequest request = HttpRequest.newBuilder()
        .uri(URI.create("https://payment.example.internal/payments"))
        .timeout(Duration.ofMillis(500))
        .header("Idempotency-Key", idempotencyKey)
        .POST(HttpRequest.BodyPublishers.ofString(payload))
        .build();

HttpResponse<String> response = client.send(
        request,
        HttpResponse.BodyHandlers.ofString()
);

Hal yang perlu diperhatikan:

  1. connectTimeout hanya untuk membangun koneksi.
  2. request.timeout membatasi request.
  3. Total retry deadline tetap harus kamu desain.
  4. HTTP status perlu diklasifikasikan.
  5. Timeout pada client tidak otomatis membatalkan side effect di server.

21. Policy Object

Daripada konfigurasi tersebar, buat policy eksplisit.

public record DependencyRetryPolicy(
        String dependencyName,
        int maxAttempts,
        Duration baseDelay,
        Duration maxDelay,
        Duration totalDeadline,
        boolean requiresIdempotencyForUnknownOutcome
) {
    public void validate() {
        if (maxAttempts < 1) {
            throw new IllegalArgumentException("maxAttempts must be >= 1");
        }
        if (baseDelay.isNegative() || baseDelay.isZero()) {
            throw new IllegalArgumentException("baseDelay must be positive");
        }
        if (maxDelay.compareTo(baseDelay) < 0) {
            throw new IllegalArgumentException("maxDelay must be >= baseDelay");
        }
    }
}

Policy harus reviewable seperti API contract.


22. Failure Matrix

Failure KindRetryIdempotency RequiredUser Response
Validation errorNoNo400/problem detail
Domain rejectionNoNo409/422/problem detail
Connect timeout before request sentMaybeUsually noRetry/backoff
Read timeout after request sentMaybeYesRetry with key or return unknown
429 throttledYesDependsBackoff, honor retry-after
503 overloadMaybeDependsBackoff, maybe shed
Duplicate request same keyNo processingYesReplay
Duplicate key different payloadNoYes409 conflict
Serialization bugNoNoAlert/fail fast
DB deadlockMaybeNo external side effectRetry transaction

23. Common Anti-Patterns

23.1 Retry Everything

catch (Exception e) {
    return retry();
}

Masalah:

  • Retry validation error.
  • Retry programmer bug.
  • Retry authorization failure.
  • Menyembunyikan root cause.
  • Memperburuk overload.

23.2 Timeout Tanpa Cancellation

Caller timeout, tetapi background task tetap berjalan.

Client gives up at 500ms
Server continues processing for 30s
Retry arrives
Duplicate processing starts

23.3 Retry Tanpa Jitter

Semua instance retry bersamaan.

23.4 Retry di Banyak Layer

Gateway retry + service retry + SDK retry + DB retry = amplification.

23.5 Idempotency Key Disimpan Setelah Side Effect

Crash window menciptakan duplicate effect.

23.6 Idempotency Key Tanpa Payload Hash

Client bisa reuse key untuk request berbeda.

23.7 Retry Menggunakan Thread Sleep di Event Loop

Di reactive/event-loop architecture, blocking sleep bisa menghancurkan throughput.


24. Review Checklist

Sebelum merge feature yang memanggil dependency eksternal:

[ ] Apakah semua call punya timeout?
[ ] Apakah timeout dibedakan antara connect, request, dan total deadline?
[ ] Apakah retryable failure sudah diklasifikasikan?
[ ] Apakah non-retryable failure tidak diulang?
[ ] Apakah retry memakai capped exponential backoff + jitter?
[ ] Apakah max attempts jelas?
[ ] Apakah ada retry budget?
[ ] Apakah unknown outcome dilindungi idempotency?
[ ] Apakah idempotency key punya scope?
[ ] Apakah payload hash dicek?
[ ] Apakah response sukses bisa di-replay?
[ ] Apakah duplicate in-progress ditangani?
[ ] Apakah business effect dan idempotency record atomic?
[ ] Apakah logs/metrics/traces menampilkan attempts?
[ ] Apakah alert bisa mendeteksi retry storm?

25. Latihan Praktik

Latihan 1 — Failure Classification

Ambil satu service yang kamu miliki. Buat table:

dependency | operation | failure | retry? | timeout? | idempotency? | fallback?

Minimal 20 failure cases.

Latihan 2 — Simulasi Unknown Outcome

Implement endpoint POST /cases dengan idempotency key.

Simulasikan:

  1. Request sukses tetapi response dibuang.
  2. Client retry dengan key sama.
  3. Server replay response.
  4. Client retry dengan key sama tetapi payload berbeda.
  5. Server return conflict.

Latihan 3 — Retry Storm

Buat simulasi 1000 client yang retry tanpa jitter dan dengan jitter. Bandingkan distribusi attempts per second.

Latihan 4 — Telemetry

Tambahkan metrics:

case_create_attempts_total
case_create_retries_total
case_create_idempotency_duplicate_total
case_create_timeout_total

Buat dashboard sederhana yang menjawab:

Apakah retry meningkatkan success rate?
Berapa latency cost retry?

26. Top 1% Mental Model

Engineer biasa bertanya:

"Berapa kali retry?"

Engineer kuat bertanya:

"Failure mana yang retryable?"
"Apakah outcome unknown?"
"Apakah operation idempotent?"
"Siapa owner retry?"
"Apakah retry punya deadline?"
"Apakah retry memperburuk overload?"
"Bagaimana telemetry membuktikan policy ini benar?"

Reliability bukan menambahkan retry. Reliability adalah mengontrol feedback loop agar sistem tetap stabil ketika sebagian sistem gagal.


References

  • AWS Builders Library — Timeouts, retries, and backoff with jitter
  • AWS Builders Library — Making retries safe with idempotent APIs
  • AWS Well-Architected Reliability Pillar — Control and limit retry calls
  • Google SRE Book — Addressing Cascading Failures
  • RFC 9110 — HTTP Semantics
  • Resilience4j Documentation — Retry and TimeLimiter
Lesson Recap

You just completed lesson 13 in build core. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.