Deepen PracticeOrdered learning track

Learn Java Patterns Part 024 Resilience Patterns

[]18 min read3581 words

In This Lesson

1. Kaufman Skill Slice 2. Failure Taxonomy 3. Resilience Pattern Map

Lesson 2435 lesson track20–29 Deepen Practice

title: Learn Java Patterns - Part 024 description: Resilience patterns for Java systems: timeout, deadline, retry, exponential backoff, jitter, circuit breaker, bulkhead, rate limiter, load shedding, fallback, graceful degradation, idempotency, observability, testing, and production anti-patterns. series: learn-java-patterns seriesTitle: Learn Java Patterns, Data Patterns, Pipeline Patterns, Concurrency Patterns, Common Patterns, and Anti-Patterns order: 24 partTitle: Resilience Patterns tags:

java
patterns
resilience
reliability
distributed-systems
microservices
fault-tolerance
advanced-java date: 2026-06-27

Part 024 — Resilience Patterns

Goal: mampu mendesain Java service yang tetap terkendali saat dependency lambat, error, overload, partial outage, retry storm, traffic spike, dan resource exhaustion.

Resilience bukan berarti sistem tidak pernah gagal.

Resilience berarti:

Failures are expected, bounded, observable, and recoverable.

Pattern seperti timeout, retry, circuit breaker, bulkhead, rate limiter, dan fallback bukan dekorasi. Mereka mengubah bentuk failure.

Pattern yang salah dapat memperburuk outage:

retry storm memperbesar load,
timeout terlalu panjang menghabiskan thread,
circuit breaker salah threshold membuat recovery lambat,
fallback diam-diam menyembunyikan data salah,
rate limiter salah tempat menghukum traffic penting,
queue tanpa batas membuat latency tak terkendali,
bulkhead terlalu kecil menurunkan availability.

Part ini membahas resilience sebagai engineering of failure boundaries.

1. Kaufman Skill Slice

Sub-skill yang harus dilatih:

Mengklasifikasikan failure: transient, persistent, overload, partial, slow, corrupt, timeout, rejected.
Menentukan timeout dan deadline dari latency budget.
Menentukan kapan retry aman berdasarkan idempotency dan error taxonomy.
Mendesain retry dengan exponential backoff dan jitter.
Mendesain circuit breaker dengan threshold, state, dan recovery strategy.
Mendesain bulkhead untuk isolasi dependency/resource.
Mendesain rate limiter dan load shedding untuk admission control.
Mendesain fallback tanpa menyembunyikan correctness problem.
Menggabungkan pattern tanpa efek samping berbahaya.
Mengobservasi resilience behavior dengan metric yang tepat.
Menguji slow dependency, partial outage, retry storm, dan recovery.
Mengetahui kapan pattern resilience tidak cocok.

Learning target:

Setelah part ini, Anda harus bisa melihat call graph service dan menjawab: timeout per dependency berapa, retry boleh untuk error apa, idempotency key-nya apa, circuit breaker threshold-nya apa, bulkhead resource-nya apa, rate limit di mana, fallback-nya apa, dan metric apa yang membuktikan sistem tidak sedang menciptakan retry storm.

2. Failure Taxonomy

Sebelum memilih pattern, klasifikasikan failure.

Failure Type	Example	Pattern Candidate
transient network error	connection reset	retry + backoff + jitter
downstream slow	p99 latency spike	timeout/deadline, circuit breaker
downstream overload	503, queue full	backoff, circuit breaker, load shedding
persistent bug	400/validation error	no retry
data conflict	optimistic lock failed	domain-specific retry or user resolution
partial response	one backend unavailable	fallback/degrade
resource saturation	thread pool exhausted	bulkhead, queue bound, load shedding
rate exceeded	429	respect retry-after, client-side rate limit
corrupt/invalid data	deserialization error	fail fast, quarantine, alert

Rule:

Do not retry what you do not understand.

3. Resilience Pattern Map

4. Timeout Pattern

4.1 Intent

Timeout membatasi berapa lama caller menunggu operation.

Tanpa timeout, slow dependency dapat menghabiskan:

request threads,
virtual threads,
database connections,
HTTP connections,
heap,
queue capacity,
user patience,
upstream latency budget.

Rule:

Every remote call must have a timeout.

4.2 Timeout vs Deadline

Timeout relatif terhadap operation.

Call payment service with timeout 300 ms.

Deadline adalah batas waktu absolut dari request end-to-end.

Request must finish before 10:00:01.250Z.

Deadline lebih kuat karena mencegah setiap layer menghabiskan timeout penuh.

4.3 Java HttpClient Timeout

HttpClient client = HttpClient.newBuilder()
        .connectTimeout(Duration.ofMillis(200))
        .build();

HttpRequest request = HttpRequest.newBuilder(uri)
        .timeout(Duration.ofMillis(500))
        .GET()
        .build();

HttpResponse<String> response = client.send(request, HttpResponse.BodyHandlers.ofString());

Differentiate:

connect timeout,
request/response timeout,
read timeout,
pool acquisition timeout,
total deadline.

4.4 Deadline Object

public final class Deadline {
    private final Instant expiresAt;
    private final Clock clock;

    private Deadline(Instant expiresAt, Clock clock) {
        this.expiresAt = expiresAt;
        this.clock = clock;
    }

    public static Deadline after(Duration duration, Clock clock) {
        return new Deadline(clock.instant().plus(duration), clock);
    }

    public Duration remaining() {
        Duration remaining = Duration.between(clock.instant(), expiresAt);
        return remaining.isNegative() ? Duration.ZERO : remaining;
    }

    public boolean expired() {
        return !clock.instant().isBefore(expiresAt);
    }

    public void throwIfExpired() {
        if (expired()) {
            throw new TimeoutException("Deadline expired");
        }
    }
}

4.5 Timeout Selection

Timeout tidak boleh dipilih acak.

Pertimbangkan:

user-facing latency budget,
downstream latency distribution,
network distance,
retry budget,
connection setup cost,
cold start/warmup,
false timeout rate,
workload type.

Contoh:

End-to-end API budget: 1000 ms
API overhead: 100 ms
Service A local work: 100 ms
Dependency B budget: 300 ms
Dependency C budget: 300 ms
Retry reserve: 200 ms

5. Cancellation Pattern

Timeout tanpa cancellation hanya berhenti menunggu, tetapi work mungkin tetap berjalan.

Dalam Java:

Future.cancel(true) mengirim interrupt,
CompletableFuture.cancel tidak selalu menghentikan underlying work,
virtual thread blocking operation umumnya lebih cancellation-friendly jika interruptible,
structured concurrency membantu propagate cancellation ke child tasks.

Pattern:

public interface CancellationToken {
    boolean cancelled();
    void throwIfCancelled();
}

Simple implementation:

public final class AtomicCancellationToken implements CancellationToken {
    private final AtomicBoolean cancelled = new AtomicBoolean();

    public void cancel() {
        cancelled.set(true);
    }

    @Override
    public boolean cancelled() {
        return cancelled.get();
    }

    @Override
    public void throwIfCancelled() {
        if (cancelled()) {
            throw new CancellationException();
        }
    }
}

Long-running tasks must poll cancellation at safe points.

for (Chunk chunk : chunks) {
    token.throwIfCancelled();
    process(chunk);
}

6. Retry Pattern

6.1 Intent

Retry mengulang operation yang gagal karena failure mungkin transient.

Retry aman jika:

operation idempotent, atau
duplicate side effects dicegah, atau
operation belum mencapai server, atau
domain memiliki conflict resolution.

Retry tidak aman jika:

command menyebabkan irreversible side effect,
tidak ada idempotency key,
error adalah validation/business error,
downstream overload dan retry memperparah load,
timeout meninggalkan unknown commit state.

6.2 Retry Error Taxonomy

public enum RetryDecision {
    RETRY,
    DO_NOT_RETRY,
    UNKNOWN
}

public final class RetryClassifier {
    public RetryDecision classify(Throwable error) {
        if (error instanceof SocketTimeoutException) return RetryDecision.RETRY;
        if (error instanceof ConnectException) return RetryDecision.RETRY;
        if (error instanceof IllegalArgumentException) return RetryDecision.DO_NOT_RETRY;
        if (error instanceof BusinessRuleViolation) return RetryDecision.DO_NOT_RETRY;
        return RetryDecision.UNKNOWN;
    }
}

HTTP guideline:

Status	Retry?	Notes
400	no	bad request
401/403	no	auth/authz problem
404	usually no	unless eventual consistency expected
409	domain-specific	may retry with reload/rebase
408	maybe	timeout
429	yes, if allowed	respect `Retry-After`
500	maybe	depends on idempotency
502/503/504	often yes	with backoff and jitter

6.3 Retry with Deadline

Retry should consume a total budget, not reset time each attempt.

public <T> T retryWithinDeadline(
        Supplier<T> operation,
        Deadline deadline,
        int maxAttempts,
        RetryClassifier classifier
) {
    Throwable last = null;

    for (int attempt = 1; attempt <= maxAttempts; attempt++) {
        deadline.throwIfExpired();

        try {
            return operation.get();
        } catch (Throwable error) {
            last = error;
            if (classifier.classify(error) != RetryDecision.RETRY) {
                throw error;
            }
            if (attempt == maxAttempts) {
                throw error;
            }
            sleep(backoffWithJitter(attempt, deadline.remaining()));
        }
    }

    throw new IllegalStateException("unreachable", last);
}

7. Backoff and Jitter Pattern

7.1 Intent

Backoff menunda retry agar downstream punya waktu pulih. Jitter menambahkan randomization agar clients tidak retry serentak.

Tanpa jitter:

1000 clients fail at T0
1000 clients retry at T0 + 100ms
1000 clients retry at T0 + 200ms
1000 clients retry at T0 + 400ms

Dengan jitter, retry tersebar.

7.2 Exponential Backoff with Jitter

public Duration backoffWithJitter(int attempt, Duration maxAllowed) {
    long baseMillis = 50L;
    long capMillis = Math.min(1_000L, Math.max(1L, maxAllowed.toMillis()));
    long exponential = Math.min(capMillis, baseMillis * (1L << Math.min(attempt - 1, 10)));
    long jittered = ThreadLocalRandom.current().nextLong(0, exponential + 1);
    return Duration.ofMillis(jittered);
}

This is closer to “full jitter” than deterministic sleep.

7.3 Retry Budget

Retry budget limits total additional traffic caused by retries.

Example policy:

Max attempts: 3
Max total elapsed: 800 ms
Retryable errors: connect timeout, 502, 503, 504
Do not retry: 400, 401, 403, 409, business validation
Jitter: full jitter
Idempotency required: yes for POST/command

8. Idempotency Pattern

8.1 Intent

Idempotency ensures repeated attempts have the same effect as one attempt.

For command APIs, use idempotency key.

record CommandIdempotencyKey(
        String tenantId,
        UUID actorId,
        String operation,
        UUID clientRequestId
) {}

8.2 Idempotency Record

record IdempotencyRecord(
        CommandIdempotencyKey key,
        String requestHash,
        String responseBody,
        CommandStatus status,
        Instant createdAt
) {}

Command flow:

8.3 Unknown Outcome

Timeout after sending command is dangerous.

The server may have committed, but client did not receive response.

Idempotency key lets retry discover prior result.

Rule:

Retrying commands without idempotency is gambling with side effects.

9. Circuit Breaker Pattern

9.1 Intent

Circuit breaker stops calls to a dependency that is likely failing, allowing fast failure and recovery probes.

State model:

9.2 Why It Exists

Without circuit breaker:

every request waits for timeout,
caller resources saturate,
downstream receives more pressure,
upstream queue grows,
user sees latency spike.

With circuit breaker:

calls fail fast while open,
dependency gets time to recover,
caller resources are preserved,
recovery is probed carefully.

9.3 Resilience4j Example

CircuitBreakerConfig config = CircuitBreakerConfig.custom()
        .failureRateThreshold(50)
        .slowCallRateThreshold(50)
        .slowCallDurationThreshold(Duration.ofMillis(500))
        .minimumNumberOfCalls(20)
        .slidingWindowSize(50)
        .waitDurationInOpenState(Duration.ofSeconds(10))
        .permittedNumberOfCallsInHalfOpenState(5)
        .build();

CircuitBreaker breaker = CircuitBreaker.of("case-search", config);

Supplier<SearchResult> decorated = CircuitBreaker
        .decorateSupplier(breaker, () -> searchClient.search(query));

SearchResult result = Try.ofSupplier(decorated)
        .recover(CallNotPermittedException.class, error -> fallbackSearch(query))
        .get();

9.4 Circuit Breaker Tuning

Important parameters:

Parameter	Meaning
failure rate threshold	when closed becomes open
slow call threshold	slow calls considered unhealthy
sliding window	sample size/time window
minimum calls	avoid opening on tiny sample
open wait duration	cooling period
half-open permitted calls	recovery probes

9.5 Failure Modes

Failure	Cause	Mitigation
flapping	thresholds too sensitive	larger window, better minimum calls
hidden outage	fallback masks all errors	alert on breaker open/fallback count
slow recovery	open duration too long	tune half-open probes
overload in half-open	too many probes	limit permitted calls
one breaker for many ops	mixed health signals	breaker per dependency/operation class

Rule:

Circuit breaker is not an error handler. It is a load and failure isolation mechanism.

10. Bulkhead Pattern

10.1 Intent

Bulkhead isolates resource pools so one dependency/workload cannot consume all capacity.

Ship compartments inspire the name: flooding one compartment should not sink the whole ship.

10.2 Java Semaphore Bulkhead

public final class SemaphoreBulkhead {
    private final Semaphore permits;

    public SemaphoreBulkhead(int maxConcurrentCalls) {
        this.permits = new Semaphore(maxConcurrentCalls);
    }

    public <T> T execute(Supplier<T> supplier) {
        boolean acquired = permits.tryAcquire();
        if (!acquired) {
            throw new RejectedExecutionException("Bulkhead full");
        }
        try {
            return supplier.get();
        } finally {
            permits.release();
        }
    }
}

10.3 Thread Pool Bulkhead

Use separate executor per dependency/workload if blocking or isolation is needed.

ExecutorService searchExecutor = Executors.newFixedThreadPool(20);
ExecutorService notificationExecutor = Executors.newFixedThreadPool(5);

With virtual threads, do not assume thread count is the only resource. You still need bulkheads for:

DB connections,
HTTP connection pools,
downstream QPS,
memory,
CPU-heavy work,
tenant fairness.

10.4 Bulkhead Failure Modes

Failure	Cause	Mitigation
global pool starvation	all work shares same executor	separate pools/limits
unbounded queue	queue absorbs infinite requests	bounded queue + rejection
wrong rejection handling	caller retries immediately	backoff/load shedding
too small bulkhead	availability drops	capacity test/tune
too large bulkhead	downstream overload	align with dependency capacity

11. Rate Limiter Pattern

11.1 Intent

Rate limiter controls number of operations per time window.

Used for:

protecting downstream,
enforcing tenant limits,
controlling expensive operations,
smoothing spikes,
complying with external API quotas.

11.2 Token Bucket Mental Model

bucket has capacity N
tokens refill at rate R
each request consumes token
if no token, reject/wait/degrade

Simple Java sketch:

public final class SimpleTokenBucket {
    private final long capacity;
    private final long refillPerSecond;
    private long tokens;
    private long lastRefillNanos;

    public SimpleTokenBucket(long capacity, long refillPerSecond) {
        this.capacity = capacity;
        this.refillPerSecond = refillPerSecond;
        this.tokens = capacity;
        this.lastRefillNanos = System.nanoTime();
    }

    public synchronized boolean tryAcquire() {
        refill();
        if (tokens <= 0) {
            return false;
        }
        tokens--;
        return true;
    }

    private void refill() {
        long now = System.nanoTime();
        long elapsed = now - lastRefillNanos;
        long add = elapsed * refillPerSecond / 1_000_000_000L;
        if (add > 0) {
            tokens = Math.min(capacity, tokens + add);
            lastRefillNanos = now;
        }
    }
}

Production code should use battle-tested libraries, but this mental model matters.

11.3 Rate Limit Dimensions

global,
per tenant,
per user,
per API key,
per operation,
per dependency,
per priority class.

Do not use one global limiter for all traffic if priority matters.

12. Load Shedding Pattern

12.1 Intent

Reject lower-value work when system is overloaded, before collapse.

Load shedding is not failure. It is controlled refusal.

Better to reject early than accept work that will timeout later.

Signals:

queue depth,
CPU saturation,
heap pressure,
GC pause,
DB pool saturation,
p99 latency,
downstream breaker open,
request deadline already too short.

12.2 Example

public void ensureAdmissible(RequestContext context) {
    if (context.deadline().remaining().compareTo(Duration.ofMillis(50)) < 0) {
        throw new RejectedExecutionException("Insufficient deadline remaining");
    }

    if (dbPoolMetrics.pendingAcquire() > 100) {
        throw new ServiceUnavailableException("Database pool saturated");
    }
}

12.3 Priority-Aware Shedding

In regulatory systems:

emergency action > background report,
command path > dashboard refresh,
statutory deadline task > notification digest,
human interactive request > batch backfill.

Do not shed blindly.

enum Priority {
    CRITICAL_COMMAND,
    INTERACTIVE_READ,
    BACKGROUND_TASK,
    BEST_EFFORT
}

13. Fallback Pattern

13.1 Intent

Provide alternative response when primary path fails.

Fallback examples:

cached/stale data,
partial response,
default configuration,
empty recommendation list,
degraded search,
manual review queue,
async accepted response.

13.2 Safe vs Unsafe Fallback

Scenario	Safe Fallback?	Notes
dashboard chart unavailable	yes	show partial UI
notification provider down	yes	enqueue retry
authorization service down	often no	fail closed or emergency policy
regulatory deadline calculation fails	no	cannot invent deadline
case search down	maybe	fallback to DB basic search
fraud risk unavailable	maybe	route to manual review

13.3 Java Example

public CaseSummary getCaseSummary(UUID caseId) {
    try {
        return primarySummaryClient.fetch(caseId);
    } catch (TimeoutException | CircuitBreakerOpenException error) {
        return cache.getIfPresent(caseId)
                .map(summary -> summary.markStale("summary-service-unavailable"))
                .orElseThrow(() -> new ServiceUnavailableException("Case summary unavailable", error));
    }
}

13.4 Fallback Must Be Observable

A fallback that users cannot see and operators cannot measure becomes silent data corruption.

Minimum:

metric: fallback count by reason,
log: dependency, request id, fallback type,
trace tag: degraded=true,
response metadata if user-visible stale/partial data.

14. Graceful Degradation Pattern

14.1 Intent

System reduces non-critical functionality while preserving critical path.

Example case platform:

Component	Degraded Behavior
audit timeline enrichment	show raw events
search ranking	basic exact search
notification digest	delay digest
dashboard aggregate	show stale snapshot
export service	queue job instead of synchronous export
external registry lookup	mark pending verification

14.2 Design

Degradation must be product/business-approved, not invented during incident.

15. Hedged Request Pattern

15.1 Intent

Send duplicate request after delay to reduce tail latency.

send request to replica A
if no response after p95 delay, send hedge to replica B
use first successful response
cancel loser

15.2 Kapan Cocok

read-only/idempotent requests,
multiple equivalent replicas,
tail latency dominates,
extra load acceptable,
cancellation possible.

15.3 Kapan Berbahaya

write commands,
overloaded downstream,
expensive operations,
no cancellation,
duplicate side effects.

Hedging is advanced. Use only after measuring tail latency.

16. Retry + Circuit Breaker + Timeout Composition

Composition is tricky.

Bad:

retry 3 times, each with 5s timeout, inside 1s API budget

Good:

one request deadline controls all attempts
per-attempt timeout derived from remaining budget
retry count bounded
backoff jittered
circuit breaker observes actual failures/slow calls
bulkhead protects resource

16.1 Conceptual Order

There is no universal order, but a common client-side shape:

Important nuance:

If breaker wraps all retries, it sees final outcome only.
If breaker wraps each attempt, it sees per-attempt failures.
If retry wraps breaker, open breaker may be retried pointlessly unless classified non-retryable.
If timeout wraps whole retry block, individual attempts may run too long unless also bounded.

Define semantics explicitly.

16.2 Recommended Reasoning

For each dependency:

Decide total call budget.
Decide idempotency and retryable errors.
Decide max attempts and backoff.
Decide concurrency limit.
Decide circuit breaker thresholds.
Decide fallback.
Decide observability.

17. Resilience4j Decorator Example

Supplier<Response> supplier = () -> client.call(request);

Supplier<Response> decorated = Decorators.ofSupplier(supplier)
        .withBulkhead(bulkhead)
        .withCircuitBreaker(circuitBreaker)
        .withRetry(retry)
        .withFallback(List.of(CallNotPermittedException.class, TimeoutException.class),
                error -> fallbackResponse(request, error))
        .decorate();

Response response = decorated.get();

The exact order must be reviewed for semantics. The point is not “use all decorators”. The point is to encode a dependency policy.

Policy object:

record DependencyPolicy(
        String name,
        Duration totalBudget,
        int maxAttempts,
        Duration slowCallThreshold,
        int maxConcurrentCalls,
        boolean fallbackAllowed
) {}

18. Dependency Policy Catalog

Create one policy per dependency/operation class.

# Dependency Policy: external-registry.lookupCompany

## Purpose
Lookup legal company metadata.

## Criticality
Interactive read support. Not command-authoritative.

## Timeout
- connect: 200ms
- per attempt: 500ms
- total deadline: 900ms

## Retry
- max attempts: 2
- retryable: connect timeout, 502, 503, 504
- non-retryable: 400, 401, 403, 404
- backoff: full jitter, cap 200ms
- idempotency: read-only

## Circuit Breaker
- sliding window: 100 calls
- failure threshold: 50%
- slow call threshold: 50% over 500ms
- open duration: 10s
- half-open probes: 5

## Bulkhead
- max concurrent calls: 25
- queue: none

## Fallback
- stale cache up to 24h
- mark response as unverified

## Observability
- latency histogram
- timeout count
- retry count
- breaker state
- fallback count
- stale age

This catalog prevents resilience settings from being magic numbers hidden in annotations.

19. Queue Bound Pattern

Queues create decoupling, but unbounded queues create invisible failure.

Bad:

ExecutorService executor = Executors.newFixedThreadPool(10);

newFixedThreadPool uses an unbounded queue. Under overload, latency and memory can grow until collapse.

Better: explicit bounded queue.

ThreadPoolExecutor executor = new ThreadPoolExecutor(
        10,
        10,
        0L,
        TimeUnit.MILLISECONDS,
        new ArrayBlockingQueue<>(100),
        new ThreadPoolExecutor.AbortPolicy()
);

With virtual threads, queueing can still happen at:

DB pool,
HTTP pool,
semaphore bulkhead,
message broker,
downstream service,
CPU scheduler.

Rule:

Every queue must have a bound, owner, metric, and rejection behavior.

20. Health Check and Readiness Pattern

20.1 Liveness vs Readiness

Liveness: should process be restarted?
Readiness: should traffic be sent here?

Do not put every downstream dependency in liveness. A temporary downstream outage should not always restart your service.

Readiness should reflect whether this instance can serve its assigned traffic.

20.2 Dependency Health

Health checks should avoid becoming DDoS tools.

Bad:

Every instance hits every dependency every second with expensive query.

Better:

cheap checks,
cached health status,
backoff,
separate critical and optional dependencies,
circuit breaker metrics integrated.

21. Resilience Observability

Minimum metrics per dependency:

Metric	Why
request count	load baseline
success/failure count	health
latency histogram	timeout tuning
timeout count	slowness
retry attempts	retry storm detection
retry success after retry	retry usefulness
circuit breaker state	failure isolation
rejected by bulkhead	saturation
rate limited count	admission pressure
fallback count	degraded mode
queue depth	overload early warning
pending duration	hidden latency

Trace tags:

dependency=external-registry
attempt=2
retry=true
circuitBreaker=closed
bulkheadRejected=false
fallback=stale-cache
deadlineRemainingMs=132

Structured log example:

log.warn("dependency_call_degraded dependency={} operation={} reason={} fallback={} requestId={} tenantId={}",
        "external-registry",
        "lookupCompany",
        error.getClass().getSimpleName(),
        "stale-cache",
        context.requestId(),
        context.tenantId());

Do not log sensitive payloads.

22. Testing Resilience Patterns

22.1 Fake Slow Dependency

public final class FakeRegistryClient implements RegistryClient {
    private Duration delay = Duration.ZERO;
    private RuntimeException error;

    public void delay(Duration delay) {
        this.delay = delay;
    }

    public void failWith(RuntimeException error) {
        this.error = error;
    }

    @Override
    public Company lookup(String id) {
        sleep(delay);
        if (error != null) throw error;
        return new Company(id, "ACME");
    }
}

22.2 Timeout Test

@Test
void shouldTimeoutSlowDependency() {
    fakeRegistry.delay(Duration.ofSeconds(2));

    assertThrows(TimeoutException.class, () ->
            service.lookupCompany("123", Deadline.after(Duration.ofMillis(200), clock))
    );
}

22.3 Retry Classification Test

@Test
void shouldNotRetryBusinessError() {
    fake.failWith(new BusinessRuleViolation("invalid"));

    assertThrows(BusinessRuleViolation.class, () -> client.call());
    assertEquals(1, fake.callCount());
}

22.4 Circuit Breaker Test

@Test
void shouldOpenCircuitAfterFailures() {
    fake.failWith(new ConnectException("down"));

    for (int i = 0; i < 20; i++) {
        ignoreFailure(() -> protectedClient.call());
    }

    assertEquals(CircuitBreaker.State.OPEN, breaker.getState());
}

22.5 Bulkhead Test

@Test
void shouldRejectWhenBulkheadFull() throws Exception {
    var bulkhead = new SemaphoreBulkhead(1);
    var latch = new CountDownLatch(1);
    var executor = Executors.newFixedThreadPool(2);

    Future<?> first = executor.submit(() -> bulkhead.execute(() -> {
        await(latch);
        return null;
    }));

    eventually(() -> assertEquals(0, bulkhead.availablePermits()));

    assertThrows(RejectedExecutionException.class,
            () -> bulkhead.execute(() -> "second"));

    latch.countDown();
    first.get();
    executor.shutdown();
}

23. Chaos and Fault Injection

Resilience must be tested with realistic failure:

latency injection,
connection reset,
DNS failure,
partial error rate,
slow response body,
429/503 storm,
dependency blackhole,
queue saturation,
DB pool exhaustion,
cache outage,
message broker lag.

Start in lower environment, then controlled production experiments if organization maturity allows.

Fault injection checklist:

blast radius limited,
rollback switch exists,
alerts active,
SLO defined,
customer impact understood,
on-call aware,
experiment duration bounded.

24. Anti-Patterns

24.1 Retry Storm

Symptom:

downstream slows,
callers retry,
traffic multiplies,
downstream collapses further.

Correction:

retry budget,
exponential backoff,
jitter,
circuit breaker,
respect 429/Retry-After,
shed load.

24.2 Timeout Too High

Symptom:

requests hang,
thread/connection pools exhausted,
p99 latency explodes.

Correction:

derive timeout from latency budget,
set per-dependency deadlines,
fail fast.

24.3 Timeout Too Low

Symptom:

false failures,
unnecessary retries,
user-visible errors during normal p99.

Correction:

use latency histogram,
account for cold connection/TLS,
warm connections,
tune based on false timeout target.

24.4 Fallback Lies

Symptom:

system returns default as if authoritative,
users make wrong decisions,
metrics show green while degraded.

Correction:

mark stale/partial data,
alert on fallback,
avoid fallback for correctness-critical commands.

24.5 One Global Circuit Breaker

Symptom:

one operation opens breaker for all operations,
low-value call blocks high-value call,
health signal too coarse.

Correction:

breaker per dependency and operation class,
priority-aware policy.

24.6 Shared Executor for Everything

Symptom:

notification failure blocks command processing,
background jobs starve API requests.

Correction:

bulkheads,
bounded queues,
separate resource pools.

24.7 Infinite Queue

Symptom:

system accepts work it cannot finish,
latency becomes invisible backlog,
OOM risk.

Correction:

bounded queue,
rejection,
load shedding,
backpressure.

25. Regulatory / Enforcement System Considerations

For enforcement lifecycle systems, resilience has defensibility implications.

25.1 Command Path

Commands like:

issue notice,
approve enforcement action,
close case,
assign legal officer,
extend deadline,
escalate violation,
freeze account,
publish sanction,

should not silently fallback to stale or default authority.

Better:

fail closed,
route to manual review,
persist pending command with clear status,
require authoritative revalidation,
record reason and dependency state.

25.2 Read Path

Read paths can degrade more safely:

dashboard partial data,
stale search results,
timeline without enrichment,
external registry marked unavailable,
export queued.

25.3 Auditability

Every degraded command-affecting behavior should be auditable:

which dependency failed,
which fallback was used,
what stale version was served,
who acted,
whether action was blocked or allowed,
whether manual review was required.

Rule:

Resilience must not erase accountability.

26. Virtual Threads and Resilience

Virtual threads reduce the cost of blocking threads, but they do not remove downstream capacity limits.

They help with:

thread-per-request simplicity,
blocking I/O scalability,
structured cancellation,
simpler code than callback chains.

They do not solve:

DB connection exhaustion,
remote service overload,
rate limits,
memory pressure,
CPU saturation,
unbounded request admission,
missing timeouts.

Rule:

Virtual threads make waiting cheaper for the caller, not cheaper for the dependency.

Still use:

timeouts,
deadlines,
bulkheads,
rate limits,
bounded queues,
cancellation,
observability.

27. Pattern Selection Matrix

Problem	Primary Pattern	Supporting Pattern
dependency slow	timeout/deadline	circuit breaker, fallback
transient network error	retry	backoff, jitter, idempotency
downstream overload	circuit breaker	load shedding, rate limit
one dependency consumes all resources	bulkhead	bounded queue, timeout
traffic spike	rate limiter	load shedding, priority
command unknown outcome	idempotency	retry, status lookup
user-facing optional data unavailable	fallback	stale cache, partial response
p99 tail latency high	timeout	maybe hedging for idempotent reads
background job backlog	bounded queue	priority, load shedding
regulatory command cannot validate	fail closed	manual review queue

28. Resilience Review Checklist

For every remote dependency:

Is there a total deadline?
Is there a connect/request/read timeout?
Is retry allowed?
Which errors are retryable?
Is operation idempotent?
Is there an idempotency key for commands?
Is backoff jittered?
Is retry budget bounded?
Is there a circuit breaker?
Is circuit breaker scoped correctly?
Is there a bulkhead/concurrency limit?
Is queue bounded?
Is rate limit needed?
Is fallback safe?
Is fallback observable?
Are metrics emitted per dependency/operation?
Are slow calls measured?
Are rejections measured?
Are timeouts tested?
Are partial outages tested?
Does behavior preserve auditability?

29. Practice Drill

Drill 1 — Protect External Registry Lookup

Given:

public Company lookupCompany(String registrationNumber) {
    return httpClient.get("/companies/" + registrationNumber);
}

Add:

connect timeout,
request timeout,
total deadline,
retry for 502/503/504 only,
exponential backoff with jitter,
circuit breaker,
semaphore bulkhead,
stale cache fallback,
metric tags,
tests for slow dependency and breaker open.

Drill 2 — Make Command Retry-Safe

Given command:

POST /cases/{id}/approve

Design:

idempotency key,
request hash,
duplicate command behavior,
unknown outcome retry,
response replay,
audit trail.

Drill 3 — Prevent Retry Storm

Given 5 services all call downstream X and each retries 3 times.

Task:

calculate amplification,
move retry to one layer,
add backoff + jitter,
add circuit breaker,
add rate limit,
define retry budget.

Amplification example:

3 attempts across 5 layers = 3^5 = 243 possible downstream attempts

This is why retry must be designed globally, not casually added locally.

30. Summary

Resilience patterns are not a checklist of libraries. They are a way to keep failure bounded.

Core invariant:

Every dependency call must have bounded time, bounded attempts, bounded concurrency, bounded queueing, explicit fallback, and observable behavior.

Use:

timeout/deadline to bound waiting,
cancellation to stop wasted work,
retry for transient idempotent failures,
backoff and jitter to prevent synchronized retry,
idempotency to make command retry safe,
circuit breaker to fail fast during dependency failure,
bulkhead to isolate resource pools,
rate limiter/load shedding to control admission,
fallback/graceful degradation for non-critical features,
observability to prove what happened.

Top 1% engineering behavior is not “add Resilience4j”. It is the ability to say:

For this dependency, under this business criticality, the total budget is X, attempts are Y, retryable failures are Z, idempotency is guaranteed by K, concurrency is bounded by B, fallback is F, and all degraded behavior is visible in metrics, traces, logs, and audit records.

References

Resilience4j documentation: circuit breaker, retry, bulkhead, rate limiter, time limiter, decorators.
AWS Builders Library: timeouts, retries, exponential backoff, jitter, overload behavior, and dependency isolation.
Java platform documentation: HttpClient, executors, futures, interrupts, structured concurrency, and virtual threads.
Martin Fowler pattern writing on Circuit Breaker and distributed system failure boundaries.

Lesson Recap

You just completed lesson 24 in deepen practice. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Back To Series Next Lesson

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.

Previous Lesson

Lesson 23

Learn Java Patterns Part 023 Cache Patterns

Next Lesson

Lesson 25

API and Boundary Patterns