Series MapLesson 28 / 34
Deepen PracticeOrdered learning track

Learn Java Jersey Glassfish Part 028 Resilience Timeout Bulkhead Circuit Breaker Backpressure

19 min read3681 words
PrevNext
Lesson 2834 lesson track1928 Deepen Practice

title: Learn Java Eclipse Jersey & GlassFish - Part 028 description: Resilience patterns for Jersey applications on GlassFish: timeout budget, retry discipline, bulkhead, circuit breaker, backpressure, graceful degradation, load shedding, and failure-mode engineering. series: learn-java-jersey-glassfish seriesTitle: Learn Java Eclipse Jersey & GlassFish order: 28 partTitle: Resilience Patterns: Timeout, Bulkhead, Circuit Breaker, Backpressure tags:

  • java
  • jersey
  • glassfish
  • jakarta-ee
  • resilience
  • timeout
  • bulkhead
  • circuit-breaker
  • backpressure
  • production
  • series date: 2026-06-28

Part 028 — Resilience Patterns: Timeout, Bulkhead, Circuit Breaker, Backpressure

Goal: setelah bagian ini, kita bisa mendesain Jersey + GlassFish service yang tetap defensible saat database lambat, downstream rusak, traffic melonjak, client lambat, deployment rolling, atau dependency partial outage. Fokusnya bukan “menambahkan library resilience”, tetapi memahami failure containment sebagai arsitektur runtime.

Performance engineering bertanya:

Seberapa cepat sistem bekerja ketika semua komponen relatif sehat?

Resilience engineering bertanya:

Apa yang terjadi ketika sebagian komponen tidak sehat?

Sistem Jersey + GlassFish production harus mampu:

  • membatasi waktu tunggu;
  • membatasi concurrency;
  • membatasi blast radius;
  • menolak request secara terkontrol saat overload;
  • menghindari retry storm;
  • memberi error contract yang konsisten;
  • mempertahankan observability saat incident;
  • pulih tanpa restart besar.

1. Kaufman Deconstruction

Skill resilience dipecah menjadi beberapa sub-skill.

Sub-skillOutput yang Diharapkan
Timeout designtimeout budget per boundary
Retry designretry hanya ketika aman dan bounded
Bulkhead designdependency failure tidak menjatuhkan semua endpoint
Circuit breakerdownstream failure cepat dikenali dan dibatasi
Backpressureoverload tidak menjadi memory/thread explosion
Load sheddingrequest ditolak lebih awal dengan error contract jelas
Fallbackresponse degradasi tetap benar secara domain
Failure-mode testingchaos/fault test dengan expected behavior

Kita tidak ingin hanya hafal pattern. Kita ingin bisa menjawab:

Jika service X lambat selama 10 menit, request mana yang gagal, mana yang tetap hidup, berapa thread tertahan, error apa yang keluar, alert mana yang menyala, dan bagaimana sistem pulih?


2. Resilience Mental Model

Runtime production adalah graph dependency.

Setiap edge harus punya:

  • timeout;
  • concurrency limit;
  • error classification;
  • retry policy;
  • fallback decision;
  • observability;
  • owner/runbook.

Jika satu edge tidak punya batas, ia bisa menarik seluruh service ke bawah.


3. Failure Taxonomy

Tidak semua failure sama.

FailureExampleStrategy
Fast failure400/401/403/404jangan retry
Transient networkconnect reset, timeout pendekretry terbatas jika idempotent
Slow dependencyDB/downstream p95 naiktimeout, bulkhead, circuit breaker
Overloadqueue penuh, pool exhaustedload shedding/backpressure
Partial outagesatu dependency matifallback/degraded response
Data conflictoptimistic lock, duplicate keydomain error, maybe client retry
Bad requestvalidation failure400, no retry
Auth failureinvalid token401/403, no retry
Rate limit429client retry after policy
Deployment transient503 during rolloutretry with jitter at caller side

Resilience buruk terjadi saat semua error diperlakukan sebagai “500, coba lagi”.


4. Timeout Budget

Timeout adalah resilience primitive paling penting.

Rule:

Timeout luar harus lebih besar dari timeout dalam agar layer dalam punya kesempatan mengembalikan response terkontrol.

Jika load balancer timeout 5 detik tetapi aplikasi menunggu downstream 10 detik, client melihat edge timeout, sementara aplikasi tetap membakar thread sampai 10 detik.


5. Timeout Types

TimeoutBoundaryMeaning
Connect timeoutclient → serverwaktu membangun koneksi
Read timeoutclient menunggu responsewaktu menunggu data setelah request terkirim
Request budgetkeseluruhan operationdeadline total business operation
Pool wait timeoutthread menunggu connection/resourcewaktu menunggu resource pool
Transaction timeoutDB/JTA transactionbatas durasi transaction
Idle timeoutconnection idlekoneksi ditutup setelah idle
Async response timeoutsuspended responsebatas response async
Load balancer timeoutedge/proxybatas request di edge

Timeout harus konsisten. Satu timeout panjang bisa mengalahkan semua proteksi lain.


6. Jersey Client Timeout Pattern

Jersey client harus selalu punya timeout eksplisit.

@ApplicationScoped
public class RiskScoreClient {
    private final Client client;
    private final WebTarget target;

    public RiskScoreClient() {
        this.client = ClientBuilder.newBuilder()
                .connectTimeout(200, TimeUnit.MILLISECONDS)
                .readTimeout(800, TimeUnit.MILLISECONDS)
                .build();
        this.target = client.target("https://risk.internal/api/v1/score");
    }

    public RiskScore score(ScoreRequest request) {
        return target.request(MediaType.APPLICATION_JSON_TYPE)
                .post(Entity.json(request), RiskScore.class);
    }

    @PreDestroy
    void close() {
        client.close();
    }
}

Do not:

ClientBuilder.newClient(); // no timeout, no lifecycle owner

7. Deadline Propagation

Timeout per client call tidak cukup jika operation memiliki banyak step.

public final class Deadline {
    private final long deadlineNanos;

    private Deadline(long deadlineNanos) {
        this.deadlineNanos = deadlineNanos;
    }

    public static Deadline after(Duration duration) {
        return new Deadline(System.nanoTime() + duration.toNanos());
    }

    public Duration remaining() {
        long remaining = deadlineNanos - System.nanoTime();
        return Duration.ofNanos(Math.max(0, remaining));
    }

    public boolean expired() {
        return remaining().isZero();
    }
}

Usage:

public CaseDecision decide(CaseRequest request) {
    Deadline deadline = Deadline.after(Duration.ofSeconds(3));

    AuthDecision auth = authClient.check(request, deadline.remaining());
    if (deadline.expired()) {
        throw new ServiceUnavailableException("deadline_exceeded");
    }

    RiskScore score = riskClient.score(request, min(deadline.remaining(), Duration.ofMillis(800)));
    return decisionEngine.decide(auth, score);
}

Pattern ini mencegah total time melebihi budget walaupun tiap call punya timeout masing-masing.


8. Retry Discipline

Retry bisa menyelamatkan transient failure, atau menghancurkan sistem.

Retry hanya aman jika:

  • operation idempotent atau punya idempotency key;
  • failure kemungkinan transient;
  • retry count kecil;
  • ada backoff + jitter;
  • timeout total tetap dibatasi;
  • downstream tidak sedang overload parah;
  • error classification jelas.

Do not retry:

  • validation error;
  • auth error;
  • permission denied;
  • domain conflict tanpa policy;
  • non-idempotent POST tanpa idempotency key;
  • request yang sudah melewati deadline.

9. Retry Storm

Retry storm terjadi saat client menambah beban ke dependency yang sedang sakit.

Mitigasi:

  • retry count kecil;
  • exponential backoff;
  • jitter;
  • circuit breaker;
  • retry budget;
  • idempotency key;
  • respect Retry-After;
  • fail fast saat overload.

10. Idempotency Key for Mutations

Mutation endpoint harus hati-hati.

Example:

@POST
@Path("/cases")
public Response createCase(
        @HeaderParam("Idempotency-Key") String idempotencyKey,
        CreateCaseRequest request) {
    CaseCreated result = commandService.create(idempotencyKey, request);
    return Response.status(Response.Status.CREATED).entity(result).build();
}

Server stores:

KeyRequest HashResultStatus
abc-123hash1case idcompleted

Rules:

  • same key + same request returns same result;
  • same key + different request returns conflict;
  • key expires after policy;
  • operation must be transactionally recorded.

Dengan ini, client bisa retry POST tertentu tanpa double-create.


11. Bulkhead Pattern

Bulkhead membatasi kerusakan agar satu dependency tidak menghabiskan semua resource.

Bulkhead bisa berupa:

  • separate thread pool;
  • separate JDBC pool;
  • semaphore/concurrency limit;
  • separate Jersey Client connector/pool;
  • separate deployment/service;
  • endpoint-level rate limit.

Rule:

Workload berat dan tidak kritis tidak boleh memakai resource pool yang sama tanpa limit dengan workload kritis.


12. Bulkhead with Semaphore

Contoh simple concurrency limit untuk dependency.

@ApplicationScoped
public class RiskScoreGateway {
    private final Semaphore permits = new Semaphore(30);
    private final RiskScoreClient client;

    public RiskScoreGateway(RiskScoreClient client) {
        this.client = client;
    }

    public RiskScore score(ScoreRequest request) {
        boolean acquired = permits.tryAcquire();
        if (!acquired) {
            throw new ServiceUnavailableException("risk_score_bulkhead_full");
        }
        try {
            return client.score(request);
        } finally {
            permits.release();
        }
    }
}

Ini lebih baik daripada membiarkan 500 request menunggu downstream yang sama.

Enhancement:

  • acquire timeout kecil;
  • metrics active/rejected;
  • fallback untuk optional dependency;
  • per-tenant fairness jika multi-tenant.

13. Circuit Breaker Model

Circuit breaker mencegah terus memanggil dependency yang sedang gagal.

States:

Meaning:

StateBehavior
Closedcall dependency normally
Openfail fast / fallback
Half-openallow limited trial calls

Circuit breaker harus berbasis rolling window, bukan satu error langsung open kecuali dependency sangat kritis dan error catastrophic.


14. Circuit Breaker Placement

Tempatkan circuit breaker di boundary gateway/client, bukan di resource method secara sporadis.

Bad:

@Path("/cases")
public class CaseResource {
    @GET
    public CaseResponse get() {
        // circuit breaker logic inline here
    }
}

Better:

@ApplicationScoped
public class RiskScoreGateway {
    public RiskScore score(ScoreRequest request) {
        return breaker.executeSupplier(() -> riskScoreClient.score(request));
    }
}

Resource method tetap bersih sebagai API adapter. Resilience policy hidup di dependency boundary.


15. Circuit Breaker Is Not Timeout

Circuit breaker tidak menggantikan timeout.

Tanpa timeout, call bisa menggantung lama sebelum dihitung gagal. Circuit breaker bereaksi terlambat.

Stack yang benar:

Timeout -> Bulkhead -> Circuit Breaker -> Retry? -> Client Call

Atau dalam beberapa library, komposisinya eksplisit:

Bulkhead protects concurrency
Timeout bounds duration
Circuit breaker observes outcomes
Retry repeats only safe failures within budget

16. Fallback Decision Model

Fallback tidak selalu benar.

DependencyFallback Aman?Example
RecommendationYareturn empty recommendations
NotificationYa, async retryqueue later
Risk scoreTergantung domainmanual review instead of auto-approve
AuthorizationBiasanya tidakfail closed
PaymentBiasanya tidakdo not assume success
IdentityBiasanya tidakfail closed
ReportingYashow stale/partial data with label

Rule:

Fallback harus benar secara domain, bukan hanya membuat error hilang.


17. Fail Open vs Fail Closed

Security dan compliance sering membutuhkan fail closed.

AreaDefault Failure Mode
Authenticationfail closed
Authorizationfail closed
Tenant resolutionfail closed
Audit writefail according to regulation; often block critical mutations or queue durably
Notificationfail open with durable retry if not part of transaction
Analyticsfail open
Recommendationfail open
Risk scoringdomain-specific; often degrade to manual review

For regulatory systems, “service unavailable” can be safer than silent inconsistent decision.


18. Backpressure

Backpressure berarti sistem memberi sinyal bahwa ia tidak mampu menerima lebih banyak work pada rate saat ini.

Tanpa backpressure:

  • request menumpuk;
  • memory naik;
  • thread habis;
  • queue latency naik;
  • health check mungkin tetap hijau sampai terlambat;
  • recovery lebih lambat.

Backpressure techniques:

TechniqueBoundary
bounded queueexecutor/job queue
semaphore limitdependency call
pool wait timeoutJDBC/outbound pool
rate limitper client/tenant/endpoint
429 responsecaller should slow down
503 responsetemporary service overload
load balancer outlier ejectionbad instance removed
Kubernetes readiness falsestop new traffic

19. Load Shedding

Load shedding menolak request lebih awal untuk menjaga sistem tetap hidup.

Kapan pakai 429:

  • client/tenant terlalu banyak request;
  • quota/rate limit tercapai;
  • caller bisa retry setelah delay.

Kapan pakai 503:

  • service overload global;
  • dependency critical unavailable;
  • maintenance/deployment transient.

Tambahkan Retry-After jika retry diharapkan.


20. Jersey Filter for Overload Protection

Contoh sederhana global concurrency guard.

@Provider
@Priority(Priorities.AUTHENTICATION - 100)
public class OverloadProtectionFilter implements ContainerRequestFilter, ContainerResponseFilter {
    private final Semaphore permits = new Semaphore(200);

    @Override
    public void filter(ContainerRequestContext requestContext) throws IOException {
        if (!permits.tryAcquire()) {
            requestContext.abortWith(Response.status(503)
                    .header("Retry-After", "1")
                    .entity(new ErrorResponse("service_overloaded", "Service is overloaded"))
                    .build());
            return;
        }
        requestContext.setProperty("overloadPermitAcquired", Boolean.TRUE);
    }

    @Override
    public void filter(ContainerRequestContext requestContext,
                       ContainerResponseContext responseContext) throws IOException {
        if (Boolean.TRUE.equals(requestContext.getProperty("overloadPermitAcquired"))) {
            permits.release();
        }
    }
}

Production note:

  • handle exception path carefully;
  • avoid releasing permit twice;
  • expose metrics;
  • use per-endpoint/per-tenant limits if needed;
  • async responses require lifecycle-aware release strategy.

21. Rate Limiting

Rate limit protects fairness and capacity.

Dimensions:

DimensionExample
ConsumerAPI key/client id
Tenanttenant id
Usersubject id
Endpoint/reports/export
MethodPOST more restricted than GET
Cost unitone export = 100 normal requests

Responses:

HTTP/1.1 429 Too Many Requests
Retry-After: 30
Content-Type: application/problem+json

Rate limiting in Jersey can be implemented in a filter, but distributed rate limiting often needs shared state or gateway support.


22. Slow Client Protection

Slow clients can hold server resources.

Risks:

  • streaming response thread held;
  • output buffer pressure;
  • connection slot consumption;
  • long-running export tied to DB connection;
  • SSE connection explosion.

Mitigation:

  • connection idle timeout;
  • max response size;
  • async job for export;
  • separate endpoint limits;
  • SSE heartbeat and max connection count;
  • release DB connection before streaming if possible;
  • reverse proxy buffering strategy understood.

23. Database Resilience

Database failure modes:

FailureSymptomStrategy
Pool exhaustedwait timeouttune pool, reduce hold time, bulkhead
Query slowp95/p99 highindex/query plan, timeout
Lock contentiontransactions waitshorter transaction, isolation review
DB downconnection failurefail fast, circuit breaker-ish DB health
Deadlockexceptionretry only safe transaction
Replica lagstale readsconsistency policy

Do not use one giant transaction around remote calls.

Bad:

@Transactional
public Decision decide(Request request) {
    CaseEntity entity = repository.load(request.caseId());
    RiskScore score = riskClient.score(request); // remote call inside transaction
    entity.apply(score);
    return mapper.toResponse(entity);
}

Better:

  • load minimal data;
  • release transaction;
  • call remote dependency;
  • start short transaction to commit decision;
  • handle conflict with optimistic locking.

24. Transaction Timeout

Transaction timeout should be shorter than user-visible timeout and aligned with DB behavior.

If transaction timeout is 60s but request budget is 3s:

  • request may fail at edge;
  • transaction may continue;
  • locks may remain longer;
  • user sees unknown result;
  • retry can cause conflict.

Design:

OperationTransaction Budget
simple lookupno transaction or read-only short
mutation500 ms - 2 s typical target
batch importasync job transaction chunks
exportavoid long transaction while streaming

25. Async and Queue Resilience

Async job queues need bounds.

Bad:

executor.submit(() -> heavyWork(request)); // unbounded queue hidden in executor

Better:

ThreadPoolExecutor executor = new ThreadPoolExecutor(
        10,
        20,
        30, TimeUnit.SECONDS,
        new ArrayBlockingQueue<>(100),
        new ThreadPoolExecutor.AbortPolicy()
);

Then map rejection:

try {
    executor.execute(task);
} catch (RejectedExecutionException e) {
    throw new ServiceUnavailableException("work_queue_full");
}

Every queue must have:

  • max size;
  • rejection policy;
  • metrics;
  • drain behavior;
  • retry/dead-letter strategy if durable.

26. SSE Resilience

SSE endpoints are long-lived.

Failure concerns:

  • client disconnect;
  • connection count explosion;
  • heartbeat missing;
  • idle timeout mismatch;
  • event producer overload;
  • slow consumer;
  • deployment drain.

Design:

ConcernPractice
Max connectionlimit per node/tenant/user
Heartbeatsend periodic comment/event
Replayuse event id if domain requires replay
Bufferbounded per client
Slow consumerdisconnect or drop non-critical events
Shutdownstop accepting, close/drain connections

27. Health Checks and Readiness

Health is not one boolean.

CheckMeaningUse
Livenessprocess should be restarted if falsedetect dead process
Readinessinstance can accept trafficrollout/load balancing
Startupapp is still startingavoid premature kill
Dependency healthDB/downstream statusdiagnostics, sometimes readiness
Degraded healthservice works with reduced capabilityalerting/routing

A service should not mark itself unready for every optional dependency failure. Otherwise optional outage can remove all instances and create larger outage.


28. Graceful Shutdown

During deployment or node termination:

  1. Stop accepting new traffic.
  2. Mark readiness false.
  3. Let load balancer drain.
  4. Finish in-flight requests within grace period.
  5. Stop SSE/async jobs safely.
  6. Close Jersey clients.
  7. Release resources.
  8. Shutdown GlassFish instance.

If shutdown is not graceful, clients see resets and may retry, creating traffic spikes during deployment.


29. Error Contract for Resilience

Resilience failure must map to stable response.

Example:

{
  "type": "https://errors.example.com/service-unavailable",
  "title": "Service temporarily unavailable",
  "status": 503,
  "code": "risk_score_unavailable",
  "correlationId": "01HX...",
  "retryable": true
}

Guidelines:

FailureStatusRetryable
bulkhead full503yes, with backoff
rate limited429yes, after delay
validation400no
auth invalid401no unless token refresh
forbidden403no
dependency unavailable503maybe
deadline exceeded503/504maybe
conflict409domain-specific

30. Observability for Resilience

Metrics:

MetricWhy
timeout count by dependencydependency health
retry countretry storm detection
circuit stateopen/half-open visibility
bulkhead active/rejectedcapacity pressure
queue sizebacklog detection
rate limit rejectedclient/tenant abuse
fallback countdegraded behavior
deadline exceededbudget issue
pool wait timeresource contention
error status distributionuser-visible impact

Logs should include:

  • dependency name;
  • operation name;
  • duration;
  • timeout/retry/circuit decision;
  • correlation ID;
  • tenant/client if safe;
  • error code.

31. Resilience Policy Configuration

Do not hardcode all policy values in code.

Better config model:

resilience:
  riskScore:
    connectTimeoutMs: 200
    readTimeoutMs: 800
    maxConcurrent: 30
    retry:
      maxAttempts: 2
      backoffMs: 50
      jitter: true
    circuitBreaker:
      failureRateThreshold: 50
      slowCallThresholdMs: 900
      openDurationMs: 10000
  reportExport:
    maxConcurrent: 5
    queueSize: 20

Config must still be validated at startup. Bad config can create outage.


32. Multi-Tenant Fairness

In regulatory/case-management systems, one tenant or integration partner must not consume all capacity.

Pattern:

Fairness dimensions:

  • tenant;
  • API client;
  • user;
  • endpoint cost;
  • priority class;
  • regulatory deadline sensitivity.

A VIP/internal/admin endpoint may need separate capacity, but this must be explicit and audited.


33. Priority and Load-Shedding Policy

Not all requests have equal importance.

Request TypePriorityOverload Behavior
Health/readinesshigh but cheapalways cheap, no heavy dependency
Case decision mutationhighprotect capacity
Case lookupmediumdegrade/cache maybe
Searchmedium/lowlimit and timeout
Exportlow/heavyqueue or reject
Analytics eventlowdrop/queue
Notificationasyncdurable queue/retry

During overload, reject low-priority heavy work first.


34. Case Study: Downstream Risk Service Slow

Scenario:

Risk service p95 rises from 100ms to 5s.
Incoming traffic: 200 RPS.
Endpoint /cases/{id}/decision calls risk service.

Without resilience:

  • Jersey request threads block;
  • outbound connections pile up;
  • user p99 rises;
  • LB timeout starts;
  • retry storm from callers;
  • unrelated endpoints starve.

With resilience:

  • risk client read timeout 800ms;
  • risk bulkhead max 30;
  • circuit breaker opens after threshold;
  • decision endpoint returns manual-review fallback or 503 depending domain;
  • unrelated endpoints continue;
  • alert fires on risk timeout/circuit open;
  • circuit half-open probes recovery.

35. Case Study: Reporting Export Overload

Scenario:

20 users start large CSV exports.
Each export holds DB connection for minutes.

Bad result:

  • core JDBC pool exhausted;
  • normal case lookup waits;
  • all app latency rises;
  • admin thinks DB is down.

Better design:

  • export is async job;
  • job queue bounded;
  • export pool separate max 5;
  • snapshot query releases transaction quickly;
  • file stored externally;
  • user polls job status or receives notification;
  • overload returns 429/503 with retry guidance.

36. Case Study: Authentication Dependency Outage

Auth is special.

If token validation needs remote introspection and auth service is down:

Options:

OptionRisk
fail opensecurity breach risk
fail closedavailability impact
cached positive authstale permission risk
JWT local verificationrequires key rotation/JWKS cache

For most regulated systems:

  • verify JWT locally where possible;
  • cache JWKS with safe TTL;
  • fail closed for unknown/invalid token;
  • do not call remote auth on every request if avoidable;
  • treat auth outage as security-sensitive incident.

37. Testing Resilience

Resilience must be tested intentionally.

Test cases:

TestExpected Behavior
downstream returns 500classified, maybe retry/fallback
downstream connect timeoutbounded failure
downstream read timeoutno thread leak
downstream slow 5stimeout before LB
downstream partial outagecircuit opens
DB pool exhaustedcontrolled 503, metrics increment
queue fullreject early
client sends burstrate limit / backpressure
deployment shutdownin-flight drained
SSE client disconnectresources released

Use test doubles that can simulate latency and error, not only success.


38. Local Fault Injection Example

Simple fake dependency resource:

@Path("/fake-risk")
public class FakeRiskResource {

    @POST
    public RiskScore score(@QueryParam("delayMs") @DefaultValue("0") long delayMs,
                           @QueryParam("status") @DefaultValue("200") int status) {
        sleep(delayMs);
        if (status >= 400) {
            throw new WebApplicationException(status);
        }
        return new RiskScore("LOW", 42);
    }

    private static void sleep(long delayMs) {
        try {
            Thread.sleep(delayMs);
        } catch (InterruptedException e) {
            Thread.currentThread().interrupt();
            throw new ServiceUnavailableException("interrupted");
        }
    }
}

Then test:

  • delayMs=100 success;
  • delayMs=2000 timeout;
  • status=500 retry/circuit;
  • many concurrent calls bulkhead.

39. Resilience Anti-Patterns

Anti-patternImpact
No timeoutstuck threads, p99 explosion
Retry all failuresretry storm
Retry non-idempotent POSTduplicate mutation
One shared pool for all workloadsblast radius large
Circuit breaker without timeoutslow reaction
Fallback for auth failuresecurity risk
Unbounded executor queuememory growth
Health check depends on heavy DB queryfalse unhealthy/slow health
Readiness false for optional dependencyentire service removed unnecessarily
Catch all exceptions as 500 retryableclients behave badly
Long transaction around remote calllock contention, unclear outcome
Rate limit without tenant dimensionunfair capacity sharing

40. Production Resilience Checklist

CheckDone?
Request-level timeout budget exists
LB timeout > app timeout > dependency timeout
Jersey Client connect/read timeouts set
No per-request client creation
Retries limited to safe/idempotent operations
Mutation idempotency key where required
Bulkhead for critical downstreams
Separate pool for heavy reporting/export
Circuit breaker for unstable dependencies
Queue sizes bounded
Rate limiting defined per tenant/client
429/503 error contracts stable
Fallbacks reviewed by domain/security owner
Auth failure mode documented
Readiness/liveness split clear
Graceful shutdown tested
Resilience metrics/alerts configured
Fault injection tests automated

41. Practical Lab

Build a mini service:

  • /cases/{id}/decision calls fake risk service;
  • fake risk can delay/fail;
  • add timeout;
  • add semaphore bulkhead;
  • add retry only for safe GET-like fake operation;
  • add circuit breaker using your chosen library or simple state machine;
  • add filter-based overload protection;
  • expose metrics/logs.

Measure behavior:

ScenarioExpected
risk delay 100mssuccess
risk delay 2stimeout within endpoint budget
risk 500 for 1 minutecircuit opens
200 concurrent risk callsbulkhead rejects quickly
export overloadcore lookup still works
shutdown during loadgraceful drain

Deliverable:

  • timeout budget diagram;
  • failure-mode matrix;
  • load/fault test results;
  • operational runbook.

42. Engineering Invariants

  1. Every dependency boundary needs timeout.
  2. Every queue needs max size and rejection policy.
  3. Every retry needs idempotency and budget.
  4. Every fallback needs domain approval.
  5. Every bulkhead needs metrics.
  6. Every circuit breaker needs timeout underneath.
  7. Every overload path must fail early and explicitly.
  8. Every health check must distinguish liveness from readiness.
  9. Every production incident should improve the failure-mode matrix.
  10. Resilience is not a library; it is a resource-governance model.

43. References

  • Jakarta RESTful Web Services 4.0 Specification.
  • Eclipse Jersey Client documentation.
  • Eclipse GlassFish Performance Tuning Guide.
  • Eclipse GlassFish Administration Guide.
  • Jakarta EE Platform 11 Specification.
  • Resilience4j documentation for circuit breaker/bulkhead concepts when used externally.

44. What Comes Next

Part 029 moves to high availability and clustering:

  • stateless REST preference;
  • sticky vs non-sticky sessions;
  • GlassFish cluster model;
  • load balancer behavior;
  • failover semantics;
  • rolling deployment;
  • HA anti-patterns.

Resilience protects a service instance and its dependency edges. HA designs how multiple instances/nodes survive instance and infrastructure failure.

Lesson Recap

You just completed lesson 28 in deepen practice. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.