Failure Model of Java Microservices
Learn Java Microservices Design and Architect - Part 039
Failure model Java microservices production-grade: taxonomy kegagalan, timeout, overload, saturation, dependency failure, partial availability, dan cara mendesain service agar tidak runtuh saat dependency bermasalah.
Part 039 — Failure Model of Java Microservices
Microservice yang bagus bukan service yang “tidak pernah error”. Itu fantasi.
Microservice yang bagus adalah service yang:
- tahu jenis kegagalan apa yang mungkin terjadi,
- tahu di mana kegagalan itu boleh berhenti,
- tahu bagaimana memberi respons yang benar saat dependency tidak sehat,
- tahu bagaimana menghindari memperbesar kegagalan menjadi outage sistemik,
- tetap bisa menjelaskan kepada operator, auditor, dan consumer: “apa yang terjadi, kapan, kenapa, dan apa dampaknya.”
Dalam monolith, banyak kegagalan terlihat sebagai exception lokal. Dalam microservices, kegagalan menjadi fenomena jaringan, kapasitas, koordinasi, data staleness, message duplication, timeout, retry storm, partial write, dan cascading failure.
Jadi mulai Phase 6 ini kita tidak membahas reliability sebagai “tambahkan library resilience”. Kita membangun failure model.
Top engineer tidak bertanya: “pakai circuit breaker apa?”
Mereka bertanya:
“Kegagalan apa yang mungkin terjadi, siapa terdampak, berapa lama boleh terjadi, apakah retry aman, apakah fallback benar secara bisnis, dan bagaimana sistem tahu bahwa ia sedang masuk kondisi bahaya?”
1. Core Mental Model
Remote dependency adalah komponen tidak terpercaya.
Bukan karena tim lain buruk. Tapi karena distributed system punya realitas:
- jaringan bisa lambat,
- DNS bisa gagal,
- connection pool bisa habis,
- dependency bisa overload,
- response bisa timeout tapi operasi berhasil,
- retry bisa memperbesar beban,
- queue bisa menumpuk,
- consumer bisa tertinggal,
- event bisa datang terlambat,
- data bisa stale,
- node bisa restart,
- deployment bisa menyebabkan mixed version,
- traffic bisa berubah jauh dari asumsi awal.
Failure model adalah daftar eksplisit tentang bagaimana service gagal dan apa kontrak perilaku saat gagal.
Tanpa failure model, desain microservices biasanya menjadi:
Diagram itu terlihat rapi, tapi hampir tidak menjawab pertanyaan production:
- apa timeout A ke D?
- jika D lambat, apakah thread B habis?
- apakah B retry?
- berapa kali?
- retry pada error apa?
- apakah command idempotent?
- apakah user boleh mendapat response partial?
- apakah fallback boleh secara bisnis?
- apakah D dependency critical atau optional?
- apakah call ke E terjadi di transaction?
- bagaimana operator tahu D sedang menurunkan availability A?
- apakah incident D akan menjadi incident A?
Failure model mengubah diagram menjadi peta risiko.
2. Failure Taxonomy
A service needs a taxonomy before it needs a resilience library.
2.1 Local failure
Local failure happens inside the service boundary.
Examples:
- validation failure,
- invariant violation,
- database constraint violation,
- optimistic lock conflict,
- memory pressure,
- thread pool saturation,
- connection pool exhaustion,
- serialization failure,
- config invalid,
- disk full,
- JVM pause,
- deployment startup failure.
Local failure should be the easiest to classify because it is inside your ownership boundary.
2.2 Remote dependency failure
Remote dependency failure happens when another service, broker, database, cache, identity provider, or external API fails or degrades.
Examples:
- connection refused,
- DNS resolution failure,
- TLS handshake failure,
- slow response,
- timeout,
- HTTP 5xx,
- HTTP 429,
- gRPC
UNAVAILABLE, - gRPC
DEADLINE_EXCEEDED, - stale endpoint,
- broker unavailable,
- consumer lag.
Remote failure is dangerous because it can spread.
2.3 Data consistency failure
Data consistency failure happens when the state observed by the service is not what the business process expects.
Examples:
- projection not updated yet,
- event arrived out of order,
- duplicate event applied,
- stale cache,
- missing reference,
- saga step completed but reply lost,
- read model behind source-of-truth,
- cross-service report inconsistent during consistency window.
This is not always a bug. Sometimes it is the designed consistency model. The failure appears when consumers were not told the consistency contract.
2.4 Capacity failure
Capacity failure happens when demand exceeds the service’s ability to process safely.
Examples:
- CPU saturation,
- heap pressure,
- GC pause,
- event loop starvation,
- executor queue growth,
- DB connection pool exhausted,
- Kafka consumer lag,
- p99 latency explosion,
- HPA scaling too late,
- downstream rate limit exceeded.
Capacity failure often starts as latency, then becomes timeout, then becomes retry storm, then becomes cascading failure.
2.5 Semantic failure
Semantic failure happens when the system technically succeeds but violates business meaning.
Examples:
- fallback returns wrong policy,
- duplicate command creates duplicate case,
- compensation reverses too much,
- stale decision shown as final,
- audit event says “approved” but actual state is “pending review”,
- user receives success before durable write exists,
- retry repeats an external side effect.
This is why “availability” cannot be designed purely at transport level. Sometimes returning a fallback is worse than failing closed.
3. Failure Classification Table
Every service should maintain a failure classification table.
Example for a regulatory case service:
| Failure | Source | User Impact | Retry Safe? | Fallback Allowed? | Availability Mode | Notes |
|---|---|---|---|---|---|---|
| Invalid case transition | local domain invariant | request rejected | no | no | fail closed | business error |
| Case DB unavailable | local dependency | cannot mutate case | maybe at client with idempotency | no | fail closed | preserve data correctness |
| Party profile timeout | remote dependency | enrichment missing | yes for GET | yes | degraded | return case without profile |
| Policy service timeout during decision | remote dependency | cannot finalize decision | maybe | only if last-known-good policy allowed | fail closed or controlled degraded | compliance-sensitive |
| Audit broker unavailable | async dependency | command may still succeed if outbox durable | publisher retry | no user fallback needed | delayed audit dispatch | outbox must persist |
| Duplicate command | client/network | potential duplicate side effect | yes if idempotency key exists | no | replay previous response | idempotency store required |
| Projection lag | async read model | stale query | no direct retry | show freshness marker | bounded stale | expose watermark |
This table forces architectural clarity. It prevents accidental resilience.
4. The Most Dangerous Failure: Slow Dependency
A dependency that fails fast is easier to handle than a dependency that becomes slow.
Fast failure:
Party Service returns 503 in 20ms
Slow failure:
Party Service responds sometimes in 5s, sometimes never
Slow failure consumes resources:
- servlet threads,
- event-loop tasks,
- HTTP client connections,
- database transaction time,
- memory,
- queues,
- user patience,
- autoscaling capacity.
A service with no timeout treats slow dependencies as infinite wait. That is not resilience. That is surrender.
Even if the dependency eventually responds, the user-facing request may already have timed out. Work done after the caller has given up is often wasted work.
5. Partial Failure
In a distributed system, one part can fail while others continue.
This is not an exceptional corner case. It is the default.
Example:
The question is not “is the system up?” The question is:
- which capability is available?
- which capability is degraded?
- which capability must fail closed?
- which response can be partial?
- which operations must be queued?
- which operations must be rejected?
- which dependency is optional?
- which dependency is critical?
Binary uptime is too crude for microservices.
A more useful model:
| Capability | Dependency | Mode if dependency fails |
|---|---|---|
| Create case | Case DB, ID generator, audit outbox | fail closed if DB unavailable |
| View case summary | Case DB, Party Service | degraded if Party Service unavailable |
| Submit enforcement decision | Case DB, Policy Service, audit outbox | fail closed if Policy Service unavailable |
| Export analytics report | reporting projection | stale output allowed with freshness marker |
| Send notification | broker/email provider | accept command, retry async, expose delivery status |
6. Failure Containment Boundary
A failure containment boundary answers:
“If this component fails, where does the failure stop?”
In microservices, boundaries are often confused:
- service boundary,
- database boundary,
- transaction boundary,
- thread pool boundary,
- connection pool boundary,
- queue boundary,
- rate limit boundary,
- tenant boundary,
- region boundary,
- team ownership boundary.
A service boundary is not automatically a failure boundary. If every service calls every other service synchronously with long timeouts and unbounded retries, you have a distributed failure amplifier.
Bad containment
If F slows down, A may become slow. That is failure propagation.
Better containment
Here, user-facing flow does not block on the external vendor. The external side effect becomes asynchronous, observable, retryable, and containable.
7. Critical vs Optional Dependencies
Not every dependency deserves equal treatment.
Classify each dependency:
Critical dependency
Without it, the operation cannot produce correct business result.
Examples:
- local database for mutation,
- policy decision service during legally binding decision,
- identity/authorization check for protected resource,
- idempotency store for non-idempotent command,
- outbox table for mandatory audit event.
Failure behavior:
- fail closed,
- reject early,
- do not pretend success,
- expose clear problem detail,
- preserve correctness over availability.
Optional dependency
Without it, operation can still produce a useful result.
Examples:
- enrichment profile,
- recommendation,
- non-critical notification preview,
- UI decoration,
- analytics counter,
- autocomplete.
Failure behavior:
- use fallback,
- return partial response,
- mark missing fragment,
- cache if safe,
- do not block critical user journey.
Deferrable dependency
The operation must happen eventually, but not in the request path.
Examples:
- email notification,
- audit publishing when outbox is durable,
- search indexing,
- projection update,
- external CRM sync.
Failure behavior:
- persist intent,
- retry asynchronously,
- expose delivery/reconciliation status,
- alert on backlog/age.
8. Failure Mode Matrix
A practical design artifact:
| Dependency | Criticality | Timeout | Retry | Fallback | Isolation | Observability |
|---|---|---|---|---|---|---|
| Case DB | critical | short DB query timeout | limited transaction retry on transient conflict | no | DB pool + transaction boundary | connection pool, slow query, lock wait |
| Party Service | optional for summary | 300ms | maybe 1 retry for GET only | omit enrichment | separate HTTP pool | dependency latency, timeout count |
| Policy Service | critical for decision | 500ms within deadline | no blind retry for command | last-known-good only if approved | separate pool + circuit breaker | policy version, decision trace |
| Audit Outbox | critical local table | same transaction | no remote retry in transaction | no | local DB | outbox insert count |
| Broker Publisher | deferrable | async publisher timeout | yes with backoff | no | worker pool | publish lag, DLQ, retry age |
This matrix is not bureaucracy. It is how architecture becomes executable.
9. Exception Taxonomy in Java
A production Java service should not let low-level exceptions leak across layers.
Bad:
catch (Exception e) {
throw new RuntimeException(e);
}
Worse:
throw new RuntimeException("Something went wrong");
Better: classify failures explicitly.
public sealed interface ServiceFailure permits
ValidationFailure,
ConflictFailure,
DependencyFailure,
CapacityFailure,
InternalFailure {
String code();
String message();
}
public record ValidationFailure(String code, String message) implements ServiceFailure {}
public record ConflictFailure(String code, String message) implements ServiceFailure {}
public record DependencyFailure(
String code,
String message,
String dependency,
boolean retryable,
boolean degradedResultAllowed
) implements ServiceFailure {}
public record CapacityFailure(
String code,
String message,
boolean retryable
) implements ServiceFailure {}
public record InternalFailure(String code, String message) implements ServiceFailure {}
The point is not to over-engineer errors. The point is to avoid treating these as equal:
- user submitted invalid transition,
- party service timed out,
- DB connection pool exhausted,
- duplicate idempotency key,
- optimistic lock conflict,
- bug in mapper,
- policy engine unavailable.
They require different responses.
10. Mapping Failure to API Response
A common mistake is mapping all exceptions to 500.
Better mapping:
| Failure | HTTP |
|---|---|
| malformed JSON | 400 |
| validation failure | 422 |
| unauthorized | 401 |
| forbidden | 403 |
| not found | 404 |
| optimistic lock conflict | 409 |
| idempotency key conflict | 409 |
| rate limited | 429 |
| dependency timeout | 503 or 504 depending boundary |
| local overload | 503 |
| internal bug | 500 |
Example problem response:
{
"type": "https://errors.example.com/dependency-timeout",
"title": "Dependency timeout",
"status": 503,
"detail": "Case summary was returned without party enrichment because Party Service timed out.",
"instance": "/cases/CASE-123",
"errorCode": "CASE_PARTY_PROFILE_TIMEOUT",
"dependency": "party-service",
"degraded": true,
"correlationId": "01HZ..."
}
A degraded response is not the same as a failed response. If the API returns 200 with missing optional fragment, the response should expose the completeness contract.
{
"caseId": "CASE-123",
"status": "UNDER_REVIEW",
"party": null,
"meta": {
"partial": true,
"missingFragments": ["partyProfile"],
"freshness": {
"case": "2026-07-05T10:15:31Z",
"partyProfile": null
}
}
}
11. Failure-Aware Service Design
A service operation should be designed with failure behavior as part of its contract.
Example command: SubmitCaseForReview.
Functional happy path
Draft case -> validate completeness -> transition to submitted -> emit audit event
Failure-aware design
| Step | Failure | Behavior |
|---|---|---|
| Load case | not found | 404 |
| Check state | invalid transition | 409/422 |
| Validate completeness | missing evidence | 422 |
| Persist transition | optimistic lock conflict | 409, client may reload |
| Insert outbox audit event | DB failure | transaction fails; no false success |
| Publish audit event | broker down | async retry from outbox |
| Notify reviewer | notification provider down | async retry; case submission still valid |
Java application service sketch:
public final class SubmitCaseForReviewHandler {
private final CaseRepository cases;
private final Outbox outbox;
private final Clock clock;
public SubmitCaseForReviewResult handle(SubmitCaseForReview command) {
CaseId caseId = CaseId.of(command.caseId());
CaseRecord record = cases.findByIdForUpdate(caseId)
.orElseThrow(() -> new CaseNotFoundException(caseId));
EnforcementCase enforcementCase = CaseMapper.toDomain(record);
enforcementCase.submitForReview(
SubmittedBy.of(command.actorId()),
SubmissionTime.of(clock.instant())
);
cases.save(CaseMapper.toRecord(enforcementCase));
outbox.append(AuditIntegrationEvent.caseSubmittedForReview(
enforcementCase.id().value(),
command.actorId(),
clock.instant()
));
return new SubmitCaseForReviewResult(
enforcementCase.id().value(),
enforcementCase.status().name()
);
}
}
Important detail: the outbox append is in the same local transaction as the case state transition. Broker publishing is not.
12. Remote Call Placement
One of the most important rules:
Do not place slow remote calls inside local database transaction unless you have a very specific, reviewed reason.
Bad:
@Transactional
public void submitDecision(Command command) {
CaseRecord record = caseRepository.find(command.caseId());
PolicyDecision decision = policyClient.evaluate(command.policyInput()); // remote call inside TX
record.apply(decision);
caseRepository.save(record);
}
Why bad?
- DB transaction remains open while remote service is slow.
- Locks are held longer.
- Connection pool is occupied longer.
- Retry may repeat remote call.
- Unknown outcome becomes harder to reason about.
Better:
public void submitDecision(Command command) {
PolicyDecision decision = policyClient.evaluate(command.policyInput());
transactionTemplate.executeWithoutResult(tx -> {
CaseRecord record = caseRepository.findForUpdate(command.caseId());
record.apply(decision);
caseRepository.save(record);
outbox.append(AuditIntegrationEvent.decisionSubmitted(...));
});
}
But this still has subtle issues. What if policy evaluation succeeds and transaction fails? You need to ask whether policy evaluation is pure/idempotent or has side effects. If it has side effects, it should be modeled as a saga/workflow step.
13. Overload
Overload is when the service receives or accepts more work than it can complete safely.
Overload can come from:
- normal traffic spike,
- retry storm,
- batch job,
- thundering herd,
- queue replay after outage,
- downstream slowness,
- expensive query,
- noisy tenant,
- deployment cold start,
- HPA lag,
- GC pause,
- connection pool starvation.
The dangerous part: overload often looks like “just latency” at first.
The system does not need a bug to fail. It can fail because normal safety mechanisms were not bounded.
14. Saturation Signals
A service should expose saturation signals.
| Resource | Signal |
|---|---|
| CPU | utilization, throttling, run queue |
| Heap | usage, GC pause, allocation rate |
| Thread pool | active threads, queue depth, rejection count |
| HTTP pool | active connections, pending acquire, timeout |
| DB pool | active, idle, pending, acquisition timeout |
| Broker consumer | lag, processing time, commit latency |
| Executor | queue age, queue size, task rejection |
| Cache | hit rate, eviction, load latency |
| Disk | IO wait, fsync latency |
| External API | latency, error rate, rate limit response |
Latency alone is a late signal. Queue age and pool pending counts are often earlier.
15. Load Shedding
When overloaded, a service should reject work it cannot safely complete.
This feels harsh. It is often the safest behavior.
Bad overloaded behavior:
accept everything -> queue forever -> timeout everything -> retry storm -> total collapse
Better overloaded behavior:
accept within capacity -> reject excess quickly -> preserve critical path -> recover
Example:
public final class CapacityGuard {
private final Semaphore permits;
public CapacityGuard(int maxConcurrentRequests) {
this.permits = new Semaphore(maxConcurrentRequests);
}
public <T> T execute(Supplier<T> operation) {
boolean acquired = permits.tryAcquire();
if (!acquired) {
throw new ServiceOverloadedException("CASE_SERVICE_OVERLOADED");
}
try {
return operation.get();
} finally {
permits.release();
}
}
}
This is a simplified example. In production, you also need:
- priority classes,
- tenant isolation,
- endpoint-specific limits,
- queue timeout,
- metrics,
- clear
Retry-Afterbehavior where appropriate, - rejection at edge/gateway before expensive work starts.
16. Failure-Aware Dependency Graph
A useful architecture diagram should include failure semantics.
This diagram is not decorative. It encodes architectural behavior.
17. Fail Open, Fail Closed, Fail Degraded
Fail closed
Reject the operation to preserve correctness/security/compliance.
Use when:
- authorization unavailable,
- policy decision unavailable for legally binding action,
- DB cannot persist mandatory state,
- idempotency cannot be checked for non-idempotent command,
- audit event cannot be durably recorded if audit is mandatory.
Fail open
Allow operation despite missing check.
Use rarely, and only when explicitly approved.
Examples:
- allow low-risk read if feature flag service unavailable and default is safe,
- allow non-critical UI rendering without personalization.
Do not fail open accidentally.
Fail degraded
Return reduced functionality while being honest about it.
Examples:
- omit optional enrichment,
- use cached reference data with freshness marker,
- delay notification,
- return stale read model with watermark.
Fail degraded is not a trick to hide failure. It is an explicit product and architecture decision.
18. Unknown Outcome Problem
The most painful failure is not “success” or “failure”. It is “I do not know.”
Example:
Case Service sends command to Payment/External/Policy service.
Connection times out.
Did the dependency process the command?
Possible outcomes:
- dependency never received command,
- dependency received and failed,
- dependency received and succeeded but response was lost,
- dependency is still processing.
If the operation has side effects, blind retry may duplicate work.
Defense:
- idempotency key,
- operation ID,
- status query,
- outbox/inbox,
- saga state,
- reconciliation job,
- external reference ID,
- audit trail.
public record ExternalOperationId(String value) {
public static ExternalOperationId forCaseDecision(String caseId, String decisionId) {
return new ExternalOperationId("case-decision:%s:%s".formatted(caseId, decisionId));
}
}
Do not make remote commands without an operation identity.
19. Failure Model for Async Messaging
Messaging changes the failure shape. It does not remove failure.
Async failure examples:
- message not published,
- message published but not consumed,
- duplicate message,
- poison message,
- out-of-order message,
- consumer lag,
- DLQ growth,
- schema incompatible,
- broker partition unavailable,
- replay overload,
- projection update failure.
Async systems need their own failure model:
| Failure | Detection | Recovery |
|---|---|---|
| publish failed | outbox pending age | publisher retry |
| duplicate consumed | inbox unique key conflict | ignore/replay previous result |
| poison message | retry count exceeded | DLQ + operator workflow |
| projection lag | watermark age | scale consumer / investigate |
| out-of-order event | aggregate version gap | buffer, refetch, or reject |
| schema incompatible | deserialization error | compatibility test + DLQ |
| replay overload | queue age and CPU saturation | throttle replay |
20. Java Threading and Failure
Java microservices often fail not because CPU is fully used, but because execution resources are blocked.
Servlet-style thread-per-request
In a traditional Spring MVC/Tomcat model:
- each request occupies a request thread,
- blocking remote calls occupy threads,
- slow dependency consumes threads,
- once threads are exhausted, new requests queue or fail.
Reactive/event-loop model
In a reactive model:
- event-loop threads must not block,
- blocking call on event loop can stall many requests,
- backpressure must be explicit,
- thread pool boundaries matter.
Different model, different failure shape.
Bad reactive code:
public Mono<CaseSummary> getCase(String id) {
return Mono.just(repository.findById(id)); // blocking call inside reactive chain
}
Better:
public Mono<CaseSummary> getCase(String id) {
return Mono.fromCallable(() -> repository.findById(id))
.subscribeOn(Schedulers.boundedElastic());
}
This is not an endorsement to make everything reactive. It is a reminder: concurrency model is part of failure model.
21. Dependency Isolation
A single bad dependency should not consume all resources.
Bad:
all outbound calls share one HTTP connection pool
If Party Service is slow, Policy Service calls may also starve.
Better:
separate client config per dependency
separate timeout
separate pool
separate circuit breaker
separate metrics
Example config idea:
dependencies:
party-service:
base-url: https://party.internal
connect-timeout: 100ms
response-timeout: 300ms
max-connections: 50
criticality: optional
policy-service:
base-url: https://policy.internal
connect-timeout: 100ms
response-timeout: 500ms
max-connections: 30
criticality: critical
Failure model should be visible in configuration.
22. Incident-Oriented Questions
For each service, ask:
- What happens if local DB is slow?
- What happens if local DB rejects connections?
- What happens if each remote dependency is slow?
- What happens if each remote dependency returns 5xx?
- What happens if dependency succeeds but response is lost?
- What happens if broker is down?
- What happens if event consumer is 2 hours behind?
- What happens if message is duplicated?
- What happens if message is out of order?
- What happens if cache is stale?
- What happens if one tenant sends 10x traffic?
- What happens if a batch job replays one million events?
- What happens if deployment creates mixed versions?
- What happens if config is wrong?
- What happens if clock is skewed?
- What happens if audit publishing is delayed?
- What happens if policy engine is unavailable?
- What happens if external vendor rate-limits us?
- What happens if HPA scales too late?
- What happens if service receives traffic during startup?
If the answer is “we’ll see”, the architecture is incomplete.
23. Review Checklist
Use this before approving a service design.
Dependency classification
- Each dependency is classified as critical, optional, or deferrable.
- Each dependency has timeout, retry, fallback, and isolation policy.
- Remote calls inside DB transactions are explicitly reviewed.
- Side-effecting remote commands have operation IDs/idempotency.
Failure containment
- Optional dependency failure does not break critical path.
- Deferrable work is persisted before async processing.
- Overload behavior is explicit.
- Load shedding exists for expensive endpoints.
- Dependency pools are separated where needed.
Data and messaging
- Duplicate messages are safe.
- Out-of-order events are handled or rejected safely.
- Projection lag is measurable.
- Reconciliation exists for important async state.
- Unknown outcome cases have recovery path.
Observability
- Dependency latency and error metrics exist.
- Saturation metrics exist.
- Error responses include correlation ID.
- Degraded responses are visible.
- Dashboards distinguish local failure vs dependency failure.
Business semantics
- Fallback is business-approved.
- Fail-open behavior is explicitly approved.
- Compliance-critical actions fail closed.
- Audit trail can explain failure and recovery.
24. Common Failure Smells
Smell 1: “All exceptions are 500”
This means the service cannot distinguish user error, conflict, dependency failure, overload, or bug.
Smell 2: “No timeout because HTTP client has default”
Default timeout is usually not your business timeout.
Smell 3: “Retry everywhere”
Retry without idempotency and budget is a load amplifier.
Smell 4: “Fallback returns fake success”
Fallback must preserve business truth. Fake success creates semantic corruption.
Smell 5: “Healthy but unusable”
Health endpoint says UP while service cannot process main capability because DB pool is exhausted or projection is 2 hours behind.
Smell 6: “Async means reliable”
Async only moves failure into queue, lag, DLQ, replay, and reconciliation.
Smell 7: “Service boundary equals failure boundary”
Not if dependencies are synchronous, timeouts are long, and retries are unbounded.
25. Exercise
Take one service you own or imagine a Case Decision Service.
Create a failure model table with these columns:
operation
dependency
criticality
failure mode
timeout
retry policy
fallback policy
idempotency requirement
observability signal
operator action
Then answer:
- Which dependency failure becomes user-visible?
- Which failure should degrade?
- Which failure must fail closed?
- Which failure can be delayed?
- Which retries are unsafe?
- Which unknown outcomes need reconciliation?
- Which overload signal appears first?
- Which metric should page a human?
If you cannot answer those, the service is not production-designed yet.
26. Summary
Failure model is the bridge between architecture diagram and production reality.
A Java microservice should not only define:
- endpoints,
- DTOs,
- database tables,
- events,
- repositories,
- deployment manifest.
It must define:
- how it fails,
- how it contains failure,
- which dependencies are critical,
- which operations are retry-safe,
- which results may be partial,
- which actions fail closed,
- which side effects are deferrable,
- which overload signals matter,
- how humans diagnose and recover it.
In distributed systems, reliability is not an afterthought. It is part of the domain design.
References
- Google SRE Book — Addressing Cascading Failures: https://sre.google/sre-book/addressing-cascading-failures/
- Google SRE Book — Handling Overload: https://sre.google/sre-book/handling-overload/
- Microsoft Azure Architecture Center — Retry pattern: https://learn.microsoft.com/en-us/azure/architecture/patterns/retry
- Microsoft Azure Architecture Center — Circuit Breaker pattern: https://learn.microsoft.com/en-us/azure/architecture/patterns/circuit-breaker
- RFC 9110 — HTTP Semantics: https://www.rfc-editor.org/rfc/rfc9110.html
- gRPC Status Codes: https://grpc.io/docs/guides/status-codes/
You just completed lesson 39 in build core. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.
Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.