Series/Learn Java Microservices Communication

Build CoreOrdered learning track

Error Response Modeling: Problem Details, Retriability, Diagnostics

Learn Java Microservices Communication - Part 018

Production-grade guide to modeling HTTP error responses in Java microservices using status codes, Problem Details, retriability classification, diagnostics, and safe operational contracts.

[2026-07-05]13 min read2535 words

In This Lesson

1. Status Code Is Necessary but Not Sufficient 2. Error Response Is Part of the API Contract 3. RFC 9457 Problem Details

PrevNext

Lesson 1896 lesson track18–52 Build Core

#java#microservices#http#errors+5 more

Part 018 — Error Response Modeling: Problem Details, Retriability, Diagnostics

A good error response is not an apology.

It is a control signal.

In microservices, an error response tells the caller:

What failed?
Was the request accepted?
Can the caller retry?
Should the caller change the request?
Is the dependency unhealthy?
Is this a business/domain rejection?
Is this a security boundary?
How can operators diagnose it?

A bad error response hides those answers.

Example:

{
  "message": "Something went wrong"
}

This is almost useless for service-to-service communication.

A production-grade error model must be:

semantically aligned with HTTP status codes;
machine-readable;
stable enough for clients;
safe enough for logs;
diagnostic enough for operators;
explicit about retriability;
compatible with tracing and correlation;
consistent across services.

This part builds that model.

1. Status Code Is Necessary but Not Sufficient

HTTP status code is the first error signal.

It is not the whole error contract.

HTTP/1.1 409 Conflict
Content-Type: application/problem+json

The status code says:

The request conflicts with current state.

But it does not say:

Which state?
Which business invariant?
Can the caller retry?
What is the stable error code?
What should be shown to an operator?
What correlation ID should be used for investigation?

That detail belongs in the error body and headers.

Think of HTTP errors as two layers:

A mature service uses all three.

2. Error Response Is Part of the API Contract

Many teams design success responses carefully but treat error responses as implementation detail.

That is wrong.

For internal microservice communication, errors are part of the dependency contract.

A caller must know how to react to:

validation failure;
authorization failure;
missing resource;
stale version;
duplicate command;
rate limit;
timeout;
dependency failure;
unavailable service;
unknown outcome.

If those are not modeled, callers invent inconsistent behavior.

One caller retries validation failures.

Another swallows 500.

Another treats 404 as success.

Another pages operators for domain rejection.

That inconsistency becomes a distributed system failure.

3. RFC 9457 Problem Details

RFC 9457 defines Problem Details for HTTP APIs.

The JSON media type is commonly:

Content-Type: application/problem+json

A problem response has standard members:

{
  "type": "https://errors.example.internal/case-version-conflict",
  "title": "Case version conflict",
  "status": 409,
  "detail": "The case was modified by another process.",
  "instance": "/cases/CASE-123/commands/submit-review"
}

Core fields:

Field	Meaning
`type`	Stable identifier for the problem type
`title`	Short human-readable summary
`status`	HTTP status code associated with this occurrence
`detail`	Human-readable detail for this occurrence
`instance`	URI reference identifying the specific occurrence

Problem Details also allows extension members.

For internal microservices, extensions are where you add operationally useful but governed fields.

Example:

{
  "type": "https://errors.example.internal/case-version-conflict",
  "title": "Case version conflict",
  "status": 409,
  "detail": "The case was modified by another process.",
  "instance": "/problems/01J0XYZABCD123",
  "errorCode": "CASE_VERSION_CONFLICT",
  "correlationId": "9f9c3a0e3c4b4e6b",
  "retryable": false,
  "currentVersion": 42,
  "expectedVersion": 41
}

Do not add random fields per service without governance.

An extension field becomes a contract if clients use it.

4. Problem Type vs Error Code

Problem Details already has type.

So why add errorCode?

Because service clients often need a compact stable enum-like value.

Use both with clear semantics.

Field	Role
`type`	Globally unique problem type URI
`errorCode`	Compact domain/platform code for client logic and metrics
`title`	Short human-readable summary
`detail`	Occurrence-specific explanation

Example:

{
  "type": "https://errors.example.internal/idempotency-key-reuse",
  "title": "Idempotency key reused with different request body",
  "status": 409,
  "errorCode": "IDEMPOTENCY_KEY_REUSE",
  "retryable": false
}

errorCode should be stable and low-cardinality.

Do not generate codes dynamically.

Bad:

{
  "errorCode": "VALIDATION_FAILED_FIELD_customer.addresses[13].postalCode_2026_07_05_10_21_33"
}

Good:

{
  "errorCode": "VALIDATION_FAILED"
}

Put field-specific detail in structured violations, not in the code.

5. Error Classification Model

Every error should belong to a class.

This classification drives response shape, retry policy, alerting, and client behavior.

6. Status Code Mapping for Internal APIs

Use status codes consistently.

Status	Use when	Retry?
400	Syntax/shape request error	No
401	Missing/invalid authentication	No, unless token refresh path applies
403	Authenticated but not allowed	No
404	Resource not found or not visible	Usually no
405	Method not allowed	No
409	State conflict	Usually no automatic retry; caller may re-read state
410	Resource permanently gone	No
412	Precondition failed	No automatic retry; caller must refresh precondition
415	Unsupported media type/content encoding	No
422	Semantically invalid request	No
425	Too early / unsafe replay concern	Retry only according to protocol/policy
429	Rate limited	Maybe, after `Retry-After` and budget check
500	Server bug/unclassified failure	Maybe, but carefully
502	Bad gateway/upstream invalid response	Maybe
503	Service unavailable/overloaded/maintenance	Maybe, after budget check
504	Gateway/upstream timeout	Maybe, but outcome may be unknown

Do not build client logic from status family alone.

For example:

5xx does not always mean safe to retry.
409 does not always mean fatal forever.
404 may be expected in eventually consistent reads.
429 may be retryable only after delay.
504 may hide unknown server-side outcome.

The error body should refine the decision.

7. Retriability Must Be Explicit but Not Blindly Trusted

Include retriability as part of the error contract.

Example:

{
  "type": "https://errors.example.internal/rate-limited",
  "title": "Rate limit exceeded",
  "status": 429,
  "errorCode": "RATE_LIMITED",
  "retryable": true,
  "retryAfterMillis": 5000
}

But the client must still apply its own budget.

Server says retryable.
Client asks: Do I still have deadline budget? Is method safe? Is idempotency guaranteed? Is retry budget available?

Retriability has two sides:

Side	Responsibility
Server	Classify the failure accurately
Client	Decide whether retry is safe within its own budget

A server cannot know the caller's end-to-end deadline or business operation context.

So retryable=true means:

The server believes retry may succeed later.

It does not mean:

The client must retry.

8. Unknown Outcome Errors

Unknown outcome is one of the most important concepts in distributed communication.

A client timeout does not prove the server did nothing.

A gateway timeout does not prove the downstream operation failed.

Example:

The client sees timeout.

But the operation may have succeeded.

For command endpoints, error modeling must distinguish:

Rejected before execution
Failed during execution
Accepted but completion unknown to caller
Completed but response lost

HTTP alone cannot always tell you which occurred.

That is why commands need idempotency keys, operation IDs, or status lookup patterns.

Error model example:

{
  "type": "https://errors.example.internal/operation-outcome-unknown",
  "title": "Operation outcome unknown",
  "status": 504,
  "errorCode": "OPERATION_OUTCOME_UNKNOWN",
  "retryable": true,
  "safeToRetryWithSameIdempotencyKey": true,
  "operationId": "OP-789"
}

The key is precision.

Do not collapse unknown outcome into generic INTERNAL_ERROR.

9. Validation Errors

Validation errors must be structured.

Bad:

{
  "message": "Invalid request"
}

Better:

{
  "type": "https://errors.example.internal/validation-failed",
  "title": "Validation failed",
  "status": 400,
  "errorCode": "VALIDATION_FAILED",
  "retryable": false,
  "violations": [
    {
      "field": "decision.reasonCode",
      "code": "REQUIRED",
      "message": "reasonCode is required"
    },
    {
      "field": "decision.effectiveDate",
      "code": "MUST_BE_FUTURE_OR_PRESENT",
      "message": "effectiveDate must not be in the past"
    }
  ]
}

Rules:

Use stable violation codes.
Keep field paths predictable.
Do not put raw user input if sensitive.
Do not expose internal validator class names.
Do not make clients parse human messages.

Human messages are for humans.

Codes are for machines.

10. Business Rule Rejections

A business rule rejection is different from malformed request.

Example:

The request is well-formed, but the case cannot be escalated because it is already closed.

Possible status:

409 Conflict

Problem body:

{
  "type": "https://errors.example.internal/case-not-escalatable",
  "title": "Case cannot be escalated",
  "status": 409,
  "errorCode": "CASE_NOT_ESCALATABLE",
  "retryable": false,
  "caseId": "CASE-123",
  "currentStatus": "CLOSED"
}

Do not use 500 for domain rejections.

A rejected command may be a successful enforcement of invariant.

It is not an incident.

Alerting should not page on expected domain conflicts.

11. Optimistic Locking and Preconditions

State-changing APIs should often use preconditions.

Example:

POST /cases/CASE-123/commands/submit-review
If-Match: "case-version-41"

If the resource changed:

HTTP/1.1 412 Precondition Failed
Content-Type: application/problem+json

{
  "type": "https://errors.example.internal/precondition-failed",
  "title": "Precondition failed",
  "status": 412,
  "errorCode": "PRECONDITION_FAILED",
  "retryable": false,
  "expectedVersion": "case-version-41",
  "currentVersion": "case-version-42"
}

Use 409 when conflict is domain/state conflict without explicit HTTP precondition.

Use 412 when an explicit precondition such as If-Match failed.

This distinction helps callers implement correct read-modify-write flows.

12. Rate Limit and Overload Errors

Rate limiting and overload should be explicit.

For rate limit:

HTTP/1.1 429 Too Many Requests
Retry-After: 5
Content-Type: application/problem+json

{
  "type": "https://errors.example.internal/rate-limited",
  "title": "Rate limit exceeded",
  "status": 429,
  "errorCode": "RATE_LIMITED",
  "retryable": true,
  "retryAfterMillis": 5000,
  "limitName": "risk-score-per-client"
}

For overload:

HTTP/1.1 503 Service Unavailable
Retry-After: 2
Content-Type: application/problem+json

{
  "type": "https://errors.example.internal/service-overloaded",
  "title": "Service temporarily overloaded",
  "status": 503,
  "errorCode": "SERVICE_OVERLOADED",
  "retryable": true,
  "retryAfterMillis": 2000
}

Do not use generic 500 for overload.

Overload is a capacity signal.

It should tell clients to slow down, shed, or retry later.

13. Dependency Failure Errors

A service may fail because a downstream dependency failed.

But be careful.

Do not leak internal topology unnecessarily.

Bad:

{
  "message": "NullPointerException from com.internal.risk.PostgresRiskRepository line 72"
}

Better:

{
  "type": "https://errors.example.internal/dependency-unavailable",
  "title": "Required dependency unavailable",
  "status": 503,
  "errorCode": "DEPENDENCY_UNAVAILABLE",
  "retryable": true,
  "dependencyClass": "risk-data-store",
  "correlationId": "9f9c3a0e3c4b4e6b"
}

Expose dependency class, not necessarily concrete hostnames, credentials, table names, or internal stack traces.

For operators, correlation ID and trace ID should lead to deeper details in observability systems.

The API response should not be the full incident report.

14. Authentication and Authorization Errors

Authentication and authorization errors should be precise but safe.

401 Unauthorized

Means authentication is missing or invalid.

403 Forbidden

Means caller is authenticated but not allowed.

But do not leak sensitive authorization details.

Bad:

{
  "message": "User lacks REGULATOR_SUPER_ADMIN on tenant central-bank-prod"
}

Better:

{
  "type": "https://errors.example.internal/access-denied",
  "title": "Access denied",
  "status": 403,
  "errorCode": "ACCESS_DENIED",
  "retryable": false,
  "correlationId": "9f9c3a0e3c4b4e6b"
}

Detailed authorization reasoning belongs in secure audit logs, not necessarily in client-visible response body.

This series will not repeat the full authorization model.

Here the communication rule is simple:

Return enough for the caller to behave correctly, not enough to help an attacker map permissions.

15. Error Body Field Policy

A standard internal problem body can use this shape:

{
  "type": "https://errors.example.internal/example-error",
  "title": "Example error",
  "status": 400,
  "detail": "Human-readable occurrence detail.",
  "instance": "/problems/01J0XYZABCD123",
  "errorCode": "EXAMPLE_ERROR",
  "correlationId": "9f9c3a0e3c4b4e6b",
  "retryable": false
}

Recommended standard extensions:

Field	Type	Cardinality	Purpose
`errorCode`	string	low	Stable machine-readable error code
`correlationId`	string	high but not metric tag	Support investigation
`retryable`	boolean	low	Server-side retry classification
`retryAfterMillis`	number	low/medium	Delay hint when retryable
`violations`	array	bounded	Structured validation issues
`operationId`	string	high but not metric tag	Command/outcome tracking
`safeToRetryWithSameIdempotencyKey`	boolean	low	Command retry safety

Do not use high-cardinality fields as metric labels.

Fields such as correlationId, operationId, caseId, and detail are useful for logs/traces, not for metric cardinality.

16. Do Not Expose Stack Traces

Never return stack traces in production service-to-service error responses.

Bad:

{
  "error": "java.lang.NullPointerException",
  "stackTrace": "com.example.CaseService.submit(CaseService.java:87)..."
}

Why it is bad:

leaks internal structure;
changes whenever code changes;
creates huge payloads;
may expose sensitive data;
encourages clients to depend on implementation detail;
makes logs and traces noisy.

Return a stable error code and correlation ID.

Keep stack trace in internal logs and tracing systems.

17. Human Detail vs Machine Detail

Do not make clients parse English text.

Bad client behavior:

if (problem.detail().contains("already closed")) {
    // handle closed case
}

Better:

if (problem.errorCode().equals("CASE_ALREADY_CLOSED")) {
    // handle closed case
}

Human text can change.

Machine codes must be stable.

Use detail for humans and operators.

Use errorCode, type, and structured fields for clients.

18. Localization

Service-to-service error responses should usually not localize title and detail.

Why?

Internal callers need stable diagnostics.

Localized text can make logs harder to aggregate.

If localization is needed for end users, translate at the edge/UI layer using stable error codes.

Internal service returns: CASE_ALREADY_CLOSED
UI maps to localized message for user.

Do not push user-facing language concerns deep into internal service communication unless the service explicitly owns user-facing content.

19. Java Error Model

Define a shared error model intentionally.

Example:

public record ApiProblem(
        String type,
        String title,
        int status,
        String detail,
        String instance,
        String errorCode,
        String correlationId,
        boolean retryable,
        Long retryAfterMillis,
        List<Violation> violations
) {
    public ApiProblem {
        if (type == null || type.isBlank()) {
            throw new IllegalArgumentException("type is required");
        }
        if (title == null || title.isBlank()) {
            throw new IllegalArgumentException("title is required");
        }
        if (status < 400 || status > 599) {
            throw new IllegalArgumentException("status must be 4xx or 5xx");
        }
        if (errorCode == null || errorCode.isBlank()) {
            throw new IllegalArgumentException("errorCode is required");
        }
        violations = violations == null ? List.of() : List.copyOf(violations);
    }
}

Violation model:

public record Violation(
        String field,
        String code,
        String message
) {
    public Violation {
        if (code == null || code.isBlank()) {
            throw new IllegalArgumentException("violation code is required");
        }
    }
}

This is intentionally boring.

Error models should be boring.

Boring means predictable.

20. Error Code Registry

Use a registry.

Not necessarily a heavy central service.

A version-controlled document or enum can be enough.

Example:

public enum ErrorCode {
    VALIDATION_FAILED,
    ACCESS_DENIED,
    RESOURCE_NOT_FOUND,
    CASE_ALREADY_CLOSED,
    CASE_VERSION_CONFLICT,
    IDEMPOTENCY_KEY_REUSE,
    RATE_LIMITED,
    SERVICE_OVERLOADED,
    DEPENDENCY_UNAVAILABLE,
    OPERATION_OUTCOME_UNKNOWN,
    INTERNAL_ERROR
}

Each code should define:

HTTP status
problem type URI
retryable default
safe-to-log fields
owner team
client handling expectation
alerting classification

Example registry entry:

- errorCode: CASE_VERSION_CONFLICT
  status: 409
  type: https://errors.example.internal/case-version-conflict
  retryable: false
  owner: case-management-platform
  clientAction: re-read case state before retrying command
  alert: false

Without a registry, error codes drift.

Drift kills client correctness.

21. Spring Boot Example with ProblemDetail

Modern Spring applications can use ProblemDetail as a base, then add properties.

Example:

@RestControllerAdvice
public class ApiExceptionHandler {

    @ExceptionHandler(CaseVersionConflictException.class)
    ResponseEntity<ProblemDetail> handleCaseVersionConflict(
            CaseVersionConflictException ex,
            HttpServletRequest request
    ) {
        ProblemDetail problem = ProblemDetail.forStatus(HttpStatus.CONFLICT);
        problem.setType(URI.create("https://errors.example.internal/case-version-conflict"));
        problem.setTitle("Case version conflict");
        problem.setDetail("The case was modified by another process.");
        problem.setInstance(URI.create(request.getRequestURI()));
        problem.setProperty("errorCode", "CASE_VERSION_CONFLICT");
        problem.setProperty("retryable", false);
        problem.setProperty("expectedVersion", ex.expectedVersion());
        problem.setProperty("currentVersion", ex.currentVersion());

        return ResponseEntity
                .status(HttpStatus.CONFLICT)
                .contentType(MediaType.APPLICATION_PROBLEM_JSON)
                .body(problem);
    }
}

This centralizes mapping from internal exceptions to API problem contracts.

Do not scatter error response construction across controllers.

22. JAX-RS Example

For JAX-RS/Jakarta REST style services:

@Provider
public class CaseVersionConflictMapper
        implements ExceptionMapper<CaseVersionConflictException> {

    @Override
    public Response toResponse(CaseVersionConflictException ex) {
        ApiProblem problem = new ApiProblem(
                "https://errors.example.internal/case-version-conflict",
                "Case version conflict",
                409,
                "The case was modified by another process.",
                "/problems/" + ex.problemId(),
                "CASE_VERSION_CONFLICT",
                ex.correlationId(),
                false,
                null,
                List.of()
        );

        return Response
                .status(Response.Status.CONFLICT)
                .type("application/problem+json")
                .entity(problem)
                .build();
    }
}

The pattern is the same:

Exception -> stable problem contract -> HTTP response

Framework differs.

Architecture does not.

23. Client-Side Error Parsing

A Java client should parse problem responses into a typed error.

Bad:

if (statusCode == 500) {
    throw new RuntimeException(body);
}

Better:

public final class RemoteServiceException extends RuntimeException {
    private final int status;
    private final String service;
    private final String route;
    private final ApiProblem problem;

    public RemoteServiceException(
            int status,
            String service,
            String route,
            ApiProblem problem
    ) {
        super(service + " returned " + status + " " + problem.errorCode());
        this.status = status;
        this.service = service;
        this.route = route;
        this.problem = problem;
    }

    public boolean retryableByServerClassification() {
        return problem.retryable();
    }

    public String errorCode() {
        return problem.errorCode();
    }
}

Client code should not need to parse raw JSON everywhere.

Centralize error decoding in the client abstraction.

24. Client Retry Decision

A robust client retry decision uses multiple inputs.

Never retry solely because status is 500.

Never retry commands without idempotency or outcome strategy.

Never retry if the deadline budget is gone.

25. Error Observability

Every problem response should connect to observability.

Required response/log/tracing alignment:

correlationId in response
trace ID in telemetry
errorCode in logs
problem type in logs
HTTP status in metrics
route template in metrics
retryable classification in logs/metrics if low-cardinality

Avoid logging the entire problem body if it may include sensitive details.

Recommended structured log:

{
  "level": "WARN",
  "event": "http_problem_response",
  "service": "case-service",
  "route": "/cases/{caseId}/commands/submit-review",
  "status": 409,
  "errorCode": "CASE_VERSION_CONFLICT",
  "problemType": "https://errors.example.internal/case-version-conflict",
  "retryable": false,
  "correlationId": "9f9c3a0e3c4b4e6b"
}

Do not use detail as a metric label.

Do not use instance as a metric label.

Do not use raw request path with IDs as a metric label.

26. Alerting Classification

Not all errors are incidents.

Error class	Alert?	Notes
Validation failure	Usually no	Client/request issue
Auth failure	Maybe security monitoring	Not service health by default
Business conflict	Usually no	Expected domain rejection
Rate limit	Maybe if sustained	Capacity/backpressure signal
Overload	Yes if sustained	Service health issue
Dependency unavailable	Yes if sustained	Dependency or platform issue
Internal error	Yes	Bug or unhandled state
Timeout/unknown outcome	Yes if elevated	Reliability risk

If expected domain conflicts page operators, teams learn to ignore alerts.

Error modeling affects alert quality.

27. Safe Diagnostics

Good diagnostics answer:

What stable class of failure occurred?
Where can I investigate?
What should the caller do?

They should not expose:

stack traces
SQL statements
hostnames if sensitive
credentials
tokens
raw personal data
internal authorization rules
unredacted request body
library exception internals

Example safe diagnostic response:

{
  "type": "https://errors.example.internal/internal-error",
  "title": "Internal service error",
  "status": 500,
  "errorCode": "INTERNAL_ERROR",
  "retryable": true,
  "correlationId": "9f9c3a0e3c4b4e6b",
  "instance": "/problems/01J0XYZABCD123"
}

Operators use correlation ID to inspect logs/traces.

Clients use error code and retryable flag to decide behavior.

28. Error Response Size

Error responses should be compact.

A failure path is often hot during incidents.

Large error payloads amplify incidents.

Bad:

10 KB stack trace returned for every failed request during outage

At 10,000 failed requests per second, that is significant network and logging pressure.

Error response policy:

Keep problem bodies small.
Bound validation violation count.
Truncate safe details.
Do not include stack traces.
Do not include large nested objects.
Do not echo full request payload.

For validation, cap violations:

{
  "errorCode": "VALIDATION_FAILED",
  "violations": [
    { "field": "items[0].id", "code": "REQUIRED" }
  ],
  "violationCount": 157,
  "violationsTruncated": true
}

Bounded error responses are part of failure containment.

29. Partial Failure Modeling

Some endpoints aggregate data from multiple dependencies.

A response may be partially successful.

Avoid pretending partial failure is full success without signal.

Example:

{
  "caseId": "CASE-123",
  "status": "UNDER_REVIEW",
  "riskScore": null,
  "warnings": [
    {
      "code": "RISK_SCORE_UNAVAILABLE",
      "message": "Risk score is temporarily unavailable"
    }
  ]
}

But use partial responses carefully.

They are appropriate when:

the caller can safely proceed with degraded data;
the response clearly marks missing sections;
the missing data is not required for invariant enforcement;
observability records the degradation;
SLOs define whether degraded success counts as success.

If missing data invalidates the operation, return an error instead.

Do not hide critical dependency failure inside a normal 200 response.

30. 200 with Error Body Is Usually Wrong

Bad:

HTTP/1.1 200 OK
Content-Type: application/json

{
  "success": false,
  "error": "CASE_ALREADY_CLOSED"
}

This breaks:

client libraries;
proxies;
metrics;
alerting;
tracing conventions;
retry logic;
caches;
operational dashboards.

Use HTTP status correctly.

There are narrow exceptions, such as batch APIs where individual items can fail while the batch request itself succeeds.

Example:

{
  "results": [
    {
      "itemId": "A",
      "status": "SUCCESS"
    },
    {
      "itemId": "B",
      "status": "FAILED",
      "errorCode": "CASE_ALREADY_CLOSED"
    }
  ]
}

But for a single operation failure, do not return 200.

31. Batch Error Modeling

Batch endpoints need item-level errors.

Example:

POST /cases/batch-close

Response:

{
  "batchId": "BATCH-123",
  "summary": {
    "total": 3,
    "succeeded": 2,
    "failed": 1
  },
  "results": [
    {
      "caseId": "CASE-1",
      "outcome": "CLOSED"
    },
    {
      "caseId": "CASE-2",
      "outcome": "FAILED",
      "problem": {
        "type": "https://errors.example.internal/case-already-closed",
        "title": "Case already closed",
        "status": 409,
        "errorCode": "CASE_ALREADY_CLOSED",
        "retryable": false
      }
    }
  ]
}

Batch APIs must define:

Does one item failure abort the batch?
Can items partially succeed?
Are results ordered like input?
How is idempotency handled per item?
How are item-level errors represented?
What status code represents partial success?

Do not design batch error semantics casually.

They become hard to change.

32. Error Contract Versioning

Error contracts evolve.

Safe changes:

Add new optional extension field.
Add new error code if clients have default handling.
Add more specific violation code if clients are tolerant.

Risky changes:

Change HTTP status for existing error code.
Rename errorCode.
Change retryable meaning.
Remove field used by clients.
Change field type.
Move machine-readable detail into human text.

Clients should implement default handling for unknown error codes.

Example:

switch (problem.errorCode()) {
    case "CASE_VERSION_CONFLICT" -> handleConflict(problem);
    case "RATE_LIMITED" -> handleRateLimit(problem);
    default -> handleUnknownRemoteProblem(problem);
}

Unknown does not mean ignore.

Unknown means use safe fallback behavior.

33. OpenAPI Documentation for Errors

OpenAPI should document common errors.

Example:

components:
  schemas:
    ApiProblem:
      type: object
      required:
        - type
        - title
        - status
        - errorCode
        - retryable
      properties:
        type:
          type: string
          format: uri
        title:
          type: string
        status:
          type: integer
        detail:
          type: string
        instance:
          type: string
        errorCode:
          type: string
        correlationId:
          type: string
        retryable:
          type: boolean
        retryAfterMillis:
          type: integer
          format: int64
        violations:
          type: array
          items:
            $ref: '#/components/schemas/Violation'

Endpoint response:

responses:
  '409':
    description: Case state conflict
    content:
      application/problem+json:
        schema:
          $ref: '#/components/schemas/ApiProblem'

Document expected errorCode values per endpoint.

Do not only document generic ApiProblem.

Clients need to know which problems are expected.

34. Error Response Testing

Test error responses as contracts.

Minimum tests:

validation error shape
business conflict shape
not found shape
auth failure shape
rate limit shape
dependency unavailable shape
internal error fallback shape
content type application/problem+json
correlation ID propagation
no stack trace leakage
retryable classification
OpenAPI examples stay valid

Example unit assertion:

assertThat(problem.errorCode()).isEqualTo("CASE_VERSION_CONFLICT");
assertThat(problem.status()).isEqualTo(409);
assertThat(problem.retryable()).isFalse();
assertThat(problem.detail()).doesNotContain("java.lang");

Error path tests are not second-class tests.

Most production incidents are error-path incidents.

35. Chaos and Failure Injection

Error modeling should be tested under injected failures.

Examples:

Downstream timeout -> 504 or 503 with DEPENDENCY_TIMEOUT
Connection refused -> 503 DEPENDENCY_UNAVAILABLE
Pool acquisition timeout -> 503 SERVICE_OVERLOADED or CLIENT_POOL_EXHAUSTED depending side
Circuit open -> 503 DEPENDENCY_UNAVAILABLE or SERVICE_PROTECTION_ACTIVE
Request body too large -> 413 PAYLOAD_TOO_LARGE
Unsupported content encoding -> 415 UNSUPPORTED_CONTENT_ENCODING

The goal is not only to see that requests fail.

The goal is to see that they fail with useful, stable, safe signals.

36. Production Error Model Template

Use this as a starting template.

## Error Model

Media type:
- application/problem+json

Required fields:
- type
- title
- status
- errorCode
- retryable
- correlationId

Optional fields:
- detail
- instance
- retryAfterMillis
- violations
- operationId
- safeToRetryWithSameIdempotencyKey

Rules:
- Never return stack trace
- Never parse human message in clients
- Never use 200 for single-operation failure
- Bound validation violations
- Document expected error codes per endpoint
- Use Retry-After for 429/503 when delay is known
- Preserve correlation ID across service boundary
- Log errorCode/status/route/correlationId
- Keep high-cardinality values out of metrics

Default client behavior:
- 400/401/403/404/409/412/415/422: no automatic retry
- 429/503: retry only if budget remains and Retry-After/backoff permits
- 500/502/504: retry only for safe/idempotent operations or idempotency-key commands
- unknown errorCode: safe fallback, log, and surface typed RemoteServiceException

37. Anti-Patterns

37.1 Generic message-only errors

{
  "message": "Failed"
}

No status refinement.

No stable code.

No retry classification.

No diagnostics.

37.2 Exception names as API codes

{
  "errorCode": "NullPointerException"
}

This leaks implementation detail and creates unstable contracts.

37.3 Always returning 500

Validation failure, business conflict, and dependency timeout are different.

If everything is 500, clients cannot behave correctly.

37.4 Always retrying 5xx

This causes retry storms.

Retry must consider idempotency, deadline, retry budget, and error classification.

37.5 Hiding dependency failure as empty success

{
  "riskScore": null
}

without warning or degradation signal.

This creates silent correctness bugs.

38. Final Mental Model

Good HTTP error modeling is not about pretty JSON.

It is about making distributed failure explicit.

Use this model:

HTTP status = protocol-level classification
Problem type = stable problem identity
Error code = compact machine-readable handling key
Retryable = server-side retry hint
Headers = cross-cutting control metadata
Correlation ID = investigation handle
Logs/traces = detailed internal diagnosis

The client should never need to guess.

The operator should never need to scrape random messages.

The platform should never confuse expected domain rejection with service failure.

That is production-grade error communication.

39. Phase 2 Complete

This part completes the HTTP communication foundation phase.

So far, the series has covered:

HTTP transport semantics
method safety and idempotency
status code design
headers and context propagation
timeout budgeting
connection pooling
HTTP/1.1 vs HTTP/2
HTTP/3 and QUIC considerations
payload efficiency
error response modeling

The next phase moves from protocol foundation to Java implementation.

We will start with JDK HttpClient, then Spring RestClient, WebClient, OpenFeign, MicroProfile Rest Client, and production-grade client abstraction design.

References

RFC 9110 — HTTP Semantics: https://www.rfc-editor.org/rfc/rfc9110.html
RFC 9457 — Problem Details for HTTP APIs: https://www.rfc-editor.org/rfc/rfc9457.html
RFC 6585 — Additional HTTP Status Codes: https://www.rfc-editor.org/rfc/rfc6585.html
OpenTelemetry Semantic Conventions — HTTP Spans: https://opentelemetry.io/docs/specs/semconv/http/http-spans/
OpenTelemetry Semantic Conventions — Error Attributes: https://opentelemetry.io/docs/specs/semconv/registry/attributes/error/
Spring Framework Reference — Error Responses: https://docs.spring.io/spring-framework/reference/web/webmvc/mvc-ann-rest-exceptions.html

Lesson Recap

You just completed lesson 18 in build core. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Back To Series Next Lesson

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.

Previous Lesson

Lesson 17

Compression, Payload Size, and Wire Efficiency

Next Lesson

Lesson 19

JDK HttpClient for Microservice Calls