Series/Learn Java Microservices Communication

Build CoreOrdered learning track

Production Resilience Policy Template

Learn Java Microservices Communication - Part 048

Production resilience policy template for Java microservices: unified policy design for timeout, deadline, retry, circuit breaker, bulkhead, rate limit, load shedding, hedging, fallback, observability, validation, governance, rollout, testing, and runbooks.

[2026-07-05]14 min read2681 words

In This Lesson

1. Why a Unified Policy Is Necessary 2. Policy Is Not Just Configuration 3. Policy Scope

PrevNext

Lesson 4896 lesson track18–52 Build Core

#java#microservices#communication#resilience+4 more

Part 048 — Production Resilience Policy Template

At this point, we have covered many synchronous-call resilience patterns:

timeout,
retry,
circuit breaker,
bulkhead,
rate limiting,
load shedding,
hedged requests,
fallback,
stale data,
deadline propagation.

The real production challenge is not knowing these patterns individually.

The challenge is composing them into one coherent policy.

A mature Java service should not have resilience behavior scattered across:

annotations,
YAML fragments,
default HTTP client settings,
generated client defaults,
service mesh config,
retry libraries,
random helper methods,
unreviewed catch blocks.

It should have an explicit communication policy.

A resilience policy is the executable contract for how a service spends time, capacity, retries, and fallback under failure.

1. Why a Unified Policy Is Necessary

When policies are scattered, teams create contradictions.

Examples:

HTTP client timeout = 2s
gateway timeout = 1s
retry max attempts = 3
deadline remaining = 800ms
DB statement timeout = 10s
bulkhead wait = 500ms

This configuration cannot behave well.

Another example:

circuit breaker opens after 50% failures
retry attempts 3x
rate limiter only limits original calls
fallback returns stale data without freshness label

Final success rate may look high while the system is overloaded and users receive stale data.

A unified policy forces trade-offs to be visible.

2. Policy Is Not Just Configuration

Configuration says:

timeout: 500ms

Policy says:

This operation may spend at most 500 ms end-to-end.
It may retry once only for transient failures.
It must use idempotency key if command retry is enabled.
It fails fast when bulkhead is full.
It may return stale cache for reads up to 5 minutes.
It must not fallback to fake success for commands.

Configuration is data.

Policy is meaning.

Production systems need both.

3. Policy Scope

Define policy at multiple levels:

Scope	Example
Platform default	all outbound calls must have deadline, timeout, metrics
Service default	case-service clients use 500 ms max deadline
Dependency default	all calls to document-service have max 50 concurrent calls
Operation-specific	`createEscalation` requires idempotency and no stale fallback
Caller-specific	batch callers lower priority and lower rate
Environment-specific	staging lower limits than production
Incident override	disable hedging during overload

Precedence:

platform default
< service default
< dependency default
< operation policy
< safe incident override

Do not let request-level override bypass safety caps.

4. The Policy Object Model

A good policy model separates concerns.

public record CommunicationPolicy(
    String dependency,
    Map<String, OperationPolicy> operations,
    DependencyDefaults defaults
) {}

Operation policy:

public record OperationPolicy(
    String operationName,
    OperationSemantics semantics,
    DeadlinePolicy deadline,
    TimeoutPolicy timeout,
    RetryPolicy retry,
    CircuitBreakerPolicy circuitBreaker,
    BulkheadPolicy bulkhead,
    RateLimitPolicy rateLimit,
    HedgingPolicy hedging,
    FallbackPolicy fallback,
    ObservabilityPolicy observability
) {}

Semantics:

public record OperationSemantics(
    boolean readOnly,
    boolean sideEffecting,
    boolean idempotentByNature,
    boolean idempotencyKeyRequired,
    boolean strongConsistencyRequired,
    Priority defaultPriority
) {}

This makes dangerous combinations detectable at startup.

5. Policy Validation Is Mandatory

Configuration that violates invariants should fail startup.

Examples:

retry enabled on side-effecting command without idempotency key
hedging enabled on side-effecting command
fallback default success enabled on command
timeout longer than deadline
bulkhead wait longer than operation budget
retry max attempts impossible within deadline
stale fallback enabled without max staleness
metrics disabled for critical dependency
circuit breaker records validation errors as dependency failure

Startup validation:

public final class CommunicationPolicyValidator {
    public void validate(OperationPolicy policy) {
        if (policy.semantics().sideEffecting()
            && policy.retry().enabled()
            && !policy.semantics().idempotencyKeyRequired()) {
            throw new InvalidPolicyException(
                policy.operationName() + " retries side-effecting command without idempotency"
            );
        }

        if (policy.semantics().sideEffecting() && policy.hedging().enabled()) {
            throw new InvalidPolicyException(
                policy.operationName() + " enables hedging for side-effecting command"
            );
        }

        if (policy.fallback().staleCacheAllowed()
            && policy.fallback().maxStaleness() == null) {
            throw new InvalidPolicyException(
                policy.operationName() + " stale fallback requires maxStaleness"
            );
        }

        if (policy.timeout().responseTimeout()
            .compareTo(policy.deadline().defaultDeadline()) > 0) {
            throw new InvalidPolicyException(
                policy.operationName() + " response timeout exceeds default deadline"
            );
        }
    }
}

Bad resilience config should not reach production.

6. Full YAML Template

communication:
  serviceName: workflow-service
  environment: production

  defaults:
    deadline:
      inboundHeader: X-Request-Deadline
      defaultMs: 500
      maxMs: 1000
      minUsefulMs: 75
      reserveResponseMarginMs: 25

    observability:
      metricsEnabled: true
      tracesEnabled: true
      logsEnabled: true
      recordAttemptMetrics: true
      recordPolicyDecisions: true

  dependencies:
    case-service:
      baseUrl: https://case-service.internal
      apiVersion: v1
      owner: case-platform
      criticality: high

      defaults:
        connectionPool:
          maxConnections: 100
          maxConnectionsPerRoute: 50
          acquisitionTimeoutMs: 25
        timeout:
          connectMs: 75
          responseMs: 400
        circuitBreaker:
          slidingWindowType: COUNT_BASED
          slidingWindowSize: 100
          minimumNumberOfCalls: 50
          failureRateThreshold: 50
          slowCallRateThreshold: 50
          slowCallDurationMs: 500
          waitDurationOpenMs: 20000
          permittedHalfOpenCalls: 5
        bulkhead:
          type: semaphore
          maxConcurrentCalls: 50
          maxWaitMs: 0

      operations:
        getCase:
          method: GET
          route: /v1/cases/{caseId}
          semantics:
            readOnly: true
            sideEffecting: false
            idempotentByNature: true
            strongConsistencyRequired: true
            defaultPriority: user-facing

          deadline:
            defaultMs: 300
            maxMs: 600
            minUsefulMs: 50

          timeout:
            connectMs: 50
            responseMs: 250

          retry:
            enabled: true
            maxAttempts: 2
            baseDelayMs: 30
            maxDelayMs: 120
            jitter: full
            retryableStatuses: [429, 502, 503, 504]
            retryBudgetRatio: 0.10
            deadlineAware: true

          circuitBreaker:
            enabled: true
            recordStatuses: [502, 503, 504]
            ignoreStatuses: [400, 401, 403, 404, 409, 422]

          bulkhead:
            enabled: true
            maxConcurrentCalls: 80
            maxWaitMs: 10

          rateLimit:
            enabled: true
            limitForPeriod: 300
            limitRefreshPeriodMs: 1000
            timeoutMs: 0

          hedging:
            enabled: false
            reason: strong-consistency-required

          fallback:
            type: fail-fast
            staleCacheAllowed: false

          observability:
            operationMetric: case.get_case
            traceSpanName: GET /v1/cases/{caseId}

        createEscalation:
          method: POST
          route: /v1/case-escalations
          semantics:
            readOnly: false
            sideEffecting: true
            idempotentByNature: false
            idempotencyKeyRequired: true
            strongConsistencyRequired: true
            defaultPriority: critical-command

          deadline:
            defaultMs: 600
            maxMs: 1000
            minUsefulMs: 100

          timeout:
            connectMs: 75
            responseMs: 450

          retry:
            enabled: true
            maxAttempts: 2
            requiresIdempotencyKey: true
            sameIdempotencyKeyAcrossAttempts: true
            retryableStatuses: [429, 502, 503]
            nonRetryableStatuses: [400, 401, 403, 404, 409, 422]
            unknownOutcomeHandling: dedup-replay-required
            deadlineAware: true

          circuitBreaker:
            enabled: true
            failureRateThreshold: 40
            slowCallDurationMs: 600
            ignoreStatuses: [400, 401, 403, 404, 409, 422]

          bulkhead:
            enabled: true
            maxConcurrentCalls: 40
            maxWaitMs: 0

          rateLimit:
            enabled: true
            limitForPeriod: 100
            limitRefreshPeriodMs: 1000
            timeoutMs: 0

          hedging:
            enabled: false
            reason: side-effecting-command

          fallback:
            type: fail-fast
            allowAsyncHandoff: false
            fakeSuccessAllowed: false

          observability:
            operationMetric: case.create_escalation
            traceSpanName: POST /v1/case-escalations
            recordIdempotencyPresence: true

This looks long because communication policy is real engineering.

Hidden complexity is still complexity.

This just makes it visible.

7. Recommended Decorator Composition

There is no universal ordering, but a policy must define one.

A practical default for synchronous outbound HTTP calls:

request context / deadline
→ rate limiter
→ bulkhead
→ circuit breaker
→ retry executor
→ per-attempt timeout
→ transport
→ error mapper
→ fallback

Visual:

But the policy must specify what each layer observes.

Example decisions:

retry should not hold bulkhead permit while sleeping,
bulkhead rejection should not count as dependency circuit breaker failure,
timeout should count as circuit breaker failure for remote health,
deadline exhausted before remote call should not count as dependency failure,
fallback success must be tagged as degraded success.

8. Attempt Model

A logical call can contain multiple attempts.

public record RemoteCallAttempt(
    int attemptNumber,
    boolean hedgeAttempt,
    Instant startedAt,
    Duration timeout,
    Optional<Integer> statusCode,
    Optional<String> errorCode,
    AttemptOutcome outcome
) {}

Logical call result:

public record RemoteCallResult<T>(
    T value,
    boolean degraded,
    int attempts,
    Duration totalDuration,
    Optional<String> fallbackType
) {}

Metrics should distinguish:

logical call success
attempt success
success after retry
success via fallback
failure due to circuit open
failure due to bulkhead full
failure due to deadline exceeded

A single "success" counter is insufficient.

9. Failure Taxonomy

Policy depends on failure classification.

public enum FailureClass {
    CALLER_BAD_REQUEST,
    AUTHENTICATION,
    AUTHORIZATION,
    NOT_FOUND,
    DOMAIN_CONFLICT,
    PRECONDITION_FAILED,
    DOMAIN_VALIDATION,
    RATE_LIMITED,
    REMOTE_UNAVAILABLE,
    REMOTE_TIMEOUT,
    CONNECT_TIMEOUT,
    READ_TIMEOUT,
    DEADLINE_EXCEEDED,
    BULKHEAD_FULL,
    CIRCUIT_OPEN,
    LOCAL_RATE_LIMITED,
    UNKNOWN
}

Each pattern uses this taxonomy differently.

Failure class	Retry	Breaker	Fallback
caller bad request	no	ignore	no
auth/authz	no	usually ignore	no
not found	usually no	ignore	maybe domain-specific
domain conflict	no	ignore	no
rate limited	yes with delay	maybe	maybe
remote unavailable	yes bounded	record	maybe
remote timeout	yes if safe	record	maybe
bulkhead full	no remote retry by default	ignore dependency breaker	fallback/degrade
circuit open	no remote call	n/a	fallback/degrade
deadline exceeded	no	not dependency failure if before call	fallback if cheap

This table should be encoded, tested, and reviewed.

10. Policy-Aware Executor Skeleton

public final class RemoteOperationExecutor {
    private final PolicyRegistry policyRegistry;
    private final Telemetry telemetry;

    public <T> T execute(
        String dependency,
        String operation,
        RequestContext context,
        Supplier<T> transportCall
    ) {
        OperationPolicy policy = policyRegistry.get(dependency, operation);
        policy.validate();

        if (!context.deadline().canFit(policy.deadline().minUseful())) {
            throw new DeadlineTooShortException();
        }

        return telemetry.observeLogicalCall(policy, context, () ->
            executeWithPolicy(policy, context, transportCall)
        );
    }

    private <T> T executeWithPolicy(
        OperationPolicy policy,
        RequestContext context,
        Supplier<T> transportCall
    ) {
        Supplier<T> supplier = () -> executeAttempts(policy, context, transportCall);

        supplier = applyCircuitBreaker(policy, supplier);
        supplier = applyBulkhead(policy, supplier);
        supplier = applyRateLimiter(policy, supplier);

        try {
            return supplier.get();
        } catch (Throwable failure) {
            return applyFallback(policy, context, failure);
        }
    }
}

This is simplified.

Production code needs careful composition, cancellation, async support, and metrics.

The important point is architectural:

resilience belongs in the owned client boundary, not scattered in business code.

11. Retry Loop with Deadline

private <T> T executeAttempts(
    OperationPolicy policy,
    RequestContext context,
    Supplier<T> call
) {
    Throwable lastFailure = null;

    for (int attempt = 1; attempt <= policy.retry().maxAttempts(); attempt++) {
        Duration attemptTimeout = context.deadline().timeoutWithMargin(
            policy.timeout().responseTimeout(),
            policy.deadline().reserveResponseMargin()
        );

        if (attemptTimeout.isZero()) {
            throw new DeadlineExceededException();
        }

        try {
            return executeOneAttemptWithTimeout(call, attemptTimeout);
        } catch (Throwable failure) {
            lastFailure = failure;

            RetryDecision decision = retryDecider.decide(
                policy,
                context,
                failure,
                attempt
            );

            if (!decision.shouldRetry()) {
                throw failure;
            }

            sleepWithoutHoldingBulkheadPermit(decision.delay());
        }
    }

    throw new RetryExhaustedException(lastFailure);
}

Rules:

do not retry when deadline cannot fit,
do not retry unsafe commands without idempotency,
use same idempotency key across attempts,
use jittered backoff,
enforce retry budget,
emit retry decision metrics.

12. Policy Registry

Policies must be discoverable.

public interface PolicyRegistry {
    OperationPolicy get(String dependency, String operation);
}

Implementation:

public final class ValidatingPolicyRegistry implements PolicyRegistry {
    private final Map<OperationKey, OperationPolicy> policies;
    private final CommunicationPolicyValidator validator;

    public ValidatingPolicyRegistry(
        Map<OperationKey, OperationPolicy> policies,
        CommunicationPolicyValidator validator
    ) {
        this.policies = Map.copyOf(policies);
        this.validator = validator;

        this.policies.values().forEach(validator::validate);
    }

    @Override
    public OperationPolicy get(String dependency, String operation) {
        OperationPolicy policy = policies.get(new OperationKey(dependency, operation));
        if (policy == null) {
            throw new MissingCommunicationPolicyException(dependency, operation);
        }
        return policy;
    }
}

Missing policy should fail startup or fail fast in development.

Do not silently use dangerous defaults for unknown dependencies.

13. Safe Defaults

Platform defaults should be safe but not overly broad.

Example safe defaults:

platformDefaults:
  allOutboundCalls:
    timeoutRequired: true
    deadlinePropagationRequired: true
    observabilityRequired: true
    connectionPoolRequired: true
    errorMappingRequired: true

  unsafeCommands:
    retryDisabledUnlessIdempotencyKeyRequired: true
    hedgingForbidden: true
    fakeSuccessFallbackForbidden: true

  allOperations:
    maxDeadlineMs: 2000
    maxRetryAttempts: 2
    maxBulkheadWaitMs: 50
    metricsCardinalityGuard: true

Defaults should prevent catastrophic mistakes.

Operation policy can be stricter.

Exceptions require review.

14. Dangerous Combination Rules

Policy validator should reject:

Combination	Why dangerous
retry command without idempotency	duplicate side effects
hedge command	duplicate side effects
fallback fake success for command	business corruption
stale cache without max staleness	unbounded stale truth
long rate limiter wait in sync path	hidden queueing
bulkhead wait longer than deadline	wasted wait
timeout > deadline	impossible budget
retry attempts impossible within deadline	retry only adds load
breaker counts 400/422 as remote failure	caller bugs open breaker
generated client direct use	policy bypass
no metrics for critical dependency	invisible failure
idempotency replay not observed	duplicate behavior hidden
open circuit with no fallback/fail-fast mapping	generic errors

These are not style rules.

They are production safety rules.

15. Policy as Code in CI

Add CI checks:

Validate:

YAML schema,
required fields,
dangerous combinations,
operation existence in OpenAPI,
Resilience4j instance names,
metric names,
owner metadata,
runbook links,
change risk.

Policy changes should be reviewed like code.

A bad retry config can take down production as effectively as a bad code deploy.

16. Policy Diff Classification

Not all policy changes have equal risk.

Change	Risk
reduce timeout	can increase false failures
increase timeout	can increase resource hold time
enable retry	can amplify load/duplicates
increase max attempts	high risk
enable hedging	high risk
lower circuit threshold	may open too often
raise circuit threshold	may react too late
increase bulkhead	more downstream load
decrease bulkhead	more local rejections
enable stale fallback	semantic risk
increase max staleness	business correctness risk
change fallback from fail-fast to default	high semantic risk

CI should require explicit approval for high-risk changes.

17. Rollout Strategy

Resilience policy changes need rollout discipline.

Recommended:

deploy policy in shadow/metrics-only mode if possible,
canary one service instance,
canary one caller or tenant,
observe attempts, timeouts, rejections, fallback rate,
expand gradually,
keep rollback config ready,
document incident impact.

Examples:

new circuit breaker threshold: start metrics-only,
hedging: enable for 1% traffic,
retry: enable for one operation and low max attempts,
fallback stale cache: enable for non-critical consumers first,
load shedding: test under load before production enforcement.

Do not flip large resilience behavior globally.

18. Kill Switches

Some patterns need runtime kill switches.

High-risk kill switches:

runtimeOverrides:
  hedging.enabled: false
  retry.enabled: false
  fallback.staleCache.enabled: false
  loadShedding.forceLevel: DEGRADED
  circuitBreaker.forceOpen:
    - external-provider.submitDocument
  circuitBreaker.disable:
    - noncritical-recommendation.getSuggestions

Kill switches must be:

access controlled,
audited,
visible in dashboards,
time-bounded,
documented in incident notes.

A kill switch without observability is just another hidden failure mode.

19. Observability Contract

Every resilience policy must define telemetry.

Minimum logical-call metrics:

remote.logical.calls.total{dependency,operation,outcome}
remote.logical.duration{dependency,operation,outcome}
remote.attempts.total{dependency,operation,attempt_kind,outcome}
remote.timeouts.total{dependency,operation,type}
remote.retries.total{dependency,operation,decision}
remote.circuit.state{dependency,operation}
remote.bulkhead.rejections.total{dependency,operation}
remote.rate_limit.denied.total{dependency,operation}
remote.fallback.used.total{dependency,operation,type}
remote.deadline.remaining_ms{dependency,operation}

Outcomes:

success_fresh,
success_after_retry,
success_stale_fallback,
success_partial,
failed_remote,
failed_timeout,
failed_deadline,
failed_circuit_open,
failed_bulkhead_full,
failed_rate_limited,
failed_validation,
failed_policy_rejected.

Avoid high-cardinality labels.

Use operation names and route templates.

20. Dashboard Template

For each dependency operation, show:

RPS,
success/failure/degraded rate,
p50/p95/p99 latency,
timeout rate by type,
retry attempts and success-after-retry,
circuit breaker state and not-permitted calls,
bulkhead active/rejected,
rate limit granted/denied,
fallback usage and stale age,
deadline remaining at call start,
top callers,
recent deploy/config changes.

Dashboard should answer:

Is the dependency healthy?
Is the caller protected?
Are users seeing fresh success, degraded success, or failure?
Is resilience policy helping or hurting?

21. Alert Template

Alert categories:

Alert	Meaning
critical dependency error rate high	user/business impact
circuit open sustained	dependency unavailable or threshold too low
bulkhead full sustained	capacity saturation
timeout p99 rising	tail degradation
retry rate above budget	retry storm risk
fallback rate above baseline	degraded mode active
stale age near hard TTL	freshness risk
deadline too short spike	upstream budget mismatch
hedging extra load high	speculative load risk
rate limit denies critical caller	quota/capacity mismatch
policy validation failure	unsafe config

Alerts should include:

dependency,
operation,
caller,
current policy version,
recent config changes,
runbook link.

22. Runbook Template

# Runbook: case-service.createEscalation communication failure

## Symptoms
- timeout rate above 5%
- circuit breaker open
- bulkhead rejection > 10%
- fallback not allowed
- upstream workflows failing

## First checks
1. Check dependency health dashboard.
2. Check recent deploys for caller and provider.
3. Check policy version and recent overrides.
4. Check timeout vs deadline remaining.
5. Check retry volume and retry budget.
6. Check idempotency-key presence.
7. Check outbox/reconciliation backlog.

## Safe mitigations
- reduce retry attempts if retry storm
- force circuit open if dependency is collapsing
- shed batch traffic
- increase bulkhead only if provider has capacity
- route to async durable intent only if approved
- do not enable fake success fallback

## Unsafe mitigations
- disabling idempotency requirement
- retrying commands with new keys
- increasing timeout above gateway deadline
- returning success without provider confirmation

Runbooks are part of the policy.

Do not make operators infer semantics during incidents.

23. OpenAPI Linkage

Operation policy should link to OpenAPI operation IDs.

Example:

operations:
  createEscalation:
    openapiOperationId: createCaseEscalation
    method: POST
    route: /v1/case-escalations

CI should verify:

operation exists in OpenAPI,
method/route match,
idempotency header required if policy says required,
documented error statuses match retry/error policy,
fallback/degradation documented if exposed to consumers.

Contract and runtime policy must not drift.

24. Generated Client Wrapper Linkage

Generated clients must not bypass policy.

Architecture:

Policy executor should be outside generated code.

Generated code should not decide:

retry,
timeout,
fallback,
idempotency,
exception taxonomy,
metric names,
deadline propagation.

The owned adapter decides.

25. Service Mesh and Gateway Alignment

Application policy is not the only policy.

Also check:

gateway timeout,
ingress timeout,
service mesh retries,
service mesh circuit breakers,
proxy connection pool limits,
local rate limits,
global rate limits,
load balancer outlier detection,
Kubernetes readiness behavior.

Misalignment example:

app retry maxAttempts=2
mesh retry maxAttempts=3
gateway retry maxAttempts=2

Worst-case:

2 × 3 × 2 = 12 attempts

Avoid hidden amplification.

Document which layer owns each behavior.

26. Ownership Matrix

Policy area	Owner
operation semantics	service/domain owner
timeout/deadline	caller owner + platform
retry eligibility	caller owner + provider contract
idempotency	provider owner
circuit breaker threshold	caller owner + platform
bulkhead limit	caller owner
rate limit quota	provider owner/platform
fallback semantics	domain/product owner
stale data max age	domain/product owner
observability	platform + service owner
mesh/gateway config	platform
runbook	service owner

Resilience policy crosses teams.

Make ownership explicit.

27. Environment Differences

Production and staging differ.

But staging should still validate policy shape.

Example:

environments:
  production:
    case-service.getCase.bulkhead.maxConcurrentCalls: 80
  staging:
    case-service.getCase.bulkhead.maxConcurrentCalls: 10

Do not disable all resilience in staging.

You need to catch:

missing deadlines,
invalid fallback,
bad retry classifier,
missing metrics,
OpenAPI mismatch,
generated client bypass.

Use smaller numbers, not absent policy.

28. Testing Strategy

Policy testing layers:

Test	Purpose
schema validation	YAML shape
semantic validation	dangerous combinations
unit tests	classifiers/deciders
stub tests	HTTP behavior
integration tests	actual client adapter
failure injection	timeout/retry/breaker/fallback
load tests	capacity and overload
canary	production safety
chaos game day	operational readiness

Example semantic test:

@Test
void rejectsRetryForSideEffectingCommandWithoutIdempotency() {
    OperationPolicy policy = policyBuilder()
        .sideEffecting(true)
        .idempotencyKeyRequired(false)
        .retryEnabled(true)
        .build();

    assertThatThrownBy(() -> validator.validate(policy))
        .isInstanceOf(InvalidPolicyException.class);
}

Example HTTP behavior test:

@Test
void createEscalationRetriesWithSameIdempotencyKey() {
    stub.transientFailureThenSuccess();

    client.createEscalation(command);

    stub.verifyPostCount(2);
    stub.verifySameHeaderAcrossAttempts("Idempotency-Key");
}

29. Failure Injection Matrix

Test each pattern deliberately.

Failure	Expected policy behavior
connect timeout	retry if safe, breaker records
read timeout on command	retry only with idempotency
400	no retry, breaker ignores
409 domain conflict	no retry, breaker ignores
429	respect retry-after if deadline fits
503	retry bounded, breaker records
bulkhead full	fail/degrade, breaker ignores
circuit open	no remote call, fallback/fail-fast
deadline too short	reject before work
stale cache exists	stale response if allowed
stale cache too old	fail fast
dependency slow p99	slow-call breaker/load shedding
retry budget exhausted	no retry
hedge budget exhausted	no hedge

If this table is not tested, the policy is mostly aspirational.

30. Policy Versioning

Policy should have a version.

communication:
  policyVersion: 2026-07-05.1

Emit version in metrics/logs:

communication.policy.version=2026-07-05.1

Why?

During incident:

Did behavior change because code changed or policy changed?

Policy version makes that answer observable.

31. Policy Documentation Generation

Because policy is structured, generate docs.

Generated doc per dependency:

# Dependency: case-service

Owner: case-platform
Criticality: high

## Operation: getCase
- Deadline: 300 ms default, 600 ms max
- Retry: enabled, max 2 attempts
- Circuit breaker: enabled
- Bulkhead: 80 concurrent
- Fallback: fail-fast
- Hedging: disabled due to strong consistency
- Error retryability: 429/502/503/504
- Dashboard: ...
- Runbook: ...

This keeps docs aligned with runtime.

Do not maintain policy docs manually if you can generate them.

32. Example Review Questions

When reviewing a new operation:

Is it query or command?
Is it side-effecting?
Can it be retried safely?
Does provider support idempotency?
What is the caller deadline?
What is dependency p99?
What is max useful timeout?
What happens on unknown outcome?
Is stale fallback safe?
Is partial response safe?
Is circuit breaker threshold appropriate?
What is bulkhead limit based on?
Does rate limit exist?
Can this operation be hedged?
What metrics show degradation?
What is the runbook?

These questions are more valuable than asking "did you add @Retry?"

33. Pattern Selection Cheat Sheet

Situation	Prefer
transient network failure	bounded retry with jitter
dependency sustained failure	circuit breaker
dependency slow tail	timeout, maybe hedging for safe reads
dependency consumes too many resources	bulkhead
caller exceeds quota	rate limit
system overloaded	load shedding
query can tolerate old data	stale cache fallback
optional enrichment fails	partial response/omit
command cannot complete now	fail fast or durable intent
unknown command outcome	idempotency + dedup replay
fan-out high	concurrency cap + deadline
gateway/app timeout mismatch	deadline propagation
hidden generated client behavior	owned wrapper

Use this as a design starting point.

Not as a substitute for thinking.

34. The Final Phase 5 Template

For every synchronous dependency operation, create this record:

operation:
  identity:
    dependency:
    operation:
    method:
    route:
    owner:
    criticality:

  semantics:
    readOnly:
    sideEffecting:
    idempotencyKeyRequired:
    consistency:
    priority:

  deadline:
    defaultMs:
    maxMs:
    minUsefulMs:
    propagate:

  timeout:
    connectMs:
    poolAcquireMs:
    responseMs:
    totalAttemptMs:

  retry:
    enabled:
    maxAttempts:
    retryableFailures:
    nonRetryableFailures:
    backoff:
    jitter:
    budget:
    idempotencyRequired:

  circuitBreaker:
    enabled:
    window:
    minimumCalls:
    failureRate:
    slowCallRate:
    halfOpen:

  bulkhead:
    enabled:
    type:
    maxConcurrent:
    maxWait:
    queueCapacity:

  rateLimit:
    enabled:
    dimensions:
    rate:
    burst:
    wait:

  hedging:
    enabled:
    delay:
    budget:
    suppressionRules:

  fallback:
    type:
    maxStaleness:
    partialAllowed:
    fakeSuccessAllowed:
    asyncHandoff:

  observability:
    metrics:
    traces:
    logs:
    dashboard:
    alerts:
    runbook:

This is the reusable policy artifact.

35. Anti-Patterns

35.1 Annotation soup

@Retry, @CircuitBreaker, @TimeLimiter, and @RateLimiter scattered with no semantic policy.

35.2 Policy invisible to business owners

Fallback/staleness decisions are business decisions.

35.3 Mesh retries plus app retries

Hidden attempt multiplication.

35.4 Timeout constants copied everywhere

No deadline propagation.

35.5 All 5xx retryable

Crude and dangerous.

35.6 Circuit breaker counts caller bugs

Bad clients open provider breaker.

35.7 Bulkhead without fallback

Rejections become generic 500.

35.8 Fallback success counted as normal success

Degradation invisible.

35.9 No policy validation

Dangerous combinations reach production.

35.10 No runbook

Operators cannot safely change behavior during incidents.

36. Phase 5 Summary

The resilience patterns are not independent tricks.

They form one control system:

deadline decides whether work is still useful
timeout bounds each wait
retry handles transient failure
circuit breaker stops repeated failure
bulkhead isolates resource use
rate limit controls demand
load shedding preserves survival
hedging fights tail latency carefully
fallback preserves semantic usefulness
observability proves what happened

A top-tier Java microservice team does not ask:

Which library annotation should I add?

It asks:

What is the communication policy for this operation under success, slowness, overload, partial failure, and unknown outcome?

That is the level of thinking required for production-grade synchronous microservice communication.

Part 049 starts the gRPC phase.

References

Resilience4j Getting Started: https://resilience4j.readme.io/docs/getting-started
Resilience4j CircuitBreaker: https://resilience4j.readme.io/docs/circuitbreaker
Resilience4j Retry: https://resilience4j.readme.io/docs/retry
Resilience4j Bulkhead: https://resilience4j.readme.io/docs/bulkhead
Resilience4j RateLimiter: https://resilience4j.readme.io/docs/ratelimiter
Resilience4j TimeLimiter: https://resilience4j.readme.io/docs/timeout
AWS Builders Library — Timeouts, retries, and backoff with jitter: https://aws.amazon.com/builders-library/timeouts-retries-and-backoff-with-jitter/
Google SRE Book — Handling Overload: https://sre.google/sre-book/handling-overload/
Google SRE Book — Addressing Cascading Failures: https://sre.google/sre-book/addressing-cascading-failures/
gRPC Deadlines: https://grpc.io/docs/guides/deadlines/

Lesson Recap

You just completed lesson 48 in build core. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Back To Series Next Lesson

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.

Previous Lesson

Lesson 47

Deadline Propagation Across Service Calls

Next Lesson

Lesson 49

gRPC Communication Mental Model