Build CoreOrdered learning track

Hedged Requests: Tail Latency vs Amplified Load

Learn Java Microservices Communication - Part 045

Hedged requests for Java microservices: tail latency mitigation, duplicate request semantics, quorum/fan-out mental model, when to hedge, when not to hedge, cancellation, idempotency, load amplification, adaptive hedging, implementation sketches, testing, observability, and production policy.

15 min read2815 words
PrevNext
Lesson 4596 lesson track18–52 Build Core
#java#microservices#communication#resilience+5 more

Part 045 — Hedged Requests: Tail Latency vs Amplified Load

A hedged request is a deliberate duplicate request sent after a delay to reduce tail latency.

The idea is simple:

If the first attempt is unusually slow, send a second attempt to another equivalent replica and use whichever response returns first.

This can reduce p99 latency.

It can also double load, duplicate side effects, overload dependencies, confuse observability, and create correctness bugs.

Hedging is not retry.

Hedging is speculative parallelism.

Use it only when the operation and infrastructure can tolerate it.


1. The Core Mental Model

Normal request:

Hedged request:

The client intentionally creates a race.

The fastest acceptable response wins.

The loser is cancelled or ignored.

The main benefit:

reduce user-visible tail latency

The main cost:

increase backend work

2. Why Tail Latency Matters

A single service call may have acceptable average latency.

But user-facing requests often fan out.

If one request depends on many subrequests, the slowest subrequest can dominate total latency.

Example:

If each dependency has a 1% chance of being slow, the probability that at least one dependency is slow grows with fan-out.

This is why tail latency matters more in large systems than averages suggest.

The "Tail at Scale" paper by Dean and Barroso describes hedged requests as one technique for reducing tail latency in large-scale online services.


3. Hedging vs Retry

Hedging and retry both send more than one request.

But their timing and meaning differ.

PatternWhen second request is sentGoal
Retryafter failure/timeout/errorrecover failed attempt
Hedgebefore primary fails, after delaybeat slow tail attempt
Parallel duplicateimmediatelyminimize latency at high load cost
Quorum requestmultiple requests required by protocolcorrectness/consensus
Fallbackcall alternate path after failuredegrade or recover

Retry says:

the first attempt failed

Hedging says:

the first attempt is not done yet and may be a tail outlier

That difference matters.

Hedging can create duplicate in-flight work even when there is no failure.


4. When Hedging Works

Hedging works best when:

  • operation is read-only or idempotent,
  • replicas are equivalent,
  • latency has high variance,
  • tail latency is important,
  • slow requests are caused by per-replica noise,
  • dependency has spare capacity,
  • hedge delay is chosen carefully,
  • losing request can be cancelled or cheaply ignored,
  • duplicate work does not corrupt state,
  • client has enough budget,
  • observability separates primary and hedge attempts.

Example good candidates:

OperationHedge suitability
read from replicated cachehigh
search read replica querymedium to high
read-only metadata lookupmedium
idempotent object fetchhigh
non-critical enrichment readmedium
expensive distributed querylow unless controlled
payment commandno
create escalation commandusually no
send notification commandno
mutate workflow stateno

Hedging is mostly a read-path technique.


5. When Hedging Is Dangerous

Avoid hedging when:

  • operation has side effects,
  • server cannot deduplicate,
  • losing request cannot be cancelled,
  • dependency is already overloaded,
  • downstream cost is high,
  • request holds locks,
  • request triggers database writes,
  • request triggers external provider call,
  • result is not deterministic,
  • replicas are not equivalent,
  • consistency model can return conflicting answers,
  • fan-out is already large,
  • retry is already active,
  • circuit breaker or bulkhead is saturated.

Bad example:

POST /v1/payments

Hedging this can create duplicate payments unless the entire path is idempotent and deduplicated.

Even then, hedging commands is usually unnecessary and risky.


6. Hedge Delay

Never send hedge immediately by default.

Immediate duplicate request doubles load for every call.

Instead:

send hedge only after primary exceeds a threshold

Typical hedge delay choices:

StrategyMeaning
fixed delayhedge after 50 ms
percentile-basedhedge after p95 latency
adaptive delayhedge based on current latency distribution
budget-awarehedge only if remaining deadline can fit
load-awarehedge only when dependency not overloaded
priority-awarehedge only for high-priority requests

A practical starting point:

hedgeDelay = dependency p95 latency
maxHedges = 1

Why p95?

Because most requests complete normally and do not hedge.

Only the slow tail gets a duplicate attempt.


7. Expected Extra Load

If hedge delay is p95, about 5% of calls hedge under normal latency distribution.

Roughly:

extra request load ≈ hedge rate × original load

At 1000 RPS:

Hedge rateExtra RPS
1%10
5%50
10%100
50%500
100%1000

A small hedge rate can be acceptable.

A high hedge rate can become self-inflicted overload.

Therefore:

Hedging must have a hedge budget.


8. Hedge Budget

A hedge budget limits speculative duplicate traffic.

Example:

hedged_attempts <= 2% of original attempts over rolling 1 minute

If budget is exhausted:

do not hedge

This prevents the worst scenario:

dependency slows
→ more requests exceed hedge delay
→ more hedges created
→ dependency receives more load
→ latency worsens
→ even more hedges

That feedback loop can become a retry-storm equivalent.

Hedge budget is as important as retry budget.


9. Hedging and Bulkhead

Hedged attempts must consume bulkhead capacity.

If hedges bypass bulkheads, they can overload dependency path.

Better:

But be careful: if primary holds one permit and hedge holds another, one logical call can consume two slots.

Policy options:

PolicyBehavior
shared bulkheadhedges compete with normal calls
separate hedge bulkheadcap hedge attempts independently
hedge budget + shared bulkheadgood default
no hedge when bulkhead near fullsafer during pressure

Recommended:

hedge only when bulkhead has spare capacity

Hedging under saturation usually worsens the problem.


10. Hedging and Circuit Breaker

Do not hedge into an open circuit.

If the dependency breaker is open, speculative duplicate calls should not be created.

For multi-replica or multi-zone clients, there may be per-zone/per-endpoint breakers.

A hedge should select only healthy candidates.

If all candidates are unhealthy, fail fast or fallback.


11. Hedging and Retry

Hedging and retry can multiply attempts.

Example:

max attempts = 2 retries
max hedges = 1 per attempt

Worst-case remote attempts:

2 attempts × 2 hedge copies = 4 calls

With fan-out, this gets ugly fast.

Recommended:

  • avoid combining hedging and retry unless necessary,
  • use small retry count,
  • use shared attempt budget,
  • classify hedged attempt separately,
  • do not retry loser cancellation as failure,
  • do not hedge retry attempts by default.

Policy:

one logical call may use at most N total remote attempts

Example:

maxTotalAttemptsIncludingHedges = 2

That means:

  • either primary + one hedge,
  • or primary + one retry,
  • not both.

12. Hedging and Idempotency

For read operations, hedging is usually semantically safe.

For commands, hedging is dangerous.

If a command must be hedged, which is rare, it requires:

  • idempotency key,
  • server-side deduplication,
  • stable request fingerprint,
  • external side-effect dedup,
  • response replay,
  • loser cancellation awareness,
  • downstream idempotent consumers,
  • strict observability.

Even then, ask:

Why are we hedging a command instead of making it asynchronous, improving dependency latency, or fixing tail causes?

Most production systems should not hedge commands.


13. Hedging and Cancellation

When one attempt wins, the client should cancel the loser.

But cancellation is not guaranteed to stop server work.

For HTTP:

  • connection may be closed,
  • request may already be executing,
  • server may not notice immediately,
  • database query may continue,
  • side effects may already have happened.

For gRPC:

  • cancellation is a first-class concept,
  • server can observe cancellation and stop work,
  • but committed side effects still cannot be undone.

Cancellation reduces wasted work only if the stack propagates and honors it.


14. Hedging and Consistency

If replicas can return different versions of data, hedging may return an older but faster result.

Example:

Replica A: updated case status = CLOSED, slower
Replica B: stale case status = OPEN, faster

Hedging chooses the faster response.

Is that acceptable?

Depends on the operation.

For eventually consistent reads:

maybe acceptable

For decision-making reads:

possibly unsafe

For regulatory action eligibility:

usually unsafe unless consistency contract allows it

Hedging must respect read consistency requirements.


15. Hedging and Cache

Hedging can be used between cache and origin in controlled ways.

Example:

  • start cache lookup,
  • if cache is slow/missed, start origin lookup,
  • use first acceptable response,
  • cancel loser.

But this is not always wise.

If origin is expensive, hedging cache/origin can increase backend load.

Better common pattern:

cache first with very short budget
then origin

or:

origin first, stale cache fallback on failure

Hedging is one option, not the default.


16. Hedging and Fan-Out

Fan-out plus hedging multiplies load.

If one request fans out to 20 shards and each shard can hedge once:

max attempts = 40 shard calls

If 100 frontend requests arrive:

4000 shard calls

Hedging in fan-out systems must be extremely controlled.

Techniques:

  • hedge only the slowest subrequests,
  • hedge only after most responses returned,
  • cap total hedges per parent request,
  • use per-shard budget,
  • cancel losers aggressively,
  • do not hedge when system load is high.

17. Basic Java Hedging Skeleton

This is conceptual.

public final class HedgedCallExecutor {
    private final ScheduledExecutorService scheduler;
    private final HedgePolicy policy;

    public <T> CompletableFuture<T> execute(
        Supplier<CompletableFuture<T>> primarySupplier,
        Supplier<CompletableFuture<T>> hedgeSupplier,
        Deadline deadline
    ) {
        CompletableFuture<T> result = new CompletableFuture<>();
        CompletableFuture<T> primary = primarySupplier.get();

        AtomicReference<CompletableFuture<T>> hedgeRef = new AtomicReference<>();

        primary.whenComplete((value, error) -> completeIfFirst(result, value, error));

        ScheduledFuture<?> hedgeTask = scheduler.schedule(() -> {
            if (result.isDone()) {
                return;
            }
            if (!policy.canHedge(deadline)) {
                return;
            }

            CompletableFuture<T> hedge = hedgeSupplier.get();
            hedgeRef.set(hedge);

            hedge.whenComplete((value, error) -> completeIfFirst(result, value, error));
        }, policy.hedgeDelay().toMillis(), TimeUnit.MILLISECONDS);

        result.whenComplete((value, error) -> {
            hedgeTask.cancel(false);
            if (!primary.isDone()) {
                primary.cancel(true);
            }
            CompletableFuture<T> hedge = hedgeRef.get();
            if (hedge != null && !hedge.isDone()) {
                hedge.cancel(true);
            }
        });

        return result.orTimeout(deadline.remainingMillis(), TimeUnit.MILLISECONDS);
    }

    private <T> void completeIfFirst(CompletableFuture<T> result, T value, Throwable error) {
        if (error == null) {
            result.complete(value);
        } else {
            result.completeExceptionally(error);
        }
    }
}

This skeleton is incomplete for production.

It still needs:

  • failure classification,
  • loser cancellation metrics,
  • hedge budget,
  • deadline-aware scheduling,
  • bulkhead integration,
  • trace context,
  • error aggregation,
  • interruption behavior,
  • request identity,
  • executor management.

18. Winner Selection Is Not Always First Response

"First response wins" is too simplistic.

A fast error may arrive before a slow success.

Should the fast error win?

Example:

Attempt A returns 503 after 20 ms
Attempt B returns 200 after 80 ms

If the client accepts first completion, it fails unnecessarily.

Better policy:

EventBehavior
first successcomplete successfully, cancel losers
retryable failure while another attempt pendingwait if deadline allows
terminal failure from all attemptsfail
mixed failureschoose best classified error
deadline exceededcancel all attempts

A hedged executor should prefer success over retryable failure.

But it must still respect deadline.


19. Candidate Selection

Do not send primary and hedge to the same overloaded replica if avoidable.

Candidate choices:

  • different connection,
  • different replica,
  • different zone,
  • different cache node,
  • different read replica,
  • different shard replica.

But do not violate consistency, locality, or data residency.

Candidate selection must respect:

  • tenant placement,
  • region,
  • routing policy,
  • service mesh behavior,
  • sticky session requirements,
  • cache key ownership,
  • authorization context,
  • consistency level.

If the load balancer already spreads requests well, the client may not control candidate selection.

In that case, hedging is less precise and more risky.


20. Adaptive Hedging

Static hedge delay can be wrong as traffic changes.

Adaptive hedging adjusts delay based on current latency and load.

Possible inputs:

  • p95/p99 latency,
  • current success rate,
  • bulkhead saturation,
  • circuit breaker state,
  • dependency load,
  • hedge budget remaining,
  • request priority,
  • deadline remaining.

Policy:

if dependency healthy and tail high and budget available:
  hedge after p95
else:
  do not hedge

Adaptive hedging is powerful but can create feedback loops.

Use simple static hedging first.

Add adaptation only with strong observability.


21. Hedging Policy Object

public record HedgePolicy(
    boolean enabled,
    Duration hedgeDelay,
    int maxHedges,
    double maxHedgeRatio,
    boolean requireIdempotentOperation,
    boolean disableWhenBulkheadSaturated,
    boolean disableWhenCircuitOpen
) {
    public HedgePolicy {
        if (maxHedges < 0) {
            throw new IllegalArgumentException("maxHedges must be >= 0");
        }
        if (hedgeDelay.isNegative() || hedgeDelay.isZero()) {
            throw new IllegalArgumentException("hedgeDelay must be positive");
        }
    }
}

Operation semantics:

public record OperationSemantics(
    String operationName,
    boolean readOnly,
    boolean idempotent,
    boolean strongConsistencyRequired,
    boolean sideEffecting
) {}

Decision:

public boolean canHedge(OperationSemantics semantics, RuntimeSignals signals, Deadline deadline) {
    if (!policy.enabled()) return false;
    if (semantics.sideEffecting()) return false;
    if (policy.requireIdempotentOperation() && !semantics.idempotent()) return false;
    if (semantics.strongConsistencyRequired()) return false;
    if (signals.bulkheadSaturated() && policy.disableWhenBulkheadSaturated()) return false;
    if (signals.circuitOpen() && policy.disableWhenCircuitOpen()) return false;
    if (!hedgeBudget.tryAcquire()) return false;
    return deadline.remaining().compareTo(policy.hedgeDelay().plus(minAttemptDuration)) > 0;
}

Hedging should be a deliberate decision, not a utility method anyone can call.


22. HTTP Client Implementation Considerations

JDK HttpClient

sendAsync returns CompletableFuture.

Hedging can be implemented by starting a second sendAsync.

But cancellation behavior depends on the underlying request state.

You still need:

  • request timeout,
  • deadline,
  • connection pool limits,
  • separate metrics,
  • candidate selection if possible,
  • safe request body replay.

Request body matters.

A streaming request body may not be replayable.

Do not hedge non-repeatable request bodies.

Spring WebClient

Reactive composition can race publishers:

Mono<Response> primary = callPrimary();
Mono<Response> hedge = Mono.delay(hedgeDelay).then(callHedge());

Mono<Response> result = Mono.firstWithValue(primary, hedge);

But you need:

  • error handling,
  • cancellation semantics,
  • context propagation,
  • metrics,
  • timeout,
  • scheduler control.

Blocking RestClient/Feign

Hedging blocking calls requires extra threads or virtual threads.

Be careful:

  • hedging doubles blocked calls,
  • thread-pool bulkhead may be needed,
  • cancelling blocking calls may not stop underlying socket immediately,
  • request thread should not spawn unbounded tasks.

Blocking hedging is easy to implement badly.


23. Hedging and Request Body Replay

Hedging requires sending the request more than once.

That means the request body must be repeatable.

Safe:

  • small JSON body held in memory,
  • GET query,
  • deterministic protobuf message,
  • immutable byte array.

Risky:

  • streaming upload,
  • input stream consumed once,
  • large file upload,
  • request body generated with timestamp/random nonce,
  • non-deterministic signature,
  • one-time token.

If body cannot be replayed, do not hedge.


24. Observability

Metrics:

hedged_requests.total{dependency,operation,decision}
hedged_attempts.total{dependency,operation}
hedge_wins.total{dependency,operation,winner=primary|hedge}
hedge_losers_cancelled.total{dependency,operation}
hedge_budget_denied.total{dependency,operation}
hedge_suppressed.total{reason}
hedge_extra_load_ratio{dependency,operation}
remote_call.duration{attempt=primary|hedge}

Useful suppression reasons:

  • OPERATION_NOT_IDEMPOTENT,
  • SIDE_EFFECTING,
  • BUDGET_EXHAUSTED,
  • DEADLINE_TOO_SHORT,
  • BULKHEAD_SATURATED,
  • CIRCUIT_OPEN,
  • CONSISTENCY_REQUIRED,
  • BODY_NOT_REPLAYABLE,
  • LOW_PRIORITY.

Trace attributes:

hedge.attempt=primary
hedge.attempt=secondary
hedge.delay_ms=75
hedge.winner=true

Do not use resource IDs as metric labels.


25. Alerting

Useful alerts:

AlertMeaning
hedge rate above policyextra load rising
hedge win rate very lowhedging may be wasting load
hedge budget exhausted oftentail latency or delay policy problem
hedging during overloaddangerous feedback loop
hedges on side-effecting operationcorrectness bug
loser cancellation not workingwasted backend work
p99 unchanged despite hedgingineffective policy
primary latency high and hedge wins highreplica/path tail issue

Hedging should prove its value.

If it does not reduce tail latency enough to justify extra load, turn it off.


26. Testing Hedged Requests

Minimum tests:

ScenarioExpected behavior
primary fastno hedge sent
primary slow, hedge fasthedge wins
primary slow, hedge disabled by budgetno hedge
primary returns retryable error, hedge succeedssuccess
primary succeeds, hedge pendinghedge cancelled
both failbest classified failure returned
operation side-effectinghedge suppressed
deadline too shorthedge suppressed
bulkhead saturatedhedge suppressed
request body non-repeatablehedge suppressed
metrics emittedprimary/hedge visible

Test primary fast:

@Test
void doesNotHedgeWhenPrimaryCompletesBeforeDelay() {
    fakeRemote.completePrimaryAfter(Duration.ofMillis(10));

    Response response = hedgedClient.getCase("CASE-100");

    assertThat(response.status()).isEqualTo("OPEN");
    assertThat(fakeRemote.hedgeAttempts()).isEqualTo(0);
}

Test hedge wins:

@Test
void hedgeWinsWhenPrimaryIsTailOutlier() {
    fakeRemote.completePrimaryAfter(Duration.ofMillis(500));
    fakeRemote.completeHedgeAfter(Duration.ofMillis(40));

    Response response = hedgedClient.getCase("CASE-100");

    assertThat(response.status()).isEqualTo("OPEN");
    assertThat(fakeRemote.hedgeAttempts()).isEqualTo(1);
    assertThat(metrics.hedgeWins("hedge")).isEqualTo(1);
}

27. Load Testing Hedging

Hedging must be load-tested.

Scenarios:

  • 1% slow replica,
  • 5% random tail latency,
  • dependency overload,
  • one zone degraded,
  • p99 improves but CPU doubles,
  • cancellation ignored by server,
  • fan-out parent request with many subrequests,
  • hedging plus retry,
  • hedging plus circuit breaker,
  • hedging with cache/origin,
  • burst traffic.

Questions:

  • How much does p99 improve?
  • How much does backend RPS increase?
  • Does p50/p95 change?
  • Does hedge budget cap extra load?
  • Does dependency saturation worsen?
  • Are losers cancelled?
  • Are stale/inconsistent responses possible?
  • Does hedging stop during overload?

Do not enable hedging because it looks good in a unit test.


28. Production Policy Template

hedging:
  dependencies:
    case-service:
      operations:
        getCase:
          enabled: true
          allowedFor:
            - read-only
            - idempotent
          hedgeDelayMs: 75
          maxHedgesPerLogicalCall: 1
          maxHedgeRatio: 0.02
          disableWhen:
            circuitOpen: true
            bulkheadUtilizationAbove: 0.70
            retryAttempt: true
            remainingDeadlineBelowMs: 150
          consistency:
            allowEventuallyConsistentReplica: false
          cancellation:
            cancelLoser: true
          observability:
            emitWinner: true
            emitSuppressionReason: true

        searchCases:
          enabled: false
          reason: expensive-query-fanout-risk

        createEscalation:
          enabled: false
          reason: side-effecting-command

Policy must be per operation.

Never enable hedging globally.


29. Common Anti-Patterns

29.1 Hedging all requests immediately

Doubles load for every call.

29.2 Hedging commands

Duplicates side effects unless deeply controlled.

29.3 No hedge budget

Extra load grows exactly when latency worsens.

29.4 Hedging under overload

Speculative traffic worsens saturation.

29.5 Ignoring loser cancellation

Backend keeps doing useless work.

29.6 First completion wins even if it is a fast error

Client fails despite slower success being available.

29.7 No consistency analysis

Fast stale replica wins when strong read was required.

29.8 Hedging plus retry without total attempt cap

Attempts multiply.

29.9 Hedging non-repeatable request body

Second attempt is invalid or corrupt.

29.10 No separate metrics

Extra load is invisible.


30. Decision Model

This decision model should be embedded into client policy.


31. Design Checklist

Before enabling hedging:

  • Which operation is hedged?
  • Is it read-only or idempotent?
  • Is it side-effecting?
  • Is request body repeatable?
  • What is the hedge delay?
  • Why that delay?
  • What is max hedges per logical call?
  • What is hedge budget?
  • Is retry also enabled?
  • What is total attempt cap?
  • Does hedging stop under overload?
  • Does hedging respect bulkhead capacity?
  • Does hedging respect circuit breaker state?
  • Does hedging respect deadline?
  • Can loser attempt be cancelled?
  • Does server honor cancellation?
  • Are replicas equivalent?
  • Is consistency safe?
  • Are metrics split by primary/hedge?
  • Is p99 improvement verified by load test?
  • Is extra backend load acceptable?
  • Is there a kill switch?

32. The Real Lesson

Hedging is a sharp tool.

It can make a large distributed system feel faster by cutting off tail outliers.

But it pays for that speed with speculative load.

A mature service uses hedging only when:

tail latency matters
+ operation is safe to duplicate
+ dependency has spare capacity
+ hedge delay is high enough
+ budget caps extra load
+ losers are cancelled
+ metrics prove benefit

Hedging is not a default resilience pattern.

It is a latency optimization with correctness and capacity consequences.


References

Lesson Recap

You just completed lesson 45 in build core. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.