Series/Learn Java Microservices Communication

Build CoreOrdered learning track

Hedged Requests: Tail Latency vs Amplified Load

Learn Java Microservices Communication - Part 045

Hedged requests for Java microservices: tail latency mitigation, duplicate request semantics, quorum/fan-out mental model, when to hedge, when not to hedge, cancellation, idempotency, load amplification, adaptive hedging, implementation sketches, testing, observability, and production policy.

[2026-07-05]15 min read2815 words

In This Lesson

1. The Core Mental Model 2. Why Tail Latency Matters 3. Hedging vs Retry

PrevNext

Lesson 4596 lesson track18–52 Build Core

#java#microservices#communication#resilience+5 more

Part 045 — Hedged Requests: Tail Latency vs Amplified Load

A hedged request is a deliberate duplicate request sent after a delay to reduce tail latency.

The idea is simple:

If the first attempt is unusually slow, send a second attempt to another equivalent replica and use whichever response returns first.

This can reduce p99 latency.

It can also double load, duplicate side effects, overload dependencies, confuse observability, and create correctness bugs.

Hedging is not retry.

Hedging is speculative parallelism.

Use it only when the operation and infrastructure can tolerate it.

1. The Core Mental Model

Normal request:

Hedged request:

The client intentionally creates a race.

The fastest acceptable response wins.

The loser is cancelled or ignored.

The main benefit:

reduce user-visible tail latency

The main cost:

increase backend work

2. Why Tail Latency Matters

A single service call may have acceptable average latency.

But user-facing requests often fan out.

If one request depends on many subrequests, the slowest subrequest can dominate total latency.

Example:

If each dependency has a 1% chance of being slow, the probability that at least one dependency is slow grows with fan-out.

This is why tail latency matters more in large systems than averages suggest.

The "Tail at Scale" paper by Dean and Barroso describes hedged requests as one technique for reducing tail latency in large-scale online services.

3. Hedging vs Retry

Hedging and retry both send more than one request.

But their timing and meaning differ.

Pattern	When second request is sent	Goal
Retry	after failure/timeout/error	recover failed attempt
Hedge	before primary fails, after delay	beat slow tail attempt
Parallel duplicate	immediately	minimize latency at high load cost
Quorum request	multiple requests required by protocol	correctness/consensus
Fallback	call alternate path after failure	degrade or recover

Retry says:

the first attempt failed

Hedging says:

the first attempt is not done yet and may be a tail outlier

That difference matters.

Hedging can create duplicate in-flight work even when there is no failure.

4. When Hedging Works

Hedging works best when:

operation is read-only or idempotent,
replicas are equivalent,
latency has high variance,
tail latency is important,
slow requests are caused by per-replica noise,
dependency has spare capacity,
hedge delay is chosen carefully,
losing request can be cancelled or cheaply ignored,
duplicate work does not corrupt state,
client has enough budget,
observability separates primary and hedge attempts.

Example good candidates:

Operation	Hedge suitability
read from replicated cache	high
search read replica query	medium to high
read-only metadata lookup	medium
idempotent object fetch	high
non-critical enrichment read	medium
expensive distributed query	low unless controlled
payment command	no
create escalation command	usually no
send notification command	no
mutate workflow state	no

Hedging is mostly a read-path technique.

5. When Hedging Is Dangerous

Avoid hedging when:

operation has side effects,
server cannot deduplicate,
losing request cannot be cancelled,
dependency is already overloaded,
downstream cost is high,
request holds locks,
request triggers database writes,
request triggers external provider call,
result is not deterministic,
replicas are not equivalent,
consistency model can return conflicting answers,
fan-out is already large,
retry is already active,
circuit breaker or bulkhead is saturated.

Bad example:

POST /v1/payments

Hedging this can create duplicate payments unless the entire path is idempotent and deduplicated.

Even then, hedging commands is usually unnecessary and risky.

6. Hedge Delay

Never send hedge immediately by default.

Immediate duplicate request doubles load for every call.

Instead:

send hedge only after primary exceeds a threshold

Typical hedge delay choices:

Strategy	Meaning
fixed delay	hedge after 50 ms
percentile-based	hedge after p95 latency
adaptive delay	hedge based on current latency distribution
budget-aware	hedge only if remaining deadline can fit
load-aware	hedge only when dependency not overloaded
priority-aware	hedge only for high-priority requests

A practical starting point:

hedgeDelay = dependency p95 latency
maxHedges = 1

Why p95?

Because most requests complete normally and do not hedge.

Only the slow tail gets a duplicate attempt.

7. Expected Extra Load

If hedge delay is p95, about 5% of calls hedge under normal latency distribution.

Roughly:

extra request load ≈ hedge rate × original load

At 1000 RPS:

Hedge rate	Extra RPS
1%	10
5%	50
10%	100
50%	500
100%	1000

A small hedge rate can be acceptable.

A high hedge rate can become self-inflicted overload.

Therefore:

Hedging must have a hedge budget.

8. Hedge Budget

A hedge budget limits speculative duplicate traffic.

Example:

hedged_attempts <= 2% of original attempts over rolling 1 minute

If budget is exhausted:

do not hedge

This prevents the worst scenario:

dependency slows
→ more requests exceed hedge delay
→ more hedges created
→ dependency receives more load
→ latency worsens
→ even more hedges

That feedback loop can become a retry-storm equivalent.

Hedge budget is as important as retry budget.

9. Hedging and Bulkhead

Hedged attempts must consume bulkhead capacity.

If hedges bypass bulkheads, they can overload dependency path.

Better:

But be careful: if primary holds one permit and hedge holds another, one logical call can consume two slots.

Policy options:

Policy	Behavior
shared bulkhead	hedges compete with normal calls
separate hedge bulkhead	cap hedge attempts independently
hedge budget + shared bulkhead	good default
no hedge when bulkhead near full	safer during pressure

Recommended:

hedge only when bulkhead has spare capacity

Hedging under saturation usually worsens the problem.

10. Hedging and Circuit Breaker

Do not hedge into an open circuit.

If the dependency breaker is open, speculative duplicate calls should not be created.

For multi-replica or multi-zone clients, there may be per-zone/per-endpoint breakers.

A hedge should select only healthy candidates.

If all candidates are unhealthy, fail fast or fallback.

11. Hedging and Retry

Hedging and retry can multiply attempts.

Example:

max attempts = 2 retries
max hedges = 1 per attempt

Worst-case remote attempts:

2 attempts × 2 hedge copies = 4 calls

With fan-out, this gets ugly fast.

Recommended:

avoid combining hedging and retry unless necessary,
use small retry count,
use shared attempt budget,
classify hedged attempt separately,
do not retry loser cancellation as failure,
do not hedge retry attempts by default.

Policy:

one logical call may use at most N total remote attempts

Example:

maxTotalAttemptsIncludingHedges = 2

That means:

either primary + one hedge,
or primary + one retry,
not both.

12. Hedging and Idempotency

For read operations, hedging is usually semantically safe.

For commands, hedging is dangerous.

If a command must be hedged, which is rare, it requires:

idempotency key,
server-side deduplication,
stable request fingerprint,
external side-effect dedup,
response replay,
loser cancellation awareness,
downstream idempotent consumers,
strict observability.

Even then, ask:

Why are we hedging a command instead of making it asynchronous, improving dependency latency, or fixing tail causes?

Most production systems should not hedge commands.

13. Hedging and Cancellation

When one attempt wins, the client should cancel the loser.

But cancellation is not guaranteed to stop server work.

For HTTP:

connection may be closed,
request may already be executing,
server may not notice immediately,
database query may continue,
side effects may already have happened.

For gRPC:

cancellation is a first-class concept,
server can observe cancellation and stop work,
but committed side effects still cannot be undone.

Cancellation reduces wasted work only if the stack propagates and honors it.

14. Hedging and Consistency

If replicas can return different versions of data, hedging may return an older but faster result.

Example:

Replica A: updated case status = CLOSED, slower
Replica B: stale case status = OPEN, faster

Hedging chooses the faster response.

Is that acceptable?

Depends on the operation.

For eventually consistent reads:

maybe acceptable

For decision-making reads:

possibly unsafe

For regulatory action eligibility:

usually unsafe unless consistency contract allows it

Hedging must respect read consistency requirements.

15. Hedging and Cache

Hedging can be used between cache and origin in controlled ways.

Example:

start cache lookup,
if cache is slow/missed, start origin lookup,
use first acceptable response,
cancel loser.

But this is not always wise.

If origin is expensive, hedging cache/origin can increase backend load.

Better common pattern:

cache first with very short budget
then origin

or:

origin first, stale cache fallback on failure

Hedging is one option, not the default.

16. Hedging and Fan-Out

Fan-out plus hedging multiplies load.

If one request fans out to 20 shards and each shard can hedge once:

max attempts = 40 shard calls

If 100 frontend requests arrive:

4000 shard calls

Hedging in fan-out systems must be extremely controlled.

Techniques:

hedge only the slowest subrequests,
hedge only after most responses returned,
cap total hedges per parent request,
use per-shard budget,
cancel losers aggressively,
do not hedge when system load is high.

17. Basic Java Hedging Skeleton

This is conceptual.

public final class HedgedCallExecutor {
    private final ScheduledExecutorService scheduler;
    private final HedgePolicy policy;

    public <T> CompletableFuture<T> execute(
        Supplier<CompletableFuture<T>> primarySupplier,
        Supplier<CompletableFuture<T>> hedgeSupplier,
        Deadline deadline
    ) {
        CompletableFuture<T> result = new CompletableFuture<>();
        CompletableFuture<T> primary = primarySupplier.get();

        AtomicReference<CompletableFuture<T>> hedgeRef = new AtomicReference<>();

        primary.whenComplete((value, error) -> completeIfFirst(result, value, error));

        ScheduledFuture<?> hedgeTask = scheduler.schedule(() -> {
            if (result.isDone()) {
                return;
            }
            if (!policy.canHedge(deadline)) {
                return;
            }

            CompletableFuture<T> hedge = hedgeSupplier.get();
            hedgeRef.set(hedge);

            hedge.whenComplete((value, error) -> completeIfFirst(result, value, error));
        }, policy.hedgeDelay().toMillis(), TimeUnit.MILLISECONDS);

        result.whenComplete((value, error) -> {
            hedgeTask.cancel(false);
            if (!primary.isDone()) {
                primary.cancel(true);
            }
            CompletableFuture<T> hedge = hedgeRef.get();
            if (hedge != null && !hedge.isDone()) {
                hedge.cancel(true);
            }
        });

        return result.orTimeout(deadline.remainingMillis(), TimeUnit.MILLISECONDS);
    }

    private <T> void completeIfFirst(CompletableFuture<T> result, T value, Throwable error) {
        if (error == null) {
            result.complete(value);
        } else {
            result.completeExceptionally(error);
        }
    }
}

This skeleton is incomplete for production.

It still needs:

failure classification,
loser cancellation metrics,
hedge budget,
deadline-aware scheduling,
bulkhead integration,
trace context,
error aggregation,
interruption behavior,
request identity,
executor management.

18. Winner Selection Is Not Always First Response

"First response wins" is too simplistic.

A fast error may arrive before a slow success.

Should the fast error win?

Example:

Attempt A returns 503 after 20 ms
Attempt B returns 200 after 80 ms

If the client accepts first completion, it fails unnecessarily.

Better policy:

Event	Behavior
first success	complete successfully, cancel losers
retryable failure while another attempt pending	wait if deadline allows
terminal failure from all attempts	fail
mixed failures	choose best classified error
deadline exceeded	cancel all attempts

A hedged executor should prefer success over retryable failure.

But it must still respect deadline.

19. Candidate Selection

Do not send primary and hedge to the same overloaded replica if avoidable.

Candidate choices:

different connection,
different replica,
different zone,
different cache node,
different read replica,
different shard replica.

But do not violate consistency, locality, or data residency.

Candidate selection must respect:

tenant placement,
region,
routing policy,
service mesh behavior,
sticky session requirements,
cache key ownership,
authorization context,
consistency level.

If the load balancer already spreads requests well, the client may not control candidate selection.

In that case, hedging is less precise and more risky.

20. Adaptive Hedging

Static hedge delay can be wrong as traffic changes.

Adaptive hedging adjusts delay based on current latency and load.

Possible inputs:

p95/p99 latency,
current success rate,
bulkhead saturation,
circuit breaker state,
dependency load,
hedge budget remaining,
request priority,
deadline remaining.

Policy:

if dependency healthy and tail high and budget available:
  hedge after p95
else:
  do not hedge

Adaptive hedging is powerful but can create feedback loops.

Use simple static hedging first.

Add adaptation only with strong observability.

21. Hedging Policy Object

public record HedgePolicy(
    boolean enabled,
    Duration hedgeDelay,
    int maxHedges,
    double maxHedgeRatio,
    boolean requireIdempotentOperation,
    boolean disableWhenBulkheadSaturated,
    boolean disableWhenCircuitOpen
) {
    public HedgePolicy {
        if (maxHedges < 0) {
            throw new IllegalArgumentException("maxHedges must be >= 0");
        }
        if (hedgeDelay.isNegative() || hedgeDelay.isZero()) {
            throw new IllegalArgumentException("hedgeDelay must be positive");
        }
    }
}

Operation semantics:

public record OperationSemantics(
    String operationName,
    boolean readOnly,
    boolean idempotent,
    boolean strongConsistencyRequired,
    boolean sideEffecting
) {}

Decision:

public boolean canHedge(OperationSemantics semantics, RuntimeSignals signals, Deadline deadline) {
    if (!policy.enabled()) return false;
    if (semantics.sideEffecting()) return false;
    if (policy.requireIdempotentOperation() && !semantics.idempotent()) return false;
    if (semantics.strongConsistencyRequired()) return false;
    if (signals.bulkheadSaturated() && policy.disableWhenBulkheadSaturated()) return false;
    if (signals.circuitOpen() && policy.disableWhenCircuitOpen()) return false;
    if (!hedgeBudget.tryAcquire()) return false;
    return deadline.remaining().compareTo(policy.hedgeDelay().plus(minAttemptDuration)) > 0;
}

Hedging should be a deliberate decision, not a utility method anyone can call.

22. HTTP Client Implementation Considerations

JDK `HttpClient`

sendAsync returns CompletableFuture.

Hedging can be implemented by starting a second sendAsync.

But cancellation behavior depends on the underlying request state.

You still need:

request timeout,
deadline,
connection pool limits,
separate metrics,
candidate selection if possible,
safe request body replay.

Request body matters.

A streaming request body may not be replayable.

Do not hedge non-repeatable request bodies.

Spring `WebClient`

Reactive composition can race publishers:

Mono<Response> primary = callPrimary();
Mono<Response> hedge = Mono.delay(hedgeDelay).then(callHedge());

Mono<Response> result = Mono.firstWithValue(primary, hedge);

But you need:

error handling,
cancellation semantics,
context propagation,
metrics,
timeout,
scheduler control.

Blocking `RestClient`/Feign

Hedging blocking calls requires extra threads or virtual threads.

Be careful:

hedging doubles blocked calls,
thread-pool bulkhead may be needed,
cancelling blocking calls may not stop underlying socket immediately,
request thread should not spawn unbounded tasks.

Blocking hedging is easy to implement badly.

23. Hedging and Request Body Replay

Hedging requires sending the request more than once.

That means the request body must be repeatable.

Safe:

small JSON body held in memory,
GET query,
deterministic protobuf message,
immutable byte array.

Risky:

streaming upload,
input stream consumed once,
large file upload,
request body generated with timestamp/random nonce,
non-deterministic signature,
one-time token.

If body cannot be replayed, do not hedge.

24. Observability

Metrics:

hedged_requests.total{dependency,operation,decision}
hedged_attempts.total{dependency,operation}
hedge_wins.total{dependency,operation,winner=primary|hedge}
hedge_losers_cancelled.total{dependency,operation}
hedge_budget_denied.total{dependency,operation}
hedge_suppressed.total{reason}
hedge_extra_load_ratio{dependency,operation}
remote_call.duration{attempt=primary|hedge}

Useful suppression reasons:

OPERATION_NOT_IDEMPOTENT,
SIDE_EFFECTING,
BUDGET_EXHAUSTED,
DEADLINE_TOO_SHORT,
BULKHEAD_SATURATED,
CIRCUIT_OPEN,
CONSISTENCY_REQUIRED,
BODY_NOT_REPLAYABLE,
LOW_PRIORITY.

Trace attributes:

hedge.attempt=primary
hedge.attempt=secondary
hedge.delay_ms=75
hedge.winner=true

Do not use resource IDs as metric labels.

25. Alerting

Useful alerts:

Alert	Meaning
hedge rate above policy	extra load rising
hedge win rate very low	hedging may be wasting load
hedge budget exhausted often	tail latency or delay policy problem
hedging during overload	dangerous feedback loop
hedges on side-effecting operation	correctness bug
loser cancellation not working	wasted backend work
p99 unchanged despite hedging	ineffective policy
primary latency high and hedge wins high	replica/path tail issue

Hedging should prove its value.

If it does not reduce tail latency enough to justify extra load, turn it off.

26. Testing Hedged Requests

Minimum tests:

Scenario	Expected behavior
primary fast	no hedge sent
primary slow, hedge fast	hedge wins
primary slow, hedge disabled by budget	no hedge
primary returns retryable error, hedge succeeds	success
primary succeeds, hedge pending	hedge cancelled
both fail	best classified failure returned
operation side-effecting	hedge suppressed
deadline too short	hedge suppressed
bulkhead saturated	hedge suppressed
request body non-repeatable	hedge suppressed
metrics emitted	primary/hedge visible

Test primary fast:

@Test
void doesNotHedgeWhenPrimaryCompletesBeforeDelay() {
    fakeRemote.completePrimaryAfter(Duration.ofMillis(10));

    Response response = hedgedClient.getCase("CASE-100");

    assertThat(response.status()).isEqualTo("OPEN");
    assertThat(fakeRemote.hedgeAttempts()).isEqualTo(0);
}

Test hedge wins:

@Test
void hedgeWinsWhenPrimaryIsTailOutlier() {
    fakeRemote.completePrimaryAfter(Duration.ofMillis(500));
    fakeRemote.completeHedgeAfter(Duration.ofMillis(40));

    Response response = hedgedClient.getCase("CASE-100");

    assertThat(response.status()).isEqualTo("OPEN");
    assertThat(fakeRemote.hedgeAttempts()).isEqualTo(1);
    assertThat(metrics.hedgeWins("hedge")).isEqualTo(1);
}

27. Load Testing Hedging

Hedging must be load-tested.

Scenarios:

1% slow replica,
5% random tail latency,
dependency overload,
one zone degraded,
p99 improves but CPU doubles,
cancellation ignored by server,
fan-out parent request with many subrequests,
hedging plus retry,
hedging plus circuit breaker,
hedging with cache/origin,
burst traffic.

Questions:

How much does p99 improve?
How much does backend RPS increase?
Does p50/p95 change?
Does hedge budget cap extra load?
Does dependency saturation worsen?
Are losers cancelled?
Are stale/inconsistent responses possible?
Does hedging stop during overload?

Do not enable hedging because it looks good in a unit test.

28. Production Policy Template

hedging:
  dependencies:
    case-service:
      operations:
        getCase:
          enabled: true
          allowedFor:
            - read-only
            - idempotent
          hedgeDelayMs: 75
          maxHedgesPerLogicalCall: 1
          maxHedgeRatio: 0.02
          disableWhen:
            circuitOpen: true
            bulkheadUtilizationAbove: 0.70
            retryAttempt: true
            remainingDeadlineBelowMs: 150
          consistency:
            allowEventuallyConsistentReplica: false
          cancellation:
            cancelLoser: true
          observability:
            emitWinner: true
            emitSuppressionReason: true

        searchCases:
          enabled: false
          reason: expensive-query-fanout-risk

        createEscalation:
          enabled: false
          reason: side-effecting-command

Policy must be per operation.

Never enable hedging globally.

29. Common Anti-Patterns

29.1 Hedging all requests immediately

Doubles load for every call.

29.2 Hedging commands

Duplicates side effects unless deeply controlled.

29.3 No hedge budget

Extra load grows exactly when latency worsens.

29.4 Hedging under overload

Speculative traffic worsens saturation.

29.5 Ignoring loser cancellation

Backend keeps doing useless work.

29.6 First completion wins even if it is a fast error

Client fails despite slower success being available.

29.7 No consistency analysis

Fast stale replica wins when strong read was required.

29.8 Hedging plus retry without total attempt cap

Attempts multiply.

29.9 Hedging non-repeatable request body

Second attempt is invalid or corrupt.

29.10 No separate metrics

Extra load is invisible.

30. Decision Model

This decision model should be embedded into client policy.

31. Design Checklist

Before enabling hedging:

Which operation is hedged?
Is it read-only or idempotent?
Is it side-effecting?
Is request body repeatable?
What is the hedge delay?
Why that delay?
What is max hedges per logical call?
What is hedge budget?
Is retry also enabled?
What is total attempt cap?
Does hedging stop under overload?
Does hedging respect bulkhead capacity?
Does hedging respect circuit breaker state?
Does hedging respect deadline?
Can loser attempt be cancelled?
Does server honor cancellation?
Are replicas equivalent?
Is consistency safe?
Are metrics split by primary/hedge?
Is p99 improvement verified by load test?
Is extra backend load acceptable?
Is there a kill switch?

32. The Real Lesson

Hedging is a sharp tool.

It can make a large distributed system feel faster by cutting off tail outliers.

But it pays for that speed with speculative load.

A mature service uses hedging only when:

tail latency matters
+ operation is safe to duplicate
+ dependency has spare capacity
+ hedge delay is high enough
+ budget caps extra load
+ losers are cancelled
+ metrics prove benefit

Hedging is not a default resilience pattern.

It is a latency optimization with correctness and capacity consequences.

References

Jeffrey Dean and Luiz André Barroso — The Tail at Scale: https://research.google/pubs/the-tail-at-scale/
The Tail at Scale PDF: https://www.barroso.org/publications/TheTailAtScale.pdf
AWS Builders Library — Timeouts, retries, and backoff with jitter: https://aws.amazon.com/builders-library/timeouts-retries-and-backoff-with-jitter/
gRPC Cancellation: https://grpc.io/docs/guides/cancellation/
gRPC Deadlines: https://grpc.io/docs/guides/deadlines/
Google SRE Book — Handling Overload: https://sre.google/sre-book/handling-overload/

Lesson Recap

You just completed lesson 45 in build core. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Back To Series Next Lesson

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.

Previous Lesson

Lesson 44

Load Shedding and Graceful Degradation

Next Lesson

Lesson 46

Fallbacks, Stale Data, and Semantic Degradation

Hedged Requests: Tail Latency vs Amplified Load

Part 045 — Hedged Requests: Tail Latency vs Amplified Load

1. The Core Mental Model

2. Why Tail Latency Matters

3. Hedging vs Retry

4. When Hedging Works

5. When Hedging Is Dangerous

6. Hedge Delay

7. Expected Extra Load

8. Hedge Budget

9. Hedging and Bulkhead

10. Hedging and Circuit Breaker

11. Hedging and Retry

12. Hedging and Idempotency

13. Hedging and Cancellation

14. Hedging and Consistency

15. Hedging and Cache

16. Hedging and Fan-Out

17. Basic Java Hedging Skeleton

18. Winner Selection Is Not Always First Response

19. Candidate Selection

20. Adaptive Hedging

21. Hedging Policy Object

22. HTTP Client Implementation Considerations

JDK HttpClient

Spring WebClient

Blocking RestClient/Feign

23. Hedging and Request Body Replay

24. Observability

25. Alerting

26. Testing Hedged Requests

27. Load Testing Hedging

28. Production Policy Template

29. Common Anti-Patterns

29.1 Hedging all requests immediately

29.2 Hedging commands

29.3 No hedge budget

29.4 Hedging under overload

29.5 Ignoring loser cancellation

29.6 First completion wins even if it is a fast error

29.7 No consistency analysis

29.8 Hedging plus retry without total attempt cap

29.9 Hedging non-repeatable request body

29.10 No separate metrics

30. Decision Model

31. Design Checklist

32. The Real Lesson

References

JDK `HttpClient`

Spring `WebClient`

Blocking `RestClient`/Feign