Build CoreOrdered learning track

Rate Limiting and Client-Side Throttling

Learn Java Microservices Communication - Part 043

Rate limiting for Java microservices: client-side and server-side throttling, quotas, token bucket, leaky bucket, fixed/sliding windows, RateLimit headers, Retry-After, Resilience4j RateLimiter, fairness, priority, testing, observability, and production policy.

16 min read3131 words
PrevNext
Lesson 4396 lesson track18–52 Build Core
#java#microservices#communication#resilience+5 more

Part 043 — Rate Limiting and Client-Side Throttling

Rate limiting is admission control over time.

It answers:

How many requests is this caller, tenant, client, endpoint, or dependency path allowed to make during a time window?

Without rate limiting, traffic can grow until some other part of the system becomes the limiter:

  • CPU saturation,
  • database connection exhaustion,
  • thread pool exhaustion,
  • queue explosion,
  • broker lag,
  • dependency throttling,
  • garbage collection pressure,
  • network bottleneck,
  • external provider quota,
  • cascading failure.

That is the worst kind of limit: accidental, late, and uncontrolled.

A production service should prefer explicit limits.

reject early
over fail late

1. Rate Limiting vs Load Shedding

These two are related but not identical.

ConceptQuestionExample
Rate limitingIs this caller within allowed quota/rate?Tenant A may call searchCases 100 RPS
Client-side throttlingShould this client slow itself before being rejected?Caller limits outbound dependency calls to 50 RPS
Load sheddingIs the system too overloaded to accept more work?Server drops low-priority traffic at high CPU/queue depth
BulkheadHow many concurrent calls can occupy this resource?Max 40 concurrent calls to case-service.createEscalation
Circuit breakerIs dependency unhealthy enough to stop calling?Open breaker after 50% failures
Retry budgetHow many extra retry attempts can be afforded?Retries max 10% of original traffic

Rate limiting is usually about fairness, quota, and predictable usage.

Load shedding is about survival under overload.

They often work together, but they should not be designed as the same mechanism.


2. Why Rate Limiting Matters in Internal Microservices

Teams often rate-limit public APIs but ignore internal APIs.

That is a mistake.

Internal callers can create more dangerous traffic than external users:

  • batch jobs,
  • replay jobs,
  • retry storms,
  • workflow engines,
  • message consumers catching up after lag,
  • data migration scripts,
  • misconfigured cron jobs,
  • fan-out services,
  • generated clients with aggressive parallelism,
  • low-priority analytics jobs.

Internal does not mean safe.

Internal traffic often bypasses edge protections and hits critical dependencies directly.

Rate limits are internal blast-radius controls.


3. What Can Be Limited?

Rate limit dimensions:

DimensionExample
Caller serviceworkflow-service max 500 RPS to case-service
Tenant/accountTenant A max 100 RPS
UserUser U max 20 requests/minute
API operationsearchCases max 200 RPS
HTTP methodPOST commands stricter than GET
Resource keyone case cannot receive 1000 updates/sec
Priority classbatch lower than user-facing
Regionregional capacity-specific limits
External providerprovider quota 1000 requests/minute
Retry trafficretries limited separately
Expensive query shapecomplex filters lower quota
Payload sizelarge requests consume more tokens

A mature limiter often uses multiple dimensions.

Example:

caller-service + operation + tenant + priority

But beware cardinality and complexity.

Start with dimensions that map to ownership and capacity.


4. Rate Limit Is a Contract

If a service rate-limits consumers, it should document:

  • who is limited,
  • what is limited,
  • limit value,
  • window,
  • burst allowance,
  • response status,
  • retry-after behavior,
  • headers,
  • whether retries count,
  • whether failed requests count,
  • whether idempotent replay counts,
  • how to request higher limit,
  • whether limits differ by environment/tenant/priority.

Without a contract, rate limiting becomes random production pain.

For HTTP APIs, rate-limit responses usually use:

429 Too Many Requests

and often include:

Retry-After: 2

Newer RateLimit fields are also defined to communicate quota policy and current limit state.


5. HTTP Rate Limit Signals

Typical response:

HTTP/1.1 429 Too Many Requests
Content-Type: application/problem+json
Retry-After: 2
RateLimit: limit=100, remaining=0, reset=2
RateLimit-Policy: 100;w=60

Problem body:

{
  "type": "https://errors.example.internal/rate-limited",
  "title": "Rate limit exceeded",
  "status": 429,
  "detail": "The caller exceeded the allowed rate for this operation.",
  "extensions": {
    "code": "RATE_LIMITED",
    "retryable": true,
    "retryAfterMillis": 2000,
    "limitScope": "caller-service:workflow-service:operation:searchCases"
  }
}

Important:

  • 429 is not a server crash.
  • 429 is intentional admission control.
  • Clients must not retry immediately.
  • Server should provide enough signal for cooperative clients to slow down.

6. Retry-After

Retry-After can be a delay in seconds or an HTTP date.

Examples:

Retry-After: 5
Retry-After: Sun, 05 Jul 2026 10:15:30 GMT

Client rule:

respect Retry-After if it fits the caller deadline and retry policy

If Retry-After is too long for a synchronous request, do not sleep the request thread for a long time.

Return controlled failure or shift to async workflow.

Example:

Retry-After = 30 seconds
user request deadline = 800 ms

Do not wait.

Return a retryable/degraded response upstream.


7. Rate Limit Algorithms

7.1 Fixed window

100 requests per minute
window: 10:00:00–10:00:59

Simple, but boundary bursts can happen.

A caller can send 100 requests at 10:00:59 and 100 more at 10:01:00.

Pros:

  • simple,
  • cheap,
  • easy to reason.

Cons:

  • boundary burst,
  • unfair at edges.

7.2 Sliding window log

Store timestamps of recent requests.

Precise but expensive at high volume.

Pros:

  • accurate,
  • fair.

Cons:

  • memory/storage cost,
  • distributed implementation complexity.

7.3 Sliding window counter

Approximate sliding window using buckets.

Pros:

  • cheaper than log,
  • smoother than fixed window.

Cons:

  • approximate,
  • more complex than fixed window.

7.4 Token bucket

Tokens refill at a steady rate. Requests consume tokens.

Allows bursts up to bucket capacity.

Good default for many service-to-service limits.

7.5 Leaky bucket

Requests enter a queue and are processed at a fixed rate.

Good for smoothing but can add queue latency.

For synchronous request/response, avoid deep queues.


8. Token Bucket Intuition

Config:

refill rate = 100 tokens/second
bucket capacity = 200 tokens

Meaning:

  • average allowed rate is 100 RPS,
  • short bursts up to 200 requests can pass,
  • sustained rate above 100 RPS will eventually be throttled.

This is usually better than a hard "100 per second" fixed window because real traffic is bursty.

But burst capacity must be deliberate.

Too much burst can still overload a dependency.


9. Server-Side Rate Limiting

Server-side rate limiting protects the provider.

Server-side limit should happen early:

  • before heavy authentication if safe,
  • before request body parsing for large bodies if possible,
  • before expensive database access,
  • before fan-out,
  • before lock acquisition.

But it still needs enough identity to limit fairly.

Common locations:

LocationProsCons
API gatewaycentral, early, cross-service visibilitymay lack deep business context
service mesh/proxyplatform-level enforcementlimited application semantics
application filter/interceptorrich business contextlater in request path
domain operation layerprecise operation semanticsafter more work already done
external rate-limit servicecentralized dynamic policyextra dependency

Often use layered limits:

gateway coarse limit
+ application fine-grained limit
+ dependency-specific outbound client limit

10. Client-Side Throttling

Client-side throttling protects both the client and the dependency.

Instead of waiting for 429, the caller limits its own outbound rate.

Use when:

  • dependency quota is known,
  • external provider has strict limits,
  • internal provider publishes capacity contract,
  • batch/replay jobs can self-throttle,
  • many worker threads could otherwise stampede,
  • retry traffic must be bounded.

Client-side throttling is especially important for:

  • message consumers,
  • workflow workers,
  • scheduled jobs,
  • data migrations,
  • fan-out aggregators.

Do not rely only on server-side rate limiting.

A cooperative client should avoid generating rejected traffic.


11. Rate Limiting Is Not Only Request Count

Some requests cost more.

Example:

GET /v1/cases?status=OPEN&pageSize=200

is not equal to:

GET /v1/cases/CASE-100

Weighted rate limits:

RequestCost
get by ID1 token
search page size 505 tokens
search page size 20020 tokens
export request100 tokens
bulk command item1 token per item
expensive filtermultiplier

Example:

tenant limit = 1000 tokens/minute
getCase costs 1
searchCases costs pageSize / 10
bulkCreate costs itemCount

Weighted limits align better with real capacity.


12. Per-Tenant Fairness

Multi-tenant systems need fairness.

Without tenant limits, one tenant can consume shared capacity.

Per-tenant limiting:

global capacity = 1000 RPS
tenant default = 100 RPS
premium tenant = 300 RPS
reserved system traffic = 100 RPS

But beware:

  • too strict per-tenant limits waste idle capacity,
  • too loose limits allow noisy neighbor,
  • dynamic borrowing is useful but complex.

Start simple:

global limit + per-tenant limit + critical system reserve

13. Priority-Aware Limits

Not all traffic deserves the same treatment.

Priority classes:

PriorityExample
criticalcommand completing regulatory action
user-facingportal request
workflowbusiness process worker
reconciliationbackground correction
batchreport/data sync
optionalrecommendation/enrichment

When capacity is scarce, low-priority traffic should be limited first.

Rate limit config:

limits:
  case-service.searchCases:
    user-facing:
      rate: 300/s
      burst: 600
    batch:
      rate: 50/s
      burst: 100
    optional:
      rate: 20/s
      burst: 40

Priority only works if callers identify traffic class reliably.

Do not let callers self-declare high priority without trust controls.


14. Distributed Rate Limiting

A single JVM-local limiter is easy.

But in a horizontally scaled service, local limits multiply.

Example:

10 pods
local limit per pod = 100 RPS
actual global limit = 1000 RPS

That may be intended or accidental.

Options:

ApproachBehavior
local per-pod limitsimple, approximate
divide global limit by pod countneeds dynamic scaling awareness
centralized Redis/service limitermore accurate, extra dependency
gateway-level global limitergood for ingress
adaptive feedbackadjusts by observed load
sharded limiter by keyscalable but more complex

Use local limits when approximate protection is enough.

Use centralized/gateway limits for contractual quotas.

Use application-level limits for business-specific dimensions.


15. Rate Limiter Failure Mode

If your rate limiter depends on Redis or a central service, what happens when that limiter is unavailable?

Choices:

ModeBehavior
fail openallow traffic
fail closedreject traffic
degraded local limitfallback to approximate local limiter
cached decisiontemporary stale policy

Choose per operation.

For public abuse protection, fail closed may be safer.

For internal critical commands, fail open with local emergency limit may be safer.

For external provider quota protection, fail closed or local conservative limit may be required to avoid provider ban/cost.


16. Resilience4j RateLimiter Model

Resilience4j RateLimiter controls permissions per refresh period.

Conceptual config:

RateLimiterConfig config = RateLimiterConfig.custom()
    .limitForPeriod(100)
    .limitRefreshPeriod(Duration.ofSeconds(1))
    .timeoutDuration(Duration.ofMillis(0))
    .build();

RateLimiter limiter = RateLimiter.of("case-service.searchCases", config);

Supplier<SearchCasesResponse> decorated =
    RateLimiter.decorateSupplier(limiter, () -> callCaseService(query));

SearchCasesResponse response = decorated.get();

Meaning:

allow 100 permissions per 1 second period
if no permission is available, do not wait

If timeoutDuration is greater than zero, caller can wait for permission.

For synchronous user-facing calls, prefer small or zero wait.

Waiting for limiter permission consumes caller latency budget.


17. Resilience4j Config Example

resilience4j:
  ratelimiter:
    instances:
      caseServiceSearchCases:
        limitForPeriod: 100
        limitRefreshPeriod: 1s
        timeoutDuration: 0ms

      externalSanctionsProviderScreen:
        limitForPeriod: 50
        limitRefreshPeriod: 1s
        timeoutDuration: 100ms

Notes:

  • limitForPeriod is the number of permissions per refresh period.
  • limitRefreshPeriod is the period at which permissions refresh.
  • timeoutDuration is how long a caller waits for permission.

Do not set long timeoutDuration on user-facing paths unless intentional.


18. Rate Limiter and Retry

Retries must be rate-limited too.

Otherwise a retry storm bypasses your original admission control.

Options:

  1. Same limiter for original and retry attempts.
  2. Separate smaller retry limiter.
  3. Retry budget plus rate limiter.
  4. Retry denied when limiter full.

Recommended:

original outbound calls use operation limiter
retry attempts also require retry budget token

Flow:

Retries are traffic.

Treat them as traffic.


19. Rate Limiter and Bulkhead

Rate limiter controls rate over time.

Bulkhead controls concurrency.

You often need both.

Example:

limit: 100 requests/second
bulkhead: max 40 concurrent calls

If latency rises to 1 second:

  • rate limiter allows 100 requests/sec,
  • concurrency would become 100 in-flight,
  • bulkhead caps at 40.

If traffic bursts 1000 requests instantly:

  • bulkhead caps in-flight,
  • rate limiter caps accepted rate.

They solve different overload shapes.


20. Rate Limiter and Circuit Breaker

When circuit breaker is open, calls should not consume normal remote-call rate permits unless you intentionally count attempted usage.

If rate limiter is before breaker:

RateLimiter -> CircuitBreaker -> Call

Open breaker traffic consumes permits.

If breaker is before limiter:

CircuitBreaker -> RateLimiter -> Call

Open breaker fails fast before limiter.

Which is right?

For outbound client dependency protection:

circuit breaker before remote rate limiter can avoid wasting permits when calls are not allowed

For caller admission fairness:

rate limiter first ensures all attempts are accounted

Again, define what the limiter is protecting.


21. Rate Limiter and Queueing

A rate limiter can either reject or wait.

Waiting creates queueing.

For synchronous APIs:

prefer reject/fast fallback over long in-memory waiting

For background workers:

waiting or delayed scheduling may be acceptable

But distinguish:

  • waiting in memory,
  • durable delayed retry,
  • message broker backoff,
  • workflow sleep,
  • scheduled retry.

If the work must eventually happen, do not rely on in-memory rate limiter wait.

Persist it.


22. Handling 429 in Java Client

Client behavior:

public final class RateLimitAwareErrorMapper {
    public RuntimeException map(int status, HttpHeaders headers, Problem problem) {
        if (status == 429) {
            Duration retryAfter = parseRetryAfter(headers.firstValue("Retry-After"));
            return new RemoteRateLimitedException(
                problem.code(),
                retryAfter,
                problem.detail()
            );
        }

        return mapOther(status, problem);
    }
}

Retry classifier:

public boolean isRetryable(Throwable throwable, Deadline deadline) {
    if (throwable instanceof RemoteRateLimitedException ex) {
        return ex.retryAfter()
            .filter(delay -> deadline.canFit(delay.plus(minAttemptDuration)))
            .isPresent();
    }

    return defaultClassifier.isRetryable(throwable);
}

The client should respect server intent, but not violate its own deadline.


23. Server-Side Spring Filter Concept

Application-level limiter:

public final class RateLimitFilter extends OncePerRequestFilter {
    private final RateLimitService rateLimitService;
    private final ProblemResponseWriter problemWriter;

    @Override
    protected void doFilterInternal(
        HttpServletRequest request,
        HttpServletResponse response,
        FilterChain chain
    ) throws ServletException, IOException {
        RateLimitKey key = RateLimitKey.from(request);
        RateLimitDecision decision = rateLimitService.tryAcquire(key);

        if (!decision.allowed()) {
            response.setStatus(429);
            response.setHeader("Retry-After", Long.toString(decision.retryAfter().toSeconds()));
            response.setHeader("RateLimit-Policy", decision.policyHeader());
            response.setHeader("RateLimit", decision.rateLimitHeader());
            problemWriter.writeRateLimited(response, decision);
            return;
        }

        chain.doFilter(request, response);
    }
}

Key design is the hard part:

public record RateLimitKey(
    String callerService,
    String tenantId,
    String operation,
    String priority
) {}

Do not use raw URL with IDs as key.

Use route template / operation ID.


24. Rate Limit Key Design

Good key:

caller=workflow-service
tenant=tenant-a
operation=searchCases
priority=batch

Bad key:

GET /v1/cases?caseId=CASE-100&userId=U-999

Problems with bad key:

  • high cardinality,
  • sensitive data exposure,
  • no stable aggregation,
  • poor fairness,
  • hard dashboards.

Key should be:

  • low cardinality enough for metrics,
  • precise enough for fairness,
  • aligned with ownership,
  • derived from authenticated identity where possible,
  • not directly controlled by untrusted caller.

25. Rate Limit Headers for Successful Responses

A server can also send rate limit fields on successful responses.

Example:

RateLimit: limit=100, remaining=42, reset=10
RateLimit-Policy: 100;w=60

This helps cooperative clients slow down before receiving 429.

But be careful:

  • do not expose sensitive capacity details if inappropriate,
  • do not make clients depend on exact internal implementation,
  • document whether headers are approximate,
  • support multiple limits carefully.

For internal APIs, these headers are useful for platform-level client behavior and dashboards.


26. Rate Limit and Idempotency Replay

Should idempotency replay count against rate limit?

Example:

  • first command succeeded,
  • response lost,
  • client retries same idempotency key,
  • server replays original result.

Counting replay fully may punish reliable retry behavior.

Not counting replay at all may allow abuse.

Possible policy:

Request typeCount?
first command attemptyes
duplicate replay same keydiscounted or separate counter
same key different payloadyes + conflict metric
in-progress duplicateyes or lower cost
validation errorusually yes
auth failureyes, possibly security limiter
health checkseparate limiter

Document it.

For internal command APIs, track replay separately:

rate_limit.tokens.consumed{kind="first_attempt"}
rate_limit.tokens.consumed{kind="idempotency_replay"}

27. Rate Limit and Security

Rate limiting is not only reliability.

It also supports:

  • abuse prevention,
  • brute-force protection,
  • credential misuse detection,
  • tenant isolation,
  • scraping control,
  • expensive-query protection,
  • internal runaway job containment.

But security limiters have different requirements:

  • often keyed by user/IP/client credential,
  • may fail closed,
  • may have lower thresholds,
  • may intentionally hide details,
  • may feed into alerting and blocking.

Do not mix all security throttling with normal capacity rate limiting.

Separate policies.


28. Observability

Metrics:

rate_limit.requests.total{operation,caller,tenant,decision}
rate_limit.permits.granted.total{limiter}
rate_limit.permits.denied.total{limiter}
rate_limit.wait.duration{limiter}
rate_limit.tokens.remaining{limiter}
http.server.requests{status="429",operation}
http.client.rate_limited.total{dependency,operation}

Useful labels:

  • operation ID,
  • caller service,
  • tenant tier, not necessarily tenant ID,
  • priority,
  • decision: allowed/denied/waited,
  • limit policy name,
  • retry-after bucket.

Avoid high cardinality:

  • user ID,
  • raw tenant ID in high-cardinality metrics unless controlled,
  • request ID,
  • raw URL,
  • idempotency key.

Structured log for denial:

{
  "event": "rate_limit_denied",
  "operation": "searchCases",
  "caller": "reporting-job",
  "priority": "batch",
  "policy": "case-search-batch-default",
  "retryAfterMs": 2000
}

29. Alerting

Useful alerts:

AlertMeaning
429 rate high for critical callercaller under-provisioned or runaway
429 rate high globallylimit too low or traffic spike
one tenant denied heavilynoisy tenant or legitimate growth
retry-after ignored by callerclient bug
client-side limiter saturateddependency quota pressure
external provider limiter near quotarisk of provider throttling
rate-limit service unavailableprotection layer degraded
limit denied but system underutilizedpolicy too strict
no 429 during overloadlimiter not protecting

Rate limiting alerts should be actionable.

A high 429 rate may be healthy if it prevents overload.


30. Testing Rate Limits

Minimum tests:

ScenarioExpected behavior
under limitrequest allowed
over limit429 returned
Retry-After presentclient can back off
rate-limit headers presentpolicy visible
different callerseparate quota
different tenantseparate quota
weighted requestconsumes correct tokens
burst within capacityallowed
burst beyond capacitylimited
limiter unavailablefail-open/fail-closed policy applied
retries count against retry budgetno retry storm
idempotency replay behaviorcounted according to policy
metrics emittedallowed/denied visible

Concurrency test for local limiter:

@Test
void deniesRequestsAfterLimitForPeriod() {
    RateLimiterConfig config = RateLimiterConfig.custom()
        .limitForPeriod(2)
        .limitRefreshPeriod(Duration.ofSeconds(1))
        .timeoutDuration(Duration.ZERO)
        .build();

    RateLimiter limiter = RateLimiter.of("test", config);

    assertThat(limiter.acquirePermission()).isTrue();
    assertThat(limiter.acquirePermission()).isTrue();
    assertThat(limiter.acquirePermission()).isFalse();
}

HTTP test:

@Test
void returns429WithRetryAfterWhenLimitExceeded() {
    for (int i = 0; i < 10; i++) {
        http.get("/v1/cases");
    }

    HttpResponse<String> response = http.get("/v1/cases");

    assertThat(response.statusCode()).isEqualTo(429);
    assertThat(response.headers().firstValue("Retry-After")).isPresent();
}

31. Production Policy Template

rateLimits:
  inbound:
    case-service:
      operations:
        searchCases:
          dimensions:
            - callerService
            - tenantTier
            - priority
          policies:
            user-facing:
              algorithm: token-bucket
              rate: 300/s
              burst: 600
              responseStatus: 429
              retryAfter: dynamic
            batch:
              algorithm: token-bucket
              rate: 50/s
              burst: 100
              responseStatus: 429
              retryAfter: dynamic

        createEscalation:
          dimensions:
            - callerService
            - tenantId
          policies:
            default:
              algorithm: token-bucket
              rate: 100/s
              burst: 150
              idempotencyReplayCost: 0.2

  outbound:
    external-sanctions-provider:
      screenParty:
        algorithm: token-bucket
        rate: 40/s
        burst: 80
        timeoutWhenNoPermit: 100ms
        failMode: local-conservative-limit

A good policy says:

  • what is limited,
  • who is limited,
  • algorithm,
  • rate,
  • burst,
  • response behavior,
  • observability,
  • owner.

32. Common Anti-Patterns

32.1 No internal rate limits

A replay job or workflow bug can overwhelm a provider.

32.2 One global limit

Critical traffic and batch traffic compete unfairly.

32.3 Long wait inside synchronous limiter

The request times out anyway, but resources are held.

32.4 Rate limiting by raw URL

High-cardinality keys and poor fairness.

32.5 Retrying 429 immediately

Client ignores server backpressure.

32.6 Server returns 500 for throttling

Clients treat intentional throttling as server crash.

32.7 Limit only at gateway

Application-specific expensive operations bypass precise control.

32.8 Limit only in app

Gateway still accepts and forwards traffic that could be rejected earlier.

32.9 Distributed local limit accidentally multiplies

10 pods each allow 100 RPS, global becomes 1000 RPS.

32.10 No observability for denied traffic

Nobody knows whether limit protects the system or blocks legitimate growth.


33. Decision Model

Rate limiting is a design choice, not a checkbox.


34. Design Checklist

Before shipping rate limiting:

  • What capacity or quota is protected?
  • Is this inbound or outbound?
  • What dimensions are used?
  • Are keys low-cardinality and trustworthy?
  • What algorithm is used?
  • What is the average rate?
  • What burst is allowed?
  • Is the limit local or global?
  • What happens when limiter storage is unavailable?
  • Is 429 used for throttling?
  • Is Retry-After provided?
  • Are RateLimit fields exposed?
  • Do clients honor throttling?
  • Are retries counted or separately budgeted?
  • Are batch and user-facing traffic separated?
  • Are weighted costs needed?
  • Is idempotency replay counted?
  • Are metrics and alerts configured?
  • Is there a process to request limit changes?
  • Are tests covering boundary and burst behavior?

35. The Real Lesson

Rate limiting is not about saying "no" arbitrarily.

It is about keeping communication within known capacity.

A mature Java microservice platform uses rate limiting to create:

fairness
+ quota enforcement
+ dependency protection
+ retry control
+ tenant isolation
+ predictable degradation

A request denied early with 429 is often a success.

It means the system refused overload while it could still explain why.


References

Lesson Recap

You just completed lesson 43 in build core. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.