Build CoreOrdered learning track

Consistency, Invalidation, and Stampede Control

Learn Java Redis In Action - Part 015

Production consistency patterns for Redis-backed caches: invalidation, versioned keys, freshness windows, stampede control, stale-while-revalidate, refresh-ahead, single-flight, logical expiry, and correctness envelopes in Java systems.

22 min read4327 words
PrevNext
Lesson 1534 lesson track0718 Build Core
#java#redis#caching#consistency+4 more

Part 015 — Consistency Patterns: Invalidation, Versioned Cache, Stampede Control

Part 014 introduced major cache patterns. This part goes deeper into the hardest part of caching:

How do we keep Redis useful without letting it silently corrupt business behavior?

The main enemy is not stale data by itself. The real enemy is unbounded, invisible, unjustified staleness.

A senior engineer does not say:

Redis is eventually consistent.

A senior engineer says:

This Redis projection may be stale for at most 30 seconds under normal operation, may serve stale for 5 minutes during source outage, is invalidated by these domain events, and is never used for payment authorization.

That is the difference between using Redis as a speed hack and using Redis as a production engineering component.


1. Kaufman Skill Decomposition

Target skill:

Design Redis cache consistency behavior explicitly, including invalidation, freshness bounds, stampede control, stale fallback, and operational failure modes.

Sub-skills:

Sub-skillWhat you must be able to do
Consistency envelopeDefine allowed staleness per user journey
Invalidation designDecide delete, update, version bump, or event-driven refresh
Versioned cachePrevent stale overwrite and cross-version pollution
Stampede controlStop hot misses from overwhelming source systems
Logical expirySeparate usability expiry from Redis physical expiry
Single-flightEnsure only one request refreshes a hot key
Stale fallbackServe controlled stale data during source failures
Negative consistencyPrevent not-found poisoning and creation races
ObservabilityMeasure staleness, refresh failures, and miss amplification
Failure modelingKnow what happens during Redis, DB, and event bus failures

Practice rule for this part:

Every cache pattern must answer: what can be stale, for how long, who refreshes it, who invalidates it, and what happens when refresh fails?


2. The Core Problem: Redis and the Source of Truth Are Separate Systems

Most Java services cache data from a durable source:

  • PostgreSQL
  • MySQL
  • Oracle
  • MongoDB
  • Elasticsearch/OpenSearch
  • external HTTP service
  • another microservice
  • event-derived projection

Redis is usually not updated in the same atomic transaction as the source.

The weak point is obvious:

DB commit and Redis invalidation are not one atomic operation.

If DB commit succeeds but Redis deletion fails, cache may remain stale. If Redis deletion succeeds but DB commit later fails, cache may be unnecessarily cold. If two writes race, a stale writer may overwrite a fresh cache value. If hot data expires at once, many clients may hit the DB together.

Therefore consistency must be designed, not assumed.


3. Taxonomy of Cache Consistency Requirements

Not every cache needs the same correctness.

Use caseCorrectness needExample Redis behavior
Product descriptionLow/mediumTTL + event invalidation
User display nameMediumTTL + delete after update
Authorization permissionHighshort TTL or direct source check for critical actions
Pricing quoteVery highdo not trust cache for final price confirmation
Inventory badgeMediumstale allowed for browse, not checkout
Feature flagHighversioned config + short TTL + fallback policy
Fraud/risk decisionHighcache intermediate features, not final irreversible decision
Search resultMediumindex version + TTL + rebuild path
Dashboard statsLow/mediumprecomputed projection + freshness timestamp

A useful rule:

Redis can accelerate decisions, but it should not secretly become the authority for decisions whose wrongness has high business cost.

Ask these questions before choosing a pattern:

  1. What is the source of truth?
  2. Is stale data acceptable?
  3. How stale is acceptable?
  4. Is stale data acceptable during outage?
  5. Can users see stale data, or only internal services?
  6. Can stale data cause financial, legal, security, or compliance damage?
  7. Is read latency more important than freshness?
  8. How expensive is source recomputation?
  9. How often does the data change?
  10. How many clients may request the same key simultaneously?

4. Consistency Envelope

A consistency envelope is a written contract around cached data.

Example:

cache: customer-profile
source: customer_db.customer
owner: customer-service
read_path: customer-service GET /customers/{id}
write_path: customer-service PATCH /customers/{id}
normal_freshness: <= 60 seconds
max_stale_during_source_outage: <= 10 minutes
physical_ttl: 15 minutes
logical_ttl: 60 seconds
refresh_policy: single-flight stale-while-revalidate
invalidation_policy: delete-after-commit on profile update event
critical_paths:
  - never use for KYC enforcement
  - never use for billing address validation

This looks heavy, but for important caches it prevents months of ambiguous production behavior.

Minimal Envelope Template

Cache name:
Source of truth:
Owner service:
Key pattern:
Value schema version:
Allowed normal staleness:
Allowed degraded staleness:
Invalidation trigger:
Refresh trigger:
Fallback when Redis unavailable:
Fallback when source unavailable:
Forbidden usage:
Observability metrics:

If a cache cannot be described this way, the team probably does not understand the cache.


5. Four Kinds of Expiry

Many Redis cache bugs come from mixing different meanings of “expired”.

Expiry typeMeaning
Physical TTLRedis automatically removes the key
Logical TTLApplication considers value stale after timestamp
Business validityDomain rule says the value is no longer valid
Client freshnessThe caller’s journey requires data no older than X

These are not equivalent.

Example:

{
  "schemaVersion": 3,
  "loadedAtEpochMs": 1783000000000,
  "freshUntilEpochMs": 1783000060000,
  "staleUntilEpochMs": 1783000600000,
  "sourceVersion": 928172,
  "payload": {
    "customerId": "c-123",
    "tier": "GOLD"
  }
}

Redis key TTL may be 15 minutes. Logical freshness may be 60 seconds. Business validity may end when customer tier changes. Client freshness may differ between journeys.

The pattern:

physical TTL > staleUntil > freshUntil

Why?

  • freshUntil controls normal freshness.
  • staleUntil allows degraded serving during source failure.
  • physical TTL bounds memory and cleans abandoned keys.

6. Basic Invalidation: Delete After Commit

The safest default for many write paths:

1. Write source of truth.
2. Commit transaction.
3. Delete Redis cache key.
4. Next read reloads from source.

Why delete instead of update?

Updating the cache after write looks attractive, but delete is often safer:

ApproachBenefitRisk
Update cache after DB writeKeeps cache warmstale writer can overwrite fresh value; incomplete projection risk
Delete cache after DB writeForces fresh reloadnext read pays source cost
Version bumpAvoids old key pollutionneeds version lookup and cleanup
Event-driven invalidationdecouples writer/read modelevent delay/loss must be handled

Delete-after-commit makes the next read rebuild from the actual source.

Java Pseudocode

@Transactional
public Customer updateCustomer(String customerId, PatchCustomerCommand command) {
    Customer updated = customerRepository.patch(customerId, command);

    // DB transaction commits when method exits.
    // In Spring, prefer after-commit hook for Redis invalidation.
    TransactionSynchronizationManager.registerSynchronization(new TransactionSynchronization() {
        @Override
        public void afterCommit() {
            redisTemplate.delete("cache:customer:v3:" + customerId);
        }
    });

    return updated;
}

Important:

Do not delete before commit unless you deliberately accept refill-before-commit races.


7. The Classic Race: Delete Then Stale Refill

Consider this sequence:

Result:

Redis contains stale old value after writer invalidated it.

This happens when a reader loads old data before the writer commits and writes it after invalidation.

Solutions:

  1. short TTL only
  2. double delete
  3. versioned cache
  4. compare source version before setting cache
  5. logical timestamp check in Lua
  6. cache only after reading committed source version

There is no universal answer. Choose based on data criticality and write/read concurrency.


8. Double Delete Pattern

Pattern:

1. Delete cache.
2. Write DB.
3. Delete cache again after a small delay.

Or safer in many Java services:

1. Write DB.
2. Commit.
3. Delete cache.
4. Schedule second delete after expected stale refill window.

When it helps:

  • read-through cache can refill stale data during write race
  • no source version available
  • stale risk is moderate
  • delayed second delete is cheap

When it is weak:

  • delay is guessed
  • long source query can exceed delay
  • event scheduler may fail
  • repeated writes complicate behavior
  • it is mitigation, not proof

Double delete is a pragmatic patch. It is not a strong correctness protocol.

Java Sketch

public void invalidateCustomerAfterCommit(String customerId) {
    String key = customerKey(customerId);

    redisTemplate.delete(key);

    delayedExecutor.schedule(
        () -> redisTemplate.delete(key),
        Duration.ofMillis(750)
    );
}

Better version:

Delay should be based on observed p99 source-read + p99 cache-set latency, not folklore.

9. Versioned Cache Keys

Versioned keys avoid stale overwrite by changing the key namespace when source changes.

Example:

cache:customer:{c-123}:v42

A separate version pointer tells readers which key to use:

cachever:customer:{c-123} = 42
cache:customer:{c-123}:v42 = payload

Flow:

Advantages:

  • old refills write to old key only
  • latest pointer selects new version
  • avoids stale overwrite on the active key
  • useful for materialized projections

Costs:

  • two Redis reads unless pipelined
  • old keys require TTL cleanup
  • version pointer consistency matters
  • source must expose version or update counter

Version Sources

SourceSuitability
DB row version columnstrong and simple
updated_at timestampuseful but watch clock precision
monotonically increasing sequencebest for strict ordering
domain event offsetgood for projection caches
Redis INCR versionok if Redis drives invalidation, weak if DB is source

Java Key Builder

public final class CustomerCacheKeys {
    public static String versionKey(String customerId) {
        return "cachever:customer:{" + customerId + "}";
    }

    public static String payloadKey(String customerId, long version) {
        return "cache:customer:{" + customerId + "}:v" + version;
    }
}

In Redis Cluster, the hash tag {customerId} keeps related keys in the same hash slot. This matters for Lua and multi-key operations.


10. Value-Level Version Guard

If you cannot version the key, version the value.

Payload envelope:

{
  "schemaVersion": 3,
  "sourceVersion": 43,
  "loadedAtEpochMs": 1783000000000,
  "payload": {
    "customerId": "c-123",
    "name": "Ari"
  }
}

Before writing cache:

Only set payload if existing sourceVersion is missing or <= candidate sourceVersion.

This should be atomic. Use Lua or Redis Functions.

Lua Version Guard

-- KEYS[1] = cache key
-- ARGV[1] = candidate source version
-- ARGV[2] = candidate payload
-- ARGV[3] = ttl seconds

local current = redis.call('GET', KEYS[1])
if current then
  local currentVersion = tonumber(string.match(current, '"sourceVersion"%s*:%s*(%d+)'))
  local candidateVersion = tonumber(ARGV[1])

  if currentVersion and currentVersion > candidateVersion then
    return 0
  end
end

redis.call('SET', KEYS[1], ARGV[2], 'EX', ARGV[3])
return 1

Do not parse large JSON in Lua in a hot path if avoidable. A more practical design stores version separately:

cache:customer:{c-123}:value
cache:customer:{c-123}:version

Then a script can compare numeric version cheaply.


11. Event-Driven Invalidation

When writes happen in one service and reads happen in another, direct invalidation is insufficient.

Pattern:

1. Writer commits source change.
2. Writer emits domain event.
3. Cache-owning service consumes event.
4. Cache-owning service invalidates or refreshes Redis keys.

Important:

Use outbox if losing invalidation events would create unacceptable stale behavior.

Do not rely on “publish event after commit” without a retryable mechanism if correctness matters.

Invalidate or Refresh?

StrategyWhen usefulRisk
Invalidate on eventcommon, simple, source reloads on demandfirst read after event pays miss
Refresh on eventhot data stays warmevent consumer may overload source
Version bump on eventstrong stale avoidanceneeds version pointer and cleanup
Refresh-aheadpredictable hot keysextra compute and stale scheduling

For most systems:

invalidate on event + refresh on demand

is safer than eager refresh everything.


12. Invalidation Fanout

One source change may affect many cache keys.

Example: product price update affects:

cache:product:{p-1}:detail:v3
cache:category:{cat-7}:products:page:1:v5
cache:search:q:{hash}:page:1:v2
cache:recommendation:user:{u-9}:v4
cache:quote-preview:{tenant}:{cart-hash}:v1

This is where naive caching collapses.

You need a dependency model.

Options

ApproachHow it worksTrade-off
Direct key invalidationwriter knows all affected keystight coupling, brittle
Tag/index invalidationmaintain reverse index of affected keysextra write/memory overhead
Version namespacebump domain version used in keysimple, may orphan old keys
Short TTLavoid explicit dependency trackingstale window and source load
Rebuild projectionasync refresh derived viewseventual freshness

Version Namespace Example

cachever:product:p-1 = 18
cachever:category:cat-7 = 81
cache:category:{cat-7}:products:v81:page:1

When category membership changes:

INCR cachever:category:cat-7

Old pages become unreachable and expire physically later.

This is often simpler than tracking every page key.


13. Cache Stampede Mental Model

A cache stampede happens when many requests miss at the same time and all recompute the same value.

Miss amplification:

source_load = miss_count * rebuild_cost

For hot keys, a single expiration can become a backend outage.

Stampede causes:

  • synchronized TTL
  • Redis restart flushes hot keys
  • deploy changes cache namespace
  • source outage causes refresh failures
  • high-cardinality keys with uneven traffic
  • manual invalidation of many hot keys
  • short TTL on expensive values

A top engineer treats stampede as a capacity problem, not only a cache problem.


14. TTL Jitter

Without jitter:

10,000 product keys loaded at 10:00
all expire at 10:15
backend spike at 10:15

With jitter:

ttl = baseTtl + random(0, jitterRange)

Example:

Duration ttlWithJitter(Duration base, Duration jitter) {
    long jitterMs = ThreadLocalRandom.current().nextLong(jitter.toMillis() + 1);
    return base.plusMillis(jitterMs);
}

Better for some cases:

ttl = baseTtl * random(0.8, 1.2)

Guideline:

SituationJitter
low traffic, cheap sourceoptional
many keys loaded in batchmandatory
hot keys with same TTLmandatory
scheduled refresh jobmandatory
cache namespace migrationmandatory

Jitter spreads load. It does not solve hot-key recomputation by itself.


15. Single-Flight Refresh

Single-flight means:

For a given key, only one caller rebuilds the value; others wait, return stale, or fail fast.

Basic lock-based flow:

Redis lock:

SET lock:refresh:customer:{c-123} <token> NX PX 5000

Rules:

  • lock TTL must be longer than expected refresh p99
  • lock token must be random
  • release only if token matches
  • never block request threads indefinitely
  • define fallback for lock wait timeout

Safe Unlock Lua

if redis.call('GET', KEYS[1]) == ARGV[1] then
  return redis.call('DEL', KEYS[1])
else
  return 0
end

This prevents one worker from deleting another worker’s lock after timeout/reacquire.


16. Single-Flight Java Sketch

public Optional<CustomerDto> getCustomer(String customerId) {
    String cacheKey = customerKey(customerId);
    String lockKey = "lock:refresh:customer:{" + customerId + "}";

    Optional<CacheEnvelope<CustomerDto>> cached = redisCache.get(cacheKey, customerType);
    if (cached.isPresent() && cached.get().isFresh(clock)) {
        return Optional.of(cached.get().payload());
    }

    String token = UUID.randomUUID().toString();
    boolean lockAcquired = redisLock.tryAcquire(lockKey, token, Duration.ofSeconds(5));

    if (lockAcquired) {
        try {
            // Double-check after acquiring lock.
            Optional<CacheEnvelope<CustomerDto>> afterLock = redisCache.get(cacheKey, customerType);
            if (afterLock.isPresent() && afterLock.get().isFresh(clock)) {
                return Optional.of(afterLock.get().payload());
            }

            CustomerDto loaded = customerRepository.findDto(customerId)
                .orElseThrow(() -> new NotFoundException(customerId));

            CacheEnvelope<CustomerDto> envelope = CacheEnvelope.fresh(
                loaded,
                clock.instant(),
                Duration.ofSeconds(60),
                Duration.ofMinutes(10)
            );

            redisCache.set(cacheKey, envelope, Duration.ofMinutes(15));
            return Optional.of(loaded);
        } finally {
            redisLock.releaseIfTokenMatches(lockKey, token);
        }
    }

    if (cached.isPresent() && cached.get().isUsablyStale(clock)) {
        metrics.increment("redis.cache.stale_served", "cache", "customer");
        return Optional.of(cached.get().payload());
    }

    // Last resort: short local wait then retry cache.
    sleepSmallBoundedDelay();
    return redisCache.get(cacheKey, customerType).map(CacheEnvelope::payload);
}

Key points:

  • read before lock
  • acquire lock only on miss/stale
  • double-check after lock
  • stale fallback if another worker refreshes
  • bounded wait only
  • no infinite retry loop

17. Stale-While-Revalidate

Stale-while-revalidate separates response availability from refresh latency.

State model:

fresh      -> return immediately
stale      -> return stale and trigger refresh
too stale  -> block on refresh or fail
missing    -> block on refresh or fail

Value envelope:

{
  "loadedAtEpochMs": 1783000000000,
  "freshUntilEpochMs": 1783000060000,
  "staleUntilEpochMs": 1783000600000,
  "payload": {}
}

Behavior:

StateAction
freshreturn value
stale but usablereturn stale; one worker refreshes async
too staletry foreground refresh
refresh fails and stale usablereturn stale with metric
refresh fails and too stalefail or degraded response

This pattern is excellent for:

  • profile summary
  • product detail
  • CMS page
  • dashboard widgets
  • risk features that tolerate degraded freshness
  • external API response cache

It is dangerous for:

  • final price authorization
  • payment state
  • permission revocation
  • inventory reservation
  • compliance decision

18. Refresh-Ahead

Refresh-ahead means refreshing hot keys before clients observe expiry.

Example:

if now > freshUntil - refreshAheadWindow:
    trigger refresh in background

Use it for:

  • known hot keys
  • expensive source queries
  • predictable dashboards
  • product homepage data
  • configuration snapshots

Avoid it for:

  • unbounded high-cardinality keyspace
  • rarely used keys
  • source systems with tight capacity
  • data where refresh does not matter until requested

Refresh-ahead requires admission control. Without admission control it becomes a cache warming DDoS against your own database.

Hot-Key Refresh Queue

Sorted set key:

cache-refresh:customer-profile:due

Member:

customer:{c-123}

Score:

epoch millis refresh due time

19. Negative Cache Consistency

Negative cache stores “not found” results.

Example:

cache:customer:{c-404}:negative = { "reason": "not_found" }

Benefits:

  • blocks repeated DB hits for missing IDs
  • protects source from malicious/random probes
  • reduces expensive external API calls

Risks:

  • newly created record hidden until negative TTL expires
  • permission-sensitive not-found may leak semantics
  • invalidation on creation is often forgotten

Rules:

RuleWhy
Use short TTLnot-found can become found
Include scopetenant/user/permission affects visibility
Invalidate on createcreation must remove negative marker
Separate not-found vs forbiddenavoid security confusion
Do not cache transient source errors as not-foundprevents false negative poisoning

Negative key pattern:

cache:customer:{tenant-a:c-123}:neg:v1

TTL examples:

DataNegative TTL
random ID lookup30s-5m
external API missing object1m-15m
user-created object5s-30s
security-sensitive resourceoften do not cache negative result

20. Permission-Aware Cache Keys

One of the worst cache consistency bugs is missing authorization dimensions.

Bad key:

cache:case-detail:{case-123}

If the payload depends on the viewer, this is wrong.

Better:

cache:case-detail:{tenant-a:case-123}:viewer-role:{role-hash}:v4

Or split the cache:

cache:case-public-summary:{tenant-a:case-123}:v2
cache:case-sensitive-fields:{tenant-a:case-123}:permission-snapshot:{hash}:v1

A cache key must include every input that changes the output:

  • tenant
  • locale
  • currency
  • role
  • entitlement snapshot
  • experiment bucket
  • API version
  • projection version
  • source version
  • feature flag version

Invariant:

If two requests can legitimately receive different responses, they must not share the same cached value unless the cached value contains only their common subset.


21. Local In-Process Cache + Redis Cache

Many Java systems use two levels:

Caffeine local cache -> Redis -> DB

This improves latency, but complicates invalidation.

Problems:

  • Redis invalidation does not automatically clear local caches
  • local cache may serve stale after Redis changed
  • instance restart changes behavior
  • Pub/Sub invalidation is at-most-once

Safer design:

LayerTTLRole
local Caffeinevery short, e.g. 1-5 secondsabsorb microbursts
Redislonger, e.g. 1-15 minutesdistributed cache
DB/sourceauthoritativecorrectness

Do not put long TTL in local memory unless you have reliable invalidation.

Local Cache Rule

local_ttl <= smallest acceptable staleness window

For permission/security data, local cache should often be extremely short or disabled.


22. Cache Rebuild Admission Control

When Redis misses, you need to protect the source.

Admission strategies:

StrategyBehavior
single-flight per keyone rebuild per key
global rebuild semaphorecap total concurrent source loads
per-tenant rebuild limitprevent noisy tenant overload
priority queuerebuild critical caches first
fail fastreject low-priority rebuilds during pressure
stale fallbackserve stale instead of rebuilding

Java sketch:

public <T> T loadWithAdmission(String cacheName, Supplier<T> loader) {
    if (!rebuildLimiter.tryAcquire()) {
        throw new CacheRebuildRejectedException(cacheName);
    }
    try {
        return loader.get();
    } finally {
        rebuildLimiter.release();
    }
}

Metric:

cache_rebuild_rejected_total{cache="customer-profile"}

If this metric is non-zero, Redis is protecting the source by degrading cache refresh. That is often better than taking the source down.


23. Cache Miss Is Not One Thing

Do not observe only hit and miss.

A miss can mean:

Miss typeMeaningAction
cold missnever loadednormal load
expired missphysical TTL removed keymaybe stampede risk
invalidated misswriter deleted keyexpected source load
version missnew version pointer, payload absentrebuild latest
negative missno negative markersource check needed
decode missvalue exists but cannot decodedelete and reload
too-stale misslogical stale limit exceededforeground refresh
Redis error misscache unavailablefallback or fail

Use reason codes in cache abstraction:

public enum CacheReadStatus {
    FRESH_HIT,
    STALE_HIT,
    MISS_ABSENT,
    MISS_TOO_STALE,
    MISS_DECODE_ERROR,
    MISS_REDIS_ERROR,
    NEGATIVE_HIT
}

Without reason codes, cache metrics become misleading.


24. Stale Overwrite Prevention

The stale overwrite problem:

Old reader loads source version 10 slowly.
New writer updates source to version 11.
Fast reader caches version 11.
Old reader finishes and overwrites cache with version 10.

Prevention options:

OptionStrengthCost
versioned keyhighversion pointer + old key cleanup
Lua compare versionhighenvelope/version management
short TTLlowstale window remains
delete on writemediumrace remains
double deletemediumguessed delay

Preferred for important data:

sourceVersion + atomic set-if-newer

Separate Version Key Lua

-- KEYS[1] = value key
-- KEYS[2] = version key
-- ARGV[1] = candidate version
-- ARGV[2] = payload
-- ARGV[3] = ttl seconds

local currentVersion = redis.call('GET', KEYS[2])
local candidateVersion = tonumber(ARGV[1])

if currentVersion and tonumber(currentVersion) > candidateVersion then
  return 0
end

redis.call('SET', KEYS[1], ARGV[2], 'EX', ARGV[3])
redis.call('SET', KEYS[2], ARGV[1], 'EX', ARGV[3])
return 1

Cluster note:

cache:customer:{c-123}:value
cache:customer:{c-123}:version

The hash tag ensures both keys are in the same slot.


25. Freshness Budget

A freshness budget is the maximum age tolerated by a path.

Example:

PathAllowed age
product browse5 minutes
product detail price preview30 seconds
checkout price confirmation0 seconds or source-authoritative
admin config page10 seconds
homepage recommendation15 minutes
fraud feature cache1-5 minutes depending on feature

Do not attach one TTL to an entity globally. The same entity may have different freshness budgets in different journeys.

Better:

cache:product-summary:{p-1}:browse:v3        TTL 10m
cache:product-price-preview:{p-1}:v5         TTL 30s
no-cache final checkout price authority      source read

This avoids over-constraining cheap paths and under-protecting critical paths.


26. Cache Key Versioning for Deployments

Schema changes require cache versioning.

Bad:

cache:customer:{id}

Better:

cache:customer:v3:{id}

But with Redis Cluster:

cache:customer:{id}:v3

because {id} is the hash tag.

Use version bump when:

  • serialized schema changes incompatibly
  • value meaning changes
  • key dimension changes
  • source query changes materially
  • permission model changes

Do not use version bump casually for every deploy. It can cold-start your entire cache.

Migration Strategy

StrategyBehavior
read old, write newgradual migration
dual writehigher write cost
namespace cutoversimple but cold start risk
prewarmreduces cold start, adds source load
fallback decodersupports mixed versions

For high traffic caches, use gradual migration:

GET v3 -> if miss GET v2 -> transform -> SET v3

Then expire v2 naturally.


27. Redis Unavailable: What Should Happen?

Every cache user needs a Redis failure policy.

Cache roleRedis down behavior
performance-only cachebypass Redis, read source with admission control
source-protecting cachedegrade or fail fast to avoid DB collapse
coordination lockdo not pretend lock succeeded
idempotency markerfail closed or use source uniqueness
rate limiterfail open or fail closed based on risk
feature flag cacheuse last-known-good local snapshot

For cache consistency, the most dangerous policy is accidental:

catch RedisException -> call DB with no limit

Under Redis outage, every service instance may stampede the DB.

Safer:

  • circuit breaker around Redis
  • bounded source fallback
  • per-cache fallback policy
  • stale local snapshot for selected caches
  • global rebuild semaphore
  • degraded response for non-critical data

28. Source Unavailable: What Should Happen?

If source is down and Redis has stale value, should you serve it?

Depends.

DataServe stale during source outage?
product contentyes, bounded
customer display profileoften yes
permissionsrarely, or very short bounded
payment statususually no
rate plan definitionmaybe, if effective date included
fraud rulemaybe last-known-good if governed
final quote priceno unless business explicitly accepts

Stale serving should be logged and measured:

cache_stale_served_total{cache="product-detail", reason="source_failure"}
cache_stale_age_seconds{cache="product-detail"}

Alert not on any stale serve, but on excessive stale age or volume.


29. Testing Cache Consistency

Unit tests are not enough. You need concurrency and failure tests.

Test Cases

TestWhat it proves
concurrent read miss single-flightonly one source call
read/write racestale overwrite prevented or bounded
Redis delete failure after DB commitstale bounded by TTL/event repair
source failure with stale valuestale fallback policy works
source failure without stale valuecorrect failure response
decode failurecache is deleted/reloaded
namespace migrationold value not decoded incorrectly
negative cache create racecreation invalidates negative marker
Redis cluster cross-slotLua/multi-key script keys co-located

Deterministic Race Test Sketch

@Test
void staleReaderMustNotOverwriteNewerCacheValue() throws Exception {
    CountDownLatch readerLoadedOld = new CountDownLatch(1);
    CountDownLatch writerCommittedNew = new CountDownLatch(1);

    CompletableFuture<Void> oldReader = CompletableFuture.runAsync(() -> {
        CustomerDto old = repository.loadVersion(10);
        readerLoadedOld.countDown();
        await(writerCommittedNew);
        cache.putIfNewer("c-123", 10, old);
    });

    CompletableFuture<Void> writer = CompletableFuture.runAsync(() -> {
        await(readerLoadedOld);
        repository.updateToVersion(11);
        cache.putIfNewer("c-123", 11, repository.loadVersion(11));
        writerCommittedNew.countDown();
    });

    CompletableFuture.allOf(oldReader, writer).join();

    assertThat(cache.get("c-123").sourceVersion()).isEqualTo(11);
}

This is the kind of test that reveals senior-level cache bugs.


30. Operational Metrics

Minimum metrics:

cache_read_total{cache,status}
cache_write_total{cache,status}
cache_delete_total{cache,status}
cache_hit_ratio{cache}
cache_stale_served_total{cache,reason}
cache_stale_age_seconds{cache}
cache_rebuild_total{cache,status}
cache_rebuild_duration_seconds{cache}
cache_rebuild_inflight{cache}
cache_lock_acquire_total{cache,status}
cache_lock_wait_duration_seconds{cache}
cache_decode_failure_total{cache,schemaVersion}
cache_payload_bytes{cache}
cache_negative_hit_total{cache}
cache_source_fallback_total{cache,reason}

Redis/server-side signals:

instantaneous_ops_per_sec
used_memory
evicted_keys
expired_keys
keyspace_hits
keyspace_misses
connected_clients
blocked_clients
cmdstat_get/usec_per_call
cmdstat_set/usec_per_call
slowlog_len
latency_spike_events

Business-facing signals:

checkout_price_revalidation_failure_total
permission_cache_stale_denied_total
permission_cache_stale_allowed_total
quote_preview_stale_age_seconds
customer_profile_cache_age_seconds

Do not stop at hit ratio. A high hit ratio can still hide stale, wrong, or oversized values.


31. Pattern Selection Matrix

ProblemRecommended pattern
normal product read cachecache-aside + TTL jitter + delete-after-commit
hot expensive keystale-while-revalidate + single-flight
source version availableversioned key or set-if-newer
many derived keys affectednamespace version bump
creation race with missing objectsshort negative cache + invalidate on create
cache schema migrationversioned key + read-old/write-new
cross-service writesoutbox event invalidation
Redis outage risk to DBsource fallback admission control
stale overwrite unacceptableatomic version guard
local cache above Redisvery short local TTL + invalidation hint only

32. Production Checklist

Before approving a Redis cache in a serious Java system:

  • Source of truth is explicit.
  • Cache owner is explicit.
  • Key pattern includes tenant/security/version dimensions.
  • Value schema is versioned.
  • Normal freshness budget is defined.
  • Degraded staleness budget is defined.
  • Physical TTL is greater than logical stale window.
  • TTL jitter is applied where relevant.
  • Invalidation trigger is documented.
  • Write path uses after-commit invalidation.
  • Stale overwrite race is either prevented or accepted explicitly.
  • Hot keys have single-flight or stale-while-revalidate.
  • Redis outage policy is defined.
  • Source outage policy is defined.
  • Negative cache TTL is short and create invalidation exists.
  • Metrics include stale age and rebuild pressure.
  • Tests include concurrency and failure cases.
  • Cache cannot be used accidentally in forbidden critical paths.

33. Anti-Patterns

Anti-pattern 1 — TTL as a Guess

TTL = 1 hour because it feels reasonable

Better:

TTL = based on freshness budget + source load + stale fallback policy

Anti-pattern 2 — Cache Key Missing Security Context

cache:case:{caseId}

Better:

cache:case-summary:{tenantId:caseId}:visibility:{permissionHash}:v2

Anti-pattern 3 — Redis Exception Means Unlimited DB Fallback

catch (RedisException e) {
    return db.load(id);
}

Better:

catch (RedisException e) {
    return sourceFallbackLimiter.executeOrDegrade(() -> db.load(id));
}

Anti-pattern 4 — Hit Ratio as the Only Cache KPI

A cache can have 99% hit ratio and still serve unauthorized or stale data.

Add:

  • stale age
  • decode failures
  • source fallback rate
  • lock contention
  • rebuild duration
  • forbidden-path usage guard

Anti-pattern 5 — Pub/Sub as Reliable Invalidation

Redis Pub/Sub is an ephemeral signaling mechanism. Treat Pub/Sub invalidation as a hint unless you have another repair mechanism such as TTL, versioning, or event replay.


34. Engineering Heuristics

Use these defaults unless you have a reason not to:

  • Prefer cache-aside for simple derived reads.
  • Prefer delete-after-commit over update-after-commit.
  • Prefer versioned keys for high-write/high-read race-prone caches.
  • Prefer short local cache TTL above Redis.
  • Prefer stale-while-revalidate for hot expensive reads where stale is acceptable.
  • Prefer source-authoritative reads for final irreversible decisions.
  • Prefer outbox-driven invalidation when writer and reader are separate services.
  • Prefer TTL jitter on all batch-loaded or hot cache families.
  • Prefer explicit degraded behavior over accidental best effort.

The top 1% skill is not knowing many cache patterns. The top 1% skill is knowing which pattern is safe under the actual failure modes.


35. Part Summary

Cache consistency is not binary. It is an envelope containing freshness, invalidation, refresh, fallback, and observability.

Key takeaways:

  • Redis and the source of truth are usually not updated atomically together.
  • Define consistency envelope per cache and per user journey.
  • Physical TTL, logical TTL, business validity, and client freshness are different concepts.
  • Delete-after-commit is often the safest default invalidation strategy.
  • Double delete reduces some races but is not strong correctness.
  • Versioned keys and version-guarded writes prevent stale overwrites.
  • Event-driven invalidation needs reliable publication, often via outbox.
  • Stampede control is mandatory for hot keys.
  • Stale-while-revalidate improves availability when stale data is acceptable.
  • Negative cache must be short-lived and invalidated on create.
  • Cache keys must include security and personalization dimensions.
  • Redis outage must not become uncontrolled DB fallback.
  • Measure stale age, rebuild pressure, decode errors, and fallback behavior.

Next part:

Part 016 — Idempotency, Deduplication, and Exactly-Once Illusions


References

Lesson Recap

You just completed lesson 15 in build core. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.