Deepen PracticeOrdered learning track

Observability and Debugging: Metrics, Slowlog, MONITOR, and Tracing

Learn Java Redis In Action - Part 027

Production Redis observability and debugging for Java engineers covering INFO, SLOWLOG, LATENCY, commandstats, client metrics, tracing, dashboards, alerts, and incident runbooks.

[2026-07-02]22 min read4385 words

In This Lesson

1. Kaufman Skill Decomposition 2. Four Layers of Redis Observability 3. What “Good Redis Observability” Must Answer

PrevNext

Lesson 2734 lesson track19–28 Deepen Practice

#java#redis#observability#debugging+6 more

Part 027 — Observability and Debugging: Metrics, Slowlog, MONITOR, and Tracing

Part 026 covered memory engineering: encoding, eviction, TTL, fragmentation, hot keys, and large keys. Now we move to the operational skill that separates Redis users from Redis production engineers:

Can you explain why Redis is slow, unstable, memory-heavy, inconsistent, or expensive while the incident is happening?

Redis is usually introduced as “fast”. That statement is useful but dangerous. Redis can still become slow because of:

network round trips
large payloads
slow commands
huge keys
hot keys
blocking operations
fork pressure during persistence
eviction storms
expiration storms
connection storms
replica lag
cluster slot movement
client retries
bad serialization
noisy neighbors on shared infrastructure
application code that turns one business request into hundreds of Redis calls

The core mental model:

Redis observability is not just monitoring the Redis process. It is correlating application-level intent, Java client behavior, Redis command execution, keyspace shape, memory lifecycle, persistence events, replication state, and host resources into one debuggable story.

1. Kaufman Skill Decomposition

The skill is not “open Redis Insight” or “check INFO”. The real skill is:

Given a production symptom, identify which layer is responsible, prove it with measurements, reduce blast radius, and implement a fix that prevents recurrence.

Breakdown:

Sub-skill	What you must be able to do
Signal design	Define Redis SLIs and metrics that match actual user-visible behavior
Client instrumentation	Measure Java command latency, timeout, retry, pool pressure, and payload size
Server metrics	Interpret `INFO`, `SLOWLOG`, `LATENCY`, commandstats, memory, persistence, replication, and keyspace metrics
Keyspace diagnosis	Detect hot keys, large keys, bad TTL distribution, and cardinality explosions
Incident triage	Move from symptom to likely cause without random guessing
Alert design	Alert on burn-rate and actionable risk, not noisy one-off spikes
Tracing	Correlate HTTP/API calls with Redis command groups and downstream effects
Safe debugging	Know when not to use expensive tools such as `MONITOR` in production
Runbook execution	Apply repeatable steps for latency, memory, eviction, replication, and client incidents
Feedback loop	Convert incidents into dashboards, tests, limits, and design changes

Kaufman-style outcome:

After this part, you should be able to design a Redis observability baseline and debug common production Redis incidents without guessing.

2. Four Layers of Redis Observability

A Redis incident is rarely only a Redis incident. A Java application can generate Redis pain long before Redis itself is “broken”.

Use this layered model:

When debugging, do not start with a command. Start with a question:

Which layer changed, and which signal proves it?

Examples:

Symptom	Possible application cause	Possible Redis cause	Possible infrastructure cause
API p99 increased	N+1 Redis calls	slow command / large key	network packet loss
Redis CPU high	sudden traffic fanout	expensive command	CPU throttling
Timeouts	retry storm	blocked event loop	network congestion
Cache hit rate dropped	key version mismatch	eviction storm	memory limit too low
Replica stale	read routing mistake	replication lag	network/disk saturation
Memory rising	payload bloat	expired keys not reclaimed fast enough	fragmentation
Sudden command volume	application deploy	hot key workload	autoscaling burst

A mature team does not rely on Redis metrics alone. It collects correlated signals from all layers.

3. What “Good Redis Observability” Must Answer

At minimum, your Redis observability must answer these questions quickly:

3.1 Availability

Is Redis reachable from the application?
Are commands timing out?
Are connection attempts failing?
Are clients reconnecting repeatedly?
Is Sentinel/Cluster topology stable?

3.2 Latency

What is application-observed Redis latency?
What is server-side command execution time?
Is latency dominated by server CPU, network round trip, client queuing, or retries?
Which commands contribute most to p95/p99 latency?

3.3 Throughput

How many commands per second are executed?
Which command families dominate traffic?
Is throughput increasing because user traffic increased or because application fanout/retry increased?

3.4 Memory

How much memory is used by dataset vs overhead?
Is memory growing by key count, value size, fragmentation, clients, replication buffers, or AOF buffers?
Are evictions happening?
Are expired keys accumulating?

3.5 Correctness

Are cache hit rates behaving as expected?
Are idempotency keys expiring too early?
Are locks timing out or being renewed incorrectly?
Are stream pending entries accumulating?
Are queues stuck?

3.6 Durability and Recovery

Are RDB/AOF persistence operations succeeding?
Is the last save time acceptable?
Are AOF rewrites failing?
Is replication lag within budget?
Can backup restore be proven?

The dangerous dashboard is one that says “Redis is green” while users are still experiencing timeouts. The useful dashboard shows both Redis internal state and application-observed behavior.

4. Metric Taxonomy

Use a stable taxonomy so dashboards, alerts, and incident reviews do not become random metric collections.

5. Redis Server-Side Tools

Redis gives several built-in tools. Each answers a different question.

Tool	Primary use	Safe for continuous production use?	Notes
`INFO`	broad server metrics	yes	foundation for exporters/dashboards
`SLOWLOG`	server-side slow command execution	yes, with sane config	does not include network/client I/O time
`LATENCY`	latency event diagnosis	yes, if enabled intentionally	useful for internal latency spikes
`COMMAND STATS` via `INFO commandstats`	command volume and CPU-ish distribution	yes	good for workload shifts
`CLIENT LIST`	connected client diagnosis	careful	useful during connection storms
`MEMORY STATS`	memory breakdown	yes, occasional	useful for capacity diagnosis
`MEMORY USAGE`	per-key memory estimate	sample only	do not scan blindly in hot prod path
`SCAN`	incremental keyspace inspection	careful	safer than `KEYS`, still workload-sensitive
`MONITOR`	live command stream	no for normal production	can be extremely expensive/noisy

Important distinction:

Redis server-side execution time and Java-observed command latency are not the same metric.

A command can be fast in SLOWLOG but slow to the Java caller because of:

connection pool wait
queued async commands
network latency
TLS overhead
retry delay
cluster redirect
serialization/deserialization
overloaded application threads

6. INFO: The Production Baseline

INFO is the default observability surface. It returns sections such as server, clients, memory, persistence, stats, replication, CPU, commandstats, cluster, and keyspace.

Useful commands:

redis-cli INFO
redis-cli INFO memory
redis-cli INFO stats
redis-cli INFO commandstats
redis-cli INFO persistence
redis-cli INFO replication
redis-cli INFO clients
redis-cli INFO keyspace

6.1 Memory Fields to Watch

Commonly useful fields:

used_memory
used_memory_human
used_memory_rss
used_memory_peak
used_memory_dataset
mem_fragmentation_ratio
maxmemory
maxmemory_policy
evicted_keys
expired_keys

Interpretation:

Field	Meaning	Common issue
`used_memory`	memory allocated by Redis allocator	dataset or buffer growth
`used_memory_rss`	memory seen by OS	fragmentation / allocator / fork effects
`used_memory_peak`	highest memory reached	capacity planning signal
`used_memory_dataset`	approximate dataset memory	actual data footprint
`mem_fragmentation_ratio`	RSS vs allocated memory relationship	high fragmentation or RSS overhead
`evicted_keys`	keys evicted due to maxmemory	correctness risk if unexpected
`expired_keys`	keys expired by TTL	TTL lifecycle behavior

Do not alert on mem_fragmentation_ratio alone without context. Low absolute memory with a high ratio can be harmless. High RSS near host limit can be dangerous even if Redis used memory seems acceptable.

6.2 Stats Fields to Watch

Useful fields:

total_commands_processed
instantaneous_ops_per_sec
total_net_input_bytes
total_net_output_bytes
keyspace_hits
keyspace_misses
expired_keys
evicted_keys
rejected_connections
sync_full
sync_partial_ok
sync_partial_err

Cache hit rate:

hit_rate = keyspace_hits / (keyspace_hits + keyspace_misses)

But hit rate must be interpreted by use case. A session store may need a very high hit rate. A negative cache may intentionally produce misses. A write-heavy queue may not use cache hits meaningfully.

6.3 Commandstats

INFO commandstats shows per-command usage and accumulated microseconds.

Example:

redis-cli INFO commandstats

You may see fields like:

cmdstat_get:calls=1000000,usec=900000,usec_per_call=0.90
cmdstat_hgetall:calls=50000,usec=700000,usec_per_call=14.00
cmdstat_zrange:calls=20000,usec=600000,usec_per_call=30.00

Questions to ask:

Did command mix change after deploy?
Are expensive commands used more often than expected?
Are multi-element reads such as HGETALL, SMEMBERS, LRANGE, or wide ZRANGE dominating?
Are Lua scripts consuming disproportionate time?
Are deletes of large keys causing spikes?

Commandstats is especially useful when application metrics say “Redis is slow” but slowlog is not obvious. It helps reveal volume-driven issues.

7. SLOWLOG: Server-Side Command Execution Diagnosis

SLOWLOG records commands whose server-side execution time exceeds a configured threshold. It is not a full query log. It does not include time spent sending responses to the client.

Common commands:

redis-cli CONFIG GET slowlog-log-slower-than
redis-cli CONFIG GET slowlog-max-len
redis-cli SLOWLOG LEN
redis-cli SLOWLOG GET 10
redis-cli SLOWLOG RESET

Example configuration idea:

# microseconds; 10000 = 10ms
CONFIG SET slowlog-log-slower-than 10000
CONFIG SET slowlog-max-len 1024

Production guidance:

Environment	Starting threshold
low-latency cache	1ms–5ms
general app cache	5ms–10ms
search / JSON / vector workload	depends on expected query profile
batch/admin Redis	higher, but separate from online path

Do not blindly set threshold to 0 in production. That logs every command and can become noise and overhead.

7.1 What SLOWLOG Can Reveal

Typical findings:

KEYS * executed by admin script
HGETALL on a huge hash
SMEMBERS on a huge set
LRANGE 0 -1 on an unbounded list
ZRANGE/ZREVRANGE over large ranges
long-running Lua script
mass delete of huge key
expensive search query
many commands from a maintenance job

7.2 What SLOWLOG Cannot Reveal

SLOWLOG does not directly show:

Java connection pool wait time
client-side queueing
network round-trip latency
TLS overhead
deserialization cost
blocked application threads
retry backoff
cluster redirect delay
response payload transfer time

That is why Java client metrics are mandatory.

8. LATENCY: Event-Level Latency Diagnosis

The LATENCY command family is useful for diagnosing latency spikes inside Redis.

Common commands:

redis-cli LATENCY LATEST
redis-cli LATENCY HISTORY command
redis-cli LATENCY DOCTOR
redis-cli LATENCY GRAPH command

Enable threshold intentionally:

CONFIG SET latency-monitor-threshold 100

The threshold is in milliseconds. Choose it based on the workload.

LATENCY DOCTOR can provide human-readable hints, but do not treat it as a substitute for analysis. Use it as a starting point.

Events may involve:

command execution
fork operation
AOF fsync/rewrite
eviction
expiration
active defragmentation
module/query behavior

A good incident workflow compares:

Java observed latency
vs Redis SLOWLOG
vs Redis LATENCY events
vs host CPU/network/disk

If Java latency is high but Redis SLOWLOG/LATENCY are clean, suspect client/network/application path. If Redis LATENCY shows fork or AOF spikes, suspect persistence/disk/fork pressure. If SLOWLOG shows commands, inspect key size and command usage.

9. MONITOR: Powerful, Dangerous, Rarely the First Tool

MONITOR streams every command processed by Redis. It can be useful in a controlled debugging session, but it is not a normal production observability mechanism.

Problems:

very high output volume
potential performance impact
sensitive data exposure
hard to aggregate meaningfully
encourages debugging by watching noise instead of metrics

Safer alternatives:

Need	Prefer
Find slow commands	`SLOWLOG`, `INFO commandstats`
Find workload shift	app metrics + commandstats
Find key pattern	sampled instrumentation at app layer
Find key expiry behavior	keyspace metrics / sampled SCAN / application events
Debug one staging issue	`MONITOR` in isolated environment

Production rule:

Do not use MONITOR on a busy production Redis unless you understand the blast radius and have a specific short-lived reason.

10. Java Client Observability

Redis server metrics cannot tell you how your Java caller experienced Redis. You need client-side instrumentation.

10.1 Minimum Java Client Metrics

For every Redis operation group, collect:

Metric	Why it matters
command latency	user-visible Redis dependency cost
timeout count	reliability and saturation signal
exception count by type	topology/network/auth/client bug diagnosis
retry count	hidden traffic amplification
pool wait time	blocking synchronous client pressure
active/idle connections	pool sizing and leaks
in-flight async commands	backpressure risk
payload size	serialization and network cost
cache hit/miss by use case	business correctness signal
operation cardinality	N+1 and batch opportunity

Avoid metric labels with raw keys or user IDs. Use low-cardinality operation names:

redis.operation = session.get
redis.operation = product-cache.mget
redis.operation = quota.consume
redis.operation = idempotency.claim
redis.operation = order-stream.ack

Bad labels:

key=user:123456:profile
email=customer@example.com
tenant=every-tenant-id-as-label

High-cardinality labels can break your metrics system.

10.2 Operation Wrapper Pattern

Wrap Redis calls by business operation, not just by command.

public final class ObservedRedisCache {
    private final RedisCommands<String, String> redis;
    private final MeterRegistry meters;

    public ObservedRedisCache(RedisCommands<String, String> redis, MeterRegistry meters) {
        this.redis = redis;
        this.meters = meters;
    }

    public Optional<String> getProductSnapshot(String productId) {
        String operation = "product-cache.get";
        Timer.Sample sample = Timer.start(meters);
        try {
            String key = "product:v3:{" + productId + "}:snapshot";
            String value = redis.get(key);

            meters.counter("redis.cache.lookup",
                    "operation", operation,
                    "result", value == null ? "miss" : "hit")
                    .increment();

            return Optional.ofNullable(value);
        } catch (RedisCommandTimeoutException ex) {
            meters.counter("redis.errors",
                    "operation", operation,
                    "type", "timeout")
                    .increment();
            throw ex;
        } catch (RedisException ex) {
            meters.counter("redis.errors",
                    "operation", operation,
                    "type", ex.getClass().getSimpleName())
                    .increment();
            throw ex;
        } finally {
            sample.stop(Timer.builder("redis.operation.latency")
                    .tag("operation", operation)
                    .register(meters));
        }
    }
}

This wrapper gives you:

business operation name
latency distribution
hit/miss semantics
exception type
low-cardinality metrics
a place to add tracing spans

10.3 Trace Spans

For distributed tracing, create spans around logical Redis operations. Do not create a separate span for every command in high-volume batch workflows unless sampling and cardinality are controlled.

Good span attributes:

db.system=redis
redis.operation=quota.consume
redis.command_family=lua
redis.keyspace=quota
redis.batch_size=1
redis.cluster_slot_known=true

Avoid storing full keys or values in spans. Keys may contain PII, tenant IDs, tokens, or security-sensitive data.

10.4 CLIENT SETNAME

Set client names to identify application sources:

CLIENT SETNAME order-service-api-1
CLIENT LIST

In Java, configure client name where supported or execute it after connection initialization. This helps during connection storms:

name=order-service-api-1 addr=10.1.2.3:50123 cmd=get age=120 idle=0

Use names like:

<service>-<component>-<environment>
order-api-cache-prod
order-worker-stream-prod
pricing-engine-rate-limit-prod

Do not encode pod IDs as metric labels unless necessary. They are high-cardinality.

11. Redis Dashboard Design

A dashboard should tell a story. Do not make a wall of graphs.

11.1 Top-Level Redis Health Dashboard

Recommended sections:

Availability
- client error rate
- client timeout rate
- Redis up/down
- rejected connections
Latency
- app-observed Redis p50/p95/p99
- server slowlog count
- latency events
- network RTT if available
Traffic
- ops/sec
- command mix
- input/output bandwidth
- request-to-command fanout
Memory
- used memory
- RSS
- fragmentation
- evictions
- expired keys
- key count
Correctness/Domain
- cache hit rate per cache
- idempotency claim conflict rate
- rate limiter allow/deny rate
- queue depth
- stream lag/pending entries
Persistence/Replication
- RDB/AOF status
- last save age
- fork time
- replication lag
- role/failover events
Host
- CPU
- memory pressure
- disk latency
- network throughput/errors
- container throttling

11.2 Dashboard Anti-Patterns

Anti-pattern	Why it fails
Only Redis server metrics	misses app/client/network pain
Only average latency	hides p99 tail latency
One global cache hit rate	hides broken critical cache
Raw key labels	cardinality explosion and sensitive data leak
No deployment markers	cannot correlate incidents with changes
No command mix	cannot detect N+1 or expensive command shift
No business metrics	cannot tell if Redis issue affects users

12. Alert Design

Alerts must be actionable. A Redis alert should imply a decision.

12.1 Good Alerts

Examples:

Redis client timeout rate > 1% for 5 minutes on order-api prod

Action:

check Redis latency dashboard
check pool wait/in-flight queue
check recent deploy
check slowlog and commandstats

Redis evicted_keys increasing for non-evictable workload

Action:

determine impacted keyspace
scale memory or reduce dataset
disable unsafe eviction for correctness-critical store

Stream pending entries growing for consumer group order-worker

Action:

inspect consumer health
claim idle entries
check poison messages

AOF rewrite failed or RDB last save age exceeds RPO budget

Action:

check disk space
check persistence config
initiate backup risk response

12.2 Bad Alerts

Redis CPU > 70% once

Not always bad. Could be normal load. Needs context.

Cache hit rate < 90% globally

Too generic. Some caches are intentionally low-hit. Some critical caches need much higher.

Memory fragmentation ratio > 1.5

Can be misleading without absolute memory and RSS pressure.

12.3 Alert Severity Mapping

Severity	Redis example	User impact
Sev1	Redis unavailable for critical path	checkout/login/order down
Sev2	p99 Redis latency causing API SLO breach	degraded user experience
Sev3	memory growth will breach limit within hours	risk but not immediate impact
Sev4	slowlog has occasional admin command in off-peak	investigate during business hours

13. Runbook: High Redis Latency

Symptom:

API p99 increased and Redis appears in traces.

Follow this decision tree:

Commands:

redis-cli INFO stats
redis-cli INFO commandstats
redis-cli SLOWLOG GET 20
redis-cli LATENCY LATEST
redis-cli INFO clients
redis-cli INFO memory

Java checks:

Redis operation p95/p99 by business operation
timeout count by exception type
pool wait time
active/idle connection count
in-flight async command count
retry count
request fanout count
payload size distribution

Typical root causes:

Evidence	Likely cause	Fix direction
slowlog shows `HGETALL` huge hash	large key read	field-specific reads / split hash
slowlog shows `SMEMBERS`	unbounded set read	page/scan/index differently
commandstats `GET` calls doubled	N+1 or retry storm	batch with `MGET` / reduce retries
app latency high, slowlog clean	client/network/pool	inspect pool/RTT/in-flight queue
latency event fork	persistence snapshot/rewrite	memory headroom/disk tuning/schedule
CPU high + commandstats expensive	workload issue	command redesign / shard / limit range

14. Runbook: Memory Growth

Symptom:

used_memory increasing faster than forecast.

Check:

redis-cli INFO memory
redis-cli INFO keyspace
redis-cli INFO stats
redis-cli MEMORY STATS

Questions:

Did key count increase?
Did value size increase?
Did TTL disappear from new keys?
Did a deployment change serializer format?
Did a queue/stream stop being consumed?
Did cache misses cause more negative cache entries?
Did retention change for time series/search/vector data?
Is RSS growth fragmentation or actual dataset growth?

Sample keyspace carefully:

redis-cli --scan --pattern 'session:*' | head -100
redis-cli MEMORY USAGE some:key
redis-cli TTL some:key

Do not run KEYS * in production. Use SCAN carefully and rate-limit inspection scripts.

Memory incident classification:

Class	Description	Response
forecasted growth	expected traffic/data growth	scale or optimize
lifecycle leak	keys missing TTL or cleanup	fix writer and backfill cleanup
payload bloat	serializer/schema changed	reduce payload, compress, migrate
queue buildup	consumers down/slow	restore consumers, drain backlog
eviction storm	maxmemory reached	protect correctness, scale, tune policy
fragmentation	RSS high vs used memory	restart/defrag/headroom depending risk

15. Runbook: Evictions Happening

Evictions are not inherently wrong. They are wrong when Redis contains data that must not disappear.

Check:

redis-cli INFO memory
redis-cli INFO stats | grep evicted_keys
redis-cli CONFIG GET maxmemory
redis-cli CONFIG GET maxmemory-policy

Decision table:

Workload	Eviction acceptable?	Policy direction
pure cache	yes	`allkeys-lru` / `allkeys-lfu` depending behavior
session store	usually no	prefer `noeviction`, scale memory
lock/idempotency	dangerous	prefer `noeviction` or separate Redis
queue/stream	dangerous	no eviction for durable-ish semantics
rate limiter	sometimes	depends on fail-open/fail-closed policy

If evictions happen unexpectedly:

identify workload sharing same Redis
split correctness-critical keys from cache keys
increase memory or reduce payload/cardinality
add TTL jitter for cache waves
tune maxmemory policy only after correctness review

16. Runbook: Connection Storm

Symptoms:

high connected clients
rejected connections
app timeouts
CPU increase
many reconnect logs

Check:

redis-cli INFO clients
redis-cli CLIENT LIST
redis-cli CONFIG GET maxclients

Root causes:

each request creates new Redis connection
pool not reused
autoscaling created too many pods
retry/reconnect loop
health checks open connections too frequently
client topology refresh bug/misconfig
connection leak

Java-specific fixes:

reuse client/connection factory
bound pool size
set connection timeout and command timeout
add jittered reconnect/backoff
avoid creating Redis client per request
separate blocking connections from normal commands
monitor active/idle/borrow wait

17. Runbook: Cache Hit Rate Collapse

Symptom:

Database load increased and Redis hit rate dropped.

Check:

hit/miss by cache name, not global only
deploy that changed key format/version
TTL distribution
invalidation traffic
eviction count
serialization errors
backend errors causing cache fill failures

Common causes:

Evidence	Likely cause
misses only for one service version	key version mismatch
misses after deploy	serializer/key namespace change
misses plus evictions	memory pressure
misses plus backend errors	cache fill failing
synchronized misses	TTL avalanche
negative cache missing	cache penetration

Correct response depends on cause. Do not simply increase TTL. That can hide bugs and increase staleness.

18. Runbook: Stream/Queue Stuck

For Redis Streams:

redis-cli XINFO STREAM order-events
redis-cli XINFO GROUPS order-events
redis-cli XPENDING order-events order-consumers
redis-cli XPENDING order-events order-consumers - + 10

Questions:

Is producer still writing?
Are consumers connected?
Is pending count increasing?
Are the same messages retrying?
Is one consumer owning too many pending entries?
Is trimming deleting data too early?
Is a poison message blocking progress?

Fix direction:

restart failed consumers
use XAUTOCLAIM for idle pending entries
move poison messages to DLQ
reduce batch size if processing is too slow
add consumer-level metrics
review ack placement

For list/sorted-set queues:

inspect ready queue depth
inspect processing queue depth
inspect delayed zset min score
inspect retry counters
inspect worker heartbeats

19. Runbook: Replication Lag or Stale Reads

Symptoms:

reads from replica return old data
failover causes unexpected data loss
replica offset lag increases

Check:

redis-cli INFO replication

Important fields:

role
connected_slaves / connected_replicas
master_repl_offset
slave_repl_offset / replica offset
master_link_status
repl_backlog_active

Application questions:

Are correctness-sensitive reads routed to replica?
Does application expect read-your-write?
Is WAIT used anywhere, and with what timeout?
Is failover tested under write load?

Design rule:

Replica reads are a latency/scaling optimization, not a default correctness mechanism.

Use replica reads only when stale reads are acceptable or explicitly handled.

20. Debugging Keyspace Shape

Redis keyspace shape affects latency and memory. You need to understand:

number of keys by namespace
TTL distribution
value size distribution
large keys
hot keys
per-tenant cardinality
cluster slot distribution

20.1 Safe Sampling Script Idea

Use SCAN, not KEYS. Rate-limit it. Run from admin environment. Do not export sensitive key names unnecessarily.

Pseudo-flow:

for key in SCAN pattern namespace:* count 100:
    sample ttl
    sample memory usage
    infer namespace
    aggregate counts and percentiles

Output example:

Namespace	Key count	p50 TTL	p99 memory	Large key count
`session`	12M	21m	3KB	20
`product-cache`	2M	5h	120KB	4,200
`quota`	80M	58s	120B	0
`stream`	48	n/a	200MB	48

This data drives capacity planning. Without it, Redis scaling decisions are guesses.

21. Observability for Redis Patterns

Each Redis pattern needs specific metrics.

21.1 Cache Metrics

Per cache:

hit count
miss count
load success/failure
load latency
stale serve count
refresh lock contention
negative cache hit/miss
invalidation count
evicted-key impact estimate

21.2 Rate Limiter Metrics

Per limiter, low-cardinality dimension:

allow count
deny count
Redis error count
fail-open/fail-closed count
Lua latency
quota key cardinality estimate
top noisy dimensions, sampled not labeled fully

21.3 Idempotency Metrics

claim success
duplicate replay
in-progress conflict
expired in-progress recovery
fingerprint mismatch
result replay failure
Redis timeout during claim/finalize

21.4 Lock Metrics

acquire success/failure
acquire wait time
lease renewal success/failure
lock hold duration
release mismatch
fencing token monotonicity failure
critical section timeout

21.5 Streams Metrics

stream length
add rate
read rate
ack rate
pending entries
oldest pending age
consumer count
DLQ count
trim count

21.6 Search/JSON/Vector Metrics

query count by query type
query latency
result count
timeout/error count
index size
index update rate
payload size
stale index detection metric if source of truth is elsewhere

22. Incident Review Template

After a Redis incident, write a review that answers:

# Redis Incident Review

## User impact
- What user journey was affected?
- Duration?
- Error rate / latency impact?

## Timeline
- First signal
- Detection
- Mitigation
- Recovery

## Redis symptom
- Latency?
- Memory?
- Eviction?
- Replication?
- Persistence?
- Connection?
- Command mix?

## Root cause
- Application behavior?
- Redis config?
- Data model?
- Infrastructure?
- Operational process?

## Why existing controls failed
- Missing metric?
- Missing alert?
- Bad threshold?
- Missing load test?
- Missing runbook?

## Permanent fixes
- Code change
- Config change
- Capacity change
- Dashboard/alert change
- Test/change management change

## Follow-up validation
- How do we know it is fixed?

The best Redis teams treat incidents as training data for better design.

23. Minimal Production Checklist

Before launching Redis-backed functionality:

24. Deliberate Practice

Exercise 1 — Build an Observability Baseline

For one Redis-backed feature, define:

operation name
commands used
expected QPS
expected payload size
expected latency budget
hit/miss expectation
failure policy
dashboard panels
alerts

Exercise 2 — Diagnose a Synthetic Latency Incident

In staging:

create a large hash
call HGETALL repeatedly
observe Java latency
observe SLOWLOG
observe commandstats
replace with field reads or pagination
compare metrics

Exercise 3 — Simulate a Cache Hit Collapse

deploy key version v1
warm cache
deploy code using v2
observe miss spike
add dashboard by cache namespace
document rollback/migration strategy

Exercise 4 — Stream Pending Entries

create stream consumer group
read without ack
inspect XPENDING
use XAUTOCLAIM
define oldest-pending-age alert

25. Mental Compression

A top-tier Redis engineer thinks like this:

Redis latency is not one number. It is a chain from Java request fanout, client connection behavior, network round trip, server command execution, key/value shape, persistence side effects, replication topology, and host resource pressure.

When debugging:

start from user impact
compare app-observed latency with server-side execution
inspect command mix and keyspace shape
check memory, eviction, persistence, replication
correlate with deploys and traffic changes
mitigate safely
convert finding into metric, alert, test, or design change

26. References

Redis observability tutorial: https://redis.io/tutorials/operate/redis-at-scale/observability/
Redis latency diagnosis: https://redis.io/docs/latest/operate/oss_and_stack/management/optimization/latency/
Redis SLOWLOG command: https://redis.io/docs/latest/commands/slowlog/
Redis INFO command: https://redis.io/docs/latest/commands/info/
Redis CLIENT command family: https://redis.io/docs/latest/commands/client-list/
Redis memory optimization and eviction references: https://redis.io/docs/latest/develop/reference/eviction/
Lettuce reference: https://redis.github.io/lettuce/
Micrometer documentation: https://docs.micrometer.io/

27. Next Part

Part 028 moves from observing Redis to protecting data:

Persistence and Durability: RDB, AOF, Hybrid, Backup, and Restore.

Lesson Recap

You just completed lesson 27 in deepen practice. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Back To Series Next Lesson

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.

Previous Lesson

Lesson 26

Memory Engineering: Encoding, Eviction, TTL, Fragmentation, and Hot Keys

Next Lesson

Lesson 28

Persistence and Durability: RDB, AOF, Hybrid, Backup, and Restore