Real-Time Features: Presence, WebSocket Fanout, and Notifications
Learn Java Redis In Action - Part 020
Production real-time features with Redis and Java: presence, WebSocket fanout, notifications, Pub/Sub, Streams, keyspace notifications, sharded channels, local connection registries, durability boundaries, and failure-aware delivery design.
Part 020 — Real-Time Features: Presence, WebSocket Fanout, and Notifications
Part 019 covered Redis-backed work queues, delayed jobs, retry, and worker pipelines. Now we move to a different class of runtime feature:
Real-time user experience built on Redis.
This includes:
- online/offline presence
- last-seen tracking
- multi-device sessions
- WebSocket gateway fanout
- room/channel membership
- live notifications
- unread counters
- ephemeral signals
- durable notification inboxes
- reconnect recovery
Redis is commonly used here because it is fast, simple, and well suited to shared runtime state. But real-time systems are easy to misdesign. The most common mistake is treating all real-time messages as equally durable. They are not.
The central distinction:
Presence and fanout are often ephemeral signals. Notification history and business state are durable records.
Redis can support both, but the design must separate them.
1. Kaufman Skill Decomposition
The skill is not “publish a message”. The real skill is:
Design a real-time delivery system where ephemeral connection state, durable notification state, fanout routing, reconnect behavior, and user-visible consistency are explicitly modeled.
Breakdown:
| Sub-skill | What you must be able to do |
|---|---|
| Delivery classification | Decide which messages may be lost and which require durable recovery |
| Presence modeling | Represent user, device, connection, heartbeat, expiry, and last-seen |
| Gateway routing | Route messages to the right WebSocket nodes and local connections |
| Fanout design | Choose direct channels, room channels, sharded Pub/Sub, Streams, or queues |
| Notification durability | Separate live push from notification inbox and unread state |
| Reconnect recovery | Allow clients to catch up after disconnect |
| Cluster behavior | Understand normal Pub/Sub vs sharded Pub/Sub and channel hot spots |
| Expiry handling | Use TTL/keyspace notifications as hints, not correctness sources |
| Backpressure | Protect gateways, Redis, and clients from fanout storms |
| Observability | Measure connected users, sessions, publish latency, dropped signals, and delivery lag |
Kaufman practice goal:
In 20 hours, build a small Java WebSocket gateway backed by Redis presence state, Pub/Sub fanout, and a durable notification inbox. Then test disconnects, node restarts, Redis restart, duplicate sessions, mobile reconnect, and offline catch-up.
2. Mental Model: Signal vs State
Real-time architecture has two different things:
- State — what must remain true after failures.
- Signal — what helps systems react quickly when online.
Example:
| Feature | Durable state? | Ephemeral signal? |
|---|---|---|
| User has 5 unread notifications | Yes | Optional |
| User is currently typing | No | Yes |
| User is online | Soft state | Yes |
| A payment was approved | Yes | Optional live push |
| A chat message exists | Yes | Live push is signal |
| WebSocket connection exists on node A | Soft state | Yes |
| User joined room | Often durable or soft depending feature | Yes |
The dangerous design is to store critical facts only in Pub/Sub.
Bad:
Payment service -> PUBLISH user:123 payment-approved
WebSocket node -> sends to browser
If the user is offline or the WebSocket node disconnects, the message is gone.
Better:
Payment service -> durable notification row / stream / outbox
Payment service -> PUBLISH notification hint
WebSocket node -> sends if user online
Client reconnect -> fetch durable inbox after last seen notification id
Redis Pub/Sub is a signal bus. It is not a durable notification store.
3. Reference Architecture
Main components:
| Component | Responsibility |
|---|---|
| WebSocket gateway | Owns client TCP/WebSocket connections |
| Local connection registry | Maps user/session to local channel objects inside one JVM |
| Redis presence store | Shared soft state: users, devices, node ownership, last heartbeat |
| Pub/Sub bus | Fast fanout signal across gateway nodes |
| Durable notification store | Source of truth for inbox, read state, audit state |
| Redis Stream/queue | Optional durable-ish event pipeline for gateway delivery/retry |
| Client recovery API | Fetch missed durable notifications after reconnect |
The gateway owns live sockets. Redis helps gateways discover and signal each other.
4. Presence Data Model
Presence is soft state. It must expire automatically if a gateway dies.
Model presence at three levels:
user -> device/session -> connection
Suggested keys:
presence:user:{userId}:sessions set of sessionId
presence:session:{sessionId} hash metadata
presence:node:{nodeId}:sessions set of sessionId
presence:last-seen zset userId -> lastSeenEpochMs
presence:online-users zset userId -> lastHeartbeatEpochMs
Session hash:
HSET presence:session:{sessionId}
userId user_123
nodeId ws-node-07
deviceId device_abc
connectedAt 1782972000000
lastHeartbeatAt 1782972060000
clientVersion 4.13.0
ipHash sha256:...
userAgentHash sha256:...
EXPIRE presence:session:{sessionId} 90
Heartbeat update:
HSET presence:session:{sessionId} lastHeartbeatAt <now>
EXPIRE presence:session:{sessionId} 90
ZADD presence:online-users <now> user_123
ZADD presence:last-seen <now> user_123
SADD presence:user:{userId}:sessions <sessionId>
SADD presence:node:{nodeId}:sessions <sessionId>
EXPIRE presence:user:{userId}:sessions 120
EXPIRE presence:node:{nodeId}:sessions 120
For cluster-safe multi-key atomic scripts, use a hash tag around the user or session partition. Do not accidentally force all presence keys into one slot unless that is intentional.
5. Online/Offline Is a Derived State
Do not model online as a permanent boolean. Model it as recent activity.
A user is online if:
ZSCORE presence:online-users user_123 >= now - onlineThresholdMs
Example threshold:
heartbeat interval: 30s
session TTL: 90s
online threshold: 75s
offline hysteresis: 120s
Why hysteresis matters:
- mobile networks drop temporarily
- browser tabs pause timers
- gateways restart
- load balancers rebalance connections
- clients reconnect quickly
Without hysteresis, users flicker online/offline.
User-visible presence should often be eventually consistent. A few seconds of delay is better than flicker.
6. Connection Registry in Java
Redis should not store actual WebSocket objects. The JVM process owns those.
Inside each gateway:
public final class LocalConnectionRegistry {
private final ConcurrentMap<String, Set<WebSocketConnection>> byUserId = new ConcurrentHashMap<>();
private final ConcurrentMap<String, WebSocketConnection> bySessionId = new ConcurrentHashMap<>();
public void register(String userId, String sessionId, WebSocketConnection connection) {
bySessionId.put(sessionId, connection);
byUserId.computeIfAbsent(userId, ignored -> ConcurrentHashMap.newKeySet()).add(connection);
}
public void unregister(String userId, String sessionId) {
WebSocketConnection connection = bySessionId.remove(sessionId);
if (connection == null) {
return;
}
Set<WebSocketConnection> connections = byUserId.get(userId);
if (connections != null) {
connections.remove(connection);
if (connections.isEmpty()) {
byUserId.remove(userId, connections);
}
}
}
public int sendToUser(String userId, RealtimeMessage message) {
Set<WebSocketConnection> connections = byUserId.getOrDefault(userId, Set.of());
int delivered = 0;
for (WebSocketConnection connection : connections) {
if (connection.trySend(message)) {
delivered++;
}
}
return delivered;
}
}
Design principles:
- local registry must be thread-safe
- sending must not block event-loop threads
- slow clients need bounded buffers
- closing a socket must clean local and Redis state
- gateway crash cleanup relies on TTL
- multiple sessions per user must be supported
7. Presence Connect and Disconnect Flow
Connect
Disconnect
Disconnect cleanup can be best-effort because TTL is the safety net.
8. Keyspace Notifications: Hint, Not Source of Truth
Redis keyspace notifications can publish events when keys expire or are modified. They are useful for presence cleanup hints.
Example idea:
subscribe to __keyevent@0__:expired
if key matches presence:session:*:
schedule cleanup / recompute user presence
But do not depend on keyspace notifications for correctness. They are delivered over Pub/Sub. If the subscriber is down, the event is missed.
Correct model:
- TTL expiration removes stale session keys
- periodic reconciliation repairs indexes
- keyspace notification only accelerates reaction
Periodic reconciliation example:
for sessionId in presence:node:{nodeId}:sessions:
if EXISTS presence:session:{sessionId} == 0:
SREM presence:node:{nodeId}:sessions sessionId
For user presence:
for sessionId in presence:user:{userId}:sessions:
if EXISTS presence:session:{sessionId} == 0:
SREM presence:user:{userId}:sessions sessionId
if SCARD presence:user:{userId}:sessions == 0:
consider user offline after hysteresis
9. Pub/Sub Fanout Model
Redis Pub/Sub is useful for sending live messages to gateway nodes.
Basic model:
Application service publishes:
PUBLISH realtime:user:user_123 <message>
All gateway nodes subscribed to relevant channels receive signal.
Only nodes with local connections for user_123 send to sockets.
This is simple but can waste work if every gateway receives every user message.
10. Routing-Aware Fanout
To reduce broadcast waste, track which node owns which sessions.
Presence session hash includes:
nodeId ws-node-07
Application can route to node channel:
PUBLISH realtime:node:ws-node-07 <message for user_123>
But this introduces lookup complexity:
SMEMBERS presence:user:{userId}:sessions
HGET presence:session:{sessionId} nodeId
PUBLISH realtime:node:{nodeId} <message>
This is better for high-volume direct messages. It is worse for simple low-volume systems.
Design options:
| Option | Pros | Cons |
|---|---|---|
| Broadcast to all gateways | Simple | Wasteful at scale |
| Route by node ID | Efficient | Requires presence lookup |
| Route by shard/channel | Balanced | Requires consistent routing |
| Use Streams per shard | Durable-ish | More operational complexity |
11. Redis Cluster and Sharded Pub/Sub
In Redis Cluster, normal Pub/Sub can become expensive because messages may need to propagate across the cluster bus.
Redis 7 introduced sharded Pub/Sub commands such as SPUBLISH and SSUBSCRIBE, where channels are assigned to hash slots.
Use sharded Pub/Sub when:
- running Redis Cluster
- Pub/Sub throughput is high
- channels can be distributed by shard key
- consumers can subscribe to shard channels intentionally
Channel examples:
realtime:user:{user_123}
realtime:room:{room_456}
realtime:node:{ws-node-07}
realtime:shard:{17}
Hash tags let you control slot placement. But do not put all channels under one hash tag unless you want one hot shard.
12. Room and Channel Membership
For chat rooms, collaboration spaces, live dashboards, or case rooms, you need membership.
Keys:
room:{roomId}:members set userId
room:{roomId}:sessions set sessionId
presence:user:{userId}:rooms set roomId
For soft membership based on active sockets:
room:{roomId}:active-sessions set sessionId with TTL-backed session keys
Message flow:
Again, Pub/Sub is live delivery. The durable message store is recovery.
13. Notification Architecture
A notification has two lives:
- Durable notification in an inbox.
- Live push signal to connected devices.
Do not merge them into one Pub/Sub event.
Recommended flow:
Durable store can be:
- PostgreSQL notification table
- Cassandra/DynamoDB style inbox
- Redis Stream with retention if requirements allow
- hybrid: DB for source of truth, Redis for unread count and live hint
For important notifications, prefer a database/outbox as source of truth.
14. Notification Redis Key Model
Redis can accelerate notification state:
notif:unread:{userId} string counter
notif:recent:{userId} list or sorted set of recent notification ids
notif:delivery:{notificationId} hash delivery metadata
notif:seen:{userId} zset notificationId -> seenEpochMs
Use Redis for:
- unread count cache
- recent notification cache
- live delivery dedupe
- delivery attempt telemetry
- online push routing
Use durable store for:
- notification source of truth
- read/unread authoritative state
- audit history
- compliance retention
- cross-device recovery
Unread Count Pattern
On notification create:
INCR notif:unread:{userId}
LPUSH notif:recent:{userId} notificationId
LTRIM notif:recent:{userId} 0 99
PUBLISH realtime:user:{userId} <notification-created>
On mark read:
DECRBY notif:unread:{userId} <count-read>
But authoritative mark-read must be in durable store if correctness matters. Redis counter can be rebuilt.
15. Reconnect Recovery
Clients disconnect. Networks fail. Browsers sleep. Mobile devices pause apps.
A real-time system must define reconnect recovery.
Client state:
{
"sessionId": "sess_abc",
"lastReceivedNotificationId": "notif_10021",
"lastReceivedMessageIdByRoom": {
"room_1": "msg_778",
"room_2": "msg_991"
}
}
Reconnect flow:
Pub/Sub cannot provide this recovery. Streams can help if you keep per-user or per-shard retention, but you still need offset management and retention guarantees.
16. Streams for Durable-ish Real-Time Delivery
For stronger delivery tracking, use Redis Streams.
Example per-user stream:
XADD notifstream:{userId} * notificationId notif_123 type CASE_ASSIGNED summary "..."
XREAD BLOCK 5000 STREAMS notifstream:{userId} <lastId>
But per-user streams can create many keys. Alternative per-shard streams:
notifstream:{shard_00}
notifstream:{shard_01}
...
notifstream:{shard_63}
Each entry includes userId.
Gateways filter for connected users.
Trade-offs:
| Stream model | Pros | Cons |
|---|---|---|
| Per-user stream | Simple recovery per user | Many keys, many readers |
| Per-room stream | Natural for chat/collaboration | Room explosion, retention management |
| Per-shard stream | Fewer keys, scalable | Filtering complexity |
| Single global stream | Simple ingestion | Hotspot and fanout overhead |
Streams are useful when:
- clients need missed event recovery
- retention window is bounded
- event volume is manageable
- Redis memory budget is explicit
- stream trimming is disciplined
For long retention, use a durable database or log.
17. Delivery Semantics
Real-time systems usually combine semantics:
| Message type | Recommended semantics |
|---|---|
| typing indicator | best-effort, no recovery |
| cursor movement | best-effort, no recovery |
| presence online hint | best-effort + periodic recompute |
| notification badge update | at-least-eventual, can recompute |
| notification content | durable store + live hint |
| chat message | durable store + live hint/recovery |
| system alert | durable inbox + retry/push |
| regulatory notice | durable workflow, not Pub/Sub-only |
Do not over-engineer ephemeral messages. Do not under-engineer durable messages.
18. Slow Client and Backpressure Handling
A WebSocket server can be killed by slow clients.
Rules:
- never let one client have an unbounded outgoing queue
- drop or coalesce ephemeral messages
- preserve durable notifications in store, not only in socket buffer
- close clients that cannot keep up
- separate high-priority and low-priority messages
Example per-connection policy:
public final class WebSocketConnection {
private final BlockingQueue<RealtimeMessage> outbound = new ArrayBlockingQueue<>(1_000);
public boolean trySend(RealtimeMessage message) {
if (message.isEphemeral()) {
return outbound.offer(message);
}
boolean accepted = outbound.offer(message);
if (!accepted) {
// durable messages can be recovered by API, so signal reconnect/recovery needed
closeWithReason("CLIENT_TOO_SLOW");
}
return accepted;
}
}
For high-frequency events, coalesce:
presence updates: keep latest per user
cursor updates: keep latest per document/user
badge count: keep latest count
progress update: keep latest percentage
19. Fanout Storm Control
Fanout storms happen when one event turns into too many socket sends.
Examples:
- broadcasting to 1 million users
- room message in a huge room
- repeated presence updates
- retrying live notifications too aggressively
- reconnect storm after gateway restart
Controls:
- rate limit publish per tenant/channel
- shard large rooms
- batch messages where possible
- coalesce state updates
- sample non-critical telemetry
- use durable inbox instead of forcing live delivery
- cap per-gateway sends per second
- apply circuit breaker for overloaded gateway nodes
- use backpressure from gateway to publisher
Metric to watch:
publish rate << socket send rate << client receive rate
If publish rate is small but socket send rate is enormous, fanout is the multiplier.
20. Java Pub/Sub Subscriber Architecture
Redis Pub/Sub connections are special: a subscribed connection is dedicated to subscription traffic. Do not share it with normal commands.
Architecture:
Java concerns:
- dedicate connection for subscription
- decode defensively
- avoid blocking Redis listener thread
- hand off to bounded executor
- protect against malformed messages
- include message type/version
- record dropped/invalid message metrics
Example envelope:
{
"messageId": "rt_01JZ4W4P4M2NSFMG1RXMWJQJ45",
"messageType": "notification.created",
"messageVersion": 2,
"targetType": "USER",
"targetId": "user_123",
"createdAtEpochMs": 1782972000000,
"traceId": "0af7651916cd43dd8448eb211c80319c",
"payload": {
"notificationId": "notif_987",
"summary": "New case assignment"
}
}
21. Message Versioning
Real-time message contracts evolve. Clients may be old. Gateways may be deployed before services. Services may publish v2 while some clients only understand v1.
Rules:
- include
messageType - include
messageVersion - keep payload backward-compatible when possible
- allow gateway-side down-conversion for important messages
- use capability negotiation at connection time
- do not remove fields abruptly
Client connect metadata:
{
"clientVersion": "4.13.0",
"supportedRealtimeMessages": {
"notification.created": [1, 2],
"presence.changed": [1],
"case.updated": [2, 3]
}
}
Gateway can decide whether to:
- send v2
- downgrade to v1
- send generic refresh hint
- ask client to refresh via API
22. Security and Privacy
Real-time systems can leak data quickly.
Rules:
- authenticate WebSocket connection before registration
- authorize room subscription
- validate tenant boundary on every fanout
- never trust client-supplied userId
- avoid sensitive payloads in Pub/Sub if Redis admins/logging/tools can inspect them
- encrypt transport with TLS
- avoid storing raw IP/user-agent unless required
- hash or truncate privacy-sensitive metadata
- expire presence keys aggressively
- avoid broadcasting to channels that unauthorized nodes/consumers can subscribe to
Channel naming is not access control. Do not assume obscure channel names protect data.
23. Failure Modes
| Failure | Expected behavior |
|---|---|
| Gateway crashes | Local sockets gone; presence expires by TTL |
| Gateway loses Redis connection | Stop accepting or degrade presence/fanout depending policy |
| Redis Pub/Sub message missed | Durable state recovered via API/store if important |
| Client disconnects | Presence eventually offline; missed durable messages fetched later |
| Slow client | Drop ephemeral messages or close connection |
| Duplicate session | Support multi-session or replace explicitly |
| Network partition | Presence may be stale until TTL/hysteresis clears |
| Keyspace notification missed | Periodic reconciliation repairs indexes |
| Redis restart | Soft presence may be rebuilt from reconnects |
| Fanout storm | Rate limit, coalesce, degrade low-priority signals |
Failure-aware presence design accepts that presence is approximate. Failure-aware notification design ensures important messages are recoverable.
24. Observability
Metrics:
| Metric | Type | Meaning |
|---|---|---|
ws_connections | gauge | active WebSocket connections |
ws_users_online | gauge | derived online users |
ws_sessions_per_user | histogram | multi-device/session distribution |
redis_presence_sessions | gauge | active presence session keys/index size |
realtime_pubsub_received_total | counter | messages consumed from Redis |
realtime_pubsub_invalid_total | counter | decode/validation failures |
realtime_socket_send_total | counter | socket sends attempted |
realtime_socket_send_failed_total | counter | send failures |
realtime_dropped_ephemeral_total | counter | dropped coalescible messages |
realtime_client_slow_closed_total | counter | slow clients closed |
notification_live_push_total | counter | live notification pushes |
notification_recovery_fetch_total | counter | reconnect recovery fetches |
presence_reconciliation_removed_total | counter | stale index cleanup |
Logs should include:
- nodeId
- sessionId
- userId or privacy-safe hash
- tenantId
- messageId
- messageType
- channel
- traceId
- delivery result
Dashboards should show:
- connections by gateway node
- publish rate vs socket send rate
- dropped messages by type
- reconnect rate
- presence index drift
- notification recovery rate
- slow client closures
- Redis command latency
- Redis Pub/Sub message rate
25. Testing Strategy
Unit Tests
- presence key naming
- heartbeat TTL calculation
- online threshold/hysteresis logic
- message envelope version parsing
- authorization decisions
- coalescing behavior
- local registry concurrency
Integration Tests
- gateway registers presence on connect
- heartbeat extends TTL
- disconnect removes local and Redis state
- TTL expiry makes user offline
- Pub/Sub routes to correct local user
- missed notification recovered from store
- slow client buffer closes connection
- duplicate sessions handled correctly
Failure Tests
| Test | What to verify |
|---|---|
| kill gateway process | presence expires without explicit disconnect |
| stop Redis temporarily | gateway degradation behavior is explicit |
| disconnect client mid-send | socket cleanup works |
| publish malformed message | subscriber does not die |
| flood room channel | backpressure/coalescing works |
| restart client after missed messages | recovery API returns gap |
| miss keyspace notification | reconciliation removes stale indexes |
Load Tests
Measure:
- max connected sockets per node
- heartbeat write rate
- Redis CPU under heartbeat load
- Pub/Sub throughput
- socket send throughput
- p99 fanout latency
- reconnect storm behavior
- memory growth of presence indexes
- effect of slow clients
26. Real-Time Design Patterns
Pattern A — Best-Effort Typing Indicator
Client -> Gateway -> PUBLISH typing:room:{roomId}
Other gateways -> send to connected room clients
No durable state. No recovery. Drop under pressure.
Pattern B — Durable Notification + Live Hint
Domain event -> Notification DB row -> Redis PUBLISH hint -> WebSocket push
Client reconnect -> fetch DB inbox after last notification id
This is the default for important notifications.
Pattern C — Presence with TTL + Reconciliation
connect/heartbeat -> session key with TTL + indexes
expired key notification -> cleanup hint
periodic reconciliation -> correctness repair
Presence is approximate but self-healing.
Pattern D — Room Broadcast with Durable Message Store
write message to DB
publish room hint
connected clients receive live message
reconnecting clients fetch messages after last seen id
Pub/Sub optimizes latency. The store provides correctness.
Pattern E — Gateway Node Routing
presence lookup user sessions -> node IDs
publish to realtime:node:{nodeId}
node sends only to local sockets
Use when direct user notifications are high-volume.
27. Production Anti-Patterns
Anti-Pattern 1 — Pub/Sub as Notification Database
Bad:
PUBLISH notif:user_123 "your case was assigned"
If user is offline, message disappears.
Better:
INSERT notification
PUBLISH notification-created hint
Anti-Pattern 2 — Presence Without TTL
Bad:
SADD online-users user_123
If gateway crashes, user stays online forever.
Better:
session key with TTL + heartbeat + last-seen zset
Anti-Pattern 3 — One Global Channel for Everything
Bad:
PUBLISH realtime:all <every message>
Every gateway parses every message.
Better:
- per-node channel
- per-room channel
- per-shard channel
- sharded Pub/Sub in Redis Cluster
Anti-Pattern 4 — Unbounded Socket Buffers
Bad:
queue.add(message) forever
A slow mobile client can exhaust gateway memory.
Better:
- bounded buffers
- drop/coalesce ephemeral messages
- close slow clients
- rely on durable recovery for important messages
Anti-Pattern 5 — Online/Offline Flicker
Bad:
if heartbeat missed once -> offline
Better:
- heartbeat threshold
- hysteresis
- last-seen
- delayed offline transition
28. Operational Checklist
Before shipping Redis-backed real-time features, answer:
- Which messages are ephemeral?
- Which messages require durable recovery?
- What is the source of truth for notifications?
- How does reconnect recovery work?
- What offset does the client send on reconnect?
- What is the presence heartbeat interval?
- What is the session TTL?
- What is the offline hysteresis window?
- How are stale presence indexes cleaned?
- Are keyspace notifications only hints?
- How are WebSocket connections mapped locally?
- What happens when Redis is unavailable?
- What happens when a gateway crashes?
- How are slow clients handled?
- What is the maximum outbound buffer per connection?
- How are messages versioned?
- How are tenants authorized for channels/rooms?
- Does channel naming leak sensitive information?
- Is normal Pub/Sub enough or is sharded Pub/Sub needed?
- What metrics indicate fanout storms?
- Can unread counts be rebuilt from durable state?
29. 20-Hour Practice Plan
Hours 1–4 — Presence Basics
Build:
- WebSocket connect/disconnect
- local registry
- Redis session key with TTL
- heartbeat update
- online query
Break:
- kill gateway without disconnect
- verify TTL cleans presence eventually
Hours 5–8 — Pub/Sub Fanout
Build:
- gateway subscription
- publish to user channel
- local routing by userId
- bounded socket send queue
Break:
- publish malformed messages
- publish while user offline
- publish during gateway restart
Hours 9–12 — Durable Notifications
Build:
- notification store table/mock
- live notification hint
- unread counter cache
- reconnect recovery after last ID
Break:
- disconnect before notification
- reconnect and fetch missed notification
Hours 13–15 — Room Broadcast
Build:
- room membership
- room Pub/Sub channel
- durable message store
- client last seen offset
Break:
- send messages during client disconnect
- recover missed messages
Hours 16–18 — Backpressure
Build:
- bounded outgoing buffer
- ephemeral coalescing
- slow client close
- fanout metrics
Break:
- simulate slow client
- flood room channel
Hours 19–20 — Operations
Create:
- dashboard sketch
- alert rules
- failure runbook
- capacity notes
Lesson:
Real-time correctness is not about never dropping a socket message. It is about knowing which messages can be dropped and how important state is recovered.
30. Summary
Redis is excellent for real-time runtime state and signaling when used carefully.
The core principles:
- Separate durable state from ephemeral signals.
- Use Pub/Sub for live hints, not critical history.
- Use TTL-backed presence, not permanent online flags.
- Derive online/offline from recent heartbeat and hysteresis.
- Keep WebSocket connection objects local to gateway nodes.
- Use Redis presence indexes for routing and discovery.
- Treat keyspace notifications as hints, not correctness mechanisms.
- Use Streams or durable stores for reconnect recovery.
- Use bounded buffers and coalescing to survive slow clients.
- Protect tenant boundaries and avoid sensitive Pub/Sub payloads.
- Measure fanout multiplier, dropped messages, reconnects, and recovery fetches.
The top 1% engineer does not ask:
How do I send a WebSocket message with Redis?
They ask:
Which facts must survive disconnects, and which Redis mechanisms are only fast-path signals?
Next: Part 021 will cover Redis Search, JSON, document modeling, secondary indexes, query patterns, and how Java services should treat Redis as an index/document acceleration layer without confusing it with the system of record.
You just completed lesson 20 in deepen practice. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.
Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.