Series/Learn Java RabbitMQ, RabbitMQ Streams, Patterns, and Deployment In Action

Final StretchOrdered learning track

RabbitMQ Deployment Model - Nodes, Clusters, Quorum Queues, and Streams

Learn Java RabbitMQ, RabbitMQ Streams, Patterns, and Deployment In Action - Part 031

Production-grade RabbitMQ deployment model covering nodes, clusters, quorum queues, streams, classic queues, data safety, placement, capacity, and Java application implications.

[2026-07-02]28 min read5454 words

In This Lesson

1. Kaufman Framing: Deconstruct the Deployment Skill 2. Deployment Model Is Part of the Contract 3. The Core RabbitMQ Physical Model

PrevNext

Lesson 3135 lesson track30–35 Final Stretch

#java#rabbitmq#amqp#rabbitmq-streams+5 more

Part 031 — RabbitMQ Deployment Model: Nodes, Clusters, Quorum Queues, and Streams

This part is about the operational shape of RabbitMQ.

At this level, the main question is no longer:

"How do I publish and consume a message?"

The question becomes:

"What physical and logical deployment model makes my messaging guarantees true under real failure?"

A RabbitMQ application can have perfect Java code and still lose availability, overload downstream systems, or make recovery impossible if the deployment model is wrong.

This part builds the mental model for running RabbitMQ as production infrastructure.

We will focus on:

nodes
clusters
leader/follower placement
quorum queues
streams
classic queues
replication and durability
capacity boundaries
operational invariants
Java application consequences
failure runbooks

We will not repeat generic container, Kubernetes, Linux, or networking basics already covered elsewhere in this project.

1. Kaufman Framing: Deconstruct the Deployment Skill

Following Kaufman's method, we first deconstruct the skill into subskills that can be practiced independently.

For RabbitMQ deployment, the skill is not "install RabbitMQ".

The actual skill is:

Given a business workload, choose and operate a RabbitMQ topology whose failure behavior is explicit, measurable, and recoverable.

That decomposes into seven subskills.

Subskill	What You Must Be Able to Do
Queue type selection	Choose classic queue, quorum queue, stream, or super stream for the workload
Replication reasoning	Explain what data is replicated, when publish is confirmed, and what happens when a node dies
Leader placement	Understand which node owns writes and how leader distribution affects load
Capacity modelling	Estimate CPU, memory, disk, network, queue depth, lag, and replay retention
Failure modelling	Predict behavior during restart, network loss, disk pressure, and replica loss
Operational governance	Use policies, limits, permissions, monitoring, and runbooks consistently
Java alignment	Configure producers/consumers to match broker semantics rather than fight them

The goal of this part is to give you the map and the invariants.

You should be able to look at a RabbitMQ cluster and say:

this queue should be quorum, not classic
this workload needs stream replay, not a normal queue
this producer confirm mode is too weak for the queue type
this prefetch will amplify memory pressure during a consumer outage
this retention setting will delete audit data before replay finishes
this cluster can survive one node failure, but not this maintenance sequence
this topology is impossible to operate safely because ownership is unclear

2. Deployment Model Is Part of the Contract

A common mistake is to treat deployment as infrastructure detail and application messaging as code detail.

That split is false.

In RabbitMQ, deployment choices change application semantics.

For example:

a durable classic queue on one node is not the same as a quorum queue replicated across nodes
a stream with retention is not the same as a queue where consume removes messages
a queue with one consumer behaves differently from the same queue with competing consumers
a producer using confirms behaves differently from one that fire-and-forgets
a consumer with manual ack behaves differently from one using auto ack
a quorum queue with a delivery limit behaves differently from an infinite redelivery loop
a stream consumer with server-side offset tracking behaves differently from one storing progress in an external database

So deployment is not merely about where RabbitMQ runs.

Deployment determines the truth of statements like:

"Once accepted by the broker, this message survives a node crash."
"This workload can be replayed for seven days."
"This queue can keep accepting commands when one broker node is down."
"This consumer group can scale horizontally without breaking entity ordering."
"This system can be upgraded without losing messages."

A senior engineer should treat these statements as contracts.

3. The Core RabbitMQ Physical Model

A RabbitMQ deployment consists of one or more broker nodes.

Each node has:

Erlang runtime
RabbitMQ application process
local storage
memory/disk alarms
plugins
listeners
management endpoints
queue/stream replicas
connection and channel state

A cluster is a group of nodes that share metadata and cooperate to host queues, exchanges, bindings, policies, users, permissions, and other runtime structures.

A simplified view:

The critical detail is that queues and streams are not abstract clouds floating above the cluster.

They have physical placement.

A quorum queue has replicas. One replica is leader. Others are followers.

A stream has replicas. One replica is leader. Others are followers.

Most writes go through the leader.

Therefore, leader placement and replica placement directly affect throughput, latency, and failure behavior.

4. Logical Model vs Physical Model

Applications think in logical names.

Examples:

exchange: cpq.commands
queue: pricing.quote.calculate.q
stream: order-events.stream
super stream: order-events.super
routing key: quote.calculate.v1

Operators must think in physical placement.

Examples:

which node is queue leader?
how many replicas exist?
how much disk does each replica consume?
which node serves reads?
where does the client connection land?
what happens if rmq-1 is drained?
can the remaining nodes form quorum?

These two views must be connected.

The design failure usually happens when teams stop at the logical view.

They create many queues and assume RabbitMQ will "handle it".

A production engineer asks:

where are leaders distributed?
how much write amplification does replication create?
what is the disk growth rate?
what is the failover recovery time?
how does client reconnection behave?
what alerts prove the invariant still holds?

5. Queue and Stream Type Decision Matrix

RabbitMQ now gives you multiple data structures. Choosing the wrong one is one of the most expensive mistakes.

Workload Need	Recommended Data Structure	Why
Reliable task dispatch	Quorum queue	Replicated, data-safety oriented, consumer ack semantics
Short-lived transient task	Classic queue may be acceptable	Lower overhead if loss is acceptable or workload is non-critical
Replayable domain event feed	Stream	Append-only, non-destructive reads, retention, offset-based consumption
High-throughput partitioned event log	Super stream	Partitioned streams for horizontal scale
Pub/sub notification without replay	Topic/fanout exchange + subscriber queues	Subscriber isolation, simple delivery model
Long-lived audit trail	Stream, possibly external archive too	Retention/replay semantics are explicit
Strict per-entity processing	Quorum queue per shard or super stream partitioned by entity key	Limits concurrency to ordering boundary
RPC reply messages	Direct Reply-To or short-lived reply queue	Avoids unnecessary durable queue state
Delayed retries	Retry queues with TTL/DLX or delayed exchange plugin	Explicit retry stages

A practical rule:

use quorum queues for important commands/tasks
use streams for replayable event history
use super streams when a single stream becomes a throughput or partitioning bottleneck
use classic queues deliberately, not by accident

Do not choose queue types because they are familiar.

Choose them because their failure behavior matches the business contract.

6. Classic Queues: Still Useful, But Not the Default for Critical Data

Classic queues are the historical queue type.

They are often still useful for:

low criticality messages
temporary queues
reply queues
non-durable workflow edges
local development
simple fanout subscribers where loss is acceptable
workloads where maximum throughput matters more than replication safety

But for critical durable messages, classic queues require careful reasoning.

A durable classic queue on a single node can persist messages to disk, but that does not automatically give multi-node data safety.

For highly available durable queues, modern RabbitMQ guidance favors quorum queues instead of the old classic mirrored queue model.

Operational risks with classic queues:

single-node placement can become an availability bottleneck
leader/node failure behavior differs from quorum queues
accidental use for critical commands can create false confidence
queue length and memory pressure can grow quickly under consumer outage
topology may drift because teams create queues without policies

Classic queues are not "bad".

They are simply not the right default when the business contract says "accepted messages must survive node failure".

7. Quorum Queues: The Default for Critical Work Queues

A quorum queue is a replicated queue based on a consensus model.

It is designed for data safety and predictable failure semantics.

The queue has multiple replicas. One replica is leader. The others are followers.

For writes, the leader coordinates replication. Publisher confirms should only be treated as meaningful when used correctly by the application.

A simplified model:

In practice, this means there is write amplification.

One logical message may become multiple replica writes.

That is the price of safety.

7.1 When to Use Quorum Queues

Use quorum queues for:

commands that must not be lost
task dispatch with manual ack
payment/order/quote/regulatory workflows
retry queues where durability matters
DLQs where diagnostic data must survive failure
workloads where availability under one-node failure is more important than raw throughput

7.2 When Not to Use Quorum Queues

Avoid quorum queues for:

huge replayable event logs
analytics fanout where consumers need independent replay
ultra-short transient reply queues
workloads that tolerate loss and require extreme throughput
cases where stream retention/replay is the actual requirement

7.3 Quorum Queue Design Invariants

For critical queues:

queue must be durable
messages must be persistent
publisher confirms must be enabled
consumer manual acknowledgements must be used
handlers must be idempotent
DLX/retry behavior must be explicit
queue length and delivery limit must be governed
leader placement must be monitored
cluster must maintain majority availability

If any item is missing, the phrase "reliable queue" becomes ambiguous.

8. Quorum Queue Delivery Limit and Poison Message Control

Quorum queues support delivery limits.

This matters because infinite redelivery is a production hazard.

Without a delivery limit, a poison message can keep cycling forever:

The result:

CPU waste
log noise
alert fatigue
head-of-line blocking
delayed valid work
duplicate side effects if handler is not safe

A safer model:

A strong production invariant:

Every durable command queue must have a bounded retry/dead-letter story.

Do not let poison messages define your runtime behavior.

9. Streams: Deployment Model for Replayable Logs

RabbitMQ Streams are not just "bigger queues".

They are a different data structure.

A stream models an append-only log.

Consumers track offsets instead of removing messages.

A queue answers:

"Who should process this message now?"

A stream answers:

"What happened, and from which offset should this consumer read?"

9.1 Use Streams For

replayable domain event history
audit logs
high fanout consumers
analytics feeds
event-carried state transfer
time-windowed processing
rebuilding projections
regulatory investigation and forensic replay
high-throughput immutable ingest

9.2 Avoid Streams For

simple one-off task dispatch
command handling where each message should be processed by exactly one worker group
workflows requiring per-message destructive consumption
unbounded retention without storage planning
cases where downstream cannot tolerate replay duplicates

9.3 Stream Deployment Implications

Streams are always persistent and replicated.

That gives durability but also creates operational responsibilities:

retention must be explicit
disk sizing must include replicas
consumer lag must be monitored
offset storage must be reliable
replay rate must be capacity-planned
partitions/super streams must be chosen before throughput bottlenecks become urgent

A stream deployment without retention planning is just delayed disk exhaustion.

10. Super Streams: Partitioned Stream Deployment

A super stream is a logical stream composed of partition streams.

Use it when one stream cannot meet throughput, storage, or parallelism requirements.

The important design decision is the partition key.

Good partition keys:

order id
account id
tenant id plus entity id
case id
customer id
aggregate root id

Bad partition keys:

timestamp only
random UUID when ordering matters
status field with low cardinality
tenant id alone for large tenants
routing key that changes over entity lifetime

A super stream gives scale, but it does not remove the need for ordering design.

Ordering remains local to the partition.

11. Cluster Size and Failure Tolerance

A three-node RabbitMQ cluster is the common minimum for quorum-style availability.

The reason is majority.

For a quorum of three replicas:

3 online: healthy
2 online: can make progress
1 online: cannot maintain quorum

For five replicas:

5 online: healthy
4 online: healthy
3 online: can make progress
2 online: cannot maintain quorum

This means "more replicas" is not free.

It improves fault tolerance but increases:

storage usage
replication traffic
write latency risk
recovery cost
operational complexity

For most business workloads, start with three nodes and three replicas for critical quorum queues, then scale based on measured requirements.

Do not blindly set every queue to five replicas.

11.1 Majority Table

Replica Count	Majority Required	Failures Tolerated
1	1	0
3	2	1
5	3	2
7	4	3

Replication is a safety mechanism, not a capacity shortcut.

12. Leader Placement and Hot Nodes

Even with three equal nodes, workload may be uneven.

If too many queue leaders live on one node, that node becomes hot.

Symptoms:

one node has higher CPU
one node has higher disk writes
one node has higher network egress
one node has higher connection/channel load
publisher confirm latency increases for queues led by that node
consumer latency differs across queues

Example bad placement:

Better placement:

Leader distribution matters because leaders coordinate writes.

A Java producer does not usually know or care which node is leader, but latency will expose the truth.

13. Client Connection Strategy

Applications usually connect through:

Kubernetes service
DNS round-robin
TCP load balancer
service mesh TCP route
direct broker node list

A good connection strategy has four properties:

clients can connect when one node is down
clients eventually distribute across nodes
reconnection does not create a thundering herd
topology declaration is idempotent

A weak strategy:

factory.setHost("rmq-0.rabbitmq.local");

A stronger strategy uses multiple addresses:

Address[] addresses = new Address[] {
    new Address("rmq-0.rabbitmq.svc", 5672),
    new Address("rmq-1.rabbitmq.svc", 5672),
    new Address("rmq-2.rabbitmq.svc", 5672)
};

Connection connection = factory.newConnection(addresses, "pricing-service");

For Spring AMQP, a stronger pattern is to configure addresses rather than a single host.

spring:
  rabbitmq:
    addresses: rmq-0.rabbitmq.svc:5672,rmq-1.rabbitmq.svc:5672,rmq-2.rabbitmq.svc:5672
    username: pricing_app
    password: ${RABBITMQ_PASSWORD}
    publisher-confirm-type: correlated
    publisher-returns: true

The application should also use bounded reconnection/backoff behavior.

When a node fails, thousands of Java clients reconnecting instantly can create a second incident.

14. Durability Is a Chain, Not a Flag

Many teams believe this:

durable queue = safe messages

That is incomplete.

Durability is a chain.

For AMQP queue workloads, safety requires:

If the chain breaks, the guarantee weakens.

Examples:

Missing Link	Failure Mode
non-durable queue	queue can disappear after restart
transient message	message can be lost on broker restart
no publisher confirm	producer cannot know if broker accepted responsibility
no replication	node loss can remove queue availability/data
auto ack	message can be lost after delivery before processing
non-idempotent consumer	redelivery creates duplicate side effects

The deployment model must be paired with application protocol.

A quorum queue without publisher confirms is a half-built reliability story.

15. Persistence and Disk Model

RabbitMQ reliability eventually hits disk.

For durable queues and streams, disk is not just storage. It is part of the messaging contract.

You must model:

message ingress rate
average message size
replication factor
retention duration
backlog tolerance
DLQ retention
stream replay window
compaction/archive strategy if any
disk IOPS
disk throughput
fsync behavior

A rough sizing formula for stream storage:

logical_data_per_day = messages_per_second * avg_message_bytes * 86400
replicated_data_per_day = logical_data_per_day * replica_count
required_disk = replicated_data_per_day * retention_days * safety_factor

Example:

messages_per_second = 5,000
avg_message_bytes   = 2,000
retention_days      = 7
replica_count       = 3
safety_factor       = 1.5

logical/day   = 5,000 * 2,000 * 86,400
              = 864 GB/day

replicated/day = 864 GB * 3
               = 2.592 TB/day

required = 2.592 TB * 7 * 1.5
         = 27.216 TB

That is why stream retention must be designed, not guessed.

For quorum queues, storage depends heavily on backlog and delivery behavior.

A command queue should normally drain. If it grows for days, the system has an operational problem, not a retention strategy.

16. Memory Model

RabbitMQ uses memory for many runtime structures:

connections
channels
unacknowledged deliveries
queue index/runtime state
message metadata
buffers
plugins
management data

High memory usage often comes from application behavior:

excessive prefetch
too many channels
too many connections
slow consumers
giant messages
unbounded publish rate
many idle queues
management API abuse
large numbers of unacknowledged messages

Consumer prefetch is especially important.

If you have:

100 consumers
prefetch = 500
average message size = 100 KB

The theoretical unacked payload exposure is:

100 * 500 * 100 KB = 5 GB

That excludes metadata and JVM memory.

A senior deployment review always checks prefetch against memory budget.

17. Network Model

Replication and fanout multiply network usage.

For quorum queues:

producer sends to broker
leader replicates to followers
consumers receive deliveries
acknowledgements flow back
confirms flow to producers

For streams:

producer sends append entries
leader replicates to followers
consumers read chunks
multiple consumers may read same retained data

For fanout:

one logical event can become many queue copies

Network model example:

The production question is:

Is the network sized for physical traffic, not just logical message rate?

For replicated workloads, logical ingress is not equal to network traffic.

18. CPU Model

RabbitMQ CPU is affected by:

routing complexity
TLS
message serialization/deserialization at client side
persistence path
replication
management plugin polling
plugin overhead
connection churn
queue count
stream chunk handling
compression

The broker does not run your Java handler logic, but your application behavior can burn broker CPU.

Bad patterns:

excessive reconnect loops
declaring topology repeatedly on hot path
publishing tiny messages at very high rate without batching
using headers exchange for high-volume routing when topic/direct would do
overusing per-message priority
management API scraping too frequently

CPU tuning starts with topology and client behavior, not VM flags.

19. Message Size Policy

RabbitMQ can carry large messages, but large messages change the system shape.

Large messages increase:

memory exposure
disk writes
replication cost
network transfer time
GC pressure in Java clients
tail latency
retry/DLQ storage
management difficulty

A common production policy:

Message Size	Policy
< 64 KB	Normal path
64 KB – 1 MB	Review serialization and batching
1 MB – 10 MB	Strong review; consider object storage pointer
> 10 MB	Usually do not send through RabbitMQ directly

For large payloads, prefer:

store blob in object storage
send metadata pointer
include checksum
include authorization context
define expiration/cleanup policy

Do not turn RabbitMQ into a binary file transport unless you have proven the operational budget.

20. Workload Isolation

A production RabbitMQ cluster should not be one undifferentiated bucket.

Isolate by:

virtual host
user/permission
queue type
criticality
tenant
domain
environment
latency sensitivity
replay/audit requirements

Example vhost strategy:

Vhost	Purpose
`/cpq-prod`	CPQ production command/event workloads
`/oms-prod`	Order management production workloads
`/analytics-prod`	Stream consumers and analytics feeds
`/platform-prod`	Shared operational events
`/sandbox`	Non-critical test workloads

Isolation prevents one team from accidentally binding a massive fanout queue to another team's critical exchange.

But do not overdo it.

Too many vhosts can make governance harder.

Use vhosts for administrative isolation, not as a substitute for good topology design.

21. Policies as Operational Control

RabbitMQ policies let operators control queue behavior by name pattern.

Use policies for:

dead-letter exchanges
queue type defaults
TTL
max length
overflow behavior
stream retention
delivery limit
leader locator behavior

Example policy idea:

rabbitmqctl set_policy critical-quorum '^critical\.' \
  '{"queue-type":"quorum","delivery-limit":20}' \
  --apply-to queues

Example stream retention policy idea:

rabbitmqctl set_policy order-streams '^orders\.' \
  '{"max-age":"7D","max-length-bytes":107374182400}' \
  --apply-to streams

Hardcoding queue arguments in application code is often less flexible than policy.

A good division:

Concern	Prefer Owner
exchange/queue names	application/platform contract
business routing keys	application/domain team
DLX name	platform policy plus application awareness
retention	platform + domain SLA
delivery limit	platform + application failure model
permissions	platform/security
message schema	application/domain governance

22. Deployment Profiles

22.1 Local Development

Goal:

fast feedback
simple topology
low durability expectations

Profile:

single node
management plugin enabled
test vhost
no assumption of HA
sample classic/quorum/stream resources

Do not infer production behavior from local single-node tests.

22.2 Integration Environment

Goal:

validate topology and contracts
test retry/DLQ
test publisher confirms
test consumer idempotency

Profile:

one or three nodes depending on budget
production-like queue types
realistic policies
reduced retention
contract tests

22.3 Staging / Pre-Production

Goal:

verify deployment, upgrade, failover, and performance

Profile:

same node count pattern as production
production queue types
realistic storage class
performance test data
failover drills
observability enabled

22.4 Production

Goal:

reliable business operation
explicit SLOs
recoverable failure modes

Profile:

multi-node cluster
quorum queues for critical commands
streams/super streams for replayable feeds
TLS
least privilege users
policy-based governance
monitored storage/memory/lag
documented runbooks
tested upgrades

23. Reference Deployment Architecture

A production-grade RabbitMQ architecture for Java microservices can look like this:

Design interpretation:

commands go through quorum queues
event notifications go through exchange bindings
replayable event history goes to stream
high-throughput partitioned feeds use super stream
poison messages are parked
audit consumers use replayable stream, not transient queues

This is not a universal blueprint.

It is a reference shape.

24. Java Application Implications

Your Java code must match the deployment model.

24.1 For Quorum Queues

Use:

durable queue declaration or topology managed by platform
persistent messages
publisher confirms
manual ack
idempotent consumers
bounded prefetch
retry/DLQ policy
connection recovery
metrics

Producer pattern:

channel.confirmSelect();

AMQP.BasicProperties props = new AMQP.BasicProperties.Builder()
    .messageId(commandId)
    .correlationId(correlationId)
    .deliveryMode(2) // persistent
    .contentType("application/json")
    .type("quote.calculate.command.v1")
    .timestamp(new Date())
    .build();

channel.basicPublish(
    "cpq.commands",
    "quote.calculate.v1",
    true,
    props,
    payload
);

boolean confirmed = channel.waitForConfirms(5_000);
if (!confirmed) {
    throw new PublishNotConfirmedException(commandId);
}

Consumer pattern:

channel.basicQos(50);

DeliverCallback callback = (consumerTag, delivery) -> {
    long tag = delivery.getEnvelope().getDeliveryTag();
    try {
        handler.handle(delivery);
        channel.basicAck(tag, false);
    } catch (RetryableBusinessException ex) {
        channel.basicNack(tag, false, false); // route through DLX/retry topology
    } catch (FatalBusinessException ex) {
        channel.basicReject(tag, false); // dead-letter or discard by policy
    }
};

channel.basicConsume("quote.calculate.q", false, callback, tag -> {});

24.2 For Streams

Use:

stable producer names if dedup is enabled
explicit offset strategy
named consumers if using server-side offset tracking
retention-aware replay
idempotent projectors
batch/checkpoint discipline
lag metrics

Stream consumer checkpoint pattern:

Consumer consumer = environment.consumerBuilder()
    .stream("order-events.stream")
    .name("audit-projector")
    .offset(OffsetSpecification.next())
    .messageHandler((context, message) -> {
        auditProjector.apply(message);
        if (shouldCheckpoint()) {
            context.storeOffset();
        }
    })
    .build();

Do not mix queue and stream semantics accidentally.

A queue ack means "this delivery is done".

A stream offset means "this consumer's progress has advanced".

25. Failure Matrix

Failure	Quorum Queue Behavior	Stream Behavior	Application Responsibility
producer loses connection before confirm	publish outcome ambiguous	publish outcome ambiguous	retry with idempotency/dedup key
broker node leader crashes	leader election if quorum remains	leader election if replicas remain	reconnect, tolerate latency spike
consumer crashes before ack	message redelivered	offset not advanced if not stored	idempotent handler
consumer crashes after DB commit before ack	message redelivered	offset may replay	dedup/inbox table
disk alarm	publishing may be blocked	append may slow/block	backpressure and alert
memory alarm	publisher flow control	appends/reads affected	reduce ingress, fix consumers
follower down	reduced redundancy	reduced redundancy	alert; restore replica before maintenance
majority lost	queue unavailable	stream unavailable	fail closed; do not pretend success
retention expires before replay	not applicable to normal queue	replay gap	alert on lag vs retention

The key pattern:

Broker replication protects data availability. It does not remove the need for application idempotency.

26. Monitoring Model

At minimum, monitor these dimensions.

26.1 Cluster Health

node up/down
cluster partition status
memory alarm
disk alarm
file descriptors
socket descriptors
Erlang process count
node uptime/restart count

26.2 Queue Health

ready messages
unacknowledged messages
publish rate
deliver rate
ack rate
redelivery rate
consumer count
consumer utilisation
queue leader
replica health
delivery limit/dead-letter activity

26.3 Stream Health

ingress rate
egress rate
disk usage
retention pressure
consumer offset
consumer lag
partition lag
replica status
leader distribution

26.4 Java Client Health

connection count
channel count
reconnect count
confirm latency
returned messages
publish failures
consumer processing latency
ack latency
handler failure rate
idempotency duplicate count
retry/DLQ count

26.5 Business Health

quote command age
order submission age
failed workflow count
duplicate command count
stuck saga count
audit projection lag
replay completion time

Do not alert only on broker metrics.

Messaging incidents are often application incidents expressed through the broker.

27. Alert Design

A bad alert:

Queue has more than 10,000 messages.

A better alert:

quote.calculate.q oldest message age exceeds 2 minutes for 5 minutes and consumer count is below expected baseline.

Reason:

Queue depth alone has no universal meaning.

A queue with 10,000 tiny messages and 5,000 msg/sec drain rate may be healthy.

A queue with 50 messages and oldest message age of 2 hours may be broken.

Use alerts based on:

oldest message age
consumer lag vs SLA
publish/consume rate divergence
redelivery rate
DLQ ingress rate
confirm latency
disk free time remaining
stream lag relative to retention
missing consumers
blocked connections

28. Runbook: Queue Growth

Symptoms:

ready message count increasing
oldest message age increasing
consume rate below publish rate
business SLA violations

First questions:

Are consumers connected?
Are consumers acking?
Did downstream dependency slow down?
Did publish rate spike?
Did message size increase?
Did redelivery spike?
Is prefetch too low or too high?
Is one poison message blocking ordering-sensitive processing?
Are broker alarms active?
Did leader placement change?

Actions:

inspect consumer errors
check downstream service latency
verify DLQ/retry activity
scale consumers only if downstream can handle it
reduce producer ingress if necessary
park poison messages if safe
avoid purging unless business explicitly accepts loss

Never start with "purge the queue".

That is data deletion, not operations.

29. Runbook: Disk Alarm

Symptoms:

broker blocks publishers
publish latency spikes
confirms delayed
stream append slows
management shows disk alarm

Questions:

Which node is low on disk?
Which queues/streams own disk?
Is stream retention too high?
Is consumer lag preventing retention from being acceptable?
Did DLQ/parking lot grow?
Did message size change?
Is compaction/archive missing?

Actions:

stop unnecessary publishers if safe
restore consumers if lagging
reduce retention only with business approval
move/archive diagnostic data if policy allows
add storage if this is expected growth
investigate producer rate/message size change
do not delete stream segments blindly

Disk alarms are business incidents when messages are business records.

30. Runbook: Node Failure

For a three-node cluster:

confirm which node failed
identify affected leaders
check quorum queue availability
check stream leader elections
monitor publish confirm latency
monitor client reconnect storms
avoid taking another node down
restore failed node
wait for replicas to catch up
verify no queues/streams have reduced redundancy

The dangerous state is not always the failure itself.

It is maintenance while redundancy is already degraded.

Rule:

Never perform planned maintenance that reduces quorum below the safety threshold.

31. Upgrade and Maintenance Principles

Production RabbitMQ upgrades need preparation.

Before upgrade:

read version-specific release notes
test in staging
verify plugin compatibility
verify client compatibility
verify queue type behavior
check deprecated features
check feature flags if applicable
ensure all queues/streams healthy
ensure no disk/memory alarms
snapshot configuration definitions
document rollback/forward plan

During upgrade:

one node at a time
wait for node health
verify quorum and stream replicas
monitor confirms and lag
watch reconnect spikes
pause non-critical heavy replay if needed

After upgrade:

verify cluster health
verify topology
verify policies
verify consumer counts
verify DLQ/retry flows
run smoke publish/consume tests
check dashboards for regressions

The operational invariant:

Upgrade should be a controlled failure drill, not a surprise chaos event.

32. Capacity Planning Worksheet

Use this worksheet before production approval.

Workload name:
Business criticality:
Queue/stream type:
Message type:
Average message size:
P95 message size:
Peak publish rate:
Peak consume rate:
Required retention:
Allowed backlog duration:
Allowed data loss:
Allowed duplicate behavior:
Ordering requirement:
Replay requirement:
Consumer groups:
Expected redelivery rate:
DLQ retention:
Replica count:
Node count:
Storage per node:
Network budget:
Confirm latency SLO:
Oldest message age SLO:
Consumer lag SLO:
Maintenance tolerance:
Disaster recovery strategy:

If you cannot fill this out, you are not ready to choose a deployment model.

33. ADR Template: Queue Type Decision

Use this for every important queue/stream.

# ADR: Message Data Structure for <workload>

## Context

<What business workflow is this supporting?>

## Decision

We will use <quorum queue | stream | super stream | classic queue>.

## Rationale

- delivery guarantee required:
- replay requirement:
- ordering requirement:
- throughput requirement:
- retention requirement:
- failure tolerance:

## Consequences

- producer requirements:
- consumer requirements:
- monitoring requirements:
- operational runbooks:
- known trade-offs:

## Rejected Options

- <option 1> because ...
- <option 2> because ...

This prevents accidental infrastructure decisions.

34. Production Readiness Checklist

A RabbitMQ deployment is not production-ready because the pod is running.

It is production-ready when these are true:

Cluster

Queues

critical queues are quorum queues
transient queues are intentionally transient
DLX/retry/parking lot configured
delivery limit understood
queue length/age alerts configured
leader distribution monitored

Streams

retention defined by business requirement
disk sizing includes replica factor
consumer lag monitored
replay procedures tested
offset tracking strategy documented
partition key reviewed for super streams

Java Applications

publisher confirms enabled for critical messages
returned messages handled
manual ack used for critical consumers
idempotency implemented
prefetch aligned with worker capacity
connection recovery tested
metrics emitted
graceful shutdown tested

Operations

35. Practice Drill

Build a local or staging three-node RabbitMQ cluster and create these workloads:

critical.command.q as quorum queue
notification.event.q as classic or quorum queue depending on criticality
orders.stream as stream with retention
orders.super as super stream with four partitions
critical.command.parking.q as parking lot queue

Then perform:

publish with confirms
consume with manual ack
kill a consumer before ack
kill a broker node
observe redelivery
observe leader election/recovery
fill queue faster than consumers drain
trigger DLQ route
replay stream from earlier offset
measure consumer lag

Write down what happened.

The important outcome is not that the system survives everything.

The important outcome is that you can predict what it will do.

36. Key Takeaways

Deployment model is part of messaging semantics.
Quorum queues are the right default for critical durable task/command workloads.
Streams are for replayable append-only history, not simple task dispatch.
Super streams scale streams through partitioning, but force partition-key discipline.
Durability is a chain: durable topology, persistent messages, publisher confirms, replication, manual ack, idempotency.
Replication improves safety but increases write amplification and operational cost.
Disk, memory, network, and leader placement must be capacity-planned.
Java producers and consumers must be configured to match broker guarantees.
A production RabbitMQ deployment is defined by invariants, metrics, and runbooks, not by successful installation.

In the next part, we will deploy this model on Kubernetes using the RabbitMQ Cluster Operator, Messaging Topology Operator, storage policies, security controls, and upgrade-safe operational patterns.

Lesson Recap

You just completed lesson 31 in final stretch. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Back To Series Next Lesson

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.

Previous Lesson

Lesson 30

JVM and Client-Side Performance: Executors, Allocation, GC, Serialization

Next Lesson

Lesson 32

Kubernetes Deployment - RabbitMQ Cluster Operator, Topology, Storage, and Upgrades