Series/Learn Java RabbitMQ, RabbitMQ Streams, Patterns, and Deployment In Action

Deepen PracticeOrdered learning track

Benchmarking With PerfTest and Stream PerfTest

Learn Java RabbitMQ, RabbitMQ Streams, Patterns, and Deployment In Action - Part 029

Production-grade guide to benchmarking Java RabbitMQ systems with PerfTest and Stream PerfTest, covering benchmark design, workload modelling, AMQP 0-9-1 tests, stream protocol tests, latency interpretation, capacity baselines, failure benchmarking, and performance governance.

[2026-07-01]29 min read5730 words

In This Lesson

1. Kaufman Deconstruction 2. The Core Benchmarking Trap 3. Benchmark Question Template

PrevNext

Lesson 2935 lesson track20–29 Deepen Practice

#java#rabbitmq#rabbitmq-streams#benchmarking+6 more

Part 029 — Benchmarking With PerfTest and Stream PerfTest

A RabbitMQ benchmark is not a contest to produce the largest msg/s number.

A useful benchmark answers a production question:

Can this topology handle the expected peak rate?
What is the p99 latency at the expected peak rate?
What happens when consumers slow down?
How many messages can safely backlog?
What is the cost of durability, replication, confirms, acknowledgements, message size, compression, and batching?
Which bottleneck fails first: producer, broker CPU, broker disk, network, consumer, database, JVM, or topology shape?

PerfTest and Stream PerfTest are useful because they let us isolate RabbitMQ behavior before mixing in real application code. PerfTest exercises AMQP 0-9-1 workloads through the Java client. Stream PerfTest exercises RabbitMQ Streams through the stream protocol and Stream Java Client.

This part teaches how to design benchmarks that produce decision-quality evidence instead of misleading charts.

1. Kaufman Deconstruction

To learn benchmarking quickly and correctly, decompose it into nine smaller skills:

Question design — define the exact production question before running a test.
Workload modelling — match message size, persistence, confirms, consumers, prefetch, topology, and payload shape to reality.
Isolation — test broker behavior separately from application behavior.
Controlled variation — change one variable at a time.
Measurement discipline — capture throughput, latency, resource use, backlog, confirms, lag, GC, network, and disk.
Warmup and steady state — ignore startup artifacts and measure enough time to catch tail behavior.
Failure benchmarking — test restarts, blocked connections, slow consumers, partition movement, DLQ spikes, and backlog drain.
Interpretation — explain the bottleneck using evidence, not intuition.
Governance — turn benchmark results into capacity baselines and regression gates.

The standard:

A benchmark is valid only when another engineer can reproduce the topology, workload, environment, and interpretation.

2. The Core Benchmarking Trap

The easiest benchmark to run is usually the least useful:

# Bad benchmark: impressive but underspecified
java -jar perf-test.jar --uri amqp://guest:guest@localhost:5672

This tells us almost nothing.

We do not know:

whether messages are persistent;
whether queues are durable;
whether publisher confirms are enabled;
whether consumers ack manually;
whether the broker is local or remote;
whether latency is p50, p95, p99, p999;
whether broker disk is involved;
whether replication is involved;
whether the workload resembles our application;
whether the bottleneck is producer, broker, consumer, network, or disk.

A production benchmark starts with a scenario.

Example:

We need to validate whether a three-node RabbitMQ cluster with quorum queues can process 8,000 persistent 2 KB command messages/sec, with publisher confirms, manual consumer acknowledgements, prefetch 100, four Java service replicas, and p99 end-to-end latency below 750 ms during normal operation and below 5 seconds during a 10-minute consumer slowdown.

That is benchmarkable.

3. Benchmark Question Template

Before running PerfTest, fill this out.

Field	Example
Business flow	order fulfillment command queue
Message type	command, event, replay, batch entry
Protocol	AMQP 0-9-1, stream protocol
Queue/stream type	quorum queue, classic queue, stream, super stream
Message size	512 B, 2 KB, 8 KB, 64 KB
Persistence	persistent, transient
Replication	one node, quorum, stream replicas
Producer behavior	confirms, no confirms, confirm batch size
Consumer behavior	manual ack, auto ack, prefetch
Topology	direct exchange, topic exchange, N queues, stream partitions
Target rate	average, p95, peak, burst
Latency SLO	p50, p95, p99, max tolerated
Backlog SLO	max queue depth, max stream lag
Failure scenario	broker restart, consumer pause, publisher overload
Pass/fail criterion	measurable threshold

If a field is unknown, the benchmark is not ready.

4. The Benchmarking Layers

A complete RabbitMQ performance program has four layers.

PerfTest cannot prove the whole application is fast. It proves the broker/topology can support a workload under controlled conditions.

If PerfTest fails, application code will not save you.

If PerfTest passes but the service fails, the bottleneck is likely in Java code, serialization, DB, downstream APIs, executor sizing, idempotency store, or observability overhead.

5. What PerfTest Is Good For

PerfTest is useful for AMQP 0-9-1 workloads:

queue throughput;
publish latency;
consumer throughput;
publisher confirms;
persistent vs transient messages;
manual ack behavior;
prefetch tuning;
queue type comparison;
message size impact;
producer/consumer count variation;
topology shape testing;
backlog creation and drain testing.

PerfTest is not a substitute for:

real serialization cost;
real business handler cost;
DB transaction cost;
idempotency table contention;
external API latency;
application logging/tracing cost;
security/encryption overhead;
full production network path;
realistic deployment autoscaling behavior.

Use it to isolate RabbitMQ, not to certify the whole system.

6. What Stream PerfTest Is Good For

Stream PerfTest is useful for RabbitMQ Stream workloads:

stream write throughput;
stream read throughput;
super stream partition scaling;
stream latency;
chunk/sub-entry batching effects;
compression trade-offs;
retention/backlog behavior;
offset-based consumption;
fan-out consumer scaling;
replay performance;
producer deduplication scenarios.

It is not a replacement for testing your actual stream consumer:

offset commit strategy;
replay-safe processing;
state store behavior;
aggregation/windowing performance;
event schema deserialization;
late-event correction;
checkpoint consistency.

7. Benchmark Environment Discipline

Always record environment details.

Dimension	Capture
RabbitMQ version	exact version and enabled plugins
Erlang/OTP version	exact version
Java version	benchmark client JVM
Broker deployment	bare metal, VM, container, Kubernetes
Node count	single node, three-node cluster, etc.
Queue type	classic, quorum, stream, super stream
Disk type	SSD/NVMe/network disk/ephemeral
Filesystem/storage class	especially on Kubernetes
CPU/memory	broker and benchmark client
Network	same host, same AZ, cross-AZ, cross-region
TLS	enabled/disabled
Compression	enabled/disabled
Broker config	memory watermark, disk limit, file handles
Test duration	warmup and measured interval

Without this metadata, benchmark numbers are anecdotes.

8. Workload Variables

Change one variable at a time.

Variable	Why It Matters
Message size	affects network, memory, disk, GC, serialization
Producer count	affects connection/channel load and broker ingress
Consumer count	affects dispatch, ack rate, ordering, downstream pressure
Queue type	classic/quorum/stream have different durability and replication models
Persistence	changes disk and safety characteristics
Confirms	affects producer throughput and safety
Confirm batch size	affects latency/throughput trade-off
Ack mode	affects loss risk and broker memory behavior
Prefetch	controls consumer in-flight work
Routing topology	affects exchange routing and queue fanout
Replication factor	affects write quorum and disk/network cost
TLS	affects CPU and latency
Compression	shifts bottleneck between CPU and network
Batch size	changes throughput, latency, duplicate amplification

The mistake is to tune ten variables and then claim victory.

9. Benchmark Taxonomy

Use a taxonomy so every test has a purpose.

Test Type	Question
Producer-only	How fast can the broker accept publishes?
Consumer-only	How fast can consumers drain existing backlog?
Balanced pub/sub	What is steady-state throughput and latency?
Burst test	How does the system absorb spikes?
Backlog drain	How quickly can the system recover?
Durability test	What is the cost of persistent messages and replication?
Confirm test	What is the cost of publisher confirms?
Prefetch test	What prefetch maximizes throughput without inflating latency?
Message size test	Where do CPU/network/disk become bottlenecks?
Fanout test	How does N subscribers change throughput and broker load?
Failure test	What happens during restart, pause, failover, or blocked connection?
Soak test	Does performance degrade over hours?

A single benchmark run cannot answer all of these.

10. PerfTest Baseline: Minimal AMQP 0-9-1 Run

A minimal baseline should still be explicit.

java -jar perf-test.jar \
  --uri amqp://user:pass@rabbitmq-1:5672/%2f \
  --queue q.benchmark.baseline \
  --producers 1 \
  --consumers 1 \
  --size 1024 \
  --rate 1000 \
  --time 300

This creates a starting point, not a conclusion.

Record:

publish rate;
consume rate;
latency distribution;
queue depth;
broker CPU;
broker memory;
disk I/O;
network I/O;
client CPU;
GC behavior;
warnings/alarms.

11. PerfTest With Publisher Confirms

Publisher confirms are required for serious at-least-once publish safety.

A confirm benchmark should vary confirm mode and confirm batch behavior.

Example shape:

java -jar perf-test.jar \
  --uri amqp://user:pass@rabbitmq-1:5672/%2f \
  --queue q.benchmark.confirms \
  --producers 4 \
  --consumers 4 \
  --size 2048 \
  --rate 8000 \
  --confirm 100 \
  --persistent \
  --time 600

Interpretation questions:

What is confirm latency at p95/p99?
Does confirm latency climb before broker CPU is saturated?
Is disk write latency the bottleneck?
Does increasing producer count improve throughput or only increase in-flight pressure?
Does larger confirm batch reduce overhead but increase tail latency?
How much memory is used by unconfirmed messages on the client side?

Do not compare confirmed and unconfirmed publishing as equivalent workloads. They have different safety contracts.

12. PerfTest With Persistent Messages

Persistent messages only matter with durable queue topology and appropriate broker storage behavior.

A useful durability benchmark uses:

durable queues;
persistent messages;
publisher confirms;
manual acknowledgements;
realistic message size;
realistic queue type.

Example shape:

java -jar perf-test.jar \
  --uri amqp://user:pass@rabbitmq-1:5672/%2f \
  --queue q.benchmark.durable \
  --queue-args x-queue-type=quorum \
  --producers 6 \
  --consumers 6 \
  --size 2048 \
  --rate 12000 \
  --confirm 100 \
  --persistent \
  --autoack false \
  --qos 100 \
  --time 900

Key observations:

Does throughput remain steady after warmup?
Does p99 latency drift upward?
Does queue depth oscillate or grow monotonically?
Does disk utilization approach saturation?
Are quorum queue leaders balanced across nodes?
Does consumer ack throughput match producer publish throughput?

13. Prefetch Benchmark Matrix

Prefetch is a work budget, not a magic speed knob.

Run the same workload with a fixed target rate and vary only prefetch.

Run	Producers	Consumers	Prefetch	Target Rate	Message Size
A	4	4	1	4,000/s	2 KB
B	4	4	10	4,000/s	2 KB
C	4	4	50	4,000/s	2 KB
D	4	4	100	4,000/s	2 KB
E	4	4	500	4,000/s	2 KB

Measure:

consume rate;
delivery latency;
processing latency if using custom client harness;
unacked message count;
consumer memory;
redelivery behavior after consumer kill;
fairness across consumers.

Expected shape:

too low: throughput limited by round-trips and dispatch inefficiency;
reasonable: stable throughput and bounded in-flight work;
too high: high unacked count, inflated tail latency, larger duplicate storm after crash.

14. Message Size Benchmark Matrix

Message size can shift the bottleneck completely.

Run	Size	Hypothesis
A	256 B	protocol overhead dominates
B	1 KB	common command/event baseline
C	4 KB	normal business payload
D	16 KB	network and memory begin to matter
E	64 KB	disk/network/GC pressure likely visible
F	256 KB	anti-pattern candidate unless justified

For each size, keep topology and rate strategy constant.

Measure:

throughput;
latency percentiles;
broker memory;
network throughput;
disk throughput;
client CPU;
GC allocation rate;
confirm latency.

Design rule:

If message size dominates performance, fix payload design before tuning broker knobs.

15. Fanout Benchmark

Fanout changes the cost model. One publish can become N queue writes.

Benchmark with subscriber counts:

Run	Subscriber Queues	Consumers Per Queue	Message Size	Target Publish Rate
A	1	2	2 KB	5,000/s
B	3	2	2 KB	5,000/s
C	10	2	2 KB	5,000/s
D	25	2	2 KB	5,000/s

Observe:

broker write amplification;
per-queue depth;
slow subscriber impact;
routing CPU;
queue leader balance;
disk amplification;
alert volume.

If many subscribers need replay, RabbitMQ Streams may be a better fit than queue-per-subscriber fanout.

16. Queue Type Comparison

Do not benchmark queue types as if they are interchangeable.

Queue Type	Why Benchmark It
Classic queue	baseline and compatibility
Quorum queue	data safety, replication, poison handling via delivery limit
Stream	log-like retention, replay, high fan-out, large backlog

A comparison must keep the target semantic clear.

Bad comparison:

Classic queue is faster, therefore better.

Better comparison:

For this command queue, we require replicated data safety. Compare quorum queue throughput with required confirms and manual ack against the actual SLO.

Benchmark against requirements, not against vanity throughput.

17. Stream PerfTest Baseline

Stream PerfTest uses the RabbitMQ stream protocol.

A baseline stream run should specify:

stream name;
producers;
consumers;
message size;
rate;
duration;
retention if relevant;
replicas if relevant;
super stream partitioning if relevant.

Example shape:

java -jar stream-perf-test.jar \
  --uris rabbitmq-stream://user:pass@rabbitmq-1:5552 \
  --stream benchmark.stream.baseline \
  --producers 1 \
  --consumers 1 \
  --size 1024 \
  --rate 10000 \
  --time 300

Measure:

publish rate;
consume rate;
latency;
consumer lag;
broker disk write rate;
broker network;
producer CPU;
consumer CPU;
retention growth.

18. Stream Replay Benchmark

Replay is a core stream capability.

Test it explicitly:

Produce a fixed dataset, for example 100 million messages.
Stop producers.
Start consumers from the beginning.
Measure read throughput and lag reduction.
Repeat with different consumer counts.
Repeat with different message sizes.
Repeat with real deserialization if using a custom harness.

Questions:

How quickly can a new projection be built?
How quickly can a consumer recover after being offline?
Does replay saturate broker disk, network, or consumer CPU?
Can replay run while live producers continue publishing?
Does replay affect live consumer latency?

A stream system that cannot replay within the operational recovery window is not production-ready.

19. Super Stream Benchmark

Super streams add partitioning.

Benchmark scaling by partition count:

Run	Partitions	Producers	Consumers	Routing Key Cardinality
A	1	2	2	1M keys
B	3	3	3	1M keys
C	6	6	6	1M keys
D	12	12	12	1M keys

Measure:

total throughput;
per-partition throughput;
hot partition skew;
consumer assignment balance;
publish latency;
consumer lag per partition;
leader distribution across nodes.

Partitioning only helps when routing keys distribute load well.

A hot key can defeat the entire super stream design.

20. Compression and Batch Benchmark for Streams

For streams, batching and compression can increase throughput but change latency and duplicate behavior.

Benchmark dimensions:

Variable	Values
Message size	512 B, 2 KB, 8 KB
Batch/sub-entry count	off, 10, 100, 500
Compression	none, gzip, snappy/lz4/zstd if available
Producer count	1, 4, 8
Partition count	1, 3, 6, 12

Measure:

throughput;
p99 latency;
CPU cost;
network reduction;
disk reduction;
duplicate amplification under retry;
consumer decompression cost.

Interpretation rule:

Compression is worth it when network/disk is the bottleneck and CPU headroom exists. It is harmful when CPU is already the bottleneck or latency budget is tight.

21. Latency: Do Not Trust Averages

Average latency hides the user-visible failure.

Always capture:

p50;
p90;
p95;
p99;
p999 if the workload is latency-sensitive;
max;
latency under backlog;
latency after broker restart;
confirm latency separately from end-to-end delivery latency.

Example interpretation:

Metric	Value	Meaning
p50	20 ms	normal path is healthy
p95	100 ms	mild queueing exists
p99	2,500 ms	tail is failing SLO
max	30 s	stall or failover visible

A benchmark with good average latency and bad p99 is a failed benchmark for most production workflows.

22. Throughput: Sustainable vs Burst

There are two different throughput numbers.

Type	Meaning
Burst throughput	temporary rate before backlog or latency grows
Sustainable throughput	rate that can run indefinitely without unbounded queue depth or latency drift

To find sustainable throughput:

Set a target publish rate.
Run long enough to reach steady state.
Verify queue depth or lag does not grow monotonically.
Verify p99 latency remains bounded.
Verify broker resources remain below alert thresholds.
Verify consumer throughput equals producer throughput.
Repeat with a higher rate.

The maximum sustainable rate is the highest rate that passes all conditions.

23. Backlog Drain Benchmark

Backlog drain matters after outages.

Scenario:

Producers publish at normal rate.
Consumers are stopped for 10 minutes.
Queue depth or stream lag grows.
Consumers restart with normal replica count.
Measure drain time.
Repeat with temporary scale-out.

Pass/fail criteria:

backlog drains within recovery objective;
p99 latency returns to normal after drain;
broker does not hit memory/disk alarms;
consumers do not overload downstream DB/API;
duplicate/redelivery rate remains expected.

Backlog drain is where many “high throughput” systems fail operationally.

24. Failure Benchmark: Broker Restart

Benchmark failover/restart as a normal event.

For queues:

publish with confirms;
consume with manual ack;
restart broker node or queue leader;
measure publish interruption;
measure confirm latency spike;
measure redelivery count;
measure queue recovery time;
verify no message loss according to idempotency ledger.

For streams:

publish continuously;
consume continuously;
restart stream leader node;
measure publish pause;
measure consumer lag spike;
verify offset recovery;
verify replay safety.

Failure benchmarking must be done before production, not during the first incident.

25. Failure Benchmark: Slow Consumer

Slow consumers reveal backpressure correctness.

Scenario:

Measure:

queue depth growth rate;
unacked messages;
p99 processing latency;
publisher confirm latency;
DLQ growth;
consumer memory;
downstream saturation;
recovery time after downstream returns.

The goal is not to avoid slowdown. The goal is bounded, observable degradation.

26. Coordinated Omission

A common latency mistake:

If the producer slows down when the system slows down, the benchmark may stop measuring the worst latency.

This is coordinated omission.

To reduce the risk:

use fixed-rate load where appropriate;
measure queueing delay separately;
measure publish timestamp to consume timestamp;
record backlog/lag during latency measurement;
measure during saturation and recovery;
do not discard timeout/error samples silently.

If the system is overloaded and your chart says latency is fine, your measurement is probably wrong.

27. Application-Level Benchmark Harness

PerfTest isolates broker behavior. A Java harness isolates your client code.

A minimal harness should implement:

real ConnectionFactory config;
real exchange/queue declarations;
real message converter;
real publisher confirm handling;
real consumer acknowledgement policy;
synthetic business handler with configurable delay/error rate;
metrics for publish, confirm, delivery, processing, ack, retry;
bounded executor;
graceful shutdown;
structured output.

public interface BenchmarkWorkload {
    byte[] payload(int sequence);
    void handle(byte[] body) throws Exception;
}

public record BenchmarkMetrics(
        long published,
        long confirmed,
        long consumed,
        long acked,
        long failed,
        long redelivered
) {}

This harness reveals overhead hidden by PerfTest:

serialization;
allocation;
JSON parsing;
validation;
logging;
tracing;
idempotency checks;
transaction boundaries.

28. Benchmark Output Format

Every benchmark should produce a result record.

benchmarkId: rabbitmq-command-quorum-2kb-confirm-prefetch100-2026-07-01
purpose: Validate command queue capacity for order workflow
environment:
  rabbitmqVersion: "x.y.z"
  erlangVersion: "x.y.z"
  javaVersion: "21"
  brokerNodes: 3
  deployment: kubernetes
  storage: nvme-backed-storage-class
workload:
  protocol: amqp-0-9-1
  queueType: quorum
  messageSizeBytes: 2048
  producers: 6
  consumers: 6
  publisherConfirms: true
  persistentMessages: true
  manualAck: true
  prefetch: 100
  targetRatePerSecond: 8000
  durationSeconds: 900
results:
  publishRatePerSecond: 8000
  consumeRatePerSecond: 8000
  p50LatencyMs: 35
  p95LatencyMs: 210
  p99LatencyMs: 640
  maxLatencyMs: 1800
  maxQueueDepth: 12000
  brokerCpuMaxPercent: 72
  brokerDiskUtilMaxPercent: 68
  clientGcPauseP99Ms: 12
conclusion:
  status: pass
  notes:
    - Meets p99 latency SLO under normal load.
    - Disk becomes primary bottleneck above 11k msg/s.

This format makes benchmark history comparable.

29. Benchmark Matrix for Command Queue

Use this matrix for a command queue workload.

Run	Queue	Persistent	Confirms	Manual Ack	Producers	Consumers	Prefetch	Size	Rate
C1	quorum	yes	yes	yes	2	2	50	1 KB	2k/s
C2	quorum	yes	yes	yes	4	4	100	1 KB	5k/s
C3	quorum	yes	yes	yes	6	6	100	2 KB	8k/s
C4	quorum	yes	yes	yes	8	8	200	2 KB	12k/s
C5	quorum	yes	yes	yes	8	8	200	8 KB	8k/s
C6	quorum	yes	yes	yes	8	0	n/a	2 KB	8k/s

C6 is backlog creation.

Then run drain tests by restarting consumers.

30. Benchmark Matrix for Event Fanout

Run	Exchange	Subscriber Queues	Queue Type	Producers	Consumers/Queue	Size	Rate
E1	topic	1	quorum	4	2	1 KB	5k/s
E2	topic	5	quorum	4	2	1 KB	5k/s
E3	topic	10	quorum	4	2	1 KB	5k/s
E4	topic	20	quorum	4	2	1 KB	5k/s

Measure broker write amplification.

If E4 cannot pass, do not solve it with blind broker scaling first. Reconsider topology:

stream fanout;
fewer subscribers;
event aggregation;
selective routing;
subscriber-side filtering moved to broker routing keys;
separating hot event categories.

31. Benchmark Matrix for Streams

Run	Stream Type	Partitions	Producers	Consumers	Size	Batch	Compression	Rate
S1	stream	1	1	1	1 KB	off	none	10k/s
S2	stream	1	4	4	1 KB	on	none	50k/s
S3	super stream	3	6	6	1 KB	on	none	100k/s
S4	super stream	6	12	12	1 KB	on	none	200k/s
S5	super stream	6	12	12	8 KB	on	compression	100k/s

Measure:

throughput scaling by partition;
latency under each partition count;
hot partition skew;
consumer lag;
replay throughput;
CPU/disk/network distribution.

32. Reading Broker Metrics During Benchmark

Important broker-side metrics:

Metric	Meaning
publish rate	ingress
deliver/get rate	egress
ack rate	completion rate
confirm rate	broker responsibility acceptance rate
queue depth	backlog
unacked messages	consumer in-flight work
redeliver rate	retry/consumer failure signal
memory used	broker memory pressure
disk free / disk alarm	storage pressure
file descriptors	connection/channel/queue pressure
socket descriptors	network pressure
connection blocked	flow control visible to publisher
stream lag	consumer progress behind stream tail
stream segment growth	retention/storage pressure

A benchmark is incomplete without broker metrics.

33. Reading Client Metrics During Benchmark

Important Java client metrics:

Metric	Meaning
publish attempts	offered load
publish success	accepted by client path
confirms	broker accepted responsibility
confirm latency	broker/disk/replication pressure
returned messages	routing failure
in-flight messages	producer memory risk
deliveries	consumer ingress
processing latency	handler/downstream pressure
ack latency	completion delay
redeliveries	failure/retry loop
executor queue depth	JVM local backpressure
allocation rate	GC pressure
GC pause	tail latency risk
thread count	concurrency health
socket write wait	network/backpressure signal

Broker metrics tell you what RabbitMQ sees. Client metrics tell you what the service experiences.

34. Benchmark Interpretation Patterns

34.1 Throughput low, broker CPU low

Likely causes:

producer bottleneck;
consumer bottleneck;
network bottleneck;
confirm window too small;
benchmark client CPU saturated;
rate limit too low;
single channel bottleneck;
TLS/client-side encryption overhead.

Next action:

inspect client CPU;
increase producer count;
increase in-flight confirm window carefully;
separate benchmark client from broker host;
inspect network throughput.

34.2 Throughput high, p99 latency bad

Likely causes:

queueing delay;
disk flush latency;
GC pauses;
confirm batching too large;
consumer prefetch too high;
backlog oscillation;
node imbalance.

Next action:

inspect latency over time, not just summary;
inspect queue depth/lag correlation;
reduce batch/prefetch;
rebalance leaders;
check disk latency.

34.3 Queue depth grows steadily

Likely causes:

producer rate > consumer completion rate;
downstream consumer dependency slow;
prefetch too low;
insufficient consumers;
poison messages causing retry loops;
broker throttling deliveries.

Next action:

compare publish rate vs ack rate;
inspect consumer processing latency;
inspect DLQ and redelivery;
increase consumers only if downstream can handle it.

34.4 Confirm latency rises before CPU saturation

Likely causes:

disk latency;
quorum replication latency;
network between cluster nodes;
leader placement;
storage throttling;
memory/disk alarm near threshold.

Next action:

inspect disk I/O;
inspect cluster inter-node network;
inspect queue leader distribution;
test with smaller message size;
test single-node vs replicated to isolate replication cost.

35. Validating Capacity With Little’s Law

Part 028 introduced Little’s Law:

L = λ × W

Where:

L = average number of messages in system;
λ = arrival rate;
W = average time in system.

Use benchmark data to sanity-check capacity.

Example:

arrival rate = 8,000 msg/s
average end-to-end latency = 0.25 s
expected in-system messages = 8,000 × 0.25 = 2,000

If the benchmark reports average in-system messages around 20,000 instead of 2,000, either latency measurement is wrong, queueing is hidden, or workload is unstable.

36. Benchmarking Consumer Handler Delay

PerfTest can simulate consumer behavior, but a custom Java harness is better for service-like delay.

Scenarios:

Handler Delay	Meaning
0 ms	broker/client maximum drain
5 ms	light CPU/DB path
25 ms	normal business transaction
100 ms	slow downstream dependency
1 s	outage/backpressure scenario

Run each with the same prefetch and consumer count.

Expected:

max throughput ≈ consumers × concurrency / processing_time

Example:

8 workers, 25 ms each
capacity ≈ 8 / 0.025 = 320 msg/s

If the benchmark shows 5,000 msg/s with 25 ms handler and only 8 workers, the measurement is wrong or acknowledgements are happening before processing finishes.

37. Benchmarking Ack Timing

Ack timing can create fake performance.

Bad benchmark:

receive message → ack → process

This shows high consume rate but violates correctness.

Production benchmark:

receive message → process → commit side effect → ack

If processing includes database transaction, include it in the service-level benchmark.

For broker-only benchmark, document that it excludes business commit cost.

38. Benchmarking DLQ and Poison Behavior

Performance tests often ignore failure paths.

Add controlled poison rate:

Poison Rate	Purpose
0%	baseline
0.1%	normal malformed input
1%	upstream bug
10%	incident scenario

Measure:

retry rate;
DLQ rate;
redelivery rate;
queue depth;
handler throughput;
CPU/logging overhead;
alert behavior.

A retry system that performs well only when nothing fails is not production-ready.

39. Benchmarking Observability Overhead

Metrics, logs, and traces cost CPU and allocation.

Run with:

metrics off, logs minimal, tracing off;
metrics on, logs minimal, tracing off;
metrics on, structured logs normal;
metrics on, structured logs normal, tracing sampled;
metrics on, logs verbose, tracing 100%.

This reveals whether observability configuration can take down the message path during incidents.

Rule:

Debug logging in a hot message loop is a denial-of-service vector.

40. Benchmarking TLS

TLS may be required for security, but it changes performance.

Benchmark:

plaintext baseline in non-production lab;
TLS enabled with realistic certs;
TLS with same node placement;
TLS with real client count;
TLS under connection churn;
TLS with large message size.

Measure CPU, latency, and throughput.

Do not disable TLS in production because of a benchmark. Use the benchmark to size CPU and tune connection reuse.

41. Benchmarking Kubernetes Reality

On Kubernetes, benchmark storage and scheduling, not only RabbitMQ.

Capture:

storage class;
persistent volume type;
pod CPU/memory limits;
CPU throttling;
node affinity/anti-affinity;
cross-node network latency;
disruption budget;
container filesystem behavior;
service mesh overhead if any;
metrics sidecar overhead if any.

A RabbitMQ benchmark on a developer laptop does not predict Kubernetes performance.

42. CI Performance Guardrails

Not every benchmark belongs in CI.

Use three layers:

Layer	Frequency	Purpose
Microbenchmark	every PR	serialization/client helper regressions
Small integration benchmark	daily	obvious RabbitMQ interaction regression
Full capacity benchmark	release/infra change	production baseline validation

CI guardrail example:

Fail build if:
- p95 publish helper latency regresses > 20%
- allocation per message regresses > 30%
- consumer handler benchmark throughput drops > 15%

Do not put a 2-hour broker benchmark in every PR. It will be ignored or disabled.

43. Benchmark Report Template

Use this report structure:

# Benchmark Report: <name>

## Purpose

## Environment

## RabbitMQ Configuration

## Topology

## Workload

## Tool Command

## Duration and Warmup

## Results Summary

## Latency Distribution

## Broker Metrics

## Client Metrics

## Failure Observations

## Bottleneck Analysis

## Decision

## Follow-up Actions

A benchmark without a decision is just measurement.

44. Common Benchmark Anti-Patterns

Anti-Pattern	Why It Is Dangerous
Benchmarking localhost only	hides network and deployment cost
Reporting only average latency	hides tail failures
Ignoring confirms	tests unsafe publish path
Using auto-ack	tests unsafe consume path
Tiny message only	hides real payload cost
No warmup	captures startup artifacts
Too short duration	misses steady-state and GC/storage behavior
No broker metrics	cannot identify bottleneck
No client metrics	cannot identify producer/consumer bottleneck
Changing many variables	results are uninterpretable
Ignoring failures	production behavior unknown
Comparing queue and stream without semantics	wrong conclusion
Using peak rate as capacity	ignores sustainable throughput

45. Decision Framework

After benchmark, decide one of five outcomes.

Never accept a benchmark solely because throughput looks good.

46. Production Benchmark Checklist

Before running:

47. Practice Drill

Build a benchmark program for the capstone platform.

Drill A — Command Queue Baseline

quorum queue;
2 KB persistent messages;
publisher confirms;
manual ack;
prefetch 100;
target 5k/s, 8k/s, 12k/s;
15-minute run each.

Deliverable:

benchmark report;
bottleneck analysis;
recommended capacity.

Drill B — Event Fanout

topic exchange;
1, 5, 10, 20 subscriber queues;
1 KB persistent events;
target 5k/s;
measure write amplification.

Deliverable:

fanout scalability curve;
recommendation: queue fanout vs stream fanout.

Drill C — Super Stream Replay

6 partitions;
100 million events;
replay from beginning;
measure rebuild time;
measure live publish impact.

Deliverable:

replay capacity estimate;
partition hot spot analysis.

Drill D — Failure Test

run balanced workload;
restart a broker node;
pause consumers for 10 minutes;
resume consumers;
verify drain time and duplicate behavior.

Deliverable:

failure timeline;
runbook improvement list.

48. Self-Correction Rubric

You understand RabbitMQ benchmarking when you can answer these without guessing:

What exact production question does this benchmark answer?
What safety semantics were enabled: confirms, persistence, durable topology, manual ack?
What variable changed between runs?
Was the result sustainable or only burst?
What was the p99 latency under steady state?
What happened to p99 during backlog drain?
What was the broker bottleneck?
What was the client bottleneck?
How did message size affect capacity?
How did prefetch affect throughput and redelivery risk?
How did replication affect confirm latency?
How did fanout amplify broker work?
How quickly can backlog drain?
How quickly can a stream consumer replay?
What decision did the benchmark support?

If you cannot answer these, rerun the benchmark with better instrumentation.

49. Key Takeaways

PerfTest answers AMQP 0-9-1 broker/topology capacity questions.
Stream PerfTest answers stream protocol, stream, and super stream capacity questions.
Benchmarks must be designed around production questions, not vanity throughput.
Publisher confirms, persistence, replication, manual ack, and prefetch materially change results.
Sustainable throughput matters more than peak throughput.
p99 latency matters more than average latency.
Backlog drain and failure behavior are part of performance, not separate concerns.
A good benchmark produces a decision and a reproducible report.

50. References

RabbitMQ Java Tools / PerfTest documentation: https://www.rabbitmq.com/client-libraries/java-tools
RabbitMQ PerfTest documentation: https://perftest.rabbitmq.com/
RabbitMQ Stream PerfTest documentation: https://rabbitmq.github.io/rabbitmq-stream-perf-test/stable/htmlsingle/
RabbitMQ Java Client API Guide: https://www.rabbitmq.com/client-libraries/java-api-guide
RabbitMQ Streams documentation: https://www.rabbitmq.com/docs/streams
RabbitMQ Stream Java Client documentation: https://rabbitmq.github.io/rabbitmq-stream-java-client/stable/htmlsingle/

Lesson Recap

You just completed lesson 29 in deepen practice. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Back To Series Next Lesson

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.

Previous Lesson

Lesson 28

Performance Model: Throughput, Latency, Queue Depth, and Consumer Lag

Next Lesson

Lesson 30

JVM and Client-Side Performance: Executors, Allocation, GC, Serialization