Deepen PracticeOrdered learning track

Benchmarking With PerfTest and Stream PerfTest

Learn Java RabbitMQ, RabbitMQ Streams, Patterns, and Deployment In Action - Part 029

Production-grade guide to benchmarking Java RabbitMQ systems with PerfTest and Stream PerfTest, covering benchmark design, workload modelling, AMQP 0-9-1 tests, stream protocol tests, latency interpretation, capacity baselines, failure benchmarking, and performance governance.

29 min read5730 words
PrevNext
Lesson 2935 lesson track2029 Deepen Practice
#java#rabbitmq#rabbitmq-streams#benchmarking+6 more

Part 029 — Benchmarking With PerfTest and Stream PerfTest

A RabbitMQ benchmark is not a contest to produce the largest msg/s number.

A useful benchmark answers a production question:

  • Can this topology handle the expected peak rate?
  • What is the p99 latency at the expected peak rate?
  • What happens when consumers slow down?
  • How many messages can safely backlog?
  • What is the cost of durability, replication, confirms, acknowledgements, message size, compression, and batching?
  • Which bottleneck fails first: producer, broker CPU, broker disk, network, consumer, database, JVM, or topology shape?

PerfTest and Stream PerfTest are useful because they let us isolate RabbitMQ behavior before mixing in real application code. PerfTest exercises AMQP 0-9-1 workloads through the Java client. Stream PerfTest exercises RabbitMQ Streams through the stream protocol and Stream Java Client.

This part teaches how to design benchmarks that produce decision-quality evidence instead of misleading charts.


1. Kaufman Deconstruction

To learn benchmarking quickly and correctly, decompose it into nine smaller skills:

  1. Question design — define the exact production question before running a test.
  2. Workload modelling — match message size, persistence, confirms, consumers, prefetch, topology, and payload shape to reality.
  3. Isolation — test broker behavior separately from application behavior.
  4. Controlled variation — change one variable at a time.
  5. Measurement discipline — capture throughput, latency, resource use, backlog, confirms, lag, GC, network, and disk.
  6. Warmup and steady state — ignore startup artifacts and measure enough time to catch tail behavior.
  7. Failure benchmarking — test restarts, blocked connections, slow consumers, partition movement, DLQ spikes, and backlog drain.
  8. Interpretation — explain the bottleneck using evidence, not intuition.
  9. Governance — turn benchmark results into capacity baselines and regression gates.

The standard:

A benchmark is valid only when another engineer can reproduce the topology, workload, environment, and interpretation.


2. The Core Benchmarking Trap

The easiest benchmark to run is usually the least useful:

# Bad benchmark: impressive but underspecified
java -jar perf-test.jar --uri amqp://guest:guest@localhost:5672

This tells us almost nothing.

We do not know:

  • whether messages are persistent;
  • whether queues are durable;
  • whether publisher confirms are enabled;
  • whether consumers ack manually;
  • whether the broker is local or remote;
  • whether latency is p50, p95, p99, p999;
  • whether broker disk is involved;
  • whether replication is involved;
  • whether the workload resembles our application;
  • whether the bottleneck is producer, broker, consumer, network, or disk.

A production benchmark starts with a scenario.

Example:

We need to validate whether a three-node RabbitMQ cluster with quorum queues can process 8,000 persistent 2 KB command messages/sec, with publisher confirms, manual consumer acknowledgements, prefetch 100, four Java service replicas, and p99 end-to-end latency below 750 ms during normal operation and below 5 seconds during a 10-minute consumer slowdown.

That is benchmarkable.


3. Benchmark Question Template

Before running PerfTest, fill this out.

FieldExample
Business floworder fulfillment command queue
Message typecommand, event, replay, batch entry
ProtocolAMQP 0-9-1, stream protocol
Queue/stream typequorum queue, classic queue, stream, super stream
Message size512 B, 2 KB, 8 KB, 64 KB
Persistencepersistent, transient
Replicationone node, quorum, stream replicas
Producer behaviorconfirms, no confirms, confirm batch size
Consumer behaviormanual ack, auto ack, prefetch
Topologydirect exchange, topic exchange, N queues, stream partitions
Target rateaverage, p95, peak, burst
Latency SLOp50, p95, p99, max tolerated
Backlog SLOmax queue depth, max stream lag
Failure scenariobroker restart, consumer pause, publisher overload
Pass/fail criterionmeasurable threshold

If a field is unknown, the benchmark is not ready.


4. The Benchmarking Layers

A complete RabbitMQ performance program has four layers.

PerfTest cannot prove the whole application is fast. It proves the broker/topology can support a workload under controlled conditions.

If PerfTest fails, application code will not save you.

If PerfTest passes but the service fails, the bottleneck is likely in Java code, serialization, DB, downstream APIs, executor sizing, idempotency store, or observability overhead.


5. What PerfTest Is Good For

PerfTest is useful for AMQP 0-9-1 workloads:

  • queue throughput;
  • publish latency;
  • consumer throughput;
  • publisher confirms;
  • persistent vs transient messages;
  • manual ack behavior;
  • prefetch tuning;
  • queue type comparison;
  • message size impact;
  • producer/consumer count variation;
  • topology shape testing;
  • backlog creation and drain testing.

PerfTest is not a substitute for:

  • real serialization cost;
  • real business handler cost;
  • DB transaction cost;
  • idempotency table contention;
  • external API latency;
  • application logging/tracing cost;
  • security/encryption overhead;
  • full production network path;
  • realistic deployment autoscaling behavior.

Use it to isolate RabbitMQ, not to certify the whole system.


6. What Stream PerfTest Is Good For

Stream PerfTest is useful for RabbitMQ Stream workloads:

  • stream write throughput;
  • stream read throughput;
  • super stream partition scaling;
  • stream latency;
  • chunk/sub-entry batching effects;
  • compression trade-offs;
  • retention/backlog behavior;
  • offset-based consumption;
  • fan-out consumer scaling;
  • replay performance;
  • producer deduplication scenarios.

It is not a replacement for testing your actual stream consumer:

  • offset commit strategy;
  • replay-safe processing;
  • state store behavior;
  • aggregation/windowing performance;
  • event schema deserialization;
  • late-event correction;
  • checkpoint consistency.

7. Benchmark Environment Discipline

Always record environment details.

DimensionCapture
RabbitMQ versionexact version and enabled plugins
Erlang/OTP versionexact version
Java versionbenchmark client JVM
Broker deploymentbare metal, VM, container, Kubernetes
Node countsingle node, three-node cluster, etc.
Queue typeclassic, quorum, stream, super stream
Disk typeSSD/NVMe/network disk/ephemeral
Filesystem/storage classespecially on Kubernetes
CPU/memorybroker and benchmark client
Networksame host, same AZ, cross-AZ, cross-region
TLSenabled/disabled
Compressionenabled/disabled
Broker configmemory watermark, disk limit, file handles
Test durationwarmup and measured interval

Without this metadata, benchmark numbers are anecdotes.


8. Workload Variables

Change one variable at a time.

VariableWhy It Matters
Message sizeaffects network, memory, disk, GC, serialization
Producer countaffects connection/channel load and broker ingress
Consumer countaffects dispatch, ack rate, ordering, downstream pressure
Queue typeclassic/quorum/stream have different durability and replication models
Persistencechanges disk and safety characteristics
Confirmsaffects producer throughput and safety
Confirm batch sizeaffects latency/throughput trade-off
Ack modeaffects loss risk and broker memory behavior
Prefetchcontrols consumer in-flight work
Routing topologyaffects exchange routing and queue fanout
Replication factoraffects write quorum and disk/network cost
TLSaffects CPU and latency
Compressionshifts bottleneck between CPU and network
Batch sizechanges throughput, latency, duplicate amplification

The mistake is to tune ten variables and then claim victory.


9. Benchmark Taxonomy

Use a taxonomy so every test has a purpose.

Test TypeQuestion
Producer-onlyHow fast can the broker accept publishes?
Consumer-onlyHow fast can consumers drain existing backlog?
Balanced pub/subWhat is steady-state throughput and latency?
Burst testHow does the system absorb spikes?
Backlog drainHow quickly can the system recover?
Durability testWhat is the cost of persistent messages and replication?
Confirm testWhat is the cost of publisher confirms?
Prefetch testWhat prefetch maximizes throughput without inflating latency?
Message size testWhere do CPU/network/disk become bottlenecks?
Fanout testHow does N subscribers change throughput and broker load?
Failure testWhat happens during restart, pause, failover, or blocked connection?
Soak testDoes performance degrade over hours?

A single benchmark run cannot answer all of these.


10. PerfTest Baseline: Minimal AMQP 0-9-1 Run

A minimal baseline should still be explicit.

java -jar perf-test.jar \
  --uri amqp://user:pass@rabbitmq-1:5672/%2f \
  --queue q.benchmark.baseline \
  --producers 1 \
  --consumers 1 \
  --size 1024 \
  --rate 1000 \
  --time 300

This creates a starting point, not a conclusion.

Record:

  • publish rate;
  • consume rate;
  • latency distribution;
  • queue depth;
  • broker CPU;
  • broker memory;
  • disk I/O;
  • network I/O;
  • client CPU;
  • GC behavior;
  • warnings/alarms.

11. PerfTest With Publisher Confirms

Publisher confirms are required for serious at-least-once publish safety.

A confirm benchmark should vary confirm mode and confirm batch behavior.

Example shape:

java -jar perf-test.jar \
  --uri amqp://user:pass@rabbitmq-1:5672/%2f \
  --queue q.benchmark.confirms \
  --producers 4 \
  --consumers 4 \
  --size 2048 \
  --rate 8000 \
  --confirm 100 \
  --persistent \
  --time 600

Interpretation questions:

  • What is confirm latency at p95/p99?
  • Does confirm latency climb before broker CPU is saturated?
  • Is disk write latency the bottleneck?
  • Does increasing producer count improve throughput or only increase in-flight pressure?
  • Does larger confirm batch reduce overhead but increase tail latency?
  • How much memory is used by unconfirmed messages on the client side?

Do not compare confirmed and unconfirmed publishing as equivalent workloads. They have different safety contracts.


12. PerfTest With Persistent Messages

Persistent messages only matter with durable queue topology and appropriate broker storage behavior.

A useful durability benchmark uses:

  • durable queues;
  • persistent messages;
  • publisher confirms;
  • manual acknowledgements;
  • realistic message size;
  • realistic queue type.

Example shape:

java -jar perf-test.jar \
  --uri amqp://user:pass@rabbitmq-1:5672/%2f \
  --queue q.benchmark.durable \
  --queue-args x-queue-type=quorum \
  --producers 6 \
  --consumers 6 \
  --size 2048 \
  --rate 12000 \
  --confirm 100 \
  --persistent \
  --autoack false \
  --qos 100 \
  --time 900

Key observations:

  • Does throughput remain steady after warmup?
  • Does p99 latency drift upward?
  • Does queue depth oscillate or grow monotonically?
  • Does disk utilization approach saturation?
  • Are quorum queue leaders balanced across nodes?
  • Does consumer ack throughput match producer publish throughput?

13. Prefetch Benchmark Matrix

Prefetch is a work budget, not a magic speed knob.

Run the same workload with a fixed target rate and vary only prefetch.

RunProducersConsumersPrefetchTarget RateMessage Size
A4414,000/s2 KB
B44104,000/s2 KB
C44504,000/s2 KB
D441004,000/s2 KB
E445004,000/s2 KB

Measure:

  • consume rate;
  • delivery latency;
  • processing latency if using custom client harness;
  • unacked message count;
  • consumer memory;
  • redelivery behavior after consumer kill;
  • fairness across consumers.

Expected shape:

  • too low: throughput limited by round-trips and dispatch inefficiency;
  • reasonable: stable throughput and bounded in-flight work;
  • too high: high unacked count, inflated tail latency, larger duplicate storm after crash.

14. Message Size Benchmark Matrix

Message size can shift the bottleneck completely.

RunSizeHypothesis
A256 Bprotocol overhead dominates
B1 KBcommon command/event baseline
C4 KBnormal business payload
D16 KBnetwork and memory begin to matter
E64 KBdisk/network/GC pressure likely visible
F256 KBanti-pattern candidate unless justified

For each size, keep topology and rate strategy constant.

Measure:

  • throughput;
  • latency percentiles;
  • broker memory;
  • network throughput;
  • disk throughput;
  • client CPU;
  • GC allocation rate;
  • confirm latency.

Design rule:

If message size dominates performance, fix payload design before tuning broker knobs.


15. Fanout Benchmark

Fanout changes the cost model. One publish can become N queue writes.

Benchmark with subscriber counts:

RunSubscriber QueuesConsumers Per QueueMessage SizeTarget Publish Rate
A122 KB5,000/s
B322 KB5,000/s
C1022 KB5,000/s
D2522 KB5,000/s

Observe:

  • broker write amplification;
  • per-queue depth;
  • slow subscriber impact;
  • routing CPU;
  • queue leader balance;
  • disk amplification;
  • alert volume.

If many subscribers need replay, RabbitMQ Streams may be a better fit than queue-per-subscriber fanout.


16. Queue Type Comparison

Do not benchmark queue types as if they are interchangeable.

Queue TypeWhy Benchmark It
Classic queuebaseline and compatibility
Quorum queuedata safety, replication, poison handling via delivery limit
Streamlog-like retention, replay, high fan-out, large backlog

A comparison must keep the target semantic clear.

Bad comparison:

Classic queue is faster, therefore better.

Better comparison:

For this command queue, we require replicated data safety. Compare quorum queue throughput with required confirms and manual ack against the actual SLO.

Benchmark against requirements, not against vanity throughput.


17. Stream PerfTest Baseline

Stream PerfTest uses the RabbitMQ stream protocol.

A baseline stream run should specify:

  • stream name;
  • producers;
  • consumers;
  • message size;
  • rate;
  • duration;
  • retention if relevant;
  • replicas if relevant;
  • super stream partitioning if relevant.

Example shape:

java -jar stream-perf-test.jar \
  --uris rabbitmq-stream://user:pass@rabbitmq-1:5552 \
  --stream benchmark.stream.baseline \
  --producers 1 \
  --consumers 1 \
  --size 1024 \
  --rate 10000 \
  --time 300

Measure:

  • publish rate;
  • consume rate;
  • latency;
  • consumer lag;
  • broker disk write rate;
  • broker network;
  • producer CPU;
  • consumer CPU;
  • retention growth.

18. Stream Replay Benchmark

Replay is a core stream capability.

Test it explicitly:

  1. Produce a fixed dataset, for example 100 million messages.
  2. Stop producers.
  3. Start consumers from the beginning.
  4. Measure read throughput and lag reduction.
  5. Repeat with different consumer counts.
  6. Repeat with different message sizes.
  7. Repeat with real deserialization if using a custom harness.

Questions:

  • How quickly can a new projection be built?
  • How quickly can a consumer recover after being offline?
  • Does replay saturate broker disk, network, or consumer CPU?
  • Can replay run while live producers continue publishing?
  • Does replay affect live consumer latency?

A stream system that cannot replay within the operational recovery window is not production-ready.


19. Super Stream Benchmark

Super streams add partitioning.

Benchmark scaling by partition count:

RunPartitionsProducersConsumersRouting Key Cardinality
A1221M keys
B3331M keys
C6661M keys
D1212121M keys

Measure:

  • total throughput;
  • per-partition throughput;
  • hot partition skew;
  • consumer assignment balance;
  • publish latency;
  • consumer lag per partition;
  • leader distribution across nodes.

Partitioning only helps when routing keys distribute load well.

A hot key can defeat the entire super stream design.


20. Compression and Batch Benchmark for Streams

For streams, batching and compression can increase throughput but change latency and duplicate behavior.

Benchmark dimensions:

VariableValues
Message size512 B, 2 KB, 8 KB
Batch/sub-entry countoff, 10, 100, 500
Compressionnone, gzip, snappy/lz4/zstd if available
Producer count1, 4, 8
Partition count1, 3, 6, 12

Measure:

  • throughput;
  • p99 latency;
  • CPU cost;
  • network reduction;
  • disk reduction;
  • duplicate amplification under retry;
  • consumer decompression cost.

Interpretation rule:

Compression is worth it when network/disk is the bottleneck and CPU headroom exists. It is harmful when CPU is already the bottleneck or latency budget is tight.


21. Latency: Do Not Trust Averages

Average latency hides the user-visible failure.

Always capture:

  • p50;
  • p90;
  • p95;
  • p99;
  • p999 if the workload is latency-sensitive;
  • max;
  • latency under backlog;
  • latency after broker restart;
  • confirm latency separately from end-to-end delivery latency.

Example interpretation:

MetricValueMeaning
p5020 msnormal path is healthy
p95100 msmild queueing exists
p992,500 mstail is failing SLO
max30 sstall or failover visible

A benchmark with good average latency and bad p99 is a failed benchmark for most production workflows.


22. Throughput: Sustainable vs Burst

There are two different throughput numbers.

TypeMeaning
Burst throughputtemporary rate before backlog or latency grows
Sustainable throughputrate that can run indefinitely without unbounded queue depth or latency drift

To find sustainable throughput:

  1. Set a target publish rate.
  2. Run long enough to reach steady state.
  3. Verify queue depth or lag does not grow monotonically.
  4. Verify p99 latency remains bounded.
  5. Verify broker resources remain below alert thresholds.
  6. Verify consumer throughput equals producer throughput.
  7. Repeat with a higher rate.

The maximum sustainable rate is the highest rate that passes all conditions.


23. Backlog Drain Benchmark

Backlog drain matters after outages.

Scenario:

  1. Producers publish at normal rate.
  2. Consumers are stopped for 10 minutes.
  3. Queue depth or stream lag grows.
  4. Consumers restart with normal replica count.
  5. Measure drain time.
  6. Repeat with temporary scale-out.

Pass/fail criteria:

  • backlog drains within recovery objective;
  • p99 latency returns to normal after drain;
  • broker does not hit memory/disk alarms;
  • consumers do not overload downstream DB/API;
  • duplicate/redelivery rate remains expected.

Backlog drain is where many “high throughput” systems fail operationally.


24. Failure Benchmark: Broker Restart

Benchmark failover/restart as a normal event.

For queues:

  • publish with confirms;
  • consume with manual ack;
  • restart broker node or queue leader;
  • measure publish interruption;
  • measure confirm latency spike;
  • measure redelivery count;
  • measure queue recovery time;
  • verify no message loss according to idempotency ledger.

For streams:

  • publish continuously;
  • consume continuously;
  • restart stream leader node;
  • measure publish pause;
  • measure consumer lag spike;
  • verify offset recovery;
  • verify replay safety.

Failure benchmarking must be done before production, not during the first incident.


25. Failure Benchmark: Slow Consumer

Slow consumers reveal backpressure correctness.

Scenario:

Measure:

  • queue depth growth rate;
  • unacked messages;
  • p99 processing latency;
  • publisher confirm latency;
  • DLQ growth;
  • consumer memory;
  • downstream saturation;
  • recovery time after downstream returns.

The goal is not to avoid slowdown. The goal is bounded, observable degradation.


26. Coordinated Omission

A common latency mistake:

If the producer slows down when the system slows down, the benchmark may stop measuring the worst latency.

This is coordinated omission.

To reduce the risk:

  • use fixed-rate load where appropriate;
  • measure queueing delay separately;
  • measure publish timestamp to consume timestamp;
  • record backlog/lag during latency measurement;
  • measure during saturation and recovery;
  • do not discard timeout/error samples silently.

If the system is overloaded and your chart says latency is fine, your measurement is probably wrong.


27. Application-Level Benchmark Harness

PerfTest isolates broker behavior. A Java harness isolates your client code.

A minimal harness should implement:

  • real ConnectionFactory config;
  • real exchange/queue declarations;
  • real message converter;
  • real publisher confirm handling;
  • real consumer acknowledgement policy;
  • synthetic business handler with configurable delay/error rate;
  • metrics for publish, confirm, delivery, processing, ack, retry;
  • bounded executor;
  • graceful shutdown;
  • structured output.
public interface BenchmarkWorkload {
    byte[] payload(int sequence);
    void handle(byte[] body) throws Exception;
}

public record BenchmarkMetrics(
        long published,
        long confirmed,
        long consumed,
        long acked,
        long failed,
        long redelivered
) {}

This harness reveals overhead hidden by PerfTest:

  • serialization;
  • allocation;
  • JSON parsing;
  • validation;
  • logging;
  • tracing;
  • idempotency checks;
  • transaction boundaries.

28. Benchmark Output Format

Every benchmark should produce a result record.

benchmarkId: rabbitmq-command-quorum-2kb-confirm-prefetch100-2026-07-01
purpose: Validate command queue capacity for order workflow
environment:
  rabbitmqVersion: "x.y.z"
  erlangVersion: "x.y.z"
  javaVersion: "21"
  brokerNodes: 3
  deployment: kubernetes
  storage: nvme-backed-storage-class
workload:
  protocol: amqp-0-9-1
  queueType: quorum
  messageSizeBytes: 2048
  producers: 6
  consumers: 6
  publisherConfirms: true
  persistentMessages: true
  manualAck: true
  prefetch: 100
  targetRatePerSecond: 8000
  durationSeconds: 900
results:
  publishRatePerSecond: 8000
  consumeRatePerSecond: 8000
  p50LatencyMs: 35
  p95LatencyMs: 210
  p99LatencyMs: 640
  maxLatencyMs: 1800
  maxQueueDepth: 12000
  brokerCpuMaxPercent: 72
  brokerDiskUtilMaxPercent: 68
  clientGcPauseP99Ms: 12
conclusion:
  status: pass
  notes:
    - Meets p99 latency SLO under normal load.
    - Disk becomes primary bottleneck above 11k msg/s.

This format makes benchmark history comparable.


29. Benchmark Matrix for Command Queue

Use this matrix for a command queue workload.

RunQueuePersistentConfirmsManual AckProducersConsumersPrefetchSizeRate
C1quorumyesyesyes22501 KB2k/s
C2quorumyesyesyes441001 KB5k/s
C3quorumyesyesyes661002 KB8k/s
C4quorumyesyesyes882002 KB12k/s
C5quorumyesyesyes882008 KB8k/s
C6quorumyesyesyes80n/a2 KB8k/s

C6 is backlog creation.

Then run drain tests by restarting consumers.


30. Benchmark Matrix for Event Fanout

RunExchangeSubscriber QueuesQueue TypeProducersConsumers/QueueSizeRate
E1topic1quorum421 KB5k/s
E2topic5quorum421 KB5k/s
E3topic10quorum421 KB5k/s
E4topic20quorum421 KB5k/s

Measure broker write amplification.

If E4 cannot pass, do not solve it with blind broker scaling first. Reconsider topology:

  • stream fanout;
  • fewer subscribers;
  • event aggregation;
  • selective routing;
  • subscriber-side filtering moved to broker routing keys;
  • separating hot event categories.

31. Benchmark Matrix for Streams

RunStream TypePartitionsProducersConsumersSizeBatchCompressionRate
S1stream1111 KBoffnone10k/s
S2stream1441 KBonnone50k/s
S3super stream3661 KBonnone100k/s
S4super stream612121 KBonnone200k/s
S5super stream612128 KBoncompression100k/s

Measure:

  • throughput scaling by partition;
  • latency under each partition count;
  • hot partition skew;
  • consumer lag;
  • replay throughput;
  • CPU/disk/network distribution.

32. Reading Broker Metrics During Benchmark

Important broker-side metrics:

MetricMeaning
publish rateingress
deliver/get rateegress
ack ratecompletion rate
confirm ratebroker responsibility acceptance rate
queue depthbacklog
unacked messagesconsumer in-flight work
redeliver rateretry/consumer failure signal
memory usedbroker memory pressure
disk free / disk alarmstorage pressure
file descriptorsconnection/channel/queue pressure
socket descriptorsnetwork pressure
connection blockedflow control visible to publisher
stream lagconsumer progress behind stream tail
stream segment growthretention/storage pressure

A benchmark is incomplete without broker metrics.


33. Reading Client Metrics During Benchmark

Important Java client metrics:

MetricMeaning
publish attemptsoffered load
publish successaccepted by client path
confirmsbroker accepted responsibility
confirm latencybroker/disk/replication pressure
returned messagesrouting failure
in-flight messagesproducer memory risk
deliveriesconsumer ingress
processing latencyhandler/downstream pressure
ack latencycompletion delay
redeliveriesfailure/retry loop
executor queue depthJVM local backpressure
allocation rateGC pressure
GC pausetail latency risk
thread countconcurrency health
socket write waitnetwork/backpressure signal

Broker metrics tell you what RabbitMQ sees. Client metrics tell you what the service experiences.


34. Benchmark Interpretation Patterns

34.1 Throughput low, broker CPU low

Likely causes:

  • producer bottleneck;
  • consumer bottleneck;
  • network bottleneck;
  • confirm window too small;
  • benchmark client CPU saturated;
  • rate limit too low;
  • single channel bottleneck;
  • TLS/client-side encryption overhead.

Next action:

  • inspect client CPU;
  • increase producer count;
  • increase in-flight confirm window carefully;
  • separate benchmark client from broker host;
  • inspect network throughput.

34.2 Throughput high, p99 latency bad

Likely causes:

  • queueing delay;
  • disk flush latency;
  • GC pauses;
  • confirm batching too large;
  • consumer prefetch too high;
  • backlog oscillation;
  • node imbalance.

Next action:

  • inspect latency over time, not just summary;
  • inspect queue depth/lag correlation;
  • reduce batch/prefetch;
  • rebalance leaders;
  • check disk latency.

34.3 Queue depth grows steadily

Likely causes:

  • producer rate > consumer completion rate;
  • downstream consumer dependency slow;
  • prefetch too low;
  • insufficient consumers;
  • poison messages causing retry loops;
  • broker throttling deliveries.

Next action:

  • compare publish rate vs ack rate;
  • inspect consumer processing latency;
  • inspect DLQ and redelivery;
  • increase consumers only if downstream can handle it.

34.4 Confirm latency rises before CPU saturation

Likely causes:

  • disk latency;
  • quorum replication latency;
  • network between cluster nodes;
  • leader placement;
  • storage throttling;
  • memory/disk alarm near threshold.

Next action:

  • inspect disk I/O;
  • inspect cluster inter-node network;
  • inspect queue leader distribution;
  • test with smaller message size;
  • test single-node vs replicated to isolate replication cost.

35. Validating Capacity With Little’s Law

Part 028 introduced Little’s Law:

L = λ × W

Where:

  • L = average number of messages in system;
  • λ = arrival rate;
  • W = average time in system.

Use benchmark data to sanity-check capacity.

Example:

arrival rate = 8,000 msg/s
average end-to-end latency = 0.25 s
expected in-system messages = 8,000 × 0.25 = 2,000

If the benchmark reports average in-system messages around 20,000 instead of 2,000, either latency measurement is wrong, queueing is hidden, or workload is unstable.


36. Benchmarking Consumer Handler Delay

PerfTest can simulate consumer behavior, but a custom Java harness is better for service-like delay.

Scenarios:

Handler DelayMeaning
0 msbroker/client maximum drain
5 mslight CPU/DB path
25 msnormal business transaction
100 msslow downstream dependency
1 soutage/backpressure scenario

Run each with the same prefetch and consumer count.

Expected:

max throughput ≈ consumers × concurrency / processing_time

Example:

8 workers, 25 ms each
capacity ≈ 8 / 0.025 = 320 msg/s

If the benchmark shows 5,000 msg/s with 25 ms handler and only 8 workers, the measurement is wrong or acknowledgements are happening before processing finishes.


37. Benchmarking Ack Timing

Ack timing can create fake performance.

Bad benchmark:

receive message → ack → process

This shows high consume rate but violates correctness.

Production benchmark:

receive message → process → commit side effect → ack

If processing includes database transaction, include it in the service-level benchmark.

For broker-only benchmark, document that it excludes business commit cost.


38. Benchmarking DLQ and Poison Behavior

Performance tests often ignore failure paths.

Add controlled poison rate:

Poison RatePurpose
0%baseline
0.1%normal malformed input
1%upstream bug
10%incident scenario

Measure:

  • retry rate;
  • DLQ rate;
  • redelivery rate;
  • queue depth;
  • handler throughput;
  • CPU/logging overhead;
  • alert behavior.

A retry system that performs well only when nothing fails is not production-ready.


39. Benchmarking Observability Overhead

Metrics, logs, and traces cost CPU and allocation.

Run with:

  1. metrics off, logs minimal, tracing off;
  2. metrics on, logs minimal, tracing off;
  3. metrics on, structured logs normal;
  4. metrics on, structured logs normal, tracing sampled;
  5. metrics on, logs verbose, tracing 100%.

This reveals whether observability configuration can take down the message path during incidents.

Rule:

Debug logging in a hot message loop is a denial-of-service vector.


40. Benchmarking TLS

TLS may be required for security, but it changes performance.

Benchmark:

  • plaintext baseline in non-production lab;
  • TLS enabled with realistic certs;
  • TLS with same node placement;
  • TLS with real client count;
  • TLS under connection churn;
  • TLS with large message size.

Measure CPU, latency, and throughput.

Do not disable TLS in production because of a benchmark. Use the benchmark to size CPU and tune connection reuse.


41. Benchmarking Kubernetes Reality

On Kubernetes, benchmark storage and scheduling, not only RabbitMQ.

Capture:

  • storage class;
  • persistent volume type;
  • pod CPU/memory limits;
  • CPU throttling;
  • node affinity/anti-affinity;
  • cross-node network latency;
  • disruption budget;
  • container filesystem behavior;
  • service mesh overhead if any;
  • metrics sidecar overhead if any.

A RabbitMQ benchmark on a developer laptop does not predict Kubernetes performance.


42. CI Performance Guardrails

Not every benchmark belongs in CI.

Use three layers:

LayerFrequencyPurpose
Microbenchmarkevery PRserialization/client helper regressions
Small integration benchmarkdailyobvious RabbitMQ interaction regression
Full capacity benchmarkrelease/infra changeproduction baseline validation

CI guardrail example:

Fail build if:
- p95 publish helper latency regresses > 20%
- allocation per message regresses > 30%
- consumer handler benchmark throughput drops > 15%

Do not put a 2-hour broker benchmark in every PR. It will be ignored or disabled.


43. Benchmark Report Template

Use this report structure:

# Benchmark Report: <name>

## Purpose

## Environment

## RabbitMQ Configuration

## Topology

## Workload

## Tool Command

## Duration and Warmup

## Results Summary

## Latency Distribution

## Broker Metrics

## Client Metrics

## Failure Observations

## Bottleneck Analysis

## Decision

## Follow-up Actions

A benchmark without a decision is just measurement.


44. Common Benchmark Anti-Patterns

Anti-PatternWhy It Is Dangerous
Benchmarking localhost onlyhides network and deployment cost
Reporting only average latencyhides tail failures
Ignoring confirmstests unsafe publish path
Using auto-acktests unsafe consume path
Tiny message onlyhides real payload cost
No warmupcaptures startup artifacts
Too short durationmisses steady-state and GC/storage behavior
No broker metricscannot identify bottleneck
No client metricscannot identify producer/consumer bottleneck
Changing many variablesresults are uninterpretable
Ignoring failuresproduction behavior unknown
Comparing queue and stream without semanticswrong conclusion
Using peak rate as capacityignores sustainable throughput

45. Decision Framework

After benchmark, decide one of five outcomes.

Never accept a benchmark solely because throughput looks good.


46. Production Benchmark Checklist

Before running:

  • Purpose is written.
  • Pass/fail thresholds are written.
  • RabbitMQ version is recorded.
  • Broker topology is recorded.
  • Queue/stream type is explicit.
  • Persistence and replication are explicit.
  • Message size is realistic.
  • Producer behavior is explicit.
  • Consumer behavior is explicit.
  • Prefetch/ack/confirm settings are explicit.
  • Warmup and duration are defined.
  • Broker metrics are captured.
  • Client metrics are captured.
  • Latency percentiles are captured.
  • Failure scenarios are included.
  • Benchmark commands are saved.
  • Result interpretation is reviewed by another engineer.

47. Practice Drill

Build a benchmark program for the capstone platform.

Drill A — Command Queue Baseline

  • quorum queue;
  • 2 KB persistent messages;
  • publisher confirms;
  • manual ack;
  • prefetch 100;
  • target 5k/s, 8k/s, 12k/s;
  • 15-minute run each.

Deliverable:

  • benchmark report;
  • bottleneck analysis;
  • recommended capacity.

Drill B — Event Fanout

  • topic exchange;
  • 1, 5, 10, 20 subscriber queues;
  • 1 KB persistent events;
  • target 5k/s;
  • measure write amplification.

Deliverable:

  • fanout scalability curve;
  • recommendation: queue fanout vs stream fanout.

Drill C — Super Stream Replay

  • 6 partitions;
  • 100 million events;
  • replay from beginning;
  • measure rebuild time;
  • measure live publish impact.

Deliverable:

  • replay capacity estimate;
  • partition hot spot analysis.

Drill D — Failure Test

  • run balanced workload;
  • restart a broker node;
  • pause consumers for 10 minutes;
  • resume consumers;
  • verify drain time and duplicate behavior.

Deliverable:

  • failure timeline;
  • runbook improvement list.

48. Self-Correction Rubric

You understand RabbitMQ benchmarking when you can answer these without guessing:

  1. What exact production question does this benchmark answer?
  2. What safety semantics were enabled: confirms, persistence, durable topology, manual ack?
  3. What variable changed between runs?
  4. Was the result sustainable or only burst?
  5. What was the p99 latency under steady state?
  6. What happened to p99 during backlog drain?
  7. What was the broker bottleneck?
  8. What was the client bottleneck?
  9. How did message size affect capacity?
  10. How did prefetch affect throughput and redelivery risk?
  11. How did replication affect confirm latency?
  12. How did fanout amplify broker work?
  13. How quickly can backlog drain?
  14. How quickly can a stream consumer replay?
  15. What decision did the benchmark support?

If you cannot answer these, rerun the benchmark with better instrumentation.


49. Key Takeaways

  • PerfTest answers AMQP 0-9-1 broker/topology capacity questions.
  • Stream PerfTest answers stream protocol, stream, and super stream capacity questions.
  • Benchmarks must be designed around production questions, not vanity throughput.
  • Publisher confirms, persistence, replication, manual ack, and prefetch materially change results.
  • Sustainable throughput matters more than peak throughput.
  • p99 latency matters more than average latency.
  • Backlog drain and failure behavior are part of performance, not separate concerns.
  • A good benchmark produces a decision and a reproducible report.

50. References

Lesson Recap

You just completed lesson 29 in deepen practice. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.