Failure Modelling: Network Split, Broker Restart, Consumer Crash, Duplicate Storm
Learn Java RabbitMQ, RabbitMQ Streams, Patterns, and Deployment In Action - Part 015
Production-grade failure modelling for Java RabbitMQ systems, covering network interruption, broker restart, quorum leader failure, producer ambiguity, consumer crash, duplicate storms, redelivery loops, chaos testing, and operational runbooks.
Part 015 — Failure Modelling: Network Split, Broker Restart, Consumer Crash, Duplicate Storm
A RabbitMQ design is not production-grade until it has been reasoned through under failure.
The beginner question is:
How do I publish and consume a message?
The senior question is:
What exactly happens if the producer times out after publishing, the broker restarts before confirm arrives, the consumer commits to the database but crashes before ack, and the message is redelivered to another instance?
This part builds a failure-first model for Java RabbitMQ systems. We will not treat failure as an afterthought. We will treat it as the primary design surface.
1. Kaufman Deconstruction
Following Kaufman's skill acquisition model, we deconstruct failure modelling into small sub-skills:
- Lifecycle state modelling — represent message progress explicitly.
- Ambiguity detection — identify states where the client cannot know what happened.
- Failure classification — separate transient, permanent, poison, overload, and operator-induced failures.
- Recovery mechanism mapping — know which failures are handled by client recovery, confirms, redelivery, DLX, idempotency, or manual intervention.
- Duplicate containment — assume duplicate delivery and prevent duplicate effects.
- Chaos verification — prove the model by killing processes, blocking networks, restarting brokers, and delaying dependencies.
The goal is not to predict every possible failure. The goal is to design so the system remains correct under the failures that matter.
2. Core Reliability Invariant
For RabbitMQ-backed systems, the core invariant is:
A message may be delayed, redelivered, reordered, or duplicated, but it must not silently disappear and it must not produce an invalid business state.
This has two parts:
| Concern | Engineering Response |
|---|---|
| Do not silently lose messages | durable topology, persistent messages, publisher confirms, quorum queues where appropriate |
| Do not corrupt business state under duplicate/retry | idempotent consumer, dedup store, business invariant checks, transactional boundaries |
RabbitMQ can help with durability and delivery. It cannot make your business operation idempotent for you.
3. Message Lifecycle State Machine
Before modelling failure, define normal flow.
The most dangerous states are not the obvious failures. They are the ambiguous states:
- producer does not know whether broker accepted the message
- consumer does not know whether downstream operation committed
- broker redelivers after connection loss even though consumer may have finished local work
- retry logic cannot distinguish transient dependency failure from poison payload
Production reliability is mostly the art of making ambiguity safe.
4. Failure Taxonomy
A failure taxonomy prevents the team from using one retry strategy for everything.
| Failure Type | Example | Retry? | Preferred Handling |
|---|---|---|---|
| Transient dependency | HTTP 503, database connection timeout | yes | delayed retry with bounded attempts |
| Overload | DB pool exhausted, executor saturated | maybe | backpressure, pause, requeue carefully, circuit breaker |
| Permanent validation | unknown enum, missing required field | no | reject/dead-letter/parking lot |
| Poison message | payload always crashes handler | no automatic infinite retry | DLQ + investigation |
| Ambiguous publish | publish timeout before confirm | maybe | retry with idempotency key/outbox |
| Ambiguous consume | DB commit succeeded but ack failed | no blind side effect | idempotent consumer then ack on redelivery |
| Operator-induced | queue deleted, permission changed | no | fail fast, alert, restore topology |
| Infrastructure | broker restart, leader election, network partition | automatic + bounded | recovery + monitoring + chaos drill |
Rule:
Retry is a correctness mechanism only when the operation is idempotent or duplicate-safe.
5. Producer Failure Model
5.1 Producer Normal Flow
The producer should only mark a publish as broker-accepted after confirm.
5.2 Producer Failure Matrix
| Failure Point | What Producer Knows | What May Have Happened | Required Design |
|---|---|---|---|
Before basicPublish leaves process | not sent | no broker effect | retry safely |
| After network write, before broker receives | ambiguous | broker may not have seen it | retry with idempotency |
| Broker receives, persists, but confirm lost | ambiguous | message is in queue, producer thinks failed | retry may duplicate |
| Broker rejects unroutable mandatory publish | knows returned | not routed | handle return as publish failure |
| Broker unavailable before publish | knows unavailable | no message accepted | retry/backoff/outbox |
| Confirm timeout | ambiguous | accepted or not accepted | do not assume failure means no effect |
The critical lesson:
A publish timeout is not proof that the message was not published.
Therefore, a correct producer uses one of these patterns:
- Transactional outbox — database transaction writes business state and outbox row; relay publishes with confirms; duplicate publish is handled by message id/idempotent consumer.
- Producer-side idempotency key — every message has stable
messageIdderived from business operation id. - Broker-side stream deduplication — for RabbitMQ Streams use producer name + publishing id where applicable.
- Business-level duplicate acceptance — the operation itself is safe if repeated.
6. Consumer Failure Model
6.1 Consumer Normal Flow
The ack happens after the business transaction commits.
6.2 Consumer Failure Matrix
| Failure Point | Broker State | Business State | Consequence | Correct Handling |
|---|---|---|---|---|
| Crash before processing | unacked delivery lost with connection | not changed | redelivery | process normally |
| Crash during processing before commit | unacked | rolled back/partial external effect possible | redelivery | transaction + idempotency |
| Crash after DB commit before ack | unacked | committed | redelivery duplicate | dedup detects already processed then ack |
| Handler throws transient error | unacked | not committed | retry | nack/retry topology |
| Handler throws permanent error | unacked | not committed | poison | reject/dead-letter |
| Ack fails due to lost connection | broker may redeliver | maybe committed | duplicate | idempotent consumer |
The most important row is this:
Crash after commit before ack causes redelivery of an already-processed message.
If your consumer cannot survive that, it is not production-safe.
7. Idempotent Consumer as Failure Absorber
A RabbitMQ consumer should treat duplicate delivery as normal.
A common implementation is a processed-message table:
CREATE TABLE processed_message (
consumer_name VARCHAR(128) NOT NULL,
message_id VARCHAR(128) NOT NULL,
processed_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
business_key VARCHAR(128),
PRIMARY KEY (consumer_name, message_id)
);
Pseudo-flow:
void handle(Delivery delivery) {
String messageId = delivery.properties().getMessageId();
transactionTemplate.executeWithoutResult(tx -> {
boolean firstTime = processedMessageRepository.tryInsert("billing-worker", messageId);
if (!firstTime) {
return; // already applied previously
}
BillingCommand command = decode(delivery.body());
billingService.apply(command);
});
channel.basicAck(delivery.envelope().getDeliveryTag(), false);
}
Important implementation note:
- The dedup insert and business mutation must be in the same transaction when possible.
- If the business effect is external and non-transactional, use provider idempotency keys.
- Do not ack before the idempotency record and business effect are safe.
8. Broker Restart Model
A broker restart affects different message classes differently.
| Topology/Message | Survives Restart? | Notes |
|---|---|---|
| transient queue | no | deleted when broker stops or connection ends depending on declaration |
| durable queue | yes | queue metadata survives |
| non-persistent message in durable queue | not guaranteed | message can be lost on restart |
| persistent message in durable queue | intended to survive | use publisher confirms for accepted responsibility |
| quorum queue message | replicated/durable | confirm after quorum conditions |
| unacked delivery | returns to queue after connection/channel loss | may be redelivered |
The beginner mistake is assuming that deliveryMode=2 alone is enough.
The durable path requires:
- durable exchange/queue as needed
- persistent message
- publisher confirms
- no unsafe auto-ack consumer
- idempotent consumer
- appropriate queue type for safety requirements
9. Quorum Queue Leader Failure
Quorum queues are replicated queues. They are useful when message safety matters more than minimum latency.
Mental model:
If the leader fails, another replica can be elected. The client may see interruption, timeout, connection close, or confirm delay.
Producer implication:
- confirms may be delayed
- publish timeout can be ambiguous
- retry can duplicate
- bounded in-flight confirms are essential
Consumer implication:
- delivery can pause during failover
- unacked deliveries can be redelivered
- processing must be idempotent
- monitoring must distinguish temporary failover from sustained outage
Design rule:
Quorum queues improve broker-side safety; they do not remove the need for producer confirms or idempotent consumers.
10. Network Interruption and Automatic Recovery
RabbitMQ Java client supports automatic connection recovery when enabled. Recovery is helpful, but it is not a correctness proof.
Recovery may restore:
- connection
- channels
- exchanges
- queues
- bindings
- consumers
But the application must still handle:
- messages published during connection outage
- confirms that were in-flight when the connection failed
- consumer deliveries that were unacked when channel closed
- topology declaration drift
- duplicated delivery after recovery
- executor tasks still running while broker has already requeued unacked messages
10.1 Recovery Race
This is why deduplication belongs at the business boundary, not merely inside the RabbitMQ client callback.
11. Network Split and Cluster Partition Thinking
In distributed systems, network partition means nodes can disagree about reachability. The client-side symptom may be:
- connection timeout
- blocked publish
- confirm delay
- consumer cancellation
- leader unavailability
- temporary throughput collapse
Do not model network split as a clean binary failure. Model it as partial progress with ambiguous visibility.
| Component | Bad Assumption | Safer Assumption |
|---|---|---|
| Producer | timeout means message failed | timeout means outcome unknown |
| Consumer | lost connection means local work stopped | local work may continue while broker redelivers |
| Broker | cluster failover is invisible | clients may see delay, close, cancellation, redelivery |
| Operator | restart is harmless | restart can amplify duplicates and lag |
12. Duplicate Storm Model
A duplicate storm happens when many messages are redelivered and retried faster than the system can absorb.
Common causes:
- Consumer crashes after committing but before acking many messages.
- Prefetch is too high, so one failing consumer owns too many unacked deliveries.
- Retry requeues immediately into the same hot queue.
- Poison message repeatedly crashes consumers.
- Publisher retries ambiguous publishes without stable idempotency keys.
- Broker restart causes many unacked messages to return at once.
12.1 Duplicate Storm Dynamics
12.2 Containment Strategy
| Layer | Control |
|---|---|
| Producer | stable message id, bounded retry, outbox relay |
| Queue | retry delay, delivery limit, DLX, queue length policy |
| Consumer | low enough prefetch, idempotency, bounded executor |
| Database | unique constraints, transaction isolation, indexed dedup table |
| Operations | alert on redelivery rate, DLQ growth, consumer exception rate |
A duplicate storm is not solved by one setting. It is solved by layered containment.
13. Redelivery Loop Model
A redelivery loop is a special failure where the same message cycles endlessly.
Bad implementation:
try {
process(delivery);
channel.basicAck(tag, false);
} catch (Exception e) {
channel.basicNack(tag, false, true); // immediate requeue forever
}
This can create hot loops.
Better approach:
try {
process(delivery);
channel.basicAck(tag, false);
} catch (PermanentMessageException e) {
channel.basicReject(tag, false); // dead-letter if DLX configured
} catch (TransientDependencyException e) {
republishToDelayedRetry(delivery, nextAttempt(delivery));
channel.basicAck(tag, false); // ack original after retry copy is safe
} catch (Throwable t) {
channel.basicReject(tag, false); // fail closed, investigate
}
The exact implementation depends on your retry topology, but the invariant is stable:
Never let an unclassified error create infinite immediate requeue.
14. Consumer Crash During Parallel Processing
A common architecture uses the RabbitMQ consumer callback to submit work to an executor.
The risks:
- callback returns before processing is complete
- channel is shared unsafely across worker threads
- application shuts down while tasks are still running
- unacked delivery is redelivered while old task still commits
- ack is attempted on closed channel
Production rules:
- Use manual ack.
- Bound prefetch to match executor capacity.
- Avoid unsafe shared channel usage across arbitrary worker threads.
- Track in-flight tasks.
- On shutdown, stop consuming first, drain tasks, then close connection.
- Treat ack failure after commit as duplicate risk, not data loss.
15. Failure-Tolerant Java Consumer Skeleton
public final class SafeRabbitConsumer {
private final Channel channel;
private final ExecutorService workers;
private final AtomicBoolean stopping = new AtomicBoolean(false);
public void start(String queue) throws IOException {
channel.basicQos(50);
channel.basicConsume(queue, false, (consumerTag, delivery) -> {
if (stopping.get()) {
channel.basicNack(delivery.getEnvelope().getDeliveryTag(), false, true);
return;
}
workers.submit(() -> processSafely(delivery));
}, consumerTag -> {
// consumer cancellation handler
});
}
private void processSafely(Delivery delivery) {
long tag = delivery.getEnvelope().getDeliveryTag();
try {
handleIdempotently(delivery);
synchronized (channel) {
channel.basicAck(tag, false);
}
} catch (PermanentMessageException e) {
safeReject(tag);
} catch (TransientDependencyException e) {
safeRetry(delivery, tag);
} catch (Throwable t) {
safeReject(tag);
}
}
private void safeReject(long tag) {
try {
synchronized (channel) {
channel.basicReject(tag, false);
}
} catch (IOException ignored) {
// Connection loss means broker may redeliver. Idempotency must absorb it.
}
}
private void safeRetry(Delivery delivery, long tag) {
// Implementation-specific: publish to delayed retry exchange with confirm,
// then ack original only after retry copy is confirmed.
}
}
This skeleton is intentionally incomplete. The real lesson is the sequencing:
- process idempotently
- commit business state
- ack after commit
- when ack fails, rely on idempotency
- when retrying by republish, confirm retry copy before acking original
16. Failure-Tolerant Producer Skeleton
public final class ConfirmingPublisher {
private final Channel channel;
private final Semaphore inFlight = new Semaphore(10_000);
private final ConcurrentNavigableMap<Long, PendingMessage> pending = new ConcurrentSkipListMap<>();
public void start() throws IOException {
channel.confirmSelect();
channel.addConfirmListener(this::handleAck, this::handleNack);
channel.addReturnListener(returned -> {
// mandatory publish could not be routed
markReturned(returned.getProperties().getMessageId(), returned.getReplyText());
});
}
public void publish(OutboundMessage message) throws Exception {
inFlight.acquire();
long seqNo = channel.getNextPublishSeqNo();
pending.put(seqNo, new PendingMessage(message, System.nanoTime()));
try {
channel.basicPublish(
message.exchange(),
message.routingKey(),
true, // mandatory
message.properties(),
message.body()
);
} catch (IOException e) {
pending.remove(seqNo);
inFlight.release();
throw e;
}
}
private void handleAck(long seqNo, boolean multiple) {
complete(seqNo, multiple, true);
}
private void handleNack(long seqNo, boolean multiple) {
complete(seqNo, multiple, false);
}
private void complete(long seqNo, boolean multiple, boolean ack) {
var confirmed = multiple
? pending.headMap(seqNo, true)
: pending.subMap(seqNo, true, seqNo, true);
int count = confirmed.size();
confirmed.clear();
inFlight.release(count);
if (!ack) {
// Retry through outbox or duplicate-safe producer strategy.
}
}
}
The key controls are:
- stable message identity
- bounded in-flight confirms
- mandatory return handling
- timeout handling for pending confirms
- outbox or retry storage
17. Failure Injection Practice
Kaufman's model emphasizes practice that produces rapid feedback. For RabbitMQ, that means failure drills.
Drill 1 — Producer Timeout Ambiguity
Setup:
- publish with confirms
- block network after publish but before confirm callback
- retry publish
- verify duplicate-safe consumer behavior
Expected result:
- no lost message
- duplicate is possible
- business state remains valid
Drill 2 — Consumer Crash After Commit Before Ack
Setup:
- consumer writes DB transaction
- deliberately crash process before
basicAck - restart consumer
- verify redelivery is deduplicated
Expected result:
- message is redelivered
- handler detects already processed
- ack succeeds
- no duplicate business effect
Drill 3 — Broker Restart With Unacked Messages
Setup:
- set prefetch to 100
- start processing messages slowly
- restart broker or kill connection
- observe unacked redelivery
Expected result:
- temporary interruption
- redelivery occurs
- duplicate-safe processing
- no unbounded retry storm
Drill 4 — Poison Message
Setup:
- send invalid payload
- consumer classifies validation failure
- verify message goes to DLQ/parking lot
Expected result:
- no immediate infinite requeue
- DLQ metadata includes reason and attempt count
- alert fires if poison rate exceeds threshold
Drill 5 — Downstream Slowdown
Setup:
- delay DB/API by 5 seconds
- keep producer load constant
- observe queue depth, unacked count, confirm latency, executor saturation
Expected result:
- producer backpressure or admission control activates
- consumer does not pull unbounded work
- system degrades predictably
18. Failure Observability
A failure model is useless if production telemetry cannot validate it.
18.1 Producer Metrics
| Metric | Meaning |
|---|---|
| publish rate | ingress load |
| confirm latency | broker acceptance pressure |
| in-flight confirms | producer risk window |
| returned messages | routing failures |
| nack count | broker negative confirm |
| publish timeout count | ambiguous publish rate |
| outbox relay lag | producer recovery backlog |
18.2 Consumer Metrics
| Metric | Meaning |
|---|---|
| delivery rate | broker-to-consumer throughput |
| ack rate | completed processing throughput |
| nack/reject rate | failure classification volume |
| redelivery rate | duplicate/retry pressure |
| processing latency | handler cost |
| dedup hit rate | duplicate absorption rate |
| unacked count | in-flight risk |
| DLQ depth | poison/permanent failure accumulation |
18.3 Broker Metrics
| Metric | Meaning |
|---|---|
| queue depth | backlog |
| memory alarm | broker pressure |
| disk alarm | persistence pressure |
| connection churn | instability |
| channel count | resource usage |
| quorum leader changes | failover/maintenance impact |
| consumer utilization | ability to deliver to consumers |
19. Operational Runbooks
19.1 Queue Depth Growing
Diagnose:
- Is publish rate higher than ack rate?
- Are consumers alive?
- Is prefetch too low or too high?
- Is DB/API latency increasing?
- Are messages stuck due to poison payload?
- Are publisher confirms slowing down?
Respond:
- scale consumers only if downstream capacity exists
- pause or throttle producers if downstream is saturated
- inspect DLQ and redelivery rate
- do not purge unless business owner approves data loss
19.2 Redelivery Rate Spike
Diagnose:
- Did consumers restart?
- Did broker restart/fail over?
- Are consumers crashing after commit?
- Is retry topology immediately requeueing?
- Is there a poison message?
Respond:
- reduce consumer concurrency if downstream is overloaded
- move poison messages to parking lot
- verify dedup table performance
- inspect recent deployments
19.3 Confirm Latency Spike
Diagnose:
- Is disk slow?
- Are quorum queues waiting for replication?
- Is broker under memory/disk alarm?
- Is network between nodes degraded?
- Did message size increase?
Respond:
- throttle producers
- check broker disk and inter-node network
- inspect queue type and replication state
- avoid disabling confirms as a quick fix
19.4 DLQ Spike
Diagnose:
- Is error permanent or transient?
- Is schema compatibility broken?
- Did a producer deploy new payload shape?
- Is downstream returning new error codes?
- Are retries exhausted too quickly?
Respond:
- stop offending producer if needed
- sample DLQ messages safely
- classify failure reason
- replay only after fixing consumer or data
20. Failure Design Review Checklist
Before approving a RabbitMQ architecture, answer these questions:
Producer
- Are messages assigned stable IDs?
- Are publisher confirms enabled for important messages?
- Is in-flight publish count bounded?
- Are returned mandatory messages handled?
- Is publish timeout treated as ambiguous?
- Is there an outbox or durable retry mechanism?
Broker
- Are exchanges/queues declared durably where required?
- Are messages persistent where required?
- Is the queue type appropriate: classic, quorum, stream?
- Is DLX configured through policy where possible?
- Is queue growth bounded or operationally monitored?
Consumer
- Is manual ack used?
- Does ack happen only after business commit?
- Is the handler idempotent?
- Is dedup storage transactional with business mutation?
- Is prefetch aligned with processing capacity?
- Are permanent and transient failures separated?
- Can shutdown drain in-flight tasks?
Operations
- Are redelivery, DLQ, confirm latency, queue depth, and consumer lag monitored?
- Are there runbooks for duplicate storm and poison messages?
- Are chaos drills automated?
- Can parking lot messages be replayed safely?
21. Mental Model Summary
RabbitMQ failure modelling is built on five truths:
- Timeout means unknown, not failure.
- Redelivery is normal, not exceptional.
- Duplicate processing must be harmless.
- Ack after commit, never before.
- Retry only when duplicate-safe and classified.
A top-tier engineer does not merely configure RabbitMQ. They define and test the invariants that keep the business correct when RabbitMQ, the network, the JVM, and downstream systems behave imperfectly.
22. Practice Assignment
Build a Java worker that processes InvoiceIssued commands from a quorum queue.
Required behavior:
- Producer uses publisher confirms and stable message IDs.
- Consumer uses manual ack.
- Consumer writes to
processed_messageandinvoice_projectionin one transaction. - Crash the consumer after DB commit but before ack.
- Verify duplicate redelivery causes no duplicate invoice projection.
- Add a poison message and verify it goes to DLQ.
- Add a transient DB failure and verify delayed retry.
- Add metrics for redelivery rate, dedup hit rate, DLQ count, and processing latency.
Completion criteria:
- no message loss under broker restart
- no duplicate business effect under consumer crash
- no infinite immediate retry loop
- operator can see what happened through metrics/logs
References
- RabbitMQ Reliability Guide: https://www.rabbitmq.com/docs/reliability
- RabbitMQ Java Client API Guide: https://www.rabbitmq.com/client-libraries/java-api-guide
- RabbitMQ Publisher Confirms and Consumer Acknowledgements: https://www.rabbitmq.com/docs/confirms
- RabbitMQ Quorum Queues: https://www.rabbitmq.com/docs/quorum-queues
- RabbitMQ Dead Letter Exchanges: https://www.rabbitmq.com/docs/dlx
- RabbitMQ Flow Control: https://www.rabbitmq.com/docs/flow-control
You just completed lesson 15 in build core. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.
Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.