Learn Java Messaging Event Streaming Part 021 Kafka Error Handling Retry Dlq
title: Learn Java Messaging and Event Streaming - Part 021 description: Kafka error handling, retry strategy, dead-letter topics, poison events, quarantine, replay, ordering trade-offs, and production recovery design. series: learn-java-messaging-event-streaming seriesTitle: Learn Java Messaging and Event Streaming order: 21 partTitle: Kafka Error Handling, Retry, DLQ, and Poison Events tags:
- java
- kafka
- apache-kafka
- error-handling
- retry
- dead-letter-queue
- dlq
- poison-message
- resilience
- reliability
- event-streaming
- distributed-systems date: 2026-06-28
Part 021 — Kafka Error Handling: Retries, DLQ, Retry Topics, Poison Events, and Quarantine
Tujuan Bagian Ini
Kafka bukan queue tradisional yang otomatis “memindahkan pesan gagal” ke tempat lain. Kafka adalah retained partitioned log. Error handling adalah desain aplikasi di atas log tersebut.
Bagian ini membahas bagaimana membangun error handling Kafka yang aman secara produksi: tidak kehilangan event, tidak membuat infinite retry storm, tidak menahan seluruh partition tanpa alasan, dan tetap memungkinkan audit serta replay.
Setelah bagian ini, kamu harus bisa:
- Membedakan error transient, deterministic, poison, infrastructure, dan business rejection.
- Mendesain retry policy tanpa memblokir poll loop terlalu lama.
- Memilih antara fail-fast, inline retry, retry topic, DLQ, dan quarantine.
- Menjaga ordering saat error terjadi.
- Mendesain DLQ envelope yang cukup untuk debugging dan replay.
- Menghindari anti-pattern seperti infinite retry, commit-before-side-effect, dan DLQ tanpa owner.
- Membuat runbook untuk replay, quarantine release, dan incident recovery.
1. Mental Model Utama
Di Kafka, consumer membaca dari partition berdasarkan offset.
Saat processing gagal, ada tiga fakta penting:
- record tetap ada di Kafka sampai retention menghapusnya;
- consumer position di memory bisa maju karena
poll()sudah mengambil record; - committed offset menentukan dari mana consumer group akan lanjut setelah restart/rebalance.
Kesalahan desain paling umum adalah mengira offset commit adalah error handling. Offset commit hanya menyimpan progress. Ia tidak menjawab pertanyaan:
- apakah record gagal harus dicoba lagi?
- kapan retry berhenti?
- apakah partition boleh lanjut ke record berikutnya?
- di mana error disimpan?
- siapa yang memperbaiki data gagal?
- bagaimana replay dilakukan?
Kafka menyediakan primitive. Error policy tetap tanggung jawab aplikasi.
2. Taxonomy Error Kafka Consumer
Sebelum memilih mekanisme retry, klasifikasikan error.
| Error type | Contoh | Retry cocok? | DLQ cocok? | Catatan |
|---|---|---|---|---|
| Transient infrastructure | DB timeout, HTTP 503, network reset | Ya | Jika retry budget habis | Jangan langsung DLQ |
| Rate/pressure failure | downstream overloaded, 429, pool exhausted | Ya, dengan backoff | Bisa jika terlalu lama | Perlu throttling |
| Deterministic data error | invalid enum, missing required field | Tidak banyak guna | Ya | Harus diperbaiki data/schema |
| Poison event | event selalu crash consumer | Tidak setelah bukti deterministik | Ya/quarantine | Harus diisolasi cepat |
| Business rejection | case closed, transition illegal | Tergantung domain | Bisa ke rejected topic | Bukan selalu error teknis |
| Serialization/deserialization | schema incompatible, bad bytes | Tidak di consumer normal | Ya, raw DLQ | Perlu handler khusus |
| Programming bug | NullPointerException, bad mapping | Tidak sampai hotfix | Quarantine | Jangan masking bug dengan retry tanpa batas |
| Ordering dependency missing | event B datang sebelum A | Retry/delay | Bisa parking lot | Butuh sequence/state model |
Mental model:
retry helps when waiting changes the outcome.
retry does not help when the input is invalid or the code is wrong.
3. Processing Outcome Matrix
Setiap record yang dipoll harus berakhir pada salah satu outcome berikut.
| Outcome | Meaning | Offset commit? | Evidence |
|---|---|---|---|
| Success | Business side effect selesai | Ya | Output written, DB updated, audit recorded |
| Skipped by policy | Record sah tapi tidak relevan | Ya | Reason logged/metric |
| Retriable failure | Akan dicoba ulang | Tidak, atau commit setelah dipindah ke retry topic | Retry metadata |
| Dead-lettered | Tidak akan diproses normal | Ya setelah DLQ write sukses | DLQ record |
| Quarantined | Butuh manual/controlled review | Ya setelah quarantine write sukses | Quarantine record |
| Fatal application failure | Consumer harus stop | Tidak | Alert + crash |
Invariant produksi:
Never commit past a record unless its outcome is durably known.
“Durably known” berarti salah satu:
- side effect sudah sukses;
- output topic sudah menerima event hasil;
- retry topic sudah menerima event retry;
- DLQ/quarantine topic sudah menerima event gagal;
- record memang disepakati untuk skip dengan audit reason.
4. Lima Pola Error Handling Kafka
4.1 Pattern A — Stop on Error
Consumer berhenti saat record gagal.
Cocok untuk:
- strict ordering;
- financial ledger;
- state machine yang tidak boleh lompat event;
- regulatory lifecycle event yang wajib diproses berurutan.
Kelebihan:
- menjaga ordering paling kuat;
- failure terlihat cepat;
- tidak menyembunyikan data buruk.
Kekurangan:
- satu poison event bisa menahan partition;
- butuh operasi manual cepat;
- availability rendah jika error sering.
Gunakan jika invariant ordering lebih penting daripada throughput.
4.2 Pattern B — Inline Retry
Consumer mencoba ulang di thread processing yang sama.
Cocok untuk:
- error transient singkat;
- downstream hiccup beberapa detik;
- consumer volume rendah;
- ordering harus dijaga.
Risiko:
- poll loop bisa terlalu lama;
max.poll.interval.msbisa terlewati;- rebalance bisa terjadi;
- consumer group menjadi tidak stabil;
- thread worker terblokir.
Aturan praktis:
Inline retry boleh pendek. Retry panjang harus keluar dari poll loop.
Contoh Java sederhana:
int attempts = 0;
while (true) {
try {
process(record);
consumer.commitSync(Map.of(tp, new OffsetAndMetadata(record.offset() + 1)));
break;
} catch (TransientDependencyException ex) {
attempts++;
if (attempts >= 3) {
publishToRetryTopic(record, ex, attempts);
consumer.commitSync(Map.of(tp, new OffsetAndMetadata(record.offset() + 1)));
break;
}
Thread.sleep(backoff(attempts));
}
}
Kode di atas hanya aman jika publishToRetryTopic durable dan error type memang retriable.
4.3 Pattern C — Retry Topic
Record gagal dipublish ke topic retry dengan metadata attempt dan due time. Consumer normal commit offset setelah write retry topic sukses.
Cocok untuk:
- retry yang lebih lama dari beberapa detik;
- high-throughput consumer;
- downstream outage;
- ingin partition utama tetap bergerak;
- ingin observability retry terpisah.
Trade-off:
- ordering bisa rusak karena record berikutnya lanjut dulu;
- topologi lebih kompleks;
- perlu metadata retry yang disiplin;
- retry consumer bisa banjir saat dependency pulih.
Retry topic harus punya contract, bukan tempat sampah.
4.4 Pattern D — Dead Letter Topic / DLQ
Record yang tidak bisa diproses setelah policy tertentu dikirim ke DLQ.
DLQ cocok untuk:
- invalid payload;
- schema incompatible;
- max retry exceeded;
- poison event;
- unknown exception setelah circuit breaker policy;
- manual review.
DLQ tidak cocok jika:
- tidak ada owner;
- tidak ada alert;
- tidak ada replay process;
- tidak ada retention cukup;
- tidak ada metadata sebab gagal.
Invariant:
A DLQ without ownership is delayed data loss.
4.5 Pattern E — Quarantine / Parking Lot
Quarantine adalah DLQ yang lebih terkontrol, biasanya untuk record yang penting secara bisnis dan tidak boleh sekadar diabaikan.
Cocok untuk:
- regulatory case event;
- enforcement lifecycle transition;
- payment/settlement;
- legal/audit record;
- high-value customer operations.
Bedakan:
| Topic | Purpose |
|---|---|
| retry topic | automatic delayed retry |
| DLQ | failed terminal processing |
| quarantine | failed record requiring controlled decision |
| rejected topic | business decision that input is invalid/denied |
| parking lot | temporarily held until dependency/state exists |
5. Inline Retry vs Retry Topic vs DLQ
| Concern | Inline retry | Retry topic | DLQ/quarantine |
|---|---|---|---|
| Short transient failure | Good | Overkill | Bad |
| Long dependency outage | Bad | Good | Only after budget |
| Ordering preservation | Good | Weak unless partition blocked | Depends on replay |
| Poll loop stability | Risky if long | Good | Good |
| Operational visibility | Medium | High | High |
| Complexity | Low | Medium/high | Medium |
| Duplicate risk | Medium | Medium/high | Medium/high |
| Replay support | Low | Medium | Must be high |
Decision heuristic:
If waiting milliseconds/seconds can fix it, inline retry.
If waiting minutes can fix it, retry topic.
If waiting will not fix it, DLQ/quarantine.
If ordering cannot be violated, stop partition or use ordered parking strategy.
6. Kafka Consumer Poll Loop Safety
A Kafka consumer is not a job runner that can block indefinitely. It has group membership constraints.
Bad shape:
while (true) {
ConsumerRecords<String, Event> records = consumer.poll(Duration.ofSeconds(1));
for (ConsumerRecord<String, Event> record : records) {
processWithTenMinuteRetry(record); // dangerous
}
consumer.commitSync();
}
Problem:
poll()cadence becomes unstable;- heartbeats/group management can be impacted depending on client behavior and config;
max.poll.interval.mscan be exceeded;- rebalance can revoke partitions while processing is still running;
- duplicate processing risk increases;
- commit may apply to records whose actual outcome is ambiguous.
Safer shape:
while (running) {
ConsumerRecords<String, Event> records = consumer.poll(Duration.ofMillis(500));
for (ConsumerRecord<String, Event> record : records) {
ProcessingResult result = processor.process(record);
switch (result.status()) {
case SUCCESS -> commit(record);
case RETRYABLE_EXHAUSTED -> {
retryPublisher.publish(record, result.error());
commit(record);
}
case NON_RETRYABLE -> {
dlqPublisher.publish(record, result.error());
commit(record);
}
case FATAL -> throw result.error();
}
}
}
Even safer for consume-process-produce is using Kafka transactions if the output is Kafka-only. But external DB/API side effects still need idempotency.
7. Commit Discipline
Kafka commit stores the next offset to read for a consumer group.
For a record at offset 42, committing 43 means:
record 42 is done from this consumer group's perspective
Do not commit before the outcome is known.
7.1 Dangerous: Commit Before Side Effect
consumer.commitSync(offsetAfter(record));
callExternalApi(record); // may fail after commit
If the process crashes after commit and before API call, the event is skipped.
7.2 Dangerous: Side Effect Before Idempotency
callExternalApi(record); // succeeds
consumer.commitSync(offset); // crashes before commit
After restart, record is processed again and external API may be called twice.
Therefore, side effects must be idempotent.
7.3 Safer: Idempotent Side Effect + Commit
String idempotencyKey = record.topic() + ":" + record.partition() + ":" + record.offset();
externalApi.call(record.value(), idempotencyKey);
consumer.commitSync(offsetAfter(record));
Better if the domain event has a natural event ID:
String idempotencyKey = event.eventId();
Do not use offset as the only business idempotency key if the same event can be replayed from a different topic or regenerated.
8. DLQ Record Design
A DLQ record should preserve enough information to debug, classify, and replay.
Minimum DLQ envelope:
{
"failureId": "01JZ...",
"failedAt": "2026-06-28T10:15:30Z",
"source": {
"topic": "case.lifecycle.v1",
"partition": 7,
"offset": 123456,
"timestamp": "2026-06-28T10:12:01Z",
"key": "case-81273"
},
"consumer": {
"groupId": "case-risk-projection-v3",
"application": "case-risk-projector",
"version": "3.14.2"
},
"error": {
"type": "NON_RETRYABLE_SCHEMA_ERROR",
"exceptionClass": "InvalidEventVersionException",
"message": "Unsupported case status: PRE_ESCALATED",
"stackTraceHash": "d7a6d1b8",
"attempt": 5,
"retryExhausted": true
},
"payload": {
"format": "avro",
"schemaId": 1024,
"redacted": false,
"data": { }
},
"headers": {
"traceparent": "00-...",
"correlationId": "corr-123",
"causationId": "event-999"
}
}
8.1 Preserve Original Bytes When Needed
For deserialization failure, the consumer may not be able to construct the typed payload. You need a lower-level error handler that can capture:
- raw key bytes;
- raw value bytes;
- headers;
- topic/partition/offset;
- deserializer exception;
- schema ID if available.
Do not assume all failures happen after successful deserialization.
8.2 Redaction and PII
DLQ often becomes a hidden data lake of broken payloads.
If events contain personal data:
- apply the same retention policy as source topic or stricter;
- redact sensitive fields when full payload is not required;
- encrypt DLQ topics where required;
- restrict ACLs;
- make DLQ access auditable;
- document lawful basis/retention for regulatory contexts.
A DLQ is production data, not debug scratch space.
9. Retry Metadata Contract
Retry topic messages should carry retry metadata.
Example headers:
x-original-topic: case.lifecycle.v1
x-original-partition: 7
x-original-offset: 123456
x-original-timestamp: 2026-06-28T10:12:01Z
x-retry-attempt: 3
x-retry-not-before: 2026-06-28T10:22:01Z
x-first-failed-at: 2026-06-28T10:12:05Z
x-last-error-class: java.net.SocketTimeoutException
x-last-error-hash: a7c2e9
x-correlation-id: corr-123
Retry contract should answer:
- how many attempts have happened?
- why did the last attempt fail?
- when should next attempt happen?
- what was the original record location?
- what consumer/application failed?
- which trace/correlation does this belong to?
Do not encode retry state only in logs. Logs expire and are hard to join with records.
10. Delay Strategies in Kafka
Kafka does not natively schedule arbitrary per-record delivery like a timer queue. Common delay strategies:
10.1 Fixed Retry Topics
case.events.retry.1m
case.events.retry.5m
case.events.retry.30m
case.events.retry.2h
case.events.dlq
Each retry consumer delays or filters until due time.
Pros:
- simple topology;
- easy metrics per tier;
- operationally visible.
Cons:
- many topics;
- less flexible;
- ordering changes.
10.2 Timestamp-Based Delay Consumer
Retry record includes notBefore. Consumer checks whether it is due.
Problem: if the first record in a partition is not due yet, later records in that partition may be blocked.
If strict partition order is used, B and C wait behind A.
10.3 External Scheduler
Failed records are stored in a DB/scheduler and re-emitted to Kafka when due.
Pros:
- flexible scheduling;
- queryable;
- can prioritize.
Cons:
- adds another durable system;
- requires exactly-once-ish bridging discipline;
- can create operational split-brain between scheduler and Kafka.
10.4 Backoff Topic Chain
Consumer publishes to next retry topic based on attempt.
main -> retry.1m -> retry.5m -> retry.30m -> dlq
This is usually the best default for microservice event pipelines.
11. Ordering and Error Handling
Kafka ordering is per partition. Error handling can violate perceived entity order.
Example:
partition 3:
offset 100: CaseOpened(case-1)
offset 101: CaseEscalated(case-1)
offset 102: CaseClosed(case-1)
If offset 101 fails and is sent to retry topic while offset 102 is processed, downstream state can become invalid.
11.1 Ordering Policy Options
| Policy | Behavior | Use when |
|---|---|---|
| Stop partition on error | Do not commit beyond failed record | Strict ordering required |
| Retry out-of-band | Move failed record to retry and continue | Availability > strict order |
| Entity parking lot | Park later records for same entity | Per-entity order required |
| State-machine guard | Later event rejected until prerequisite exists | Domain can self-heal |
| Compensating event | Allow out-of-order then correct later | Eventually consistent domain |
11.2 Per-Entity Parking Lot
A more advanced pattern:
This preserves ordering per entity but adds state.
Required table:
create table entity_processing_block (
entity_id varchar primary key,
blocked_since timestamp not null,
failed_topic varchar not null,
failed_partition int not null,
failed_offset bigint not null,
reason varchar not null
);
create table parked_event (
entity_id varchar not null,
sequence_no bigint,
source_topic varchar not null,
source_partition int not null,
source_offset bigint not null,
payload jsonb not null,
parked_at timestamp not null,
primary key (entity_id, source_topic, source_partition, source_offset)
);
Trade-off:
- stronger business ordering;
- more operational complexity;
- need cleanup/replay tooling;
- risk of unbounded parking if entity stays broken.
12. Poison Event Handling
A poison event is a record that repeatedly causes failure and prevents progress.
Signs:
- same topic/partition/offset fails repeatedly;
- stack trace hash stable;
- retry attempts high;
- consumer lag grows behind one record;
- redeploy does not fix;
- CPU spikes due to tight retry loop.
Response:
- Stop infinite retry.
- Capture original record and failure metadata.
- Move to DLQ/quarantine if policy allows.
- Commit only after DLQ/quarantine publish succeeds.
- Alert owner.
- Add regression test using captured payload.
- Replay after fix through controlled tool.
Do not simply skip poison events silently.
13. Retry Budget
Retry budget prevents endless attempts.
Example policy:
retryPolicy:
inline:
maxAttempts: 3
backoff: [100ms, 500ms, 2s]
delayed:
topics:
- name: case.lifecycle.retry.1m
delay: PT1M
- name: case.lifecycle.retry.10m
delay: PT10M
- name: case.lifecycle.retry.1h
delay: PT1H
maxTotalAttempts: 9
terminal:
topic: case.lifecycle.dlq
Retry budget dimensions:
| Dimension | Reason |
|---|---|
| max attempts | stop infinite loop |
| max age | old event may be harmful |
| max cumulative delay | SLA impact |
| per-error-class policy | invalid data should not retry |
| per-entity cap | avoid one entity dominating |
| global retry rate | avoid thundering herd |
Retry should be treated as load. During outage, retry can amplify traffic.
14. Retry Storm and Thundering Herd
If a downstream service is down for 30 minutes, all failed events may become due around the same time.
Mitigations:
- exponential backoff;
- jitter;
- retry rate limit;
- circuit breaker;
- bulkhead per downstream dependency;
- pause/resume consumer partitions;
- drain slowly after recovery;
- monitor dependency saturation.
Example jitter:
Duration backoffWithJitter(int attempt) {
long baseMillis = Math.min(60_000, 1_000L * (1L << Math.min(attempt, 6)));
long jitter = ThreadLocalRandom.current().nextLong(0, baseMillis / 2 + 1);
return Duration.ofMillis(baseMillis + jitter);
}
15. DLQ Topic Naming
Good topic names encode source and purpose.
<domain>.<event-stream>.<version>.retry.<delay>
<domain>.<event-stream>.<version>.dlt
<domain>.<event-stream>.<version>.quarantine
Examples:
case.lifecycle.v1
case.lifecycle.v1.retry.1m
case.lifecycle.v1.retry.10m
case.lifecycle.v1.retry.1h
case.lifecycle.v1.dlt
case.lifecycle.v1.quarantine
Avoid:
dlq
errors
failed-events
misc-retry
Bad names make ownership and lineage unclear.
16. DLQ Partitioning
DLQ partitioning is not trivial.
Options:
| DLQ key | Benefit | Risk |
|---|---|---|
| original key | preserves entity locality | hot failed entity creates hotspot |
| original topic-partition | easier source replay | less domain-friendly |
| failure type | easy operational grouping | loses entity ordering |
| random | spreads load | hard replay/order |
| composite | flexible | more complex |
Default recommendation:
Use original record key for DLQ if entity-level replay matters.
Use original topic-partition-offset in headers for exact lineage.
Do not rely on DLQ offset as source identity.
17. DLQ Retention
DLQ retention must be long enough for operational reality.
Questions:
- How long until someone notices?
- How long until root cause is fixed?
- How long until replay is approved?
- Are failed records legally/audit relevant?
- Does payload contain PII?
- Is compacted DLQ appropriate? Usually no.
For regulatory case-management:
DLQ retention should usually align with incident audit requirements, not developer convenience.
In many systems, DLQ should be retained longer than normal retry topics, but access should be stricter.
18. Replay Design
Replay is not “produce the DLQ payload back to main topic” blindly.
Replay must answer:
- What was fixed?
- Is the original event still valid?
- Has the entity state moved on?
- Should replay preserve original key?
- Should replay preserve original timestamp?
- Should replay go to original topic or a repair topic?
- Should consumers distinguish replay from first delivery?
- How do we avoid replaying twice?
18.1 Replay Envelope
When replaying, add headers:
x-replayed: true
x-replay-id: replay-20260628-001
x-replay-requested-by: user@example.com
x-replay-reason: fixed schema mapping for PRE_ESCALATED
x-original-topic: case.lifecycle.v1
x-original-partition: 7
x-original-offset: 123456
x-original-failure-id: 01JZ...
18.2 Replay Modes
| Mode | Meaning |
|---|---|
| Same-topic replay | Re-publish to original topic |
| Repair-topic replay | Publish to <topic>.repair consumed by same projection |
| Direct projector replay | Tool calls consumer logic directly |
| Bulk rebuild | Rebuild projection from source-of-truth/log |
| Manual correction | Do not replay, correct state with audit record |
Same-topic replay is simple but can confuse consumers that assume topic timestamp/order equals original business time.
19. Kafka Transaction Use in Error Handling
If consumer reads from topic A and writes to topic B/DLQ, Kafka transactions can atomically commit output records and source offsets.
Pseudo-flow:
producer.initTransactions();
while (running) {
ConsumerRecords<String, Event> records = consumer.poll(Duration.ofMillis(500));
producer.beginTransaction();
Map<TopicPartition, OffsetAndMetadata> offsets = new HashMap<>();
for (ConsumerRecord<String, Event> record : records) {
try {
ProducerRecord<String, ProjectionEvent> output = transform(record);
producer.send(output);
} catch (NonRetryableException ex) {
producer.send(toDlq(record, ex));
}
offsets.put(
new TopicPartition(record.topic(), record.partition()),
new OffsetAndMetadata(record.offset() + 1)
);
}
producer.sendOffsetsToTransaction(offsets, consumer.groupMetadata());
producer.commitTransaction();
}
This protects Kafka-to-Kafka processing. It does not make external HTTP/DB side effects exactly-once.
20. Deserialization Error Handling
Deserialization errors happen before your typed handler receives a domain object.
Bad assumption:
try { handle(event); } catch (...) { sendToDlq(event); }
If deserialization fails, event does not exist.
Design options:
- Use framework-level deserialization error handler.
- Consume raw bytes and deserialize inside application boundary.
- Have a side-channel raw DLQ for malformed records.
- Validate schema at producer boundary to reduce bad records.
Raw DLQ record should include:
- raw bytes or secure pointer to raw bytes;
- schema ID if encoded in wire format;
- deserializer class/config;
- exception class/message;
- topic/partition/offset;
- headers.
21. Business Rejection Is Not Always Technical Failure
Example:
CaseEscalated(caseId=123)
Consumer reads current case state and finds case already closed.
This can mean:
- event is out of order;
- upstream emitted invalid transition;
- consumer state is stale;
- duplicate event;
- legitimate no-op;
- fraud/error requiring investigation.
Do not blindly send all business rejections to technical DLQ.
Better taxonomy:
case.lifecycle.rejected.v1
case.lifecycle.quarantine.v1
case.lifecycle.dlt
A rejected event may be a domain signal, not an infrastructure error.
22. Error Classification Code Shape
Use explicit classification.
public enum FailureClass {
TRANSIENT_DEPENDENCY,
RATE_LIMITED,
NON_RETRYABLE_DATA,
SCHEMA_INCOMPATIBLE,
BUSINESS_REJECTED,
POISON_EVENT,
PROGRAMMING_BUG,
UNKNOWN
}
Classifier:
public final class FailureClassifier {
public FailureClass classify(Throwable error) {
if (error instanceof SocketTimeoutException) {
return FailureClass.TRANSIENT_DEPENDENCY;
}
if (error instanceof RateLimitedException) {
return FailureClass.RATE_LIMITED;
}
if (error instanceof InvalidSchemaException) {
return FailureClass.SCHEMA_INCOMPATIBLE;
}
if (error instanceof BusinessTransitionRejectedException) {
return FailureClass.BUSINESS_REJECTED;
}
if (error instanceof NullPointerException) {
return FailureClass.PROGRAMMING_BUG;
}
return FailureClass.UNKNOWN;
}
}
Policy:
public sealed interface FailureDecision {
record RetryInline(Duration backoff) implements FailureDecision {}
record PublishRetryTopic(String topic, Instant notBefore) implements FailureDecision {}
record PublishDlq(String topic) implements FailureDecision {}
record PublishQuarantine(String topic) implements FailureDecision {}
record StopConsumer() implements FailureDecision {}
}
Avoid error handling code shaped like:
catch (Exception ex) {
log.warn("failed", ex);
}
That is data loss disguised as resilience.
23. Observability for Error Handling
Required metrics:
| Metric | Why it matters |
|---|---|
| processed_total | baseline throughput |
| processing_failed_total by class | failure classification |
| retry_published_total | retry volume |
| retry_exhausted_total | terminal failures |
| dlq_published_total | DLQ inflow |
| quarantine_published_total | manual workload |
| retry_age_seconds | stale retry detection |
| dlq_oldest_age_seconds | unowned DLQ detection |
| poison_offset_repeats | same offset failing repeatedly |
| replay_total | recovery activity |
| replay_failed_total | bad recovery process |
Required logs:
- topic;
- partition;
- offset;
- key;
- event ID;
- correlation ID;
- causation ID;
- consumer group;
- application version;
- failure class;
- retry attempt;
- action taken.
Example log shape:
{
"level": "WARN",
"message": "event processing failed; publishing to retry topic",
"topic": "case.lifecycle.v1",
"partition": 7,
"offset": 123456,
"key": "case-81273",
"eventId": "evt-999",
"consumerGroup": "case-risk-projection-v3",
"failureClass": "TRANSIENT_DEPENDENCY",
"attempt": 2,
"nextTopic": "case.lifecycle.v1.retry.10m",
"correlationId": "corr-123"
}
24. Alerting Rules
Alert on symptoms that require action.
| Alert | Severity | Action |
|---|---|---|
| DLQ inflow > 0 for critical stream | High | Triage immediately |
| DLQ rate spike | High | Check deployment/schema/upstream |
| Retry backlog age exceeds SLA | Medium/high | Check downstream dependency |
| Same offset fails > N times | High | Poison event |
| Quarantine oldest age > SLA | High | Manual process stuck |
| Replay failure | High | Recovery unsafe |
| Consumer lag + retry errors | High | Downstream/infrastructure pressure |
Do not alert only on consumer crash. Many broken pipelines keep running while silently writing DLQ.
25. Testing Error Handling
Minimum tests:
25.1 Transient Failure Then Success
Given downstream fails twice
When consumer processes record
Then it retries according to policy
And commits only after success
25.2 Retry Exhausted
Given downstream keeps timing out
When retry budget is exhausted
Then record is published to DLQ
And source offset is committed only after DLQ publish succeeds
25.3 Non-Retryable Data Error
Given payload has invalid enum
When consumer processes record
Then it does not retry infinitely
And record is sent to DLQ/quarantine with reason
25.4 Deserialization Error
Given record bytes cannot be deserialized
When consumer polls
Then raw failure is captured
And offset handling follows policy
25.5 Ordering Failure
Given event N fails for entity E
And event N+1 arrives
When ordering policy is strict
Then N+1 is not applied before N
25.6 Crash Windows
Test crash windows explicitly:
| Crash point | Expected result |
|---|---|
| after side effect before commit | duplicate-safe reprocess |
| after retry topic publish before commit | duplicate retry-safe |
| after DLQ publish before commit | duplicate DLQ-safe |
| after commit before DLQ publish | must not happen |
26. Regulatory Case-Management Example
Input topic:
case.lifecycle.v1
Events:
CaseOpened
EvidenceSubmitted
RiskScoreChanged
CaseEscalated
NoticeIssued
CaseClosed
Consumer:
case-enforcement-projection
Failure policy:
| Failure | Policy |
|---|---|
| DB timeout | inline retry 3 times, then retry.1m |
| DB outage | retry topics with jitter and rate limit |
| invalid lifecycle transition | quarantine, not generic DLQ |
| unsupported event version | DLQ/schema incident |
| duplicate event ID | idempotent skip with metric |
| out-of-order sequence | entity parking lot |
| programming bug | stop consumer for critical stream |
Topology:
Important: business invariant violation goes to quarantine, not ordinary DLQ, because it may imply a legally relevant lifecycle inconsistency.
27. Common Anti-Patterns
27.1 Infinite Retry in Poll Loop
while true -> process same bad record -> fail -> sleep -> retry forever
Impact:
- partition stuck;
- lag grows;
- dependency overloaded;
- no terminal signal.
Fix:
- retry budget;
- poison detection;
- DLQ/quarantine;
- alert.
27.2 Catch-and-Commit
try {
process(record);
} catch (Exception ex) {
log.error("failed", ex);
}
consumer.commitSync();
This loses data.
27.3 DLQ Without Replay
DLQ that nobody checks is just a slower deletion mechanism.
27.4 Retrying Non-Retryable Errors
Invalid enum does not become valid after sleeping 30 seconds.
27.5 One Global DLQ
A single errors topic mixes unrelated domains, schemas, owners, and severities.
27.6 Retrying Everything at Same Time
No jitter + downstream recovery = retry storm.
27.7 Treating Business Rejection as Technical Failure
Some rejected events are valid domain facts.
27.8 DLQ Without Original Coordinates
Without original topic/partition/offset, replay and forensic analysis become weak.
28. Production Checklist
Before production, answer:
- What failure classes exist?
- Which failures are retriable?
- What is max retry attempt?
- What is max retry age?
- Which topic receives terminal failures?
- Who owns the DLQ?
- What alert fires when DLQ receives data?
- How is DLQ replay performed?
- Is replay idempotent?
- Are DLQ payloads redacted/encrypted?
- What is retention?
- How is ordering preserved or intentionally relaxed?
- What is the poison event procedure?
- What happens if DLQ publishing fails?
- What happens if consumer crashes after side effect but before commit?
- What happens if consumer crashes after DLQ publish but before commit?
29. Deliberate Practice
Exercise 1 — Classify Failures
Ambil 20 failure dari production logs atau test cases. Kelompokkan menjadi:
- transient;
- deterministic;
- schema;
- business rejection;
- poison;
- programming bug;
- unknown.
Untuk setiap kelas, tulis policy.
Exercise 2 — Design Retry Topology
Untuk stream case.lifecycle.v1, desain:
- retry topics;
- DLQ;
- quarantine;
- retention;
- alert;
- replay flow.
Exercise 3 — Crash Window Test
Buat integration test yang mematikan consumer pada empat titik:
- after poll;
- after side effect;
- after retry publish;
- after commit.
Buktikan tidak ada silent loss.
Exercise 4 — Poison Event Drill
Masukkan event yang selalu gagal.
Expected:
- retry budget berjalan;
- event masuk quarantine/DLQ;
- partition lanjut jika policy mengizinkan;
- alert muncul;
- replay tool bisa memproses setelah fix.
30. Ringkasan
Kafka error handling bukan fitur tunggal. Ia adalah desain end-to-end.
Prinsip utama:
- Offset commit bukan bukti sukses bisnis.
- Jangan commit melewati record kecuali outcome-nya durable.
- Retry hanya berguna jika waktu bisa mengubah outcome.
- Retry panjang jangan menahan poll loop.
- DLQ harus punya owner, metadata, alert, retention, dan replay.
- Ordering harus dinyatakan sebagai policy eksplisit.
- Quarantine berbeda dari DLQ teknis.
- External side effects butuh idempotency.
- Replay adalah operasi produksi, bukan script ad hoc.
- Semua failure harus meninggalkan evidence.
Part berikutnya membahas schema discipline: bagaimana JSON, Avro, Protobuf, Schema Registry, dan compatibility policy menentukan apakah event stream bisa berevolusi tanpa menghancurkan consumer.
Referensi
- Apache Kafka Documentation — https://kafka.apache.org/documentation/
- Apache Kafka APIs — https://kafka.apache.org/42/apis/
- Apache Kafka Connect User Guide, Errors and DLQ — https://kafka.apache.org/40/kafka-connect/user-guide/
- Confluent: Error Handling Patterns for Apache Kafka Applications — https://www.confluent.io/blog/error-handling-patterns-in-kafka/
- Confluent: Apache Kafka Dead Letter Queue Guide — https://www.confluent.io/learn/kafka-dead-letter-queue/
You just completed lesson 21 in deepen practice. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.
Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.