Final StretchOrdered learning track

Architecture Review and Anti-Patterns

Learn Java Kafka in Action - Part 034

Architecture review handbook for Kafka-based systems covering design invariants, review questions, anti-pattern catalog, failure modeling, governance, and production-readiness scoring.

[2026-07-02]25 min read4868 words

In This Lesson

1. Review Philosophy 2. The Kafka Architecture Review Stack 3. Core Review Questions

PrevNext

Lesson 3435 lesson track30–35 Final Stretch

#java#kafka#architecture#anti-patterns+7 more

Part 034 — Architecture Review and Anti-Patterns

Kafka systems fail for two broad reasons:

the cluster is operated poorly;
the architecture asks Kafka to provide guarantees it does not provide.

Part 033 handled operations. This part handles architecture review: how to detect weak designs before they become incidents.

A top-tier Kafka engineer does not merely ask:

Can this service produce and consume messages?

They ask:

What truth is represented by this event, what ordering boundary protects it, what happens when it is replayed, which team owns the contract, and how does the system fail under load, schema change, rebalancing, disaster recovery, and partial downstream outage?

This part gives you a review framework and an anti-pattern catalog.

1. Review Philosophy

Kafka architecture review should not be a gatekeeping ceremony. It should expose hidden coupling, missing invariants, and unowned failure modes.

A good review produces:

sharper domain boundaries;
clearer event contracts;
safer topic design;
explicit ordering and idempotency rules;
operational readiness;
better upgrade/replay/DR posture;
fewer production surprises.

A bad review produces:

generic checklist compliance;
diagrams with no failure paths;
topic names with no semantics;
performance claims without measurements;
"Kafka guarantees exactly-once" as a magic phrase;
no ownership after launch.

2. The Kafka Architecture Review Stack

Review Kafka systems layer by layer.

Do not start the review at Kafka configuration. Start at the business invariant.

3. Core Review Questions

3.1 Business Semantics

What business fact does this event represent?
Is it a fact, command, notification, snapshot, or integration record?
Is it immutable?
What is the authoritative source?
Can the event be corrected? How?
Is the event part of an audit trail?
What is the consequence of duplicate, delayed, or missing events?

3.2 Topic Design

Why is this a separate topic?
Who owns it?
What is the key?
What is the partition count and why?
What is the retention policy and why?
Is the topic compacted, delete-retained, or both?
Is the topic public domain API or internal pipeline topic?
Is there a lifecycle/deprecation plan?

3.3 Producer Design

What happens if produce succeeds but local DB transaction fails?
What happens if local DB transaction succeeds but produce fails?
Is an outbox needed?
Are retries enabled safely?
Is idempotent producer enabled and compatible?
Is acks=all used for critical topics?
What metadata is included for traceability?

3.4 Consumer Design

When is offset committed?
Is processing idempotent?
What happens on crash after side effect before commit?
What happens on poison pill?
Is retry blocking or non-blocking?
Is DLQ replay safe?
How is backpressure handled?
What is the ordering boundary?

3.5 Stream Processing Design

Is stateful processing required?
What state store exists?
What changelog/internal topics are created?
What repartitions happen?
What happens during state restore?
What is the topology evolution plan?
Is exactly-once needed or misunderstood?

3.6 Contract Design

What schema format is used?
What compatibility mode is used?
Who approves schema changes?
Are semantic breaking changes reviewed?
How are deprecated fields handled?
How do consumers handle unknown fields?

3.7 Operational Design

What is the SLO?
What metrics detect failure?
What dashboards exist?
What alerts page humans?
What runbook exists?
Who is on-call?
What is the replay procedure?
What is the DR tier?

4. Review Output: Decision Record

Every important Kafka design should leave an Architecture Decision Record.

# ADR: <Kafka Design Decision>

## Context
What business and technical problem are we solving?

## Decision
What topic/event/pattern/configuration did we choose?

## Alternatives Considered
What else did we evaluate?

## Invariants
What must remain true for the design to be correct?

## Failure Modes
How can this fail?

## Operational Requirements
Metrics, alerts, runbooks, ownership, DR.

## Consequences
Trade-offs accepted by the team.

Architecture without decision history becomes folklore.

5. Anti-Pattern Catalog Overview

We will group anti-patterns by failure domain.

Domain	Anti-Pattern Examples
Event semantics	command disguised as event, mutable fact, ambiguous event name
Topic modeling	mega-topic, topic-per-user, database-table mirror without domain meaning
Partitioning	low-cardinality key, random key when ordering needed, over-partitioning
Producer	dual write, unbounded retries, no idempotence, giant records
Consumer	auto-commit side effects, no idempotency, poison pill loop
Retry/DLQ	DLQ graveyard, retry storm, no replay runbook
Schema	schema-free JSON, compatibility without semantic review
Streams	hidden repartition, state on ephemeral disk, topology reset as fix
ksqlDB	SQL sprawl, unowned persistent queries, pull query as OLTP API
Connect	connector as unowned black box, SMT business logic abuse
Security	shared principal, wildcard ACL, no data classification
Observability	offset lag only, no freshness SLO, no business impact mapping
DR	untested failover, replicated bytes without schemas/ACLs/offsets
Organization	platform owns everything, domain owns nothing

6. Anti-Pattern: Kafka as a Generic Queue Replacement

Symptom

A team migrates from a queue to Kafka and expects:

per-message acknowledgment;
invisible messages;
competing consumers with arbitrary parallelism;
easy dead-letter behavior;
queue depth semantics;
task distribution semantics.

Why It Is Wrong

Kafka is a partitioned log. A consumer group owns offsets. Records stay in the log according to retention. Parallelism is limited by partition count. Ordering is per partition.

Damage

incorrect offset commits;
surprise reprocessing;
ordering bugs;
lag misunderstood as queue depth;
inability to scale beyond partition count;
poison pill blocks partition.

Better Design

Use Kafka when you need durable event log, replay, fan-out, event-driven integration, stream processing, or event-carried state transfer.

Use a work queue when you need individual task leasing, visibility timeout, and per-task acknowledgment semantics.

7. Anti-Pattern: Mega-Topic

Symptom

One topic carries many unrelated event types:

company.events.all

Payload has eventType, and every consumer filters what it needs.

Why Teams Do It

fewer topics to manage;
easy early integration;
weak governance;
misunderstanding topic count cost;
desire for global ordering.

Damage

unrelated consumers coupled to unrelated producer changes;
schema compatibility becomes impossible;
retention cannot match event needs;
ACLs become too broad;
replay becomes dangerous;
consumer filtering wastes resources;
ownership unclear.

Better Design

Create domain-owned topics with coherent event semantics:

cpq.quote-created.v1
cpq.quote-priced.v1
oms.order-submitted.v1
billing.invoice-issued.v1

A topic should group events that share ownership, key semantics, retention, security classification, and consumer expectations.

8. Anti-Pattern: Topic per User / Tenant / Entity

Symptom

A topic is created for each tenant, user, customer, case, order, or device.

user.123.events
user.124.events
user.125.events

Why It Is Wrong

Kafka topics and partitions are operational resources. Topic explosion increases metadata, ACL management, monitoring cardinality, and operational complexity.

Damage

controller metadata pressure;
impossible governance;
dashboard cardinality explosion;
ACL sprawl;
difficult retention management;
poor capacity planning.

Better Design

Use topic + key + headers:

topic: customer.activity.v1
key: customerId
headers: tenantId, region, classification

Use separate topics only when tenants require distinct retention, security, ownership, or physical isolation.

9. Anti-Pattern: Database Table Mirroring Without Domain Meaning

Symptom

Kafka topics are named after database tables:

public.quote
public.quote_line
public.quote_status_history

Consumers reconstruct business behavior from row-level changes.

When It Is Acceptable

CDC table mirroring can be valid for integration, replication, analytics, or low-level data capture.

Why It Becomes an Anti-Pattern

It becomes dangerous when table changes are treated as stable domain events.

Database tables are implementation details. Domain events are business facts.

Damage

consumers coupled to database schema;
refactoring database breaks event consumers;
event meaning unclear;
multiple row changes required to infer one business transition;
audit story becomes accidental.

Better Design

For domain integration, use outbox/domain events:

quote-created
quote-priced
quote-approved
quote-expired

Use CDC as transport mechanism, not as semantic API unless table semantics are explicitly part of the contract.

10. Anti-Pattern: Command Disguised as Event

Symptom

Topic contains messages named like events but semantically commands:

UserShouldBeSuspended
InvoiceMustBeGenerated
OrderNeedsToBeCancelled

Why It Matters

An event is a fact that already happened. A command asks someone to do something.

Damage

unclear ownership;
consumer becomes hidden command handler;
retry semantics ambiguous;
multiple consumers may act on same command;
audit trail lies about what happened.

Better Design

Use explicit command topic if asynchronous command is intended:

command: cancel-order-requested
handler: oms-order-service
response event: order-cancelled or order-cancellation-rejected

Or use an event if the fact already occurred:

order-cancelled

11. Anti-Pattern: Mutable Event

Symptom

Producer republishes an event with the same eventId but changed business meaning.

Why It Is Wrong

Events should be immutable facts. Corrections should be new facts.

Damage

audit trail corrupted;
replay nondeterministic;
dedup logic hides correction;
consumers cannot explain state at time T.

Better Design

Use correction events:

quote-price-calculated
quote-price-corrected
quote-price-reversal-issued

Corrections must include causation metadata.

12. Anti-Pattern: Low-Cardinality Partition Key

Symptom

Key is a status, region, type, or boolean.

key = "APPROVED"
key = "FAILED"
key = "US"
key = "true"

Damage

hot partitions;
poor parallelism;
uneven broker load;
lag concentrated in one partition;
consumer scale ineffective.

Better Design

Choose a key with enough cardinality and correct ordering boundary:

quoteId
orderId
caseId
customerId
accountId

If both ordering and load distribution matter, evaluate composite keys carefully.

13. Anti-Pattern: Random Key When Ordering Is Required

Symptom

Producer uses UUID/random key to spread load.

Damage

events for same entity land on different partitions;
state transitions arrive out of order;
consumers need complex reorder buffers;
regulatory state machine becomes non-defensible.

Better Design

Key by aggregate/entity whose state transition order matters.

For high-volume hot entities, redesign the domain workflow instead of breaking ordering silently.

14. Anti-Pattern: Over-Partitioning

Symptom

Every topic gets hundreds or thousands of partitions "for future scale".

Damage

metadata overhead;
more file handles;
more leader election work;
more consumer assignment overhead;
slower recovery;
more internal topics for stream apps;
operational noise.

Better Design

Partition count should be based on:

target throughput;
consumer parallelism;
key cardinality;
broker count;
future growth;
recovery time;
operational overhead.

Do not optimize for imaginary future traffic without a scaling plan.

15. Anti-Pattern: Increasing Partitions Without Reviewing Key Semantics

Symptom

Lag occurs, so team increases partition count.

Hidden Problem

For keyed records, increasing partitions changes future key-to-partition mapping under the default partitioner. Existing records remain in old partitions. Future records may route differently.

Damage

ordering assumptions may break for future records relative to old records;
stream joins/repartition assumptions change;
consumers may see entity history split across old/new partitions over time.

Better Design

Before increasing partitions:

confirm lag is parallelism-bound;
confirm key semantics tolerate changed mapping;
consider new topic version;
consider custom partitioning if stable mapping is required;
document impact.

16. Anti-Pattern: Schema-Free JSON for Critical Topics

Symptom

Events are JSON blobs with no registry or compatibility check.

Why Teams Do It

speed;
perceived flexibility;
no Schema Registry setup;
weak contract culture.

Damage

runtime-only failures;
consumer assumptions undocumented;
breaking changes discovered late;
no evolution discipline;
analytics pipelines infer wrong types;
replay breaks after producer changes.

Better Design

Use Avro, Protobuf, or JSON Schema with registry and compatibility mode. Even if JSON is required, govern it with JSON Schema.

17. Anti-Pattern: Compatibility Check as the Only Review

Symptom

Team says, "Schema Registry says it is backward compatible, so it is safe."

Why It Is Incomplete

Schema compatibility checks structural evolution. They do not guarantee business meaning compatibility.

Example

Changing field meaning from:

amount = total gross amount

to:

amount = net amount after discount

may be schema-compatible but semantically breaking.

Better Design

Add semantic review:

meaning changes;
unit changes;
precision changes;
nullability meaning;
enum lifecycle;
default value semantics;
consumer interpretation.

18. Anti-Pattern: Auto-Commit With Non-Idempotent Side Effects

Symptom

Consumer uses auto-commit while writing to external DB/API.

Failure

Consumer receives records, offset auto-commits, then side effect fails. Kafka thinks record is processed, but business side effect is missing.

Better Design

Use manual commit after durable side effect. Make side effects idempotent.

poll -> process -> durable side effect -> commit offset

For async processing, commit only contiguous completed offsets per partition.

19. Anti-Pattern: No Idempotency Because "Kafka Is Exactly Once"

Symptom

Consumer writes to database or external API without deduplication because producer or Streams app uses exactly-once.

Why It Is Wrong

Kafka exactly-once semantics do not automatically make external systems exactly-once.

Damage

duplicate charges;
duplicate notifications;
duplicate workflow transitions;
audit confusion;
replay unsafe.

Better Design

Use idempotency key, dedup table, inbox pattern, ledger table, or monotonic state guard.

20. Anti-Pattern: Poison Pill Infinite Loop

Symptom

Consumer repeatedly fails on the same bad record and never advances.

Damage

partition blocked;
lag grows;
retry storm;
downstream SLA breach;
on-call confusion.

Better Design

Classify errors:

Error Type	Handling
transient	retry with backoff
deterministic bad data	DLQ/quarantine
unknown	bounded retry then DLQ with alert
fatal code/config	stop rollout, page owner

DLQ must include enough metadata for replay.

21. Anti-Pattern: DLQ Graveyard

Symptom

Records go to DLQ, but nobody owns replay, analysis, or cleanup.

Damage

silent data loss from business perspective;
compliance gap;
storage growth;
no accountability;
operational false comfort.

Better Design

Every DLQ needs:

owner;
alert threshold;
schema;
error metadata;
triage SLA;
replay runbook;
retention policy;
dashboard.

22. Anti-Pattern: Retry Storm

Symptom

Many consumers retry aggressively against a failing downstream dependency.

Damage

downstream outage worsens;
Kafka lag grows;
duplicate attempts increase;
thread pools saturate;
circuit breakers trip globally.

Better Design

Use:

exponential backoff;
retry topics;
rate limits;
circuit breakers;
pause/resume;
bounded concurrency;
DLQ after bounded attempts.

23. Anti-Pattern: Synchronous Request/Reply Over Kafka Everywhere

Symptom

Kafka is used like HTTP/RPC:

service A sends request event
service B consumes
service B sends response event
service A blocks waiting

When It Can Be Valid

Rarely, when asynchronous transport is required and timeout/correlation semantics are explicit.

Why It Is Usually Wrong

Kafka is optimized for durable streams and fan-out, not low-latency RPC.

Damage

hidden synchronous coupling;
hard timeout semantics;
correlation topic complexity;
blocked threads;
poor user-facing latency;
difficult error handling.

Better Design

Use HTTP/gRPC for synchronous query/command when immediate response is required. Use Kafka for durable facts and asynchronous workflows.

24. Anti-Pattern: Event Choreography Without Ownership

Symptom

Many services react to each other's events. Nobody owns the end-to-end business process.

Damage

emergent workflow bugs;
hard incident diagnosis;
no single process state;
compensation unclear;
circular event chains;
regulatory explanation difficult.

Better Design

Use choreography for simple independent reactions. Use orchestration/workflow engine when process state, timeout, compensation, and auditability are central.

25. Anti-Pattern: Kafka Streams State on Ephemeral Storage

Symptom

Kafka Streams app uses local state store on ephemeral pod storage without understanding restore cost.

Damage

slow startup;
repeated full state restore;
network and broker pressure;
rebalance instability;
pod churn causes processing delay.

Better Design

use persistent volume where appropriate;
configure standby replicas if needed;
size local state;
monitor restore time;
design changelog retention;
test pod loss.

26. Anti-Pattern: Deleting Kafka Streams Internal Topics to Fix Bugs

Symptom

An operator deletes changelog/repartition topics because the app is failing.

Damage

state loss;
duplicate/replayed outputs;
inconsistent materialized views;
irreversible recovery complexity.

Better Design

Use a documented reset process. Understand topology, state stores, output topics, and downstream idempotency first.

27. Anti-Pattern: Hidden Repartition in Kafka Streams or ksqlDB

Symptom

A join or group-by silently creates repartition/internal topics.

Damage

unexpected latency;
additional storage;
ACL failures;
internal topic ownership unclear;
degraded throughput;
partition/key mismatch.

Better Design

Inspect topology/query plan. Name processors/internal topics where possible. Review key alignment before joins and aggregations.

28. Anti-Pattern: ksqlDB SQL Sprawl

Symptom

Many persistent queries are created by different teams with unclear ownership.

Damage

unowned sink topics;
hidden pipelines;
no code review;
hard rollback;
duplicated logic;
production dependency on ad hoc SQL.

Better Design

Treat ksqlDB persistent queries as deployable artifacts:

version controlled SQL;
owner;
migration plan;
query naming convention;
sink topic governance;
dashboard;
rollback.

29. Anti-Pattern: Kafka Connect as Unowned Magic

Symptom

Connectors are installed and forgotten.

Damage

silent source lag;
sink duplicates;
credential expiration;
DLQ ignored;
schema drift;
external system overload.

Better Design

Every connector must have:

owner;
source/sink SLA;
offset policy;
DLQ policy;
credentials owner;
capacity model;
connector config in version control;
recovery runbook.

30. Anti-Pattern: Business Logic in SMTs

Symptom

Kafka Connect Single Message Transforms are used for complex domain rules.

Why It Is Risky

SMTs are useful for lightweight transformations. Complex business logic becomes hard to test, version, observe, and debug inside connector configuration.

Better Design

Use SMTs for structural adaptation. Use application code, Kafka Streams, or ksqlDB for domain transformation where lifecycle and tests are stronger.

31. Anti-Pattern: Shared Service Principal

Symptom

Many apps use the same Kafka username/certificate.

Damage

no auditability;
overbroad ACLs;
impossible credential rotation;
blast radius too large;
compromised service affects many apps.

Better Design

Use one principal per workload or bounded service identity. Grant least privilege by topic/resource.

32. Anti-Pattern: Wildcard ACL Everywhere

Symptom

Apps get read/write access to broad topic patterns.

User:app-* can Read/Write Topic:*

Damage

accidental writes;
unauthorized reads;
data exfiltration risk;
weak tenant isolation;
compliance failure.

Better Design

Define ACLs from data flow contracts:

producer can write only owned output topics;
consumer can read only required input topics;
Streams apps can access required internal topics;
Connect workers have connector-specific rights;
admin rights limited to platform automation.

33. Anti-Pattern: Offset Lag as the Only Alert

Symptom

Only consumer lag offset count is monitored.

Why It Is Incomplete

Offset lag does not directly express business freshness. A lag of 10 records can be severe if records are old and critical. A lag of 1 million can be acceptable during planned backfill.

Better Design

Monitor:

lag records;
lag age/freshness;
processing latency;
consumer error rate;
DLQ rate;
rebalance frequency;
downstream latency;
business SLO impact.

34. Anti-Pattern: No Replay Plan

Symptom

Team relies on Kafka retention but has never replayed.

Damage

replay causes duplicate side effects;
old schema cannot be read;
consumers cannot handle old event versions;
downstream systems overload;
audit trail unclear.

Better Design

Replay plan must include:

offset range;
rate limit;
idempotency;
schema versions;
downstream capacity;
metadata;
approval;
validation.

35. Anti-Pattern: DR Means "We Replicate Topics"

Symptom

DR design only says MirrorMaker, Cluster Linking, or another replication tool copies topic data.

Missing Pieces

schemas;
ACLs;
credentials;
consumer offsets;
producer routing;
application configs;
connector configs;
ksqlDB queries;
Streams state strategy;
failover authority;
failback strategy.

Better Design

DR must be tested as an end-to-end business recovery process.

36. Anti-Pattern: No Topic Ownership

Symptom

A topic exists, but nobody can answer:

who owns it;
whether it is still used;
whether it can be deleted;
what schema changes are safe;
what retention it needs;
what SLO it has.

Damage

platform clutter;
risky changes;
long incidents;
no cost accountability;
compliance uncertainty.

Better Design

Maintain topic catalog and enforce ownership at creation.

37. Anti-Pattern: Platform Team Owns All Kafka Semantics

Symptom

Application teams produce domain events but expect platform team to define correctness, schema meaning, replay impact, and idempotency.

Why It Is Wrong

Platform can own Kafka infrastructure. Domain teams must own business event semantics.

Better Ownership Split

Concern	Owner
Broker health	Platform
Topic policy framework	Platform
Event business meaning	Domain team
Schema semantic compatibility	Domain + consumers
Consumer idempotency	Consumer team
DR infrastructure	Platform
DR business validation	Domain + platform

38. Kafka Design Review Scorecard

Score each area from 0 to 3.

Area	0	1	2	3
Event semantics	unclear	named but ambiguous	mostly clear	precise, owned, documented
Topic design	ad hoc	basic naming	key/retention considered	full lifecycle/governance
Producer correctness	unsafe dual write	retries only	idempotence/outbox partially	failure-tested
Consumer correctness	auto-commit unsafe	manual commit	idempotent	replay-tested
Schema	none	schema exists	compatibility	semantic governance
Ordering	assumed	per-topic claim	key-based	failure-tested boundary
Retry/DLQ	none	basic DLQ	bounded retry	replay runbook + owner
Observability	logs only	lag dashboard	SLO metrics	business-impact runbooks
Security	shared access	basic ACL	least privilege	audited identity lifecycle
DR	none	data replicated	failover documented	game-day tested

Interpretation:

Total	Meaning
0–10	Not production-ready
11–20	Prototype or low-criticality only
21–25	Production with risks
26–30	Strong production design

The score is not the goal. The conversation is the goal.

39. Architecture Review Walkthrough: CPQ to OMS

Consider a CPQ platform producing quote events consumed by OMS.

39.1 Weak Design

topic: all-events
key: random UUID
payload: arbitrary JSON
consumer: OMS auto-commits and writes orders
retry: infinite retry
DLQ: none
schema: none
observability: offset lag only

Failure modes:

quote lifecycle events reorder;
OMS creates duplicate orders;
bad record blocks partition;
schema change breaks runtime;
replay unsafe;
no audit defensibility.

39.2 Stronger Design

topic: cpq.quote-approved.v1
key: quoteId
schema: Protobuf/Avro with registry
producer: transactional outbox from CPQ DB
consumer: OMS idempotent inbox table
retry: bounded retry topics
DLQ: owned and replayable
observability: freshness SLO + DLQ + duplicate detection
DR: topic/schema/ACL/offset recovery tested

39.3 Review Diagram

39.4 Review Result

This design is not perfect, but it has explicit invariants:

CPQ DB and event publication are connected by outbox;
ordering boundary is quoteId;
OMS side effect is idempotent;
replay is possible;
DLQ has a process;
schema is governed;
event is audit-friendly.

40. Architecture Review Meeting Format

Keep review efficient.

40.1 Inputs Required

system context diagram;
topic list;
event schema examples;
key/partition decision;
producer failure handling;
consumer offset/idempotency design;
retry/DLQ design;
observability plan;
security/ACL plan;
DR/replay plan.

40.2 Meeting Agenda

Business process and criticality.
Event semantics and topic ownership.
Key, ordering, partitioning.
Producer/consumer correctness.
Schema and compatibility.
Failure and replay.
Observability and operations.
Security and governance.
Decision, risks, actions.

40.3 Review Outcomes

Outcome	Meaning
Approved	Ready with no blocking issues
Approved with conditions	Specific fixes required before production
Needs redesign	Core invariant is unsafe
Deferred	Missing information prevents review

41. Design Smell Matrix

Smell	Ask This
"Kafka guarantees exactly once"	Exactly once between which systems?
"We can replay anytime"	Which side effects are idempotent?
"We use JSON for flexibility"	How are breaking changes prevented?
"One topic is simpler"	Simpler for whom: producer, consumer, operator, auditor?
"We will add partitions later"	What happens to keyed ordering?
"DLQ is enough"	Who replays it and how?
"Consumers can ignore fields"	What about changed meaning?
"DR copies topics"	What about schemas, ACLs, offsets, and applications?
"ksqlDB is easier"	Who owns persistent query lifecycle?
"Connect handles it"	What are connector offset and duplicate semantics?

42. Production Go/No-Go Checklist

Do not ship a Kafka flow to production unless these are answered.

Semantics

Event names represent facts or explicitly named commands.
Event owner is defined.
Business criticality is defined.
Correction/reversal model exists if needed.

Topic and Key

Topic name follows convention.
Key chosen based on ordering boundary.
Partition count justified.
Retention and cleanup policy justified.
Topic has owner metadata.

Producer

Dual-write risk addressed.
Reliability settings appropriate.
Event ID and trace metadata included.
Serialization failures handled.

Consumer

Schema

Schema registered.
Compatibility mode set.
Semantic compatibility process defined.
Consumer contract tests exist.

Operations

Metrics and dashboards exist.
Alerts map to business impact.
Runbook exists.
Ownership and on-call known.

Security

Principal is workload-specific.
ACLs are least privilege.
Data classification defined.
Secrets rotation understood.

DR and Replay

Replay tested.
DR tier defined.
Failover behavior understood.
Offset/schema/ACL recovery considered.

43. What Senior Reviewers Look For

Senior engineers do not look only for correctness under normal conditions. They look for clarity under stress.

They ask:

Can this be operated at 3 a.m.?
Can a new team understand the event contract?
Can we replay without fear?
Can we fail over without inventing procedure during incident?
Can we prove what happened to an auditor?
Can we evolve schema without breaking unknown consumers?
Can we scale without changing business semantics?
Can we remove this topic later?

If the answer is no, the design may still be acceptable for a prototype, but it is not yet mature production architecture.

44. Exercises

Exercise 1 — Review a Topic

Pick one topic and answer:

owner;
event type;
key;
retention;
schema;
compatibility;
known consumers;
replay policy;
DR tier.

If any answer is unknown, create an action item.

Exercise 2 — Find Hidden Coupling

Draw producers and consumers for a domain flow. Mark:

synchronous dependencies;
event dependencies;
schema dependencies;
operational dependencies;
security dependencies.

Identify coupling that is not visible in service diagrams.

Exercise 3 — Replay Readiness

For one consumer, simulate replay of 100 historical records.

Validate:

no duplicate side effects;
old schema compatibility;
expected state result;
replay observability;
rate limiting.

Exercise 4 — Anti-Pattern Hunt

Search your architecture for:

mega-topic;
unowned DLQ;
shared principal;
schema-free critical JSON;
auto-commit side effect;
retry storm risk;
hidden ksqlDB query;
Connect black box.

Exercise 5 — Review Scorecard

Score a real Kafka design using the scorecard in Section 38. Use low scores to drive improvement, not blame.

45. Key Takeaways

Kafka architecture review starts from domain invariants, not broker config.
Topic design is API design.
Key design is ordering and scaling design.
Schema compatibility is necessary but not sufficient.
Exactly-once does not eliminate idempotency for external side effects.
DLQ without owner and replay is delayed data loss.
DR without schema, ACL, offset, and application behavior is incomplete.
Anti-patterns usually begin as shortcuts that survived too long.
Production readiness is proven by failure behavior, not happy-path throughput.

46. Bridge to Final Capstone

Part 035 will use everything from this series to build a capstone architecture:

Java services;
Kafka producers/consumers;
transactional outbox;
Schema Registry;
Kafka Streams;
ksqlDB;
Kafka Connect;
observability;
security;
deployment;
runbooks;
architecture decision records.

The capstone will not be a toy. It will be a production-grade reference platform design.

47. References

Apache Kafka Documentation: https://kafka.apache.org/documentation/
Apache Kafka Design: https://kafka.apache.org/41/design/design/
Apache Kafka Producer Configurations: https://kafka.apache.org/41/configuration/producer-configs/
Apache Kafka Consumer Design — Confluent: https://docs.confluent.io/kafka/design/consumer-design.html
Confluent Schema Registry Fundamentals: https://docs.confluent.io/platform/current/schema-registry/fundamentals/index.html
Kafka Streams Architecture — Confluent: https://docs.confluent.io/platform/current/streams/architecture.html
Kafka Connect Documentation — Confluent: https://docs.confluent.io/platform/current/connect/index.html

Lesson Recap

You just completed lesson 34 in final stretch. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Back To Series Next Lesson

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.

Previous Lesson

Lesson 33

Operations, Upgrade, and Disaster Recovery

Next Lesson

Lesson 35

Capstone: Production-Grade Java Kafka Platform