Topic Design for Data Pipelines
Learn Java Data Pipeline Pattern - Part 034
Kafka topic design for Java data pipelines: domain boundary, partition key, compaction, retention, topic taxonomy, tenancy, lifecycle, security, DLQ, backfill, and production review.
Part 034 — Topic Design for Data Pipelines
Kafka topic design is architecture.
It is not naming convention trivia.
A topic defines a long-lived contract between producers, consumers, storage policy, ordering semantics, replay strategy, security boundary, and operational ownership.
Weak topic design creates years of downstream pain:
ambiguous ownership
unclear event meaning
unbounded schema drift
broken ordering
consumer-specific hacks
topic sprawl
retention mismatch
DLQ graveyards
PII leakage
impossible replay
expensive migrations
Strong topic design makes pipelines boring:
clear meaning
stable keys
known replay window
safe schema evolution
predictable consumer behavior
explicit ownership
observable failure modes
governed access
This part gives you a production-grade decision framework for Kafka topic design in Java data pipelines.
1. The Topic Is a Contract Boundary
A Kafka topic is not just a string.
It is a boundary.
At that boundary, the producer promises something and consumers build assumptions on top of it.
A topic contract should answer:
What does this stream mean?
Who owns it?
Who may write to it?
Who may read from it?
What is the key?
What ordering is guaranteed?
What schema is used?
What compatibility mode applies?
How long is data retained?
Is it compacted?
Is it replayable?
Is it authoritative?
What data classification applies?
What is the deprecation policy?
If those answers are missing, consumers infer them informally.
Informal assumptions become production dependencies.
Production dependencies become migration blockers.
2. Start With Meaning, Not Naming
Do not begin with:
Should the topic be called case-events-v1 or case.lifecycle.events?
Begin with:
What facts should exist in this stream?
What state changes do they represent?
Who owns those facts?
Who needs to derive state from them?
What ordering do they require?
How long must they be recoverable?
Good naming follows good meaning.
Bad meaning cannot be fixed by naming style.
3. Topic Taxonomy
Most production Kafka estates contain several topic types.
Treating all topics the same causes wrong retention, wrong keying, wrong security, and wrong consumer expectations.
3.1 Domain Event Topic
Represents business facts.
Example:
case.lifecycle.events
Meaning:
Facts about case lifecycle transitions that already happened.
Typical properties:
cleanup.policy: delete
key: caseId
retention: aligned with replay/audit hot window
schema: Avro/Protobuf/JSON Schema envelope
compatibility: backward or full transitive depending on consumers
owner: domain team
Use for:
- audit stream;
- read model rebuild;
- downstream analytics;
- stateful stream processing;
- event-driven integration.
3.2 Command Topic
Represents requested work.
Example:
notification.email.commands
Meaning:
Requests to send an email notification.
Typical properties:
cleanup.policy: delete
key: commandTargetId or business id
retention: short-to-medium operational window
idempotency: commandId required
owner: command receiver or platform team
Commands are not facts until executed.
A command topic should not be treated as an audit log of completed outcomes.
3.3 CDC Topic
Represents low-level source database changes.
Example:
db.enforcement.public.case
Meaning:
Changes captured from the public.case table in the enforcement database.
Typical properties:
cleanup.policy: delete or compact depending on use case
key: source primary key
schema: connector-generated schema
owner: source system/platform
classification: often same as source table
CDC topics expose source implementation details.
Do not treat them as stable domain APIs unless you intentionally govern them as such.
3.4 Canonical Event Topic
Represents normalized business events derived from raw sources or CDC.
Example:
enforcement.case.lifecycle.v1
Meaning:
Canonical case lifecycle facts independent of source table shape.
Typical properties:
cleanup.policy: delete
key: caseId
schema: strongly governed canonical schema
owner: data product/domain platform team
Canonical topics are useful when many consumers should not depend on raw CDC shape.
3.5 Projection / Latest-State Topic
Represents latest known value by key.
Example:
case.current-status
Typical properties:
cleanup.policy: compact
key: caseId
value: latest status projection
retention: compacted, optionally with delete retention
Use for enrichment and lookup.
Do not use as complete history.
3.6 Changelog Topic
Represents state store updates for stream processors.
Example:
case-sla-detector-case-state-changelog
Usually internal to Kafka Streams or stream processing infrastructure.
Avoid exposing internal changelog topics as public data products.
3.7 Retry Topic
Represents records delayed for retry.
Example:
case.lifecycle.events.retry.5m
Retry topics should have clear policy:
what error class enters
how many attempts
delay/backoff
next destination
headers preserved
original topic/partition/offset preserved
3.8 DLQ / Quarantine Topic
Represents failed records requiring repair or explicit discard.
Example:
case.lifecycle.events.dlq
case.lifecycle.events.quarantine
DLQ/quarantine topics require ownership and replay tooling.
They are not trash bins.
3.9 Backfill Topic
Represents isolated historical replay.
Example:
case.lifecycle.events.backfill.2026-07-v2
Use sparingly.
Backfill topics prevent live-topic pollution but can create topic sprawl.
3.10 Audit Archive Topic
Sometimes a topic is designed specifically as a hot audit stream before archival.
Example:
audit.case.lifecycle.raw
Usually paired with immutable object storage/lakehouse sink.
4. Topic Design Decision Tree
Use this as a first-pass decision framework.
If you cannot classify the topic, the design is not ready.
5. Naming: Optimize for Meaning and Operations
A good topic name communicates:
domain
entity or data product
semantic type
optionally version/layer/environment when needed
Example style:
<domain>.<entity-or-product>.<semantic-type>
Examples:
case.lifecycle.events
case.assignment.events
case.current-status
case.sla-breach.detected
notification.email.commands
reference.officer-profile.current
For CDC:
db.<system>.<schema>.<table>
Examples:
db.enforcement.public.case
db.enforcement.public.case_assignment
For pipeline internals:
<pipeline>.<purpose>.<internal-type>
Examples:
case-sla-detector.case-state.changelog
case-sla-detector.events.retry.5m
case-sla-detector.events.dlq
Avoid vague names
Bad:
events
case-events
updates
integration
pipeline-data
realtime-feed
all-domain-events
Why bad:
- unclear ownership;
- unclear meaning;
- unclear schema boundary;
- likely to accumulate unrelated records;
- hard to apply retention/security correctly.
6. Should Topic Names Include Version?
Sometimes.
But do not put v1 everywhere by habit.
Schema evolution should usually happen inside a stable topic through compatibility rules.
Create a new topic version when semantics break, not merely when schema changes.
Do not create a new topic for safe additive schema change
Example:
Add optional field: escalationReason
If compatible, keep same topic.
Consider a new topic when meaning changes
Example:
Old: case.lifecycle.events means workflow status changes only.
New: case.lifecycle.events also includes assignment and SLA timer events.
That is semantic expansion that may break consumers.
Consider a new topic when key changes
Example:
Old key: caseId
New key: organizationId
This changes ordering and partitioning semantics.
Consider a new topic when retention/security changes fundamentally
Example:
Old topic contained non-PII public case metadata.
New topic contains sensitive enforcement notes.
Security classification changed.
A new topic may be required.
Rule:
Version topics for incompatible semantic contracts, not every schema revision.
7. Choosing Topic Granularity
One of the hardest design decisions is granularity.
Should you create one broad topic or many narrow topics?
Option A — Broad Domain Topic
Example:
case.lifecycle.events
Contains:
CaseOpened
CaseAssigned
CaseEscalated
CaseClosed
CaseReopened
Advantages:
- preserves order across related lifecycle events for same key;
- easier for consumers needing full lifecycle;
- fewer topics;
- simpler replay for aggregate history.
Disadvantages:
- consumers may filter many event types;
- schema union can grow;
- permissions are coarse;
- retention applies to all contained event types.
Option B — Narrow Event-Type Topics
Examples:
case.opened.events
case.assigned.events
case.escalated.events
case.closed.events
Advantages:
- clear event-specific schema;
- fine-grained access;
- targeted consumers;
- per-topic retention/security possible.
Disadvantages:
- cross-event ordering is harder;
- consumers needing lifecycle must subscribe to many topics;
- topic sprawl;
- harder aggregate replay.
Option C — Domain Topic + Derived Narrow Topics
Pattern:
case.lifecycle.events // authoritative broad fact stream
case.closed.events // derived narrow stream for specific consumers
case.sla-breach.detected // derived analytical/domain event
This is often best for mature platforms.
The broad topic preserves aggregate history.
Derived topics optimize consumer-specific access.
But derived topics must be documented as derived, not authoritative source of truth.
8. Topic Boundary Heuristics
Use these rules of thumb.
Same topic when
events describe the same aggregate lifecycle
same owner controls semantics
same key preserves required order
same security classification
same retention requirement
same compatibility policy
most consumers understand the union
Separate topic when
different owner
different key
different security classification
different retention/replay requirement
different semantic lifecycle
different consumer population
different throughput/capacity profile
different compatibility policy
Create derived topic when
source topic is authoritative
many consumers repeatedly filter the same subset
subset has distinct access/security needs
subset is materially easier to consume
derivation is deterministic and observable
9. Partition Key Selection
Partition key is the most important topic design decision after meaning.
It controls:
- per-key ordering;
- parallelism;
- data distribution;
- compaction identity for compacted topics;
- state locality;
- join feasibility;
- consumer scaling;
- hot partition risk.
Ask:
What state must be processed in order?
That state owner is usually the key.
For case lifecycle:
key = caseId
For officer workload updates:
key = officerId
For tenant usage aggregation:
key = tenantId
For payment movement ledger:
key = accountId or ledgerAccountId
For idempotency index:
key = eventId
Bad key choices
random UUID:
good distribution, bad ordering
current timestamp:
hot partitions and no state meaning
event type:
hot partitions and poor entity ordering
tenantId only:
may create hot partition for large tenant
null key:
partitioner-specific behavior, usually poor for stateful processing
10. Composite Keys
Sometimes one field is not enough.
Example:
tenantId + caseId
Composite key helps:
- preserve tenant isolation;
- preserve per-case ordering;
- avoid key collision across tenants;
- support multi-tenant routing.
In Java, avoid ad-hoc string concatenation without escaping/versioning.
Bad:
String key = tenantId + ":" + caseId;
Better:
public record TopicKey(
String tenantId,
String entityType,
String entityId
) {
public String canonical() {
return String.join("|",
escape(tenantId),
escape(entityType),
escape(entityId)
);
}
}
For binary schemas, a typed key format may be better.
But remember: Kafka partitioner needs bytes.
Whatever encoding you use must be stable.
11. Hot Partition Analysis
A correct key can still create skew.
Example:
key = tenantId
If one tenant produces 70% of traffic, one partition may become hot.
Options:
Option A — Keep key, accept skew
Use when ordering per tenant is mandatory and volume is manageable.
Option B — Key by lower-level entity
key = caseId instead of tenantId
Preserves per-case ordering and distributes traffic better.
Option C — Key salting
key = tenantId + bucket
Increases distribution but breaks strict per-tenant ordering.
Use only when consumers can handle it.
Option D — Split domain stream
Separate extremely high-volume producer/tenant/type into another topic.
Option E — Dedicated pipeline for hot entity
Use special handling when one entity is naturally large.
Decision rule:
Never fix hot partition by breaking ordering unless you have redesigned the downstream state model.
12. Partition Count Selection
Partition count affects maximum parallelism and operational cost.
Inputs:
expected producer throughput
expected consumer throughput per instance
record size
peak traffic multiplier
required consumer parallelism
broker capacity
replication factor
retention size
future growth
ordering boundary
state store size
rebalancing tolerance
A simple starting formula:
partitions_by_write = ceil(peak_write_mb_per_sec / target_mb_per_sec_per_partition)
partitions_by_read = desired_max_consumer_instances
partition_count = max(partitions_by_write, partitions_by_read, minimum_operational_floor)
But this is only a sizing aid.
Partition count is also a semantic decision.
If you increase partition count later, key-to-partition mapping may change for new records.
Production heuristic
For important topics:
Choose enough partitions for 12-24 months of growth.
Avoid extreme over-partitioning.
Document expansion implications.
Load test with realistic record size and key distribution.
13. Replication Factor and Durability
Replication factor determines how many broker replicas store each partition.
Common production baseline:
replication.factor = 3
min.insync.replicas = 2
producer acks = all
This is not universal, but it is a common high-durability posture.
The topic contract should state durability expectations.
For low-value telemetry, a lower durability posture may be acceptable.
For regulatory facts, financial events, audit records, or case decisions, durability should be stricter.
Tie configuration to data criticality.
Do not copy one setting everywhere without classification.
14. Retention Design
Retention should be based on recovery and audit requirements.
Questions:
How long can a consumer be down and still recover from Kafka?
How long must a new consumer bootstrap from Kafka?
Is Kafka the authoritative replay source?
Is there a raw archive elsewhere?
What is the maximum acceptable data loss window?
What is the expected backfill horizon?
What legal/regulatory retention applies?
What is the storage cost?
Common retention classes
short operational buffer:
hours to days
standard replay buffer:
7-30 days
extended replay buffer:
90-365 days
latest-state compacted:
compaction-based, not full history
archive-backed:
Kafka hot window + object storage long-term archive
Alerting implication
If retention is 7 days, do not alert only when consumer lag is 8 days.
Alert when oldest unprocessed event age approaches threshold.
Example:
warning at 60% of retention
critical at 80% of retention
page at 90% of retention
15. Delete Retention vs Compaction
Kafka cleanup policy is central to topic meaning.
Delete policy
Keeps records for time/size retention window.
Good for:
- immutable event logs;
- CDC history windows;
- command streams;
- retry streams;
- DLQ streams;
- audit hot windows.
Compact policy
Keeps latest value per key eventually.
Good for:
- current-state topics;
- reference data;
- lookup/enrichment topics;
- changelog topics;
- idempotency state distribution.
Delete + compact
Can be used when you want latest value plus bounded deletion behavior.
But understand cleanup semantics carefully before depending on it.
Rule
If consumers need historical sequence, do not rely on compaction.
If consumers need latest state by key, compaction may be ideal.
16. Tombstones in Compacted Topics
In compacted topics, a record with key and null value is often used as a tombstone.
It represents deletion of the key's latest value.
Design questions:
What does deletion mean?
Is it physical deletion, logical deletion, privacy deletion, or source absence?
Should consumers remove derived state?
How long are tombstones retained?
Can late older records resurrect deleted state?
Consumer rule:
Tombstone handling must be explicit.
Example Java shape:
public sealed interface CurrentCaseStatusRecord permits CaseStatusValue, CaseStatusTombstone {
String caseId();
}
public record CaseStatusValue(String caseId, String status, Instant updatedAt)
implements CurrentCaseStatusRecord {}
public record CaseStatusTombstone(String caseId, Instant deletedAt, String reason)
implements CurrentCaseStatusRecord {}
If you rely only on raw null value, you may lose semantic deletion reason.
Sometimes a domain-level deletion event plus compacted tombstone is the better pair.
17. Topic Schema Strategy
Topic schema strategy must match topic type.
Single event envelope with union payload
Example:
{
"eventId": "...",
"eventType": "CaseEscalated",
"occurredAt": "...",
"payload": { }
}
Good for broad domain event topics.
One schema per event type
Good for narrow event-type topics or systems with schema registry subject per record type.
CDC connector schema
Generated by connector.
Good for raw capture, not necessarily for public consumers.
Latest-state schema
Good for compacted projection topics.
Error envelope schema
DLQ should have a stable wrapper:
original topic/partition/offset
original key
original headers
original payload
error category
error message
processor name
processor version
failedAt
replay hint
DLQ schema matters as much as source schema.
18. Topic Subject Naming for Schema Registry
Schema registry subject naming affects compatibility scope.
Common strategies:
TopicNameStrategy:
one subject per topic key/value
RecordNameStrategy:
one subject per fully qualified record name
TopicRecordNameStrategy:
one subject per topic + record name
The choice affects whether multiple event types can coexist in one topic and how compatibility is checked.
For broad event topics, you need to intentionally design subject naming and event envelope strategy.
Do not let default settings accidentally define your long-term schema boundary.
19. Topic Ownership
Every production topic needs an owner.
Owner responsibilities:
approve producers
approve contract changes
maintain schema compatibility
publish deprecation notices
monitor topic health
respond to DLQ/quarantine issues
manage retention and cost
review access requests
own documentation
coordinate incident response
Topic ownership should not default to “the Kafka platform team”.
The platform team owns the substrate.
The domain/data product team owns semantics.
20. Producer Ownership and Allowed Writers
A topic should define allowed producer identities.
Reasons:
- prevent rogue writes;
- preserve contract quality;
- enforce schema compatibility;
- maintain event identity discipline;
- protect ordering assumptions;
- support incident tracing.
Good pattern:
case.lifecycle.events
owner: Case Management domain team
allowed producers:
- case-service-outbox-relay-prod
- case-migration-backfill-job-prod
producer approval required: yes
Dangerous pattern:
Any service can publish if it has the schema.
That turns the topic into a shared mutable bus.
21. Consumer Governance
Consumers build dependencies.
The topic owner should know important consumers.
At minimum, track:
consumer group id
owning team
purpose
criticality
contact channel
fields consumed
freshness requirement
replay dependency
schema compatibility expectation
PII authorization
This enables impact analysis.
If a producer wants to remove field escalationReason, it must know which consumers depend on it.
A schema registry can catch structural incompatibility.
It cannot always catch semantic incompatibility.
Consumer registry fills that gap.
22. Multi-Tenant Topic Design
Multi-tenant systems need explicit topic strategy.
Option A — Shared Topic, Tenant in Key/Header/Payload
Example:
case.lifecycle.events
key = tenantId|caseId
Advantages:
- fewer topics;
- easier platform operations;
- consumers can process all tenants;
- shared retention and schema.
Disadvantages:
- security isolation is coarser;
- noisy tenant can affect others;
- per-tenant retention/access is hard;
- hot tenant skew possible.
Option B — Topic Per Tenant
Example:
tenant-a.case.lifecycle.events
tenant-b.case.lifecycle.events
Advantages:
- strong isolation;
- per-tenant access/retention;
- easier tenant-level deletion/export.
Disadvantages:
- topic explosion;
- operational overhead;
- schema rollout complexity;
- consumer deployment complexity.
Option C — Tiered Hybrid
Shared topic for normal tenants.
Dedicated topic for large/sensitive tenants.
Often practical.
Decision drivers:
tenant count
data isolation requirement
regulatory boundary
throughput skew
retention differences
operational maturity
consumer model
Do not choose shared topics if tenant isolation is a hard compliance boundary unless compensating controls are strong.
23. Environment Strategy
Avoid mixing environments inside one topic.
Bad:
case.lifecycle.events with env header = dev/test/prod
Better:
separate clusters or separate namespaces with strong ACLs
Examples:
prod.case.lifecycle.events
stg.case.lifecycle.events
Or environment separation at cluster level:
prod Kafka cluster:
case.lifecycle.events
staging Kafka cluster:
case.lifecycle.events
Production and test data should not share the same operational log unless you have a very specific controlled reason.
24. Security Classification by Topic
Classify each topic.
Example classes:
PUBLIC
INTERNAL
CONFIDENTIAL
SENSITIVE_PII
REGULATED_EVIDENCE
SECRET
Classification controls:
- ACLs;
- retention;
- encryption;
- masking/tokenization;
- logging restrictions;
- DLQ handling;
- schema visibility;
- cross-region replication;
- consumer approval;
- archive policy.
A topic with mixed classification is difficult to govern.
If one event type contains sensitive notes and another contains public case status, putting both in the same topic may force the whole topic to the strictest classification.
This may be correct.
Or it may be a sign that topic granularity is wrong.
25. DLQ Topic Design
DLQ topic design is often neglected.
A good DLQ topic preserves enough information to repair and replay.
DLQ record should include:
failureId
failedAt
processorName
processorVersion
errorCategory
errorCode
errorMessage
stackTraceHash
originalTopic
originalPartition
originalOffset
originalTimestamp
originalKey
originalHeaders
originalValue
schemaId if known
contractVersion if known
replayEligibility
attemptCount
correlationId
traceId
tenantId
classification
DLQ topic naming
<source-topic>.dlq
<consumer-name>.<source-topic>.dlq
Choose based on ownership.
If failures are consumer-specific, consumer-specific DLQ is often clearer.
Example:
case-search-indexer.case.lifecycle.events.dlq
This avoids mixing failures from different processors with different error meanings.
DLQ retention
DLQ retention should usually be longer than source hot retention.
Why?
Failed records may need investigation after source records expire.
DLQ access
DLQ may contain original sensitive payload plus error context.
Do not make it broadly readable.
26. Retry Topic Design
Retry topic design should avoid uncontrolled retry storms.
Common pattern:
case-indexer.events.retry.1m
case-indexer.events.retry.5m
case-indexer.events.retry.30m
case-indexer.events.dlq
Record headers:
x-original-topic
x-original-partition
x-original-offset
x-attempt-count
x-first-failed-at
x-last-error-code
x-next-visible-at
Retry topics are useful when:
- sink is temporarily unavailable;
- rate limit should be respected;
- transient network failures happen;
- order can tolerate delayed records.
Be careful with strict per-key ordering.
If record N fails and record N+1 succeeds, retry topics may reorder effects.
For ordering-sensitive consumers, partition pause and blocking retry may be safer than side-lane retry.
Decision:
If per-key order is mandatory, retry strategy must preserve or explicitly repair ordering.
27. Backfill Topic Design
Backfill topics isolate historical processing.
Example:
case.lifecycle.events.backfill.2026-07-rebuild-v2
Backfill topic contract should state:
source data range
extraction timestamp
transform version
schema version
processing mode
target sink
merge policy
owner
expiry date
cleanup date
Backfill topics should often be temporary.
Add TTL/lifecycle owner.
Otherwise topic sprawl becomes permanent.
Alternative:
Use same topic with backfill headers only if live consumers can safely ignore/process backfill mode.
That is rarely safe without careful design.
28. CDC Topic Design
CDC topics should preserve source metadata.
Typical CDC topic dimensions:
source system
database/schema/table
primary key
operation type
before/after image
source commit timestamp
transaction id
log position
snapshot indicator
schema history
For source table:
enforcement.public.case
CDC topic:
db.enforcement.public.case
Key:
case primary key
But CDC table topics are not always ideal consumer APIs.
Raw CDC is good for:
- replication;
- canonical event derivation;
- lake ingestion;
- search indexing when table shape is acceptable;
- operational data sync.
Raw CDC is risky for:
- public cross-team event contracts;
- long-term semantic stability;
- domain-level event interpretation;
- consumers that should not know database internals.
Common architecture:
29. Outbox Topic Design
Outbox topics are usually cleaner than raw table CDC for domain events.
Outbox table columns often include:
id
aggregate_type
aggregate_id
event_type
event_version
payload
headers
occurred_at
created_at
Topic routing can use aggregate_type or event category.
Example:
outbox row:
aggregate_type = case
event_type = CaseEscalated
topic:
case.lifecycle.events
key:
aggregate_id = caseId
Outbox topic design should preserve:
- aggregate ordering;
- event identity;
- schema version;
- causation/correlation;
- producer service identity;
- transaction boundary.
Avoid one global outbox topic for all domains unless you are intentionally building a central integration bus with strict governance.
30. Internal vs Public Topics
A topic can be public, private, or internal.
Public data product topic
case.lifecycle.events
Promised to consumers.
Requires contract, documentation, compatibility, support.
Private service topic
case-service.internal.retry
Used by one service/team.
Still needs operational quality, but contract is narrower.
Internal framework topic
streams-app-KTABLE-AGGREGATE-STATE-STORE-changelog
Generated by framework.
Do not expose as product contract.
Rule:
Consumers should not depend on internal topics unless the owner explicitly promotes them to public contract.
31. Topic Lifecycle
Every topic should have lifecycle state.
Proposed
Design under review.
Experimental
Used by limited consumers. Contract may change.
Production
Stable contract. Changes governed.
Deprecated
No new consumers. Migration plan exists.
Retired
No active consumers. Data retained only if required.
Deleted
Topic removed after retention/compliance checks.
Lifecycle avoids permanent zombie topics.
32. Deprecation Strategy
Deprecating a topic requires more than saying “do not use”.
Checklist:
identify active consumer groups
identify known downstream datasets
publish migration target
run dual publish if needed
provide field mapping
define sunset date
monitor remaining consumption
block new consumer access
archive old topic if required
remove producers
delete after retention/compliance approval
For important topics, deprecation may take months.
Designing topics carefully up front is cheaper.
33. Cross-Region and Replication Concerns
If topics are replicated across regions or clusters, define:
source of truth cluster
replication direction
active-active or active-passive
conflict handling
ordering expectations after replication
consumer failover behavior
topic naming in target cluster
offset translation strategy
schema registry replication
PII export restrictions
Cross-region replication can change latency, ordering visibility, failover behavior, and compliance posture.
Do not treat replicated topics as identical unless operationally proven.
34. Topic Design for Joins
Stream joins require compatible partitioning.
If two Kafka streams are joined by caseId, both should be keyed by caseId or repartitioned.
Repartitioning costs:
- extra internal topic;
- extra network IO;
- extra storage;
- extra latency;
- extra failure surface;
- possible schema/key confusion.
Design topics with known joins in mind.
Example:
case.lifecycle.events
key = caseId
case.risk-score.events
key = caseId
case.officer-assignment.events
key = caseId
This makes case-centric joins natural.
If one topic is keyed by officerId, case-centric join requires repartition or lookup strategy.
There is no universally correct key.
There is only a key aligned with primary state access pattern.
35. Topic Design for Enrichment
Enrichment commonly uses compacted reference topics.
Example:
reference.officer-profile.current
key = officerId
cleanup.policy = compact
A stream processor consumes:
case.assignment.events
and enriches with officer profile current state.
Design questions:
Should enrichment use current reference state or historical state as of event time?
Is enriched output reproducible during replay?
What happens if reference data changes after original event?
Is reference topic compacted only, or is reference history also retained elsewhere?
For audit-grade replay, current-state enrichment may be wrong.
You may need temporal reference history.
36. Topic Design for Materialized Views
Materialized view topics should document derivation.
Example:
case.current-status
Contract:
source_topics:
- case.lifecycle.events
key: caseId
cleanup.policy: compact
value_semantics: latest derived case status
ordering_dependency: per-case lifecycle order
rebuild_strategy: replay case.lifecycle.events from earliest available archive
owner: case data platform
Consumers must know whether the topic is authoritative or derived.
A derived compacted topic is convenient.
But if it is wrong, repair requires replaying its source, not editing records by hand.
37. Topic Design for Regulatory Pipelines
For regulatory/enforcement lifecycle systems, topic design needs extra rigor.
Important questions:
Can we prove when a case entered a state?
Can we prove who/what produced the event?
Can we reconstruct the timeline as known at the time?
Can corrections be represented without rewriting history?
Can we distinguish business effective time from processing time?
Can we show why an SLA breach was detected?
Can we identify downstream decisions affected by a bad event?
Can sensitive investigation notes be protected separately?
Suggested topic split:
case.lifecycle.events // immutable lifecycle facts
case.assignment.events // assignment facts
case.sla-breach.detected // derived detection facts
case.current-status // compacted latest status
case.investigation-note.events // sensitive, stricter ACL
case.lifecycle.events.dlq // repair stream
case.lifecycle.raw-archive // optional hot audit handoff stream
Do not mix highly sensitive notes into broad lifecycle topics unless every consumer is authorized.
38. Topic Configuration Template
A topic should be declared as code.
Example YAML:
topic: case.lifecycle.events
owner: case-management-domain
lifecycle: production
classification: regulated-evidence
purpose: Immutable business facts for case lifecycle transitions.
allowedProducers:
- case-service-outbox-relay-prod
allowedConsumerApproval: required
key:
format: string
fields:
- caseId
orderingGuarantee: All events for the same caseId are ordered within one partition.
value:
schemaFormat: avro
subjectStrategy: TopicRecordNameStrategy
compatibility: BACKWARD_TRANSITIVE
retention:
cleanupPolicy: delete
retentionMs: 2592000000 # 30 days
archive: s3://regulated-raw/case/lifecycle/events/
partitions: 48
replicationFactor: 3
minInSyncReplicas: 2
producer:
acks: all
idempotenceRequired: true
dlq:
topic: case-lifecycle-projector.case.lifecycle.events.dlq
owner: case-data-platform
observability:
freshnessSlo: PT5M
alertWhenOldestUnprocessedAgeExceeds: PT20M
alertBeforeRetentionLossAtPercent: 80
sunset:
policy: no deletion without compliance approval
This is not only documentation.
It can drive automation:
- topic provisioning;
- ACL provisioning;
- schema registry configuration;
- retention checks;
- lineage metadata;
- owner registry;
- CI policy checks.
39. Java Topic Descriptor Model
In Java, avoid scattering topic strings.
Bad:
producer.send(new ProducerRecord<>("case.lifecycle.events", key, value));
Better:
public record TopicDescriptor(
String name,
TopicKind kind,
CleanupPolicy cleanupPolicy,
String owner,
DataClassification classification,
KeyStrategy keyStrategy,
String schemaSubject,
CompatibilityMode compatibilityMode
) {}
public enum TopicKind {
DOMAIN_EVENT,
COMMAND,
CDC,
PROJECTION,
RETRY,
DLQ,
BACKFILL,
INTERNAL
}
Then route through a catalog:
public interface TopicCatalog {
TopicDescriptor descriptorFor(DomainEvent event);
TopicDescriptor descriptorByName(String topicName);
}
Benefits:
- central topic governance;
- testable key strategy;
- no typo-driven production bugs;
- easier schema registry integration;
- easier documentation generation.
40. Topic Key Strategy in Java
Create explicit key strategies.
public interface KafkaKeyStrategy<T> {
String keyFor(T record);
String description();
}
public final class CaseLifecycleKeyStrategy implements KafkaKeyStrategy<CaseLifecycleEvent> {
@Override
public String keyFor(CaseLifecycleEvent event) {
return event.caseId().value();
}
@Override
public String description() {
return "caseId; preserves ordering for lifecycle events of the same case";
}
}
Test it.
@Test
void lifecycleEventsForSameCaseUseSameKey() {
CaseLifecycleKeyStrategy strategy = new CaseLifecycleKeyStrategy();
assertEquals("CASE-1", strategy.keyFor(opened("CASE-1")));
assertEquals("CASE-1", strategy.keyFor(escalated("CASE-1")));
}
This test seems simple.
It protects a major invariant.
41. Topic Contract Review Checklist
Use this before creating a production topic.
Meaning
Is the topic type clear?
Is the topic authoritative or derived?
Is the owner accountable for semantics?
Does the name communicate meaning?
Key and Ordering
What key is used?
What ordering is guaranteed?
What ordering is explicitly not guaranteed?
Can hot keys occur?
Can partition count expansion affect assumptions?
Schema
What schema format is used?
What compatibility mode applies?
How are event types represented?
Can historical records be decoded?
Are semantic changes governed?
Retention and Replay
How long is Kafka replay available?
Where is long-term archive?
Can new consumers bootstrap?
What happens if consumer lag exceeds retention?
Security
What classification applies?
Does topic contain PII/secrets/investigation notes?
Are DLQ/retry topics classified the same or stricter?
Are ACLs producer/consumer-specific?
Operations
What metrics and alerts exist?
Who handles DLQ?
Who approves backfill?
What is incident contact?
What is deprecation process?
42. Topic Design Anti-Patterns
Anti-Pattern 1 — One Topic Per Microservice Without Meaning
case-service-events
A service is not a semantic boundary by itself.
What events?
What facts?
What consumers?
Anti-Pattern 2 — One Global Enterprise Event Topic
enterprise.events
Looks simple but becomes impossible to govern.
Different domains have different owners, schemas, retention, classification, and keying.
Anti-Pattern 3 — Topic Per Event Type by Default
This creates topic sprawl and can break aggregate ordering.
Use when justified by schema/security/consumer/retention differences.
Anti-Pattern 4 — Null Key Everywhere
Destroys ordering and state locality.
Anti-Pattern 5 — Compaction on Business Event History
May remove historical facts.
Anti-Pattern 6 — No Owner
Nobody approves changes, handles incidents, or answers consumer questions.
Anti-Pattern 7 — DLQ Without Original Metadata
Impossible to replay safely.
Anti-Pattern 8 — Backfill Into Live Topic Without Mode
Can confuse live consumers and cause duplicate side effects.
Anti-Pattern 9 — Topic Version for Every Schema Change
Creates unnecessary migration burden.
Use schema compatibility for safe evolution.
Anti-Pattern 10 — Security Classification Ignored Until Later
Retroactive access cleanup is painful and risky.
43. Worked Example: Case Lifecycle Topics
Assume regulatory case management platform.
Domain facts
CaseOpened
CaseAssigned
CaseEscalated
CaseStatusChanged
CaseClosed
CaseReopened
Candidate topic design
case.lifecycle.events
Topic contract:
kind: DOMAIN_EVENT
owner: case-management-domain
key: caseId
cleanup.policy: delete
retention: 30 days hot
archive: immutable lake raw archive
classification: regulated-evidence
schema: Avro envelope with event-specific payloads
compatibility: backward transitive
ordering: per caseId only
Why one topic?
same aggregate lifecycle
same ordering key
same owner
same classification
consumers often need full lifecycle
Sensitive notes
case.investigation-note.events
Why separate?
higher sensitivity
restricted consumers
different retention policy
not needed by most lifecycle consumers
Latest status
case.current-status
Why separate?
compacted projection for enrichment/search
not full history
keyed by caseId
rebuilt from lifecycle events
SLA derived event
case.sla-breach.detected
Why separate?
derived analytical/domain event
owned by SLA detector/data platform
consumed by alerting and reporting
Diagram:
44. Worked Example: Wrong Design and Repair
Wrong design:
topic: case-events
key: random eventId
contains:
lifecycle changes
investigation notes
officer comments
notification commands
current status snapshots
cleanup.policy: compact
retention: 7 days
owner: platform
Problems:
random key breaks per-case ordering
compaction destroys event history
mixed sensitivity leaks notes to broad consumers
commands mixed with facts
snapshots mixed with events
owner does not own domain semantics
retention too short for recovery
name hides meaning
Repair direction:
case.lifecycle.events key=caseId delete retention
case.investigation-note.events key=caseId restricted ACL
notification.email.commands key=notificationId/caseId command semantics
case.current-status key=caseId compacted
case.lifecycle.events.dlq repair-owned DLQ
Migration pattern:
1. define new contracts
2. dual-publish or derive new topics from old
3. migrate consumers
4. freeze old topic producers
5. monitor remaining reads
6. archive old topic if needed
7. retire old topic
45. Testing Topic Design
Topic design should be testable.
Key tests
same aggregate -> same key
unrelated aggregate -> likely distributed
null key rejected for stateful topics
composite key encoded deterministically
Schema tests
new schema compatible with registered policy
old records decode with new consumer
new records rejected by old consumer if expected
semantic required fields preserved
Retention tests
consumer lag alert fires before retention loss
backfill source available beyond Kafka hot retention
DLQ retention longer than repair SLA
Security tests
unauthorized producer cannot write
unauthorized consumer cannot read
DLQ ACL matches source classification
schema visibility does not leak sensitive fields
Replay tests
new consumer can rebuild from earliest retained offset
archive replay produces equivalent output
compacted projection rebuilds from source event topic
46. Topic Design as Code
Mature organizations manage Kafka topics as code.
A repository may contain:
topics/
case.lifecycle.events.yaml
case.current-status.yaml
case.sla-breach.detected.yaml
schemas/
case/CaseOpened.avsc
case/CaseEscalated.avsc
policies/
regulated-evidence.yaml
CI checks:
name matches convention
owner exists
classification valid
retention allowed for classification
schema compatibility passes
partition count within platform limits
replication factor matches criticality
DLQ configured for production consumers
consumer approval required for sensitive topics
Provisioning applies:
- topic creation;
- partition count;
- retention;
- compaction;
- ACLs;
- schema subjects;
- lineage metadata;
- documentation pages.
This turns topic governance from meeting notes into executable policy.
47. Operational Runbook Per Topic
Each production topic should have a runbook.
Minimum runbook:
owner and escalation contact
purpose
critical consumers
normal throughput range
normal record size range
retention
oldest-lag alert threshold
producer failure playbook
consumer lag playbook
schema failure playbook
DLQ repair playbook
backfill procedure
deprecation procedure
Example incident:
Symptom:
case-search-indexer lag reaches 2 million records.
Questions:
Is source topic healthy?
Is lag increasing or draining?
Which partitions are lagging?
Is there key skew?
Is sink slow?
Are records failing validation?
Is oldest unprocessed event near retention risk?
Can we scale consumers within partition limit?
Should we pause low-priority consumers?
Without a runbook, operators guess under pressure.
48. Mental Model Recap
A Kafka topic is a durable semantic boundary.
Design it by answering:
meaning
ownership
key
ordering
schema
retention
cleanup policy
security
consumer population
replay source
failure handling
lifecycle
The biggest mistakes are usually not API mistakes.
They are meaning mistakes:
wrong boundary
wrong key
wrong cleanup policy
wrong retention
wrong classification
wrong owner
A good topic makes pipeline code simpler.
A bad topic forces every consumer to compensate forever.
49. Final Checklist
Before creating or approving a topic, verify:
[ ] topic type is explicit
[ ] semantic owner exists
[ ] allowed producers are known
[ ] important consumers are discoverable
[ ] key strategy is documented and tested
[ ] ordering guarantee is stated
[ ] partition count is justified
[ ] cleanup policy matches history/latest-state need
[ ] retention matches recovery requirement
[ ] schema format and compatibility mode are defined
[ ] security classification is assigned
[ ] DLQ/retry topics are designed if needed
[ ] replay/backfill strategy is known
[ ] topic lifecycle state is tracked
[ ] deprecation path exists
[ ] observability and alert thresholds exist
If any of these are missing, the topic is not production-ready.
50. References
- Apache Kafka Documentation — Topics, partitions, producers, consumers, and event streaming model.
- Apache Kafka Documentation — Topic configuration and cleanup policies.
- Confluent Documentation — Schema Registry subject naming and compatibility modes.
- Confluent Documentation — Kafka retention and log compaction.
- Debezium Documentation — CDC topic naming, event envelope, and outbox routing patterns.
You just completed lesson 34 in build core. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.
Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.