Build CoreOrdered learning track

Pipeline Topology Design

Learn Java Kafka in Action - Part 016

Pipeline topology design for Java Kafka systems: source-transform-sink architecture, fan-out/fan-in, branching, enrichment, ownership, contracts, lineage, replay, scaling, and operability.

23 min read4470 words
PrevNext
Lesson 1635 lesson track0719 Build Core
#java#kafka#pipeline-design#stream-processing+4 more

Part 016 — Pipeline Topology Design

Part 015 explained how data enters Kafka reliably through outbox, CDC, event sourcing, and CQRS. This part asks the next question:

Once data is in Kafka, how should we design the pipeline topology so it remains understandable, evolvable, observable, and safe to operate?

A Kafka pipeline is not just a sequence of consumers. It is a distributed data product composed of:

  • source contracts;
  • topic boundaries;
  • processing stages;
  • state stores;
  • enrichment dependencies;
  • retry and DLQ paths;
  • output contracts;
  • ownership boundaries;
  • replay and rebuild procedures;
  • deployment units;
  • operational SLOs.

A junior design says:

Service A reads topic X and writes topic Y.

A senior design asks:

  • What is the semantic meaning of X and Y?
  • Is this transformation deterministic?
  • Can it be replayed?
  • Who owns the output?
  • Does this stage preserve ordering?
  • Does this stage repartition?
  • Is enrichment local, remote, or table-backed?
  • What happens when downstream is slow?
  • Can we rebuild the output from retained inputs?
  • How do we observe correctness, not just throughput?

1. Kaufman Skill Decomposition

The skill is designing Kafka pipelines as explicit, operable topologies.

SubskillProduction Meaning
Topology modellingRepresent pipeline stages, topics, state, and dependencies clearly.
Contract boundary designDefine where semantic contracts start and end.
Transformation classificationDistinguish map/filter/enrich/join/aggregate/project/sink.
Ownership assignmentKnow which team owns each input, output, processor, and runbook.
Replay reasoningDecide what can be rebuilt and what cannot.
Scaling designAlign partitioning, consumer groups, state stores, and throughput goals.
Failure routingPlace retry, DLQ, quarantine, and compensation boundaries intentionally.
Observability designMeasure lag, freshness, correctness, error rate, and business completeness.
Evolution planningAdd, split, migrate, and deprecate stages safely.
GovernanceDocument lineage, schema, retention, SLO, and access rules.

1.1 Practice Goal

By the end of this part, you should be able to design a pipeline like:

quote.events.v1
  -> quote.normalized.v1
  -> quote.enriched.v1
  -> quote.approval-summary.v1
  -> search index / dashboard / fulfillment trigger

And explain:

  • why each topic exists;
  • who owns each contract;
  • what each stage guarantees;
  • what each stage can replay;
  • where ordering is preserved or lost;
  • where state lives;
  • what happens under partial failure;
  • how to deploy and monitor it.

2. Pipeline as a Graph, Not a Chain

A real pipeline is usually a graph.

This graph has:

  • source event topic;
  • normalization stage;
  • current-state enrichment topic;
  • enriched event topic;
  • multiple downstream projections;
  • side outputs;
  • external sink.

Each edge should have a contract. Each node should have an owner.

2.1 Why Graph Thinking Matters

Chain thinking hides important questions:

  • Can Risk Scoring be redeployed without affecting search?
  • Can Approval Projection replay independently?
  • Does Enrich with Customer depend on a remote API?
  • If customer data is late, does the pipeline block, emit partial event, or route to retry?
  • Does quote.enriched.v1 become a stable product or an internal implementation topic?
  • What happens if one branch is slow?

Kafka makes fan-out easy. Governance makes fan-out safe.


3. Topology Vocabulary

3.1 Source

A source is where data enters the topology.

Examples:

  • domain event topic;
  • outbox CDC topic;
  • raw CDC table topic;
  • external connector source;
  • command topic;
  • compacted reference-data topic;
  • replay/backfill topic.

3.2 Processor

A processor consumes records, performs a transformation, and may emit records.

Processor types:

Processor TypeExampleState?
MapConvert field namesUsually stateless
FilterDrop irrelevant eventsStateless
ValidateCheck contract/business invariantsStateless or stateful
NormalizeConvert raw source into canonical eventUsually stateless
EnrichAdd customer/product/policy dataOften stateful
JoinCombine streams/tablesStateful
AggregateCount/sum/windowStateful
ProjectBuild read modelStateful/external
RouteBranch to output topicsStateless or stateful
SinkWrite to external systemExternal state

3.3 Sink

A sink is an output target:

  • Kafka output topic;
  • database table;
  • search index;
  • cache;
  • object storage;
  • warehouse;
  • external API;
  • notification system;
  • workflow engine.

Kafka-to-Kafka sinks are easier to replay than external side-effecting sinks.

3.4 State Store

State may live in:

  • Kafka Streams local state store;
  • compacted Kafka topic;
  • external database;
  • Redis/cache;
  • warehouse table;
  • object storage;
  • in-memory ephemeral cache.

Every state store needs an answer for:

  • rebuild source;
  • backup/restore;
  • schema evolution;
  • consistency with offsets;
  • ownership;
  • retention.

4. The Source-Transform-Sink Skeleton

A basic pipeline skeleton:

This is the minimum. Production pipelines also need:

  • retry topics;
  • metrics;
  • schema validation;
  • idempotency;
  • offset strategy;
  • error classification;
  • deployment unit;
  • ownership metadata;
  • replay procedure;
  • capacity model.

4.1 Stage Contract Template

# Pipeline Stage Contract

## Stage
- Name:
- Owner:
- Runtime: Java consumer / Kafka Streams / ksqlDB / Kafka Connect / other

## Inputs
- Topic:
- Key:
- Schema:
- Retention:
- Ordering assumption:

## Outputs
- Topic/sink:
- Key:
- Schema:
- Semantics:

## Processing
- Transformation type:
- Deterministic:
- Stateful:
- External dependencies:
- Idempotency strategy:

## Failure Handling
- Retry:
- DLQ:
- Poison pill policy:
- Replay support:

## Operations
- SLO:
- Metrics:
- Alerts:
- Runbook:

5. Transformation Classes

5.1 Stateless Transform

Example: rename fields, normalize currency code, filter irrelevant event type.

public NormalizedQuoteEvent normalize(RawQuoteEvent raw) {
    return new NormalizedQuoteEvent(
            raw.quoteId(),
            raw.customerId(),
            raw.amount().setScale(2, RoundingMode.HALF_UP),
            raw.currency().toUpperCase(Locale.ROOT),
            raw.occurredAt()
    );
}

Stateless transforms are easiest to scale and replay.

5.2 Validation Stage

Validation has two categories:

Validation TypeExampleFailure Destination
TechnicalInvalid schema, missing required fieldDLQ/quarantine
BusinessNegative price, illegal status transitionDomain rejection event or DLQ depending on meaning

Do not send every validation failure to a generic DLQ. Some are business facts and deserve explicit rejection events.

5.3 Enrichment Stage

Enrichment adds context from another source.

Examples:

  • add customer segment to quote;
  • add product category to order;
  • add jurisdiction policy to enforcement case;
  • add FX rate to invoice;
  • add risk score to approval request.

Enrichment can be implemented through:

  • local table loaded from compacted Kafka topic;
  • Kafka Streams KTable/GlobalKTable;
  • cache/database lookup;
  • remote API call;
  • pre-join upstream topic.

5.4 Aggregation Stage

Aggregation computes derived state:

  • count orders per customer;
  • total quote value per account;
  • number of failed payments in 10 minutes;
  • session duration;
  • rolling fraud score;
  • SLA breach count.

Aggregations introduce state and time semantics. They require stricter observability.

5.5 Projection Stage

Projection writes a read model.

Projection examples:

  • quote summary table;
  • case lifecycle view;
  • customer timeline;
  • materialized approval queue;
  • OpenSearch index;
  • Redis lookup cache.

Projection stages must be idempotent and usually need their own applied-event tracking.


6. Fan-Out Pattern

Fan-out lets multiple consumers process the same source independently.

Each consumer group gets its own offset. This is Kafka's strength.

6.1 Fan-Out Benefits

  • independent scaling;
  • independent deployment;
  • independent replay;
  • loose runtime coupling;
  • multiple data products from the same facts.

6.2 Fan-Out Risks

  • too many consumers relying on unstable event internals;
  • topic schema evolution becomes politically hard;
  • producer becomes blocked by fear of breaking consumers;
  • consumers duplicate transformation logic;
  • hidden business workflows emerge through uncoordinated subscriptions.

6.3 Fan-Out Rule

Fan-out is safe when the source topic is a stable product contract, not an implementation detail.

If the topic is internal, keep fan-out limited or publish a curated downstream product topic.


7. Branching and Routing

Branching splits records by predicate.

Branching is useful when downstream ownership differs by category.

7.1 Branching Design Questions

  • Is the branch condition stable?
  • Does each branch need a different schema?
  • Are branches domain facts or processing routes?
  • Can a record belong to multiple branches?
  • What happens when a new branch is added?
  • Are old consumers affected?

7.2 Processing Route vs Domain Event

A topic named payment.retryable-failure.v1 may be an internal processing route.

A topic named payment.failed.v1 may be a domain event.

Do not confuse them. Processing topics can change with implementation. Domain event topics are contracts.


8. Fan-In and Merge

Fan-in merges multiple sources into one output.

8.1 Fan-In Risks

  • incompatible semantics;
  • source-specific fields leak into common schema;
  • ordering across sources is not guaranteed;
  • duplicate identities;
  • mismatched timestamps;
  • inconsistent retry semantics;
  • one noisy source harms shared output.

8.2 Canonicalization Rule

A merged output topic should not be a bag of source-specific fields. It should represent a stable canonical contract.

Bad:

{
  "cardAuthCode": "...",
  "bankRef": "...",
  "walletProvider": "..."
}

Better:

{
  "paymentId": "PAY-1001",
  "methodType": "CARD",
  "amount": "100.00",
  "currency": "USD",
  "status": "AUTHORIZED",
  "authorizedAt": "2026-07-01T10:00:00Z",
  "source": {
    "system": "card-gateway",
    "reference": "AUTH-991"
  }
}

9. Enrichment Topologies

9.1 Remote Lookup Enrichment

This is simple but dangerous at scale:

  • remote API latency slows processing;
  • API outage blocks pipeline;
  • retry can amplify load;
  • ordering can be affected;
  • replay can overload the dependency.

Use remote lookup only when throughput is low, dependency is reliable, or cache/fallback exists.

9.2 Local Table Enrichment

This is usually better for high-throughput stream processing.

Reference data arrives through Kafka and is materialized locally. The event processing path avoids synchronous network calls.

9.3 Global Table Enrichment

For small reference datasets, a global table can replicate all reference data to each processing instance. This avoids co-partitioning requirements but increases memory/disk footprint.

9.4 Enrichment Staleness

Enrichment always has a time question:

Should the event be enriched with reference data as of event time, processing time, or latest known value?

Examples:

Use CaseCorrect Enrichment Time
Audit decisionReference data as of decision time.
Dashboard displayLatest known reference data may be acceptable.
Billing calculationEffective-time policy required.
Search indexingLatest known value often acceptable.
Regulatory explanationHistorical/effective version required.

Do not hide this decision inside code.


10. Join Topology

Joins combine streams and/or tables.

10.1 Join Questions

  • Are the inputs keyed the same way?
  • Is repartitioning needed?
  • Is one side a stream or table?
  • Are late records expected?
  • What is the join window?
  • What happens when reference data is missing?
  • Is output deterministic under replay?
  • What is the state store retention?

10.2 Missing Reference Data Policy

When enrichment data is missing, options include:

PolicyBehaviorUse When
DropIgnore recordNon-critical telemetry.
DLQQuarantine recordMissing data indicates bug.
RetryDelay and try againReference may arrive soon.
Emit partialContinue with missing fieldsDownstream tolerates partial data.
Emit pending stateModel incomplete data explicitlyBusiness flow has pending semantics.
Use defaultFill fallback valueDefault is legally/business valid.

A hidden default is often worse than a visible pending state.


11. Repartitioning

Repartitioning changes the key distribution so downstream operations can group, join, or aggregate correctly.

11.1 Why Repartitioning Happens

You may need repartitioning when:

  • grouping by a different key;
  • joining two streams/tables by a different key;
  • changing ownership boundary;
  • building customer-level aggregate from quote-level events;
  • correcting a poor upstream key choice.

11.2 Repartition Cost

Repartitioning is a full Kafka write/read shuffle:

  • more network IO;
  • more broker storage;
  • more latency;
  • more internal topics;
  • more state restore cost;
  • more operational complexity.

Do not fear repartitioning, but make it visible in design.

11.3 Repartition Design Rule

Key topics by the dominant ordering and processing boundary, not by arbitrary entity id.

A wrong key early in the pipeline can force expensive downstream shuffles.


12. Topic Boundary Design

A topic boundary should exist when it creates one of these values:

  • stable contract for another team;
  • replay checkpoint;
  • fan-out point;
  • ownership boundary;
  • failure isolation boundary;
  • expensive transformation reuse;
  • audit boundary;
  • materialized data product.

A topic should not exist only because “every method returns a topic.”

12.1 Too Few Topics

Symptoms:

  • generic mega-topic;
  • consumers filter most records;
  • schema chaos;
  • ACLs too broad;
  • unrelated retention needs conflict;
  • impossible ownership.

12.2 Too Many Topics

Symptoms:

  • topic-per-step with no contract value;
  • operational noise;
  • unclear lineage;
  • many internal schemas;
  • too much broker metadata;
  • hard cleanup.

12.3 Boundary Heuristic

Create a topic when you need a durable, observable, independently consumable boundary.

Avoid creating a topic for purely local computation inside one application unless it helps state recovery or operational isolation.


13. Internal vs Product Topics

13.1 Internal Topic

An internal topic is an implementation detail.

Examples:

  • retry topic;
  • repartition topic;
  • changelog topic;
  • intermediate normalization topic used only by one owning service;
  • DLQ/quarantine topic.

Internal topics can change faster but need clear naming and restricted access.

13.2 Product Topic

A product topic is a stable data/event contract.

Examples:

  • quote.events.v1;
  • customer.profile.current.v1;
  • case.lifecycle.v1;
  • payment.authorized.v1.

Product topics need stronger governance:

  • schema compatibility;
  • documentation;
  • owners;
  • retention;
  • SLO;
  • consumer communication;
  • deprecation policy.

13.3 Naming Convention Example

<domain>.<entity-or-capability>.<semantic-type>.v<major>

Examples:

quote.lifecycle.events.v1
quote.current.state.v1
quote.approval.summary.v1
quote.approval.retry.v1
quote.approval.dlq.v1

Do not let naming become religious. The point is predictability and meaning.


14. Pipeline Ownership Model

Every pipeline needs owners at three levels:

Ownership LevelResponsibility
Source ownerMaintains source schema, quality, retention, and meaning.
Processor ownerMaintains transformation code, deployment, retry, DLQ, and SLO.
Output ownerMaintains output contract, consumers, and deprecation.

In small teams one group may own all three. In large organizations, they often differ.

14.1 Ownership Metadata

For each topic, document:

owner: quote-platform-team
slack: '#quote-platform'
pager: quote-platform-oncall
schemaOwner: quote-domain-team
dataClassification: confidential
retention: 30d
cleanupPolicy: delete
consumerSlo: p95 freshness < 60s
replaySupported: true

This metadata matters during incidents.


15. Runtime Choice: Consumer, Streams, ksqlDB, Connect

15.1 Plain Java Consumer

Use when:

  • custom side effects;
  • complex external transactions;
  • imperative workflow;
  • simple projection;
  • low-level control over offset commit;
  • integration with existing service logic.

Risk:

  • you must implement state, retry, idempotency, and offset discipline yourself.

15.2 Kafka Streams

Use when:

  • Kafka-to-Kafka stream processing;
  • stateful joins/aggregations;
  • local state stores;
  • exactly-once Kafka read-process-write requirement;
  • topology testability;
  • Java-native processing.

Risk:

  • state stores, repartition topics, task assignment, and restore behavior require operational understanding.

15.3 ksqlDB

Use when:

  • SQL is sufficient;
  • rapid materialized views;
  • common filters/joins/aggregations;
  • platform wants declarative stream processing;
  • transformation ownership can be expressed as SQL artifacts.

Risk:

  • complex domain logic, custom error handling, and strict lifecycle control may be harder than in Java.

15.4 Kafka Connect

Use when:

  • source/sink integration is standard;
  • connector exists and is mature;
  • data movement is more important than custom domain logic;
  • configuration-driven pipeline is desired.

Risk:

  • connector-specific semantics, DLQ behavior, offset management, and schema conversion must be understood.

15.5 Decision Table

NeedPreferred Tool
DB -> Kafka CDCKafka Connect + Debezium
Kafka -> ElasticsearchKafka Connect sink if connector meets semantics
Kafka -> Kafka join/aggregateKafka Streams or ksqlDB
Complex Java domain transitionPlain Java consumer or service
Declarative materialized viewksqlDB
External API side effectPlain Java consumer with idempotency
Stateful stream processing with testsKafka Streams

16. Failure Handling Topology

A production topology includes failure routes.

16.1 Failure Classification

Failure TypeExampleRoute
DeserializationInvalid payloadDLQ/quarantine
Schema compatibilityMissing required fieldDLQ or reject event
Transient dependencyDB timeoutRetry
Permanent dependencyUnknown customer idPending/reject/DLQ depending on domain
Business invalidIllegal transitionBusiness event, not technical DLQ
BugNull pointerStop, alert, fix, replay

16.2 Retry Boundary

Place retry where it preserves correctness.

Bad retry:

  • blocks all records behind one poison pill;
  • breaks ordering without documentation;
  • retries forever;
  • hides failure from operators;
  • overloads downstream dependency.

Good retry:

  • bounded;
  • observable;
  • classified;
  • idempotent;
  • has DLQ escape;
  • has replay tool.

17. Backpressure and Flow Control

Kafka decouples producer and consumer time, but it does not remove capacity problems. It stores them as lag.

17.1 Backpressure Signals

  • consumer lag increasing;
  • oldest unprocessed event age increasing;
  • retry topic growth;
  • DLQ growth;
  • processor latency rising;
  • state store restore time rising;
  • external sink queue increasing;
  • broker disk/network pressure.

17.2 Flow-Control Options

OptionUse WhenRisk
Pause consumer partitionsDownstream temporary pressureMust resume correctly.
Reduce poll batch sizePer-record work is heavyLower throughput.
Increase consumersPartition count allows scalingMore concurrency complexity.
Buffer internallySmall burstsMemory pressure.
Retry with delay topicTransient failuresOrdering changes.
Shed low-priority recordsNon-critical telemetryData loss if misclassified.
Scale downstream sinkSink is bottleneckCost and dependency constraints.

17.3 Freshness SLO

For business pipelines, lag in records is not enough. Use freshness:

freshness = now - event.occurredAt or now - event.recordedAt

A topic with 10 records of lag can be worse than 10,000 records if the 10 records are old and business-critical.


18. Observability and Lineage

18.1 Topology-Level Dashboard

Track per stage:

  • input rate;
  • output rate;
  • processing latency;
  • consumer lag;
  • event freshness;
  • error rate;
  • retry count;
  • DLQ count;
  • rebalance count;
  • state restore time;
  • output completeness.

18.2 Business Completeness Metrics

Technical metrics are insufficient.

Examples:

  • number of approved quotes vs number of fulfillment triggers;
  • number of created orders vs indexed orders;
  • number of payment captures vs settlement records;
  • number of escalated cases vs audit entries.

Completeness checks find semantic pipeline bugs that lag metrics miss.

18.3 Lineage Metadata

Every output should be traceable to inputs:

{
  "eventId": "out-123",
  "sourceEvents": [
    {
      "topic": "quote.events.v1",
      "partition": 3,
      "offset": 91811,
      "eventId": "quote-event-991"
    },
    {
      "topic": "customer.current.v1",
      "partition": 1,
      "offset": 11102,
      "eventId": "customer-snapshot-441"
    }
  ],
  "processor": "quote-enricher",
  "processorVersion": "2026.07.01"
}

This is especially valuable for audit, debugging, and reproducibility.


19. Evolution Patterns

19.1 Add New Consumer

Safest evolution:

  • no source change;
  • new consumer group;
  • independent deployment;
  • replay from earliest or latest depending on use case.

19.2 Add New Output Topic

Useful when creating a new data product.

Need:

  • schema;
  • owner;
  • retention;
  • SLO;
  • backfill plan;
  • deprecation plan if replacing old output.

19.3 Split Pipeline Stage

Before:

input -> normalize+enrich+project -> output

After:

input -> normalize -> normalized topic -> enrich -> enriched topic -> project

Split when:

  • transformation is reused;
  • failure domains differ;
  • ownership differs;
  • stage has separate scaling needs;
  • observability needs better isolation.

Do not split merely for aesthetics.

19.4 Merge Pipeline Stages

Merge when:

  • intermediate topic has no independent value;
  • latency matters;
  • operational overhead is high;
  • ownership is same;
  • replay boundary is not needed.

19.5 Version Output Contract

When breaking semantic changes are needed:

quote.enriched.v1
quote.enriched.v2

Run both during migration. Do not surprise existing consumers.


20. Deployment Unit Design

A pipeline stage may be deployed as:

  • part of an existing domain service;
  • separate stream processor app;
  • Kafka Streams application;
  • ksqlDB persistent query;
  • Kafka Connect connector;
  • scheduled/batch backfill job.

20.1 Deployment Decision Questions

  • Does it need independent scaling?
  • Does it need independent release cadence?
  • Does it own state stores?
  • Does it have different operational SLO?
  • Does it call external dependencies?
  • Does it need domain model access?
  • Does it belong to platform or product team?

20.2 Microservice Boundary Warning

Do not create a new deployable service for every topic transformation. Operational overhead is real:

  • CI/CD;
  • monitoring;
  • on-call;
  • configs;
  • secrets;
  • scaling;
  • logs;
  • versioning;
  • incident response.

A topology should be modular in design, not necessarily fragmented into dozens of deployments.


21. Example: Quote Approval Pipeline

21.1 Requirements

  • Quote service publishes lifecycle events reliably.
  • Fulfillment needs approved quotes.
  • Search needs enriched quote data.
  • Risk team needs approval metrics.
  • Compliance needs audit trail.
  • Replay must be possible for search and risk.
  • Email notification must not be blindly replayed.

21.2 Proposed Topology

21.3 Design Notes

  • quote.events.v1 is the domain event source.
  • quote.normalized.v1 is an internal or semi-product boundary depending on reuse.
  • quote.enriched.v1 is a product topic if multiple teams consume it.
  • Search and risk are replayable.
  • Notification is isolated behind notification.commands.v1 to avoid email side effects during projection replay.
  • Audit writer consumes original domain events to avoid losing semantics through enrichment transformation.

21.4 Failure Handling

StageFailureHandling
NormalizerInvalid eventDLQ with schema/error metadata.
EnricherMissing customerPending/retry depending on business rule.
Search projectorOpenSearch downRetry with backoff; idempotent upsert.
Risk aggregatorState store restoreLet Streams restore; monitor lag.
Notification producerDuplicate eventCommand idempotency key = source event id.
Audit writerDB downRetry; high-priority alert.

22. Testing Pipeline Topologies

22.1 Topology Unit Tests

For Kafka Streams, use topology tests to verify:

  • input record;
  • transformation;
  • output topic;
  • key;
  • timestamp;
  • headers;
  • state store update.

22.2 Contract Tests

For every output topic:

  • schema compatibility;
  • key format;
  • required metadata;
  • semantic examples;
  • null/tombstone behavior;
  • error envelope.

22.3 Integration Tests

Run the actual pipeline with Kafka and dependencies:

  • source topic input;
  • processor app;
  • output topic assertion;
  • retry topic assertion;
  • DLQ assertion;
  • state store restoration;
  • restart behavior.

22.4 Replay Tests

Replay the same input twice and assert deterministic output where expected.

@Test
void quoteSummaryProjectionIsDeterministicUnderReplay() {
    publishHistoricalQuoteEvents();
    runProjectorUntilCaughtUp();
    QuoteSummary first = repository.find("Q-1001");

    resetProjectionAndOffsets();
    publishHistoricalQuoteEvents();
    runProjectorUntilCaughtUp();
    QuoteSummary second = repository.find("Q-1001");

    assertThat(second).isEqualTo(first);
}

22.5 Chaos Tests

Inject:

  • processor crash after output write before offset commit;
  • downstream sink timeout;
  • invalid input record;
  • reference data delay;
  • rebalance during processing;
  • state store restoration;
  • duplicate input event;
  • out-of-order event.

A topology that only works in a happy-path local demo is not production-ready.


23. Architecture Review Checklist

23.1 Topology Clarity

  • Topology graph exists.
  • All input and output topics are listed.
  • Internal vs product topics are marked.
  • State stores are identified.
  • External dependencies are identified.
  • Ownership is documented.

23.2 Contract Safety

  • Input schemas are versioned.
  • Output schemas are versioned.
  • Key semantics are documented.
  • Tombstone behavior is documented.
  • Retention is documented.
  • Compatibility mode is defined.

23.3 Correctness

  • Ordering assumptions are explicit.
  • Repartitioning is visible.
  • Idempotency exists for side effects.
  • Offset commit strategy is safe.
  • Replay behavior is defined.
  • Duplicate handling is tested.

23.4 Failure Handling

  • Retry policy exists.
  • DLQ/quarantine policy exists.
  • Poison pill procedure exists.
  • Missing reference data policy exists.
  • Downstream outage behavior exists.
  • Manual repair runbook exists.

23.5 Operations

  • Lag metrics exist.
  • Freshness metrics exist.
  • Error metrics exist.
  • Business completeness metrics exist.
  • Dashboards exist.
  • Alerts have owners.
  • Capacity model exists.

24. Anti-Patterns

24.1 Topic-per-Function

Creating a topic after every tiny code function creates operational noise without meaningful boundaries.

24.2 Mega-Topic

One topic for every event in the company creates schema chaos, broad ACLs, and inefficient consumers.

24.3 Hidden Remote API in Stream Processor

A high-throughput Kafka processor that calls a slow synchronous API can turn a streaming system into a distributed timeout generator.

24.4 No Ownership for Intermediate Topics

An intermediate topic consumed by another team has become a product topic whether you admit it or not.

24.5 Replay Triggers Side Effects

If replay sends emails, charges cards, or opens cases, your pipeline is unsafe.

24.6 Lag-Only Observability

Lag can be zero while output is wrong. Add business completeness metrics.

24.7 Repartitioning by Accident

Kafka Streams and ksqlDB can create repartition topics when key requirements change. Understand and monitor them.

24.8 Transformation Without Semantic Contract

If no one can explain what an output topic means, consumers will infer meaning and create accidental contracts.


25. Pipeline ADR Template

# ADR: Pipeline Topology for <Pipeline Name>

## Context
- Business capability:
- Source topic/table/API:
- Consumers:
- Latency/freshness requirement:
- Audit/regulatory requirement:

## Topology
```mermaid
flowchart LR
    A[input] --> B[processor]
    B --> C[output]

Stages

StageOwnerRuntimeInputOutputStateReplay

Topic Contracts

TopicTypeOwnerKeySchemaRetentionProduct/Internal

Correctness

  • Ordering assumptions:
  • Repartitioning:
  • Idempotency:
  • Offset strategy:
  • Time semantics:

Failure Handling

  • Retry:
  • DLQ:
  • Missing reference data:
  • Downstream outage:
  • Replay:

Observability

  • Technical metrics:
  • Business metrics:
  • Alerts:
  • Dashboards:

Evolution

  • Versioning plan:
  • Backfill plan:
  • Deprecation plan:

Consequences

  • Benefits:
  • Costs:
  • Open risks:
--- ## 26. Deliberate Practice ### Exercise 1 — Draw a Real Topology Pick one business flow: - quote approval; - order fulfillment; - payment settlement; - enforcement escalation; - fraud scoring. Draw the topology with: - source topics; - processors; - output topics; - state stores; - retry/DLQ; - external sinks. ### Exercise 2 — Classify Topics For each topic in your topology, classify it as: - domain event; - current-state topic; - command topic; - internal retry; - DLQ; - repartition/changelog; - product projection; - external integration topic. ### Exercise 3 — Identify Repartition Points Find every stage where key changes. For each, explain: - old key; - new key; - why key changes; - cost; - ordering impact; - alternative design. ### Exercise 4 — Design Failure Routes For each stage, define: - retryable errors; - terminal errors; - business invalid cases; - DLQ payload; - replay path. ### Exercise 5 — Business Completeness Metric Design one metric that proves semantic correctness. Examples: ```text approved_quotes_total == fulfillment_triggers_total + intentionally_skipped_quotes_total

or:

case_escalated_events_total == audit_escalation_rows_total

27. Mental Model Summary

  • A Kafka pipeline is a graph of contracts, state, ownership, and failure paths.
  • Topic boundaries should represent durable, observable, independently consumable value.
  • Internal topics and product topics require different governance.
  • Enrichment design is mostly about dependency and time semantics.
  • Repartitioning is a necessary but costly shuffle; make it explicit.
  • Replay is a design requirement, not an emergency trick.
  • Observability must include business completeness, not only lag.
  • Pipeline topology should be reviewed like application architecture, not treated as glue code.

28. References

Lesson Recap

You just completed lesson 16 in build core. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.