Multi-Service Transaction Boundaries
Learn Java Kafka in Action - Part 027
Multi-service transaction boundaries for Java Kafka systems: local database transactions, Kafka transactions, transactional outbox, CDC, inbox pattern, saga, orchestration, compensation, auditability, and regulatory-grade failure modeling.
Part 027 — Multi-Service Transaction Boundaries
Part 026 covered ordering, consistency, and idempotency.
Now we address one of the hardest parts of Kafka architecture:
How do we keep a multi-service system correct when one business action touches a local database, Kafka topics, downstream services, external APIs, and sometimes a workflow engine?
The short answer:
Kafka gives strong primitives for durable event logs, consumer offsets, idempotent producers, and Kafka-native transactions. It does not magically make your database, payment provider, notification system, search index, and case-management workflow part of one global ACID transaction.
This part is about transaction boundary design.
A top-level Kafka engineer should be able to answer these questions clearly:
- What is the atomic boundary of this operation?
- Which side effects can happen twice?
- Which side effects can be delayed?
- Which side effects require compensation?
- Which state is authoritative?
- Which event is audit-grade evidence?
- How do we recover after crash, replay, and partial commit?
1. Kaufman Skill Decomposition
The target skill is designing event-driven workflows that remain correct without pretending distributed ACID exists everywhere.
| Subskill | Production Meaning |
|---|---|
| Transaction boundary modeling | Know exactly what commits atomically and what does not. |
| Dual-write diagnosis | Detect unsafe DB + Kafka write patterns. |
| Outbox design | Publish reliable events after local DB commit. |
| CDC reasoning | Understand how database log capture changes reliability and latency. |
| Kafka transaction usage | Use transactions for Kafka read-process-write boundaries only where appropriate. |
| Inbox/idempotency | Protect consumers from duplicates and replay. |
| Saga design | Coordinate long-running multi-service workflows through events/commands. |
| Compensation | Undo or neutralize business effects when rollback is impossible. |
| Auditability | Preserve enough evidence to explain every transition. |
| Failure matrix design | Enumerate crash points and recovery behavior before implementation. |
1.1 Practice Goal
By the end of this part, you should be able to:
- Explain why dual-write is unsafe.
- Choose between direct publish, Kafka transaction, outbox, CDC, and saga.
- Design an outbox table and publisher that is replay-safe.
- Use Kafka transactions only inside the correct boundary.
- Build an inbox/idempotent consumer for downstream services.
- Model a multi-service order workflow as a saga.
- Write an architecture decision record for transaction strategy.
2. The Central Boundary Problem
Consider this Java service:
public Quote approveQuote(ApproveQuoteCommand command) {
Quote quote = quoteRepository.findById(command.quoteId());
quote.approve(command.apverId());
quoteRepository.save(quote);
kafkaTemplate.send("quote.approved.v1", quote.id(), QuoteApproved.from(quote));
return quote;
}
It looks simple. It is also dangerous.
There are two separate systems:
- The local database.
- Kafka.
They do not commit atomically in this code.
2.1 Failure Matrix
| Step | DB State | Kafka State | Failure Effect |
|---|---|---|---|
| Before DB commit | unchanged | no event | Safe retry. |
| After DB commit, before Kafka send | approved | no event | Invisible state change to downstream services. |
| After Kafka send, before response | approved | event published | Client may retry and produce duplicate intent. |
| Kafka send succeeds but acknowledgement lost | approved | event may exist | Producer may retry. |
| DB rollback after Kafka send | not approved | event exists | Downstream observes false fact. |
This is the dual-write problem.
Dual-write means one business operation writes to two durable systems without one atomic commit boundary covering both writes.
Kafka does not remove this problem. It gives patterns to design around it.
3. Transaction Boundary Vocabulary
Before choosing a pattern, define the boundaries.
| Boundary | Atomic? | Example |
|---|---|---|
| Local DB transaction | Yes, within one database | Update order + insert outbox row. |
| Kafka producer transaction | Yes, within Kafka topics/partitions and consumed offsets | Consume input, produce output, commit offsets atomically. |
| DB + Kafka transaction | Usually no | Updating PostgreSQL and publishing Kafka event. |
| Kafka + external REST API | No | Consume event then call payment gateway. |
| Multi-service business workflow | No global atomic commit | Quote → order → payment → fulfillment. |
| Regulatory case lifecycle | Usually long-running, stateful, auditable | Case opened → evidence requested → escalation → enforcement action. |
A strong design does not hide these boundaries. It makes them explicit.
4. The Wrong Pattern: Local DB Then Kafka Send
The dangerous window is between DB commit and Kafka publish.
If the service crashes in that window, the quote is approved but no event exists.
4.1 When Is Direct Publish Acceptable?
Direct publish can be acceptable only when the event is not part of the correctness model.
Examples:
| Use Case | Direct Publish Acceptable? | Reason |
|---|---|---|
| Debug telemetry | Yes | Loss is tolerable. |
| Cache warm notification | Maybe | Cache can be rebuilt. |
| Audit event | No | Loss breaks evidence. |
| Billing event | No | Loss causes financial inconsistency. |
| Workflow transition | No | Downstream lifecycle stalls. |
| Regulatory enforcement state | No | Loss breaks defensibility. |
A useful rule:
If a downstream service must observe the event for the business process to remain correct, do not use naive dual-write.
5. Pattern 1 — Transactional Outbox
Transactional outbox solves the DB + Kafka dual-write problem by writing the business state and an event record in the same local database transaction.
The event is later published to Kafka by a relay process.
The local DB commit becomes the source of truth for both:
- Entity state.
- Intent to publish event.
5.1 Outbox Table Design
CREATE TABLE outbox_event (
id UUID PRIMARY KEY,
aggregate_type VARCHAR(100) NOT NULL,
aggregate_id VARCHAR(100) NOT NULL,
aggregate_version BIGINT NOT NULL,
event_type VARCHAR(200) NOT NULL,
event_version INTEGER NOT NULL,
topic VARCHAR(200) NOT NULL,
partition_key VARCHAR(300) NOT NULL,
payload JSONB NOT NULL,
headers JSONB NOT NULL DEFAULT '{}',
status VARCHAR(30) NOT NULL DEFAULT 'PENDING',
attempt_count INTEGER NOT NULL DEFAULT 0,
next_attempt_at TIMESTAMPTZ NOT NULL DEFAULT now(),
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
published_at TIMESTAMPTZ,
last_error TEXT
);
CREATE INDEX idx_outbox_pending
ON outbox_event (status, next_attempt_at, created_at)
WHERE status IN ('PENDING', 'FAILED_RETRYABLE');
CREATE UNIQUE INDEX uq_outbox_aggregate_event
ON outbox_event (aggregate_type, aggregate_id, aggregate_version, event_type);
Important fields:
| Field | Purpose |
|---|---|
id | Stable event ID for deduplication. |
aggregate_id | Business entity boundary. |
aggregate_version | Monotonic transition guard. |
event_type | Semantic event name. |
topic | Destination topic. |
partition_key | Kafka key chosen intentionally. |
payload | Event body. |
headers | Correlation, causation, tenant, trace, schema metadata. |
status | Relay lifecycle. |
attempt_count | Retry control. |
last_error | Operational debugging. |
5.2 Application Transaction Example
public final class QuoteApprovalService {
private final QuoteMapper quoteMapper;
private final OutboxEventMapper outboxMapper;
private final TransactionManager tx;
public Quote approve(ApproveQuoteCommand command) {
return tx.required(() -> {
Quote quote = quoteMapper.findForUpdate(command.quoteId())
.orElseThrow(() -> new NotFoundException("quote not found"));
quote.approve(command.approverId());
quoteMapper.updateStatusAndVersion(quote);
QuoteApproved event = QuoteApproved.from(quote, command.correlationId());
outboxMapper.insert(new OutboxEventRow(
event.eventId(),
"Quote",
quote.id().toString(),
quote.version(),
"QuoteApproved",
1,
"quote.approved.v1",
quote.id().toString(),
Json.write(event),
Json.write(EventHeaders.from(command))
));
return quote;
});
}
}
The key invariant:
If the quote is approved, the outbox row exists. If the transaction rolls back, neither the quote transition nor the outbox event exists.
5.3 Relay Publisher Example
public final class OutboxRelay implements Runnable {
private final OutboxEventMapper outboxMapper;
private final KafkaProducer<String, byte[]> producer;
private final TransactionManager tx;
@Override
public void run() {
List<OutboxEventRow> batch = outboxMapper.claimBatch(100);
for (OutboxEventRow row : batch) {
try {
ProducerRecord<String, byte[]> record = new ProducerRecord<>(
row.topic(),
row.partitionKey(),
row.payloadBytes()
);
record.headers().add("event-id", row.id().toString().getBytes(UTF_8));
record.headers().add("event-type", row.eventType().getBytes(UTF_8));
record.headers().add("aggregate-id", row.aggregateId().getBytes(UTF_8));
producer.send(record).get();
tx.required(() -> {
outboxMapper.markPublished(row.id());
return null;
});
} catch (Exception ex) {
tx.required(() -> {
outboxMapper.markRetryableFailure(row.id(), ex.getMessage());
return null;
});
}
}
}
}
This simple relay is understandable but has one subtle issue:
Publishing to Kafka and marking the row as published are still two separate systems.
If the relay publishes successfully and crashes before markPublished, it may publish the same event again later.
Therefore downstream consumers must still be idempotent.
5.4 Outbox Guarantee
Transactional outbox gives:
| Property | Guarantee |
|---|---|
| DB state + event intent atomicity | Yes |
| Event eventually published | Yes, if relay operates correctly |
| Kafka event exactly once globally | No |
| Duplicate event possible | Yes |
| Consumer idempotency still required | Yes |
| Auditability | Strong if event row is retained |
Outbox changes the problem from:
“Event may be lost.”
into:
“Event may be duplicated, but not silently lost.”
That is usually the better failure mode.
6. Pattern 2 — CDC Outbox
Instead of polling the outbox table directly, a CDC connector can read the database transaction log and publish changes to Kafka.
This pattern is common with Debezium.
6.1 Why CDC Helps
| Concern | Polling Relay | CDC Outbox |
|---|---|---|
| Extra DB polling load | Yes | Lower, reads log |
| Event order relative to DB commit | Approximate by polling | Based on log order |
| Operational dependency | Application-owned relay | Kafka Connect + connector |
| Transformation | Application code | Connector SMT / routing |
| Failure surface | App scheduler + DB + Kafka | Connect cluster + DB log + Kafka |
CDC outbox is powerful, but it moves operational complexity to Kafka Connect and the database log configuration.
6.2 CDC Outbox Invariants
- The outbox row is inserted in the same transaction as business state.
- The CDC connector reads only committed changes.
- The connector publishes to Kafka with stable key and payload routing.
- The consumer remains idempotent.
- Connector lag is monitored as part of business process latency.
6.3 When CDC Outbox Is Better
Use CDC outbox when:
- You already operate Kafka Connect reliably.
- Multiple services need consistent outbox publishing.
- Polling overhead is unacceptable.
- You want database commit order to be reflected more naturally.
- You need operational uniformity across many services.
Avoid CDC outbox when:
- The team cannot operate connectors, offsets, and schema routing.
- The database log retention is too short for outage recovery.
- The application needs complex custom publish logic that does not belong in connector transforms.
- Security policy does not allow CDC access to the database log.
7. Pattern 3 — Kafka Transactions
Kafka transactions are useful when the atomic boundary is inside Kafka.
Typical read-process-write flow:
- Consume from input topic.
- Process records.
- Produce to output topic.
- Commit consumed offsets.
- Make output records and offset commits atomic.
7.1 Java Transaction Skeleton
Properties props = new Properties();
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrapServers);
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, ByteArraySerializer.class.getName());
props.put(ProducerConfig.TRANSACTIONAL_ID_CONFIG, "risk-score-worker-7");
props.put(ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG, "true");
KafkaProducer<String, byte[]> producer = new KafkaProducer<>(props);
producer.initTransactions();
while (running) {
ConsumerRecords<String, byte[]> records = consumer.poll(Duration.ofMillis(500));
if (records.isEmpty()) {
continue;
}
try {
producer.beginTransaction();
for (ConsumerRecord<String, byte[]> record : records) {
ProducerRecord<String, byte[]> output = transform(record);
producer.send(output);
}
Map<TopicPartition, OffsetAndMetadata> offsets = currentOffsets(records);
producer.sendOffsetsToTransaction(offsets, consumer.groupMetadata());
producer.commitTransaction();
} catch (ProducerFencedException fatal) {
throw fatal;
} catch (Exception retryable) {
producer.abortTransaction();
}
}
7.2 What Kafka Transactions Do Guarantee
| Scenario | Covered? | Explanation |
|---|---|---|
| Input offsets + output records atomic in Kafka | Yes | Offsets and output are committed together. |
| Duplicate output from retry inside Kafka transaction | Prevented for committed transactions | Aborted transactions are hidden from read_committed consumers. |
| Zombie producer fencing | Yes | transactional.id fences old producers. |
| Local DB update atomic with Kafka output | No | DB is outside Kafka transaction. |
| External REST call atomic with Kafka output | No | REST system is outside Kafka transaction. |
| Email/payment/notification exactly once | No | Requires external idempotency. |
7.3 When to Use Kafka Transactions
Use Kafka transactions when:
- The workflow is Kafka-in/Kafka-out.
- Outputs are derived from consumed records.
- Offsets and outputs must move together.
- You can control consumer isolation level.
- You understand transaction timeout and fencing behavior.
Do not use Kafka transactions as a substitute for outbox when the local database is authoritative.
8. Pattern 4 — Inbox Pattern
The inbox pattern records consumed event IDs before or during business processing.
It protects consumers from duplicates caused by retry, replay, relay crash, or producer uncertainty.
CREATE TABLE inbox_event (
consumer_name VARCHAR(150) NOT NULL,
event_id UUID NOT NULL,
topic VARCHAR(200) NOT NULL,
partition_no INTEGER NOT NULL,
offset_no BIGINT NOT NULL,
processed_at TIMESTAMPTZ NOT NULL DEFAULT now(),
PRIMARY KEY (consumer_name, event_id)
);
8.1 Consumer Example
public void handle(ConsumerRecord<String, QuoteApproved> record) {
QuoteApproved event = record.value();
tx.required(() -> {
boolean firstTime = inboxMapper.insertIfAbsent(
"order-service.quote-approved-handler",
event.eventId(),
record.topic(),
record.partition(),
record.offset()
);
if (!firstTime) {
return null;
}
orderMapper.createDraftOrderIfAbsent(
event.quoteId(),
event.customerId(),
event.approvedPrice()
);
return null;
});
}
The invariant:
The business side effect and the inbox record are committed together in the consumer database.
This makes duplicate events safe.
8.2 Inbox vs Dedup Table
| Pattern | Main Use |
|---|---|
| Inbox table | Record consumed event identity. |
| Dedup table | Prevent duplicate commands or requests. |
| Ledger table | Preserve append-only business facts. |
| Projection version column | Ignore stale aggregate versions. |
In high-integrity workflows, you often use more than one.
9. Pattern 5 — Saga
A saga coordinates a long-running business process using a series of local transactions and messages.
Each step commits locally. If a later step fails, the system executes compensating actions.
9.1 Choreography Saga
In choreography, services react to each other’s events.
Pros:
- No central coordinator.
- Services are loosely coupled at runtime.
- Good for simple flows.
Cons:
- Hard to see global process state.
- Failure handling is distributed.
- Compensation logic may be scattered.
- Audit trail must be reconstructed from many topics.
9.2 Orchestration Saga
In orchestration, a coordinator decides the next command.
Pros:
- Global workflow state is explicit.
- Easier timeout handling.
- Easier regulatory audit narrative.
- Compensation is centralized.
Cons:
- Coordinator can become a coupling point.
- Requires careful state machine design.
- More operational responsibility.
9.3 Choreography vs Orchestration Decision Matrix
| Criterion | Choreography | Orchestration |
|---|---|---|
| Simple event propagation | Strong | Overkill |
| Complex branching | Weak | Strong |
| Regulatory traceability | Medium | Strong |
| Team autonomy | Strong | Medium |
| Central process visibility | Weak | Strong |
| Timeout handling | Distributed | Centralized |
| Compensation | Scattered | Explicit |
| Debuggability | Harder | Easier |
For complex case management, enforcement lifecycle, or order orchestration, orchestration often wins because the process itself is a first-class domain object.
10. Commands, Events, and Saga Boundaries
Do not blur command and event semantics.
| Message | Meaning | Example |
|---|---|---|
| Command | Please do this | ReserveCreditCommand |
| Event | This happened | CreditReserved |
| Rejection event | This requested action failed | CreditReservationRejected |
| Timeout event | Expected response did not arrive | CreditReservationTimedOut |
| Compensation command | Please undo/neutralize previous effect | ReleaseCreditCommand |
| Compensation event | Undo/neutralization happened | CreditReleased |
A common anti-pattern is naming commands as events:
Bad: CreditShouldBeReserved
Good: ReserveCreditCommand
Good: CreditReserved
Good: CreditReservationRejected
Events are facts. Commands are requests.
11. Compensation Is Not Rollback
Rollback means the original change never became visible.
Compensation means a later action neutralizes or reverses the business effect.
| Original Action | Compensation |
|---|---|
| Reserve credit | Release credit |
| Create draft order | Cancel draft order |
| Request fulfillment | Send cancellation request |
| Open enforcement case | Close as superseded / withdrawn |
| Notify customer | Send correction notice |
Compensation must be modeled as real business behavior.
Do not hide it as technical cleanup.
11.1 Compensation Event Design
{
"eventId": "4a8b1a9c-b6d8-43e7-a722-b43eac4b9a43",
"eventType": "CreditReleased",
"eventVersion": 1,
"correlationId": "order-2026-00019",
"causationId": "InventoryReservationRejected.eventId",
"aggregateId": "credit-reservation-8831",
"reason": "INVENTORY_UNAVAILABLE",
"releasedAmount": "1200000.00",
"occurredAt": "2026-07-01T10:15:30Z"
}
For audit-grade systems, compensation should answer:
- What was reversed?
- Why was it reversed?
- Who or what caused it?
- Was the original effect visible?
- What state is valid now?
12. Timeout as a Domain Event
Distributed systems cannot assume silence means success.
A saga step needs timeout policy.
A timeout should usually become an explicit state transition, not just a log line.
12.1 Timeout Policy
| Field | Example |
|---|---|
| Step | WAITING_FOR_CREDIT_RESERVATION |
| Deadline | 2026-07-01T10:20:00Z |
| Retry limit | 3 |
| Retry backoff | exponential, capped |
| Timeout event | CreditReservationTimedOut |
| Compensation | CancelOrderCommand or ReleaseCreditCommand |
13. Workflow Engine Boundary
Kafka is not a workflow engine.
Kafka is excellent at durable event distribution. It does not automatically provide:
- Human task assignment.
- Timers as business objects.
- Visual process instance state.
- BPMN decision gateways.
- Built-in compensation modeling.
- Case lifecycle audit views.
For complex process orchestration, Kafka often works beside a workflow engine.
The key design question:
Is Kafka carrying facts and commands, while the workflow engine owns process state? Or is a Kafka Streams/Java service owning process state?
Both can work. Mixing them without clear ownership creates subtle bugs.
14. Reference Architecture: Order Saga with Outbox and Inbox
Notice the repeated pattern:
- Consume event idempotently.
- Update local state.
- Insert outbox event/command.
- Publish asynchronously.
- Downstream consumes idempotently.
This is the backbone of reliable event-driven workflows.
15. State Machine Guard for Workflow Correctness
A saga should reject invalid transitions.
public enum OrderSagaState {
STARTED,
CREDIT_RESERVED,
INVENTORY_RESERVED,
READY_FOR_FULFILLMENT,
COMPENSATING,
REJECTED,
COMPLETED
}
public void apply(CreditReserved event) {
if (state != OrderSagaState.STARTED) {
throw new InvalidTransitionException(state, "CreditReserved");
}
this.creditReservationId = event.reservationId();
this.state = OrderSagaState.CREDIT_RESERVED;
this.version++;
}
For replay-tolerant consumers, invalid transition handling should distinguish:
| Case | Action |
|---|---|
| Duplicate event | Ignore safely. |
| Stale event | Ignore and audit. |
| Impossible transition | Quarantine / DLQ. |
| Valid transition | Apply and publish next command. |
16. Transaction Boundary Decision Framework
Use this framework during architecture review.
16.1 Question 1 — What Is the System of Record?
| System of Record | Preferred Pattern |
|---|---|
| Local relational DB | Transactional outbox + inbox |
| Kafka topic | Kafka transactions / stream processing |
| Event store | Event sourcing + projections |
| External SaaS | Idempotent command + reconciliation |
| Workflow engine | Engine-owned state + Kafka integration |
16.2 Question 2 — Can the Side Effect Be Repeated?
| Side Effect | Repeat-Safe? | Required Control |
|---|---|---|
| Updating projection by version | Usually | Version guard |
| Inserting audit event | Yes if event ID unique | Unique constraint |
| Sending email | No | Notification idempotency key |
| Charging payment | No | Provider idempotency key |
| Creating shipment | No | External idempotency / reconciliation |
| Producing Kafka event | Duplicate possible | Event ID + consumer idempotency |
16.3 Question 3 — Is Compensation Meaningful?
| Operation | Compensation Quality |
|---|---|
| Reserve inventory | Good: release inventory |
| Reserve credit | Good: release credit |
| Send customer notification | Weak: send correction |
| Publish audit event | Usually no delete; publish correction |
| Enforcement decision | Must preserve history; supersede/revoke |
16.4 Question 4 — What Is the Required User Experience?
| UX Requirement | Architecture Implication |
|---|---|
| Immediate confirmation | Local state transition + async fulfillment |
| Strong finality | Synchronous dependency or pending state |
| Long-running review | Saga/workflow state |
| Human approval | Workflow engine or case state machine |
| Regulatory evidence | Append-only event/audit ledger |
17. Failure Modeling Checklist
For every transaction boundary, enumerate these crash points.
17.1 Producer Service with Outbox
| Crash Point | Expected Recovery |
|---|---|
| Before DB commit | Command retried; no state/event exists. |
| After DB commit before response | Client may retry; command idempotency prevents duplicate transition. |
| After outbox row exists before relay publishes | Relay publishes later. |
| After relay publishes before marking published | Event may publish again; consumer idempotency handles duplicate. |
| Relay permanently fails | Pending outbox alert fires. |
| Kafka unavailable | Outbox backlog grows; business state remains committed. |
17.2 Consumer with Inbox
| Crash Point | Expected Recovery |
|---|---|
| Before inbox insert | Kafka redelivers; process normally. |
| After inbox insert before business update in same tx | Impossible if same transaction rolls back. |
| After business update before offset commit | Kafka redelivers; inbox detects duplicate. |
| After outbox insert before offset commit | Kafka redelivers; inbox detects duplicate; outbox event exists. |
| After offset commit | Processing complete. |
17.3 Saga Coordinator
| Crash Point | Expected Recovery |
|---|---|
| Before state transition persisted | Event redelivered and applied. |
| After state transition before command published | Outbox publishes command later. |
| After command published before command marked sent | Duplicate command possible; receiver idempotency required. |
| Waiting for response | Timeout job emits timeout transition. |
| Compensation partially complete | Compensation state machine resumes. |
18. Regulatory and Audit-Grade Considerations
In regulatory or enforcement systems, transaction design is not only about technical correctness.
It must be explainable.
A defensible event-driven workflow needs:
- Stable event IDs.
- Correlation and causation IDs.
- Actor identity.
- Decision rationale.
- Input evidence reference.
- State before and after.
- Timestamp semantics.
- Correction/supersession model.
- Replay/reconstruction procedure.
- Immutable audit retention policy.
18.1 Audit Event Example
{
"eventId": "7be0c2b4-3d6e-4b8c-99bc-9d03ac8b80be",
"eventType": "CaseEscalated",
"eventVersion": 1,
"caseId": "CASE-2026-008812",
"previousState": "UNDER_REVIEW",
"newState": "ESCALATED",
"actor": {
"type": "SYSTEM",
"id": "rules-engine",
"reason": "risk_score_threshold_exceeded"
},
"causationId": "RiskScoreCalculated.eventId",
"correlationId": "CASE-2026-008812",
"evidenceRefs": [
"document:doc-881",
"transaction:txn-933"
],
"occurredAt": "2026-07-01T12:01:00Z"
}
A later correction should not mutate or erase this event. It should append another event.
19. Anti-Patterns
19.1 “Kafka Is Our Transaction Manager”
Kafka transactions are not distributed transactions across every system.
Use them for Kafka-native boundaries. Use outbox/inbox/saga for multi-system boundaries.
19.2 “Exactly Once Means No Idempotency Needed”
Exactly-once Kafka processing does not make external APIs exactly-once.
You still need idempotency keys for side effects outside Kafka.
19.3 “DLQ Is Compensation”
A DLQ is an operational quarantine. It is not a business compensation model.
19.4 “Rollback the Event”
Events are historical facts. If the business fact was wrong, publish a correction or supersession event.
19.5 “One Giant Saga Topic”
A mega-topic for all workflow commands and events makes authorization, schema, retention, and ownership harder.
19.6 “No Explicit Pending State”
If a process depends on asynchronous downstream work, model pending states explicitly.
Bad:
Order status = APPROVED
Better:
Order status = CREDIT_RESERVATION_PENDING
Order status = INVENTORY_RESERVATION_PENDING
Order status = READY_FOR_FULFILLMENT
20. Architecture Decision Record Template
# ADR: Transaction Boundary for <Workflow Name>
## Context
<Business operation, involved systems, consistency requirement.>
## Systems Involved
- Service:
- Local database:
- Kafka topics:
- External systems:
- Workflow engine:
## System of Record
<Which system owns authoritative state?>
## Chosen Pattern
<Outbox / CDC outbox / Kafka transaction / saga / workflow orchestration / hybrid.>
## Atomic Boundary
<Exactly what commits atomically?>
## Non-Atomic Side Effects
<What can happen before/after other effects?>
## Duplicate Strategy
<Event ID, command ID, idempotency table, external idempotency key.>
## Ordering Strategy
<Partition key, aggregate version, sequence.>
## Compensation Strategy
<Compensation commands/events and terminal states.>
## Timeout Strategy
<Deadlines, retries, timeout events.>
## Recovery Procedure
<Crash points and expected recovery behavior.>
## Observability
<Metrics, logs, traces, audit events, dashboards, alerts.>
## Consequences
<Trade-offs accepted.>
21. Observability for Transaction Boundaries
21.1 Metrics
| Metric | Why It Matters |
|---|---|
| Outbox pending count | Publishing backlog. |
| Outbox oldest pending age | Business latency risk. |
| Outbox publish failure rate | Kafka or serialization issue. |
| Inbox duplicate count | Retry/replay/producer uncertainty. |
| Saga pending count by state | Workflow bottleneck. |
| Saga timeout count | Downstream failure or insufficient timeout. |
| Compensation count | Business failure and operational stress. |
| DLQ count by event type | Data quality or consumer defect. |
| External idempotency conflict count | Duplicate command pressure. |
21.2 Logs
A useful transaction-boundary log includes:
{
"level": "INFO",
"message": "outbox event published",
"eventId": "...",
"aggregateType": "Order",
"aggregateId": "ORD-991",
"aggregateVersion": 7,
"topic": "order.created.v1",
"partitionKey": "ORD-991",
"correlationId": "...",
"attemptCount": 1
}
21.3 Traces
Trace propagation should include:
- HTTP request span.
- DB transaction span.
- Outbox insert span.
- Relay publish span.
- Kafka consume span.
- Downstream transaction span.
Do not rely only on offset numbers for workflow debugging. Use correlation IDs.
22. Deliberate Practice
Exercise 1 — Dual-Write Failure Table
Take one service method in your system that updates a database and publishes Kafka.
Write a crash matrix with these columns:
- Step.
- DB state.
- Kafka state.
- Client-visible state.
- Recovery behavior.
- Duplicate risk.
- Data-loss risk.
Exercise 2 — Outbox Design
Design an outbox table for one aggregate.
Required fields:
- Event ID.
- Aggregate ID.
- Aggregate version.
- Topic.
- Partition key.
- Payload.
- Headers.
- Status.
- Attempt count.
- Error field.
Exercise 3 — Inbox Consumer
Implement a consumer that:
- Inserts an inbox row with unique
(consumer_name, event_id). - Applies a business update in the same transaction.
- Commits offset only after DB commit.
- Ignores duplicate event IDs safely.
Exercise 4 — Saga State Machine
Model a workflow with at least four services.
Define:
- States.
- Events.
- Commands.
- Timeouts.
- Compensation actions.
- Terminal states.
- Invalid transition behavior.
Exercise 5 — External Side Effect Idempotency
Pick an external side effect such as email, payment, document generation, or shipment creation.
Define:
- Idempotency key.
- Retry policy.
- Timeout behavior.
- Reconciliation job.
- Compensation/correction model.
23. Mental Model Summary
A production Kafka transaction model is built from explicit boundaries:
Key conclusions:
- Local DB + Kafka direct write is unsafe for correctness-critical events.
- Transactional outbox makes event intent atomic with local state.
- Outbox still allows duplicate publication; consumers must be idempotent.
- Kafka transactions solve Kafka-native read-process-write atomicity.
- Kafka transactions do not cover databases or external APIs.
- Saga is the correct model for long-running multi-service business workflows.
- Compensation is business behavior, not technical rollback.
- Audit-grade systems need explicit event identity, causation, correction, and state transition evidence.
24. References
- Apache Kafka Documentation — Transactions, producers, consumers, and design: https://kafka.apache.org/documentation/
- Apache Kafka Producer Configs — idempotence and transactional configuration: https://kafka.apache.org/documentation/#producerconfigs
- Confluent — Kafka delivery semantics: https://docs.confluent.io/kafka/design/delivery-semantics.html
- Confluent Blog — Transactions in Apache Kafka: https://www.confluent.io/blog/transactions-apache-kafka/
- Debezium Documentation — Outbox Event Router: https://debezium.io/documentation/reference/stable/transformations/outbox-event-router.html
25. What Comes Next
Part 028 moves from correctness to protection:
How do we secure Kafka with service identity, TLS/mTLS, SASL, ACLs, least privilege, topic governance, tenant isolation, and auditability?
You just completed lesson 27 in deepen practice. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.
Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.