Distributed Transactions and Sagas
Learn Database Design and Architect - Part 036
Distributed transactions and sagas in production systems: local transactions, 2PC, outbox, compensation, idempotency, orchestration, choreography, consistency boundaries, and failure modelling.
Part 036 — Distributed Transactions and Sagas
The hardest distributed transaction problem is not writing to two databases.
The hardest problem is knowing what the business state means after one step succeeded, another failed, a message was delayed, and the user retried.
A database transaction gives you a clean boundary: either the local database changes commit or they do not. Distributed business processes do not have that luxury by default.
When one business action spans multiple services, databases, queues, payment providers, search indexes, or external systems, you need a design for partial success.
This part explains distributed transactions and sagas from an architect’s point of view.
1. Local Transaction vs Distributed Business Transaction
A local transaction:
one database connection
one database engine
one commit decision
one atomic boundary
Example:
BEGIN;
UPDATE account
SET balance = balance - 100
WHERE id = :from_account_id;
UPDATE account
SET balance = balance + 100
WHERE id = :to_account_id;
INSERT INTO ledger_entry (...);
COMMIT;
A distributed business transaction:
service A database
service B database
message broker
external provider
search index
notification service
Example:
Submit enforcement case
-> create case record
-> reserve case number
-> attach evidence
-> assign investigator
-> notify supervisor
-> publish audit event
-> update search index
Trying to treat this as one giant database transaction is usually the wrong architecture.
2. The Fundamental Problem: Partial Success
Consider:
Possible realities:
- DB commit succeeded, event publish failed.
- DB commit succeeded, event publish succeeded, but API timed out.
- DB commit succeeded, event publish delayed.
- DB commit failed, event publish mistakenly succeeded.
- User retries and creates duplicate intent.
The architect’s job is to make these realities safe.
3. Two Broad Approaches
| Approach | Meaning | Best Fit | Main Risk |
|---|---|---|---|
| Distributed atomic commit | Try to commit multiple resources as one atomic unit | tightly controlled infrastructure, rare cases | blocking, availability, coordinator failure, complexity |
| Saga / eventual consistency | Break process into local transactions with messages and compensation | microservices, external providers, long workflows | partial state complexity, compensation correctness |
In most service-oriented systems, prefer local transactions plus saga/outbox patterns unless you have a strong reason and platform support for distributed atomic commit.
4. Two-Phase Commit Mental Model
Two-phase commit, or 2PC, has a coordinator and participants.
Phase 1: prepare.
Phase 2: commit or abort.
The attractive promise:
All participants commit or all abort.
The hard production reality:
- participants may hold locks while waiting
- coordinator failure creates uncertainty
- prepared transactions can block resources
- not every resource supports 2PC
- external APIs usually do not participate
- latency grows with participants
- operational recovery is hard
2PC is not evil. It is just a heavy tool. Use it only when its guarantees are worth its operational cost.
5. Why Sagas Exist
A saga decomposes one distributed business transaction into a sequence of local transactions.
Each step:
- commits locally
- records durable progress
- emits or receives a message
- triggers the next step
- may have a compensation path
Example:
A saga does not make the world atomic. It makes the non-atomic reality explicit, durable, observable, and recoverable.
6. Saga Is a State Machine
A saga should be modelled like a state machine, not as invisible async glue.
CREATE TABLE saga_instance (
id uuid PRIMARY KEY,
saga_type text NOT NULL,
business_key text NOT NULL,
status text NOT NULL CHECK (status IN (
'STARTED',
'WAITING',
'COMPLETED',
'COMPENSATING',
'COMPENSATED',
'FAILED'
)),
current_step text NOT NULL,
attempt_count integer NOT NULL DEFAULT 0,
last_error_code text,
last_error_detail text,
created_at timestamptz NOT NULL DEFAULT now(),
updated_at timestamptz NOT NULL DEFAULT now(),
completed_at timestamptz,
UNIQUE (saga_type, business_key)
);
Step table:
CREATE TABLE saga_step_execution (
id uuid PRIMARY KEY,
saga_id uuid NOT NULL REFERENCES saga_instance(id),
step_name text NOT NULL,
direction text NOT NULL CHECK (direction IN ('FORWARD', 'COMPENSATION')),
status text NOT NULL CHECK (status IN (
'PENDING', 'RUNNING', 'SUCCEEDED', 'FAILED', 'SKIPPED'
)),
idempotency_key text NOT NULL,
started_at timestamptz,
completed_at timestamptz,
error_code text,
error_detail text,
UNIQUE (saga_id, step_name, direction),
UNIQUE (idempotency_key)
);
This gives you:
- durable progress
- retry control
- duplicate protection
- auditability
- operator visibility
- compensation trace
7. Orchestration vs Choreography
7.1 Orchestrated Saga
A central orchestrator tells each participant what to do.
Pros:
- clear process visibility
- easier timeout handling
- easier compensation ordering
- easier audit trail
- easier versioning
Cons:
- orchestrator can become central dependency
- services may become passive workers
- process logic concentrated in one place
Best for:
- regulated workflows
- long-running business processes
- explicit approvals
- strong audit requirements
- operational visibility
7.2 Choreographed Saga
Services react to events without central coordinator.
Pros:
- loose coupling
- no central process coordinator
- easy to add listeners
- natural event-driven architecture
Cons:
- process is harder to see
- compensation logic scattered
- event cycles possible
- versioning can be difficult
- debugging needs excellent tracing
Best for:
- simple flows
- naturally independent reactions
- notification/projection/update side effects
- low-regulation domains
For enforcement/case-management systems, orchestrated saga is often easier to defend because the workflow is explicit.
8. Transactional Outbox Pattern
The outbox pattern solves the classic dual-write problem:
write database + publish message
Bad pattern:
caseRepository.save(caseRecord); // commits
messageBroker.publish(event); // may fail
Better pattern:
BEGIN;
INSERT INTO case_record (id, tenant_id, status, created_at)
VALUES (:case_id, :tenant_id, 'OPEN', now());
INSERT INTO outbox_event (
id,
aggregate_type,
aggregate_id,
event_type,
payload,
created_at,
status
)
VALUES (
gen_random_uuid(),
'CASE',
:case_id,
'CaseCreated',
:payload::jsonb,
now(),
'PENDING'
);
COMMIT;
Then a relay publishes the outbox event.
The event may still be delivered more than once. Therefore consumers must be idempotent.
9. Outbox Table Design
CREATE TABLE outbox_event (
id uuid PRIMARY KEY,
aggregate_type text NOT NULL,
aggregate_id uuid NOT NULL,
event_type text NOT NULL,
event_version integer NOT NULL,
payload jsonb NOT NULL,
headers jsonb NOT NULL DEFAULT '{}'::jsonb,
idempotency_key text NOT NULL,
status text NOT NULL CHECK (status IN ('PENDING', 'PUBLISHED', 'FAILED')),
available_at timestamptz NOT NULL DEFAULT now(),
attempt_count integer NOT NULL DEFAULT 0,
last_attempt_at timestamptz,
published_at timestamptz,
created_at timestamptz NOT NULL DEFAULT now(),
UNIQUE (idempotency_key)
);
CREATE INDEX idx_outbox_pending
ON outbox_event (available_at, id)
WHERE status = 'PENDING';
Relay claim pattern:
WITH picked AS (
SELECT id
FROM outbox_event
WHERE status = 'PENDING'
AND available_at <= now()
ORDER BY available_at, id
LIMIT 100
FOR UPDATE SKIP LOCKED
)
UPDATE outbox_event o
SET status = 'PUBLISHED',
published_at = now(),
attempt_count = attempt_count + 1,
last_attempt_at = now()
FROM picked
WHERE o.id = picked.id
RETURNING o.*;
In many designs, you should mark published after broker ack, not before. If you claim and update before publish, you need a separate IN_PROGRESS state and timeout recovery.
Safer relay states:
PENDING -> IN_PROGRESS -> PUBLISHED
\-> PENDING after timeout
\-> FAILED after max attempts
10. Inbox / Dedup Pattern
Consumers need an inbox table.
CREATE TABLE consumed_message (
message_id uuid PRIMARY KEY,
consumer_name text NOT NULL,
consumed_at timestamptz NOT NULL DEFAULT now(),
UNIQUE (consumer_name, message_id)
);
Consumer flow:
BEGIN;
INSERT INTO consumed_message (message_id, consumer_name)
VALUES (:message_id, 'assignment-service')
ON CONFLICT DO NOTHING;
-- If insert affected 0 rows, message was already processed.
-- Otherwise execute side effect inside same local transaction.
UPDATE assignment_queue
SET status = 'READY'
WHERE case_id = :case_id;
COMMIT;
Exactly-once delivery is rarely a reliable assumption at system boundary. Design for at-least-once delivery plus idempotent consumers.
11. Compensation Is Not Rollback
A database rollback erases uncommitted changes.
A compensation is a new business action that semantically counteracts a previous committed action.
Example:
| Forward Step | Compensation |
|---|---|
| reserve inventory | release reservation |
| assign investigator | unassign or reassign |
| debit account | credit reversal entry |
| send notification | send correction notification |
| create case | cancel case with reason |
A compensation must be:
- explicit
- idempotent
- auditable
- allowed by business rules
- safe to retry
- versioned with the saga
Bad compensation:
DELETE FROM case_record WHERE id = :case_id;
Better compensation:
UPDATE case_record
SET status = 'CANCELLED',
cancelled_at = now(),
cancellation_reason = 'SAGA_COMPENSATION'
WHERE id = :case_id
AND status IN ('DRAFT', 'OPEN');
Regulated systems usually need reversal, not deletion.
12. Saga Failure Modes
12.1 Step Succeeds, Reply Lost
The participant completed work but orchestrator did not receive response.
Mitigation:
- idempotency key
- query step status
- retry same command safely
12.2 Message Delivered Twice
Mitigation:
- inbox dedup
- unique idempotency key
- atomic consumer transaction
12.3 Compensation Fails
Mitigation:
- compensation step state
- retry policy
- manual intervention state
- operator dashboard
12.4 Timeout Ambiguity
Mitigation:
- distinguish business timeout vs technical timeout
- reconcile with participant
- use durable status endpoints
12.5 Out-of-Order Event
Mitigation:
- aggregate version
- sequence number
- monotonic state transition guard
- ignore stale events
12.6 Poison Message
Mitigation:
- dead-letter queue
- max attempts
- structured error code
- replay tooling
- schema validation
12.7 Saga Definition Changed Mid-Flight
Mitigation:
- saga version on instance
- old versions remain executable
- migration plan for in-flight sagas
13. Idempotency Key Design
An idempotency key identifies a business intent, not an HTTP request attempt.
Good keys:
tenant:{tenant_id}:submit-case:{client_request_id}
tenant:{tenant_id}:assign-case:{case_id}:{assignment_command_id}
payment:{payment_intent_id}:capture
Bad keys:
random UUID generated by server on every retry
current timestamp
message offset only
Command table pattern:
CREATE TABLE command_request (
idempotency_key text PRIMARY KEY,
tenant_id uuid NOT NULL,
command_type text NOT NULL,
request_hash text NOT NULL,
status text NOT NULL CHECK (status IN ('STARTED', 'SUCCEEDED', 'FAILED')),
response_payload jsonb,
created_at timestamptz NOT NULL DEFAULT now(),
updated_at timestamptz NOT NULL DEFAULT now()
);
Request hash protects against accidental reuse of same idempotency key with different payload.
14. Designing a Saga Boundary
A saga boundary should follow business process, not technical convenience.
Ask:
- What business intent starts the saga?
- What local transaction records the intent?
- What is the durable business key?
- Which steps are required vs optional?
- Which steps are compensatable?
- Which steps are irreversible?
- What state is visible to users during progress?
- What timeout is business-significant?
- Who can manually intervene?
- What evidence must be retained?
If a step is irreversible, design the saga so irreversible work happens late, after validations and reservations.
Example:
validate -> reserve -> persist -> notify -> external irreversible action
Not:
external irreversible action -> validate -> fail -> panic
15. Saga State Visibility
Users and operators need meaningful intermediate states.
Bad UI/backend state:
PROCESSING
Better states:
CASE_SUBMISSION_RECEIVED
VALIDATING_EVIDENCE
WAITING_FOR_ASSIGNMENT
NOTIFICATION_PENDING
SUBMISSION_COMPLETED
SUBMISSION_FAILED_RETRYING
SUBMISSION_REQUIRES_OPERATOR_REVIEW
But avoid exposing every internal technical step as domain state. Separate:
- business state
- saga execution state
- technical retry state
Schema split:
case_record.status -- domain-visible state
saga_instance.status -- process execution state
saga_step_execution.status -- technical step state
16. Distributed Consistency Boundary
A local transaction gives immediate consistency inside one database.
A saga gives eventual consistency across boundaries.
Document this explicitly:
| Data | Owner | Consistency Guarantee |
|---|---|---|
| case status | Case DB | local transaction consistent |
| assignment | Assignment DB | eventually consistent after CaseSubmitted |
| notification | Notification service | at-least-once attempt |
| search index | Search projection | eventually consistent, freshness SLO 30s |
| report warehouse | Analytics | eventually consistent, daily load |
Do not let product, support, and compliance teams assume everything updates atomically when it does not.
17. External Systems
External APIs rarely support your transaction semantics.
Examples:
- payment gateway
- email/SMS provider
- identity provider
- government registry
- document signing service
- third-party evidence repository
Design rules:
- Use provider idempotency key when available.
- Store request and response correlation IDs.
- Treat timeout as unknown, not failed.
- Reconcile by querying provider status.
- Do not retry non-idempotent calls blindly.
- Keep irreversible external side effects late.
- Record evidence for audit.
External call table:
CREATE TABLE external_call_attempt (
id uuid PRIMARY KEY,
saga_id uuid NOT NULL,
provider text NOT NULL,
operation text NOT NULL,
idempotency_key text NOT NULL,
request_payload jsonb NOT NULL,
response_payload jsonb,
status text NOT NULL CHECK (status IN (
'PENDING', 'SENT', 'SUCCEEDED', 'FAILED', 'UNKNOWN'
)),
provider_ref text,
attempt_count integer NOT NULL DEFAULT 0,
last_error_code text,
created_at timestamptz NOT NULL DEFAULT now(),
updated_at timestamptz NOT NULL DEFAULT now(),
UNIQUE (provider, operation, idempotency_key)
);
18. Retry Policy
Not every failure deserves the same retry.
| Failure | Retry? | Notes |
|---|---|---|
| network timeout | yes, with idempotency | status unknown |
| serialization failure | yes | rerun local transaction |
| deadlock | yes with backoff | inspect contention if frequent |
| validation failure | no | business rejection |
| insufficient funds | no or compensate | business failure |
| provider 429 | yes with backoff | respect rate limit |
| provider 400 | usually no | request bug |
| schema validation failure | no until fixed | poison message risk |
Retry policy table:
CREATE TABLE retry_policy (
operation text PRIMARY KEY,
max_attempts integer NOT NULL,
initial_delay_ms integer NOT NULL,
max_delay_ms integer NOT NULL,
backoff_type text NOT NULL CHECK (backoff_type IN ('FIXED', 'EXPONENTIAL'))
);
Architectural rule:
Retry is safe only when the operation is idempotent or the previous attempt is known not to have happened.
19. Ordering and Versioning
Event ordering is rarely globally reliable. Design around aggregate-level ordering.
Event payload:
{
"eventId": "...",
"aggregateType": "CASE",
"aggregateId": "...",
"aggregateVersion": 12,
"eventType": "CaseAssigned",
"occurredAt": "2026-07-05T10:15:30Z"
}
Consumer guard:
UPDATE case_projection
SET status = :new_status,
aggregate_version = :event_version
WHERE case_id = :case_id
AND aggregate_version < :event_version;
This prevents stale events from moving projection backward.
20. Outbox + Saga Combined
A robust command handler often uses both.
The API does not need to complete the whole business process synchronously. It records the intent and returns a durable accepted state.
21. Case Management Example
Business command:
Submit Case For Enforcement Review
Steps:
- Create case record.
- Validate required evidence.
- Lock evidence package.
- Assign investigator queue.
- Create SLA timer.
- Notify supervisor.
- Index case for search.
- Emit audit event.
Schema shape:
CREATE TABLE case_submission_saga (
id uuid PRIMARY KEY,
tenant_id uuid NOT NULL,
case_id uuid NOT NULL,
status text NOT NULL,
current_step text NOT NULL,
version integer NOT NULL,
created_at timestamptz NOT NULL DEFAULT now(),
updated_at timestamptz NOT NULL DEFAULT now(),
UNIQUE (tenant_id, case_id)
);
Forward/compensation thinking:
| Step | Forward | Compensation |
|---|---|---|
| create case | insert case_record | mark cancelled, not delete |
| validate evidence | mark validated | mark validation invalidated |
| lock evidence | set locked | unlock if no downstream use |
| assign investigator | create assignment | cancel assignment |
| create SLA | create timer | cancel timer |
| notify supervisor | send notification | send correction if needed |
| index search | upsert projection | remove/update projection |
Some compensations may be impossible or undesirable. That is not a problem if explicitly modelled.
22. Financial Ledger Example
Never compensate ledger by deleting entries.
Bad:
DELETE FROM ledger_entry WHERE id = :entry_id;
Good:
INSERT INTO ledger_entry (
id,
account_id,
amount,
direction,
reference_type,
reference_id,
reversal_of_entry_id,
created_at
)
VALUES (
gen_random_uuid(),
:account_id,
:amount,
'CREDIT',
'REVERSAL',
:saga_id,
:original_entry_id,
now()
);
A ledger compensation is another immutable fact.
23. Testing Sagas
Test the failure matrix, not only the happy path.
Minimum tests:
- duplicate command request
- duplicate event delivery
- event delivered out of order
- participant timeout after success
- participant timeout before success
- orchestrator crash after local commit
- relay crash before publish
- relay crash after publish before marking published
- compensation failure
- retry exhaustion
- poison message
- schema version mismatch
- manual intervention path
Chaos-style scenario:
Kill process after DB commit but before outbox relay observes event.
Expected: outbox row remains pending and is eventually published.
24. Observability
A saga architecture without observability becomes invisible failure.
Track:
- saga count by status
- stuck saga age
- retry attempts by step
- compensation count
- failed compensation count
- outbox backlog
- outbox oldest pending age
- inbox duplicate count
- dead-letter count
- event publish latency
- end-to-end saga duration
- participant timeout rate
Dashboard sections:
Saga Health
- Started / Completed / Failed / Compensating
- Oldest WAITING saga
- Top failing steps
Outbox Health
- Pending count
- Oldest pending age
- Publish error rate
Consumer Health
- Duplicate messages
- Dead-letter queue
- Projection lag
25. Manual Intervention
Production sagas need operator states.
FAILED_RETRYABLE
FAILED_REQUIRES_REVIEW
COMPENSATION_REQUIRES_REVIEW
BLOCKED_EXTERNAL_PROVIDER
WAITING_MANUAL_APPROVAL
Operator action table:
CREATE TABLE saga_operator_action (
id uuid PRIMARY KEY,
saga_id uuid NOT NULL REFERENCES saga_instance(id),
action_type text NOT NULL,
reason text NOT NULL,
actor_id uuid NOT NULL,
created_at timestamptz NOT NULL DEFAULT now()
);
Operator actions must be audited. Manual repair that bypasses saga state is a future incident.
26. Versioning Sagas
Long-running workflows outlive deployments.
Design:
ALTER TABLE saga_instance
ADD COLUMN saga_version integer NOT NULL DEFAULT 1;
Rules:
- a saga instance runs under the version it started with
- new deployments can start new version
- old version handlers remain until old sagas complete or migrate
- compensation semantics are versioned
- event payload versions are explicit
Do not assume all in-flight sagas can instantly adopt new logic.
27. When Not to Use Saga
Do not use saga when:
- one local transaction is sufficient
- process does not cross consistency boundaries
- compensation is impossible and partial success is unacceptable
- business requires immediate atomic result
- team cannot operate async workflows
- product cannot tolerate eventual consistency
Alternative options:
- keep operation inside one database/service boundary
- redesign aggregate boundary
- use synchronous local transaction
- use database constraints instead of async coordination
- use explicit reservation before irreversible action
The best distributed transaction is often the one avoided through better boundary design.
28. Architecture Review Checklist
Boundary
- Which local transaction records the initial business intent?
- Which data is immediately consistent?
- Which data is eventually consistent?
- Is the consistency contract documented?
Saga Design
- Is saga state durable?
- Is each step idempotent?
- Is each compensation explicit?
- Are irreversible steps delayed as much as possible?
- Is saga version stored?
- Are stuck states visible?
Outbox/Inbox
- Are DB write and outbox insert in same transaction?
- Is relay safe against crash?
- Are consumers idempotent?
- Is duplicate delivery expected and tested?
- Are event versions explicit?
Failure Handling
- Are timeouts treated as unknown when appropriate?
- Is retry policy classified by error type?
- Is compensation retryable?
- Is there manual intervention path?
- Are poison messages isolated?
Observability
- Is outbox backlog monitored?
- Is oldest pending message monitored?
- Is stuck saga age monitored?
- Are compensation failures alerted?
- Can support trace one business key end-to-end?
29. Common Anti-Patterns
29.1 Dual Write Without Outbox
DB commit succeeds, publish fails.
This creates silent inconsistency.
29.2 Compensation by Delete
Deleting evidence of a committed action destroys auditability.
29.3 Retry Without Idempotency
Retry turns transient failure into duplicate side effects.
29.4 One Giant PROCESSING State
Operators cannot know what is stuck.
29.5 Choreography Without Traceability
Nobody can answer where the process is.
29.6 Assuming Event Order
Out-of-order messages corrupt projection/state.
29.7 Treating Timeout as Failure
Timeout means unknown unless you can prove no side effect occurred.
29.8 No Manual Repair Model
Every distributed system eventually needs controlled human intervention.
30. Final Mental Model
A local transaction protects invariants inside one database.
A saga protects business progress across multiple consistency boundaries by making partial success explicit.
The core design principles:
- Keep as much correctness as possible inside local transactions.
- Use outbox to atomically record state change and message intent.
- Make consumers idempotent with inbox/dedup.
- Model saga state durably.
- Treat compensation as business action, not technical rollback.
- Prefer explicit orchestration for regulated workflows.
- Make intermediate state observable.
- Test crash points, duplicates, retries, and timeouts.
The top 1% database architect does not pretend distributed systems are atomic. They design the truth model for when they are not.
References
- AWS Prescriptive Guidance — Transactional Outbox Pattern: https://docs.aws.amazon.com/prescriptive-guidance/latest/cloud-design-patterns/transactional-outbox.html
- AWS Prescriptive Guidance — Saga Choreography Pattern: https://docs.aws.amazon.com/prescriptive-guidance/latest/cloud-design-patterns/saga-choreography.html
- Microsoft Azure Architecture Center — Saga Distributed Transactions Pattern: https://learn.microsoft.com/en-us/azure/architecture/patterns/saga
- Microsoft Azure Architecture Center — Compensating Transaction Pattern: https://learn.microsoft.com/en-us/azure/architecture/patterns/compensating-transaction
- PostgreSQL Documentation — Transaction Isolation: https://www.postgresql.org/docs/current/transaction-iso.html
- PostgreSQL Documentation — Explicit Locking: https://www.postgresql.org/docs/current/explicit-locking.html
You just completed lesson 36 in build core. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.
Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.