Final StretchOrdered learning track

Database Architecture Decision Framework

Learn Database Design and Architect - Part 081

Production-grade framework untuk memilih, menjustifikasi, dan mempertahankan keputusan arsitektur database berdasarkan authority, invariant, workload, consistency, scale, security, operasi, dan evolusi.

29 min read5629 words
PrevNext
Lesson 8184 lesson track70–84 Final Stretch
#database#architecture#decision-framework#adr+7 more

Part 081 — Database Architecture Decision Framework

Target pembelajaran: mampu memilih, menjustifikasi, mengkritik, dan mempertahankan keputusan arsitektur database secara sistematis. Bukan berdasarkan selera engine, bukan berdasarkan hype, bukan berdasarkan “tim sebelumnya pakai ini”, tetapi berdasarkan authority, invariant, workload, consistency, evolution, operational capability, risk, dan cost of change.

Database architect yang kuat bukan orang yang selalu memilih PostgreSQL, MongoDB, Cassandra, Spanner, DynamoDB, Elasticsearch, atau warehouse tertentu.

Database architect yang kuat adalah orang yang bisa menjawab:

  1. data mana yang menjadi truth;
  2. invariant mana yang wajib dijaga;
  3. operasi mana yang harus atomic;
  4. query mana yang latency-sensitive;
  5. data mana yang boleh stale;
  6. perubahan schema mana yang akan sering terjadi;
  7. failure mana yang paling mahal;
  8. siapa yang mengoperasikan sistem saat incident;
  9. kapan keputusan ini harus diubah;
  10. bagaimana membuktikan keputusan ini benar setelah production berjalan.

Framework ini adalah sintesis dari semua part sebelumnya. Anggap ini seperti decision operating system untuk database architecture.


1. The Real Problem: Database Decisions Are Usually Made Too Late

Banyak tim membuat keputusan database setelah mereka sudah punya:

  • beberapa entity class;
  • beberapa endpoint;
  • beberapa query;
  • beberapa migration;
  • beberapa report;
  • beberapa integrasi;
  • beberapa production incident.

Pada titik itu, keputusan sudah terjadi, hanya belum ditulis.

Contoh:

"Kita pakai JSONB dulu, nanti kalau stabil dinormalisasi."

Itu bukan temporary shortcut. Itu keputusan arsitektur.

"Report langsung query database production saja, datanya belum besar."

Itu bukan quick win. Itu keputusan workload isolation.

"Access control dicek di service saja."

Itu bukan simplifikasi. Itu keputusan security boundary.

"Event dikirim setelah insert berhasil. Kalau publish gagal, retry dari app."

Itu bukan detail implementasi. Itu keputusan consistency antara database dan event bus.

Keputusan database jarang gagal karena engineer tidak tahu teknologi. Ia gagal karena tradeoff tidak dibuat eksplisit.

A database architecture decision that is not written down still exists. It is just harder to review, harder to test, and harder to reverse.


2. Mental Model: Decision as Constraint + Bet + Recovery Plan

Setiap keputusan database punya tiga komponen.

KomponenPertanyaanContoh
ConstraintApa yang wajib benar?Satu case hanya boleh punya satu active investigator
BetApa yang kita asumsikan akan terjadi?Volume case per tenant tidak akan melebihi 10 juta per tahun
Recovery planJika bet salah, apa jalur keluar?Tenant besar bisa dipindah ke dedicated database

Keputusan yang matang bukan keputusan yang tidak punya risiko. Keputusan matang adalah keputusan yang risikonya dikenal, diamati, dan punya escape hatch.

Example decision:

Decision:
Use PostgreSQL as canonical store for case-management data.
Use OpenSearch as search projection, not source of truth.
Use warehouse/lakehouse for analytical reports.

Constraint:
Case transitions, assignment uniqueness, decision audit, and evidence lifecycle must be transactionally protected.

Bet:
Operational query workload remains OLTP-shaped; search and analytics are projection workloads.

Recovery:
If search projection falls behind, operational writes continue; search freshness SLO degrades but case truth remains safe.
If analytics grows, warehouse scales independently.

3. Decision Levels

Jangan mencampur semua keputusan ke dalam satu diskusi “pakai database apa?”.

Ada banyak level keputusan.

LevelKeputusanContoh
Data authorityDi mana truth tinggal?case_file di PostgreSQL adalah source of truth
ModelBagaimana data dimodelkan?Normalize party/evidence/decision, denormalize timeline projection
ConstraintSiapa menjaga invariant?Unique partial index untuk one active assignment
TransactionOperasi apa harus atomic?Create case + intake event + audit row + outbox
EngineEngine apa dipakai?PostgreSQL, distributed SQL, document DB, warehouse
TopologyBagaimana engine disebar?Single primary + replicas, partitioning, sharding, multi-region
IntegrationBagaimana data keluar?CDC, outbox, API, batch export
SecurityBoundary akses di mana?RLS, application policy, views, grants
MigrationBagaimana berubah?Expand-migrate-contract
OperationBagaimana diamati dan dipulihkan?Metrics, runbook, PITR, restore drill
GovernanceBagaimana keputusan dikontrol?ADR, review checklist, ownership

Architecture review yang buruk melompat langsung ke engine.

Architecture review yang baik memulai dari authority, invariant, workload, dan failure mode.


4. The Decision Pipeline

Gunakan pipeline berikut untuk keputusan database besar.

The key idea:

Engine selection is not the first decision. It is a consequence of earlier decisions.


5. Step 1 — Define Data Authority

The first question is not:

Which database should we use?

The first question is:

Which system has authority over this fact?

Authority means the system is allowed to decide what is true.

Example facts:

FactAuthorityNot authority
Case current statusCase operational DBSearch index, dashboard, email notification
Evidence chain of custodyEvidence DB/moduleObject storage metadata alone
User display nameIdentity/profile serviceCase DB local copy
Daily case count metricWarehouse/semantic layerOperational table as live query source
Search relevance scoreSearch engineCanonical case DB

A database decision that ignores authority will create duplicate truth.

Authority questions

Ask:

  1. Who is allowed to create this fact?
  2. Who is allowed to correct it?
  3. Who is allowed to delete or expire it?
  4. Who must be notified when it changes?
  5. Can another system override it?
  6. Is it canonical, derived, cached, projected, or temporary?
  7. What happens when copies disagree?

Authority map template

## Data Authority Map

| Data / Fact | Authority | Copies / Projections | Freshness | Repair Path | Owner |
|---|---|---|---|---|---|
| Case status | Case DB | Search, timeline, warehouse | Search <= 60s, warehouse daily | Rebuild projection from case DB | Case Platform Team |
| Evidence metadata | Evidence DB | Search, audit export | Search <= 5m | Re-index from evidence table | Evidence Team |

6. Step 2 — Extract Invariants

A database design without explicit invariants is just storage.

Invariant examples:

  • a case cannot transition from CLOSED back to UNDER_REVIEW without a reopening decision;
  • a ledger transaction must balance debit and credit lines;
  • a tenant user cannot read another tenant’s case;
  • an evidence item cannot be deleted while under legal hold;
  • a report must be reproducible using the same data snapshot;
  • a case cannot have two active primary investigators;
  • a policy version cannot overlap another active policy version for the same jurisdiction.

Invariant classification

TypeExampleCommon enforcement
Domain invariantOne active assignmentPartial unique index, transaction lock
Referential invariantEvidence belongs to existing caseForeign key
Temporal invariantEffective periods do not overlapExclusion constraint, trigger, serializable transaction
Security invariantTenant isolationRLS, composite keys, app policy
Lifecycle invariantClosed case cannot accept new taskTransaction guard
Analytical invariantMetric has one definitionSemantic layer, metric contract
Integration invariantEvent published exactly after DB commitOutbox pattern

Invariant enforcement ladder

Prefer the strongest practical enforcement layer.

LayerStrengthUse when
Database constraintVery highInvariant is local to database rows/tables
Transactional guardHighNeeds read-modify-write logic
Serializable transactionHigh but operationally complexPredicate invariant cannot be expressed as simple constraint
Application validationMediumUX-level validation or cross-service policy
Async repairLow as prevention, useful as safety netDerived data, projection drift, external copies
Dashboard alert onlyDetection onlyInvariant is hard to prevent but must be detected

A decision framework must ask:

Can this invariant be made impossible to violate?
If not, can violation be detected fast?
If detected, can it be repaired safely?

7. Step 3 — Profile the Workload

Database decisions must be workload-first.

A workload profile is not “read-heavy” or “write-heavy”. That is too shallow.

Good workload profile:

## Workload Profile

Commands:
- Create case: 20 TPS peak, inserts 8 rows, writes audit + outbox.
- Assign investigator: 50 TPS peak, must enforce one active primary investigator.
- Submit decision: 5 TPS peak, writes decision, status transition, audit, outbox.

Queries:
- Case detail by id: p95 < 100ms, strongly consistent.
- Case queue by assignee/status/SLA: p95 < 250ms, current enough within same primary.
- Full-text search by party/evidence keyword: p95 < 1s, freshness <= 60s acceptable.
- Monthly compliance report: async, reproducible, not from OLTP primary.

Data growth:
- 30M cases/year, 200M tasks/year, 2B audit events/year.
- Large tenants can be 100x median tenant.

Concurrency risks:
- Assignment, status transition, and SLA escalation can conflict.
- Queue workers claim tasks concurrently.

Workload dimensions

DimensionWhy it matters
Read/write ratioAffects index, projection, cache, replica design
Query shapeDetermines whether B-Tree, search index, graph traversal, warehouse is appropriate
Write shapeDetermines transaction boundary, contention, WAL pressure
Latency SLODetermines serving path and topology
Freshness toleranceDetermines primary vs replica vs projection
ConcurrencyDetermines lock/transaction strategy
GrowthDetermines partitioning/sharding/lifecycle
SkewDetermines tenant isolation/hotspot mitigation
RetentionDetermines archive/purge/partitioning
Recovery objectiveDetermines backup/PITR/replication strategy

8. Step 4 — Define Consistency Contract

Consistency is not a single global checkbox.

Different operations need different guarantees.

OperationConsistency requiredReason
Case status transitionStrong, transactionalPrevent illegal lifecycle
Assignment uniquenessStrong, transactionalPrevent duplicate active owner
Search result listingEventually consistentProjection; freshness SLO acceptable
Dashboard countEventually consistent with watermarkAnalytical display
Regulatory decision recordStrong + auditableDefensibility
Notification deliveryAt-least-once with idempotencyExternal side effect
Report exportSnapshot consistentReproducibility

Consistency vocabulary

TermPractical meaning
Strong consistencyRead observes committed truth needed for correctness
Read-your-writesUser sees own write after commit
Monotonic readUser does not move backward in observed state
Snapshot consistencyQuery sees one stable snapshot
Eventual consistencySystem converges after delay
Causal consistencyDependent effects become visible in causal order
External consistencyCommit order respects real-time ordering

Decision rule

Use the weakest consistency that still preserves user trust and domain correctness.

But do not weaken consistency for invariants.

Latency optimization is acceptable for derived views.
Latency optimization is dangerous for authority updates.

9. Step 5 — Model Data Lifecycle

Database decisions must account for lifecycle.

For every major entity, document:

  1. creation path;
  2. valid transitions;
  3. correction path;
  4. cancellation/reversal path;
  5. archival path;
  6. retention period;
  7. purge/erasure conditions;
  8. downstream propagation;
  9. reporting snapshot behavior;
  10. audit requirement.

Lifecycle matrix

EntityCreated byMutated byTerminal stateRetentionPurge?Audit level
CaseIntake serviceWorkflow commandsClosed / withdrawn7 yearsAfter retention unless legal holdFull
EvidenceInvestigator / integrationMetadata correctionArchivedCase retention + holdRestrictedFull chain of custody
TaskWorkflow engineAssignment/escalationDone / canceled2 years after case closeYesMedium
DecisionAuthorized officerCorrection/reversal onlyFinalPermanent/long retentionUsually noFull

If lifecycle is not understood, engine selection is premature.

A document database may fit aggregate lifecycle when document boundary is stable and bounded.

A relational database may fit when lifecycle has many relationships, constraints, and independent sub-entities.

An event log may fit when history is the primary truth and current state is projection.

A warehouse may fit when immutable analytical snapshots matter more than command-time constraints.


10. Step 6 — Hard Filters Before Scoring

Many teams use weighted scoring too early. Scoring is useful only after disqualifying unsafe options.

Hard filter examples

RequirementDisqualifies
Multi-row transactional invariant is centralPure search engine as authority
Cross-entity joins are core OLTP pathKey-value store as only canonical store
Unbounded relationship traversal is primary queryPlain relational-only solution may become painful unless graph projection added
Regulatory audit and correction history requiredCache/projection-only truth
Reproducible historical reporting requiredLive OLTP-only dashboard without snapshots
Strict tenant isolation requiredApp-only tenant filtering without DB-level guardrails
Frequent schema evolution with many consumersShared raw table access as public contract
Low-latency global writes with strict consistencySingle-region primary without locality strategy
Need point-in-time restoreSystem without tested backup/PITR path

Hard filters prevent “pretty scorecards” from justifying impossible designs.


11. Step 7 — Engine Fit Matrix

This matrix is not a universal truth. It is a decision starting point.

Engine familyStrong fitWeak fitWatch out
Relational OLTPInvariants, joins, transactions, referential integrityMassive write-only event ingestion without partitioningLock/contention, schema migration, reporting overload
Distributed SQLRelational + scale/multi-region needsSimple app where ops complexity is not justifiedTransaction retries, locality, cost, operational maturity
Document DBAggregate/document boundary, flexible document shapeStrong cross-document invariants and joinsUnbounded arrays, duplicated truth, schema drift
Key-value / LSMHigh write throughput, simple key access, event/state storesAd-hoc query, multi-dimensional filteringSecondary indexes, compaction, hot keys
Wide-columnQuery-driven high-scale access patternsExploratory relational queriesPartition key mistakes, tombstones, table-per-query duplication
Graph DBRelationship traversal, network analysis, lineageSimple CRUD with limited traversalHigh-degree nodes, traversal explosion, transaction boundary
Search engineFull-text, relevance, faceting, semantic retrievalCanonical transactional truthStaleness, reindexing, authorization freshness
Vector store/indexSimilarity search, RAG, recommendationExact canonical truthEmbedding versioning, recall/precision, filtering semantics
Warehouse/lakehouseAnalytics, reporting, historical snapshotsLow-latency command processingFreshness, reconciliation, governance, cost
CacheLatency reduction, computed viewSource of truthInvalidation, stampede, stale data

Practical principle

Choose the engine whose failure modes you can operate, not only the one whose happy path looks elegant.


12. Step 8 — Topology Decision

Even after engine selection, topology matters.

PostgreSQL as single primary with replicas is different from PostgreSQL partitioned by time, from database-per-tenant, from logical replication to read models, from distributed SQL.

Topology options

TopologyUse whenRisk
Single primaryStrong consistency, modest scale, simpler operationsPrimary bottleneck, regional latency
Primary + read replicasRead scaling and HAStale reads, lag, failover complexity
Partitioned single databaseLarge table lifecycle/query pruningWrong partition key, migration complexity
Sharded application-levelHorizontal scale/tenant isolationRouting, cross-shard query, resharding
Database-per-tenantIsolation, custom backup/restore, enterprise tenantsOperational overhead, fleet management
Cell-based architectureBlast radius control, tenant groupingRouting, deployment complexity
Distributed SQLScale/multi-region relational semanticsCost, retries, topology-aware schema
Polyglot projectionSpecialized query servingConsistency and rebuild discipline

Topology decision questions

  1. What is the blast radius of one database failure?
  2. What is the blast radius of one tenant spike?
  3. Can a tenant be restored independently?
  4. Can a tenant be migrated independently?
  5. Can data residency be enforced?
  6. How is routing done?
  7. What happens during failover?
  8. What becomes stale?
  9. What is the operational overhead per partition/shard/tenant/cell?

13. Step 9 — Evolution and Migration Fit

A design is not complete unless it can evolve.

Ask:

  • Can we add a column safely?
  • Can we add a constraint safely?
  • Can we change a relationship from one-to-one to one-to-many?
  • Can we split a table?
  • Can we move a tenant?
  • Can we rebuild derived stores?
  • Can we run old and new code together?
  • Can we rollback the application without rolling back data?
  • Can we validate data correctness after migration?

Evolution score

ScoreMeaning
1Change requires downtime/manual repair
2Change possible but risky and untested
3Change possible with expand-contract and backfill
4Change path documented and automated
5Change path continuously tested with validation/reconciliation

A design that is fast on day one but impossible to evolve is not cheap. It is deferred cost.


14. Step 10 — Security and Privacy Fit

Security is not a feature added after schema design.

Security boundary affects:

  • table structure;
  • key design;
  • tenant scoping;
  • row-level policy;
  • view design;
  • export paths;
  • backup access;
  • audit logging;
  • analytics access;
  • support access;
  • deletion/retention design.

Security decision questions

AreaQuestions
IdentityWho is the principal: user, service, tenant, organization?
AuthorizationIs access object-level, row-level, field-level, purpose-based?
Tenant isolationIs tenant boundary enforced by app, DB, topology, or all?
Sensitive dataWhich fields are PII/secret/confidential/evidence-grade?
AuditCan we prove who accessed/changed data?
BackupAre backups encrypted and access-controlled?
AnalyticsDoes sensitive data leak into warehouse/search/export?
Support accessIs break-glass controlled and audited?
ErasureCan deletion propagate to projections and derived stores?

If a design says “authorization is handled elsewhere”, require a concrete proof path.


15. Step 11 — Operational Fit

A database design must match operational capability.

Questions:

  1. Who receives the alert?
  2. What dashboard shows the cause?
  3. What runbook handles the incident?
  4. How do we restore data?
  5. How often do we test restore?
  6. How do we detect bloat, lag, lock storms, queue backlog, projection drift?
  7. Who approves migrations?
  8. How do we stop a bad migration?
  9. How do we test failover?
  10. What is our cost of on-call complexity?

Operational fit score

ScoreMeaning
1Team cannot operate this safely
2Operation depends on one expert
3Basic monitoring/runbook exists
4Alerts, runbooks, drills, ownership exist
5Self-service guardrails and continuous validation exist

A technically superior database choice can be the wrong choice if the organization cannot operate it.


16. Step 12 — Cost of Change and Exit Strategy

Every database decision should include an exit strategy.

Exit does not mean “we can rewrite everything later”. That is not a strategy.

Exit strategy examples:

DecisionExit strategy
Use search projectionProjection rebuild from canonical DB
Use JSONB for dynamic attributesAttribute registry + extraction migration path
Use shared schema multi-tenancyTenant catalog + tenant export/import + DB-per-tenant path
Use relational primaryCDC stream to analytical warehouse/search
Use document DB canonical storeSchema versioning + migration jobs + validation queries
Use distributed SQLLocality-aware schema + fallback strategy for non-global operations

Reversibility classification

ClassMeaningExample
Type 1Hard to reverseCanonical database engine, sharding strategy, tenant isolation model
Type 2Reversible with effortIndex strategy, projection technology, report store
Type 3Easily reversibleQuery tuning, cache TTL, read replica routing rule

Spend review energy proportional to irreversibility.


17. Weighted Scorecard

After hard filters, use scoring.

Example dimensions:

DimensionWeightOption A: PostgreSQL canonical + search projectionOption B: Document DB canonicalOption C: Distributed SQL
Transactional invariants5535
Relationship complexity4535
Query flexibility4535
Write scale3344
Multi-region readiness3235
Migration simplicity4433
Team operational skill5532
Security/RLS fit4534
Analytics boundary3434
Cost predictability3432

Weighted scoring must include notes, not just numbers.

## Scoring Notes

Option A wins for invariant enforcement, operational maturity, SQL query flexibility, and RLS.
Option B is attractive for flexible intake documents but weak for cross-document workflow invariants.
Option C is strong for future global topology but currently operationally expensive for the team.

Decision:
Choose Option A now with explicit exit path:
- search projection for text/semantic search;
- warehouse for analytics;
- tenant extraction path for future silo/cell model;
- revisit distributed SQL if multi-region write latency becomes a real requirement.

18. Decision Pattern: Canonical Store + Projection Store

This is one of the most common production-grade patterns.

Use when:

  • canonical data has transactional invariants;
  • search/reporting access patterns differ from command workload;
  • derived stores can be rebuilt;
  • freshness can be explicitly stated.

Do not use when:

  • projection becomes hidden authority;
  • events are not replayable;
  • deletes/privacy changes do not propagate;
  • consumers query raw internal tables without contract.

19. Decision Pattern: Shared Schema vs Database-per-Tenant

Shared schema

Use when:

  • tenants are numerous and small/medium;
  • cost efficiency matters;
  • schema is common;
  • tenant isolation can be enforced with composite keys/RLS/app controls;
  • per-tenant restore is not a hard business requirement.

Risks:

  • noisy neighbor;
  • tenant filter leak;
  • difficult per-tenant restore;
  • large tenant skew;
  • data residency complexity.

Database-per-tenant

Use when:

  • enterprise tenants require strong isolation;
  • per-tenant backup/restore matters;
  • data residency/custom retention matters;
  • noisy neighbor risk is unacceptable;
  • tenant count is operationally manageable or automated.

Risks:

  • migration fleet complexity;
  • version drift;
  • monitoring explosion;
  • cost overhead;
  • cross-tenant analytics complexity.

Hybrid/cell model

Use when:

  • most tenants can share pools;
  • large/regulated tenants need isolation;
  • routing catalog and operational automation exist.

20. Decision Pattern: Normalize vs Denormalize

Do not frame this as ideology.

Normalize when:

  • correctness matters;
  • relationship has independent lifecycle;
  • data is updated in many places;
  • duplicate copies would drift;
  • reporting/search can be projected separately.

Denormalize when:

  • read path is hot and stable;
  • duplicate data has clear source of truth;
  • freshness contract is explicit;
  • rebuild/repair exists;
  • storage/write amplification is acceptable.

Decision record should state:

## Denormalization Contract

Canonical source:
- `case_party.display_name` from Party/Profile service snapshot at case intake.

Denormalized copy:
- `case_search_document.party_names`.

Freshness:
- <= 60 seconds after CDC event.

Repair:
- Rebuild search document from canonical tables.

Drift detection:
- Daily sample reconciliation and event-lag alert.

21. Decision Pattern: Database Constraint vs Application Validation

Application validation improves user experience.

Database constraints protect truth.

RuleBest home
Email field must look like emailApp + maybe DB check if stable
Case ID must be uniqueDB unique constraint
Evidence must belong to existing caseDB foreign key
User cannot submit after deadlineApp + transaction guard
One active assignment per case/roleDB partial unique index or transaction guard
Tenant isolationDB/RLS + app policy + tests
Workflow transition allowed only by roleApp/service policy + audit + transaction guard

A mature decision does not ask “DB or app?”. It asks:

Which layer prevents corruption?
Which layer improves UX?
Which layer detects violation?
Which layer repairs violation?

22. Decision Pattern: Cache vs Replica vs Read Model

These are not interchangeable.

ChoiceBest forNot for
CacheExpensive computed reads, low-latency repeated readsSource of truth, complex invalidation without owner
Read replicaOffloading SQL reads with acceptable lagCritical fresh read after write unless routed carefully
Materialized viewDB-local derived query resultHighly dynamic per-user search with complex relevance
CQRS read modelStable query serving shapeAd-hoc exploratory analytics
Search projectionText/relevance/facets/vector retrievalTransactional truth
WarehouseAnalytics/reporting/historyLow-latency command guard

Decision rule:

If correctness depends on fresh data, do not route blindly to stale stores.
If the view is derived, define rebuild and drift detection.

23. Decision Pattern: Event Sourcing vs State + History

Event sourcing is powerful when the event log is the primary truth.

But many systems only need:

  • current state table;
  • transition history table;
  • audit trail;
  • outbox events.

Use event sourcing when

  • reconstructing state from events is a core requirement;
  • event sequence is the canonical business record;
  • retroactive interpretation/versioning is expected;
  • audit needs exact command/event timeline;
  • team can operate event replay/versioning.

Use state + history when

  • current state is primary operational need;
  • history is audit/explanation;
  • relational constraints are central;
  • event replay is not required for every query;
  • simpler operations matter.

Decision smell:

"We use event sourcing because audit is required."

Audit requirement alone does not imply event sourcing.


24. Decision Pattern: Single Region vs Multi-Region

Multi-region is not an automatic maturity upgrade.

It adds:

  • latency tradeoffs;
  • data residency complexity;
  • conflict/failover semantics;
  • operational burden;
  • cost;
  • testing requirements;
  • incident complexity.

Single-region primary is often correct when

  • user base is regionally concentrated;
  • write latency is acceptable;
  • DR can be met with replica/PITR;
  • team maturity is still growing;
  • consistency and simplicity matter more than global latency.

Multi-region becomes justified when

  • latency is a product requirement;
  • data residency is mandatory;
  • regional outage tolerance requires regional continuity;
  • entity ownership can be region-scoped;
  • failover and conflict model are explicitly designed.

Multi-region decision contract

## Multi-Region Contract

Home-region rule:
- Each tenant has one home region for writes.

Read rule:
- Local region can serve stale profile/search reads within freshness SLO.
- Case transition reads must go to home region or strongly consistent distributed path.

Failover rule:
- Tenant can be promoted to DR region with fencing epoch.

Conflict rule:
- No active-active writes for the same case.

Residency rule:
- Evidence payload remains in jurisdiction-bound storage.

25. Database ADR Template

Use this template for significant database decisions.

# ADR-DB-XXX: <Decision Title>

Status: Proposed | Accepted | Superseded | Deprecated
Date: YYYY-MM-DD
Owner: <team/person>
Reviewers: <names/roles>
Related ADRs: <links>

## 1. Context

What problem are we solving?
What business/system pressure triggered this decision?
What constraints are non-negotiable?

## 2. Decision

State the decision clearly.
Avoid vague language.

## 3. Scope

In scope:
- ...

Out of scope:
- ...

## 4. Data Authority

| Fact | Authority | Copies | Freshness | Repair Path |
|---|---|---|---|---|

## 5. Invariants

| Invariant | Enforcement | Test | Failure handling |
|---|---|---|---|

## 6. Workload Profile

Commands:
Queries:
Growth:
Concurrency:
Freshness:
Retention:

## 7. Options Considered

| Option | Pros | Cons | Rejection reason |
|---|---|---|---|

## 8. Decision Drivers

- Correctness
- Latency
- Scale
- Operational simplicity
- Security/privacy
- Migration path
- Cost

## 9. Risk and Failure Modes

| Failure mode | Impact | Detection | Mitigation | Recovery |
|---|---|---|---|---|

## 10. Migration / Rollout Plan

Expand:
Backfill:
Validate:
Cutover:
Contract:
Rollback / roll-forward:

## 11. Observability

Metrics:
Logs:
Traces:
Dashboards:
Alerts:
Runbooks:

## 12. Security / Privacy

Access boundary:
Sensitive fields:
Audit:
Retention:
Backup/export handling:

## 13. Consequences

Positive consequences:
Negative consequences:
Deferred work:
Open questions:

## 14. Review Date

Revisit when:
- workload exceeds ...
- tenant count exceeds ...
- p95 latency exceeds ...
- restore time exceeds ...

26. Decision Review Checklist

A decision is not ready if these are unanswered.

Authority

  • Is source of truth explicit?
  • Are derived stores labelled as derived?
  • Is repair/rebuild path defined?
  • Are consumer contracts known?

Invariants

  • Are critical invariants listed?
  • Is each invariant enforced somewhere concrete?
  • Are race conditions considered?
  • Are violations detectable?

Workload

  • Are command/query shapes known?
  • Are latency/freshness expectations documented?
  • Are growth/skew assumptions stated?
  • Are reporting/export paths separate from OLTP when needed?

Consistency

  • Which operations require strong consistency?
  • Which operations tolerate stale reads?
  • Is read-your-writes needed?
  • Is retry behavior documented?

Schema/model

  • Is grain clear?
  • Are keys stable?
  • Are relationships explicit?
  • Is lifecycle modelled?
  • Are temporal/correction semantics clear?

Engine/topology

  • Were hard filters applied before scoring?
  • Are engine failure modes known?
  • Does team have operational skill?
  • Is topology appropriate for blast radius/growth?

Migration

  • Can change be rolled out without downtime?
  • Is compatibility preserved?
  • Is backfill idempotent?
  • Is validation defined?
  • Is rollback/roll-forward realistic?

Security/privacy

  • Is tenant/security boundary enforced?
  • Is PII classified?
  • Are audit and access logs sufficient?
  • Are backup/export/search/warehouse covered?

Operations

  • Are metrics and alerts defined?
  • Is runbook ready?
  • Is backup/restore tested?
  • Are capacity triggers defined?
  • Is incident ownership clear?

27. Example ADR — Case Management Canonical Store

# ADR-DB-081: Use PostgreSQL as Canonical Store for Regulatory Case Management

Status: Accepted
Date: 2026-07-05
Owner: Case Platform Team

## Context

The platform needs to manage regulatory cases with strict lifecycle rules, assignment uniqueness, evidence chain of custody, decision audit, tenant access boundaries, search, and reporting.

## Decision

Use PostgreSQL as the canonical operational store.
Use OpenSearch as a derived search projection.
Use warehouse/lakehouse as analytical reporting platform.
Use outbox/CDC to update derived stores.

## Data Authority

| Fact | Authority | Projection | Freshness |
|---|---|---|---|
| Case status | PostgreSQL | Search, timeline, warehouse | Search <= 60s |
| Assignment | PostgreSQL | Queue read model | Immediate on primary |
| Evidence metadata | PostgreSQL | Search | <= 5m |
| Report metric | Warehouse semantic layer | Dashboard | Daily/hourly watermark |

## Invariants

| Invariant | Enforcement |
|---|---|
| One active primary investigator per case | Partial unique index |
| Closed case cannot accept new task | Transaction guard |
| Evidence under legal hold cannot be purged | Purge transaction guard |
| Tenant cannot read another tenant case | Composite keys + RLS + app policy |

## Options Considered

| Option | Result |
|---|---|
| Document DB canonical | Rejected: cross-document workflow invariants and reporting joins too central |
| Event sourcing as full truth | Deferred: current state + transition history sufficient; replay complexity not justified |
| Distributed SQL now | Deferred: global active-active writes not required yet |
| PostgreSQL canonical + projections | Accepted |

## Consequences

Positive:
- Strong invariant enforcement.
- Mature SQL query/review tooling.
- Clear source of truth.
- Derived stores rebuildable.

Negative:
- Need careful schema migration discipline.
- Need partitioning/archival for audit/event growth.
- Need search projection pipeline.

Revisit when:
- Global write latency becomes product blocker.
- Single-primary write throughput exceeds capacity model.
- Tenant isolation requirements require dedicated databases.

28. Example Decision — Add Graph Database or Not?

Problem:

Investigators want to see relationships between parties, companies, cases, transactions, addresses, and evidence.

Bad decision:

Use graph database because the UI has nodes and edges.

Better decision process:

Authority

Are relationships canonical or derived?

If canonical:

  • graph DB may be source of truth;
  • need strong write/update governance;
  • audit must be graph-native.

If derived:

  • relational operational DB remains truth;
  • graph projection supports exploration;
  • rebuild from canonical entities/relationships.

Workload

QueryFit
Case detail by idRelational
All evidence for caseRelational
Find all parties connected within 3 hopsGraph
Identify central actor by relationship countGraph/analytics
Official decision recordRelational/audit

Decision

Use graph projection for investigation exploration.
Canonical relationship facts remain in PostgreSQL.
Graph projection is rebuilt from relationship tables and CDC.
Graph results are exploratory and must link back to canonical evidence before decision use.

This avoids turning exploratory tooling into hidden authority.


29. Example Decision — JSONB Dynamic Attributes

Dynamic fields are tempting.

Bad decision:

Put custom fields in JSONB because requirements change often.

Better decision:

Classify fields

Field typeStorage decision
Stable domain fieldColumn
Field used in invariantColumn/constraint
Field used in filter/sort frequentlyColumn or governed generated/indexed expression
Rare display-only custom fieldJSONB with registry
Versioned form submissionJSON document with schema version
Evidence payload metadataHybrid: core columns + JSON envelope

JSONB contract

JSONB allowed for:
- low-criticality custom metadata;
- form snapshots;
- rarely queried extensibility fields.

JSONB not allowed for:
- lifecycle state;
- tenant boundary;
- permission-critical fields;
- high-volume queue filters;
- fields required for foreign key/reference integrity.

Required controls:
- schema version;
- attribute registry;
- validation layer;
- index review for any promoted query field;
- extraction path to real column.

30. Example Decision — Strong RLS or App-Only Authorization?

For low-risk internal tools, app-only authorization might be acceptable.

For tenant-regulated systems, app-only authorization is fragile.

Decision matrix:

RequirementApp-onlyDB RLS + app
Simple app, few queriesGoodMaybe overkill
Multi-tenant sensitive dataRiskyStrong fit
Ad-hoc support SQLDangerousSafer if roles/policies enforced
Analytics/export boundaryStill needs governanceStill needs governance
DebuggabilityEasierRequires policy tooling
Performance simplicityEasierNeeds index/policy testing

A mature decision might be:

Use app authorization for business actions.
Use database RLS as tenant/case confidentiality invariant.
Use service roles with least privilege.
Use policy tests for allowed/denied scenarios.
Use break-glass role with approval and audit.

31. Decision Evidence: What Counts as Proof?

Architecture arguments should rely on evidence.

Evidence examples:

ClaimEvidence
Query shape fits indexEXPLAIN ANALYZE on realistic data
Migration safeLock test + staging rehearsal + rollback/roll-forward plan
Projection rebuildableRebuild dry-run + checksum/reconciliation
RLS correctPolicy test suite with allow/deny cases
Backup sufficientRestore drill result and measured RTO
Tenant isolation worksCross-tenant negative tests + monitoring
Capacity sufficientLoad test + growth model + headroom policy
Event consumer idempotentDuplicate/replay test

Opinion is useful for hypothesis.

Evidence is needed for decision.


32. Architecture Decision Smells

SmellWhy dangerous
“We can migrate later” without migration pathUsually false under production load
“Database is just persistence”Ignores invariant and authority
“Use NoSQL for scale”Scale type undefined
“Use relational for consistency”Specific invariants and isolation not stated
“Use events for audit”Audit and event sourcing conflated
“Use cache to fix performance”May hide query/model problems
“Replica reads are safe”Staleness not classified
“Search index has latest state”Projection treated as truth
“RLS is too slow” without testPerformance claim without evidence
“DBA will handle it”Ownership unclear
“All reports query production”Workload isolation ignored
“One database for everything”Could be fine, but must be deliberate
“Polyglot is modern”Could create operational debt

33. Decision Review Meeting Format

Keep review efficient.

60-minute structure

TimeTopic
0–5 minProblem and non-negotiable constraints
5–15 minData authority and invariants
15–25 minWorkload and consistency contract
25–35 minOptions and hard filters
35–45 minRisks, failure modes, migration path
45–55 minSecurity, operations, cost of change
55–60 minDecision, action items, revisit trigger

The review should end with one of:

  1. accepted;
  2. accepted with conditions;
  3. rejected with reason;
  4. needs evidence;
  5. split decision into smaller ADRs.

34. Revisit Triggers

A good decision includes conditions for re-evaluation.

Examples:

Revisit this decision when:
- p95 case queue query exceeds 300ms for 3 consecutive business days;
- largest tenant exceeds 20% of pooled database write load;
- daily audit table growth exceeds forecast by 2x;
- restore drill exceeds RTO;
- search projection lag exceeds freshness SLO twice in one month;
- cross-region latency becomes a signed product requirement;
- schema migration takes more than 30 minutes of elevated risk window;
- compliance requires per-tenant data residency.

Without revisit triggers, decisions become stale doctrine.


35. Mini-Framework for Fast Decisions

Not every decision deserves a 20-page ADR.

For small database decisions, use this lightweight form.

## Quick DB Decision

Decision:

Why now:

Data authority affected:

Invariant affected:

Workload affected:

Alternatives considered:

Risk:

Rollback / recovery:

Owner:

Review date / trigger:

Use this for:

  • adding a non-critical index;
  • adding a projection field;
  • changing cache TTL;
  • adding a small table;
  • introducing a materialized view;
  • changing retention for non-sensitive derived data.

Use full ADR for:

  • new canonical store;
  • tenant isolation model;
  • sharding/partitioning strategy;
  • RLS strategy;
  • distributed SQL adoption;
  • event sourcing decision;
  • warehouse/lakehouse architecture;
  • major schema refactor;
  • data residency/security decision.

36. The Final Decision Rule

When options are close, prefer the option that:

  1. makes truth explicit;
  2. makes invalid state hard to create;
  3. makes failure visible;
  4. makes recovery testable;
  5. makes migration possible;
  6. matches team operational capability;
  7. keeps derived data rebuildable;
  8. keeps security boundary enforceable;
  9. avoids irreversible complexity until justified;
  10. leaves evidence for future engineers.

Top-tier database architecture is not the art of picking the most powerful database. It is the discipline of preserving truth under change, scale, concurrency, failure, and human turnover.


37. Practice Drills

Drill 1 — Case workflow database

You need to design a database for regulatory cases with:

  • 10M cases/year;
  • full audit trail;
  • complex assignment workflow;
  • search by party/evidence text;
  • monthly compliance reporting;
  • tenant confidentiality;
  • possible enterprise tenant isolation.

Produce:

  1. data authority map;
  2. invariant list;
  3. workload profile;
  4. engine/topology options;
  5. decision ADR.

Drill 2 — Ledger

You need a high-volume ledger with:

  • double-entry transactions;
  • idempotent external references;
  • balance queries;
  • reversal/correction;
  • reconciliation;
  • CDC to downstream accounting.

Choose between:

  • relational OLTP;
  • key-value store + event log;
  • distributed SQL;
  • event sourcing.

Explain hard filters and failure modes.

Drill 3 — Global SaaS

You need a SaaS platform with:

  • EU and US tenants;
  • data residency;
  • enterprise tenant restore;
  • low-latency reads;
  • occasional cross-region collaboration;
  • strict authorization.

Design:

  1. tenant topology;
  2. routing catalog;
  3. consistency contract;
  4. failover model;
  5. migration path.

38. Key Takeaways

  • Database decisions should start from authority, not engine.
  • Invariants determine what must be protected transactionally.
  • Workload determines serving shape, indexing, projections, and topology.
  • Consistency should be per operation, not globally hand-waved.
  • Hard filters should happen before weighted scoring.
  • Every decision is a constraint, a bet, and a recovery plan.
  • ADRs are not bureaucracy when they preserve context, consequences, and revisit triggers.
  • Good database architecture optimizes for correctness under change, not just performance on day one.

Source Anchors

  • AWS Well-Architected Framework — decision tradeoffs, operational excellence, reliability, security, performance, and cost dimensions.
  • AWS Prescriptive Guidance — Architectural Decision Records process.
  • Google Cloud Well-Architected / Architecture Framework — operational excellence, reliability, security, privacy, cost, and performance categories.
  • PostgreSQL Documentation — constraints, indexes, transactions, row-level security, monitoring, and migration mechanics.
  • Prior parts in this series — invariants-first design, workload-first design, replication, partitioning, CDC, schema evolution, observability, and production readiness.
Lesson Recap

You just completed lesson 81 in final stretch. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.