Failure Mode and Risk Modelling
Learn Database Design and Architect - Part 059
Failure-mode and risk modelling for production-grade database architecture: failure taxonomy, blast radius, FMEA, fault trees, controls, degraded modes, and risk review method.
Part 059 — Failure Mode and Risk Modelling
Database architecture is not finished when the schema is normalized, the indexes are fast, and replication is configured. It is finished when the team can explain how the design fails, how the failure is detected, how damage is contained, how recovery works, and how correctness is proven afterward.
This part gives you a practical failure-modelling method for database systems. The goal is not pessimism. The goal is disciplined design.
A strong database architect does not ask only:
Can this design work?
They ask:
How does this design fail under concurrency, load, migration, operator error,
replication lag, corrupt input, stale consumers, bad assumptions, and partial recovery?
That question is where senior database design begins.
1. Learning Goals
After this part, you should be able to:
- Model database failures beyond “database down”.
- Separate correctness failure, availability failure, performance failure, durability failure, confidentiality failure, and operability failure.
- Build a database-specific FMEA: Failure Mode and Effects Analysis.
- Draw fault trees for data-loss, data-corruption, stale-read, tenant-leakage, and unrecoverable-migration scenarios.
- Assign controls into prevention, detection, containment, recovery, and proof.
- Convert abstract risk into concrete schema, transaction, index, migration, backup, monitoring, and runbook decisions.
- Design degraded modes that preserve core invariants.
- Review a database architecture with a risk register rather than opinions.
2. The Core Mental Model
A database failure is not simply the loss of a process, node, disk, or connection.
A production database is responsible for several promises:
| Promise | Meaning |
|---|---|
| Correctness | The database rejects illegal state and preserves business invariants. |
| Availability | Required operations can be performed when needed. |
| Durability | Committed data survives expected failures. |
| Recoverability | The system can be restored to an acceptable point within acceptable time. |
| Confidentiality | Only authorized principals can see sensitive data. |
| Integrity | Data cannot be silently altered, lost, duplicated, or mis-associated. |
| Traceability | Important changes can be explained later with evidence. |
| Evolvability | Schema and access patterns can change without unsafe downtime or corruption. |
| Operability | Operators can diagnose, mitigate, and recover under pressure. |
So the correct question is not:
Is the database reliable?
The correct question is:
Which promise can fail, through which mechanism, with what blast radius,
under which detection signal, with what recovery path, and what proof of recovery?
That sentence is the backbone of this part.
3. Why Database Failure Modelling Is Different
Application failures are often reversible. A bad API deployment can be rolled back. A worker can be restarted. A cache can be flushed.
Database failures are different because they can modify the system of truth.
A bad database failure may create one of these outcomes:
- Data is unavailable.
- Data is wrong.
- Data is missing.
- Data is duplicated.
- Data is visible to the wrong party.
- Data history becomes unverifiable.
- Recovery overwrites valid newer data.
- Consumers process inconsistent projections.
- Backups exist but cannot restore the required state.
- Operators do not know which records are affected.
The dangerous cases are not always loud. A failed database node is obvious. A silent invariant violation can live for months.
4. Failure Taxonomy
Use this taxonomy during architecture review.
4.1 Availability Failure
The operation cannot complete.
Examples:
- Database unavailable.
- Connection pool exhausted.
- Primary unavailable and failover not complete.
- Query times out.
- Lock wait exceeds SLA.
- Disk full prevents writes.
- Replica lag prevents read scaling.
Availability failure is often visible, but its root cause may be correctness, capacity, schema, index, or operational failure.
4.2 Correctness Failure
The database accepts or returns a state that violates expected business meaning.
Examples:
- Two active assignments for a case that must have one owner.
- Negative balance where impossible.
- Case transitions from
CLOSEDback toUNDER_REVIEWwithout reopening event. - Invoice paid twice because idempotency was missing.
- A soft-deleted row participates in uniqueness incorrectly.
- Effective-dated records overlap.
Correctness failures are architecturally serious because backups faithfully preserve wrong data.
4.3 Durability Failure
A committed write is lost or not recoverable.
Examples:
- WAL not flushed as assumed.
- Async replica promoted before receiving latest commits.
- Snapshot backup without consistent log stream.
- Misconfigured storage cache lies about flush.
- Manual restore omits required schema/data dependency.
Durability failure is rare in mature systems, but when it happens, it destroys trust.
4.4 Recoverability Failure
The system technically has backups, but cannot restore the business state within required RPO/RTO.
Examples:
- Backup is corrupt.
- Restore process was never tested.
- PITR target time is ambiguous.
- Tenant-level restore is impossible in shared schema.
- Restore replays bad migration or bad batch job.
- Backup contains encrypted data but key is unavailable.
A backup is not a recovery capability. A tested restore is.
4.5 Consistency Failure Across Boundaries
The canonical store and derived systems disagree beyond their freshness contract.
Examples:
- Search index shows deleted document.
- Reporting table misses reversal transaction.
- CDC consumer processes out of order events.
- Cache returns unauthorized stale data after permission revocation.
- Replica read violates workflow guard.
This failure often happens when the database design is correct locally but integration contracts are weak.
4.6 Security Failure
Unauthorized access, modification, exfiltration, or privilege escalation occurs.
Examples:
- Shared application role can read all tenants.
- SQL injection reaches tables directly.
- Admin credentials reused by service runtime.
- Read replica exposed to analytics users with production PII.
- Backup copied to insecure storage.
- CDC stream leaks sensitive payloads.
Security failures are database architecture failures when the schema, roles, grants, boundaries, and audit path do not enforce intent.
4.7 Operability Failure
The system may be technically recoverable, but operators cannot understand or safely act in time.
Examples:
- No query fingerprint to identify top offenders.
- No migration identifier in lock storm.
- No tenant dimension in metrics.
- No runbook for replication lag.
- Alert fires after disk is already full.
- Data repair script has no dry-run mode.
Operability failure converts small incidents into large incidents.
4.8 Evolvability Failure
The database cannot change safely.
Examples:
- Table has no compatibility layer for old and new application versions.
- Enum/status migration requires global downtime.
- Backfill cannot resume after failure.
- Contract consumers are unknown.
- Column rename breaks reports and CDC consumers.
- Data model has no versioning for historical records.
A database that cannot evolve becomes a hidden product constraint.
5. Failure Surfaces in Database Architecture
A failure surface is a place where a system promise can be violated.
Do not review only tables and indexes. Review all surfaces.
6. Risk Is Not One Number
Many teams score risk as:
risk = likelihood × impact
That is useful but incomplete for database design.
Use this richer model:
risk = likelihood × impact × undetectability × unrecoverability × blast_radius
Where:
| Factor | Question |
|---|---|
| Likelihood | How likely is this failure under expected workload and change frequency? |
| Impact | What business, user, regulatory, financial, or operational damage occurs? |
| Detectability | How quickly and precisely can we detect it? |
| Recoverability | Can we repair or restore safely? |
| Blast radius | How much data, how many tenants, how many consumers, and how many operations are affected? |
A rare failure with huge impact and poor recovery can be more important than a frequent small failure.
7. Database FMEA Template
FMEA means Failure Mode and Effects Analysis. For databases, do not make it academic. Make it executable.
Use this template:
| Field | Meaning |
|---|---|
| Component | Table, transaction, job, replica, migration, index, CDC flow, backup flow, role, etc. |
| Failure mode | What can go wrong? |
| Cause | Why can it happen? |
| Effect | What breaks? |
| Invariant impacted | Which system truth becomes unsafe? |
| Detection signal | Metric, log, query, audit, reconciliation, alert. |
| Prevention control | Constraint, transaction design, role, migration gate, test. |
| Containment control | Partition, tenant boundary, rate limit, circuit breaker, feature flag. |
| Recovery control | Restore, replay, repair script, rebuild, compensation. |
| Proof of recovery | Query/report/reconciliation showing state is correct. |
| Owner | Team or role accountable. |
8. Example FMEA: Regulatory Case Assignment
Assume this invariant:
A case can have at most one active primary assignee at a time.
A weak schema might use:
CREATE TABLE case_assignment (
id uuid PRIMARY KEY,
case_id uuid NOT NULL,
user_id uuid NOT NULL,
assignment_type text NOT NULL,
active boolean NOT NULL DEFAULT true,
assigned_at timestamptz NOT NULL DEFAULT now(),
revoked_at timestamptz
);
This schema says active, but it does not prevent two active primary assignees.
FMEA:
| Field | Example |
|---|---|
| Component | case_assignment write path |
| Failure mode | Two active primary assignments for same case |
| Cause | Concurrent reassignment; missing partial unique index; retry duplicates command |
| Effect | Two officers act on same case; SLA and accountability become ambiguous |
| Invariant impacted | Single active primary owner per case |
| Detection signal | Reconciliation query counts active primary assignments per case |
| Prevention | Partial unique index on (case_id) where active and type is primary |
| Containment | Reassignment command serialized by case row lock |
| Recovery | Resolve active assignment by transition event; preserve duplicate as revoked correction |
| Proof | Query returns zero cases with more than one active primary assignment |
Better schema-level control:
CREATE UNIQUE INDEX uq_one_active_primary_assignment
ON case_assignment (case_id)
WHERE active = true AND assignment_type = 'PRIMARY';
But that is not enough if the assignment transition also needs audit history. A production path should use a transaction:
BEGIN;
SELECT id
FROM enforcement_case
WHERE id = :case_id
FOR UPDATE;
UPDATE case_assignment
SET active = false,
revoked_at = now(),
revoked_reason = :reason
WHERE case_id = :case_id
AND assignment_type = 'PRIMARY'
AND active = true;
INSERT INTO case_assignment (
id,
case_id,
user_id,
assignment_type,
active,
assigned_at,
assigned_by,
command_id
)
VALUES (
gen_random_uuid(),
:case_id,
:new_user_id,
'PRIMARY',
true,
now(),
:actor_id,
:command_id
);
INSERT INTO case_timeline_event (
id,
case_id,
event_type,
occurred_at,
actor_id,
payload
)
VALUES (
gen_random_uuid(),
:case_id,
'PRIMARY_ASSIGNEE_CHANGED',
now(),
:actor_id,
jsonb_build_object('newUserId', :new_user_id, 'reason', :reason)
);
COMMIT;
The lock is not merely a performance detail. It is a containment control.
9. Controls: Prevention, Detection, Containment, Recovery, Proof
A mature risk model has five control types.
9.1 Prevention
Prevention reduces the chance of failure.
Examples:
NOT NULL,CHECK,UNIQUE,FOREIGN KEY.- Exclusion constraint for non-overlapping validity intervals.
- Idempotency table.
- Transaction boundary.
- Least privilege database role.
- Migration precheck.
statement_timeoutandlock_timeout.CREATE INDEX CONCURRENTLYwhere appropriate.- Backfill batching and resume cursor.
9.2 Detection
Detection reveals failure or drift.
Examples:
- Reconciliation query.
- Slow query logs.
- Lock wait alert.
- Replication lag metric.
- Row-count mismatch between canonical and projection stores.
- Duplicate command metric.
- Audit completeness check.
- Backup restore drill failure.
9.3 Containment
Containment limits blast radius.
Examples:
- Tenant partitioning.
- Cell architecture.
- Feature flag.
- Rate limit per tenant.
- Circuit breaker for heavy reports.
- Read-only degraded mode.
- Quarantine table for invalid imports.
- Per-consumer CDC offset.
- Separate database role for batch job.
9.4 Recovery
Recovery returns the system to acceptable state.
Examples:
- Point-in-time restore.
- Roll-forward data repair.
- Replay event log.
- Rebuild projection.
- Re-run idempotent batch.
- Compensating transaction.
- Tenant-level restore.
- Reconciliation-driven repair script.
9.5 Proof
Proof demonstrates the system is correct after recovery.
Examples:
- Invariant query returns zero violations.
- Balance reconciliation matches ledger.
- CDC consumer offset reaches source LSN.
- Search index checksum matches source table count/version.
- Audit chain has no missing sequence.
- Sampled restored records match expected point in time.
Proof is often missing. Without proof, recovery is only a hope.
10. Fault Tree Analysis
FMEA starts from a component. Fault tree analysis starts from an unacceptable outcome.
Example top event:
User sees another tenant's confidential case data.
Fault tree:
This reveals a key principle:
If one missed predicate can leak tenant data, the security architecture is too fragile.
11. Blast Radius Modelling
For every failure mode, ask four questions:
- Scope of data: one row, one case, one account, one tenant, one partition, all data?
- Scope of users: one user, one team, one customer, all customers, external public?
- Scope of time: current request, last hour, all history, future decisions?
- Scope of consumers: only OLTP app, reports, search, CDC, downstream systems, backups?
Example:
| Failure mode | Bad blast radius | Better blast radius |
|---|---|---|
| Bad import | Corrupts canonical tables directly | Lands in quarantine table, validated before merge |
| Hot tenant | Saturates whole shared DB | Isolated by tenant rate limit/cell/shard |
| Bad migration | Locks critical table globally | Runs in batches with timeout and feature flag |
| Wrong report query | Exports all tenants | Role/view enforces tenant and masking boundary |
| CDC consumer bug | All projections corrupted | Per-consumer offset + replay + rebuild path |
Design for small blast radius before you design for heroic recovery.
12. Risk Categories by Database Layer
12.1 Schema Layer
Failure modes:
- Missing uniqueness.
- Weak FK strategy.
- Nullable field with overloaded meaning.
- No effective-date non-overlap enforcement.
- Status column with invalid transitions.
- JSON blob hides invariant-critical data.
Controls:
- Constraints.
- Generated columns.
- Partial indexes.
- Exclusion constraints.
- State transition table.
- Schema review checklist.
12.2 Transaction Layer
Failure modes:
- Lost update.
- Write skew.
- Double execution of command.
- External side effect before commit.
- Long transaction blocks vacuum or locks.
Controls:
- Atomic conditional update.
- Row lock.
- Serializable transaction with retry.
- Idempotency key.
- Transactional outbox.
- Short transaction discipline.
12.3 Query and Index Layer
Failure modes:
- Bad plan after data growth.
- Full scan under peak traffic.
- Sort spill.
- Offset pagination degradation.
- Index bloat.
- Write amplification due to too many indexes.
Controls:
- Query review.
EXPLAIN ANALYZEbaseline.- Composite index aligned to predicate/order.
- Keyset pagination.
- Statistics maintenance.
- Index lifecycle governance.
12.4 Replication Layer
Failure modes:
- Stale read violates workflow rule.
- Failover loses async commits.
- Replica slot retains WAL until disk pressure.
- Split brain accepts writes in two primaries.
Controls:
- Read routing by freshness contract.
- Synchronous replication where business requires it.
- Promotion runbook.
- Fencing token.
- Lag alerting.
- Replica slot monitoring.
12.5 Migration Layer
Failure modes:
- Unsafe DDL locks production table.
- New app version assumes column exists while old version still writes old shape.
- Backfill corrupts rows.
- Rollback impossible after destructive contract phase.
Controls:
- Expand-migrate-contract.
- Compatibility window.
- Backfill checkpoint.
- Dry-run and validation query.
- Lock timeout.
- Kill switch.
12.6 Backup and Recovery Layer
Failure modes:
- Backup missing WAL required for PITR.
- Backup encrypted with lost key.
- Restore takes longer than RTO.
- Tenant-specific restore impossible.
- Restore is logically corrupt because bad data was already committed.
Controls:
- Restore drill.
- PITR test.
- Key recovery test.
- Backup inventory.
- Reconciliation after restore.
- Logical export for selected recovery needs.
12.7 Security Layer
Failure modes:
- Overprivileged role.
- RLS bypass.
- Misconfigured read replica.
- Backup leakage.
- Sensitive field copied to logs, reports, or CDC.
Controls:
- Least privilege grants.
- RLS/tenant enforcement.
- Secret rotation.
- Encryption.
- Audit.
- Masked views.
- Separate roles for app, migration, analyst, support, and admin.
13. Risk Register Format
Create a database risk register per major system or bounded context.
# Database Risk Register
## Risk: Duplicate enforcement case created for same complaint
- ID: DB-RISK-001
- Owner: Case Platform Team
- Status: Open
- Impact: High
- Likelihood: Medium
- Detectability: Medium
- Recoverability: Medium
- Blast radius: Per complaint / per tenant
### Failure Mode
Concurrent imports or retries create two active cases for the same external complaint.
### Impact
Duplicate investigation, conflicting decisions, invalid SLA reporting, audit confusion.
### Existing Controls
- External complaint id stored.
- Unique index on `(tenant_id, source_system, external_complaint_id)`.
- Idempotency key for import command.
### Missing Controls
- Reconciliation dashboard for duplicate suspect cases.
- Merge workflow with audit event.
- Import replay proof.
### Detection
Daily query: active duplicate candidates by external complaint id.
### Recovery
Merge duplicate case using approved merge workflow. Preserve both original records in timeline.
### Proof
No duplicate active case remains for same `(tenant_id, source_system, external_complaint_id)`.
A risk register is useful only if it changes design decisions.
14. Scoring Model
Use a simple 1–5 scale.
| Score | Likelihood | Impact | Detectability | Recoverability | Blast radius |
|---|---|---|---|---|---|
| 1 | Rare | Minor | Immediate | Automatic | Single row/request |
| 2 | Unlikely | Low | Fast | Simple manual | Single user/case |
| 3 | Possible | Medium | Delayed | Scripted repair | Tenant or feature |
| 4 | Likely | High | Hard | Restore/replay required | Multiple tenants/systems |
| 5 | Expected | Critical | Silent | Unclear/impossible | Global/systemic |
Example calculation:
risk_score = likelihood + impact + detectability + recoverability + blast_radius
Or weighted:
risk_score = impact*3 + blast_radius*3 + recoverability*2 + detectability*2 + likelihood
For database systems, impact, blast radius, and recoverability usually matter more than likelihood.
15. Example: Bad Migration Risk Model
Failure mode:
Add NOT NULL column with default to very large table during peak traffic.
Effects:
- Table rewrite or heavy lock depending engine/version/DDL shape.
- Write path blocked.
- Connection pool fills.
- API timeouts.
- Retry storm increases load.
- Operators kill migration halfway.
- Application version expects column but migration incomplete.
Controls:
| Control | Type |
|---|---|
| Add nullable column first | Prevention |
| Backfill in batches | Prevention/containment |
lock_timeout and statement_timeout | Containment |
| Feature flag read/write path | Containment |
| Validation query before contract | Detection/proof |
| Roll-forward migration plan | Recovery |
| Migration dashboard | Detection |
| Maintenance window for risky DDL | Containment |
Mermaid risk path:
16. Example: Stale Replica Risk Model
Failure mode:
Application reads workflow state from replica immediately after write to primary.
Effects:
- User sees old status.
- Guard condition evaluates incorrectly.
- Duplicate command submitted.
- Case transitions incorrectly.
- Audit timeline appears inconsistent.
Controls:
| Control | Type |
|---|---|
| Query classification: guard reads use primary | Prevention |
| Sticky primary after write | Prevention |
| Freshness token / LSN wait | Prevention |
| Replica lag alert | Detection |
| Idempotency key | Containment |
| State transition constraint | Containment |
| Duplicate command reconciliation | Recovery/proof |
Design rule:
A replica is a scalability tool, not a correctness boundary, unless the operation explicitly tolerates staleness.
17. Example: Silent Data Corruption
Failure mode:
Batch job recalculates derived balances using an incorrect filter.
Effects:
- Derived balance wrong.
- Reports wrong.
- Customer notices discrepancy.
- Downstream exports propagate wrong numbers.
- Audit questions arise.
Controls:
| Control | Type |
|---|---|
| Append-only ledger as canonical source | Prevention |
| Derived table marked rebuildable | Containment |
| Batch job writes job_id and calculation_version | Detection/recovery |
| Reconciliation between ledger and balance | Detection/proof |
| Shadow calculation before cutover | Prevention |
| Ability to rebuild derived balances | Recovery |
Important distinction:
If a value is derived, design it as disposable and rebuildable.
If a value is canonical, design it as protected and auditable.
18. Detection Queries as Architecture Artifacts
A database design should ship with detection queries.
Duplicate active primary assignee
SELECT case_id, count(*) AS active_primary_count
FROM case_assignment
WHERE active = true
AND assignment_type = 'PRIMARY'
GROUP BY case_id
HAVING count(*) > 1;
Orphan records that should be impossible
SELECT e.id, e.case_id
FROM evidence_document e
LEFT JOIN enforcement_case c ON c.id = e.case_id
WHERE c.id IS NULL;
Overlapping effective intervals
SELECT a.id AS id_a, b.id AS id_b, a.subject_id
FROM subject_assignment a
JOIN subject_assignment b
ON a.subject_id = b.subject_id
AND a.id < b.id
AND tstzrange(a.valid_from, a.valid_to, '[)') && tstzrange(b.valid_from, b.valid_to, '[)');
Outbox lag
SELECT
now() - min(created_at) AS oldest_unpublished_age,
count(*) AS pending_count
FROM outbox_event
WHERE published_at IS NULL;
Tenant distribution skew
SELECT tenant_id, count(*) AS row_count
FROM enforcement_case
GROUP BY tenant_id
ORDER BY row_count DESC
LIMIT 20;
These queries are not merely monitoring snippets. They are executable statements of your invariants.
19. Failure Injection
Do not wait for production to discover whether recovery works.
Run controlled failure experiments.
| Experiment | Question answered |
|---|---|
| Kill app during transaction | Are writes atomic and retry-safe? |
| Kill worker after DB commit before publish | Does outbox publish later? |
| Delay replica | Do stale-sensitive reads route to primary? |
| Fill disk in staging | Do alerts and runbooks work before total failure? |
| Cancel migration mid-backfill | Can it resume safely? |
| Reprocess event stream | Are consumers idempotent? |
| Restore backup to staging | Can we meet RTO and prove correctness? |
| Revoke DB privilege | Does app fail closed with useful error? |
| Inject duplicate command | Does idempotency return same result? |
| Simulate tenant hot key | Is blast radius contained? |
Failure injection is not chaos for entertainment. It is proof collection.
20. Degraded Modes
A degraded mode is a deliberate reduced-capability state that preserves critical invariants.
Examples:
| Failure | Bad response | Better degraded mode |
|---|---|---|
| Replica lag high | Continue stale reads for all pages | Route sensitive reads to primary; disable stale dashboards warning-free |
| Search index stale | Show outdated confidential documents | Show “search updating” and fallback to canonical query for sensitive records |
| Outbox lag | Keep accepting all commands blindly | Accept commands but expose downstream-lag warning and throttle noncritical events |
| Reporting DB unavailable | Query OLTP heavily | Disable heavy reports, keep operational writes available |
| Migration issue | Continue rollout | Freeze deploy, disable feature flag, run validation |
| Backup failure | Ignore until next cycle | Treat as reliability incident; freeze destructive migrations if required |
Degraded mode design principle:
When full functionality is unsafe, preserve correctness, confidentiality,
and recoverability before convenience.
21. Recovery Strategies
21.1 Rollback
Useful for code, dangerous for data if destructive change has already happened.
Rollback works when:
- Old schema still exists.
- Old application version can read/write current data.
- No irreversible transformation occurred.
- External side effects can be ignored or compensated.
21.2 Roll-Forward
Often safer for databases.
Roll-forward works by applying a corrective migration or repair that preserves history.
Example:
Bad migration wrote wrong status for 10,000 cases.
Do not restore whole database if valid newer updates exist.
Instead, identify affected records, create correction events, repair current state,
and prove invariant health afterward.
21.3 Point-in-Time Restore
Useful for catastrophic corruption or data loss, but complex when new valid writes happened after the target time.
Questions:
- What is the exact target time?
- Which writes after target time must be preserved?
- Can affected tenant be restored separately?
- Are encryption keys available?
- Will replay re-trigger external side effects?
21.4 Replay
Useful when canonical log/event/outbox exists.
Replay works when:
- Events are durable.
- Consumers are idempotent.
- Event schema is versioned.
- Ordering scope is understood.
- Rebuild target is disposable.
21.5 Reconciliation + Repair
Useful for data correctness failure.
Pattern:
- Detect affected records.
- Freeze or limit affected write path if needed.
- Export affected set.
- Produce expected state from canonical source.
- Apply repair in batches.
- Write audit repair record.
- Run proof queries.
- Publish incident summary.
22. Database Risk Review Meeting
A good database risk review is structured.
Inputs
- Data model.
- Transaction design.
- Query workload.
- Migration plan.
- Backup/restore plan.
- Security model.
- Integration/CDC contract.
- Observability plan.
- Runbook.
Questions
- What data is canonical?
- What state must never exist?
- Which failures create wrong data rather than unavailable data?
- Which operations rely on stale reads?
- Which writes are retry-safe?
- Which migrations are destructive?
- Which restore scenarios have been tested?
- Which consumers can be rebuilt?
- Which access paths can leak data?
- What is the proof of recovery for each critical failure?
Output
- Approved controls.
- Open risks.
- Required tests.
- Runbook tasks.
- Observability gaps.
- Architecture decision records.
23. Risk Modelling for Regulatory Systems
Regulatory and enforcement systems have special risk characteristics:
- Decisions must be explainable.
- Evidence history matters.
- Deadlines and SLA have legal or procedural weight.
- Unauthorized disclosure can be severe.
- Incorrect state transitions can invalidate process.
- Audit trails must preserve who, what, when, why, and under which authority.
- Corrections must not erase prior facts without trace.
Typical high-risk failure modes:
| Failure mode | Risk |
|---|---|
| Evidence deleted without audit | Loss of defensibility |
| Case owner ambiguity | Accountability failure |
| SLA recalculation wrong | Procedural unfairness or missed enforcement deadline |
| Role revocation not reflected in search projection | Confidential data leakage |
| Duplicate case not merged properly | Conflicting decisions |
| State transition skipped | Invalid workflow outcome |
| Backdated update without transaction/effective-time distinction | Historical reconstruction failure |
Design response:
- Prefer append-only timeline for decisions and evidence events.
- Store correction events, not silent overwrite, for important facts.
- Put hard invariants in database constraints where possible.
- Treat report reproducibility as an architecture requirement.
- Treat access control changes as time-sensitive writes.
- Make repair actions auditable.
24. Anti-Patterns
24.1 “We Have Backups” Without Restore Proof
Backups do not prove recoverability.
A restore drill proves recoverability.
24.2 “The App Validates That”
Application validation is useful, but concurrency, batch jobs, admin scripts, migrations, and direct integrations can bypass it.
Database constraints are the last line of defense for core invariants.
24.3 “Read Replicas Are Always Safe for Reads”
Some reads are correctness-sensitive.
A read that guards a write, checks authorization freshness, or validates workflow state may need primary or freshness guarantee.
24.4 “The Migration Worked in Dev”
Dev data does not have production size, skew, lock contention, index bloat, replication lag, or old consumers.
Migration safety must be tested with production-like data shape.
24.5 “We Can Repair It Manually”
Manual repair without script, audit, dry-run, and proof is a new failure mode.
24.6 “Only One Consumer Uses This Table”
Hidden consumers include reports, exports, support tools, ad hoc analytics, CDC pipelines, search indexes, caches, and compliance jobs.
25. Database Risk Checklist
Use this before production launch.
Correctness
- Critical invariants are listed.
- Each invariant has enforcement location.
- Database-level constraints exist for hard invariants where feasible.
- Concurrency anomalies are identified.
- Retry and idempotency strategy exists.
- Derived data has reconciliation or rebuild path.
Availability and Performance
- Workload envelope is defined.
- Critical queries have execution plan baseline.
- Lock-sensitive paths are reviewed.
- Hot tenant/key/row scenarios are considered.
- Connection pool limits and timeouts are configured.
- Degraded mode exists for heavy noncritical operations.
Replication and Consistency
- Read routing is classified by freshness need.
- Replica lag alerts exist.
- Failover behavior is understood.
- Promotion data-loss risk is accepted or mitigated.
- CDC ordering and replay strategy are documented.
Migration
- Migration follows expand-migrate-contract where needed.
- Backfill is resumable.
- Validation queries exist.
- Lock timeout and rollback/roll-forward plan exist.
- Consumer compatibility is reviewed.
Backup and Recovery
- RPO/RTO are explicit.
- Restore drill has been run.
- PITR target selection is understood.
- Backup encryption keys are recoverable.
- Tenant-level restore needs are addressed.
- Proof-of-restore queries exist.
Security
- Database roles follow least privilege.
- Tenant/data access boundary is enforced below application where needed.
- Sensitive data in backup, replica, CDC, and export is protected.
- Admin and break-glass access are audited.
- Security failure modes are in risk register.
Operability
- Metrics cover query, lock, WAL, disk, replication, connection, and job health.
- Alerts fire before hard failure.
- Runbooks exist for top incidents.
- Reconciliation queries exist for core invariants.
- Repair scripts have dry-run and audit mode.
26. Production Review Exercise
Take any table in your system and answer:
- What is the canonical fact represented by this table?
- What illegal states should never exist?
- Which illegal states are prevented by constraints?
- Which are prevented only by application code?
- What happens under concurrent writes?
- What happens if a command is retried?
- What happens if a migration fails halfway?
- What happens if this table is restored to yesterday?
- What derived systems depend on it?
- What query proves it is healthy?
If you cannot answer these questions, the table is not architecture-reviewed yet.
27. Summary
Failure-mode and risk modelling turns database architecture from design optimism into operational truth.
The key lessons:
- Database failure is broader than outage.
- Wrong data can be worse than unavailable data.
- Every critical invariant needs enforcement, detection, recovery, and proof.
- Risk must include blast radius and recoverability, not only likelihood.
- FMEA starts from components; fault trees start from unacceptable outcomes.
- Degraded modes should preserve correctness, confidentiality, and recoverability first.
- Backups are not recovery until restored and verified.
- Security, migration, replication, and integration risks are database architecture risks.
- A production database design should ship with detection queries and runbooks.
- Top-level engineers design not only for the happy path, but for controlled failure.
28. References
- AWS Well-Architected Reliability Pillar — design principles, failure management, disaster recovery, RTO/RPO, multi-location deployment: https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/welcome.html
- AWS Well-Architected Framework Reliability overview: https://docs.aws.amazon.com/wellarchitected/latest/framework/reliability.html
- AWS Reliability Pillar — Plan for Disaster Recovery: https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/plan-for-disaster-recovery-dr.html
- PostgreSQL Documentation — Monitoring Database Activity and Statistics: https://www.postgresql.org/docs/current/monitoring-stats.html
- PostgreSQL Documentation — Explicit Locking: https://www.postgresql.org/docs/current/explicit-locking.html
- PostgreSQL Documentation — Transaction Isolation: https://www.postgresql.org/docs/current/transaction-iso.html
- PostgreSQL Documentation — Row Security Policies: https://www.postgresql.org/docs/current/ddl-rowsecurity.html
- NIST SP 800-53 Rev. 5 — Security and Privacy Controls for Information Systems and Organizations: https://csrc.nist.gov/pubs/sp/800/53/r5/upd1/final
You just completed lesson 59 in deepen practice. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.
Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.