Series/Learn Database Design and Architect

Build CoreOrdered learning track

Backup, Restore, and Recovery Objectives

Learn Database Design and Architect - Part 032

Backup, restore, RPO, RTO, point-in-time recovery, restore drills, blast-radius control, and the operational discipline behind recoverable databases.

[2026-07-05]22 min read4256 words

In This Lesson

1. The Core Mental Model 2. RPO and RTO 3. Recovery Objectives Are Business Decisions

PrevNext

Lesson 3284 lesson track16–45 Build Core

#database#architecture#backup#restore+5 more

Part 032 — Backup, Restore, and Recovery Objectives

A backup that has never been restored is not a recovery mechanism. It is an optimistic file.

Crash recovery protects the database from incomplete physical state after process, OS, or power failure.

Backup and restore protect the organization from broader failure classes:

accidental delete;
bad migration;
bad deploy;
data corruption discovered late;
ransomware;
credential compromise;
storage loss;
region failure;
logical bug that slowly damages data;
tenant-specific recovery request;
audit/regulatory reconstruction;
environment rebuild;
legal hold and retention conflict.

This part is about designing recoverability as a product capability, not a checkbox.

1. The Core Mental Model

Backup is not the goal.

Restore is the goal.

Recovery objectives define what restore must achieve.

Architectural rule:

The system is only recoverable if restore has been tested, timed, secured, automated enough, and validated against business correctness.

2. RPO and RTO

Two terms matter more than almost everything else.

RPO — Recovery Point Objective

RPO answers:

How much data can the business afford to lose?

Examples:

RPO	Meaning
0 seconds	No acknowledged transaction may be lost
5 seconds	Losing last few seconds may be acceptable
15 minutes	Restore may lose recent quarter-hour of data
24 hours	Daily backup is enough

RPO is about data loss.

RTO — Recovery Time Objective

RTO answers:

How long can the service be down or degraded while recovering?

Examples:

RTO	Meaning
< 1 minute	Automated failover / hot standby required
< 15 minutes	Warm standby or fast restore path required
< 4 hours	Snapshot restore may be acceptable
< 24 hours	Manual restore from archive may be acceptable

RTO is about time to service restoration.

Common mistake:

“We have daily backups” answers neither RPO nor RTO completely.

Daily backups imply up to 24 hours data loss unless combined with logs/PITR. They also say nothing about how long restore takes.

3. Recovery Objectives Are Business Decisions

Engineers can explain cost and mechanism.

Business owners must accept risk.

For each data domain, capture:

Data Domain	Loss Consequence	RPO	RTO	Restore Granularity	Owner
Enforcement decision	Legal/regulatory defect	0-very low	Low	case/database/PITR	Compliance owner
Financial ledger	Money mismatch	0	Low	database/ledger range	Finance owner
User uploaded evidence	Case integrity loss	Very low	Medium	object + metadata	Case owner
Search projection	Rebuildable	High	Medium	rebuild index	Platform owner
Analytics aggregate	Recomputable	Medium	High	table/date partition	Data owner
Audit log	Legal evidence	Very low	Low/medium	append-only archive	Security/compliance

A single application can have multiple recovery classes.

Treating all data identically usually creates either unnecessary cost or unacceptable risk.

4. Backup Types

4.1 Logical Backup

A logical backup exports database objects and data in logical form.

Examples:

SQL dump;
CSV export;
custom dump format;
per-table export;
schema-only export;
tenant-specific logical export.

Strengths:

portable across storage layout changes;
useful for partial restore;
useful for schema inspection;
can migrate between versions in some cases;
smaller for selective data.

Weaknesses:

slower for large databases;
restore can be slow due to index rebuild and constraints;
may not capture cluster-level objects unless explicitly included;
point-in-time recovery usually requires physical/log backup instead;
consistency across multiple databases/services is hard.

Use cases:

small/medium database backup;
per-tenant export;
pre-migration safety copy;
archival extract;
test data seed;
partial logical recovery.

4.2 Physical Backup

A physical backup copies database storage files in engine-specific format.

Examples:

base backup;
file-system copy with correct protocol;
volume snapshot;
managed database snapshot;
storage-level backup.

Strengths:

usually faster for large databases;
preserves physical storage state;
suitable for PITR when combined with WAL/log archive;
good for whole-cluster restore;
often integrates with replication.

Weaknesses:

engine/version/storage specific;
less flexible for partial restore;
may require careful consistency protocol;
can include bloat/dead space;
restore target compatibility matters.

4.3 Continuous Archiving / PITR

Point-in-time recovery uses:

a base backup;
continuous transaction log archive;
restore base backup;
replay logs until target time/LSN/restore point.

PITR is essential for recovering from logical mistakes.

Example:

10:00 good state
10:13 bad migration starts
10:15 bad migration commits
10:20 issue detected

Restore to 10:12:59.

4.4 Snapshot Backup

A snapshot captures a point-in-time view of storage.

Strengths:

fast to create;
often incremental at storage layer;
convenient in cloud environments;
useful for large volumes.

Weaknesses:

database consistency depends on snapshot mechanism and database integration;
cross-volume consistency may be tricky;
restore can still take time;
snapshots are not enough if attacker can delete them;
snapshots may not provide fine-grained PITR without log replay.

4.5 Replica as Recovery Aid

A replica can reduce downtime.

But a replica is not a backup.

Scenario	Replica Helps?	Backup/PITR Needed?
Primary node crash	Yes	Still yes
Accidental delete	Usually no; delete replicates	Yes
Bad migration	Usually no; bad change replicates	Yes
Region failover	Yes if cross-region	Still yes
Corruption discovered late	Maybe no	Yes
Ransomware deletes primary and replica	No	Yes, isolated backup needed

5. Backup Consistency

A backup must represent a consistent database state.

For a transactional database, “consistent” means the backup should not contain half of one committed transaction and half missing.

Logical dump tools often create transactionally consistent snapshots.

Physical backups require engine-specific protocol to ensure copied files plus logs can recover to a consistent state.

Unsafe pattern:

cp -r /var/lib/database /backup/database

while the database is running.

That may copy files in mutually inconsistent states.

Safe patterns:

engine-supported base backup;
managed snapshot integrated with database engine;
filesystem snapshot with database backup mode/checkpoint protocol;
logical dump using consistent snapshot;
replication-based backup with correct log coverage.

6. Cross-System Consistency

Modern systems often store truth across multiple components:

relational database for metadata;
object storage for evidence files;
search index for query;
message broker for integration;
warehouse for analytics;
external SaaS for communication.

Restoring only the database may create broken references.

Example:

Database restored to 10:00.
Object storage remains at 10:30.
Search index remains at 10:30.
Broker has messages from 10:20.

Questions:

Are object files versioned?
Can search index be rebuilt from database/outbox?
Can events after restore be replayed or discarded?
Are external messages idempotent?
Are downstream consumers aware of rewind?
Is there a global recovery point?

For most systems, the practical pattern is:

restore source-of-truth database;
rebuild projections from source;
reconcile object storage references;
replay outbox/CDC from restored point;
invalidate incompatible downstream data;
document any external side effect that cannot be undone.

7. Recovery Granularity

Recovery can happen at different scopes.

Scope	Example	Difficulty	Risk
Whole cluster	restore entire database cluster	Medium	downtime/blast radius
Single database	restore one database	Medium	cross-db dependencies
Single schema	restore tenant schema	Medium-high	dependencies/security
Single tenant	recover tenant accidentally deleted data	High	isolation/partial correctness
Single table	recover dropped table	High	FK/time consistency
Single entity/case	recover one case	Very high	business semantics
Single row	repair row	Very high	audit/provenance risk

Important principle:

The smaller the restore scope, the more logical understanding is required.

Whole database PITR is mechanically simpler but operationally disruptive.

Single-tenant or single-case restore is business-friendly but technically complex.

Design for required granularity early.

8. RPO/RTO Design Matrix

Use a matrix like this during architecture review.

Requirement	Mechanism	Notes
RPO near zero, local node failure	synchronous replication / quorum commit	increases write latency and availability coupling
RPO seconds-minutes	async replication + frequent WAL archive	data loss possible on primary loss
RPO to arbitrary recent time	PITR	requires base backup + continuous logs
RTO seconds-minutes	hot standby / managed failover	failover correctness must be tested
RTO minutes-hours	snapshot restore / warm standby	depends on data size and automation
RTO hours-days	cold backup restore	cheaper, but operationally heavy
tenant-level restore	tenant isolation + logical export/import + repair scripts	must avoid cross-tenant contamination
corruption recovery	backup history + checksums + validation	detection time matters
ransomware resilience	immutable/offline/cross-account backups	access separation matters

There is no free design.

Lower RPO/RTO usually increases:

cost;
operational complexity;
write latency;
replication complexity;
test burden;
security requirements.

9. PITR Timeline Mental Model

Suppose you take base backups every day and archive WAL continuously.

Day 1 00:00  Base backup B1
Day 1 00:00-23:59 WAL archived continuously
Day 2 00:00  Base backup B2
Day 2 00:00-23:59 WAL archived continuously

To restore to Day 2 14:37:

pick base backup B2;
restore B2 files;
fetch WAL from archive after B2 start/checkpoint;
replay until Day 2 14:37;
stop recovery;
validate.

RPO depends on whether WAL is archived up to the desired recovery point.

RTO depends on:

size of base backup;
speed of storage provisioning;
number/size of WAL segments to replay;
CPU/storage performance during replay;
validation process;
DNS/application cutover;
human approval steps;
dependent service rebuild time.

10. Logical Backup vs PITR

Do not use one tool for every recovery problem.

Need	Logical Dump	Physical + WAL / PITR
Restore entire large DB fast	Weak	Strong
Restore to exact time	Weak	Strong
Inspect/modify data before import	Strong	Weak
Partial table restore	Strong	Medium/weak
Cross-version migration	Sometimes strong	Risky/dependent
Small DB simplicity	Strong	Maybe overkill
Preserve physical layout	No	Yes
Rebuild indexes during restore	Usually yes	Usually no, already present
Human-readable-ish output	Often yes	No

Production systems often use both:

physical/PITR for disaster and point-in-time recovery;
logical exports for partial recovery, migration, legal archive, and tenant-level tasks.

11. Backup Security

Backups are high-value targets.

They often contain:

production PII;
secrets accidentally stored in tables;
deleted records still within retention;
historical data beyond current app visibility;
audit trails;
credentials/tokens if schema is poor;
sensitive evidence files or references.

Security controls:

encryption at rest;
encryption in transit;
separate backup access roles;
cross-account or isolated backup vault;
immutable retention / write-once controls where appropriate;
MFA/delete protection for backup deletion;
audit access to backups;
restore approval workflow;
secrets scanning for exported dumps;
masking for non-production restore;
key management lifecycle;
tested key recovery.

A common failure:

Backups are encrypted, but the same compromised admin role can delete both database and backups.

Real resilience needs separation of duties and deletion resistance.

12. Retention Policy

Retention answers:

How long do we keep recoverable history?

It is constrained by:

compliance requirements;
legal hold;
privacy/erasure obligations;
storage cost;
corruption detection window;
business audit needs;
incident investigation needs;
backup restore compatibility;
encryption key retention.

Example retention tiers:

Tier	Retention	Purpose
PITR logs	7-35 days	recent mistake recovery
Daily backups	30-90 days	operational restore
Monthly backups	1-7 years	compliance/archive
Immutable audit archive	policy-specific	legal/regulatory evidence
Non-prod masked restore	short	testing/debugging

Be careful:

Retention is not only how long backup files exist. It is how long they remain decryptable, restorable, and legally allowed to exist.

13. Privacy and Erasure Tension

Backups complicate privacy deletion.

If a user/entity must be erased from active systems, old backups may still contain their data.

Common strategies:

expire backups after defined retention;
prevent restoring erased data back into active systems without re-erasure process;
maintain deletion tombstone list applied after restore;
encrypt subject-specific data with keys that can be destroyed in some architectures;
document backup retention exception in privacy policy where legally allowed;
separate long-term audit facts from unnecessary PII.

Restore runbook must include:

After restoring backup older than erasure time:
  apply erasure replay/tombstone process
  validate erased subjects are not reintroduced
  log the restore and re-erasure action

Otherwise, a restore can violate privacy obligations by resurrecting deleted data.

14. Restore Validation

A restore is not complete when the database starts.

It is complete when the recovered system is validated as fit for purpose.

Validation layers:

14.1 Engine-Level Validation

database starts;
WAL replay completed;
no corruption errors;
checksums pass where available;
expected databases/schemas exist;
extension versions compatible.

14.2 Structural Validation

table counts roughly match expected point;
constraints valid;
indexes valid;
migrations at expected version;
required roles/permissions exist;
partition metadata correct.

14.3 Business Validation

known canary records exist;
known post-target records do not exist for PITR;
critical invariants hold;
ledger balances reconcile;
case status history matches current status;
audit event counts match domain operations;
tenant boundaries intact.

14.4 Application Validation

application can connect;
read/write smoke tests pass;
background workers safe to start;
outbox position handled;
search/index projections compatible;
scheduled jobs do not re-run destructive work unexpectedly.

Example validation queries:

-- current state must match latest status history
SELECT c.case_id
FROM enforcement_case c
LEFT JOIN LATERAL (
    SELECT h.to_status
    FROM case_status_history h
    WHERE h.case_id = c.case_id
    ORDER BY h.changed_at DESC, h.history_id DESC
    LIMIT 1
) latest ON true
WHERE c.status <> latest.to_status;

-- no duplicate active case number within tenant
SELECT tenant_id, case_number, count(*)
FROM enforcement_case
WHERE deleted_at IS NULL
GROUP BY tenant_id, case_number
HAVING count(*) > 1;

-- outbox should not contain impossible future messages after PITR target
SELECT count(*)
FROM outbox_message
WHERE created_at > :recovery_target_time;

15. Restore Drill Types

15.1 Full Restore Drill

Purpose:

prove entire database can be restored;
measure RTO;
validate runbook.

Steps:

select backup point;
provision restore environment;
restore database;
run engine validation;
run business validation;
connect application in isolated mode;
record elapsed time and issues;
update runbook.

15.2 PITR Drill

Purpose:

prove recovery to a specific time before a bad operation.

Scenario:

create canary row at T1;
create second canary row at T2;
restore to between T1 and T2;
verify first exists and second does not.

15.3 Tenant Restore Drill

Purpose:

recover one tenant without affecting others.

Requires:

tenant-scoped data map;
FK dependency graph;
tenant-safe import process;
conflict handling;
audit of restored data;
validation that other tenants unchanged.

15.4 Corruption Drill

Purpose:

recover from corrupted table/index/file discovered after delay.

Includes:

detection path;
selecting clean backup;
restoring to side environment;
extracting clean data;
repairing production;
documenting data loss window.

15.5 Ransomware / Credential Compromise Drill

Purpose:

prove backups survive admin credential compromise.

Questions:

Can attacker delete backups using same role?
Are backups immutable?
Are backup keys isolated?
Is there offline/cross-account copy?
Can restore happen with break-glass credentials?

16. Restore Runbook Template

# Database Restore Runbook

## 1. Incident Summary
- Incident type:
- Detection time:
- Suspected bad-change time:
- Systems affected:
- Business owner:
- Technical incident commander:

## 2. Recovery Objective
- Target recovery point:
- Maximum accepted data loss:
- Maximum accepted downtime:
- Restore scope:
- Legal/compliance constraints:

## 3. Backup Selection
- Backup ID:
- Backup timestamp:
- WAL/archive range required:
- Encryption key required:
- Backup integrity status:

## 4. Restore Environment
- Target environment:
- Network isolation:
- Credentials:
- Storage size:
- Database version:
- Extensions:

## 5. Restore Steps
1. Provision target.
2. Restore base backup/snapshot.
3. Apply WAL/logs until target.
4. Start database in restricted mode.
5. Run validation.
6. Apply privacy tombstones if needed.
7. Rebuild projections if needed.
8. Cut over application.

## 6. Validation
- Engine checks:
- Structural checks:
- Business invariant checks:
- Application smoke checks:
- Owner sign-off:

## 7. Cutover
- Freeze writes:
- DNS/connection switch:
- Worker restart policy:
- Outbox/CDC handling:
- Monitoring dashboard:

## 8. Post-Recovery
- Root cause:
- Data loss assessment:
- Customer/regulatory notification:
- Backup/runbook improvement:
- Follow-up tasks:

This template should live in the engineering handbook, not in one engineer’s head.

17. Restore Architecture for Multi-Tenant Systems

Multi-tenant restore is hard because tenants share infrastructure but expect isolated recovery.

Shared Table with `tenant_id`

Pros:

efficient resource use;
simpler global operations;
easier shared schema evolution.

Restore challenge:

tenant data is interleaved;
tenant-specific restore requires logical extraction/import;
global sequences and shared reference data complicate replay;
accidental cross-tenant import is severe.

Schema Per Tenant

Pros:

easier tenant-specific logical restore;
clearer boundary;
per-tenant migration possible.

Restore challenge:

many schemas to manage;
shared services still exist;
schema drift risk;
operational automation required.

Database Per Tenant

Pros:

strongest restore isolation;
database-level PITR per tenant;
clearer blast radius.

Restore challenge:

higher cost;
fleet management;
cross-tenant reporting complexity;
migration orchestration.

Architecture principle:

Tenant isolation is not only a query-security decision. It is also a backup/restore decision.

18. Data Dependency Map

Before you can restore safely, you need a data dependency map.

Example for case management:

For each node, document:

source of truth or projection;
backup mechanism;
restore order;
consistency check;
retention rule;
security classification;
tenant ownership;
rebuild possibility.

Without this map, partial restore becomes guesswork.

19. Restore Order

A general restore order:

infrastructure/network/security prerequisites;
database engine and storage;
source-of-truth database;
object/file storage metadata alignment;
audit/security tables;
application configuration/secrets;
projections/read models/search indexes;
background workers;
integration/event streams;
external-facing application traffic.

Do not start all workers immediately after restore.

Workers may:

publish old outbox messages;
replay scheduled jobs;
send duplicate emails;
call external APIs;
mutate restored data before validation;
rebuild projections from wrong checkpoint.

Safe restore starts in restricted mode.

20. Backup Monitoring

Monitor the backup system like production.

Key signals:

Signal	Why It Matters
Last successful backup time	detects backup job failure
Last successful WAL/log archive time	protects PITR RPO
Backup size trend	detects abnormal growth/shrink
Backup duration	detects RTO/operational drift
Restore test age	detects unproven backup process
WAL archive lag	detects RPO risk
Replication slot retained bytes	detects disk-full risk
Snapshot deletion events	detects attack/operator error
Backup encryption/key status	detects unrecoverable backup
Validation failure count	detects silent corruption/process drift

Bad metric:

backup_job_exit_code = 0

Better metric:

last_restorable_point_age_seconds
restore_drill_last_success_timestamp
restore_drill_duration_seconds
validated_recovered_database_version

A system is not backed up because a cron job succeeded.

It is backed up when you can prove a restorable point exists.

21. Backup Failure Modes

21.1 Backup Succeeds but Restore Fails

Causes:

missing WAL segments;
missing encryption key;
incompatible database version;
corrupted backup file;
incomplete snapshot set;
extension not installed;
restore procedure undocumented;
credentials unavailable.

Mitigation:

automated restore drill;
checksum verification;
dependency manifest;
key recovery test;
version compatibility matrix.

21.2 PITR Gap

Causes:

WAL archive command failed;
log retention too short;
archive bucket access denied;
replication slot issue;
wrong timeline selected;
backup catalog metadata wrong.

Mitigation:

alert on archive lag;
periodically restore to random timestamp;
retain enough logs;
monitor backup catalog.

21.3 Backup Contains Corruption

Causes:

corruption existed before backup;
no checksums/validation;
logical bug damaged data long before detection.

Mitigation:

keep long enough history;
validate invariants regularly;
use checksums where possible;
sample restore older backups;
maintain audit/history for logical repair.

21.4 Restore Reintroduces Deleted Private Data

Causes:

restored old backup after privacy erasure;
no tombstone replay process;
no post-restore erasure validation.

Mitigation:

erasure ledger;
post-restore deletion replay;
restore runbook privacy step;
minimized PII retention.

21.5 Backup Deleted by Attacker

Causes:

same admin role controls production and backup deletion;
no immutable backup retention;
no cross-account copy;
no deletion alert.

Mitigation:

immutable backup vault;
access separation;
MFA/delete protection;
offline/cross-account copy;
tested break-glass restore.

22. Pre-Migration Backup Strategy

Before dangerous database migration:

classify migration risk;
verify latest restorable point;
take explicit pre-change backup/snapshot if needed;
confirm WAL archiving/PITR is healthy;
estimate rollback feasibility;
define forward-fix strategy;
define validation queries;
define abort thresholds;
pause conflicting jobs;
document restore decision owner.

For high-risk migrations, “rollback” may be impossible once writes continue.

Example:

drop column;
rewrite table;
merge entities;
destructive deduplication;
data anonymization;
partition detach/drop;
change primary key model.

Safer approach:

expand-contract migration;
shadow copy;
backfill with validation;
dual read/write where needed;
cutover;
delayed cleanup after backup retention window.

23. Backup and Analytics/Reporting

Do not run heavy analytics directly on backup restore unless designed.

But backups can support:

forensic analysis;
regulatory point-in-time report reconstruction;
debugging bad migration;
data diff between time points;
audit evidence extraction.

Pattern:

restore backup into isolated forensic environment;
disable external side effects;
run read-only analysis;
export approved evidence/report;
destroy environment according to policy.

Never casually restore production backup into low-security dev environment.

Backups have production sensitivity.

24. Case Study — Accidental Case Closure

Scenario:

A bug deploy runs:

UPDATE enforcement_case
SET status = 'closed'
WHERE status <> 'closed';

Detected 12 minutes later.

Questions:

Can you identify exact bad-change time?
Are there legitimate updates after the bad change?
Is full database rewind acceptable?
Can you repair logically using audit/status history?
Do downstream systems need compensation?
Were notifications sent based on bad closure?
Do backups allow PITR to before the bad change?
How do you preserve evidence of the incident?

Possible recovery strategies:

Strategy A — Full PITR Rewind

Use when:

system can tolerate losing all writes after target time;
incident detected quickly;
downstream effects manageable.

Pros:

clean mechanical recovery.

Cons:

loses legitimate post-target writes;
requires outage/cutover;
downstream rewind complexity.

Strategy B — Logical Repair

Use when:

full rewind unacceptable;
audit/status history can reconstruct prior state;
bad update is well-scoped.

Pros:

preserves legitimate writes.

Cons:

higher logical complexity;
must prove correctness;
may need compensating downstream events.

Example repair sketch:

WITH previous_status AS (
    SELECT DISTINCT ON (case_id)
        case_id,
        from_status AS status_before_bad_change
    FROM case_status_history
    WHERE changed_at >= :bad_change_started_at
      AND changed_at <= :bad_change_ended_at
      AND to_status = 'closed'
      AND reason_code = 'buggy_batch_closure'
    ORDER BY case_id, changed_at ASC
)
UPDATE enforcement_case c
SET status = p.status_before_bad_change,
    updated_at = now()
FROM previous_status p
WHERE c.case_id = p.case_id;

A senior architect designs the schema so Strategy B is possible, while keeping Strategy A available through PITR.

25. Recovery Design Document Section

Every serious database design doc should include this section:

## Recovery Design

### Data Classes
- Source-of-truth tables:
- Rebuildable projections:
- External objects/files:
- Audit/evidence records:

### Objectives
- RPO by data class:
- RTO by incident type:
- Restore granularity:
- Retention requirements:

### Backup Mechanisms
- Logical backup:
- Physical backup:
- PITR/WAL archive:
- Snapshot:
- Cross-region/cross-account copy:

### Restore Procedures
- Full restore:
- PITR restore:
- Tenant restore:
- Partial logical restore:
- Object storage reconciliation:
- Projection rebuild:

### Validation
- Engine checks:
- Structural checks:
- Business invariant checks:
- Security/privacy checks:

### Drills
- Frequency:
- Last drill result:
- Measured RPO:
- Measured RTO:
- Known gaps:

If this section is empty, the architecture is not production-ready.

26. Senior-Level Heuristics

Heuristic 1 — Restore Is the Product

Backup tooling is implementation detail.

The product is the ability to recover a valid business state within accepted time and loss boundaries.

Heuristic 2 — RPO/RTO Must Be Measured, Not Claimed

A slide can say RTO is 30 minutes.

A restore drill tells the truth.

Heuristic 3 — Replicas Copy Mistakes

Use replicas for availability and read scaling.

Use backups/PITR for going back.

Heuristic 4 — Partial Restore Requires Data Model Discipline

Tenant restore, case restore, and table restore are easier when ownership boundaries, FK relationships, audit history, and tenant keys are clean.

Heuristic 5 — Security of Backups Is Security of Production

If backups are exposed, production data is exposed.

Heuristic 6 — Recovery Must Include Downstream Effects

Database restore without search/index/broker/object-store strategy is incomplete.

Heuristic 7 — Every Dangerous Migration Needs a Recovery Story

Before you run destructive DDL/DML, know the restore point, validation plan, and decision owner.

27. Practice Exercises

Exercise 1 — Define Recovery Objectives

For a regulatory case management platform, define RPO/RTO for:

case metadata;
evidence files;
case status history;
audit logs;
search index;
analytics reports.

Explain why they differ.

Exercise 2 — PITR Drill

Design a PITR test that proves you can restore to a point between two canary writes.

Include:

canary creation;
backup/log verification;
target restore time;
validation query;
elapsed time measurement.

Exercise 3 — Tenant Restore

Assume shared tables with tenant_id.

Design a tenant-specific restore procedure after accidental deletion of one tenant’s cases.

Include:

dependency map;
extraction source;
import ordering;
conflict handling;
audit trail;
verification that other tenants are unchanged.

Exercise 4 — Backup Security Review

Review a backup system where the production DBA role can delete backups.

Identify failure modes and propose controls.

Exercise 5 — Bad Migration Recovery

A migration incorrectly normalizes party records and merges unrelated people.

Choose between PITR and logical repair.

What evidence do you need to decide?

28. What You Should Remember

Backup is not a checkbox.

Recoverability is a system property.

The architect’s job is to ensure:

recovery objectives are explicit;
mechanisms match objectives;
backups are consistent, secure, and retained;
restore is tested regularly;
PITR works end to end;
partial recovery is designed where required;
privacy and retention are not afterthoughts;
downstream projections and external objects are included;
every dangerous migration has a recovery story.

The core phrase to remember:

We do not have backups unless we have proven restores.

The next part moves into replication models: leader-follower, synchronous/asynchronous replication, quorum, lag, failover, and the difference between scaling reads and preserving truth.

References

PostgreSQL Documentation — SQL Dump: https://www.postgresql.org/docs/current/backup-dump.html
PostgreSQL Documentation — Continuous Archiving and Point-in-Time Recovery: https://www.postgresql.org/docs/current/continuous-archiving.html
PostgreSQL Documentation — pg_basebackup: https://www.postgresql.org/docs/current/app-pgbasebackup.html
AWS Prescriptive Guidance — Choosing the right database for RTO and RPO: https://docs.aws.amazon.com/prescriptive-guidance/latest/strategy-database-disaster-recovery/choosing-database.html
AWS Backup Documentation — Continuous backups and point-in-time recovery: https://docs.aws.amazon.com/aws-backup/latest/devguide/point-in-time-recovery.html
Amazon RDS Documentation — Restoring a DB instance to a specified time: https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_PIT.html

Lesson Recap

You just completed lesson 32 in build core. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Back To Series Next Lesson

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.

Previous Lesson

Lesson 31

Durability, WAL, and Crash Recovery

Next Lesson

Lesson 33

Replication Models

Backup, Restore, and Recovery Objectives

Part 032 — Backup, Restore, and Recovery Objectives

1. The Core Mental Model

2. RPO and RTO

RPO — Recovery Point Objective

RTO — Recovery Time Objective

3. Recovery Objectives Are Business Decisions

4. Backup Types

4.1 Logical Backup

4.2 Physical Backup

4.3 Continuous Archiving / PITR

4.4 Snapshot Backup

4.5 Replica as Recovery Aid

5. Backup Consistency

6. Cross-System Consistency

7. Recovery Granularity

8. RPO/RTO Design Matrix

9. PITR Timeline Mental Model

10. Logical Backup vs PITR

11. Backup Security

12. Retention Policy

13. Privacy and Erasure Tension

14. Restore Validation

14.1 Engine-Level Validation

14.2 Structural Validation

14.3 Business Validation

14.4 Application Validation

15. Restore Drill Types

15.1 Full Restore Drill

15.2 PITR Drill

15.3 Tenant Restore Drill

15.4 Corruption Drill

15.5 Ransomware / Credential Compromise Drill

16. Restore Runbook Template

17. Restore Architecture for Multi-Tenant Systems

Shared Table with tenant_id

Schema Per Tenant

Database Per Tenant

18. Data Dependency Map

19. Restore Order

20. Backup Monitoring

21. Backup Failure Modes

21.1 Backup Succeeds but Restore Fails

21.2 PITR Gap

21.3 Backup Contains Corruption

21.4 Restore Reintroduces Deleted Private Data

21.5 Backup Deleted by Attacker

22. Pre-Migration Backup Strategy

23. Backup and Analytics/Reporting

24. Case Study — Accidental Case Closure

Strategy A — Full PITR Rewind

Strategy B — Logical Repair

25. Recovery Design Document Section

26. Senior-Level Heuristics

Heuristic 1 — Restore Is the Product

Heuristic 2 — RPO/RTO Must Be Measured, Not Claimed

Heuristic 3 — Replicas Copy Mistakes

Heuristic 4 — Partial Restore Requires Data Model Discipline

Heuristic 5 — Security of Backups Is Security of Production

Heuristic 6 — Recovery Must Include Downstream Effects

Heuristic 7 — Every Dangerous Migration Needs a Recovery Story

27. Practice Exercises

Exercise 1 — Define Recovery Objectives

Exercise 2 — PITR Drill

Exercise 3 — Tenant Restore

Exercise 4 — Backup Security Review

Exercise 5 — Bad Migration Recovery

28. What You Should Remember

References

Shared Table with `tenant_id`