Series/Learn Database Design and Architect

Deepen PracticeOrdered learning track

Capacity Planning and Growth Model

Learn Database Design and Architect - Part 056

Capacity planning and growth modelling for production databases: demand model, storage growth, index amplification, WAL, memory, CPU, I/O, connections, tenant skew, retention, headroom, scaling triggers, forecasting, and cost-aware architecture decisions.

[2026-07-05]16 min read3142 words

In This Lesson

1. Core Mental Model 2. Stop Guessing Capacity 3. Capacity Dimensions

PrevNext

Lesson 5684 lesson track46–69 Deepen Practice

#database#capacity-planning#growth-model#performance-engineering+4 more

Part 056 — Capacity Planning and Growth Model

Capacity planning is not guessing how large the database should be.

It is modelling when the database will become unsafe, slow, expensive, or operationally difficult as demand grows.

A weak capacity plan says:

We have a big database instance, so we are fine.

A strong capacity plan says:

At current growth, the audit table reaches 4.8B rows in 14 months. The current partitioning strategy still supports retention purge, but index size exceeds memory by Q3, replica replay becomes the limiting factor during peak write bursts, and backup restore time exceeds RTO by November unless we change backup topology or archival strategy.

Capacity is a multidimensional envelope.

This part teaches how to build that envelope.

1. Core Mental Model

Capacity planning is a function of four things:

capacity_required = demand × data_shape × workload_cost × safety_margin

Where:

demand = users, tenants, TPS, QPS, batch jobs, reports, integrations
data shape = row count, row width, history, indexes, partitions, skew, retention
workload cost = CPU, memory, I/O, WAL, locks, network per operation
safety margin = headroom for spikes, failover, maintenance, incidents, growth uncertainty

If you ignore any of these, the plan is incomplete.

Capacity is not just storage.

It includes:

CPU
memory
buffer cache
storage size
storage IOPS
storage throughput
write-ahead log throughput
network throughput
connection capacity
lock/concurrency capacity
replica apply capacity
backup window
restore time
maintenance work
partition count
index size
operational human capacity
cost envelope

A database can have enough disk and still be out of capacity because p99 latency fails, replica lag grows, or backup restore exceeds RTO.

2. Stop Guessing Capacity

Cloud infrastructure can make capacity easier to change, but it does not remove the need to understand demand.

The AWS Well-Architected reliability guidance explicitly frames “stop guessing capacity” as a design principle: monitor demand and utilization, then add or remove resources to satisfy demand without over- or under-provisioning.

For databases, this principle has a catch:

vertical scaling may require failover/restart
storage grows easily, but I/O and restore time may not
partitioning is hard to retrofit
sharding is much harder to retrofit
indexes grow invisibly until memory/I/O behavior changes
retention mistakes become expensive years later
large tenants can break assumptions before total system volume does

Capacity planning must happen before the database is in pain.

3. Capacity Dimensions

3.1 Storage Capacity

Storage includes:

table data
indexes
TOAST/large object storage
WAL files
temporary files
bloat/dead tuples
materialized views
projections
archive partitions
backup snapshots
replication slots retained WAL

Do not estimate only table rows.

3.2 CPU Capacity

CPU is consumed by:

query execution
joins
aggregation
sorting
JSON processing
compression/decompression
encryption
constraint checks
triggers
query planning
background workers
replication apply
autovacuum

CPU bottlenecks often appear after query complexity or concurrency grows, not merely after storage grows.

3.3 Memory Capacity

Memory is consumed by:

buffer cache
connection overhead
sort/hash memory
maintenance operations
prepared statement caches
OS page cache
background workers

A system can degrade when the working set no longer fits in memory.

3.4 I/O Capacity

I/O includes:

random reads
sequential reads
random writes
WAL writes
checkpoint writes
vacuum reads/writes
index maintenance
temporary file spills
backup reads
replication reads

Storage capacity in GB is not the same as storage performance.

3.5 WAL Capacity

Write-heavy systems are often limited by WAL throughput.

WAL grows with:

inserts
updates
deletes
index modifications
full-page writes after checkpoint
bulk backfills
table rewrites
outbox/audit writes
high-churn tables

A design with many indexes and audit/outbox tables may generate much more WAL than table-size growth suggests.

3.6 Connection Capacity

Connections consume memory and scheduling overhead.

Capacity must define:

maximum database connections
application pool size per service
number of service replicas
background worker connections
migration/admin connections
reporting connections
failover surge behavior

The database is not a free connection sink.

3.7 Lock/Concurrency Capacity

Some designs saturate on locks before CPU or I/O.

Examples:

one counter row per tenant
one queue row for all workers
parent-row locking for high-frequency child inserts
hot account balance row
global sequence-like business identifier
high-contention state transition

Capacity planning must include contention hotspots.

3.8 Operational Capacity

Operational capacity includes:

backup duration
restore duration
migration duration
vacuum/maintenance duration
index rebuild duration
failover duration
incident diagnosis complexity
on-call ability to operate system safely

A design is not production-ready if humans cannot recover it within required time.

4. Build a Demand Model

Start from business/product growth.

business_growth:
  tenants:
    current: 120
    monthly_growth: 8
    projected_12_months: 216
  users:
    current_active_daily: 25_000
    monthly_growth_percent: 5
  cases:
    created_per_day_current: 180_000
    created_per_day_projected_12_months: 420_000
  evidence_documents:
    avg_per_case: 3.5
    p95_per_case: 20
    p99_per_case: 120
  audit_events:
    avg_per_case_lifetime: 25
  retention:
    cases_years: 7
    audit_years: 10

Translate this into database growth.

5. Entity Growth Model

For each major table, estimate row growth.

table_growth:
  cases:
    current_rows: 80_000_000
    daily_insert_rows: 180_000
    monthly_insert_rows: 5_400_000
    yearly_insert_rows: 65_700_000
    retention_years: 7
  evidence_metadata:
    current_rows: 260_000_000
    daily_insert_rows: 630_000
    yearly_insert_rows: 229_950_000
    retention_years: 7
  audit_event:
    current_rows: 1_800_000_000
    daily_insert_rows: 4_500_000
    yearly_insert_rows: 1_642_500_000
    retention_years: 10
  outbox_event:
    current_rows: 50_000_000
    daily_insert_rows: 1_200_000
    retention_days: 14

Do not average everything together.

Audit, outbox, timeline, and history tables often grow faster than the root entity table.

6. Storage Growth Formula

A rough storage model:

monthly_table_growth = monthly_rows × avg_row_size_bytes
monthly_index_growth = monthly_rows × sum(avg_index_entry_size_bytes)
monthly_bloat_budget = (table_growth + index_growth) × expected_bloat_factor
monthly_total_growth = table_growth + index_growth + bloat_budget + derived_storage_growth

Example:

cases_storage_model:
  monthly_rows: 5_400_000
  avg_heap_row_size_bytes: 700
  indexes:
    cases_pkey: 32
    idx_cases_tenant_status_created: 64
    idx_cases_assignee_open: 56
    idx_cases_case_number: 48
  total_index_entry_size_bytes: 200
  monthly_heap_growth_gb: 3.52
  monthly_index_growth_gb: 1.01
  bloat_budget_percent: 20
  monthly_total_growth_gb: 5.44

For audit tables:

audit_storage_model:
  monthly_rows: 135_000_000
  avg_heap_row_size_bytes: 900
  total_index_entry_size_bytes: 160
  monthly_heap_growth_gb: 113.2
  monthly_index_growth_gb: 20.1
  bloat_budget_percent: 10
  monthly_total_growth_gb: 146.6

Audit/history can dominate storage.

7. Index Amplification

Index amplification is the ratio of index storage to table storage.

index_amplification = total_index_size / table_heap_size

If a table has 1 TB heap and 2 TB indexes, index amplification is 2.0.

High index amplification increases:

storage cost
backup size
restore time
insert/update/delete cost
WAL volume
cache pressure
vacuum cost
replication apply work

Every index must justify itself.

Index capacity review:

index_capacity_review:
  table: cases
  heap_size_gb: 900
  index_size_gb: 1400
  index_amplification: 1.56
  indexes:
    - name: cases_pkey
      purpose: identity lookup
      keep: true
    - name: idx_cases_tenant_status_created
      purpose: open case list
      keep: true
    - name: idx_cases_assignee_open
      purpose: user queue
      keep: true
    - name: idx_cases_priority
      purpose: unknown
      usage_last_30_days: low
      action: review/drop

Capacity planning is also index governance.

8. WAL Growth Model

WAL is often the hidden bottleneck.

Estimate WAL per operation from measurement, not pure theory.

wal_measurement:
  create_case:
    avg_wal_bytes: 18_000
  assign_case:
    avg_wal_bytes: 12_500
  add_evidence_metadata:
    avg_wal_bytes: 8_000
  audit_only_event:
    avg_wal_bytes: 4_000

Then model peak WAL:

wal_bytes_per_second = sum(operation_tps × avg_wal_bytes_per_operation)

Example:

peak_wal_model:
  create_case:
    tps: 150
    wal_bytes_per_op: 18_000
    wal_mb_per_sec: 2.57
  assign_case:
    tps: 200
    wal_bytes_per_op: 12_500
    wal_mb_per_sec: 2.38
  add_evidence:
    tps: 300
    wal_bytes_per_op: 8_000
    wal_mb_per_sec: 2.29
  audit_event:
    tps: 900
    wal_bytes_per_op: 4_000
    wal_mb_per_sec: 3.43
  total_peak_wal_mb_per_sec: 10.67

Now compare against:

storage write throughput
WAL disk capacity
archive throughput
replica replay capacity
network throughput
backup/PITR assumptions

WAL capacity matters for durability, replication, recovery, and storage pressure.

9. Backup and Restore Capacity

Capacity planning must include restore time.

Storage size alone is not the risk. Restore duration is.

restore_capacity_model:
  database_size_tb: 8
  backup_restore_throughput_mb_per_sec: 500
  estimated_restore_seconds: 16777
  estimated_restore_hours: 4.66
  rto_hours: 2
  status: fails_rto

If restore time exceeds RTO, options include:

smaller failure domains
partition-level restore strategy
warm standby
more frequent snapshots
cross-region replica
archival split
reduce retained hot data
tenant isolation for critical tenants
faster storage/restore pipeline

A system with 10-year retention in the primary OLTP database may be easy to query and hard to recover.

10. Memory and Working Set Model

The database does not need all data in memory.

It needs the working set in memory.

Working set includes:

hot indexes
frequently accessed table pages
active tenant data
queue rows
recent timeline data
lookup/reference tables
current-state rows

Model it:

working_set:
  hot_tenant_count: 20
  hot_cases_per_tenant: 500_000
  hot_case_row_size_bytes: 700
  hot_case_heap_gb: 6.52
  hot_case_index_gb: 9.8
  hot_audit_recent_gb: 12
  hot_lookup_gb: 2
  estimated_working_set_gb: 30.32
  available_memory_for_cache_gb: 90
  status: healthy

If the working set exceeds effective cache memory, expect more reads from storage and higher tail latency.

Beware: reporting queries can evict OLTP working set.

11. CPU Capacity Model

CPU capacity is best measured from benchmarks.

Example:

cpu_measurement:
  benchmark_load:
    tps: 500
    qps: 3000
    cpu_utilization_percent: 55
  projected_peak:
    tps: 900
    qps: 5400
    estimated_cpu_percent_linear: 99
  conclusion: CPU may saturate before projected peak

But CPU rarely scales perfectly linearly.

At higher load, additional costs appear:

lock waits
context switching
cache misses
connection overhead
plan instability
memory pressure
I/O waits
retry amplification

Use benchmark curves, not simple multiplication alone.

12. I/O Capacity Model

Track four I/O categories:

read IOPS
write IOPS
read throughput
write throughput

For OLTP, random I/O and latency matter.

For backup/reporting, throughput matters.

For WAL, sequential write latency matters.

io_capacity:
  storage:
    provisioned_iops: 16000
    provisioned_throughput_mb_sec: 1000
  measured_peak:
    read_iops: 7000
    write_iops: 4500
    read_throughput_mb_sec: 250
    write_throughput_mb_sec: 180
    avg_read_latency_ms: 1.2
    avg_write_latency_ms: 1.8
  projected_12_month_peak:
    read_iops: 14000
    write_iops: 9000
    read_throughput_mb_sec: 500
    write_throughput_mb_sec: 360
  risk:
    iops_headroom_low: true

Headroom matters because maintenance tasks also need I/O.

13. Connection Capacity Model

A common failure mode:

services × replicas × pool_size > database max_connections

Example:

connection_model:
  database_max_connections: 600
  reserved_admin_connections: 20
  reserved_background_connections: 30
  available_app_connections: 550
  services:
    case_service:
      replicas: 20
      pool_size: 20
      total: 400
    search_service:
      replicas: 10
      pool_size: 10
      total: 100
    report_service:
      replicas: 5
      pool_size: 15
      total: 75
  total_requested: 575
  status: exceeds_available_app_connections

The fix may be:

reduce pool sizes
use pooling middleware
separate reporting connection pool
separate read replicas
limit service replicas
backpressure at application layer
queue batch/report jobs

More connections are not always more capacity.

14. Little's Law for Database Thinking

A useful approximation:

concurrency = throughput × latency

If a system handles 1,000 requests/sec and each request spends 100 ms in DB, average concurrent DB work is:

1000 × 0.1 = 100 concurrent operations

If latency rises to 500 ms at same arrival rate:

1000 × 0.5 = 500 concurrent operations

Latency creates concurrency.

Concurrency creates contention.

Contention creates more latency.

This feedback loop causes collapse.

Capacity planning must include this nonlinear effect.

15. Headroom Policy

Running a database at 95% utilization is fragile.

Define headroom explicitly.

headroom_policy:
  cpu:
    normal_max: 60%
    peak_max: 75%
    emergency_max: 85%
  storage:
    used_max: 70%
    emergency_expand_at: 80%
  iops:
    normal_max: 60%
    peak_max: 75%
  connections:
    normal_max: 60%
    peak_max: 75%
  replication_lag:
    normal_p95_seconds: 5
    emergency_seconds: 60
  wal_archive_lag:
    normal_minutes: 2
    emergency_minutes: 10

Headroom is not waste. It buys:

spikes
failover
noisy neighbor absorption
maintenance
backup
index creation
backfills
incident response
traffic forecasting error

16. Tenant Skew Capacity Model

Multi-tenant capacity is rarely about average tenant size.

Average tenant:

100M cases / 500 tenants = 200k cases per tenant

But real distribution may be:

largest tenant = 25M cases
second largest = 14M cases
long tail median = 40k cases

Plan for:

largest tenant
top 1% tenants
noisy tenant traffic bursts
tenant-specific reports
tenant restore
tenant migration
tenant-level retention/legal hold
tenant-level search projection rebuild

Capacity model:

tenant_skew:
  total_tenants: 500
  largest_tenant_cases: 25_000_000
  p99_tenant_cases: 12_000_000
  median_tenant_cases: 40_000
  largest_tenant_daily_writes: 80_000
  top_5_tenants_share_of_traffic: 35%
  risk:
    noisy_neighbor: high
    large_tenant_query_plan_skew: high

Average-based planning fails multi-tenant systems.

17. Retention and Archival Capacity

Retention policy determines long-term capacity.

Questions:

How long is data retained online?
How long is it retained in warm archive?
How long is it retained in cold archive?
What is the legal hold behavior?
Is deletion physical or logical?
Are backups allowed to retain deleted personal data?
Can reports query archive data?
Can cases be restored from archive?

Retention model:

retention_model:
  cases:
    hot_online: 2 years
    warm_archive: 5 years
    cold_archive: 10 years
  audit_event:
    hot_online: 1 year
    warm_archive: 9 years
    immutable: true
  outbox_event:
    hot_online: 14 days
    hard_delete_after: 30 days
  idempotency_key:
    hot_online: 7 days
    hard_delete_after: 30 days

Without retention control, storage grows forever.

18. Partitioning Capacity Model

Partitioning changes capacity operations.

Useful partition dimensions:

time
tenant
region
status/lifecycle
hash bucket

For time-partitioned audit events:

partition_model:
  table: audit_event
  partition_strategy: monthly_range
  rows_per_month: 135_000_000
  avg_partition_size_gb: 145
  hot_partitions: current_month + previous_month
  retention_months: 120
  expected_total_partitions: 120
  purge_method: detach_and_drop_partition

Review:

partition count
partition size
partition pruning effectiveness
index strategy per partition
maintenance window
backup/restore impact
query complexity
partition creation automation
old partition archival

Partitioning should be designed before tables become unmanageable.

19. Growth Forecasting

Use multiple forecast scenarios.

forecast_scenarios:
  conservative:
    case_growth_monthly: 3%
    traffic_growth_monthly: 4%
  expected:
    case_growth_monthly: 7%
    traffic_growth_monthly: 8%
  aggressive:
    case_growth_monthly: 15%
    traffic_growth_monthly: 20%
  event_spike:
    traffic_multiplier: 5x
    duration_hours: 6

Capacity planning should not use a single optimistic line.

Create forecast tables:

Month	Cases Rows	Audit Rows	DB Size	Peak TPS	CPU Peak	Storage Used	Risk
0	80M	1.8B	3.2TB	500	55%	45%	healthy
3	98M	2.2B	4.0TB	650	64%	55%	monitor
6	118M	2.7B	5.1TB	820	77%	68%	scale soon
9	142M	3.2B	6.4TB	1050	90%	81%	unsafe
12	170M	3.9B	8.1TB	1350	>100%	96%	fails

This makes timing visible.

20. Capacity Envelope

The capacity envelope is the safe operating region.

Example:

capacity_envelope:
  safe_until:
    cases_rows: 120_000_000
    audit_rows: 2_700_000_000
    peak_tps: 800
    database_size_tb: 5
  limiting_factor:
    primary: CPU under mixed workload
    secondary: audit storage growth
    tertiary: restore time
  required_actions_before_limit:
    - move dashboard counts to projection
    - partition audit_event monthly
    - increase storage throughput
    - run restore drill on 5TB snapshot
    - review unused indexes on cases

The envelope must state the limiting factor.

If you do not know the limiting factor, you do not have a capacity model.

21. Scaling Triggers

Capacity planning should produce triggers.

scaling_triggers:
  vertical_scale:
    condition: cpu_p95_peak > 70% for 7 days
    action: scale instance during maintenance window
  storage_expand:
    condition: storage_used > 70% or projected_60_days > 80%
    action: increase allocated storage and IOPS
  partitioning_required:
    condition: audit_event projected > 2B rows or partition size > 200GB
    action: implement monthly partitions
  read_replica_required:
    condition: primary_read_cpu > 40% from read-only queries
    action: route eligible reads to replica
  archival_required:
    condition: hot_data_retention_exceeds_2_years
    action: move closed cases older than 2 years to archive tier
  sharding_review:
    condition: largest_tenant exceeds 30% primary workload
    action: evaluate tenant isolation/cell architecture

Triggers convert capacity planning into action.

22. Scaling Options and Tradeoffs

Option	Helps	Costs/Risks
Bigger instance	CPU/memory	failover/restart, cost, vertical limit
More IOPS/throughput	storage bottleneck	cost, does not fix bad queries
Read replicas	read load	stale reads, lag, routing complexity
Partitioning	maintenance, pruning, retention	design complexity, query constraints
Archival	storage/restore/hot set	query complexity, product behavior
Index cleanup	write/storage/cache	risk if usage misunderstood
Query rewrite	CPU/I/O/latency	application changes
Projection/read model	dashboard/search load	eventual consistency, rebuild path
Sharding/cells	tenant isolation/scale	operational complexity
Database-per-tenant	isolation/restore	high operational overhead
Distributed SQL	multi-region/scale	latency/transaction complexity/cost

Scaling is not just adding hardware.

Often the correct capacity move is changing data lifecycle or query architecture.

23. Cost Capacity Model

Capacity must include cost.

Database cost includes:

compute
storage
provisioned IOPS
backups
snapshots
cross-region replication
data transfer
monitoring/logging
read replicas
support/operations
migration time
engineer time

Example:

monthly_cost_model:
  primary_compute: 4200
  read_replicas: 3600
  storage: 1800
  provisioned_iops: 2400
  backups: 900
  cross_region_replication: 1200
  monitoring_logs: 500
  total_usd: 14600

Cost review questions:

Which indexes are expensive and unused?
Which data can be archived?
Which reports should leave OLTP?
Which tenants drive most cost?
Should large tenants be isolated and priced differently?
Are backups retaining unnecessary data?
Are replicas serving enough reads to justify cost?

Capacity planning is also financial architecture.

24. Capacity and Reliability

Capacity failures are reliability failures.

Common examples:

disk fills and database stops accepting writes
WAL archive lags and PITR window is threatened
replica lag breaks stale-read contract
CPU saturation causes timeout storm
connection exhaustion takes app down
backup restore exceeds RTO
autovacuum cannot keep up and bloat grows
partition count grows without automation
large tenant report takes down primary

Reliability planning must include capacity thresholds.

25. Capacity and Security/Compliance

Security/compliance requirements affect capacity.

Examples:

audit retention increases storage
encryption increases CPU overhead
row-level security adds predicate cost
data masking/projection creates derived storage
legal hold prevents purge
right-to-erasure requires delete propagation
access audit creates more write volume
immutable logs need archival strategy
tenant data residency may require regional duplication

Do not treat compliance as metadata. It changes capacity.

26. Capacity Review for a New Feature

When reviewing a feature, require a capacity section.

feature_capacity_review:
  feature: case_risk_scoring
  new_tables:
    - case_risk_score
    - case_risk_score_history
  row_growth:
    case_risk_score: one per case
    case_risk_score_history: avg 5 per case per year
  new_indexes:
    - tenant_id, risk_score desc, updated_at desc
  write_amplification:
    on_case_update: +1 history row
  read_workload:
    high_risk_queue: interactive
    risk_trend_report: analytical
  retention:
    history_years: 7
  estimated_12_month_storage_gb: 650
  risks:
    - high-risk queue may become hot query
    - history table grows faster than root table
    - report should not run on OLTP primary

No production feature should add unbounded data without capacity review.

27. Database Capacity Dashboard

A production capacity dashboard should show trend, not just current value.

Include:

database size over time
table size top N
index size top N
index amplification by table
row count estimates
daily insert/update/delete rate
WAL generated per day
backup size and duration
restore drill duration
CPU p50/p95/p99
I/O latency p95
storage used percent and projected exhaustion date
connection usage
lock waits
dead tuples/bloat indicators
autovacuum lag
replication lag
largest tenant data/traffic share
query latency top N by total time and p95

Trend is more important than snapshot.

28. Forecasting Storage Exhaustion

A simple forecast:

days_until_full = (usable_storage - current_used_storage) / daily_growth

Example:

storage_forecast:
  usable_storage_tb: 10
  current_used_tb: 6.5
  daily_growth_gb: 35
  days_until_80_percent: 42
  days_until_90_percent: 71
  days_until_full: 102

But use multiple growth rates:

last 7 days
last 30 days
last 90 days
projected product launch impact
worst-case import/backfill impact

If the forecast says 102 days to full, do not wait 100 days.

29. Forecasting Restore Failure

Restore time grows with data size.

restore_forecast:
  current_db_size_tb: 5
  restore_time_hours_current: 2.5
  projected_db_size_tb_12m: 9
  estimated_restore_time_hours_12m: 4.5
  rto_hours: 3
  action_required_by: before_db_size_reaches_6TB

Capacity risk is not only “disk full”.

It is also “we can no longer recover within business requirements.”

30. Forecasting WAL/Replication Failure

If WAL generation exceeds replica replay capacity, lag grows.

replication_capacity:
  normal_wal_mb_sec: 8
  peak_wal_mb_sec: 35
  replica_replay_capacity_mb_sec: 28
  peak_duration_minutes: 30
  expected_lag_growth_mb: 12_600
  risk: replica_lag_exceeds_freshness_contract

Mitigations:

reduce write amplification
reduce indexes
scale replica
improve storage throughput
batch writes differently
throttle backfills
separate analytical replicas
adjust freshness contract

Replica lag is a capacity signal.

31. Forecasting Hotspot Failure

Some limits do not scale with total hardware.

Example: single hot row.

UPDATE tenant_counter
SET next_number = next_number + 1
WHERE tenant_id = $1;

This row can serialize all case creation for a large tenant.

Capacity model:

hotspot_model:
  operation: allocate_case_number
  current_largest_tenant_tps: 40
  projected_largest_tenant_tps: 200
  measured_row_lock_saturation_tps: 90
  risk: largest tenant growth exceeds hot-row capacity
  mitigation: range allocation or sequence-per-tenant strategy

Hardware may not fix serialization bottlenecks.

32. Case Study: Capacity Plan for Case Management

32.1 Current State

current_state:
  tenants: 120
  cases: 80_000_000
  evidence_metadata: 260_000_000
  audit_events: 1_800_000_000
  database_size_tb: 3.2
  peak_tps: 500
  peak_qps: 3000
  largest_tenant_case_share: 18%
  restore_time_hours: 1.7

32.2 12-Month Expected State

expected_12_month_state:
  tenants: 216
  cases: 150_000_000
  evidence_metadata: 510_000_000
  audit_events: 3_600_000_000
  database_size_tb: 7.4
  peak_tps: 1200
  peak_qps: 7200
  largest_tenant_case_share: 24%
  restore_time_hours_estimated: 3.9

32.3 Risks

risks:
  - audit_event table becomes operationally heavy without partitioning
  - restore time exceeds 3h RTO by month 8
  - dashboard exact counts overload primary for largest tenants
  - largest tenant may create noisy-neighbor issue
  - WAL generation during backfills may create replica lag
  - index amplification on cases exceeds 1.5x

32.4 Decisions

architecture_decisions:
  - partition audit_event by month before 2.5B rows
  - move dashboard counts to async summary table
  - introduce tenant-level capacity dashboard
  - review/drop unused indexes before next major growth phase
  - set storage expansion trigger at 70% projected 60-day usage
  - run quarterly restore drill
  - evaluate cell isolation for top 5 tenants

That is a capacity plan.

33. Capacity Planning Anti-Patterns

33.1 Planning Only Disk

Disk is only one dimension. CPU, I/O, WAL, connections, restore time, and lock capacity matter.

33.2 Using Average Tenant

Average tenant hides large-tenant failure.

33.3 Ignoring Index Growth

Index size can exceed table size.

33.4 Ignoring Audit/History

Audit and history often dominate long-term storage.

33.5 No Restore-Time Forecast

A backup that exists but cannot be restored within RTO is not enough.

33.6 No Backfill Capacity

Backfills can consume WAL, I/O, and vacuum capacity.

33.7 No Spike Scenario

Peak events, imports, campaigns, deadlines, and regulatory submissions create non-average demand.

33.8 No Human Operating Model

If only one expert can safely operate the database at scale, operational capacity is low.

33.9 Over-scaling Instead of Fixing Design

Bigger hardware does not fix bad query shape, hot rows, unbounded history, or missing archival.

33.10 No Trigger-Based Actions

A forecast without action thresholds becomes a document nobody uses.

34. Capacity Planning Checklist

Business demand:

Tenant growth model exists
User growth model exists
Transaction growth model exists
Batch/import/report growth is modelled
Spike scenarios are modelled

Data growth:

Row growth per major table is estimated
Row width is estimated/measured
Index growth is estimated
Audit/history growth is estimated
Retention and archival are defined
Bloat budget is included

Resource capacity:

CPU capacity is benchmarked
Memory/working set is estimated
I/O capacity is measured
WAL capacity is measured
Connection capacity is modelled
Lock hotspot capacity is reviewed
Replica replay capacity is tested

Operational capacity:

Backup size trend is monitored
Restore time is tested
RPO/RTO remain achievable
Migration/backfill capacity is known
Partition/archival automation exists where needed
Incident runbooks include capacity failures

Governance:

Scaling triggers are defined
Cost forecast exists
Large tenant/noisy neighbor policy exists
Capacity dashboard exists
Forecast reviewed regularly

35. Sources and Further Reading

AWS Well-Architected Framework — Reliability Pillar: https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/welcome.html
AWS Well-Architected Framework — Design Principles, Stop Guessing Capacity: https://docs.aws.amazon.com/wellarchitected/latest/framework/rel-dp.html
AWS Well-Architected Framework — Performance Efficiency Pillar: https://docs.aws.amazon.com/wellarchitected/latest/performance-efficiency-pillar/welcome.html
PostgreSQL Documentation — Monitoring Database Activity: https://www.postgresql.org/docs/current/monitoring.html
PostgreSQL Documentation — The Cumulative Statistics System: https://www.postgresql.org/docs/current/monitoring-stats.html
PostgreSQL Documentation — pgbench: https://www.postgresql.org/docs/current/pgbench.html
PostgreSQL Documentation — Routine Vacuuming: https://www.postgresql.org/docs/current/routine-vacuuming.html
PostgreSQL Documentation — WAL Configuration: https://www.postgresql.org/docs/current/runtime-config-wal.html

36. Final Mental Model

Capacity planning is not a spreadsheet exercise.

It is the act of making future failure visible early enough that architecture can still change cheaply.

The question is not:

How big should the database be?

The real question is:

Under projected business growth, which database resource or design assumption fails first, when will it fail, how will we know, and what action must happen before that point?

Part 057 continues this by zooming into one of the hardest capacity problems: hotspots, skew, and high-cardinality workloads.

Lesson Recap

You just completed lesson 56 in deepen practice. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Back To Series Next Lesson

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.

Previous Lesson

Lesson 55

Load Testing and Benchmarking Databases

Next Lesson

Lesson 57

Hotspots, Skew, and High Cardinality