Series MapLesson 34 / 35
Final StretchOrdered learning track

Learn Sql In Action Part 034 System Design With Sql And Relational Databases

22 min read4228 words
PrevNext
Lesson 3435 lesson track3035 Final Stretch

title: Learn SQL in Action - Part 034 description: System design with SQL and relational databases: OLTP vs OLAP, read replicas, CQRS, outbox/inbox, CDC, transactional messaging, cache invalidation, consistency boundaries, and failure modelling. series: learn-sql-in-action seriesTitle: Learn SQL in Action order: 34 partTitle: System Design with SQL and Relational Databases tags:

  • sql
  • database
  • system-design
  • cdc
  • outbox
  • cqrs
  • replication
  • consistency
  • cache
  • series date: 2026-07-01

Part 034 — System Design with SQL and Relational Databases

Target skill: mampu menempatkan SQL database secara benar di arsitektur sistem: bukan hanya sebagai tempat menyimpan data, tetapi sebagai consistency boundary, integration source, audit anchor, read model source, dan operational dependency yang punya failure mode.

Outcome praktis: ketika mendesain sistem case management, regulatory workflow, order lifecycle, billing, entitlement, atau audit platform, kita tahu kapan memakai single relational database, kapan memakai replica, kapan memakai outbox/CDC, kapan membuat read model, kapan cache berbahaya, dan kapan sharding/partitioning diperlukan.

SQL skill untuk top engineer tidak berhenti di query. Skill itu harus naik ke system design.

Database relasional adalah salah satu komponen paling kuat dalam sistem karena ia bisa memberikan:

  • durable state,
  • transaction boundary,
  • referential integrity,
  • concurrency control,
  • auditability,
  • queryability,
  • recovery path,
  • integration stream,
  • operational visibility.

Tetapi kekuatan itu datang dengan trade-off:

  • latency,
  • contention,
  • migration risk,
  • replication lag,
  • cache inconsistency,
  • backup/restore complexity,
  • schema evolution cost,
  • blast radius,
  • operational coupling.

Tujuan part ini adalah menyatukan semua materi sebelumnya ke desain sistem.

1. Kaufman Lens: Deconstructing Database System Design

System design dengan SQL bisa dipecah menjadi sub-skill:

Sub-skillPertanyaan intiFeedback cepat
Boundary designApa yang harus konsisten dalam satu transaksi?invariant bisa dilanggar atau tidak
Workload classificationIni OLTP, OLAP, HTAP, queue, audit, atau search?query latency dan contention
Data ownershipSiapa source of truth?conflicting writes
Replication designReplica dipakai untuk apa?lag, stale reads
Event propagationBagaimana state change keluar dari DB?lost event / duplicate event
Cache strategyApa yang boleh stale?stale read incident
Migration strategyBagaimana schema berubah tanpa downtime?deployment failure
Failure modellingApa yang terjadi saat DB, broker, replica, atau consumer gagal?incident drill
ObservabilityBagaimana mendeteksi query, lock, lag, bloat, pool saturation?time-to-diagnose

Kita tidak belajar pattern untuk dihafal. Kita belajar constraint dan invariant.

2. Database as Consistency Boundary

Pertanyaan paling penting:

Data apa yang harus berubah bersama-sama atau tidak berubah sama sekali?

Itulah transaction boundary.

Contoh case approval:

  • update case_file.status,
  • insert case_status_history,
  • insert case_decision,
  • insert audit_event,
  • insert outbox_event.

Semua ini harus terjadi atomically.

BEGIN;

UPDATE case_file
SET status = 'APPROVED',
    version = version + 1,
    updated_at = current_timestamp
WHERE case_id = :case_id
  AND status = 'UNDER_REVIEW'
  AND version = :expected_version;

INSERT INTO case_status_history (...)
VALUES (...);

INSERT INTO case_decision (...)
VALUES (...);

INSERT INTO audit_event (...)
VALUES (...);

INSERT INTO outbox_event (...)
VALUES (...);

COMMIT;

If any part fails, none should be committed.

This is where relational database shines.

3. Unit of Consistency vs Unit of Deployment

A common mistake: assuming microservice boundary must always split database transaction boundary.

Better distinction:

ConceptMeaning
Unit of consistencyData that must obey invariant together
Unit of ownershipTeam/service responsible for changing data
Unit of deploymentArtifact deployed independently
Unit of scalingComponent scaled independently
Unit of queryShape needed by user/report/API

These are not always identical.

A regulatory case system may have one strong consistency boundary for case lifecycle, but multiple read models for search, dashboard, notification, and analytics.

4. OLTP vs OLAP vs Operational Analytics

OLTP

OLTP optimizes for small, concurrent, transactional changes.

Examples:

  • open case,
  • assign officer,
  • approve decision,
  • record payment,
  • update SLA deadline,
  • submit evidence.

Characteristics:

  • short transactions,
  • strict constraints,
  • high concurrency,
  • predictable indexes,
  • normalized model,
  • row-level changes,
  • low latency.

OLAP

OLAP optimizes for large scans, aggregations, joins, and trend analysis.

Examples:

  • monthly enforcement trend,
  • officer workload distribution,
  • cohort retention,
  • violation severity analysis,
  • SLA compliance report.

Characteristics:

  • large scans,
  • columnar storage useful,
  • denormalized/star schema useful,
  • batch/stream ingestion,
  • higher latency tolerance,
  • snapshot consistency important.

Operational Analytics

Operational analytics lives between OLTP and OLAP.

Examples:

  • open cases due in next 24 hours,
  • backlog by team,
  • current SLA breach count,
  • queue health,
  • stuck workflow detection.

It often runs on OLTP data but can hurt OLTP if not controlled.

Strategy:

  • indexed summary table,
  • materialized view,
  • read replica,
  • read model database,
  • CDC-fed dashboard store,
  • query budget and timeout.

5. Read Replicas

Read replicas are not magic scale-out. They are consistency trade-offs.

Read replicas help when:

  • read workload is heavy,
  • stale reads are acceptable,
  • reports should not overload primary,
  • failover readiness is needed,
  • backup can run away from primary.

Read replicas hurt when:

  • read-after-write consistency is required,
  • application is unaware of lag,
  • long analytical queries create replica conflicts,
  • replica lag causes wrong decisions,
  • failover assumptions are untested.

6. Read-After-Write Consistency

Problem:

  1. user updates case,
  2. API writes to primary,
  3. next screen reads from replica,
  4. replica has not caught up,
  5. user sees stale state.

This is not a database bug. It is architecture behavior.

Solutions:

StrategyWhen usefulTrade-off
Read own writes from primaryuser-facing immediate consistencymore primary load
Lag-aware routingroute to primary if replica lag too highneeds lag telemetry
Session consistency tokenread from replica only after LSN/binlog position visiblecomplexity
UI pending stateeventual consistency acceptableUX design needed
CQRS read model with delay indicatordashboard/reportingeventual semantics explicit

For regulatory workflow, do not show approval result from stale replica unless UI explicitly marks freshness.

7. CQRS: Command Model vs Query Model

CQRS means separating write model from read model. It does not mean eventual consistency everywhere by default.

Write model:

  • normalized,
  • constraint-heavy,
  • transactionally correct,
  • optimized for state changes.

Read model:

  • denormalized,
  • query-shaped,
  • possibly duplicated,
  • rebuildable,
  • optimized for user/API/dashboard reads.

CQRS is useful when:

  • read shape differs greatly from write shape,
  • query load threatens OLTP,
  • search/reporting needs denormalized fields,
  • read model can be rebuilt,
  • eventual consistency is acceptable for those queries.

CQRS is harmful when:

  • used to avoid learning SQL joins,
  • read model becomes source of truth,
  • projection lag is not observable,
  • no replay/rebuild strategy exists,
  • duplicate business rules drift.

8. Transactional Outbox Pattern

Problem: dual write.

Bad flow:

If DB commit succeeds and broker publish fails, state changed but event lost.

Transactional outbox solves this by writing event into the same transaction as business state.

BEGIN;

UPDATE case_file
SET status = 'APPROVED'
WHERE case_id = :case_id
  AND status = 'UNDER_REVIEW';

INSERT INTO outbox_event (
    event_id,
    aggregate_type,
    aggregate_id,
    event_type,
    payload,
    occurred_at,
    published_at
) VALUES (
    :event_id,
    'CaseFile',
    :case_id,
    'CaseApproved',
    :payload,
    current_timestamp,
    NULL
);

COMMIT;

Then a relay publishes the outbox event.

This gives atomic state + event recording, but delivery to broker is usually at-least-once.

Consumers must be idempotent.

9. Outbox Table Design

Example:

CREATE TABLE outbox_event (
    event_id uuid PRIMARY KEY,
    aggregate_type varchar(80) NOT NULL,
    aggregate_id varchar(120) NOT NULL,
    event_type varchar(120) NOT NULL,
    event_version integer NOT NULL,
    payload json NOT NULL,
    headers json NULL,
    occurred_at timestamp NOT NULL,
    available_at timestamp NOT NULL,
    published_at timestamp NULL,
    publish_attempts integer NOT NULL DEFAULT 0,
    last_error text NULL
);

CREATE INDEX idx_outbox_unpublished
ON outbox_event (available_at, occurred_at, event_id)
WHERE published_at IS NULL;

Important fields:

FieldPurpose
event_ididempotency key for consumers
aggregate_typerouting and rebuild
aggregate_idordering/keying per aggregate
event_typeschema contract
event_versionevent evolution
payloadbusiness fact
occurred_atsource timestamp
available_atdelay/retry scheduling
published_atrelay progress
publish_attemptspoison detection
last_erroroperational debugging

10. Polling Outbox vs CDC Outbox

Polling Outbox

Relay queries outbox table periodically.

SELECT event_id, payload
FROM outbox_event
WHERE published_at IS NULL
  AND available_at <= current_timestamp
ORDER BY occurred_at, event_id
FETCH FIRST 100 ROWS ONLY
FOR UPDATE SKIP LOCKED;

Pros:

  • simple,
  • explicit,
  • works without CDC infrastructure,
  • easy to reason about.

Cons:

  • polling overhead,
  • locking complexity,
  • marking published creates more writes,
  • ordering needs design,
  • relay failure can leave rows in progress.

CDC Outbox

CDC reads database log and streams inserted outbox rows.

Pros:

  • avoids polling query,
  • preserves commit order more naturally depending connector,
  • scales integration better,
  • outbox row itself becomes event source.

Cons:

  • needs CDC infrastructure,
  • replication slot/binlog retention risk,
  • connector failure can grow log storage,
  • schema evolution needs discipline,
  • operational complexity higher.

11. Change Data Capture

CDC means capturing committed database changes and propagating them to downstream systems.

CDC use cases:

  • search indexing,
  • analytics ingestion,
  • read model projection,
  • audit streaming,
  • cache invalidation,
  • cross-service integration,
  • migration synchronization.

CDC risks:

  • lag,
  • duplicate events,
  • reordering by key/partition,
  • schema change breakage,
  • connector outage,
  • log retention growth,
  • snapshot/re-snapshot complexity,
  • downstream backpressure.

CDC rule:

CDC gives reliable observation of committed changes. It does not remove the need for idempotent consumers and schema contracts.

12. Inbox Pattern

If outbox solves producer-side dual write, inbox solves consumer-side duplicate handling.

Consumer table:

CREATE TABLE consumed_event (
    consumer_name varchar(100) NOT NULL,
    event_id uuid NOT NULL,
    consumed_at timestamp NOT NULL,
    PRIMARY KEY (consumer_name, event_id)
);

Consumer transaction:

BEGIN;

INSERT INTO consumed_event (consumer_name, event_id, consumed_at)
VALUES (:consumer_name, :event_id, current_timestamp);

-- apply business projection/update
UPDATE case_search_projection
SET status = :status,
    updated_at = :occurred_at
WHERE case_id = :case_id;

COMMIT;

If insert conflicts, event was already processed.

This is essential because at-least-once delivery is the realistic default for robust distributed systems.

13. Cache Invalidation

Cache is a second truth unless designed carefully.

Cache strategies:

StrategyGood forRisk
Cache-asidesimple readsstale after write
Write-throughcontrolled updatesapp complexity
Write-behindhigh write throughputdata loss/inconsistency risk
CDC invalidationDB as source of truthinvalidation lag
TTL onlylow correctness needstale window always exists

For regulatory systems, cache domain decisions carefully. Cache reference data and read-only projections more freely than lifecycle state.

14. Cache Key and Invalidation Scope

Bad key:

case:123

Better key includes tenant and representation:

tenant:{tenantId}:case-summary:{caseId}:v3

Invalidation must know:

  • which aggregate changed,
  • which query result sets include it,
  • whether list caches exist,
  • whether permission-dependent view differs,
  • whether stale read is acceptable.

For list queries, invalidation is hard. Often better:

  • short TTL,
  • event-driven projection,
  • no list cache for critical workflow queue,
  • cache only immutable pages/snapshots.

15. Search Index as Read Model

Search index is not source of truth.

Common architecture:

Rules:

  • index documents should include enough fields for search/list,
  • canonical detail should come from source DB if correctness matters,
  • indexer must be idempotent,
  • reindex from source must be possible,
  • projection lag must be visible,
  • delete/soft-delete semantics must be explicit.

16. Reporting and Analytics Architecture

Do not run heavy monthly reports on the OLTP primary unless intentionally budgeted.

Options:

  1. read replica for moderate reports,
  2. ETL/ELT into warehouse/lakehouse,
  3. materialized summary tables,
  4. CDC stream to analytical store,
  5. snapshot export,
  6. DuckDB/local analytical processing for offline/embedded workloads.

Architecture:

Use OLTP for current operational truth. Use analytical store for broad historical scanning.

17. Materialized Summary Tables

For operational dashboards:

CREATE TABLE team_case_backlog_summary (
    tenant_id bigint NOT NULL,
    team_id bigint NOT NULL,
    status varchar(30) NOT NULL,
    case_count bigint NOT NULL,
    breached_count bigint NOT NULL,
    refreshed_at timestamp NOT NULL,
    PRIMARY KEY (tenant_id, team_id, status)
);

Refresh options:

  • synchronous update in same transaction,
  • asynchronous event projection,
  • scheduled recomputation,
  • hybrid incremental + periodic reconciliation.

Trade-off:

Refresh styleFreshnessWrite costComplexity
Synchronoushighesthighmedium
Async eventmediumlowhigh
Batch refreshlow/mediumlowmedium
On-demand querycurrentquery cost highlow

For dashboards, eventual consistency is often acceptable if freshness is visible.

18. Multi-Tenancy with SQL

Common models:

ModelProsCons
Shared DB, shared schema, tenant columnefficient, simple opsisolation relies on predicates/RLS
Shared DB, schema per tenantstronger logical separationmigration complexity
DB per tenantstrong isolation, custom backupoperational overhead
Cluster per tenantstrongest isolationhighest cost

Shared schema pattern:

CREATE TABLE case_file (
    tenant_id bigint NOT NULL,
    case_id bigint NOT NULL,
    status varchar(30) NOT NULL,
    opened_at timestamp NOT NULL,
    PRIMARY KEY (tenant_id, case_id)
);

CREATE INDEX idx_case_file_tenant_status_opened
ON case_file (tenant_id, status, opened_at DESC, case_id DESC);

Never forget tenant in unique constraints:

ALTER TABLE case_file
ADD CONSTRAINT uq_case_external_ref
UNIQUE (tenant_id, external_reference);

Tenant isolation is an invariant, not a UI filter.

19. Sharding and Distribution

Sharding is not partitioning with marketing.

Partitioning usually divides data inside one database system. Sharding divides data across independent database nodes.

Use sharding when:

  • one primary cannot handle write load,
  • one node cannot hold data economically,
  • tenant isolation/blast radius requires physical split,
  • regional data residency requires separate placement.

Avoid sharding when:

  • query model needs frequent cross-shard joins,
  • team lacks operational maturity,
  • bottleneck is missing index/query design,
  • archive/partitioning would solve the issue.

Shard key questions:

  • Does it align with transaction boundary?
  • Does it avoid hot shards?
  • Can tenants grow unevenly?
  • How do we move a tenant?
  • How do we run global reports?
  • How do we enforce global uniqueness?
  • How do we handle cross-shard workflow?

20. Cross-Database Transactions

Distributed transactions are expensive operationally and conceptually.

Safer pattern:

  • keep strong invariant inside one database,
  • publish event via outbox,
  • downstream systems update eventually,
  • use saga/process manager for multi-step workflow,
  • make every step idempotent,
  • reconcile periodically.

Saga example:

Not every process can be eventually consistent. Identify which invariant truly requires atomicity.

21. Transactional Messaging Without Illusions

Exactly-once is often a composed illusion built from:

  • atomic write to source,
  • durable event log,
  • idempotent consumer,
  • deduplication key,
  • transactional sink,
  • deterministic replay,
  • careful offset management.

Design for:

  • at-least-once delivery,
  • duplicate events,
  • delayed events,
  • partial failure,
  • replay,
  • poison messages,
  • schema evolution.

Consumer idempotency is not optional.

22. Event Schema Design

Bad event:

{
  "status": "APPROVED"
}

Better event:

{
  "eventId": "8a2e...",
  "eventType": "CaseApproved",
  "eventVersion": 2,
  "occurredAt": "2026-07-01T10:15:30Z",
  "tenantId": "tenant-42",
  "caseId": "case-1001",
  "previousStatus": "UNDER_REVIEW",
  "newStatus": "APPROVED",
  "decisionId": "decision-778",
  "approvedBy": "user-91",
  "reasonCode": "COMPLIANT"
}

Event should contain enough context for consumers without forcing them to query source synchronously for every projection.

But do not leak sensitive data unnecessarily.

23. Audit Log vs Event Log

Audit log and event log overlap but are not identical.

AspectAudit logDomain event/outbox
Purposeaccountability, evidenceintegration, projection
Audiencecompliance, support, investigationdownstream systems
Mutabilityappend-only, strictappend-only ideally
Payloadwho/what/when/whybusiness fact for consumers
Retentionoften long/legaloperational/event retention
Query patternreconstruction/investigationconsumption/replay

Do not assume Kafka topic replaces audit table. Auditability often needs local queryable evidence with permission controls.

24. Backup, Restore, and Recovery Design

A database system design is incomplete without recovery thinking.

Questions:

  • What is RPO?
  • What is RTO?
  • Are backups tested?
  • Can we restore to a point in time?
  • Are encryption keys backed up?
  • Are schema migrations recoverable?
  • How do replicas interact with backups?
  • How do CDC/outbox consumers recover after restore?
  • How do we reconcile events emitted after restored point?

Restore drill is not optional for critical systems.

25. Failure Mode: Primary DB Down

Behavior to define:

  • write APIs fail fast or queue?
  • read APIs serve stale replica?
  • admin UI disabled?
  • background jobs paused?
  • outbox relay paused?
  • health checks avoid thundering herd?
  • circuit breaker protects DB during recovery?

Never let retry storms kill a recovering database.

26. Failure Mode: Replica Lag

Symptoms:

  • user sees old state,
  • reports inconsistent,
  • search result missing recent case,
  • dashboard undercounts backlog,
  • failover risks data loss if async replication.

Mitigations:

  • monitor lag,
  • route critical reads to primary,
  • expose freshness timestamp,
  • limit long queries on replicas,
  • capacity plan replica apply rate,
  • alert on lag thresholds,
  • test failover.

27. Failure Mode: CDC Connector Down

Consequences:

  • search index stale,
  • notifications delayed,
  • analytics delayed,
  • outbox backlog grows,
  • WAL/binlog retention grows,
  • disk risk on primary depending configuration.

Mitigations:

  • backlog dashboard,
  • connector health check,
  • lag metric,
  • disk alert,
  • replay runbook,
  • schema-change compatibility process,
  • poison event quarantine.

28. Failure Mode: Cache Stale

Questions:

  • What is maximum acceptable stale window?
  • Can stale data cause wrong approval/decision?
  • Does permission change invalidate cached data?
  • Can user see data after access revoked?
  • Can list cache omit newly escalated case?

For permission-sensitive data, stale cache can be a security incident.

29. Failure Mode: Migration Half-Complete

A safe migration plan handles:

  • app v1 and app v2 both running,
  • nullable transition,
  • partial backfill,
  • failed backfill,
  • constraint validation failure,
  • rollback not possible,
  • old code reading new shape,
  • new code reading old shape,
  • CDC consumers seeing schema change.

Use expand-contract and feature flags.

30. Case Management Reference Architecture

Core principle:

  • DB owns canonical lifecycle state.
  • Audit is written atomically with state change.
  • Outbox is written atomically with state change.
  • Read models are rebuildable.
  • Consumers are idempotent.
  • Search and dashboard are not source of truth.

31. Write Path Design

Example service flow:

Notice: user success depends on DB commit, not broker publish.

32. Read Path Design

Different reads have different freshness needs.

Read use caseSourceFreshness requirement
confirm approval resultprimary OLTPimmediate
case detail audit pageOLTP + audit tableimmediate/strong
search listsearch indexeventual acceptable if marked
dashboard countprojection/summaryeventual acceptable
legal reportwarehouse/snapshotpoint-in-time consistency
workflow queueOLTP or strongly consistent queue tablestrong enough to avoid duplicate work

Design read path per use case, not per generic repository.

33. Connection Pool and Backpressure

A relational database can be killed by too many connections even when CPU is not the first bottleneck.

Rules:

  • use bounded pool,
  • set statement timeout,
  • set lock timeout for risky operations,
  • separate OLTP and reporting pools,
  • avoid long idle transactions,
  • use bulkhead per workload,
  • apply backpressure before DB collapses,
  • monitor pool wait time.

Connection pool saturation is often the first visible symptom of slow query, lock wait, or DB overload.

34. Query Budgeting

Every query class should have a budget:

Query classExampleBudget
command writeapprove caselow latency, short transaction
user detail readopen case pagelow latency
workflow queuenext work itemvery low latency, high concurrency
dashboardbacklog summarymoderate latency, bounded
reportmonthly complianceasync/off primary
backfillschema migrationchunked, throttleable
reconciliationdata quality checkscheduled, controlled

Without query budget, every query competes equally. That is operationally false.

35. Database Observability in Architecture

Minimum signals:

  • query latency by normalized query,
  • rows examined/returned where available,
  • lock wait,
  • deadlock count,
  • connection pool active/waiting,
  • transaction age,
  • replication lag,
  • CDC lag,
  • outbox backlog,
  • cache hit/miss/staleness,
  • migration duration,
  • disk growth,
  • table/index bloat,
  • vacuum/analyze health where relevant.

Tie signals to user impact:

  • “approval latency high because update waits on lock”,
  • “dashboard stale because projection lag 8 minutes”,
  • “search missing case because CDC connector stopped”,
  • “database CPU high because report ran on primary”.

36. Data Ownership and Shared Database Trap

Multiple services writing the same tables is dangerous.

Better:

  • one owner for write model,
  • other services consume events or use API,
  • shared read replica only for read-only/reporting with contract,
  • schema changes controlled by owner,
  • permissions enforce ownership.

Shared database is not always wrong, but uncontrolled shared writes create hidden coupling.

37. API vs Database Integration

When another system needs data, options include:

IntegrationUse whenRisk
Synchronous APIneeds current decision/commandlatency/coupling
Read replica accessinternal reportingschema coupling
CDC/event streamdownstream projectioneventual consistency
Batch exportperiodic reportingfreshness delay
Shared table writealmost neverownership violation

For regulated data, prefer explicit contracts and auditability over convenience.

38. Reconciliation as Architecture Pattern

Eventual consistency needs reconciliation.

Examples:

-- Cases approved in OLTP but missing from search projection
SELECT c.case_id
FROM case_file c
LEFT JOIN case_search_projection p
  ON p.case_id = c.case_id
WHERE c.status = 'APPROVED'
  AND p.case_id IS NULL;
-- Outbox events not published after threshold
SELECT event_id, event_type, occurred_at, publish_attempts, last_error
FROM outbox_event
WHERE published_at IS NULL
  AND occurred_at < current_timestamp - interval '10 minutes';

Reconciliation jobs should be first-class, not emergency scripts.

39. Decision Heuristics

Use single relational DB when:

  • invariants cross several tables,
  • transaction correctness matters,
  • team size/system scale does not require distribution,
  • query flexibility is valuable,
  • operational simplicity matters.

Add read replica when:

  • reads overload primary,
  • stale reads are acceptable for some paths,
  • reporting needs isolation,
  • failover/read scaling needs justify complexity.

Add outbox/CDC when:

  • state changes must drive external systems,
  • dual write risk exists,
  • read models/search/analytics need reliable updates,
  • consumers can be idempotent.

Add CQRS read model when:

  • read shape is expensive on normalized OLTP,
  • eventual consistency is acceptable,
  • rebuild/replay is possible,
  • projection lag is observable.

Add cache when:

  • data is read-heavy,
  • staleness is acceptable or controlled,
  • invalidation is understood,
  • source of truth remains clear.

Add sharding when:

  • one primary cannot meet write/storage/isolation requirements,
  • partitioning/archival/indexing is insufficient,
  • cross-shard queries are manageable,
  • operations team can handle it.

40. Anti-Patterns

Anti-pattern: Broker as Source of Truth for Transactional State

Events are important, but canonical transactional state usually belongs in a database with constraints and recovery.

Anti-pattern: Cache Before Index

Many systems add cache because query is slow, when the real issue is missing index or bad query shape.

Anti-pattern: Read Replica for Strong Reads

If user must see their write immediately, route to primary or implement consistency token.

Anti-pattern: Outbox Without Idempotent Consumer

Outbox does not guarantee exactly-once business effect. Consumer must dedupe.

Anti-pattern: CDC Without Operational Ownership

CDC pipeline requires monitoring, schema policy, lag alert, replay plan, and disk/log retention guard.

Anti-pattern: Sharding Before Lifecycle Partitioning

Large table can often be handled with partitioning, archival, indexing, and workload isolation before sharding.

Anti-pattern: Analytics on Primary

A single unbounded report can degrade user-facing writes.

41. Architecture Review Checklist

For every SQL-backed system design, ask:

Consistency

  • What invariants must be transactionally enforced?
  • Are they enforced by constraints, guarded writes, or application logic only?
  • What isolation level is assumed?
  • Is retry safe?

Ownership

  • Which service owns writes?
  • Who can read directly?
  • Are permissions aligned with ownership?
  • Are schema changes controlled?

Read Models

  • Which reads require strong consistency?
  • Which reads can be stale?
  • Is freshness visible?
  • Can read models be rebuilt?

Integration

  • Is there a dual write?
  • Is outbox/CDC used?
  • Are consumers idempotent?
  • Is event schema versioned?

Operations

  • What are query budgets?
  • What are pool limits?
  • What are lock/statement timeouts?
  • How is lag monitored?
  • How are backups tested?
  • What is the failover plan?

Migration

  • Is expand-contract used?
  • Is backfill chunked and throttleable?
  • Are old and new app versions compatible?
  • Are CDC consumers schema-compatible?

42. Practice Drill: Design a Regulatory Case Platform

Design the SQL architecture for a case management platform with:

  • case lifecycle,
  • assignment,
  • approval,
  • escalation,
  • SLA deadline,
  • audit trail,
  • external notification,
  • search,
  • dashboard,
  • monthly compliance report.

Deliverables:

  1. OLTP schema boundaries,
  2. transaction boundary for case approval,
  3. audit table design,
  4. outbox event schema,
  5. idempotent consumer/inbox schema,
  6. search projection design,
  7. dashboard summary design,
  8. read-after-write strategy,
  9. replica usage policy,
  10. cache invalidation policy,
  11. migration strategy,
  12. observability dashboard,
  13. failure-mode table.

Failure-mode table example:

FailureUser impactDetectionMitigation
primary DB downcannot approve casehealth check, error ratefail fast, pause workers, failover
replica lagstale dashboardlag metricfreshness badge, route critical reads primary
CDC downsearch staleCDC lag, outbox backlogrestart connector, replay, alert
duplicate eventduplicate notificationinbox conflict, event auditidempotency key
migration fails mid-backfillmixed data shapemigration dashboardresume backfill, roll forward

43. Key Takeaways

  1. A relational database is a consistency boundary, not just storage.
  2. System design starts by identifying invariants that must hold atomically.
  3. Read replicas trade freshness for capacity/isolation.
  4. CQRS is useful when read and write shapes diverge, but read models must be rebuildable.
  5. Outbox prevents producer-side dual-write loss, but consumers must still be idempotent.
  6. CDC is powerful but operationally serious: lag, schema evolution, and log retention matter.
  7. Cache is safe only when staleness and invalidation are explicitly designed.
  8. Sharding is a last-resort distribution strategy, not a substitute for indexing/partitioning.
  9. Every SQL architecture needs observability, recovery, migration, and reconciliation plans.
  10. Top-level SQL mastery means reasoning across query, transaction, schema, and distributed failure boundaries.

The next part is the capstone. We will combine the full series into an end-to-end production casebook: schema, query correctness, optimizer work, transaction anomalies, auditability, migration, observability, and final drills.

Lesson Recap

You just completed lesson 34 in final stretch. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.