Series/Learn Database Design and Architect

Series MapLesson 29 / 84

Build CoreOrdered learning track

MVCC, Locking, and Contention

Learn Database Design and Architect - Part 029

Deep dive into MVCC, locking, contention, deadlocks, hot rows, and production diagnosis for database systems.

[2026-07-05]26 min read5112 words

In This Lesson

1. The Core Mental Model 2. MVCC in One Sentence 3. What MVCC Solves

PrevNext

Lesson 2984 lesson track16–45 Build Core

#database#architecture#concurrency#mvcc+4 more

Part 029 — MVCC, Locking, and Contention

Database concurrency is not mainly about making queries run in parallel. It is about allowing many actors to read and change shared state while preserving invariants.

A top-tier database architect does not stop at: "the database supports transactions". They ask sharper questions:

What state is shared?
Which writes can race?
Which read must observe a stable snapshot?
Which invariant is local to one row, one aggregate, one tenant, or the whole system?
Which rows, indexes, partitions, or metadata objects become contention points?
Which lock waits are acceptable, and which ones indicate a broken write path?

This part builds the operational mental model behind MVCC, locks, deadlocks, and contention. The goal is not to memorize every lock mode in one database engine. The goal is to understand what the database is doing when production traffic turns a clean schema into a blocked system.

1. The Core Mental Model

A database is a shared state machine.

Every transaction proposes a state transition. Concurrency control decides which transitions can safely overlap and which ones must wait, retry, or abort.

The database engine is trying to satisfy three forces at once:

Correctness — illegal states must not become committed states.
Concurrency — independent transactions should not block each other unnecessarily.
Performance — the engine must avoid excessive coordination, disk I/O, and memory pressure.

MVCC is one strategy for balancing those forces. Locks are another. Modern transactional databases usually use both.

2. MVCC in One Sentence

MVCC, or Multi-Version Concurrency Control, lets transactions see a stable version of data while other transactions are modifying newer versions of the same logical rows.

Instead of having one physical copy of a row that every reader and writer fights over, the engine may keep multiple row versions. A reader sees the version that is visible to its snapshot. A writer creates a new version or marks an old version obsolete.

This is why in MVCC systems, simple readers often do not block simple writers. But that does not mean there are no locks. It means the database can often separate read visibility from write coordination.

3. What MVCC Solves

MVCC solves a fundamental problem: if every reader had to lock every row it reads, high-read systems would collapse under blocking.

For example, assume a dashboard reads 10,000 open cases while agents are updating individual cases. Without MVCC, the dashboard could block agents, or agents could block the dashboard. With MVCC, the dashboard can read a snapshot while agents continue creating newer row versions.

-- Dashboard query.
SELECT status, count(*)
FROM regulatory_case
WHERE tenant_id = :tenant_id
GROUP BY status;

-- Agent action at the same time.
UPDATE regulatory_case
SET status = 'UNDER_REVIEW', updated_at = now()
WHERE tenant_id = :tenant_id
  AND case_id = :case_id;

The dashboard may not see the newest update depending on its isolation level and snapshot timing. That is usually acceptable for a dashboard. It is not acceptable for a command that must enforce a business invariant.

That distinction matters.

MVCC helps reads scale. MVCC does not automatically make application-level invariants safe.

4. What MVCC Does Not Solve

MVCC does not remove all conflicts.

It does not make these problems disappear:

two writers updating the same row;
two commands checking the same condition and both deciding it is safe to insert;
a long transaction preventing cleanup of old row versions;
a hot counter row updated thousands of times per second;
a slow migration holding locks longer than expected;
index page contention from monotonically increasing keys;
stale reads from replicas;
write skew across multiple rows;
inconsistent application retry behavior;
unbounded transactions that combine user think-time, network calls, and database locks.

A dangerous belief is:

"We use MVCC, so locking is not a problem."

A better belief is:

"MVCC reduces unnecessary read/write blocking, but write coordination, invariant enforcement, cleanup, and contention still require explicit design."

5. Snapshot, Visibility, and Row Versions

A transaction does not simply read "the table". It reads a visibility-filtered view of row versions.

The exact implementation differs by engine, but the mental model is portable:

Concept	Meaning
Transaction id	Logical marker identifying a transaction or commit order
Snapshot	The set of transaction effects visible to a statement or transaction
Row version	A physical representation of a logical row at some point in time
Visible row	A row version valid for the current snapshot
Dead row	A row version no longer visible to any active or future transaction
Vacuum / cleanup	Process that removes or reclaims obsolete versions
Tuple visibility	Engine-specific logic that decides which version a transaction sees

A simplified update looks like this:

Under READ COMMITTED, many engines take a new snapshot per statement. Under snapshot-style repeatable read, the transaction keeps a stable snapshot. Under serializable isolation, the engine also tracks conflicts so the final result behaves as if transactions ran one at a time.

6. Readers, Writers, and the Dangerous Half-Truth

In MVCC systems, a common rule of thumb is:

Readers do not block writers, and writers do not block readers.

This is useful but incomplete.

It usually applies to ordinary reads and writes on visible row versions. It does not mean:

DDL never blocks queries;
updates to the same row can proceed simultaneously;
foreign key checks never wait;
unique index checks never wait;
serializable transactions never abort;
vacuum/cleanup has no relationship with long-running transactions;
SELECT FOR UPDATE behaves like a normal SELECT;
materialized view refresh, schema migration, or maintenance tasks are free.

The precise architecture view is:

Use the simple rule to reason quickly. Use the detailed model to debug production.

7. Locks Still Exist Under MVCC

MVCC decides what row version a transaction can see. Locks coordinate access to resources that cannot safely be changed concurrently.

Resources can include:

rows;
tables;
indexes;
pages;
metadata/catalog entries;
advisory application-level resources;
predicate/range dependencies under serializable isolation;
internal latches protecting in-memory structures.

At the application design level, the most important categories are:

Lock type	Used for	Architect concern
Row lock	Coordinate concurrent updates to the same row	Hot rows, deadlocks, lock ordering
Table lock	Protect table-level operations	DDL, migrations, bulk operations
Predicate/range dependency	Protect logical conditions under serializable isolation	Write skew, phantom prevention, retry
Advisory lock	Application-defined coordination	Non-row resource, distributed job, coarse critical section
Internal latch	Protect engine internals	Usually not directly controlled, but can surface as contention

A plain SELECT usually does not acquire row locks that block writers. A SELECT ... FOR UPDATE does.

BEGIN;

SELECT *
FROM regulatory_case
WHERE tenant_id = :tenant_id
  AND case_id = :case_id
FOR UPDATE;

-- Safe to make decisions based on the locked current row.
UPDATE regulatory_case
SET status = 'ESCALATED'
WHERE tenant_id = :tenant_id
  AND case_id = :case_id;

COMMIT;

This is a deliberate pessimistic concurrency pattern. It is powerful. It is also easy to overuse.

8. Row Locking: The Practical Model

A row lock says: "another transaction cannot perform conflicting writes to this row until this transaction finishes."

Typical use cases:

update account balance;
transition case status;
assign task to a worker;
reserve inventory;
approve a decision exactly once;
ensure only one worker processes a job.

Example:

BEGIN;

SELECT id, status, assigned_to
FROM case_task
WHERE tenant_id = :tenant_id
  AND id = :task_id
FOR UPDATE;

-- Application checks status and permissions.

UPDATE case_task
SET status = 'IN_PROGRESS',
    assigned_to = :user_id,
    started_at = now()
WHERE tenant_id = :tenant_id
  AND id = :task_id;

COMMIT;

The important timing property:

A lock is held until transaction end, not until the statement result is returned.

This means transaction duration is lock duration.

Bad transaction design creates lock contention even when each SQL statement is fast.

// Bad shape: lock is held across external work.
transaction(() -> {
    Task task = taskRepository.lockTask(taskId);
    remotePolicyService.check(task);   // network call while DB lock is held
    emailService.send(...);            // external side effect while DB lock is held
    taskRepository.markApproved(taskId);
});

Better shape:

// Better shape: external work happens before or after short DB transaction.
PolicyDecision decision = remotePolicyService.check(input);

transaction(() -> {
    Task task = taskRepository.lockTask(taskId);
    task.assertStillEligibleFor(decision);
    taskRepository.markApproved(taskId, decision.id());
    outboxRepository.enqueueApprovalNotification(taskId);
});

// Outbox worker sends email after commit.

9. Lock Wait vs Deadlock

A lock wait is not automatically a bug. A transaction may legitimately wait for another transaction to finish.

A deadlock is different. A deadlock occurs when transactions wait on each other in a cycle.

Deadlocks usually reveal inconsistent lock ordering.

Bad:

-- Transaction A
SELECT * FROM regulatory_case WHERE case_id = 1 FOR UPDATE;
SELECT * FROM regulatory_case WHERE case_id = 2 FOR UPDATE;

-- Transaction B
SELECT * FROM regulatory_case WHERE case_id = 2 FOR UPDATE;
SELECT * FROM regulatory_case WHERE case_id = 1 FOR UPDATE;

Better:

-- Always lock in deterministic order.
SELECT *
FROM regulatory_case
WHERE case_id IN (:case_id_a, :case_id_b)
ORDER BY case_id
FOR UPDATE;

The production rule:

Any transaction that locks multiple entities must define a deterministic lock order.

10. Contention Is a Shape Problem

Contention is not just "too much traffic". It is traffic concentrated on the same resource.

Ten thousand updates per second across ten thousand rows may be fine. Ten thousand updates per second to one row may destroy throughput.

Common contention shapes:

Shape	Example	Symptom
Hot row	`tenant_usage.current_count` updated every request	Lock waits, low write throughput
Hot parent	All child inserts validate/update same parent	Parent row lock contention
Hot index page	Monotonic inserts into same B-Tree edge	Insert latency, page split pressure
Hot tenant	One tenant dominates workload	Noisy neighbor, partition skew
Hot queue	Many workers claim from same small pending set	Lock waits, duplicate work risk
Long transaction	Report/export keeps snapshot open	Vacuum lag, bloat, old versions retained
DDL lock	Migration waits for or blocks active query	Incident during deployment
FK check contention	Parent row referenced by many concurrent child writes	Waits around referential integrity
Unique constraint contention	Many requests try same unique value/idempotency key	Waits or unique violation storm

A senior engineer asks: "Where does concurrency collapse into serialization?"

11. Hot Row Example: Counter Table

A naive counter design:

CREATE TABLE tenant_usage_counter (
    tenant_id uuid PRIMARY KEY,
    api_call_count bigint NOT NULL DEFAULT 0,
    updated_at timestamptz NOT NULL DEFAULT now()
);

Every request does:

UPDATE tenant_usage_counter
SET api_call_count = api_call_count + 1,
    updated_at = now()
WHERE tenant_id = :tenant_id;

For a high-volume tenant, this single row becomes a serialization point. Only one transaction can update the row at a time.

Better options depend on correctness requirement:

Option A — Append events, aggregate asynchronously

CREATE TABLE tenant_usage_event (
    tenant_id uuid NOT NULL,
    event_id uuid NOT NULL,
    occurred_at timestamptz NOT NULL,
    unit_count integer NOT NULL,
    PRIMARY KEY (tenant_id, event_id)
);

Pros:

high write throughput;
natural audit trail;
replayable;
easy deduplication with event_id.

Cons:

current count is eventually consistent unless aggregated in transaction;
more storage;
more query complexity.

Option B — Bucketed counters

CREATE TABLE tenant_usage_counter_bucket (
    tenant_id uuid NOT NULL,
    bucket_no integer NOT NULL,
    api_call_count bigint NOT NULL DEFAULT 0,
    updated_at timestamptz NOT NULL DEFAULT now(),
    PRIMARY KEY (tenant_id, bucket_no)
);

Write path:

UPDATE tenant_usage_counter_bucket
SET api_call_count = api_call_count + 1,
    updated_at = now()
WHERE tenant_id = :tenant_id
  AND bucket_no = :bucket_no;

Read path:

SELECT sum(api_call_count)
FROM tenant_usage_counter_bucket
WHERE tenant_id = :tenant_id;

Pros:

reduces row-level contention;
bounded read aggregation.

Cons:

approximate or delayed semantics if cached;
more rows to manage;
not suitable for every invariant.

Option C — Enforce hard limit with atomic conditional update

UPDATE tenant_quota
SET used_count = used_count + 1
WHERE tenant_id = :tenant_id
  AND used_count < max_count;

Then check affected row count.

This remains a hot row but preserves a hard invariant. Sometimes serialization is the cost of correctness. The architect's job is to know when that cost is required.

12. Hot Parent Example: Aggregate-Level Invariant

Assume a case can have at most 5 active reviewers.

Naive flow:

SELECT count(*)
FROM case_reviewer
WHERE case_id = :case_id
  AND status = 'ACTIVE';

INSERT INTO case_reviewer(case_id, reviewer_id, status)
VALUES (:case_id, :reviewer_id, 'ACTIVE');

Two concurrent transactions can both see count 4 and both insert. Now count is 6.

A common fix is parent-row locking:

BEGIN;

SELECT id
FROM regulatory_case
WHERE id = :case_id
FOR UPDATE;

SELECT count(*)
FROM case_reviewer
WHERE case_id = :case_id
  AND status = 'ACTIVE';

INSERT INTO case_reviewer(case_id, reviewer_id, status)
VALUES (:case_id, :reviewer_id, 'ACTIVE');

COMMIT;

This serializes reviewer assignment per case. That may be acceptable because the invariant is per case. It is much better than serializing reviewer assignment globally.

The design principle:

Lock the smallest object that represents the invariant boundary.

13. Foreign Key Contention

Foreign keys are correctness tools. They are not free.

When many transactions insert child rows referencing the same parent, the database must ensure the parent exists and is not concurrently deleted or changed in a way that violates referential integrity. Depending on engine and operation, this can create waits around parent rows or indexes.

Example:

CREATE TABLE enforcement_case (
    id uuid PRIMARY KEY
);

CREATE TABLE evidence_item (
    id uuid PRIMARY KEY,
    case_id uuid NOT NULL REFERENCES enforcement_case(id),
    file_name text NOT NULL
);

High-volume evidence ingestion for one case can stress:

parent row access;
child foreign key index;
insert locality;
storage and WAL throughput;
downstream audit/outbox writes.

Do not remove foreign keys casually. Instead, decide deliberately:

Choice	Use when	Risk
Keep FK	Same database boundary, correctness important	Possible write overhead
Deferrable FK	Complex transaction graph, temporary intermediate states	Error appears at commit
Application-enforced reference	Cross-service or cross-database boundary	Orphan risk, needs reconciliation
Periodic reconciliation	High-volume ingestion, eventual repair acceptable	Temporary inconsistency

For strong internal data integrity, foreign keys are usually worth the cost. If they become bottlenecks, inspect workload shape before deleting integrity.

14. Unique Constraint Contention

Unique constraints are a strong concurrency control mechanism.

Example:

CREATE TABLE idempotency_record (
    tenant_id uuid NOT NULL,
    idempotency_key text NOT NULL,
    request_hash text NOT NULL,
    response_status text NOT NULL,
    created_at timestamptz NOT NULL DEFAULT now(),
    PRIMARY KEY (tenant_id, idempotency_key)
);

Concurrent inserts for the same idempotency key will converge on one winner. The loser must read the existing record and decide whether to return the stored result or reject a conflicting request hash.

INSERT INTO idempotency_record(
    tenant_id,
    idempotency_key,
    request_hash,
    response_status
)
VALUES (
    :tenant_id,
    :key,
    :hash,
    'PROCESSING'
)
ON CONFLICT (tenant_id, idempotency_key) DO NOTHING;

This is good contention. It turns a race into deterministic behavior.

Bad unique contention happens when:

clients retry with low-entropy keys;
a sequence generator creates collisions;
a unique index covers too wide a hot predicate;
a single natural key is abused for unrelated workflow attempts;
the application treats unique violation as a 500 error instead of a known concurrency outcome.

15. Long Transactions and MVCC Cleanup

MVCC keeps old row versions as long as they may still be visible to active transactions. A long-running transaction can prevent cleanup.

Common causes:

report/export transaction that runs for minutes or hours;
open transaction left idle by application bug;
batch job scanning huge table in one transaction;
migration script that combines DDL, DML, and validation in one long transaction;
interactive admin console session with BEGIN but no COMMIT;
queue worker that starts transaction before doing slow external work.

Symptoms:

table bloat;
index bloat;
vacuum lag;
increased disk usage;
slower sequential scans;
slower index scans due to more dead entries;
replication lag if WAL volume grows;
transaction id wraparound pressure in some engines.

The rule:

Long-running reads are not free in MVCC systems; they keep old versions alive.

Safer reporting patterns:

use read replica with known staleness contract;
chunk export by primary key/time window;
avoid user think-time inside transaction;
prefer short repeatable chunks over one giant transaction;
materialize report snapshots deliberately;
set transaction timeout and idle-in-transaction timeout.

16. `SELECT FOR UPDATE`, `NOWAIT`, and `SKIP LOCKED`

SELECT FOR UPDATE is a pessimistic lock. It is useful when the application must read current state and then modify based on that state.

Wait by default

SELECT *
FROM case_task
WHERE id = :task_id
FOR UPDATE;

If another transaction holds a conflicting lock, this waits.

Fail fast with `NOWAIT`

SELECT *
FROM case_task
WHERE id = :task_id
FOR UPDATE NOWAIT;

Use this when waiting would harm user experience or when the caller can return a clear conflict response.

Example response:

409 Conflict: task is currently being modified by another operation

Skip locked rows for worker queues

SELECT id
FROM case_task
WHERE status = 'PENDING'
ORDER BY priority DESC, created_at ASC, id ASC
FOR UPDATE SKIP LOCKED
LIMIT 10;

This lets multiple workers claim different rows without waiting on the same locked rows.

But be careful:

locked high-priority rows may be skipped;
starvation is possible;
ordering is not globally fair under concurrency;
abandoned work needs timeout/reclaim logic;
long transactions reduce queue throughput;
SKIP LOCKED is not a substitute for idempotent processing.

Worker claim pattern:

WITH candidate AS (
    SELECT id
    FROM case_task
    WHERE status = 'PENDING'
      AND available_at <= now()
    ORDER BY priority DESC, created_at ASC, id ASC
    FOR UPDATE SKIP LOCKED
    LIMIT 10
)
UPDATE case_task t
SET status = 'IN_PROGRESS',
    locked_by = :worker_id,
    locked_at = now()
FROM candidate c
WHERE t.id = c.id
RETURNING t.*;

17. Advisory Locks

Advisory locks are application-defined locks managed by the database. They are useful when the resource you need to coordinate does not map cleanly to a single row.

Examples:

one migration per tenant;
one scheduled billing run per account;
one import job per external file;
one recalculation per case;
one maintenance task per named resource.

Conceptual example:

-- Engine-specific syntax; shown as PostgreSQL-style idea.
SELECT pg_try_advisory_xact_lock(hashtext(:resource_key));

Use transaction-scoped advisory locks when possible. They are automatically released when the transaction ends.

Good use:

resource_key = tenant:{tenant_id}:billing-cycle:{cycle_id}

Bad use:

resource_key = global-import-lock

A global advisory lock can quietly become a platform-wide bottleneck.

Advisory lock rules:

Keep resource keys deterministic.
Keep critical sections small.
Prefer transaction-scoped locks.
Treat lock acquisition failure as a normal outcome.
Monitor lock wait time.
Do not use advisory locks to hide missing database constraints.

18. Predicate Conflicts and Write Skew

Some invariants are not about one row. They are about a predicate.

Example:

A case must always have at least one active reviewer.

Two transactions each deactivate a different reviewer. Each sees another active reviewer and proceeds. After both commit, there are zero active reviewers.

Row locks on each reviewer do not help if each transaction locks only its own row.

Possible fixes:

lock the parent case row before changing reviewer set;
use serializable isolation and retry serialization failures;
represent the invariant with a constraint where possible;
redesign workflow so one aggregate row owns the reviewer-set state;
use a counter with guarded update if exact count is needed.

The recurring design move:

Convert predicate-level invariants into enforceable boundaries.

19. Isolation Level and MVCC Interaction

MVCC gives the engine versions. Isolation level defines visibility and conflict behavior.

Isolation style	Mental model	Architect consequence
Read committed	Each statement sees a fresh committed snapshot	Safer than dirty reads, but multi-statement logic can race
Repeatable read / snapshot	Transaction sees stable snapshot	Avoids read skew, but may still allow some predicate anomalies depending on engine
Serializable	Result equivalent to some serial order	Strongest model, but may require retries or blocking

Do not choose isolation by superstition. Choose it by invariant.

For many OLTP commands:

keep transaction small;
enforce most invariants with constraints and guarded writes;
use row locks for aggregate-local state transitions;
use serializable selectively when predicate invariants are complex;
implement retry discipline where serializable or optimistic conflicts are expected.

20. Diagnosing Lock Contention

When production slows down, do not guess. Build a wait graph.

Questions:

Which sessions are waiting?
What are they waiting for?
Who is blocking them?
How long has the blocker been running?
Is the blocker active, idle in transaction, or waiting too?
What query started the transaction?
Which table/index/row/predicate is involved?
Is this a normal high-contention path or a new incident?

PostgreSQL-style diagnostic query:

SELECT
    blocked.pid AS blocked_pid,
    blocked.query AS blocked_query,
    blocking.pid AS blocking_pid,
    blocking.query AS blocking_query,
    blocked.wait_event_type,
    blocked.wait_event,
    now() - blocked.query_start AS blocked_duration,
    now() - blocking.query_start AS blocking_duration
FROM pg_stat_activity blocked
JOIN pg_locks blocked_locks
  ON blocked_locks.pid = blocked.pid
JOIN pg_locks blocking_locks
  ON blocking_locks.locktype = blocked_locks.locktype
 AND blocking_locks.database IS NOT DISTINCT FROM blocked_locks.database
 AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
 AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page
 AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple
 AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid
 AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid
 AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid
 AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid
 AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid
 AND blocking_locks.pid <> blocked_locks.pid
JOIN pg_stat_activity blocking
  ON blocking.pid = blocking_locks.pid
WHERE NOT blocked_locks.granted
  AND blocking_locks.granted;

Operationally, you want dashboards for:

lock wait time;
deadlock count;
idle-in-transaction sessions;
transaction age;
query duration;
blocking session count;
row update rate by table;
dead tuple growth;
vacuum lag;
replication lag;
queue claim latency;
retry rate;
unique violation rate by constraint.

21. Diagnosing Long Transactions

Find old transactions:

SELECT
    pid,
    usename,
    application_name,
    state,
    now() - xact_start AS transaction_age,
    now() - query_start AS query_age,
    wait_event_type,
    wait_event,
    query
FROM pg_stat_activity
WHERE xact_start IS NOT NULL
ORDER BY xact_start ASC;

Things to inspect:

Is the transaction idle in transaction?
Is it from app, migration, report, admin console, or batch job?
Does it hold locks?
Is it preventing vacuum cleanup?
Is it part of normal business flow?
Can it be killed safely?
Should there be timeout configuration?

Design response:

Finding	Engineering fix
App transaction waits on external API	Move external call outside transaction or use outbox
Export transaction runs for hours	Chunk export or use snapshot table
Migration waiting on active query	Use online migration pattern and short lock windows
Worker idle in transaction	Fix worker transaction scope and timeout
Transaction age grows during traffic spike	Reduce batch size, shorten unit of work

22. Database Contention Is Often Application Design Leaking

The database exposes contention, but the application often creates it.

Common examples:

Chatty transaction

BEGIN
SELECT entity
SELECT related A
SELECT related B
Call remote service
UPDATE entity
INSERT audit
INSERT outbox
COMMIT

Fix:

fetch non-locking reads before transaction when possible;
lock only at decision point;
move external effects after commit;
combine related writes;
ensure indexes support locked reads.

Coarse aggregate

A giant aggregate root forces unrelated operations to lock the same parent.

Fix:

split aggregate by true invariant boundary;
move independent counters/statuses to separate rows;
use event stream or projection for derived summary.

Global sequence of business events

A system wants one global ordering for everything.

Fix:

challenge whether global order is needed;
use per-tenant/per-case/per-account order if sufficient;
use append-only event table with scoped sequence;
separate audit order from business causality.

Read model updated synchronously on every write

Every command updates many summary tables.

Fix:

separate correctness write from projection update;
use outbox/CDC to update projection;
rebuild projection from source when drift occurs;
define freshness contract.

23. Lock Granularity

Lock granularity is the size of the protected resource.

Granularity	Example	Pros	Cons
Row	one task, one case, one account	precise, high concurrency	may not protect predicate invariants
Parent aggregate	one case before changing children	protects aggregate invariant	serializes child operations per parent
Tenant	one tenant maintenance job	simple isolation	noisy for large tenants
Table	bulk migration	simple operationally	blocks too much traffic
Global	one platform-wide job	easiest to reason about	worst scalability

Architectural rule:

Lock at the smallest boundary that fully protects the invariant.

Not smaller, or the invariant leaks. Not larger, or throughput collapses.

24. Lock Ordering Discipline

Lock ordering is one of the simplest ways to prevent deadlocks.

Examples:

always lock tenants by tenant_id ascending;
always lock cases by case_id ascending;
always lock account rows before ledger rows;
always lock parent before child;
always lock source account before destination account using ordered account ids, not transfer direction;
always lock workflow definition before workflow instance if both must be changed.

Transfer example:

BEGIN;

SELECT id, balance
FROM account
WHERE id IN (:from_account_id, :to_account_id)
ORDER BY id
FOR UPDATE;

-- Then apply transfer logic.

COMMIT;

Do not lock by business action order. Lock by deterministic resource order.

25. Deadlock Handling

Even with good design, deadlocks can happen. The application must treat them as retryable when the operation is idempotent and safe to retry.

Retryable cases often include:

deadlock detected;
serialization failure;
transient lock timeout;
transient connection fail before known commit outcome;
optimistic version conflict depending on UX.

Non-retryable cases include:

unique constraint violation for non-idempotent command;
foreign key violation due to invalid reference;
check constraint violation due to bad input;
permission failure;
business rule failure.

Retry requires discipline:

for (int attempt = 1; attempt <= maxAttempts; attempt++) {
    try {
        return transactionTemplate.execute(tx -> commandHandler.handle(command));
    } catch (TransientConcurrencyException ex) {
        if (attempt == maxAttempts) throw ex;
        sleep(backoffWithJitter(attempt));
    }
}

Never blindly retry every database exception. That turns deterministic bugs into load amplification.

26. Contention and Index Design

Index design can create or relieve contention.

Examples:

Missing index on lock query

SELECT *
FROM case_task
WHERE status = 'PENDING'
ORDER BY priority DESC, created_at ASC
FOR UPDATE SKIP LOCKED
LIMIT 10;

Without a supporting index, workers scan too many rows and may lock inefficiently.

Better:

CREATE INDEX idx_case_task_claim
ON case_task(status, priority DESC, created_at ASC, id)
WHERE status = 'PENDING';

Monotonic insert hotspot

A B-Tree index on monotonically increasing created_at or sequence id may concentrate inserts at the right edge. Usually this is acceptable in many systems, but at high write rates it can become a hot page or partition problem.

Mitigations:

partition by time or tenant;
use hash partitioning for write distribution;
use batch insert;
avoid unnecessary secondary indexes on high-write tables;
separate append-only ingest from query-serving projection.

Unique key contention

Unique constraints are good, but a hot unique predicate can become a coordination point. Use them deliberately for correctness, not accidentally for workflow serialization.

27. Queue Tables and Contention

Database-backed queues are legitimate for moderate throughput and transactional coupling. They are not magic.

Good queue table design:

CREATE TABLE background_job (
    id uuid PRIMARY KEY,
    tenant_id uuid NOT NULL,
    job_type text NOT NULL,
    status text NOT NULL,
    priority integer NOT NULL DEFAULT 0,
    available_at timestamptz NOT NULL DEFAULT now(),
    attempts integer NOT NULL DEFAULT 0,
    locked_by text,
    locked_at timestamptz,
    payload jsonb NOT NULL,
    created_at timestamptz NOT NULL DEFAULT now(),
    updated_at timestamptz NOT NULL DEFAULT now(),
    CHECK (status IN ('PENDING', 'IN_PROGRESS', 'DONE', 'FAILED'))
);

CREATE INDEX idx_background_job_claim
ON background_job(status, available_at, priority DESC, created_at, id)
WHERE status = 'PENDING';

Claim query:

WITH picked AS (
    SELECT id
    FROM background_job
    WHERE status = 'PENDING'
      AND available_at <= now()
    ORDER BY priority DESC, created_at ASC, id ASC
    FOR UPDATE SKIP LOCKED
    LIMIT :batch_size
)
UPDATE background_job j
SET status = 'IN_PROGRESS',
    locked_by = :worker_id,
    locked_at = now(),
    attempts = attempts + 1,
    updated_at = now()
FROM picked p
WHERE j.id = p.id
RETURNING j.*;

Reclaim abandoned work:

UPDATE background_job
SET status = 'PENDING',
    locked_by = NULL,
    locked_at = NULL,
    available_at = now() + interval '30 seconds',
    updated_at = now()
WHERE status = 'IN_PROGRESS'
  AND locked_at < now() - interval '10 minutes';

Do not use database queues for everything. They are best when:

job creation must commit atomically with domain state;
throughput is moderate;
job state must be queryable in the same database;
operational simplicity is worth more than broker-level throughput.

Use a dedicated broker when:

fan-out is high;
queue depth is huge;
ordering semantics are broker-specific;
consumer groups and streaming semantics are core;
database write path is already under pressure.

28. The Contention Budget

A production database design should have a contention budget.

For each critical write path, define:

Question	Example answer
What row/resource is locked?	`case_task.id`
How long is it locked?	p95 transaction duration < 50 ms
What is expected conflict rate?	< 1% of attempts
What happens on conflict?	return 409 or retry up to 3 times
Is operation idempotent?	yes, via command id
How is lock wait monitored?	metric `db.lock_wait.ms` by query fingerprint
What is degradation mode?	queue command for later or fail fast

Without a contention budget, teams discover limits only during incidents.

29. Production Failure Modes

Failure Mode 1 — Idle transaction blocks deployment

A user session opens a transaction and sits idle. Deployment migration tries to alter the table. Migration waits. New queries pile up behind migration lock. The app becomes unavailable.

Prevention:

set idle-in-transaction timeout;
use lock timeout for migrations;
run DDL in small safe steps;
inspect blockers before migration;
avoid wrapping large migration scripts in one transaction unless required.

Failure Mode 2 — Hot tenant saturates shared table

One tenant has 90% of traffic. Shared table indexes become hot. Other tenants see latency.

Prevention:

tenant-aware metrics;
partition or shard by tenant when needed;
per-tenant rate limiting;
cell architecture for large tenants;
separate noisy workloads.

Failure Mode 3 — Queue workers fight for the same rows

Workers scan pending jobs without a good index. They lock/wait inefficiently. Throughput drops as workers increase.

Prevention:

claim query with FOR UPDATE SKIP LOCKED;
partial index for pending jobs;
bounded batch size;
short transaction;
reclaim timeout;
idempotent job execution.

Failure Mode 4 — Reporting transaction causes bloat

A long report keeps old snapshots alive. Heavy OLTP updates produce dead versions. Cleanup cannot reclaim them. Disk grows and query performance degrades.

Prevention:

chunked report;
replica for reporting;
snapshot table;
transaction timeout;
monitoring transaction age and dead tuples.

Failure Mode 5 — Retry storm

A contention bug causes many serialization failures or deadlocks. Application retries immediately. Load increases. Conflicts increase. The system melts down.

Prevention:

exponential backoff with jitter;
retry budget;
circuit breaker for hot operation;
conflict metrics;
idempotent command design;
degrade gracefully instead of infinite retry.

30. Design Checklist

Use this checklist during database design review.

MVCC and snapshot

Which queries require latest committed state?
Which queries can tolerate snapshot staleness?
Which workflows are unsafe under multi-statement READ COMMITTED logic?
Are long-running reads expected?
Are reports/export jobs chunked?
Do we have transaction timeout and idle-in-transaction timeout?

Locking

Which commands lock rows explicitly?
Are locks held only inside short transactions?
Are external calls outside lock-holding transactions?
Do multi-entity operations use deterministic lock order?
Are lock waits monitored by query fingerprint?
Are lock timeouts configured for migrations/admin operations?

Contention

What are the hottest rows?
What are the hottest indexes?
What tenant/account/case can dominate workload?
Does any write path update a global counter or global metadata row?
Are high-volume events append-only where possible?
Are counters bucketed or asynchronous when exact immediate count is not required?

Invariants

Is the invariant row-local, aggregate-local, tenant-local, or global?
Does the lock boundary match the invariant boundary?
Could write skew occur?
Are predicate invariants protected by parent lock, constraint, or serializable isolation?
Are unique constraints used intentionally for concurrency control?

Retry

Which errors are retryable?
Which errors are deterministic business failures?
Are commands idempotent?
Is there backoff with jitter?
Is retry count bounded?
Are retries observable?

31. Implementation Heuristics

A few rules that work well in production:

Keep transactions short. Transaction duration is often lock duration and snapshot lifetime.
Use constraints for facts, not comments. If the database can enforce it cheaply and correctly, let it.
Use row locks when the invariant is row/aggregate-local. Locking is not primitive; it is a correctness tool.
Use deterministic lock ordering for multi-row operations. This prevents many deadlocks.
Treat unique violation as a domain outcome when uniqueness is part of the workflow. It is not always an unexpected error.
Use NOWAIT when the caller can fail fast. Better a clear conflict than a hanging request.
Use SKIP LOCKED only with reclaim and idempotency. It is a worker coordination tool, not a complete queue architecture.
Avoid global mutable rows. They become serialization points.
Monitor the database by wait, not only by CPU. Many database incidents are wait problems.
Retry only when the operation is safe to retry. Retry without idempotency is duplicate damage.

32. Mini Case Study: Case Escalation Under Contention

Requirement:

An enforcement case can be escalated once. Escalation creates an audit record and emits an outbox event. Multiple users may click escalate at nearly the same time.

Bad design:

SELECT status
FROM regulatory_case
WHERE id = :case_id;

-- App checks status.

UPDATE regulatory_case
SET status = 'ESCALATED'
WHERE id = :case_id;

INSERT INTO case_audit(...);
INSERT INTO outbox_event(...);

Problems:

concurrent users may both believe they escalated;
duplicate audit/outbox events are possible;
command is not idempotent;
no clear conflict behavior.

Better design with guarded update:

BEGIN;

UPDATE regulatory_case
SET status = 'ESCALATED',
    escalated_at = now(),
    updated_at = now()
WHERE id = :case_id
  AND tenant_id = :tenant_id
  AND status IN ('UNDER_REVIEW', 'PENDING_APPROVAL')
RETURNING id, status, escalated_at;

-- If no row returned, command did not transition.
-- App maps this to already escalated / invalid state.

INSERT INTO case_audit(
    id,
    tenant_id,
    case_id,
    event_type,
    occurred_at,
    actor_id
)
VALUES (
    :audit_id,
    :tenant_id,
    :case_id,
    'CASE_ESCALATED',
    now(),
    :actor_id
);

INSERT INTO outbox_event(
    id,
    aggregate_type,
    aggregate_id,
    event_type,
    payload,
    created_at
)
VALUES (
    :event_id,
    'REGULATORY_CASE',
    :case_id,
    'CASE_ESCALATED',
    :payload::jsonb,
    now()
);

COMMIT;

Even better with idempotency:

CREATE TABLE command_idempotency (
    tenant_id uuid NOT NULL,
    command_id uuid NOT NULL,
    command_type text NOT NULL,
    aggregate_id uuid NOT NULL,
    result_code text NOT NULL,
    created_at timestamptz NOT NULL DEFAULT now(),
    PRIMARY KEY (tenant_id, command_id)
);

The first command wins. Retries return the same result. Concurrent different commands converge through guarded update.

33. Exercises

Exercise 1 — Find the contention point

You see this query dominating lock waits:

UPDATE tenant_summary
SET open_case_count = open_case_count + 1
WHERE tenant_id = :tenant_id;

Questions:

Is exact synchronous count required?
Can count be derived from case events?
Can counters be bucketed?
Is this used for billing, dashboard, quota, or analytics?
What correctness would be lost by making it async?

Exercise 2 — Prevent reviewer write skew

Invariant:

A case must always have at least one active reviewer.

Design at least two safe implementations:

parent-row lock on case;
serializable isolation with retry;
state machine where reviewer set changes through one aggregate command;
counter row with guarded update.

Explain tradeoffs.

Exercise 3 — Queue worker review

Given:

SELECT id
FROM job
WHERE status = 'PENDING'
ORDER BY created_at
LIMIT 1;

Then app updates the row.

Find the race. Design a safe claim query. Add indexes. Add reclaim logic. Add idempotency.

34. Key Takeaways

MVCC gives readers stable visibility without forcing every read to block every write.
MVCC does not remove write conflicts, predicate anomalies, long-transaction costs, or hot-resource contention.
Locks are not the enemy; uncontrolled lock scope and duration are the enemy.
Contention is traffic concentration on a resource, not merely high traffic.
Lock the smallest boundary that fully protects the invariant.
Use deterministic lock ordering for multi-entity transactions.
Use constraints, guarded updates, idempotency keys, and short transactions before reaching for complex distributed coordination.
Monitor lock waits, transaction age, deadlocks, retries, and hot query fingerprints.
Retrying concurrency failures is valid only when the command is idempotent and retry budget is bounded.

In the next part, we convert these mechanics into reusable concurrency control patterns that can be applied consistently in application and database design.

References and Further Reading

PostgreSQL Documentation — Chapter 13, Concurrency Control: https://www.postgresql.org/docs/current/mvcc.html
PostgreSQL Documentation — Explicit Locking: https://www.postgresql.org/docs/current/explicit-locking.html
PostgreSQL Documentation — Transaction Isolation: https://www.postgresql.org/docs/current/transaction-iso.html
CockroachDB Documentation — Transaction Retry Error Reference: https://www.cockroachlabs.com/docs/stable/transaction-retry-error-reference

Lesson Recap

You just completed lesson 29 in build core. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Back To Series Next Lesson

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.

Previous Lesson

Lesson 28

Isolation Levels and Anomalies

Next Lesson

Lesson 30

Concurrency Control Patterns

MVCC, Locking, and Contention

Part 029 — MVCC, Locking, and Contention

1. The Core Mental Model

2. MVCC in One Sentence

3. What MVCC Solves

4. What MVCC Does Not Solve

5. Snapshot, Visibility, and Row Versions

6. Readers, Writers, and the Dangerous Half-Truth

7. Locks Still Exist Under MVCC

8. Row Locking: The Practical Model

9. Lock Wait vs Deadlock

10. Contention Is a Shape Problem

11. Hot Row Example: Counter Table

Option A — Append events, aggregate asynchronously

Option B — Bucketed counters

Option C — Enforce hard limit with atomic conditional update

12. Hot Parent Example: Aggregate-Level Invariant

13. Foreign Key Contention

14. Unique Constraint Contention

15. Long Transactions and MVCC Cleanup

16. SELECT FOR UPDATE, NOWAIT, and SKIP LOCKED

Wait by default

Fail fast with NOWAIT

Skip locked rows for worker queues

17. Advisory Locks

18. Predicate Conflicts and Write Skew

19. Isolation Level and MVCC Interaction

20. Diagnosing Lock Contention

21. Diagnosing Long Transactions

22. Database Contention Is Often Application Design Leaking

Chatty transaction

Coarse aggregate

Global sequence of business events

Read model updated synchronously on every write

23. Lock Granularity

24. Lock Ordering Discipline

25. Deadlock Handling

26. Contention and Index Design

Missing index on lock query

Monotonic insert hotspot

Unique key contention

27. Queue Tables and Contention

28. The Contention Budget

29. Production Failure Modes

Failure Mode 1 — Idle transaction blocks deployment

Failure Mode 2 — Hot tenant saturates shared table

Failure Mode 3 — Queue workers fight for the same rows

Failure Mode 4 — Reporting transaction causes bloat

Failure Mode 5 — Retry storm

30. Design Checklist

MVCC and snapshot

Locking

Contention

Invariants

Retry

31. Implementation Heuristics

32. Mini Case Study: Case Escalation Under Contention

33. Exercises

Exercise 1 — Find the contention point

Exercise 2 — Prevent reviewer write skew

Exercise 3 — Queue worker review

34. Key Takeaways

References and Further Reading

16. `SELECT FOR UPDATE`, `NOWAIT`, and `SKIP LOCKED`

Fail fast with `NOWAIT`