Build CoreOrdered learning track

Learn Postgresql In Action Part 019 Jsonb Document Hybrid Modeling

[]15 min read2812 words

In This Lesson

Learning Goal Kaufman Deconstruction Mental Model

Lesson 1935 lesson track07–19 Build Core

title: Learn PostgreSQL in Action - Part 019 description: JSONB dan hybrid relational-document modeling di PostgreSQL untuk engineer Java yang perlu menyeimbangkan fleksibilitas schema, queryability, integrity, performance, migration, dan operasional production. series: learn-postgresql-in-action seriesTitle: Learn PostgreSQL in Action order: 19 partTitle: JSONB and Hybrid Relational-Document Modeling tags:

postgresql
jsonb
database
java
schema-design
indexing
performance
series date: 2026-07-01

Part 019 — JSONB and Hybrid Relational-Document Modeling

Learning Goal

Setelah part ini, kamu harus mampu mengambil keputusan yang defensible tentang kapan data harus dimodelkan sebagai kolom relasional biasa, kapan boleh masuk ke jsonb, kapan perlu hybrid, dan kapan JSONB justru menjadi smell arsitektural.

Target skill bukan sekadar bisa memakai operator JSONB. Targetnya adalah bisa menjawab pertanyaan seperti:

Apakah field ini harus menjadi column, generated column, domain object, child table, atau jsonb?
Query mana yang masih sehat bila data disimpan di JSONB?
Index apa yang benar untuk containment, path extraction, attribute search, dan flexible metadata?
Bagaimana menjaga integrity saat sebagian data schema-less?
Bagaimana mapping JSONB di Java tanpa membuat domain model menjadi map-of-string-anything yang rapuh?
Bagaimana melakukan migrasi dari JSONB fleksibel menuju schema yang lebih kuat saat business rule mulai stabil?

Kaufman Deconstruction

Josh Kaufman menekankan bahwa skill kompleks harus dipecah menjadi sub-skill kecil yang bisa dilatih cepat. Untuk JSONB production, sub-skill-nya adalah:

Sub-skill	Yang Harus Bisa Dilakukan	Bukti Penguasaan
Modeling boundary	Memutuskan column vs JSONB vs child table	Bisa menjelaskan invariant yang hilang jika field masuk JSONB
Operator semantics	Memilih operator `->`, `->>`, `#>`, `@>`, `?`, `@@`, `@?`	Query tidak accidental full scan karena salah operator
Index selection	Memilih GIN, expression index, partial index, generated column	`EXPLAIN` membuktikan plan menggunakan index yang relevan
Integrity strategy	Menjaga validasi JSONB dengan CHECK, generated column, trigger, atau app contract	Invalid document tidak diam-diam masuk production
Java mapping	Mapping JSONB ke value object, bukan loose map di semua layer	Domain code tetap type-safe dan migration-friendly
Evolution strategy	Migrasi field dari JSONB ke column dengan expand-contract	Tidak perlu big-bang rewrite data

Mental Model

JSONB adalah document fragment inside a relational engine.

Ia bukan pengganti relational modeling. Ia adalah alat untuk menyimpan bagian data yang:

bentuknya fleksibel;
tidak selalu hadir pada semua row;
sering dibaca bersama parent row;
belum stabil menjadi schema kuat;
tidak membutuhkan relational integrity yang kompleks;
query-nya bisa dibatasi ke pola yang diketahui.

Model paling sehat adalah:

Jika field menentukan identity, ownership, lifecycle state, money, permission, audit, eligibility, or regulatory decision, default-nya jangan disembunyikan di JSONB. Data seperti itu biasanya perlu column, constraint, FK, index, dan observability yang kuat.

PostgreSQL JSON vs JSONB

PostgreSQL memiliki json dan jsonb.

Secara praktik production, jsonb hampir selalu pilihan default karena:

disimpan dalam bentuk binary decomposed;
tidak mempertahankan whitespace dan urutan key asli;
mendukung indexing yang jauh lebih berguna;
mendukung containment dan operators yang kuat.

json masih berguna bila kamu benar-benar perlu mempertahankan representasi textual asli, misalnya payload auditing yang harus persis sama dengan input. Untuk domain querying, gunakan jsonb.

CREATE TABLE case_event (
    id           uuid PRIMARY KEY DEFAULT uuidv7(),
    case_id      uuid NOT NULL,
    event_type   text NOT NULL,
    occurred_at  timestamptz NOT NULL DEFAULT now(),
    payload      jsonb NOT NULL
);

JSONB Normalization Effect

jsonb melakukan normalisasi internal. Contoh konsekuensi:

key order tidak dianggap meaningful;
duplicate key tidak dipertahankan seperti raw JSON text;
comparison dan containment berbasis struktur, bukan string literal.

Ini bagus untuk query. Ini kurang cocok bila kamu butuh exact raw payload.

Pattern yang aman untuk audit/event ingestion:

CREATE TABLE inbound_message (
    id              uuid PRIMARY KEY DEFAULT uuidv7(),
    received_at     timestamptz NOT NULL DEFAULT now(),
    source_system   text NOT NULL,
    raw_payload     jsonb NOT NULL,
    raw_payload_sha bytea NOT NULL,
    processing_state text NOT NULL DEFAULT 'received'
);

Jika exact string diperlukan:

CREATE TABLE inbound_message_raw (
    id              uuid PRIMARY KEY DEFAULT uuidv7(),
    received_at     timestamptz NOT NULL DEFAULT now(),
    raw_payload_text text NOT NULL,
    parsed_payload   jsonb GENERATED ALWAYS AS (raw_payload_text::jsonb) STORED
);

Catatan: generated expression harus memenuhi aturan PostgreSQL untuk generated columns. Bila expression tidak valid untuk generated column di versi atau konfigurasi tertentu, gunakan ingestion pipeline atau trigger.

Modeling Decision Matrix

Gunakan matrix ini sebelum membuat kolom JSONB.

Data Characteristic	Column	Child Table	JSONB	Reasoning
Dibutuhkan untuk FK	✅	✅	❌	FK tidak natural terhadap nested JSONB
Dibutuhkan untuk uniqueness	✅	⚠️	⚠️	Bisa expression/partial unique, tapi lebih fragile
Dipakai di banyak WHERE/JOIN	✅	✅	⚠️	JSONB query bisa diindex, tetapi lebih sulit distabilkan
Bentuk fleksibel per source	❌	⚠️	✅	JSONB cocok untuk source-specific attributes
Auditing payload eksternal	⚠️	❌	✅	Raw/parsed payload lebih cocok JSONB/text
Business lifecycle state	✅	❌	❌	State harus observable dan constrained
Array entity dengan identity sendiri	❌	✅	❌	Child table lebih tepat
Rare metadata, dibaca bersama parent	⚠️	❌	✅	JSONB menghindari sparse columns
Reporting dimension	✅	✅	⚠️	Reporting butuh stable columns/index/stats

Rule sederhana:

JSONB adalah tempat yang baik untuk context, bukan tempat yang baik untuk core invariant.

Hybrid Modeling Pattern

Hybrid modeling berarti core fields tetap relasional, sementara data fleksibel disimpan sebagai JSONB.

Contoh domain regulatory case:

CREATE TABLE enforcement_case (
    id              uuid PRIMARY KEY DEFAULT uuidv7(),
    case_number     text NOT NULL UNIQUE,
    subject_id      uuid NOT NULL,
    lifecycle_state text NOT NULL,
    priority        text NOT NULL,
    opened_at       timestamptz NOT NULL DEFAULT now(),
    closed_at       timestamptz,
    metadata        jsonb NOT NULL DEFAULT '{}'::jsonb,
    CONSTRAINT enforcement_case_lifecycle_state_chk
        CHECK (lifecycle_state IN ('draft', 'screening', 'investigation', 'enforcement', 'closed')),
    CONSTRAINT enforcement_case_priority_chk
        CHECK (priority IN ('low', 'normal', 'high', 'critical'))
);

Yang menjadi column:

case_number, karena identity;
subject_id, karena relationship;
lifecycle_state, karena workflow invariant;
priority, karena routing/escalation;
opened_at, karena lifecycle/query/retention.

Yang masuk JSONB:

source-specific flags;
external enrichment;
optional metadata;
calculated hints yang belum stabil;
UI display preferences yang tidak menjadi invariant.

Contoh metadata:

{
  "source": {
    "system": "complaints-portal",
    "importBatchId": "batch-2026-07-01"
  },
  "riskSignals": {
    "complaintCountLast90Days": 14,
    "hasPriorWarning": true
  },
  "externalReferences": [
    { "type": "ticket", "value": "TCK-8821" },
    { "type": "document", "value": "DOC-441" }
  ]
}

Operator Semantics

Extraction Operators

SELECT
    metadata -> 'source' AS source_json,
    metadata -> 'source' ->> 'system' AS source_system_text,
    metadata #> '{riskSignals,complaintCountLast90Days}' AS complaint_count_json,
    metadata #>> '{riskSignals,complaintCountLast90Days}' AS complaint_count_text
FROM enforcement_case;

Mental model:

Operator	Result	Use Case
`->`	JSON/JSONB	Extract object/array/value while preserving JSON type
`->>`	text	Extract scalar for comparison/cast/display
`#>`	JSON/JSONB by path	Nested extraction with path array
`#>>`	text by path	Nested scalar extraction

Common mistake:

-- Bad for numeric semantics: text comparison
WHERE metadata #>> '{riskSignals,complaintCountLast90Days}' > '9';

This is lexical comparison. '14' > '9' can behave differently than numeric reasoning expects.

Better:

WHERE (metadata #>> '{riskSignals,complaintCountLast90Days}')::int > 9;

But once a field becomes important enough for numeric comparisons, strongly consider generated column or real column.

Containment Operator

SELECT *
FROM enforcement_case
WHERE metadata @> '{"source": {"system": "complaints-portal"}}'::jsonb;

@> asks whether the left JSONB value contains the right JSONB structure.

This is one of the main query shapes that works well with GIN indexes.

Key Existence Operators

SELECT *
FROM enforcement_case
WHERE metadata ? 'riskSignals';

SELECT *
FROM enforcement_case
WHERE metadata ?| array['riskSignals', 'externalReferences'];

SELECT *
FROM enforcement_case
WHERE metadata ?& array['source', 'riskSignals'];

Use these for key existence, not for semantic business validation alone.

JSONPath Operators

PostgreSQL supports SQL/JSON path query operators such as @? and @@ for path existence and predicate checks.

SELECT *
FROM enforcement_case
WHERE metadata @? '$.externalReferences[*] ? (@.type == "ticket")';

JSONPath is expressive, but expressive does not always mean operationally cheap. Use EXPLAIN (ANALYZE, BUFFERS) and understand index support before relying on it for hot paths.

Indexing JSONB

JSONB performance depends on matching query shape to index strategy.

Strategy 1: General GIN Index

CREATE INDEX CONCURRENTLY idx_case_metadata_gin
ON enforcement_case
USING gin (metadata);

Useful for containment and key existence queries.

Example:

EXPLAIN (ANALYZE, BUFFERS)
SELECT *
FROM enforcement_case
WHERE metadata @> '{"source": {"system": "complaints-portal"}}'::jsonb;

This can use the GIN index.

Trade-off:

flexible;
can be large;
write amplification can be significant;
may need recheck;
not a replacement for targeted B-tree indexes on stable scalar fields.

Strategy 2: `jsonb_path_ops` GIN Index

CREATE INDEX CONCURRENTLY idx_case_metadata_path_ops
ON enforcement_case
USING gin (metadata jsonb_path_ops);

jsonb_path_ops is more specialized. It is often smaller and faster for containment-style queries, but supports fewer operators than default jsonb_ops.

Use it when the workload is dominated by containment query shapes like:

WHERE metadata @> '{"source": {"system": "complaints-portal"}}'

Do not choose it blindly. It is a workload-specific trade-off.

Strategy 3: Expression Index for Stable Scalar Path

If you frequently filter by a scalar value:

CREATE INDEX CONCURRENTLY idx_case_source_system
ON enforcement_case ((metadata #>> '{source,system}'));

Query must match expression shape closely:

SELECT *
FROM enforcement_case
WHERE metadata #>> '{source,system}' = 'complaints-portal';

This is often better than a broad GIN index when:

the field is scalar;
the predicate is equality/range/sort;
the field is heavily used;
you want B-tree behavior.

Strategy 4: Generated Column + Normal B-tree Index

When a JSONB field becomes semi-stable and important, expose it:

ALTER TABLE enforcement_case
ADD COLUMN source_system text
GENERATED ALWAYS AS (metadata #>> '{source,system}') STORED;

CREATE INDEX CONCURRENTLY idx_case_source_system_generated
ON enforcement_case (source_system);

This has several advantages:

clearer query shape;
better statistics;
easier ORM mapping;
better reporting compatibility;
easier migration toward real column.

Generated columns are a bridge from flexible JSONB to stronger schema.

Strategy 5: Partial Index on JSONB Predicate

CREATE INDEX CONCURRENTLY idx_case_high_risk_from_portal
ON enforcement_case (opened_at DESC, id)
WHERE metadata @> '{"source": {"system": "complaints-portal"}, "riskSignals": {"hasPriorWarning": true}}'::jsonb;

Good for specialized work queues or operational dashboards.

But partial JSONB indexes are fragile if application query predicates do not imply the same condition.

JSONB and Statistics

Planner statistics for JSONB are not as straightforward as normal columns. A query on:

WHERE metadata #>> '{source,system}' = 'complaints-portal'

may suffer from weak cardinality estimation compared to a normal column with statistics.

If the field is important to planning:

use generated column;
run ANALYZE;
inspect pg_stats;
compare estimated vs actual rows;
consider extended statistics if multiple extracted columns correlate.

Example:

ALTER TABLE enforcement_case
ADD COLUMN source_system text
GENERATED ALWAYS AS (metadata #>> '{source,system}') STORED;

ALTER TABLE enforcement_case
ADD COLUMN has_prior_warning boolean
GENERATED ALWAYS AS ((metadata #>> '{riskSignals,hasPriorWarning}')::boolean) STORED;

CREATE STATISTICS st_case_source_warning
ON source_system, has_prior_warning
FROM enforcement_case;

ANALYZE enforcement_case;

Now PostgreSQL has a better chance of estimating combined predicates.

Integrity in and around JSONB

A JSONB column can easily become a dumping ground. Production-grade design needs integrity layers.

Layer 1: Non-null and Object Shape

ALTER TABLE enforcement_case
ADD CONSTRAINT enforcement_case_metadata_object_chk
CHECK (jsonb_typeof(metadata) = 'object');

This prevents arrays/scalars from becoming metadata.

Layer 2: Required Key for Certain Type

ALTER TABLE case_event
ADD CONSTRAINT case_event_payload_shape_chk
CHECK (
    CASE event_type
        WHEN 'case.assigned' THEN payload ? 'assigneeUserId'
        WHEN 'case.escalated' THEN payload ? 'escalationReason'
        ELSE true
    END
);

This is useful when event payload varies by type.

Layer 3: Type Check Nested Field

ALTER TABLE enforcement_case
ADD CONSTRAINT enforcement_case_risk_count_number_chk
CHECK (
    metadata #> '{riskSignals,complaintCountLast90Days}' IS NULL
    OR jsonb_typeof(metadata #> '{riskSignals,complaintCountLast90Days}') = 'number'
);

Layer 4: Promote Important JSON Field

If a JSON key becomes part of correctness, promote it.

Bad long-term design:

WHERE metadata #>> '{approval,status}' = 'approved'

Better:

ALTER TABLE enforcement_case
ADD COLUMN approval_status text;

ALTER TABLE enforcement_case
ADD CONSTRAINT approval_status_chk
CHECK (approval_status IN ('not_required', 'pending', 'approved', 'rejected'));

A field that drives workflow should be column-level.

JSONB Arrays: The Most Common Modeling Trap

JSONB arrays are tempting:

{
  "violations": [
    { "code": "KYC_MISSING", "severity": "high" },
    { "code": "REPORT_LATE", "severity": "medium" }
  ]
}

This is acceptable if violations are small embedded facts, rarely updated independently, and always read with the parent.

It becomes a smell if you need:

unique constraint per violation;
FK to violation catalog;
per-violation lifecycle;
per-violation audit;
frequent filtering/joining;
partial updates to individual elements;
high-volume analytics.

Then use a child table:

CREATE TABLE case_violation (
    case_id     uuid NOT NULL REFERENCES enforcement_case(id),
    code        text NOT NULL,
    severity    text NOT NULL,
    detected_at timestamptz NOT NULL DEFAULT now(),
    metadata    jsonb NOT NULL DEFAULT '{}'::jsonb,
    PRIMARY KEY (case_id, code)
);

Hybrid still works: stable child entity + flexible metadata.

Update Semantics and Write Amplification

Updating a nested JSONB value still updates the row version under MVCC.

UPDATE enforcement_case
SET metadata = jsonb_set(
    metadata,
    '{riskSignals,hasPriorWarning}',
    'true'::jsonb,
    true
)
WHERE id = $1;

Operational consequence:

row version changes;
indexes may need maintenance;
large JSONB values can increase write cost;
HOT update eligibility can be lost if indexed expressions depend on changed JSONB;
frequent small updates inside a large JSONB blob can create bloat.

If a field changes frequently and independently, it probably deserves its own column or child table.

JSONB and TOAST

Large JSONB documents may be TOASTed. That means the value can be stored out-of-line.

This matters because:

reading the row may not always read the entire value immediately, depending on query;
extracting nested values can force detoasting;
updating large values can be expensive;
broad JSONB documents can hurt cache behavior.

Design implication:

keep hot columns separate from cold JSONB;
avoid selecting JSONB when listing rows;
use projection discipline in Java queries;
split large audit payloads into dedicated tables when necessary.

Bad repository method:

// Loads every column including large metadata/payload.
List<CaseEntity> findByLifecycleState(String state);

Better query projection:

SELECT id, case_number, lifecycle_state, priority, opened_at
FROM enforcement_case
WHERE lifecycle_state = ?
ORDER BY opened_at DESC
LIMIT ?;

Java Mapping Strategy

Anti-pattern: Map Everywhere

class EnforcementCase {
    UUID id;
    String caseNumber;
    Map<String, Object> metadata;
}

This looks flexible but creates hidden costs:

no compile-time semantic boundary;
string keys leak everywhere;
casting errors become runtime bugs;
refactoring is unsafe;
API contract becomes unclear;
validation becomes inconsistent.

Better: Value Object Boundary

public record CaseMetadata(
    SourceInfo source,
    RiskSignals riskSignals,
    List<ExternalReference> externalReferences
) {
    public static CaseMetadata empty() {
        return new CaseMetadata(null, null, List.of());
    }
}

public record SourceInfo(
    String system,
    String importBatchId
) {}

public record RiskSignals(
    Integer complaintCountLast90Days,
    Boolean hasPriorWarning
) {}

public record ExternalReference(
    String type,
    String value
) {}

Mapping can still store JSONB, but the application boundary remains typed.

JDBC Example

With Jackson:

String sql = """
    INSERT INTO enforcement_case (
        case_number,
        subject_id,
        lifecycle_state,
        priority,
        metadata
    ) VALUES (?, ?, ?, ?, ?::jsonb)
    """;

try (PreparedStatement ps = connection.prepareStatement(sql)) {
    ps.setString(1, command.caseNumber());
    ps.setObject(2, command.subjectId());
    ps.setString(3, "draft");
    ps.setString(4, command.priority());
    ps.setString(5, objectMapper.writeValueAsString(command.metadata()));
    ps.executeUpdate();
}

Hibernate Mapping Direction

For Hibernate, prefer explicit type mapping through supported JSON support or a known library. The architecture decision is more important than the annotation:

map JSONB to a value object;
avoid exposing mutable map directly;
use immutable records or defensive copying;
keep hot query fields as columns;
avoid dirty-checking huge JSON documents unnecessarily.

Common ORM issue:

application loads entity;
modifies one nested JSON attribute;
Hibernate sees the whole JSONB value as dirty;
update rewrites the row;
index maintenance and bloat increase;
optimistic locking may conflict more broadly than intended.

For high-write fields, split them out.

Event Payload Pattern

JSONB is a strong fit for event payloads when event type defines payload shape.

CREATE TABLE domain_event (
    id             uuid PRIMARY KEY DEFAULT uuidv7(),
    aggregate_type text NOT NULL,
    aggregate_id   uuid NOT NULL,
    event_type     text NOT NULL,
    event_version  int NOT NULL,
    occurred_at    timestamptz NOT NULL DEFAULT now(),
    payload        jsonb NOT NULL,
    metadata       jsonb NOT NULL DEFAULT '{}'::jsonb,
    CONSTRAINT domain_event_payload_object_chk CHECK (jsonb_typeof(payload) = 'object')
);

CREATE INDEX CONCURRENTLY idx_domain_event_aggregate
ON domain_event (aggregate_type, aggregate_id, occurred_at, id);

CREATE INDEX CONCURRENTLY idx_domain_event_type_time
ON domain_event (event_type, occurred_at DESC);

Do not over-index event payloads unless you have a real query workload. Most event tables should be append-heavy and queried by aggregate/time/type.

Audit Log Pattern

Audit logs benefit from JSONB but still need relational anchors.

CREATE TABLE audit_log (
    id            uuid PRIMARY KEY DEFAULT uuidv7(),
    actor_type    text NOT NULL,
    actor_id      text,
    action        text NOT NULL,
    resource_type text NOT NULL,
    resource_id   uuid NOT NULL,
    occurred_at   timestamptz NOT NULL DEFAULT now(),
    before_state  jsonb,
    after_state   jsonb,
    metadata      jsonb NOT NULL DEFAULT '{}'::jsonb
);

CREATE INDEX CONCURRENTLY idx_audit_resource_time
ON audit_log (resource_type, resource_id, occurred_at DESC);

CREATE INDEX CONCURRENTLY idx_audit_action_time
ON audit_log (action, occurred_at DESC);

Do not rely on JSONB for the main audit access path. Keep resource identity and actor/action as columns.

External Integration Payload Pattern

For external systems, store payload in JSONB but classify it relationally.

CREATE TABLE integration_message (
    id               uuid PRIMARY KEY DEFAULT uuidv7(),
    provider         text NOT NULL,
    message_type     text NOT NULL,
    provider_ref     text,
    received_at      timestamptz NOT NULL DEFAULT now(),
    processing_state text NOT NULL DEFAULT 'received',
    payload          jsonb NOT NULL,
    error_detail     jsonb,
    CONSTRAINT integration_message_state_chk
        CHECK (processing_state IN ('received', 'processing', 'processed', 'failed', 'ignored'))
);

CREATE UNIQUE INDEX CONCURRENTLY ux_integration_provider_ref
ON integration_message (provider, provider_ref)
WHERE provider_ref IS NOT NULL;

CREATE INDEX CONCURRENTLY idx_integration_work_queue
ON integration_message (processing_state, received_at, id)
WHERE processing_state IN ('received', 'failed');

This design supports idempotency, retry, observability, and failure triage.

Query Patterns

Good: Stable Filter Outside JSONB

SELECT id, case_number, priority, opened_at
FROM enforcement_case
WHERE lifecycle_state = 'investigation'
  AND priority IN ('high', 'critical')
ORDER BY opened_at DESC
LIMIT 50;

Acceptable: Metadata Filter with Targeted Index

SELECT id, case_number, opened_at
FROM enforcement_case
WHERE metadata #>> '{source,system}' = 'complaints-portal'
ORDER BY opened_at DESC
LIMIT 50;

With:

CREATE INDEX CONCURRENTLY idx_case_source_opened
ON enforcement_case ((metadata #>> '{source,system}'), opened_at DESC, id);

Better as Field Stabilizes

ALTER TABLE enforcement_case
ADD COLUMN source_system text
GENERATED ALWAYS AS (metadata #>> '{source,system}') STORED;

CREATE INDEX CONCURRENTLY idx_case_source_system_opened
ON enforcement_case (source_system, opened_at DESC, id);

Then query:

SELECT id, case_number, opened_at
FROM enforcement_case
WHERE source_system = 'complaints-portal'
ORDER BY opened_at DESC
LIMIT 50;

Risky: Open-Ended JSON Search

SELECT *
FROM enforcement_case
WHERE metadata::text ILIKE '%portal%';

This is usually a smell:

not semantically precise;
hard to index well;
expensive on large rows;
fragile under key/value changes.

If you need search, model it explicitly: generated search fields, full-text index, trigram index, or external search system depending on requirements.

Migration: JSONB to Column

JSONB often starts as flexibility and later becomes stable schema. Use expand-contract.

Step 1: Add Nullable Column

ALTER TABLE enforcement_case
ADD COLUMN source_system text;

Step 2: Backfill in Batches

UPDATE enforcement_case
SET source_system = metadata #>> '{source,system}'
WHERE source_system IS NULL
  AND metadata ? 'source'
LIMIT 1000;

PostgreSQL does not support LIMIT directly in UPDATE in this form. Use a CTE:

WITH batch AS (
    SELECT id
    FROM enforcement_case
    WHERE source_system IS NULL
      AND metadata ? 'source'
    ORDER BY id
    LIMIT 1000
)
UPDATE enforcement_case c
SET source_system = c.metadata #>> '{source,system}'
FROM batch
WHERE c.id = batch.id;

Repeat from application job or migration runner.

Step 3: Dual Write

Application writes both metadata.source.system and source_system temporarily.

Step 4: Add Constraint When Clean

ALTER TABLE enforcement_case
ADD CONSTRAINT source_system_consistency_chk
CHECK (
    source_system IS NULL
    OR source_system = metadata #>> '{source,system}'
) NOT VALID;

ALTER TABLE enforcement_case
VALIDATE CONSTRAINT source_system_consistency_chk;

Step 5: Index Column

CREATE INDEX CONCURRENTLY idx_case_source_system
ON enforcement_case (source_system);

Step 6: Read from Column

Switch queries and application reads to source_system.

Step 7: Decide Whether to Keep JSON Copy

Options:

keep for raw payload history;
remove nested key from future writes;
keep consistency constraint during transition;
eventually remove duplicate if no audit requirement.

Migration: Column to JSONB

Less common, but useful for retiring sparse columns.

UPDATE enforcement_case
SET metadata = jsonb_set(
    metadata,
    '{legacy,oldCategory}',
    to_jsonb(old_category),
    true
)
WHERE old_category IS NOT NULL;

Then remove column only after readers no longer depend on it.

Anti-Patterns

Anti-pattern 1: JSONB as Escape Hatch for Poor Domain Modeling

CREATE TABLE thing (
    id uuid PRIMARY KEY,
    data jsonb NOT NULL
);

This is not flexible architecture. It is usually schema avoidance.

Ask:

What is identity?
What is lifecycle state?
What relationships exist?
What constraints must never be violated?
What access paths matter?

Anti-pattern 2: Everything in JSONB Because “Requirements Change”

Requirements change. That does not mean no schema. It means schema should evolve safely.

PostgreSQL is good at schema evolution when you use expand-contract, concurrent indexes, nullable-first columns, backfills, and contract checks.

Anti-pattern 3: JSONB Field Used in Hot Join

SELECT c.*
FROM enforcement_case c
JOIN subject s
  ON s.external_ref = c.metadata #>> '{subject,externalRef}';

If this join matters, promote the field.

Anti-pattern 4: Huge Mutable JSONB Blob

If a 200KB JSONB document is updated for a single boolean flag, the design is probably wrong.

Split by write frequency:

hot fields: columns or narrow child table;
cold payload: JSONB archive table;
audit: append-only event/audit table.

Anti-pattern 5: Indexing Every JSON Key

Indexing is not free. Every index creates write overhead and maintenance cost.

Index only proven access paths.

Diagnostic Queries

Find Large JSONB Columns

SELECT
    id,
    pg_column_size(metadata) AS metadata_bytes
FROM enforcement_case
ORDER BY pg_column_size(metadata) DESC
LIMIT 20;

Check JSONB Query Plan

EXPLAIN (ANALYZE, BUFFERS)
SELECT id
FROM enforcement_case
WHERE metadata @> '{"source": {"system": "complaints-portal"}}'::jsonb;

Find Sequential Scan JSONB Queries via pg_stat_statements

SELECT
    calls,
    mean_exec_time,
    rows,
    query
FROM pg_stat_statements
WHERE query ILIKE '%jsonb%'
   OR query ILIKE '%metadata%'
ORDER BY mean_exec_time DESC
LIMIT 20;

Compare Expression Index Usage

EXPLAIN (ANALYZE, BUFFERS)
SELECT id
FROM enforcement_case
WHERE metadata #>> '{source,system}' = 'complaints-portal';

Ensure the expression in the query matches the expression in the index.

Practice Lab

Lab 1: Create Hybrid Table

DROP TABLE IF EXISTS enforcement_case;

CREATE TABLE enforcement_case (
    id              uuid PRIMARY KEY DEFAULT uuidv7(),
    case_number     text NOT NULL UNIQUE,
    subject_id      uuid NOT NULL,
    lifecycle_state text NOT NULL,
    priority        text NOT NULL,
    opened_at       timestamptz NOT NULL DEFAULT now(),
    metadata        jsonb NOT NULL DEFAULT '{}'::jsonb,
    CONSTRAINT lifecycle_state_chk CHECK (lifecycle_state IN ('draft', 'screening', 'investigation', 'enforcement', 'closed')),
    CONSTRAINT priority_chk CHECK (priority IN ('low', 'normal', 'high', 'critical')),
    CONSTRAINT metadata_object_chk CHECK (jsonb_typeof(metadata) = 'object')
);

Lab 2: Insert Synthetic Data

INSERT INTO enforcement_case (
    case_number,
    subject_id,
    lifecycle_state,
    priority,
    metadata
)
SELECT
    'CASE-' || gs,
    uuidv7(),
    (ARRAY['draft', 'screening', 'investigation', 'enforcement', 'closed'])[1 + (random() * 4)::int],
    (ARRAY['low', 'normal', 'high', 'critical'])[1 + (random() * 3)::int],
    jsonb_build_object(
        'source', jsonb_build_object(
            'system', (ARRAY['portal', 'batch', 'api', 'manual'])[1 + (random() * 3)::int],
            'importBatchId', 'batch-' || ((random() * 20)::int)
        ),
        'riskSignals', jsonb_build_object(
            'complaintCountLast90Days', (random() * 100)::int,
            'hasPriorWarning', random() > 0.7
        ),
        'externalReferences', jsonb_build_array(
            jsonb_build_object('type', 'ticket', 'value', 'TCK-' || gs)
        )
    )
FROM generate_series(1, 100000) gs;

ANALYZE enforcement_case;

Lab 3: Test Without Index

EXPLAIN (ANALYZE, BUFFERS)
SELECT id, case_number
FROM enforcement_case
WHERE metadata @> '{"source": {"system": "portal"}}'::jsonb;

Record:

scan type;
estimated rows;
actual rows;
buffer reads/hits;
runtime.

Lab 4: Add GIN Index

CREATE INDEX CONCURRENTLY idx_case_metadata_gin
ON enforcement_case
USING gin (metadata);

ANALYZE enforcement_case;

EXPLAIN (ANALYZE, BUFFERS)
SELECT id, case_number
FROM enforcement_case
WHERE metadata @> '{"source": {"system": "portal"}}'::jsonb;

Observe whether GIN is used.

Lab 5: Add Expression Index

CREATE INDEX CONCURRENTLY idx_case_source_system_expr
ON enforcement_case ((metadata #>> '{source,system}'));

ANALYZE enforcement_case;

EXPLAIN (ANALYZE, BUFFERS)
SELECT id, case_number
FROM enforcement_case
WHERE metadata #>> '{source,system}' = 'portal';

Compare against GIN containment.

Lab 6: Promote to Generated Column

ALTER TABLE enforcement_case
ADD COLUMN source_system text
GENERATED ALWAYS AS (metadata #>> '{source,system}') STORED;

CREATE INDEX CONCURRENTLY idx_case_source_system
ON enforcement_case (source_system);

ANALYZE enforcement_case;

EXPLAIN (ANALYZE, BUFFERS)
SELECT id, case_number
FROM enforcement_case
WHERE source_system = 'portal';

Compare clarity, stats, and plan stability.

Engineering Checklist

Before using JSONB, answer:

Is this field part of identity, lifecycle, permission, money, or correctness?
Does it need FK, uniqueness, or check constraint?
Will it be used in hot WHERE/JOIN/ORDER BY?
Is it frequently updated independently?
Is it large enough to affect cache/TOAST behavior?
Is the Java representation typed or a loose map?
Do we have a migration path if this field stabilizes?
Do we know the exact query operators we need?
Have we validated the plan with EXPLAIN (ANALYZE, BUFFERS)?
Do we have tests for invalid JSON shape?

Self-Correction Drills

Take an existing table with 10+ nullable optional columns. Decide which should become JSONB metadata and which should stay columns.
Take an existing JSONB blob. Identify 3 fields that should be promoted to generated columns or real columns.
Write the same query using @>, #>>, and generated column. Compare plans.
Create one intentionally bad JSONB design and refactor it to hybrid relational-document design.
Simulate large JSONB updates and observe table/index growth.

Mental Model Summary

JSONB is powerful when it is treated as a controlled flexibility boundary.

It becomes dangerous when it replaces domain modeling, hides invariants, or moves high-value query fields out of the relational planner’s comfort zone.

Top-level rule:

Put stable business truth in relational structure. Put flexible context in JSONB. Promote JSONB fields when they become important.

References

PostgreSQL Documentation — JSON Types: https://www.postgresql.org/docs/current/datatype-json.html
PostgreSQL Documentation — JSON Functions and Operators: https://www.postgresql.org/docs/current/functions-json.html
PostgreSQL Documentation — GIN Indexes: https://www.postgresql.org/docs/current/gin.html
PostgreSQL Documentation — Expression Indexes: https://www.postgresql.org/docs/current/indexes-expressional.html
PostgreSQL Documentation — Partial Indexes: https://www.postgresql.org/docs/current/indexes-partial.html

Lesson Recap

You just completed lesson 19 in build core. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Back To Series Next Lesson

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.

Previous Lesson

Lesson 18

Aggregation, Sorting, and Memory Engineering

Next Lesson

Lesson 20

Learn Postgresql In Action Part 020 Partitioning Lifecycle Architecture