Deepen PracticeOrdered learning track

Bronze Silver Gold Without Cargo Cult

Learn Java Data Pipeline Pattern - Part 054

Bronze, silver, and gold data pipeline layering without cargo cult: responsibilities, invariants, contracts, quality gates, replay, correction, governance, and Java implementation patterns.

20 min read3944 words
PrevNext
Lesson 5484 lesson track46–69 Deepen Practice
#java#data-pipeline#medallion-architecture#lakehouse+5 more

Part 054 — Bronze / Silver / Gold Without Cargo Cult

Bronze, silver, and gold are useful words.

They are also dangerously easy to misuse.

Many teams adopt the three-layer model like this:

raw data -> bronze
clean data -> silver
business data -> gold

Then the system slowly degrades:

  • bronze becomes a dumping ground,
  • silver becomes “whatever is not raw”,
  • gold becomes a pile of dashboard tables,
  • nobody knows where validation belongs,
  • correction logic is duplicated,
  • PII leaks across layers,
  • backfills are unsafe,
  • every pipeline adds layers even when not needed,
  • quality is implied by layer name instead of enforced by contract.

This part is about using the layered lakehouse model without cargo cult.

The goal is not to memorize bronze/silver/gold.

The goal is to understand what responsibility each boundary owns.


1. The Layer Names Are Not the Architecture

Layer names are labels.

The architecture is the set of invariants.

A layer is useful only if crossing it means something concrete.

Bad boundary:

Data moved from table A to table B.

Good boundary:

Data crossed from raw-source preservation into validated canonical semantics.

Even better:

Data crossed from raw append-only capture into a canonical event model after schema validation, identity normalization, event-time validation, PII classification, duplicate classification, and rejection routing.

That is a real boundary.


2. A Better Mental Model

Instead of thinking:

bronze -> silver -> gold

Think:

capture -> understand -> serve

Or more explicitly:

preserve source truth
  -> establish platform truth
  -> publish product truth

Bronze is not low quality by definition.

Silver is not “cleaned bronze” in a vague sense.

Gold is not “final data”.

Each layer has a different job.


3. The Real Contract of Bronze

Bronze exists to preserve source-aligned facts with minimal irreversible interpretation.

The bronze invariant:

Bronze should make source data durable, replayable, traceable, and inspectable without pretending it is already semantically correct.

Bronze usually owns:

  • raw source payload,
  • source metadata,
  • ingestion metadata,
  • source checkpoint,
  • schema-at-ingestion,
  • decode status,
  • ingestion timestamp,
  • producer identity,
  • tenant/domain routing,
  • privacy classification if detectable,
  • file/API/CDC/Kafka source position,
  • rejection reason for unreadable records.

Bronze should not usually own:

  • complex business rules,
  • derived KPIs,
  • consumer-specific joins,
  • final reporting semantics,
  • destructive correction,
  • ambiguous dedupe that loses source history.

Bronze table examples

bronze.case_cdc_raw
bronze.case_event_kafka_raw
bronze.case_api_response_raw
bronze.case_file_ingestion_raw
bronze.party_reference_raw

Bronze row example

{
  "raw_payload": "{...}",
  "source_system": "case-management-db",
  "source_table": "case",
  "source_operation": "UPDATE",
  "source_position": "lsn:349923499",
  "source_commit_time": "2026-07-04T12:01:09Z",
  "ingestion_time": "2026-07-04T12:01:13Z",
  "schema_fingerprint": "sha256:...",
  "decode_status": "SUCCESS",
  "ingestion_run_id": "...",
  "trace_id": "..."
}

Bronze preserves evidence.

It should be possible to ask:

  • what did the source send,
  • when did we ingest it,
  • under which schema did we decode it,
  • which source checkpoint contained it,
  • which raw record produced a downstream canonical record,
  • why was a record rejected.

4. Bronze Is Not a Trash Can

A common mistake is treating bronze as ungoverned storage.

That fails in production.

Bronze still needs rules:

  • location and naming standards,
  • retention policy,
  • access control,
  • encryption,
  • PII classification,
  • schema capture,
  • ingestion ledger,
  • source ownership,
  • replay contract,
  • basic technical validation.

Technical validations at bronze

Bronze does not need to prove business correctness, but it should prove ingestibility:

  • payload is readable or explicitly marked unreadable,
  • source identity is known,
  • ingestion timestamp exists,
  • source checkpoint exists,
  • file was complete,
  • API page was fully retrieved,
  • CDC event has source position,
  • Kafka event has topic/partition/offset,
  • record can be routed to the right domain.

Bronze rejection

If the pipeline cannot even classify or persist the raw record safely, use an ingestion failure lane:

ingestion_failure.case_raw_unreadable

This is different from silver-level business rejection.


5. The Real Contract of Silver

Silver exists to establish canonical, validated, reusable semantics.

The silver invariant:

Silver data should be understandable without knowing every source-system accident, and reusable without embedding a specific dashboard’s business logic.

Silver owns:

  • canonical field names,
  • normalized types,
  • stable identifiers,
  • event-time normalization,
  • dedupe classification,
  • source-to-domain mapping,
  • schema compatibility enforcement,
  • data quality validation,
  • rejection/quarantine routing,
  • canonical event modeling,
  • current-state projections,
  • reference data normalization,
  • correction semantics.

Silver should not usually own:

  • report-specific aggregation,
  • dashboard-specific formatting,
  • ML feature leakage,
  • organization-specific presentation rules,
  • ad hoc analyst transformations that are not governed.

Silver table examples

silver.case_event_canonical
silver.case_current_projection
silver.case_assignment_current
silver.party_canonical
silver.regulatory_calendar
silver.case_event_rejected

Silver canonical event example

{
  "event_id": "case-101:status:349923499",
  "aggregate_id": "case-101",
  "aggregate_type": "CASE",
  "event_type": "CASE_STATUS_CHANGED",
  "event_time": "2026-07-04T12:01:09Z",
  "business_effective_time": "2026-07-04T00:00:00+07:00",
  "previous_status": "OPEN",
  "new_status": "UNDER_REVIEW",
  "source_record_ref": {
    "table": "bronze.case_cdc_raw",
    "snapshot_id": 981231,
    "source_position": "lsn:349923499"
  },
  "contract_version": "case-event-v3",
  "quality_status": "VALID"
}

Silver should make downstream consumers less coupled to source chaos.

But silver must not lie.

When a source ambiguity cannot be resolved, silver should preserve the ambiguity or reject the record with evidence.


6. Silver Is the Semantic Choke Point

Silver is where teams often under-invest.

They move too quickly from raw data to gold dashboards.

That creates duplicated logic:

Dashboard A defines active case one way.
Dashboard B defines it differently.
ML feature job defines it a third way.
Regulatory export defines it a fourth way.

A strong silver layer centralizes reusable semantics:

  • active case definition,
  • valid escalation event,
  • official business calendar,
  • case age calculation base fields,
  • canonical party identity,
  • canonical assignment history,
  • deduped event identity,
  • source deletion semantics,
  • correction model.

Silver should be boring, explicit, and heavily reviewed.

It is the place where data engineering becomes domain engineering.


7. The Real Contract of Gold

Gold exists to serve a specific product, decision, report, API, dashboard, ML feature set, or external obligation.

The gold invariant:

Gold should be optimized for a declared use case and should be traceable back to reusable silver semantics.

Gold owns:

  • business aggregation,
  • dimensional modeling,
  • report snapshots,
  • SLA metrics,
  • dashboard-friendly shape,
  • denormalized serving tables,
  • export-ready datasets,
  • ML feature tables,
  • product-specific freshness SLO,
  • consumer-specific access policy,
  • publication/certification state.

Gold should not usually own:

  • raw source decoding,
  • hidden source-specific patches,
  • canonical identity rules,
  • undocumented fixes,
  • untraceable manual edits.

Gold table examples

gold.case_sla_daily_snapshot
gold.case_inventory_monthly_report
gold.enforcement_action_dashboard
gold.case_risk_feature_daily
gold.regulatory_breach_export
gold.executive_case_health_metrics

Gold report row example

{
  "reporting_date": "2026-07-04",
  "regulatory_unit": "UNIT-A",
  "open_case_count": 12041,
  "overdue_case_count": 391,
  "sla_breach_rate": 0.0324,
  "source_snapshot_ids": {
    "silver.case_event_canonical": 882199,
    "silver.case_current_projection": 882201
  },
  "metric_definition_version": "case-sla-v5.1.0",
  "published_snapshot_id": 992014
}

Gold is allowed to be opinionated.

But the opinion must be declared.


8. Cargo Cult Symptoms

A medallion architecture has become cargo cult when you see these symptoms.

Symptom 1: Every pipeline must have all three layers

Not always necessary.

A temporary migration job may need staging and publish only.

A raw immutable audit archive may never produce gold.

A small reference feed may go bronze -> silver and stop.

Symptom 2: Layer name replaces contract

Someone says:

It is silver, so it is clean.

That means nothing.

Ask:

  • which validations ran,
  • what was rejected,
  • what semantics were normalized,
  • what guarantees are documented,
  • who owns the contract.

Symptom 3: Bronze contains transformed business data

Then silver cannot tell source truth from interpreted truth.

Fix: keep bronze source-aligned.

Symptom 4: Silver contains dashboard logic

Then reusable domain semantics become polluted.

Fix: move product-specific aggregation to gold.

Symptom 5: Gold tables are treated as source of truth

A gold report snapshot can be authoritative for a report, but it should not become the hidden system of record for core domain state.

Fix: trace gold back to silver and source.

Symptom 6: No quarantine layer

Invalid records are silently dropped or patched.

Fix: make rejection explicit.

Symptom 7: No correction model

Teams overwrite data and cannot explain restatements.

Fix: add correction/restatement workflow.


9. Layering by Invariant, Not by Habit

Use this decision table.

QuestionIf yesLayer implication
Need to preserve what source sent?yesbronze/raw
Need to normalize source into domain semantics?yessilver/canonical
Need reusable current state?yessilver projection
Need report/dashboard/export-specific shape?yesgold/product
Need invalid record analysis?yesquarantine/rejected
Need certified output?yespublished/certified gold
Need temporary publish validation?yesstaging/validated boundary

Layering is a consequence of responsibility.

Not a template.


For production systems, I prefer more explicit zones than only bronze/silver/gold.

The names can vary, but the responsibilities should be explicit:

ZoneResponsibility
raw/bronzesource preservation and replay
ingestion failuresource data cannot be safely ingested
canonical/silverreusable validated semantics
rejected/quarantinesemantically invalid or policy-blocked records
projection/silverreusable current state/materialization
product/golduse-case-specific shape
certified/publishedapproved/reportable output

Bronze/silver/gold is a simplification.

Production systems often need the extra boundaries.


11. The Boundary Contract

Every layer transition should have a contract.

Bronze to silver contract

name: case_raw_to_canonical_event
source:
  table: bronze.case_cdc_raw
  required_fields:
    - source_position
    - source_operation
    - raw_payload
    - ingestion_time
output:
  table: silver.case_event_canonical
invariants:
  - event_id is not null
  - aggregate_id is not null
  - event_type is in registered taxonomy
  - event_time is valid
  - source_record_ref is preserved
rejection:
  table: silver.case_event_rejected
  fatal_reasons:
    - missing_primary_key
    - unknown_operation
    - invalid_event_time

Silver to gold contract

name: case_sla_daily_snapshot
source:
  tables:
    - silver.case_event_canonical
    - silver.case_current_projection
output:
  table: gold.case_sla_daily_snapshot
invariants:
  - one row per reporting_date and regulatory_unit
  - counts reconcile with input snapshot boundary
  - metric_definition_version is recorded
  - source_snapshot_ids are recorded
publication:
  requires_quality_pass: true
  requires_owner_approval: true

The layer boundary is not the table write.

The boundary is the contract enforcement.


12. Bronze Implementation Pattern in Java

A Java bronze ingestion pipeline should be conservative.

It should capture data with enough metadata to replay and debug.

Interfaces

interface RawSourceReader<T> {
    List<RawEnvelope<T>> read(SourceCheckpoint checkpoint, int maxRecords);
}

record RawEnvelope<T>(
    String sourceSystem,
    String sourceEntity,
    SourcePosition sourcePosition,
    Instant sourceEventTime,
    Instant ingestionTime,
    Map<String, String> headers,
    T payload
) {}

interface BronzeWriter<T> {
    BronzeWriteResult write(List<RawEnvelope<T>> records);
}

Bronze rules

  • Never drop raw records silently.
  • Always capture source position.
  • Always capture ingestion time.
  • Preserve raw payload or stable raw reference.
  • Store decode failure separately.
  • Use append-only where possible.
  • Make retention explicit.
  • Avoid heavy business transforms.

File/API/CDC differences

SourceBronze metadata
Filepath, size, checksum, manifest ID, row number
APIendpoint, cursor, page token, request ID, response status
Kafkatopic, partition, offset, timestamp, headers
CDCdatabase, table, operation, transaction ID, log position

Bronze is where you preserve evidence of input reality.


13. Silver Implementation Pattern in Java

Silver transforms raw source-aligned data into canonical domain data.

Interfaces

interface Canonicalizer<R, C> {
    CanonicalizationResult<C> canonicalize(R raw, CanonicalizationContext context);
}

sealed interface CanonicalizationResult<C> permits Accepted, Rejected {}

record Accepted<C>(C canonicalRecord) implements CanonicalizationResult<C> {}

record Rejected<C>(
    String reasonCode,
    String reasonDetail,
    RawRecordRef sourceRef,
    Map<String, Object> evidence
) implements CanonicalizationResult<C> {}

Canonicalization responsibilities

  • parse and normalize types,
  • map source code values to domain enum,
  • derive stable event ID,
  • validate identity fields,
  • normalize timestamps,
  • attach source reference,
  • classify PII,
  • route invalid records,
  • preserve reason for rejection.

Silver canonicalization flow

The rejection table is not a trash bin.

It is operational evidence.


14. Gold Implementation Pattern in Java

Gold usually runs as batch, micro-batch, SQL transform, Spark job, or Flink aggregation depending on latency and volume.

Gold job contract

record GoldJobContext(
    UUID runId,
    LocalDate reportingDate,
    Map<String, Long> inputSnapshotIds,
    String metricDefinitionVersion,
    String transformVersion,
    PublicationMode publicationMode
) {}

Gold responsibilities

  • read declared input versions,
  • apply declared metric definitions,
  • aggregate or denormalize,
  • validate output shape,
  • record source snapshots,
  • publish atomically,
  • emit lineage,
  • support restatement.

Gold output should be explainable

For every metric:

  • what is the definition,
  • which input rows qualify,
  • which version of logic was used,
  • which reporting cut-off applies,
  • which records were excluded,
  • what changed from previous run.

Gold is where business decisions meet data engineering.

That makes it high-risk.


15. Quality Gates by Layer

Quality is not one monolithic check.

Different layers validate different things.

LayerValidation focus
Bronzeingestibility, source metadata, raw durability
Silvercanonical meaning, identity, type, time, domain rules
Goldmetric correctness, aggregation, reconciliation, publication readiness
Certifiedapproval, retention, evidence, external obligation

Example severity model

severity:
  fatal:
    action: block_publish
  high:
    action: quarantine_and_alert
  medium:
    action: publish_with_warning_if_under_threshold
  low:
    action: record_metric

Example by layer

RuleLayerFailure action
payload unreadablebronzeingestion failure table
missing case_idsilverreject canonical record
invalid status transitionsilverreject or correction workflow
open case count drops 80%goldblock publish
report lacks source snapshot IDscertifiedblock certification

A quality rule without an action is only documentation.


16. Rejection and Quarantine Pattern

Invalid data must not disappear.

Use explicit rejection tables.

Rejection table schema

rejection_id            string
source_layer            string
source_table            string
source_snapshot_id      long
source_record_ref       string
target_contract         string
reason_code             string
reason_detail           string
severity                string
raw_payload_ref         string
first_seen_at           timestamp
pipeline_run_id         string
status                  string
owner                   string

Rejection lifecycle

Key principle

Quarantine is part of the pipeline, not an error side effect.

Rejected records need ownership, status, metrics, and reprocessing path.


17. Correction and Restatement Pattern

Real data changes after publication.

Sources send late updates.

Rules change.

Bug fixes happen.

Manual corrections arrive.

A mature bronze/silver/gold architecture needs correction semantics.

Correction types

Correction typeExampleHandling
source correctionsource updates wrong fieldCDC/raw changelog + silver update
pipeline bug fixtransform misclassified statusbackfill affected partitions
business rule changeSLA definition updatednew gold version/restatement
privacy correctiondata must be redactedgoverned erasure/anonymization
reference correctioncalendar changeddependent recompute

Do not silently overwrite gold

For certified reports, use restatement.

gold.case_sla_daily_snapshot
  reporting_date=2026-07-04
  statement_version=1

gold.case_sla_daily_snapshot
  reporting_date=2026-07-04
  statement_version=2
  restates_statement_version=1
  restatement_reason="holiday calendar correction"

This is especially important in regulatory contexts.

A correction must be explainable.


18. Replay and Rebuild Semantics

A layered architecture is valuable only if it supports replay.

Replay questions

  • Can bronze be replayed into silver?
  • Can silver be rebuilt from bronze?
  • Can current projections be rebuilt from canonical events?
  • Can gold reports be recomputed from fixed input snapshots?
  • Can rejected records be reprocessed after a fix?
  • Can outputs be compared before publication?

Replay graph

Replay invariant

Replay must specify input version, transform version, and output publish strategy.

Without these, replay is not engineering. It is guessing.


19. Freshness and Latency by Layer

Not every layer needs the same latency.

LayerCommon latency goal
Bronze streaming rawseconds to minutes
Silver canonical eventsseconds to hours depending on validation
Silver current projectionnear-real-time if operational analytics need it
Gold daily reportdaily/cut-off based
Certified regulatory outputapproval-based, not purely technical latency

Cargo cult mistake

Forcing gold to be streaming because bronze is streaming.

That may create unstable business metrics.

Some outputs should wait for:

  • late events,
  • validation windows,
  • reconciliation,
  • approval,
  • reference data availability.

Latency is a product requirement, not a universal virtue.


20. Ownership by Layer

Layering without ownership fails.

LayerTypical owner
Bronzeingestion/platform team with source owner support
Silverdomain data product team/platform-domain partnership
Goldconsuming product/report owner with data engineering support
Quarantineshared source + pipeline owner
Certifiedaccountable business/regulatory owner

Ownership contract

Each dataset should declare:

dataset: silver.case_event_canonical
owner: enforcement-data-domain
technical_owner: data-platform-team
source_owner: case-management-platform
support_channel: '#data-enforcement-support'
slo:
  freshness: PT15M
  availability: 99.5%
  rejection_rate_threshold: 0.5%

If nobody owns silver, every gold consumer will reinvent it.


21. Access Control by Layer

Access should become more restrictive or more purpose-specific depending on sensitivity.

Bronze often contains raw PII and source-specific secrets.

Silver may contain cleaned but still sensitive canonical data.

Gold may be aggregated and safer, or it may be highly sensitive because it encodes decisions.

Access model

LayerAccess default
Bronze rawrestricted to platform/source owners
Silver canonicaldomain-authorized consumers
Gold productproduct/report audience
Quarantinerestricted incident/source/pipeline owners
Certified reportscontrolled distribution

Important warning

Gold is not automatically safer than silver.

A gold table can expose sensitive outcomes:

  • risk scores,
  • breach flags,
  • enforcement recommendations,
  • investigation priority,
  • protected attributes through aggregation leakage.

Security follows data meaning, not layer color.


22. Naming and Catalog Structure

Names should express responsibility.

bronze.<domain>_<source>_<entity>_raw
silver.<domain>_<entity>_canonical
silver.<domain>_<entity>_current
gold.<domain>_<product>_<grain>
quarantine.<domain>_<entity>_rejected
certified.<domain>_<report>_<grain>

Examples

bronze.enforcement_cms_case_cdc_raw
silver.enforcement_case_event_canonical
silver.enforcement_case_current
gold.enforcement_case_sla_daily
quarantine.enforcement_case_event_rejected
certified.enforcement_sla_regulatory_daily

Avoid:

case_table_v2
case_table_clean
case_table_gold_new
case_dashboard_final

Bad names hide semantics.

Hidden semantics create production incidents.


23. Table Granularity

The layer is not enough. You also need grain.

Grain means one row represents what?

Examples

TableGrain
bronze.case_cdc_rawone source CDC event
silver.case_event_canonicalone canonical domain event
silver.case_currentone case latest state
silver.case_assignment_historyone assignment interval
gold.case_sla_dailyone reporting date + unit
gold.case_agent_daily_workloadone date + agent

A gold table without declared grain is dangerous.

A silver table without declared grain is worse.

Grain mismatch bug

If silver.case_current has one row per case but a downstream job treats it as one row per case assignment, metrics inflate.

Declare grain in the contract.


24. Multi-Hop Does Not Mean Multi-Copy Everything

A layered architecture creates copies.

Copies cost money and complexity.

But they also isolate responsibilities.

The question is not “should we minimize copies?”

The question is:

Which semantic boundaries are important enough to materialize?

Materialize when

  • boundary is reused by many consumers,
  • validation is expensive,
  • contract needs stable snapshot,
  • audit requires evidence,
  • latency requires serving shape,
  • source replay is costly,
  • downstream isolation matters.

Do not materialize when

  • transformation is trivial and not reused,
  • data is temporary,
  • consumer can safely query upstream view,
  • materialization creates governance overhead without value,
  • freshness/cost trade-off is poor.

Layering is a trade-off.

Not a moral rule.


25. Views vs Tables

A layer can be a view or a table.

Use a view when

  • transform is lightweight,
  • source table is stable,
  • no independent retention needed,
  • performance is acceptable,
  • consumers can tolerate upstream changes,
  • audit does not require frozen output.

Use a materialized table when

  • transform is expensive,
  • output must be versioned,
  • quality gate must publish atomically,
  • downstream requires stable snapshot,
  • report must be reproducible,
  • input sources are volatile,
  • cross-engine access matters.

Rule

Gold reports that require defensibility should usually be materialized and versioned, not only exposed as ephemeral views.

Views are useful.

But views are not a replacement for publication boundaries.


26. Case Study: Regulatory Enforcement Lifecycle

Assume a case management system emits operational changes:

  • case created,
  • status changed,
  • assigned officer changed,
  • escalation triggered,
  • deadline breached,
  • enforcement action issued,
  • case closed.

Bronze

bronze.enforcement_case_cdc_raw

Captures:

  • database table,
  • operation,
  • before/after payload,
  • transaction ID,
  • log position,
  • source commit time,
  • ingestion time,
  • schema fingerprint.

Silver canonical events

silver.enforcement_case_event_canonical

Produces:

  • CASE_CREATED,
  • CASE_STATUS_CHANGED,
  • CASE_ASSIGNED,
  • CASE_ESCALATED,
  • CASE_DEADLINE_BREACHED,
  • ENFORCEMENT_ACTION_ISSUED,
  • CASE_CLOSED.

Silver current state

silver.enforcement_case_current

One row per case.

Includes latest known:

  • status,
  • owner,
  • risk tier,
  • SLA state,
  • current escalation level,
  • open/closed flag,
  • last source position.

Gold daily SLA

gold.enforcement_case_sla_daily

One row per reporting date + regulatory unit.

Includes:

  • open cases,
  • cases due today,
  • overdue cases,
  • breach count,
  • breach rate,
  • average age,
  • restatement version,
  • source snapshot IDs.

Certified report

certified.enforcement_sla_regulatory_daily

Published after:

  • quality gates pass,
  • reconciliation succeeds,
  • owner approval is recorded,
  • evidence bundle is stored.

This is bronze/silver/gold as responsibility boundaries.

Not as naming decoration.


27. Layer-Specific Failure Models

Each layer fails differently.

Bronze failures

  • source unavailable,
  • file incomplete,
  • API cursor invalid,
  • CDC log retention lost,
  • payload unreadable,
  • schema not captured,
  • source checkpoint duplicate.

Silver failures

  • source field mapping invalid,
  • unknown enum value,
  • missing identity,
  • timestamp ambiguity,
  • duplicate event ID,
  • invalid state transition,
  • reference data missing,
  • PII classification fails.

Gold failures

  • aggregate mismatch,
  • metric definition ambiguity,
  • input snapshot unavailable,
  • report cut-off wrong,
  • late corrections after publication,
  • dashboard query exceeds SLA,
  • consumer expects old schema.

Certified failures

  • evidence incomplete,
  • approval missing,
  • retention policy wrong,
  • report restatement not traceable,
  • external export inconsistent.

A mature runbook is layer-specific.

Generic “pipeline failed” alerts are not enough.


28. Observability by Layer

Bronze metrics

  • source lag,
  • ingestion lag,
  • records ingested,
  • decode failures,
  • source checkpoint progress,
  • raw bytes,
  • duplicate source positions,
  • file completeness failures,
  • API rate-limit events.

Silver metrics

  • accepted records,
  • rejected records,
  • rejection rate by reason,
  • duplicate event IDs,
  • reference lookup misses,
  • late event count,
  • schema version distribution,
  • canonicalization latency.

Gold metrics

  • output row count,
  • aggregate checksum,
  • metric drift,
  • freshness vs SLA,
  • publication delay,
  • restatement count,
  • downstream query latency,
  • report approval latency.

Certified metrics

  • evidence completeness,
  • approval SLA,
  • export success,
  • consumer delivery status,
  • retention/legal hold status.

Layer names should appear in dashboards, but metrics should reflect responsibilities.


29. Layered Pipeline DAG

A simplified orchestration model:

Notice that the DAG is not merely technical dependency.

Each edge crosses a semantic boundary.

That is what makes it reviewable.


30. When to Skip Bronze/Silver/Gold

Skip bronze only when

Rarely.

Maybe if:

  • source is already immutable and versioned,
  • source contract is strong,
  • raw data is accessible elsewhere with retention,
  • pipeline is not required to replay,
  • governance accepts the risk.

For most production systems, especially regulated systems, skipping raw preservation is risky.

Skip silver when

Maybe if:

  • source data is already canonical for your domain,
  • there is one narrow consumer,
  • no reuse is expected,
  • no semantic normalization is needed.

But be careful. Silver often becomes necessary later.

Skip gold when

Often.

If silver already serves the consumer efficiently and safely, do not create a gold table just because the pattern says so.

Add extra layers when

  • certification is needed,
  • quarantine needs lifecycle management,
  • PII isolation is required,
  • ML features need point-in-time correctness,
  • external exports require delivery evidence.

Architecture should follow risk and responsibility.


31. The “Clean Data” Trap

“Clean data” is not precise enough.

Clean according to what?

  • syntactic validity,
  • domain validity,
  • completeness,
  • consistency,
  • freshness,
  • uniqueness,
  • privacy policy,
  • report definition,
  • consumer expectation?

A silver table may be valid for canonical events but not suitable for a dashboard.

A gold table may be suitable for one report but misleading for another.

Replace “clean” with explicit guarantees.

Example:

dataset: silver.case_event_canonical
guarantees:
  - event_id is globally unique within domain
  - aggregate_id is non-null
  - event_type is from registered taxonomy
  - event_time is normalized to UTC
  - source_record_ref is preserved
  - rejected records are stored in silver.case_event_rejected
non_guarantees:
  - does not guarantee case is currently open
  - does not aggregate by regulatory unit
  - does not certify report metrics

Non-guarantees are just as important as guarantees.


32. Data Product Thinking

A well-designed silver or gold dataset should behave like a product.

It needs:

  • owner,
  • contract,
  • SLO,
  • versioning,
  • changelog,
  • support channel,
  • deprecation policy,
  • access model,
  • lineage,
  • quality report,
  • examples.

Layering helps organize products, but it does not replace product ownership.

Dataset README template

# silver.enforcement_case_event_canonical

## Purpose
Reusable canonical event stream for enforcement case lifecycle.

## Grain
One row per canonical case lifecycle event.

## Sources
- bronze.enforcement_case_cdc_raw

## Guarantees
- stable event_id
- source_record_ref preserved
- event_time normalized to UTC

## Non-guarantees
- not a current-state table
- not report-certified

## Owner
Enforcement Data Domain

## SLO
Freshness: 15 minutes p95

This is the kind of documentation that prevents misuse.


33. Production Checklist

Before accepting a bronze/silver/gold design, review these questions.

Bronze

  • What exactly is preserved from source?
  • Is source checkpoint stored?
  • Is raw payload or raw reference retained?
  • Is ingestion idempotent?
  • Are unreadable records captured?
  • Is PII classified?
  • Is retention intentional?

Silver

  • What canonical semantics are established?
  • What is the grain?
  • What validations run?
  • Where do rejected records go?
  • Are corrections represented?
  • Can silver be rebuilt from bronze?
  • Are source references preserved?

Gold

  • What use case does it serve?
  • What metric definitions apply?
  • What input versions are used?
  • Is output materialized or view-based?
  • Is publication atomic?
  • Can the report be restated?
  • Is owner approval required?

Cross-layer

  • Is lineage captured?
  • Are quality gates enforced?
  • Are access policies correct?
  • Are freshness SLOs realistic?
  • Is backfill isolated?
  • Are layer boundaries meaningful?
  • Are non-guarantees documented?

34. Final Mental Model

Bronze, silver, and gold are not maturity levels.

They are responsibility boundaries.

Use this framing:

Bronze: preserve source reality.
Silver: establish reusable domain reality.
Gold: publish product-specific decision reality.

Then add missing boundaries when needed:

Quarantine: preserve invalid evidence.
Certified: preserve approved/reportable truth.
Staging: isolate unvalidated output.

A top-level engineer does not ask:

Do we have bronze, silver, and gold?

They ask:

What invariant changes at each boundary?
What can be replayed?
What can be audited?
What can be corrected?
Who owns each dataset?
What exactly is guaranteed?

That is how you avoid cargo cult architecture.


References

  • Databricks medallion architecture documentation and articles: bronze, silver, and gold as progressively refined lakehouse layers.
  • Apache Iceberg documentation: snapshots, table metadata, schema evolution, partition evolution, maintenance, and time travel.
  • OpenLineage concepts: dataset-level lineage and run tracking.
  • Great Expectations / GX concepts: data quality validation, expectations, and checkpoints.
  • Data mesh/data product operating concepts: ownership, contracts, SLOs, and domain-oriented data products.
Lesson Recap

You just completed lesson 54 in deepen practice. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.