Data Lineage and Provenance
Learn Database Design and Architect - Part 063
Data lineage dan provenance untuk database architecture: origin, transformation, dependency graph, evidence-grade traceability, impact analysis, reconstruction, governance, dan operational proof.
Part 063 — Data Lineage and Provenance
Target bagian ini: kamu bisa mendesain database dan platform data yang mampu menjawab data ini berasal dari mana, diproses oleh apa, berubah karena apa, dipakai oleh siapa, dan apa dampaknya jika berubah atau salah. Fokusnya adalah database architecture, metadata model, impact analysis, evidence-grade traceability, dan operasionalisasi lineage di production.
Lineage bukan diagram cantik di katalog data.
Lineage adalah kemampuan sistem untuk menjelaskan perjalanan data dari sumber sampai penggunaan akhir. Provenance adalah versi yang lebih evidence-oriented: bukan hanya “A mengalir ke B”, tetapi siapa/apa yang menghasilkan data, aktivitas apa yang mengubahnya, input apa yang dipakai, kapan terjadi, versi rule apa yang berlaku, dan bukti apa yang mendukung output.
Di sistem regulatory, enforcement, finance, healthcare, atau case management, lineage menjawab pertanyaan defensible:
- Keputusan ini dibuat memakai data apa?
- Data itu berasal dari sistem mana?
- Transformasi apa yang diterapkan?
- Versi aturan/model apa yang dipakai?
- Apakah data sudah dikoreksi setelah keputusan dibuat?
- Report mana yang terdampak jika kolom atau rule berubah?
- Bagaimana membuktikan bahwa export, dashboard, dan evidence package konsisten dengan source-of-truth saat itu?
Architectural rule:
If data can influence a decision, the path from source to decision must be explainable.
1. Mental model: lineage is a dependency graph plus evidence
Lineage paling sederhana adalah dependency graph.
Tetapi production-grade lineage tidak cukup hanya graph. Ia harus punya metadata yang menjawab:
| Question | Required metadata |
|---|---|
| Data berasal dari mana? | source system, source entity, source key, extraction timestamp |
| Diproses oleh apa? | job name, job version, code version, rule version, config version |
| Input apa yang dipakai? | input dataset version, upstream watermark, file checksum, query snapshot |
| Output apa yang dibuat? | output dataset, row/entity key, output version, commit/batch id |
| Siapa/apa responsible? | service account, actor, workflow, owner team |
| Kapan terjadi? | event time, processing time, effective time, transaction time |
| Bagaimana diverifikasi? | row count, checksum, reconciliation result, validation status |
| Siapa mengonsumsi? | dashboard, API, report, export, model, downstream job |
W3C PROV memberi model dasar yang berguna: provenance menjelaskan penggunaan dan produksi entities oleh activities, yang dapat dipengaruhi oleh agents. Model ini cocok sebagai mental model, meskipun implementasi database tidak harus memakai RDF/ontology.
Terjemahan database-oriented:
| PROV concept | Database/platform example |
|---|---|
| Entity | table, row, file, event, report, model output |
| Activity | ETL job, migration, backfill, API command, rule engine run |
| Agent | service account, user, team, scheduler, external system |
2. Lineage vs provenance vs audit log
Tiga istilah ini sering dicampur. Untuk architect, bedakan dengan tajam.
| Concept | Main question | Typical granularity | Example |
|---|---|---|---|
| Audit log | Who did what? | action/event | User approved case at 10:04 |
| Lineage | What depends on what? | dataset/table/column/job/row | Report uses warehouse fact derived from case_event |
| Provenance | How was this output produced? | entity/activity/agent/evidence | This risk score was generated by model v3 using input snapshot X |
Audit log membantu accountability. Lineage membantu dependency/impact analysis. Provenance membantu explainability dan defensibility.
Contoh:
A case decision was approved.
Audit log menjawab:
Actor: reviewer_123
Action: APPROVE_DECISION
Time: 2026-07-05T09:21:33Z
Lineage menjawab:
Decision record depends on:
- case profile projection v12
- sanctions screening result v5
- evidence document metadata v8
- policy rule table version 2026.06
Provenance menjawab:
The risk classification was generated by rule_engine@sha256:...
using source events up to LSN 42/BF9E2A10,
policy bundle 2026.06.14,
and external registry snapshot 2026-07-01.
Validation passed: checksum=..., input_count=..., exception_count=0.
Rule:
Audit says what happened. Lineage says what depends on what. Provenance says how an output came into existence.
3. Lineage levels
Lineage punya beberapa level. Jangan langsung mengejar row-level lineage jika table/column-level belum rapi.
| Level | Meaning | Use case | Cost |
|---|---|---|---|
| System-level | system A feeds system B | architecture overview | low |
| Dataset/table-level | table/file/topic A feeds table/report B | impact analysis | medium |
| Column-level | target column derived from source columns | schema change safety | medium-high |
| Row/entity-level | output row/entity derived from input rows/events | regulatory evidence, debugging | high |
| Cell/value-level | target value derived from exact source values | high assurance, scientific/research systems | very high |
Recommendation:
- Start with system/dataset lineage.
- Add column lineage for shared contracts, reports, and derived fields.
- Add row/entity lineage only where decisions, money, rights, compliance, or evidence are affected.
- Use value-level lineage only when cost is justified.
Failure mode:
Teams attempt universal row-level lineage and end up with expensive metadata nobody trusts.
Better:
Capture coarse lineage everywhere; capture fine-grained provenance at decision-critical boundaries.
4. Data product lineage contract
Setiap data product yang production-grade harus punya lineage contract.
name: regulatory_case_summary
owner: enforcement-data-platform
classification: internal-restricted
source_of_truth: false
canonical_source:
- operational.case
- operational.case_status_transition
- operational.evidence_document
freshness_sla: P15M
rebuildable: true
retention_policy: CASE_RETENTION_7Y
lineage_level:
dataset: required
column: required
row: required_for_decision_fields
quality_gates:
- row_count_reconciliation
- orphan_case_check
- status_transition_validity
consumers:
- case_dashboard
- monthly_enforcement_report
- evidence_export_service
Lineage contract menjelaskan:
| Field | Why it matters |
|---|---|
| owner | Siapa yang memperbaiki jika lineage rusak |
| canonical source | Menghindari circular truth |
| freshness SLA | Menjelaskan staleness downstream |
| rebuildable | Menentukan recovery strategy |
| lineage level | Mengontrol biaya metadata |
| quality gates | Membuktikan transformasi valid |
| consumers | Impact analysis saat perubahan |
5. Metadata model: minimum viable lineage schema
Untuk internal platform, mulai dari schema metadata yang sederhana tapi kuat.
CREATE TABLE data_asset (
asset_id uuid PRIMARY KEY,
asset_type text NOT NULL CHECK (asset_type IN (
'database', 'schema', 'table', 'column', 'topic', 'file', 'report', 'api', 'model', 'dashboard'
)),
asset_urn text NOT NULL UNIQUE,
display_name text NOT NULL,
owner_team text NOT NULL,
classification text NOT NULL,
system_name text NOT NULL,
created_at timestamptz NOT NULL DEFAULT now(),
retired_at timestamptz
);
CREATE TABLE data_job (
job_id uuid PRIMARY KEY,
job_urn text NOT NULL UNIQUE,
job_type text NOT NULL CHECK (job_type IN (
'etl', 'cdc', 'api_command', 'migration', 'backfill', 'report_build', 'ml_inference'
)),
owner_team text NOT NULL,
code_repository text,
description text,
created_at timestamptz NOT NULL DEFAULT now(),
retired_at timestamptz
);
CREATE TABLE data_lineage_edge (
lineage_edge_id uuid PRIMARY KEY,
upstream_asset_id uuid NOT NULL REFERENCES data_asset(asset_id),
downstream_asset_id uuid NOT NULL REFERENCES data_asset(asset_id),
job_id uuid REFERENCES data_job(job_id),
transformation_type text NOT NULL CHECK (transformation_type IN (
'copy', 'filter', 'join', 'aggregate', 'derive', 'mask', 'tokenize', 'enrich', 'project', 'export'
)),
lineage_level text NOT NULL CHECK (lineage_level IN ('system', 'dataset', 'column', 'row', 'value')),
confidence text NOT NULL CHECK (confidence IN ('declared', 'observed', 'verified')),
valid_from timestamptz NOT NULL DEFAULT now(),
valid_to timestamptz,
metadata jsonb NOT NULL DEFAULT '{}'::jsonb
);
CREATE INDEX idx_lineage_upstream ON data_lineage_edge(upstream_asset_id);
CREATE INDEX idx_lineage_downstream ON data_lineage_edge(downstream_asset_id);
CREATE INDEX idx_lineage_job ON data_lineage_edge(job_id);
Design notes:
| Column | Reason |
|---|---|
asset_urn | Stable identity across tools |
confidence | Distinguishes declared vs observed vs verified lineage |
valid_from/valid_to | Lineage changes over time |
metadata jsonb | Extensible without blocking core model |
Do not make lineage purely free-form JSON. You need typed edges for graph traversal and impact analysis.
6. Asset URN convention
Lineage graph dies when identity is inconsistent. Define URN convention early.
Examples:
postgres://prod-core/enforcement/public.case
postgres://prod-core/enforcement/public.case.status
kafka://prod/enforcement.case-events.v1
s3://regulated-lake/bronze/case_events/dt=2026-07-05/
dashboard://superset/enforcement-case-overview
report://regulatory/monthly-enforcement-summary/v2
api://case-service/GET /cases/{caseId}
model://risk-classifier/v3
Rules:
- URN must be globally unique.
- URN must be stable enough for metadata history.
- URN must not embed volatile deployment detail unless environment-specific lineage is intentional.
- Table and column URNs must be separable.
- External data sources must have explicit source identities.
Bad:
case_table
summary_report
source_file
Good:
postgres://prod-core/enforcement/public.case
report://regulator-x/monthly-enforcement-summary/v2
sftp://agency-y/imports/registry_snapshot_2026_07_01.csv
7. Dataset-level lineage
Dataset-level lineage answers: “If this table changes, what breaks?”
Example:
INSERT INTO data_lineage_edge (
lineage_edge_id,
upstream_asset_id,
downstream_asset_id,
job_id,
transformation_type,
lineage_level,
confidence,
metadata
)
VALUES (
gen_random_uuid(),
:case_table_asset_id,
:case_summary_table_asset_id,
:case_summary_job_id,
'aggregate',
'dataset',
'verified',
jsonb_build_object(
'refresh_mode', 'incremental',
'freshness_sla', 'P15M',
'watermark_column', 'updated_at'
)
);
Use dataset-level lineage for:
- migration impact analysis
- ownership review
- consumer notification
- stale data debugging
- data product inventory
- report dependency mapping
Common anti-pattern:
“Only the pipeline team knows dependencies.”
That creates hidden coupling. Dependencies must be inspectable.
8. Column-level lineage
Column-level lineage answers: “Can we rename/drop/change semantics of this column?”
Schema:
CREATE TABLE column_lineage_edge (
column_lineage_edge_id uuid PRIMARY KEY,
source_column_asset_id uuid NOT NULL REFERENCES data_asset(asset_id),
target_column_asset_id uuid NOT NULL REFERENCES data_asset(asset_id),
job_id uuid REFERENCES data_job(job_id),
expression text,
transformation_type text NOT NULL,
confidence text NOT NULL CHECK (confidence IN ('declared', 'observed', 'verified')),
valid_from timestamptz NOT NULL DEFAULT now(),
valid_to timestamptz
);
CREATE INDEX idx_column_lineage_source ON column_lineage_edge(source_column_asset_id);
CREATE INDEX idx_column_lineage_target ON column_lineage_edge(target_column_asset_id);
Example mappings:
| Target column | Source columns | Transformation |
|---|---|---|
case_summary.current_status | case.status | copy |
case_summary.age_days | case.created_at, now() | derive |
monthly_report.total_closed_cases | case.closed_at, case.status | aggregate/filter |
search_document.display_name | person.legal_name | mask/project |
Column lineage should preserve semantic meaning, not only SQL text.
Example:
target: warehouse.case_fact.days_to_close
sources:
- operational.case.opened_at
- operational.case.closed_at
expression: date_diff('day', opened_at, closed_at)
semantic_contract: calendar days from case opening to terminal closure
null_semantics: null when case is not terminal
version: 2
If the semantic contract changes, version the derived field.
9. Row/entity-level provenance
Row-level lineage is expensive but valuable for decisions.
Use it for:
- risk score explanation
- regulatory decision evidence
- financial ledger reconciliation
- case recommendation
- automated enforcement action
- AI/ML inference evidence
- customer-impacting eligibility decision
Minimal schema:
CREATE TABLE provenance_run (
run_id uuid PRIMARY KEY,
job_id uuid NOT NULL REFERENCES data_job(job_id),
run_status text NOT NULL CHECK (run_status IN ('running', 'succeeded', 'failed', 'cancelled')),
started_at timestamptz NOT NULL,
completed_at timestamptz,
code_version text NOT NULL,
config_version text,
rule_version text,
input_watermark text,
output_watermark text,
actor_type text NOT NULL CHECK (actor_type IN ('scheduler', 'service', 'user', 'external_system')),
actor_id text NOT NULL,
metadata jsonb NOT NULL DEFAULT '{}'::jsonb
);
CREATE TABLE entity_provenance (
entity_provenance_id uuid PRIMARY KEY,
run_id uuid NOT NULL REFERENCES provenance_run(run_id),
output_asset_id uuid NOT NULL REFERENCES data_asset(asset_id),
output_entity_key text NOT NULL,
input_asset_id uuid NOT NULL REFERENCES data_asset(asset_id),
input_entity_key text NOT NULL,
relationship_type text NOT NULL CHECK (relationship_type IN (
'copied_from', 'derived_from', 'aggregated_from', 'joined_with', 'validated_against', 'enriched_by'
)),
captured_at timestamptz NOT NULL DEFAULT now(),
metadata jsonb NOT NULL DEFAULT '{}'::jsonb
);
CREATE INDEX idx_entity_prov_output ON entity_provenance(output_asset_id, output_entity_key);
CREATE INDEX idx_entity_prov_input ON entity_provenance(input_asset_id, input_entity_key);
CREATE INDEX idx_entity_prov_run ON entity_provenance(run_id);
Example:
Output: risk_assessment(case_id=CASE-123, version=8)
Derived from:
- case(case_id=CASE-123, version=27)
- evidence_document(document_id=DOC-991, checksum=...)
- screening_result(screening_id=SCR-77, version=5)
- policy_bundle(policy_id=POLICY-2026-06, checksum=...)
Generated by:
- job: risk-classifier
- code: git sha abc123
- rule version: 2026.06.14
- run: 7df4...
Important: do not store sensitive source values unnecessarily in provenance. Store keys, checksums, versions, and references unless value-level proof is required.
10. Provenance for decision records
For decision-grade systems, attach provenance directly to decisions.
CREATE TABLE decision_record (
decision_id uuid PRIMARY KEY,
case_id uuid NOT NULL,
decision_type text NOT NULL,
decision_status text NOT NULL,
decision_outcome text NOT NULL,
decided_by_actor_id text NOT NULL,
decided_at timestamptz NOT NULL,
rationale text NOT NULL,
evidence_package_id uuid,
provenance_run_id uuid REFERENCES provenance_run(run_id),
source_snapshot_ref text NOT NULL,
policy_version text NOT NULL,
created_at timestamptz NOT NULL DEFAULT now()
);
Why embed provenance references?
Because decisions are not only current facts. They are outputs of a context at a time.
A later correction should not silently rewrite what the decision saw. Instead, model:
Pattern:
- Store current state separately.
- Store decision context snapshot/provenance immutably.
- If source data changes later, create an impact review rather than mutating old decision explanation.
11. Snapshot references and reproducibility
Lineage says what depends on what. Reproducibility says whether we can recompute the same output.
For reproducible report/decision/model output, capture:
| Metadata | Example |
|---|---|
| source snapshot | database transaction timestamp, LSN, backup snapshot id, file checksum |
| code version | git SHA, container digest |
| config version | feature flag snapshot, YAML checksum |
| rule/model version | policy bundle id, model registry version |
| environment | engine version, locale/timezone, dependency versions |
| input watermark | Kafka offset, CDC LSN, batch id |
| output checksum | row count, aggregate checksum, report hash |
Example report build table:
CREATE TABLE report_build (
report_build_id uuid PRIMARY KEY,
report_name text NOT NULL,
report_version text NOT NULL,
period_start date NOT NULL,
period_end date NOT NULL,
built_at timestamptz NOT NULL DEFAULT now(),
built_by_job_id uuid NOT NULL REFERENCES data_job(job_id),
source_snapshot_ref text NOT NULL,
code_version text NOT NULL,
config_version text NOT NULL,
row_count bigint NOT NULL,
output_checksum text NOT NULL,
validation_status text NOT NULL CHECK (validation_status IN ('passed', 'failed', 'waived')),
storage_uri text NOT NULL
);
Rule:
A regulatory report without source snapshot and build metadata is a screenshot, not a defensible artifact.
12. Lineage capture mechanisms
Lineage can be captured in multiple ways.
| Mechanism | Example | Strength | Weakness |
|---|---|---|---|
| Declared metadata | YAML contract, dbt model metadata | simple, explicit | can drift from reality |
| SQL parsing | parse transformation queries | useful for column lineage | misses runtime/dynamic logic |
| Runtime instrumentation | emit events from jobs/services | accurate execution lineage | requires engineering discipline |
| CDC/log analysis | derive from change streams | useful for data movement | weak semantic context |
| Catalog integration | scan warehouse/database metadata | broad inventory | shallow semantics |
| Manual curation | owner documents dependency | handles complex semantics | costly and stale-prone |
Best practice: combine declared + observed + verified.
Use confidence level:
| Confidence | Meaning |
|---|---|
| declared | owner says dependency exists |
| observed | system observed data/job access |
| verified | dependency validated by tests/reconciliation/runtime event |
13. Runtime lineage events
A practical lineage event should be small, structured, and emitted by jobs/services.
{
"eventType": "LINEAGE_RUN_COMPLETED",
"runId": "7df4b8d7-7a75-4a0f-9f9b-3b6b5d8b0d1a",
"job": {
"name": "case-summary-builder",
"version": "git:abc123",
"owner": "enforcement-data-platform"
},
"inputs": [
{
"assetUrn": "postgres://prod-core/enforcement/public.case",
"snapshotRef": "lsn:42/BF9E2A10",
"rowCount": 982331
},
{
"assetUrn": "postgres://prod-core/enforcement/public.case_status_transition",
"snapshotRef": "lsn:42/BF9E2A10",
"rowCount": 3341220
}
],
"outputs": [
{
"assetUrn": "postgres://analytics/enforcement/public.case_summary",
"rowCount": 982331,
"checksum": "sha256:..."
}
],
"startedAt": "2026-07-05T01:00:00Z",
"completedAt": "2026-07-05T01:04:12Z",
"status": "succeeded"
}
This is close to OpenLineage-style thinking: jobs, datasets, runs, and contextual metadata. You do not need to implement the OpenLineage spec directly, but adopting its vocabulary avoids inventing a weak custom model.
14. Impact analysis queries
Lineage becomes valuable when it answers questions quickly.
14.1 Downstream impact
“What depends on case.status?”
WITH RECURSIVE downstream AS (
SELECT
le.upstream_asset_id,
le.downstream_asset_id,
1 AS depth,
ARRAY[le.upstream_asset_id, le.downstream_asset_id] AS path
FROM data_lineage_edge le
WHERE le.upstream_asset_id = :asset_id
AND le.valid_to IS NULL
UNION ALL
SELECT
d.downstream_asset_id,
le.downstream_asset_id,
d.depth + 1,
d.path || le.downstream_asset_id
FROM downstream d
JOIN data_lineage_edge le
ON le.upstream_asset_id = d.downstream_asset_id
AND le.valid_to IS NULL
WHERE NOT le.downstream_asset_id = ANY(d.path)
)
SELECT
d.depth,
a.asset_urn,
a.asset_type,
a.owner_team,
a.classification
FROM downstream d
JOIN data_asset a ON a.asset_id = d.downstream_asset_id
ORDER BY d.depth, a.asset_urn;
14.2 Upstream root cause
“Report total looks wrong. What inputs produced it?”
WITH RECURSIVE upstream AS (
SELECT
le.downstream_asset_id,
le.upstream_asset_id,
1 AS depth,
ARRAY[le.downstream_asset_id, le.upstream_asset_id] AS path
FROM data_lineage_edge le
WHERE le.downstream_asset_id = :report_asset_id
AND le.valid_to IS NULL
UNION ALL
SELECT
u.upstream_asset_id,
le.upstream_asset_id,
u.depth + 1,
u.path || le.upstream_asset_id
FROM upstream u
JOIN data_lineage_edge le
ON le.downstream_asset_id = u.upstream_asset_id
AND le.valid_to IS NULL
WHERE NOT le.upstream_asset_id = ANY(u.path)
)
SELECT
u.depth,
a.asset_urn,
a.asset_type,
a.owner_team
FROM upstream u
JOIN data_asset a ON a.asset_id = u.upstream_asset_id
ORDER BY u.depth, a.asset_urn;
14.3 Consumer notification
“Who must be notified if this column semantic changes?”
SELECT DISTINCT
consumer.asset_urn AS consumer_urn,
consumer.owner_team
FROM column_lineage_edge cle
JOIN data_asset target_col
ON target_col.asset_id = cle.target_column_asset_id
JOIN data_lineage_edge le
ON le.upstream_asset_id = target_col.asset_id
JOIN data_asset consumer
ON consumer.asset_id = le.downstream_asset_id
WHERE cle.source_column_asset_id = :source_column_asset_id
AND cle.valid_to IS NULL
AND le.valid_to IS NULL;
15. Lineage and schema evolution
Schema evolution without lineage is guesswork.
Before changing a table/column, ask:
| Change | Lineage question |
|---|---|
| drop column | downstream consumers? derived columns? reports? exports? |
| rename column | contract alias? parser impact? dbt/SQL jobs? |
| type change | downstream cast? precision loss? warehouse schema? |
| semantic change | consumers expecting old meaning? report historical comparability? |
| split table | lineage rewrite? consumer migration? rebuild path? |
| merge entities | provenance of merged identity? duplicate resolution evidence? |
| mask/tokenize PII | search/export/report impact? re-identification path? |
Migration gate:
Never approve destructive schema changes based only on source repository search. Repository search misses dashboards, BI tools, ad-hoc reports, CDC consumers, external exports, and notebooks.
16. Lineage and data quality
Lineage helps diagnose quality incidents.
Example incident:
Monthly closed case count is 12% lower than expected.
Diagnosis path:
- Identify affected report build.
- Read report provenance: source snapshot, job version, config version.
- Traverse upstream lineage to source tables/jobs.
- Compare row counts/checksums with previous successful build.
- Check recent schema/config/job changes on upstream assets.
- Isolate whether error is source, transform, filter, aggregation, or freshness.
- Rebuild from known-good snapshot if needed.
Quality metadata table:
CREATE TABLE data_quality_result (
quality_result_id uuid PRIMARY KEY,
run_id uuid NOT NULL REFERENCES provenance_run(run_id),
asset_id uuid NOT NULL REFERENCES data_asset(asset_id),
rule_name text NOT NULL,
rule_version text NOT NULL,
status text NOT NULL CHECK (status IN ('passed', 'failed', 'warning', 'waived')),
checked_at timestamptz NOT NULL DEFAULT now(),
observed_value numeric,
expected_value numeric,
details jsonb NOT NULL DEFAULT '{}'::jsonb
);
Attach quality result to provenance. A derived asset without quality metadata is hard to trust.
17. Lineage for PII and privacy
Lineage is mandatory for privacy operations because personal data propagates.
Questions:
- Where does PII originate?
- Which derived tables contain PII?
- Which search indexes, vector indexes, exports, reports, and backups include it?
- Which transformations mask/tokenize/anonymize it?
- Which downstream stores must receive erasure or anonymization commands?
PII lineage example:
Metadata to add:
ALTER TABLE data_asset
ADD COLUMN contains_pii boolean NOT NULL DEFAULT false,
ADD COLUMN pii_categories text[] NOT NULL DEFAULT '{}',
ADD COLUMN privacy_transform text;
Privacy transform examples:
| Transform | Meaning |
|---|---|
| none | identifiable data remains |
| masked | displayed partially |
| tokenized | reversible tokenization exists |
| pseudonymized | linkable but not directly identifying without key |
| anonymized | no reasonable re-identification path under threat model |
Be careful: aggregation does not automatically mean anonymous. Small cells can re-identify.
18. Lineage for AI, scoring, and automated decisions
If a model or rule engine influences decisions, provenance must capture more than input/output.
Capture:
| Metadata | Why |
|---|---|
| model id/version | identifies exact model behavior |
| feature set version | explains feature semantics |
| training data lineage | reveals upstream bias/change |
| inference input snapshot | explains individual decision |
| prompt/template version | for LLM-based workflows |
| tool/output references | for agentic workflows |
| confidence/score | decision explanation |
| human override | accountability |
| policy threshold version | why score led to outcome |
Schema extension:
CREATE TABLE model_inference_provenance (
inference_id uuid PRIMARY KEY,
model_name text NOT NULL,
model_version text NOT NULL,
feature_set_version text NOT NULL,
input_asset_id uuid NOT NULL REFERENCES data_asset(asset_id),
input_entity_key text NOT NULL,
output_asset_id uuid NOT NULL REFERENCES data_asset(asset_id),
output_entity_key text NOT NULL,
inference_time timestamptz NOT NULL,
code_version text NOT NULL,
config_version text NOT NULL,
input_checksum text NOT NULL,
output_checksum text NOT NULL,
explanation_ref text,
metadata jsonb NOT NULL DEFAULT '{}'::jsonb
);
Do not store raw prompts or full input payloads blindly if they may contain PII, secrets, or privileged investigation data. Store classified references and redacted evidence where possible.
19. Lineage storage design
Lineage is graph-shaped, but you can start relational.
Options:
| Storage | Good for | Weakness |
|---|---|---|
| Relational tables | governance, SQL queries, constraints | recursive graph traversal can become heavy |
| Graph database | deep traversal, impact graph exploration | operationally another engine |
| Search index | discovery, keyword lookup | weak consistency and graph semantics |
| Data catalog platform | UI, stewardship, integrations | may not support custom row-level provenance |
| Event log | immutable lineage events | needs materialized query model |
Practical architecture:
This mirrors a common architecture pattern:
- event log is source of lineage changes
- relational store enforces metadata integrity
- graph/search projections optimize exploration
20. Lineage event immutability and correction
Lineage/provenance can be wrong. Do not update historical facts silently.
Pattern:
CREATE TABLE lineage_event (
lineage_event_id uuid PRIMARY KEY,
event_type text NOT NULL,
event_time timestamptz NOT NULL,
emitted_by text NOT NULL,
correlation_id text,
payload jsonb NOT NULL,
payload_hash text NOT NULL,
supersedes_event_id uuid REFERENCES lineage_event(lineage_event_id)
);
Correction strategy:
| Situation | Handling |
|---|---|
| missing lineage edge | append correction event |
| wrong edge | append superseding event, mark old edge invalid |
| wrong run metadata | append correction with reason |
| wrong report provenance | create corrected report build or annotation |
| wrong decision provenance | create evidence correction and impact review |
Rule:
Provenance correction must itself be provenance-tracked.
21. Lineage freshness and drift detection
Lineage can drift from reality.
Drift examples:
- job changed SQL but lineage metadata not updated
- dashboard directly queries source table without catalog registration
- emergency script writes derived table outside pipeline
- CDC consumer added without owner registration
- BI user creates unmanaged report using restricted table
- schema migration changes column meaning without contract update
Detection mechanisms:
| Signal | Detection |
|---|---|
| unexpected DB query consumer | database logs / query telemetry |
| unregistered table write | audit trigger / permissions / DDL event trigger |
| stale declared metadata | CI check compares SQL model to lineage declaration |
| missing consumer owner | catalog required field validation |
| dashboard unmanaged query | BI metadata scanner |
| CDC consumer unknown | broker ACL / consumer group inventory |
Lineage freshness metric:
lineage_coverage = assets_with_owner_and_upstream / total_production_assets
verified_lineage_coverage = verified_edges / total_edges
stale_lineage_count = edges_not_confirmed_within_policy_window
unknown_consumer_count = production_queries_from_unregistered_clients
22. Enforcement: make lineage hard to bypass
Lineage cannot rely only on goodwill.
Controls:
- Production data assets require owner metadata.
- New tables require registration in catalog.
- New public views require contract metadata.
- ETL jobs emit lineage events before being considered production-ready.
- BI dashboards must use governed datasets.
- CDC consumers require owner and purpose registration.
- Migration PRs must include impact analysis.
- PII-containing assets require privacy lineage.
- Report builds require source snapshot and checksum.
- Decision records require provenance references.
Example migration review gate:
schema_change:
asset: postgres://prod-core/enforcement/public.case.status
change_type: semantic_change
impact_analysis_required: true
downstream_assets:
- postgres://analytics/enforcement/public.case_summary.current_status
- dashboard://superset/enforcement-case-overview
- report://regulatory/monthly-enforcement-summary/v2
consumer_approval_required: true
backward_compatible: true
23. Case study: regulatory enforcement decision
Scenario:
A regulator asks why CASE-123 was escalated automatically to senior review on 2026-07-05.
A weak system answers:
Because risk_score was HIGH.
A defensible system answers:
CASE-123 was escalated by rule bundle POLICY-2026-06.14,
rule RISK_ESCALATION_07,
at 2026-07-05T09:21:33Z.
The rule evaluated these inputs:
- case_profile version 27
- evidence_document DOC-991 checksum sha256:...
- sanctions_screening SCR-77 version 5
- prior_enforcement_history snapshot LSN 42/BF9E2A10
The escalation was generated by service risk-engine version git:abc123.
Validation passed and the transition was recorded in case_status_transition id ...
The decision context is preserved in evidence_package EPK-...
Schema sketch:
CREATE TABLE case_decision_input_ref (
decision_input_ref_id uuid PRIMARY KEY,
decision_id uuid NOT NULL REFERENCES decision_record(decision_id),
source_asset_urn text NOT NULL,
source_entity_key text NOT NULL,
source_version text NOT NULL,
source_checksum text,
input_role text NOT NULL,
captured_at timestamptz NOT NULL DEFAULT now()
);
This makes the decision reconstructable without duplicating all raw data into decision row.
24. Common failure modes
| Failure mode | Why it happens | Consequence | Prevention |
|---|---|---|---|
| Diagram-only lineage | manually maintained docs | stale, untrusted lineage | runtime events + catalog |
| No owner metadata | assets created ad hoc | no one fixes broken data | owner required at creation |
| Dataset lineage only | no column mapping | schema changes break reports | column lineage for contracts |
| Overly granular lineage everywhere | ambition exceeds need | high cost, poor adoption | tiered lineage policy |
| Provenance stores raw PII | careless evidence capture | privacy/security risk | references/checksums/redaction |
| Lineage not versioned | only current dependencies shown | historical reports unexplained | valid_from/valid_to |
| No confidence level | declared and verified mixed | false trust | declared/observed/verified |
| Consumers not registered | dashboards/notebooks bypass process | hidden breaking changes | query telemetry + BI scanning |
| No source snapshot | report cannot be reproduced | weak audit defense | report_build metadata |
| No correction model | provenance edited silently | evidence integrity risk | append correction events |
25. Design checklist
Use this checklist before approving database/data-platform architecture.
Asset inventory
- Every production table/topic/file/report has a stable asset URN.
- Every asset has owner team and classification.
- PII/sensitive assets are explicitly marked.
- Retired assets are not deleted from lineage history.
Lineage graph
- Critical assets have upstream and downstream lineage.
- Shared contracts have column-level lineage.
- Decision-critical outputs have row/entity provenance.
- Lineage edges have confidence level.
- Lineage edges are time-versioned.
Provenance
- Report/decision/model outputs capture source snapshot.
- Jobs capture code/config/rule/model version.
- Inputs/outputs capture row counts and checksums where feasible.
- Corrections are append-only and traceable.
- Raw sensitive values are not unnecessarily duplicated in provenance.
Operations
- Schema changes require impact analysis.
- Consumer notification is driven by lineage graph.
- Lineage drift is monitored.
- Unknown consumers are detected.
- Runbooks can use lineage for root cause analysis.
Governance
- New assets cannot become production without metadata.
- CDC/export consumers must register owner and purpose.
- Regulatory reports include build/provenance metadata.
- Decision records attach evidence/provenance references.
- Privacy erasure/masking uses lineage to propagate changes.
26. Practical implementation path
Do not try to build a perfect lineage platform on day one.
Phase 1 — Inventory
- asset registry
- owner/team/classification
- stable URNs
- production vs non-production flag
Phase 2 — Dataset lineage
- pipeline declared metadata
- runtime job events
- downstream/upstream traversal
- schema change impact query
Phase 3 — Column lineage
- public data product columns
- BI/report fields
- CDC/export contracts
- semantic versioning
Phase 4 — Decision/report provenance
- source snapshot
- code/config/rule version
- row count/checksum
- evidence package reference
Phase 5 — Enforcement and automation
- CI/CD impact gates
- drift detection
- unknown consumer detection
- privacy propagation
- incident runbook integration
Architectural rule:
Start with the lineage questions you need to answer under pressure. Build only the metadata required to answer them reliably.
27. Final mental model
Data lineage and provenance are not documentation tasks. They are database architecture capabilities.
A top-tier database architect treats lineage as:
- Dependency graph — what depends on what.
- Evidence chain — how an output came into existence.
- Impact system — what breaks if something changes.
- Trust mechanism — why a report, dashboard, decision, or model output can be believed.
- Compliance enabler — how privacy, retention, audit, and regulatory traceability become executable.
The key shift:
Do not ask, “Can we draw the data flow?” Ask, “Can we prove how this value reached this decision and who is affected if it changes?”
That is the difference between a data diagram and an evidence-grade architecture.
References
- W3C, PROV-DM: The PROV Data Model — provenance as entities, activities, and agents.
- W3C, PROV-O: The PROV Ontology — formal ontology for provenance representation.
- OpenLineage, Object Model — jobs, datasets, runs, and facets as lineage metadata concepts.
- PostgreSQL Documentation, Recursive Queries / WITH RECURSIVE, useful for lineage graph traversal in relational metadata stores.
- PostgreSQL Documentation, JSON Types, useful for extensible metadata fields when core columns remain typed.
You just completed lesson 63 in deepen practice. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.
Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.