Series/Learn Database Design and Architect

Deepen PracticeOrdered learning track

Master Data, Reference Data, and Taxonomy

Learn Database Design and Architect - Part 064

Master data, reference data, dan taxonomy untuk database architecture: trusted entities, reference lifecycle, hierarchy, versioning, stewardship, synchronization, mapping, and governance.

[2026-07-05]19 min read3678 words

In This Lesson

1. Core definitions 2. Why master/reference data matters 3. Master data mental model

PrevNext

Lesson 6484 lesson track46–69 Deepen Practice

#database#database-design#database-architecture#master-data+4 more

Part 064 — Master Data, Reference Data, and Taxonomy

Target bagian ini: kamu bisa membedakan master data, reference data, dan taxonomy, lalu mendesain schema, lifecycle, ownership, versioning, mapping, synchronization, governance, dan failure mode-nya di sistem produksi.

Banyak sistem enterprise gagal bukan karena query lambat, tetapi karena istilah bisnis, kode referensi, dan entity utama tidak konsisten.

Contoh:

customer, client, party, account_holder, subject, dan respondent dipakai bergantian tanpa definisi jelas.
Status “closed” berarti selesai di satu sistem, tetapi berarti archived di sistem lain.
Country code, currency code, industry classification, risk category, dan case type tersebar sebagai string bebas.
Organisasi punya tiga “single source of truth” untuk customer.
Dashboard berbeda karena taxonomy berubah tanpa versioning.
Case lama tidak bisa diinterpretasi karena reference value lama dihapus.

Database architect harus menguasai ini karena master/reference data adalah semantic infrastructure.

Architectural rule:

If teams cannot agree what an entity or code means, the database cannot compensate with indexes and constraints.

1. Core definitions

Concept	Meaning	Example	Volatility	Ownership
Master data	Core business entity shared across processes	person, organization, product, location, supplier, account	medium	domain/stewardship team
Reference data	Controlled values used to classify/categorize	country code, currency, status code, case type, risk rating	low-medium	governance/domain owner
Taxonomy	Structured classification system	enforcement category tree, product hierarchy, document type hierarchy	medium	policy/governance/domain owner
Metadata	Data about data	table owner, column classification, rule version	medium	platform/governance
Transactional data	Business events/facts	payment, case transition, order, inspection result	high	operational service/domain

IBM describes reference data as data used to classify or categorize other data, such as country and currency codes. That distinction is useful: reference data is not the event itself; it shapes interpretation of events.

2. Why master/reference data matters

Master/reference data affects correctness, not only reporting convenience.

Area	Failure when master/reference data is weak
Transaction processing	invalid status/type/category accepted
Workflow	wrong queue/escalation because category mapping wrong
Authorization	actor linked to wrong organization/role
Reporting	inconsistent aggregation across systems
Compliance	old decisions cannot be interpreted after taxonomy changes
Integration	external code mappings drift
Search	duplicate entities and inconsistent labels
Analytics	dimensions unstable; time-series reports not comparable
Data quality	duplicate parties/products/locations explode

A strong database design separates:

identity of core entity
attributes of entity
classification of entity
versioned interpretation of classification
mapping to external systems
historical usage in transactions

3. Master data mental model

Master data answers: “What shared thing does the enterprise recognize?”

Examples:

Master entity	Why it is master data
Person/Party	referenced by cases, accounts, claims, communications
Organization	referenced by permits, enforcement actions, ownership structures
Product	referenced by orders, inventory, pricing, support
Location	referenced by inspections, incidents, service areas
Supplier	referenced by procurement, risk, contracts

Master data is not “all data about customer”. It is the trusted cross-process representation of shared entity identity and key attributes.

4. Reference data mental model

Reference data answers: “What controlled values are allowed, and what do they mean?”

CREATE TABLE ref_case_type (
  case_type_code      text PRIMARY KEY,
  display_name        text NOT NULL,
  description         text NOT NULL,
  is_active           boolean NOT NULL DEFAULT true,
  valid_from          date NOT NULL,
  valid_to            date,
  sort_order          integer NOT NULL,
  owner_team          text NOT NULL,
  created_at          timestamptz NOT NULL DEFAULT now(),
  CHECK (valid_to IS NULL OR valid_to > valid_from)
);

Reference data should not be random app constants.

Bad:

if (caseType.equals("A")) { ... }

Better:

CREATE TABLE ref_case_type (
  case_type_code text PRIMARY KEY,
  display_name text NOT NULL,
  regulatory_meaning text NOT NULL,
  requires_senior_review boolean NOT NULL,
  default_sla_hours integer NOT NULL,
  valid_from date NOT NULL,
  valid_to date
);

Design question:

Is this just a display label, or does it drive business behavior?

If it drives behavior, it needs ownership, versioning, tests, and change governance.

5. Taxonomy mental model

Taxonomy is reference data with structure.

Example:

Taxonomy can be:

Shape	Example	DB pattern
flat list	risk rating LOW/MEDIUM/HIGH	reference table
hierarchy	product category tree	adjacency list / closure table
DAG	document can belong to multiple categories	edge table
ontology-like	rich typed relationships	graph/properties
versioned taxonomy	regulatory classification by year	taxonomy version tables

Architectural rule:

A taxonomy is not only a tree; it is a contract for classification, reporting, routing, and historical interpretation.

6. Master vs reference vs transactional data

Do not confuse entity identity with transaction events.

Example: regulatory case management.

Data	Type	Reason
`party`	master data	shared person/org identity
`case`	transactional/operational entity	lifecycle of investigation/enforcement matter
`case_type`	reference data	classifies case
`case_status`	reference data + state machine	classifies lifecycle state
`case_status_transition`	transactional event	records state change
`enforcement_taxonomy_node`	taxonomy	structured regulatory classification
`evidence_document`	operational entity	case-specific artifact
`country_code`	reference data	classification/standardized location attribute
`organization_relationship`	master relationship	shared relationship between orgs/people

Rule:

Reference data defines allowed meaning. Transactional data records what happened.

7. Reference table design

Minimum viable production reference table:

CREATE TABLE ref_value_set (
  value_set_code   text PRIMARY KEY,
  display_name     text NOT NULL,
  description      text NOT NULL,
  owner_team       text NOT NULL,
  change_policy    text NOT NULL CHECK (change_policy IN ('platform', 'domain_owner', 'committee', 'external_standard')),
  created_at       timestamptz NOT NULL DEFAULT now()
);

CREATE TABLE ref_value (
  value_set_code   text NOT NULL REFERENCES ref_value_set(value_set_code),
  value_code       text NOT NULL,
  display_name     text NOT NULL,
  description      text NOT NULL,
  is_active        boolean NOT NULL DEFAULT true,
  valid_from       date NOT NULL,
  valid_to         date,
  sort_order       integer NOT NULL DEFAULT 0,
  metadata         jsonb NOT NULL DEFAULT '{}'::jsonb,
  created_at       timestamptz NOT NULL DEFAULT now(),
  updated_at       timestamptz NOT NULL DEFAULT now(),
  PRIMARY KEY (value_set_code, value_code),
  CHECK (valid_to IS NULL OR valid_to > valid_from)
);

But generic reference table can become a dumping ground. Use it carefully.

Prefer dedicated reference tables when:

values have domain-specific attributes
FK constraints are important
values drive behavior
history/versioning matters
query performance matters
semantics differ across value sets

Example dedicated table:

CREATE TABLE ref_case_status (
  status_code           text PRIMARY KEY,
  display_name          text NOT NULL,
  lifecycle_phase       text NOT NULL CHECK (lifecycle_phase IN ('intake', 'assessment', 'investigation', 'decision', 'closure')),
  is_terminal           boolean NOT NULL,
  allows_reopen         boolean NOT NULL,
  default_sla_hours     integer,
  requires_decision_ref boolean NOT NULL,
  is_active             boolean NOT NULL DEFAULT true,
  valid_from            date NOT NULL,
  valid_to              date,
  CHECK (valid_to IS NULL OR valid_to > valid_from)
);

8. Do not delete reference values casually

Deleting reference values breaks history.

Bad:

DELETE FROM ref_case_type WHERE case_type_code = 'LEGACY_X';

This creates old rows with invalid references or forces destructive update.

Better:

UPDATE ref_case_type
SET is_active = false,
    valid_to = DATE '2026-07-05'
WHERE case_type_code = 'LEGACY_X';

Then operational tables can preserve history:

CREATE TABLE case_file (
  case_id         uuid PRIMARY KEY,
  case_number     text NOT NULL UNIQUE,
  case_type_code  text NOT NULL REFERENCES ref_case_type(case_type_code),
  opened_at       timestamptz NOT NULL,
  closed_at       timestamptz
);

But this only preserves code, not full meaning if meaning changes. For defensibility, capture versioned meaning when necessary.

9. Versioned reference data

Reference values can change meaning over time.

Example:

risk rating HIGH threshold changes
case category definition changes after regulation update
country/region classification changes
product category hierarchy changes
enforcement violation taxonomy version changes

Pattern:

CREATE TABLE ref_case_category_version (
  category_code      text NOT NULL,
  version_no         integer NOT NULL,
  display_name       text NOT NULL,
  definition         text NOT NULL,
  valid_from         date NOT NULL,
  valid_to           date,
  regulation_ref     text,
  is_active          boolean NOT NULL DEFAULT true,
  PRIMARY KEY (category_code, version_no),
  CHECK (valid_to IS NULL OR valid_to > valid_from)
);

CREATE UNIQUE INDEX uq_ref_case_category_current
ON ref_case_category_version(category_code)
WHERE valid_to IS NULL;

Transaction row can store both code and version:

CREATE TABLE case_classification (
  case_classification_id uuid PRIMARY KEY,
  case_id                uuid NOT NULL,
  category_code          text NOT NULL,
  category_version_no    integer NOT NULL,
  classified_at          timestamptz NOT NULL,
  classified_by_actor_id text NOT NULL,
  rationale              text,
  FOREIGN KEY (category_code, category_version_no)
    REFERENCES ref_case_category_version(category_code, version_no)
);

Why store version?

Because a classification made under old taxonomy must remain interpretable after taxonomy changes.

Rule:

If the meaning can change and historical interpretation matters, store the reference version used at decision time.

10. Effective-dated reference data lookup

Sometimes transaction date determines applicable reference value.

SELECT r.*
FROM ref_tax_rate r
WHERE r.jurisdiction_code = :jurisdiction_code
  AND r.valid_from <= :transaction_date
  AND (r.valid_to IS NULL OR :transaction_date < r.valid_to)
ORDER BY r.valid_from DESC
LIMIT 1;

For PostgreSQL, range types and exclusion constraints can prevent overlapping validity periods.

CREATE EXTENSION IF NOT EXISTS btree_gist;

CREATE TABLE ref_tax_rate (
  jurisdiction_code text NOT NULL,
  valid_period      daterange NOT NULL,
  rate              numeric(8,5) NOT NULL,
  PRIMARY KEY (jurisdiction_code, valid_period),
  EXCLUDE USING gist (
    jurisdiction_code WITH =,
    valid_period WITH &&
  )
);

This enforces:

For the same jurisdiction, two active tax-rate rows cannot overlap in time.

The same pattern applies to policy threshold, SLA calendar, classification definition, and jurisdiction-specific rule.

11. Reference data lifecycle

Reference values need lifecycle governance.

Lifecycle table:

CREATE TABLE reference_change_request (
  change_request_id uuid PRIMARY KEY,
  value_set_code    text NOT NULL,
  change_type       text NOT NULL CHECK (change_type IN ('add', 'update', 'deprecate', 'retire', 'merge', 'split')),
  requested_by      text NOT NULL,
  requested_at      timestamptz NOT NULL DEFAULT now(),
  status            text NOT NULL CHECK (status IN ('draft', 'under_review', 'approved', 'rejected', 'implemented', 'cancelled')),
  reason            text NOT NULL,
  effective_date    date NOT NULL,
  impact_summary    text,
  metadata          jsonb NOT NULL DEFAULT '{}'::jsonb
);

Operational rule:

low-risk UI labels can be changed by owner team
behavior-driving values require review
regulatory classifications require effective date and evidence
destructive changes require impact analysis
taxonomy restructuring requires migration plan

12. Reference data as deploy-time vs runtime configuration

Not all reference data should be editable at runtime.

Type	Example	Change mechanism
Standards reference	ISO country/currency code	controlled import/versioned update
Domain classification	case type, violation category	governance workflow
Behavior configuration	SLA hours by case type	versioned config with tests
UI label	display name, sort order	owner update with audit
State machine status	workflow state	code/schema migration + tests

If reference data drives business logic, treat it like code:

version it
test it
review it
deploy it safely
preserve old behavior for historical records

Anti-pattern:

Business-critical rules hidden inside editable lookup tables with no tests.

Better:

Reference data that changes behavior has an explicit rule contract, lifecycle, audit trail, and validation suite.

13. Master data architecture patterns

Master data architecture usually falls into several patterns.

Pattern	Description	Good for	Risk
Registry	central index links records across systems	lightweight identity resolution	attributes remain inconsistent
Consolidation	central hub consolidates data for reporting	analytics and dedup	not real-time operational authority
Coexistence	hub and sources synchronize trusted attributes	shared operations	complex synchronization
Centralized	hub is system of record for master entity	strong governance	high organizational coupling

Do not choose pattern based on tool marketing. Choose based on authority and workflow.

Questions:

Which system creates the entity?
Which system can change authoritative attributes?
Which attributes are mastered where?
How are duplicates merged?
How are splits handled?
What happens when source systems disagree?
How fast must corrections propagate?
Which consumers require historical state?

14. Party master data pattern

For person/organization, use party modelling.

Schema sketch:

CREATE TABLE party (
  party_id        uuid PRIMARY KEY,
  party_type      text NOT NULL CHECK (party_type IN ('person', 'organization')),
  lifecycle_state text NOT NULL CHECK (lifecycle_state IN ('active', 'merged', 'split', 'retired')),
  created_at      timestamptz NOT NULL DEFAULT now(),
  updated_at      timestamptz NOT NULL DEFAULT now()
);

CREATE TABLE person_profile (
  party_id       uuid PRIMARY KEY REFERENCES party(party_id),
  legal_name     text NOT NULL,
  date_of_birth  date,
  created_at     timestamptz NOT NULL DEFAULT now(),
  updated_at     timestamptz NOT NULL DEFAULT now()
);

CREATE TABLE organization_profile (
  party_id             uuid PRIMARY KEY REFERENCES party(party_id),
  legal_name           text NOT NULL,
  registration_number  text,
  jurisdiction_code    text,
  created_at           timestamptz NOT NULL DEFAULT now(),
  updated_at           timestamptz NOT NULL DEFAULT now()
);

CREATE TABLE party_source_xref (
  party_source_xref_id uuid PRIMARY KEY,
  party_id             uuid NOT NULL REFERENCES party(party_id),
  source_system        text NOT NULL,
  source_entity_type   text NOT NULL,
  source_entity_key    text NOT NULL,
  source_confidence    text NOT NULL CHECK (source_confidence IN ('low', 'medium', 'high', 'verified')),
  valid_from           timestamptz NOT NULL DEFAULT now(),
  valid_to             timestamptz,
  UNIQUE (source_system, source_entity_type, source_entity_key)
);

Do not expose internal party_id as the only identifier business users see. Keep business identifiers explicit and governed.

15. Golden record vs source-specific truth

The phrase “golden record” can be misleading.

A golden record is not always one row with perfect truth. Often it is an assembled trusted view with attribute-level survivorship.

Example:

Attribute	Source of truth	Rule
legal name	government registry	verified external source wins
email	self-service portal	latest verified user input wins
risk category	risk system	computed by policy version
organization relationship	corporate registry + stewardship	verified manual curation wins
mailing address	customer portal	most recent confirmed address wins

Schema:

CREATE TABLE mastered_attribute (
  mastered_attribute_id uuid PRIMARY KEY,
  party_id              uuid NOT NULL REFERENCES party(party_id),
  attribute_name        text NOT NULL,
  attribute_value       text NOT NULL,
  source_system         text NOT NULL,
  source_entity_key     text NOT NULL,
  confidence_score      numeric(5,4) NOT NULL,
  survivorship_rule     text NOT NULL,
  effective_from        timestamptz NOT NULL,
  effective_to          timestamptz,
  created_at            timestamptz NOT NULL DEFAULT now(),
  UNIQUE (party_id, attribute_name, effective_from)
);

Alternative: materialize current trusted profile for fast reads, but keep attribute history.

CREATE TABLE party_trusted_profile (
  party_id              uuid PRIMARY KEY REFERENCES party(party_id),
  legal_name            text NOT NULL,
  legal_name_source     text NOT NULL,
  email                 text,
  email_source          text,
  updated_at            timestamptz NOT NULL
);

Rule:

The golden record should be explainable, not magical.

16. Duplicate, merge, and split

Master data systems must model duplicate resolution explicitly.

Merge

CREATE TABLE party_merge_event (
  merge_event_id        uuid PRIMARY KEY,
  surviving_party_id    uuid NOT NULL REFERENCES party(party_id),
  merged_party_id       uuid NOT NULL REFERENCES party(party_id),
  merge_reason          text NOT NULL,
  confidence_score      numeric(5,4) NOT NULL,
  approved_by_actor_id  text NOT NULL,
  approved_at           timestamptz NOT NULL,
  evidence_ref          text,
  CHECK (surviving_party_id <> merged_party_id)
);

After merge:

old party should not disappear
old references should resolve to surviving party where appropriate
historical transactions should preserve original reference if legally meaningful
downstream systems need merge event

Split

Splits are harder than merges.

CREATE TABLE party_split_event (
  split_event_id       uuid PRIMARY KEY,
  original_party_id    uuid NOT NULL REFERENCES party(party_id),
  new_party_id         uuid NOT NULL REFERENCES party(party_id),
  split_reason         text NOT NULL,
  approved_by_actor_id text NOT NULL,
  approved_at          timestamptz NOT NULL,
  evidence_ref         text
);

A split may require reassigning identifiers, relationships, cases, consents, and historical events. Never treat it as a simple update.

Rule:

Identity correction is a business event, not a row overwrite.

17. External identifiers and crosswalk mapping

Enterprise systems rarely agree on IDs.

Use crosswalk mapping:

CREATE TABLE master_identity_crosswalk (
  crosswalk_id       uuid PRIMARY KEY,
  master_entity_type text NOT NULL,
  master_entity_id   uuid NOT NULL,
  source_system      text NOT NULL,
  source_entity_type text NOT NULL,
  source_entity_key  text NOT NULL,
  match_method       text NOT NULL CHECK (match_method IN ('source_asserted', 'deterministic', 'probabilistic', 'manual')),
  confidence_score   numeric(5,4) NOT NULL,
  valid_from         timestamptz NOT NULL DEFAULT now(),
  valid_to           timestamptz,
  UNIQUE (source_system, source_entity_type, source_entity_key)
);

Match methods:

Method	Example	Risk
source asserted	external system says ID maps to party	source may be wrong
deterministic	exact national ID hash match	missing/incorrect IDs
probabilistic	name+DOB+address similarity	false positive/negative
manual	steward confirms	human error, slow

For regulated systems, automatic merge should be conservative. Prefer candidate match + steward review when false positive risk is high.

18. Taxonomy hierarchy patterns

18.1 Adjacency list

Simple and common.

CREATE TABLE taxonomy_node (
  taxonomy_code    text NOT NULL,
  node_code        text NOT NULL,
  parent_node_code text,
  display_name     text NOT NULL,
  definition       text NOT NULL,
  sort_order       integer NOT NULL DEFAULT 0,
  is_active        boolean NOT NULL DEFAULT true,
  PRIMARY KEY (taxonomy_code, node_code),
  FOREIGN KEY (taxonomy_code, parent_node_code)
    REFERENCES taxonomy_node(taxonomy_code, node_code)
);

Good for simple trees. Recursive queries needed for ancestors/descendants.

18.2 Closure table

Better for frequent ancestor/descendant queries.

CREATE TABLE taxonomy_node_closure (
  taxonomy_code        text NOT NULL,
  ancestor_node_code   text NOT NULL,
  descendant_node_code text NOT NULL,
  depth                integer NOT NULL CHECK (depth >= 0),
  PRIMARY KEY (taxonomy_code, ancestor_node_code, descendant_node_code)
);

Query descendants:

SELECT n.*
FROM taxonomy_node_closure c
JOIN taxonomy_node n
  ON n.taxonomy_code = c.taxonomy_code
 AND n.node_code = c.descendant_node_code
WHERE c.taxonomy_code = :taxonomy_code
  AND c.ancestor_node_code = :node_code;

18.3 Versioned taxonomy

For historical interpretation:

CREATE TABLE taxonomy_version (
  taxonomy_code     text NOT NULL,
  version_no        integer NOT NULL,
  display_name      text NOT NULL,
  valid_from        date NOT NULL,
  valid_to          date,
  status            text NOT NULL CHECK (status IN ('draft', 'approved', 'active', 'retired')),
  PRIMARY KEY (taxonomy_code, version_no)
);

CREATE TABLE taxonomy_node_version (
  taxonomy_code     text NOT NULL,
  version_no        integer NOT NULL,
  node_code         text NOT NULL,
  parent_node_code  text,
  display_name      text NOT NULL,
  definition        text NOT NULL,
  PRIMARY KEY (taxonomy_code, version_no, node_code),
  FOREIGN KEY (taxonomy_code, version_no)
    REFERENCES taxonomy_version(taxonomy_code, version_no)
);

Transactions reference taxonomy version:

CREATE TABLE case_taxonomy_assignment (
  assignment_id      uuid PRIMARY KEY,
  case_id            uuid NOT NULL,
  taxonomy_code      text NOT NULL,
  taxonomy_version   integer NOT NULL,
  node_code          text NOT NULL,
  assigned_at        timestamptz NOT NULL,
  assigned_by        text NOT NULL,
  FOREIGN KEY (taxonomy_code, taxonomy_version, node_code)
    REFERENCES taxonomy_node_version(taxonomy_code, version_no, node_code)
);

19. Mapping between taxonomies

Different regulators, departments, or systems may classify the same concept differently.

CREATE TABLE taxonomy_mapping (
  mapping_id              uuid PRIMARY KEY,
  source_taxonomy_code    text NOT NULL,
  source_taxonomy_version integer NOT NULL,
  source_node_code        text NOT NULL,
  target_taxonomy_code    text NOT NULL,
  target_taxonomy_version integer NOT NULL,
  target_node_code        text NOT NULL,
  mapping_type            text NOT NULL CHECK (mapping_type IN (
    'equivalent', 'broader', 'narrower', 'related', 'manual_review_required'
  )),
  confidence              text NOT NULL CHECK (confidence IN ('low', 'medium', 'high', 'verified')),
  valid_from              date NOT NULL,
  valid_to                date,
  rationale               text NOT NULL
);

Mapping is rarely perfect.

Mapping	Meaning
equivalent	same meaning
broader	target is more general
narrower	target is more specific
related	useful association but not substitutable
manual_review_required	no safe automatic mapping

Rule:

A mapping table without mapping semantics is a hidden data quality incident.

20. Reference data and state machines

Statuses are often reference data, but state transitions are not merely lookup values.

Weak design:

CREATE TABLE case_file (
  case_id uuid PRIMARY KEY,
  status text NOT NULL
);

Better:

CREATE TABLE ref_case_status (
  status_code text PRIMARY KEY,
  display_name text NOT NULL,
  is_terminal boolean NOT NULL,
  lifecycle_phase text NOT NULL
);

CREATE TABLE ref_case_status_transition_rule (
  from_status_code text NOT NULL REFERENCES ref_case_status(status_code),
  to_status_code   text NOT NULL REFERENCES ref_case_status(status_code),
  requires_reason  boolean NOT NULL DEFAULT false,
  requires_approval boolean NOT NULL DEFAULT false,
  PRIMARY KEY (from_status_code, to_status_code)
);

CREATE TABLE case_status_transition (
  transition_id    uuid PRIMARY KEY,
  case_id          uuid NOT NULL,
  from_status_code text NOT NULL,
  to_status_code   text NOT NULL,
  transitioned_at  timestamptz NOT NULL,
  transitioned_by  text NOT NULL,
  reason           text,
  FOREIGN KEY (from_status_code, to_status_code)
    REFERENCES ref_case_status_transition_rule(from_status_code, to_status_code)
);

Reference data defines allowed states. Transition event records movement. Rule table defines allowed movement.

21. Caching reference data

Reference data is read often and changes slowly, so caching is common. But stale reference data can cause wrong behavior.

Define freshness contract:

Reference type	Cache strategy
UI label/sort	longer TTL acceptable
behavior-driving SLA/risk rule	versioned, cache by version, short TTL or explicit reload
state transition rule	deploy-time or version-pinned runtime config
external standards	versioned release/import
taxonomy used for historical reports	version-pinned; never “latest” for old period

Pattern:

command includes reference_version or policy_version
service loads exact version
transaction stores version used

Do not let long-lived app cache silently use old reference values for write validation.

22. Synchronizing reference data across services

Options:

Pattern	Use when	Risk
shared DB reference table	modular monolith / same DB boundary	coupling
reference API	services need central lookup	latency/availability dependency
replicated reference package	high availability, versioned config	propagation complexity
event stream of reference changes	many consumers need updates	consumer lag/version mismatch
code constants	values rarely change and are behavior-critical	redeploy needed

A production-grade reference data service publishes:

value set metadata
active values
versions
effective dates
change events
deprecation notices
checksum/package signature
compatibility notes

Reference change event:

{
  "eventType": "REFERENCE_VALUE_CHANGED",
  "valueSet": "CASE_TYPE",
  "version": 17,
  "changeType": "deprecate",
  "valueCode": "LEGACY_X",
  "effectiveDate": "2026-08-01",
  "ownerTeam": "enforcement-policy",
  "requiresConsumerAction": true
}

23. Data quality rules for master/reference data

Master/reference data requires stronger quality gates.

Reference data rules

code is stable and unique
display name is not used as key
inactive values cannot be used for new transactions unless explicitly allowed
effective periods do not overlap
destructive changes require impact analysis
behavior-driving attributes have tests
value definition is not empty
owner is assigned

Master data rules

source crosswalk is unique
merge/split events are auditable
duplicate candidates are reviewed
identifiers are classified and protected
survivorship rules are explicit
trusted attributes have source/provenance
lifecycle state is valid
downstream consumers receive identity correction events

Example duplicate candidate table:

CREATE TABLE party_duplicate_candidate (
  candidate_id       uuid PRIMARY KEY,
  party_id_a         uuid NOT NULL REFERENCES party(party_id),
  party_id_b         uuid NOT NULL REFERENCES party(party_id),
  match_score        numeric(5,4) NOT NULL,
  match_reason       jsonb NOT NULL,
  status             text NOT NULL CHECK (status IN ('open', 'confirmed_duplicate', 'rejected', 'needs_more_evidence')),
  created_at         timestamptz NOT NULL DEFAULT now(),
  reviewed_by        text,
  reviewed_at        timestamptz,
  CHECK (party_id_a <> party_id_b)
);

24. Security and privacy for master data

Master data often concentrates risk.

Risks:

party/entity hub becomes high-value target
crosswalk table enables re-identification
identifiers contain sensitive personal/business data
merge/split mistakes have large blast radius
analytics copies overexpose identifiable attributes
steward tools bypass normal app authorization

Controls:

Control	Purpose
attribute classification	know which fields are sensitive
row/tenant/org access policy	restrict who sees which party/entity
hashed identifiers	prevent casual exposure
tokenization	protect reversible sensitive identifiers
stewardship audit	trace manual identity decisions
break-glass control	controlled emergency access
export policy	prevent broad extraction
lineage	know where master data propagates

Example identifier storage:

CREATE TABLE party_identifier (
  identifier_id        uuid PRIMARY KEY,
  party_id             uuid NOT NULL REFERENCES party(party_id),
  identifier_type      text NOT NULL,
  identifier_hash      text NOT NULL,
  identifier_token_ref text,
  issuing_authority    text,
  verified_at          timestamptz,
  valid_from           date,
  valid_to             date,
  UNIQUE (identifier_type, identifier_hash)
);

Store raw identifiers only when required, and protect them separately.

25. Master/reference data and analytics dimensions

Reference and master data often become dimensions in analytics.

Dimension design questions:

Is this attribute slowly changing?
Should historical reports reflect old value or current value?
Is the hierarchy versioned?
Are mappings one-to-one or many-to-many?
Does category change retroactively?
Does report need regulatory taxonomy version?

Example slowly changing dimension:

CREATE TABLE dim_case_category (
  dim_case_category_key bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  category_code         text NOT NULL,
  category_version      integer NOT NULL,
  display_name          text NOT NULL,
  parent_category_code  text,
  effective_from        date NOT NULL,
  effective_to          date,
  is_current            boolean NOT NULL,
  UNIQUE (category_code, category_version)
);

Operational transaction stores code/version; warehouse dimension adds surrogate dimension key for analytics performance and historical reporting.

26. Case study: enforcement taxonomy change

Scenario:

A regulator changes violation taxonomy. Old category DISCLOSURE_BREACH splits into:

MISLEADING_DISCLOSURE
LATE_DISCLOSURE
INCOMPLETE_DISCLOSURE

Weak approach:

UPDATE case_file
SET category_code = 'MISLEADING_DISCLOSURE'
WHERE category_code = 'DISCLOSURE_BREACH';

This destroys historical meaning.

Better approach:

Create taxonomy version 2.
Retire old node in version 2; keep it in version 1.
Add mapping from old to new with manual_review_required or conditional rules.
New cases use taxonomy v2.
Existing cases keep v1 assignment unless reclassified through explicit event.
Reports declare taxonomy version or mapping strategy.
Reclassification creates audit/provenance record.

Schema event:

CREATE TABLE case_reclassification_event (
  reclassification_event_id uuid PRIMARY KEY,
  case_id                   uuid NOT NULL,
  old_taxonomy_code          text NOT NULL,
  old_taxonomy_version       integer NOT NULL,
  old_node_code              text NOT NULL,
  new_taxonomy_code          text NOT NULL,
  new_taxonomy_version       integer NOT NULL,
  new_node_code              text NOT NULL,
  reason                    text NOT NULL,
  reclassified_by            text NOT NULL,
  reclassified_at            timestamptz NOT NULL,
  evidence_ref               text
);

Rule:

Taxonomy change is not a bulk update. It is a semantic migration with historical consequences.

27. Operational ownership model

Reference/master data requires stewardship, not just engineering ownership.

Role	Responsibility
Data owner	accountable for meaning and policy
Data steward	manages quality, duplicates, reference changes
Engineering owner	implements schema, APIs, pipelines, controls
Security/privacy owner	reviews sensitive data handling
Consumer owner	validates downstream impact
Governance forum	approves high-impact semantic changes

RACI example:

Activity	Data owner	Steward	Engineer	Security	Consumers
Add reference value	A	R	C	C if sensitive	I
Change behavior-driving attribute	A	R	R	C	C
Merge party records	A	R	C	C	I
Split party record	A	R	C	C	C
Taxonomy version release	A	R	R	C	C
Retire value	A	R	R	C	C

A = accountable, R = responsible, C = consulted, I = informed.

28. APIs for reference/master data

For distributed systems, expose stable APIs.

Reference API examples:

GET /reference/value-sets/CASE_TYPE?asOf=2026-07-05
GET /reference/value-sets/CASE_TYPE/versions/17
GET /reference/taxonomies/ENFORCEMENT_CATEGORY/versions/2/nodes
GET /reference/mappings?from=taxonomyA:v1&to=taxonomyB:v3

Master API examples:

GET /parties/{partyId}
GET /parties:resolve?sourceSystem=CRM&sourceKey=123
GET /parties/{partyId}/identifiers
GET /parties/{partyId}/relationships?asOf=2026-07-05
POST /party-merge-candidates

API response should include version and metadata:

{
  "valueSet": "CASE_TYPE",
  "version": 17,
  "effectiveDate": "2026-07-05",
  "checksum": "sha256:...",
  "values": [
    {
      "code": "MARKET_ABUSE",
      "displayName": "Market Abuse",
      "active": true,
      "definition": "..."
    }
  ]
}

Consumers can cache by version/checksum safely.

29. Common anti-patterns

Anti-pattern	Why it hurts	Better design
Stringly typed status/category	invalid values, inconsistent meaning	FK to reference table
Display label as key	label changes break data	stable code + display name
Generic lookup table for everything	weak constraints and semantics	dedicated tables for important value sets
Deleting old reference values	history becomes invalid	deprecate/retire with validity period
No effective dating	historical decisions ambiguous	versioned/effective-dated reference data
Taxonomy as nested JSON blob only	hard to query/govern/version	node/edge/version tables
Golden record without provenance	unexplainable trust	attribute source/provenance/survivorship
Auto-merge without review	false positives corrupt identity	conservative matching + stewardship
Crosswalk without confidence	unsafe identity mapping	match method + confidence + validity
Runtime editable business rules	untested behavior changes	versioned rules + review + tests
Cache latest reference values for decisions	old decisions unreproducible	version-pin decisions
No owner for value set	no accountable semantics	owner/steward metadata required

30. Design checklist

Reference data

Stable code is separate from display label.
Values have definitions, owner, active flag, and effective period.
Old values are retired, not deleted.
Behavior-driving values have tests.
New transactions cannot use inactive values unless explicitly allowed.
Historical transactions can be interpreted under the version used.
External standards/imports are versioned and auditable.

Taxonomy

Taxonomy has explicit version.
Nodes have stable codes and definitions.
Parent/child relationships are versioned.
Mapping semantics are explicit.
Reclassification is event-driven and audited.
Reports declare taxonomy version or mapping strategy.

Master data

Master entity identity is stable and governed.
Source-system crosswalks are unique and versioned.
Duplicate/merge/split is modelled as events.
Trusted attributes have source/provenance.
Survivorship rules are explicit.
Sensitive identifiers are hashed/tokenized/protected.
Downstream consumers receive identity correction events.

Operations

Reference changes have workflow and approval.
Consumers are notified of breaking semantic changes.
Caches are version-aware.
Reference/master changes are observable and auditable.
Backup/restore preserves reference history.
Data lineage tracks reference/master propagation.

31. Practical implementation path

Phase 1 — Stop string chaos

identify top 20 business-critical codes/statuses/categories
create dedicated reference tables
add FKs from operational tables
define owner and definitions

Phase 2 — Add lifecycle and versioning

active/inactive
valid_from/valid_to
change requests
audit trail
behavior tests for important values

Phase 3 — Build taxonomy support

taxonomy version
node table
closure table if needed
mapping table
reclassification events

Phase 4 — Establish master identity

party/product/location master tables
source crosswalks
duplicate candidate queue
merge/split events
trusted attribute provenance

Phase 5 — Operationalize governance

reference APIs/packages
change events
consumer impact analysis
lineage integration
stewardship dashboards

32. Final mental model

Master data, reference data, and taxonomy are not clerical tables.

They are the semantic backbone of the database architecture.

A top-tier database architect asks:

Which entities are shared across processes?
Which attributes are authoritative, and where?
Which codes define allowed meaning?
Which classifications drive workflow, risk, reporting, or compliance?
Can old transactions still be interpreted after definitions change?
Can we prove why two records were merged or split?
Can every consumer safely handle semantic change?

The key shift:

Treat meaning as data with lifecycle, ownership, versioning, and evidence.

When meaning is governed, schema becomes more than storage. It becomes a reliable model of the business.

References

IBM, What is Master Data Management? — defines master data categories and reference data as data used to classify or categorize other data.
IBM InfoSphere MDM Reference Data Management Hub — describes centralized management and distribution of trusted enterprise reference data.
PostgreSQL Documentation, Constraints — foreign keys, unique constraints, check constraints for enforcing reference integrity.
PostgreSQL Documentation, Range Types and Exclusion Constraints — useful for effective-dated reference data without overlapping validity periods.
W3C PROV — useful conceptual foundation for provenance of mastered attributes and taxonomy-driven decisions.

Lesson Recap

You just completed lesson 64 in deepen practice. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Back To Series Next Lesson

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.

Previous Lesson

Lesson 63

Data Lineage and Provenance

Next Lesson

Lesson 65

Analytical Modelling and Dimensional Design