Deepen PracticeOrdered learning track

Data Quality Gates

Learn Java Data Pipeline Pattern - Part 068

Data quality gates for production Java data pipelines: fail-fast, warn, quarantine, progressive validation, executable quality contracts, quality result modeling, policy decisions, data quality eventing, and Java implementation patterns.

17 min read3286 words
PrevNext
Lesson 6884 lesson track46–69 Deepen Practice
#java#data-pipeline#data-quality#quality-gates+4 more

Part 068 — Data Quality Gates

A data quality check tells you whether data looks wrong.

A data quality gate decides what the platform is allowed to do next.

That distinction matters.

A junior pipeline often has checks like this:

run SQL query
print failed count
send Slack alert
continue publishing anyway

A production pipeline needs gates like this:

Validate source completeness.
If critical source completeness fails, block publish.
If non-critical optional field drift occurs, warn and continue.
If record-level validity fails below threshold, quarantine bad records and continue.
If privacy rule fails, fail closed.
Attach quality result to run manifest and lineage.
Propagate degraded status to downstream consumers.

Data quality gates are control decisions over data state.

They are not decorative assertions.


1. The Core Mental Model

A quality gate has five components:

scope + rule + measurement + threshold + action

Example:

scope       = silver.case_event partition event_date=2026-07-04
rule        = case_id must be unique
measurement = duplicate case_id count
threshold   = 0 duplicates
action      = block publication

Another example:

scope       = bronze.vendor_case_import file batch vendor-a-20260704
rule        = optional field middle_name may be null
measurement = null percentage
threshold   = <= 90%
action      = warn if exceeded, do not block

Quality gate output must be machine-readable:

PASS
WARN
QUARANTINE
BLOCK
FAIL_CLOSED

2. Quality Checks vs Quality Gates

ConceptPurposeExample
CheckMeasures a conditioncase_id is not null
ExpectationDeclares desired propertystatus in allowed set
RuleBusiness/data invariantclosed case must have closed_at
SuiteGroup of checkssilver.case_event required checks
GateMakes control decisionblock publish if critical failures exist
PolicyMaps failure to actionprivacy failure -> fail closed
ResultEvidence of execution2 failures, 31 rejected records
QuarantineIsolate invalid datawrite rejected rows to quarantine table

Checks without gates create noise.

Gates without evidence create distrust.

A serious pipeline needs both.


3. Where Gates Exist in a Pipeline

Quality gates are not only at the end.

Common gate points:

Gate PointPurpose
Source readiness gateavoid processing incomplete source
File integrity gatedetect partial/corrupt file
Schema gatereject incompatible structure
Parse gateisolate malformed records
Contract gateenforce producer/consumer contract
Semantic gateenforce business invariants
Join/enrichment gatedetect missing reference data
Reconciliation gatecompare source vs target counts/balances
Privacy gateblock unmasked sensitive data
Publication gatedecide whether asset becomes certified/current
Backfill gateprevent bad historical rewrite

Do not push every rule into one final gate. Late detection increases blast radius.


4. Gate Action Taxonomy

A gate must choose an action.

ActionMeaningExample
PASScontinue normallyall critical checks pass
WARNcontinue but mark degradeddistribution drift above warning threshold
QUARANTINE_RECORDSisolate bad records, continue good records0.02% malformed rows
QUARANTINE_BATCHisolate entire input batchsource manifest incomplete
BLOCK_PUBLICATIONdo not publish outputuniqueness failure in gold table
FAIL_CLOSEDstop and require human approvalprivacy leak detected
SUPERSEDEreplace previous output with corrected versionrepair run after bad output
DEGRADE_CONSUMERmark downstream asset/API/dashboard degradedfreshness SLO breach

The action must be explicit in the quality policy. Engineers should not decide manually during incidents unless policy is missing.


5. Severity Is Not the Same as Action

Severity describes the problem. Action describes what to do.

SeverityMeaning
INFOinformational signal
LOWminor issue, no immediate consumer risk
MEDIUMpotential consumer degradation
HIGHlikely wrong or incomplete output
CRITICALregulatory/security/customer-impacting risk

Example mapping:

RuleSeverityAction
Optional description null rate increasedLOWWARN
Case ID missingHIGHQUARANTINE_RECORDS or BLOCK
Duplicate case ID in current projectionHIGHBLOCK_PUBLICATION
Unmasked national ID in gold tableCRITICALFAIL_CLOSED
Source file missing trailerHIGHQUARANTINE_BATCH
Row count changed by 2%MEDIUMWARN
Row count changed by 80%HIGHBLOCK_PUBLICATION

Do not hardcode severity == action. Use policy.


6. Quality Rule Categories

Structural rules

schema exists
required columns exist
type is compatible
field can be parsed
record envelope has required metadata

Completeness rules

row count above lower bound
all expected partitions arrived
mandatory fields non-null
all expected source files present
all expected tenants processed

Validity rules

status in allowed set
date not in impossible range
amount non-negative
email format valid
country code known

Uniqueness rules

case_id unique in current projection
event_id unique within dedupe horizon
natural key unique per effective time

Consistency rules

closed case has closed_at
closed_at >= opened_at
breach flag agrees with SLA deadline
case status transition is legal

Referential rules

case.policy_id exists in policy reference table
case.assignee_id exists in user dimension
court_id exists in reference.court

Freshness rules

latest event time within 15 minutes
source watermark not older than threshold
asset published by deadline

Distribution/drift rules

status distribution within expected range
null rate did not spike
vendor volume within historical band
priority mix not impossible

Reconciliation rules

source count equals target count after filters
financial balance matches source ledger
checksums match for imported file
accepted + rejected = input

Privacy/security rules

PII fields masked in gold layer
restricted fields absent from external export
tenant ID is preserved and enforced
classification tag exists

7. Quality Gate Design Principle

The strongest rule:

A data quality gate must protect consumer trust without destroying pipeline operability.

Too weak:

always publish and alert

Result: bad data leaks.

Too strict:

block entire pipeline for every minor anomaly

Result: platform becomes brittle.

Better:

critical invariant failure blocks publication
record-level defects are quarantined when safe
statistical anomalies warn unless known consumer risk
privacy failures fail closed
reconciliation failures block certified outputs

Quality gate engineering is policy design, not just validation code.


8. Java Quality Model

Start with stable types.

public record QualityScope(
        DatasetRef dataset,
        Optional<PartitionRef> partition,
        Optional<String> runId,
        Optional<String> backfillId
) {}

public record QualityRule(
        String ruleId,
        String version,
        String description,
        QualityDimension dimension,
        QualitySeverity severity,
        GatePolicy policy
) {}

public enum QualityDimension {
    STRUCTURE,
    COMPLETENESS,
    VALIDITY,
    UNIQUENESS,
    CONSISTENCY,
    REFERENTIAL_INTEGRITY,
    FRESHNESS,
    DISTRIBUTION,
    RECONCILIATION,
    PRIVACY,
    SECURITY
}

public enum QualitySeverity {
    INFO,
    LOW,
    MEDIUM,
    HIGH,
    CRITICAL
}

public enum GateAction {
    PASS,
    WARN,
    QUARANTINE_RECORDS,
    QUARANTINE_BATCH,
    BLOCK_PUBLICATION,
    FAIL_CLOSED,
    DEGRADE_CONSUMER
}

Result:

public record QualityCheckResult(
        QualityRule rule,
        QualityScope scope,
        CheckStatus status,
        long checkedCount,
        long failedCount,
        Optional<Double> observedValue,
        Optional<Double> threshold,
        List<BadRecordRef> sampleFailures,
        Instant evaluatedAt
) {}

public enum CheckStatus {
    PASS,
    FAIL,
    ERROR,
    SKIPPED
}

public record QualityGateDecision(
        QualityScope scope,
        GateAction action,
        QualitySeverity maxSeverity,
        List<QualityCheckResult> results,
        String explanation
) {}

A quality system should produce a decision object, not just throw exceptions.


9. Quality Evaluator Interface

public interface QualityCheck<T> {
    QualityRule rule();
    QualityCheckResult evaluate(QualityScope scope, T data) throws Exception;
}

public interface QualityGatePolicy {
    QualityGateDecision decide(QualityScope scope, List<QualityCheckResult> results);
}

public final class QualityGate<T> {
    private final List<QualityCheck<T>> checks;
    private final QualityGatePolicy policy;

    public QualityGate(List<QualityCheck<T>> checks, QualityGatePolicy policy) {
        this.checks = List.copyOf(checks);
        this.policy = policy;
    }

    public QualityGateDecision evaluate(QualityScope scope, T data) {
        var results = new ArrayList<QualityCheckResult>();

        for (QualityCheck<T> check : checks) {
            try {
                results.add(check.evaluate(scope, data));
            } catch (Exception e) {
                results.add(QualityResults.error(check.rule(), scope, e));
            }
        }

        return policy.decide(scope, results);
    }
}

The policy decides whether a check error blocks the pipeline.

For example, if the check engine fails:

  • fail closed for regulatory export,
  • warn for exploratory dataset,
  • block publish if data cannot be certified,
  • retry if external reference check timed out.

10. Policy Evaluation

Example policy:

public final class DefaultQualityGatePolicy implements QualityGatePolicy {
    @Override
    public QualityGateDecision decide(QualityScope scope, List<QualityCheckResult> results) {
        boolean privacyFailed = results.stream().anyMatch(r ->
            r.status() == CheckStatus.FAIL
                && r.rule().dimension() == QualityDimension.PRIVACY);

        if (privacyFailed) {
            return decision(scope, GateAction.FAIL_CLOSED, results,
                "Privacy rule failed; fail closed.");
        }

        boolean criticalFailed = results.stream().anyMatch(r ->
            r.status() == CheckStatus.FAIL
                && r.rule().severity() == QualitySeverity.CRITICAL);

        if (criticalFailed) {
            return decision(scope, GateAction.BLOCK_PUBLICATION, results,
                "Critical quality rule failed.");
        }

        boolean quarantineable = results.stream().anyMatch(r ->
            r.status() == CheckStatus.FAIL
                && r.rule().policy().supportsRecordQuarantine());

        if (quarantineable) {
            return decision(scope, GateAction.QUARANTINE_RECORDS, results,
                "Record-level failures can be isolated.");
        }

        boolean warning = results.stream().anyMatch(r -> r.status() == CheckStatus.FAIL);

        if (warning) {
            return decision(scope, GateAction.WARN, results,
                "Non-blocking quality failures detected.");
        }

        return decision(scope, GateAction.PASS, results, "All checks passed.");
    }
}

This is intentionally simple. Production policy often uses asset criticality, consumer criticality, historical behavior, and backfill mode.


11. Record-Level Quarantine

Quarantine is not a trash bin. It is a controlled lane for invalid data.

Quarantine record should contain:

public record QuarantineRecord(
        String quarantineId,
        String sourceRunId,
        DatasetRef sourceDataset,
        Optional<String> sourcePosition,
        String recordKey,
        Optional<String> eventId,
        String ruleId,
        String ruleVersion,
        QualitySeverity severity,
        String reasonCode,
        String reasonMessage,
        String payloadRef,
        String payloadHash,
        Instant quarantinedAt
) {}

Do not always store raw payload inline. For sensitive data, store encrypted payload references or masked payload.


12. Quarantine Safety Rules

You can quarantine records and continue only when:

  • bad records are independent,
  • downstream output can represent partial acceptance,
  • completeness impact is within threshold,
  • no aggregate correctness is silently corrupted,
  • consumer contract allows partial data,
  • rejected count is visible and auditable,
  • replay/repair path exists.

You should block the batch when:

  • file manifest is incomplete,
  • schema is incompatible,
  • primary key generation is broken,
  • reference table is missing,
  • high percentage of records are invalid,
  • privacy/security invariant fails,
  • aggregate output would be misleading,
  • rejected records cannot be safely separated.

Quarantine is a scalpel, not a blanket excuse to publish bad data.


13. Batch Gate Example

A batch publication gate:

public final class BatchPublicationGate {
    private final QualityGate<TableBatch> qualityGate;
    private final Publisher publisher;
    private final QuarantineWriter quarantineWriter;

    public PublicationResult publishIfAllowed(QualityScope scope, TableBatch batch) {
        QualityGateDecision decision = qualityGate.evaluate(scope, batch);

        return switch (decision.action()) {
            case PASS -> publisher.publish(batch, decision);
            case WARN -> publisher.publishDegraded(batch, decision);
            case QUARANTINE_RECORDS -> {
                var split = batch.splitByValidity(decision.results());
                quarantineWriter.write(split.invalidRecords(), decision);
                yield publisher.publish(split.validRecords(), decision);
            }
            case QUARANTINE_BATCH -> {
                quarantineWriter.writeBatch(batch, decision);
                yield PublicationResult.notPublished(decision);
            }
            case BLOCK_PUBLICATION, FAIL_CLOSED -> PublicationResult.blocked(decision);
            case DEGRADE_CONSUMER -> publisher.markDegraded(batch, decision);
        };
    }
}

Every path records the quality decision.


14. Streaming Gate Example

Streaming quality gates cannot block forever.

They need local decisions and aggregate windows.

Per-record gates:

  • schema parse,
  • required key,
  • event timestamp valid,
  • enum value allowed,
  • tenant present,
  • PII not in forbidden field.

Window-level gates:

  • error rate,
  • duplicate rate,
  • source lag,
  • watermark delay,
  • distribution drift,
  • output volume anomaly.

Streaming publication is continuous, so actions include:

  • route bad record to DLQ/quarantine,
  • pause source partition,
  • mark asset degraded,
  • trip circuit breaker,
  • fail job for critical invariant,
  • alert owner.

15. Gate State Machine

This state should appear in asset/run metadata.

Consumers should know whether they are reading:

certified data
degraded data
partial data
uncertified data
superseded data

16. Publication Gate

The publication gate is the final guard before an asset becomes current/certified.

It should check:

  • all required upstream assets are available,
  • input versions are known,
  • transform version is approved,
  • schema is compatible,
  • critical quality rules pass,
  • reconciliation passes,
  • lineage manifest exists,
  • sensitive fields are classified,
  • access policy exists,
  • output was written to staging,
  • output row count and partition count are plausible,
  • backfill/rewrite is authorized.

Publication gate output:

{
  "asset": "gold.case_sla_breach",
  "candidateVersion": "iceberg-snapshot-885",
  "decision": "PUBLISHED_CERTIFIED",
  "qualityStatus": "PASS",
  "lineageStatus": "COMPLETE",
  "reconciliationStatus": "PASS",
  "decidedAt": "2026-07-04T02:08:31Z"
}

Do not let a job write directly to the public/current location without passing the publication gate.


17. Progressive Validation

Validate progressively as data gains trust.

Layer-specific gates:

LayerGate Focus
Raw/Bronzearrival, integrity, parseability, preservation
Parsedschema, type, required metadata
Canonical/Silverdomain invariants, identity, time semantics, dedupe
Derived/Goldaggregation correctness, business rules, consumer contract
Exportprivacy, authorization, row-level policy, external contract

Each layer should reduce uncertainty.


18. Reconciliation Gates

Reconciliation gates compare independent views of the same data.

Examples:

input rows = accepted rows + rejected rows
source count = bronze count for file batch
CDC event count = applied mutations + ignored no-op mutations
case ledger balance = sum(case movements)
source vendor total = imported total

Reconciliation check model:

public record ReconciliationCheck(
        String checkId,
        DatasetRef leftDataset,
        DatasetRef rightDataset,
        String measurement,
        Tolerance tolerance
) {}

public record Tolerance(
        double absolute,
        double percentage
) {}

Example decision:

source_count=1,000,000
target_count=999,996
rejected_count=4
accepted + rejected = 1,000,000
=> PASS

Another:

source_count=1,000,000
target_count=945,000
rejected_count=10
missing=54,990
=> BLOCK_PUBLICATION

Reconciliation is often more important than sophisticated statistical checks.


19. Freshness Gates

Freshness is a quality dimension.

Freshness gate examples:

latest_event_time >= now - 15 minutes
latest_source_commit_time >= now - 10 minutes
asset_published_at <= expected_deadline
watermark >= now - 30 minutes

Freshness result:

public record FreshnessMeasurement(
        Instant observedTime,
        Duration maxAllowedDelay,
        Duration actualDelay
) {}

Action policy:

AssetFreshness BreachAction
internal explorationwarnWARN
daily dashboardpast deadlineDEGRADE_CONSUMER
real-time risk API> 15 minBLOCK dependent publication / alert
regulatory reportdeadline missincident

Freshness gates must use the correct time:

  • event time,
  • source commit time,
  • ingestion time,
  • publication time,
  • business effective time.

Do not use processing time as a proxy without thinking.


20. Schema Gates

Schema gate answers:

Can this data be safely parsed and interpreted by downstream consumers?

It should check:

  • schema ID exists,
  • schema compatibility mode passes,
  • required fields exist,
  • field type is compatible,
  • enum changes are allowed,
  • removed fields are not used by consumers,
  • semantic version transition is approved,
  • unknown fields handling is clear,
  • default values are safe,
  • schema registry and contract registry agree.

Schema gate action:

SituationAction
compatible additive nullable fieldPASS/WARN
required field removedBLOCK
type changed string -> intBLOCK unless explicitly migrated
enum value addedWARN or BLOCK depending consumer exhaustiveness
schema registry unavailablepolicy-dependent
schema ID missingBLOCK

Do not confuse “can deserialize” with “semantically compatible”.


21. Semantic Gates

Semantic gates encode domain truth.

For enforcement lifecycle:

case cannot be CLOSED before OPENED
case cannot move from CLOSED to UNDER_REVIEW without REOPEN event
sla_deadline must be calculated from accepted_at using policy active at accepted_at
breach event must not be emitted before sla_deadline
assigned officer must belong to active enforcement unit at assignment time

These rules are not generic data quality. They are domain invariants.

Example:

public final class CaseStatusTransitionCheck implements QualityCheck<List<CaseEvent>> {
    @Override
    public QualityRule rule() {
        return Rules.caseStatusTransitionV1();
    }

    @Override
    public QualityCheckResult evaluate(QualityScope scope, List<CaseEvent> events) {
        long failures = events.stream()
            .filter(e -> !StatusTransitions.isLegal(e.previousStatus(), e.newStatus()))
            .count();

        return failures == 0
            ? QualityResults.pass(rule(), scope, events.size())
            : QualityResults.fail(rule(), scope, events.size(), failures);
    }
}

Domain gates are where software engineering and data engineering meet.


22. Temporal Gates

Temporal correctness requires dedicated checks.

Rules:

event_time is present
event_time is not too far in the future
event_time is not before source system existence
effective_time is valid for business rule
recorded_time >= event_time for normal events
watermark progress is within SLO
late event rate is within threshold

Temporal check example:

public final class EventTimeBoundsCheck implements QualityCheck<List<Envelope<?>>> {
    private final Clock clock;
    private final Duration allowedFutureSkew;
    private final Instant minimumBusinessDate;

    @Override
    public QualityCheckResult evaluate(QualityScope scope, List<Envelope<?>> records) {
        Instant now = clock.instant();
        long failures = records.stream().filter(r -> {
            Instant eventTime = r.eventTime();
            return eventTime.isAfter(now.plus(allowedFutureSkew))
                || eventTime.isBefore(minimumBusinessDate);
        }).count();

        return QualityResults.fromFailureCount(rule(), scope, records.size(), failures);
    }
}

Temporal gates protect windowing, backfill, bitemporal reporting, and SLA detection.


23. Join and Enrichment Gates

Enrichment can silently corrupt data.

Rules:

reference table version is known
lookup hit rate above threshold
missing reference keys quarantined or defaulted explicitly
temporal join uses reference version valid at event time
broadcast/reference data not stale

Example:

Enrichment FailureRiskAction
0.01% optional country lookup missinglowWARN
30% policy lookup missinghighBLOCK
reference snapshot unknownhighBLOCK
stale risk policycriticalFAIL_CLOSED for regulated decision

Never silently default important reference data.

Bad:

String policy = policyMap.getOrDefault(policyId, "STANDARD");

Better:

Optional<Policy> policy = policyRepository.find(policyId, eventTime);
if (policy.isEmpty()) {
    return ValidationResult.reject("MISSING_POLICY_REFERENCE");
}

24. Distribution and Drift Gates

Distribution gates detect unexpected shape changes.

Examples:

case volume by vendor not below historical p5
priority distribution not shifted beyond threshold
null rate of accepted_at not above 0.1%
rejected record rate not above 0.5%
unknown enum value rate not above 0

Use with care.

Distribution checks are not proof of correctness. They are anomaly signals.

Action policy:

  • warn for statistical anomaly,
  • block only when anomaly violates known consumer contract,
  • require human review for suspicious but not structurally invalid data,
  • suppress or adjust during known business events/backfill/migration.

False positives destroy trust in gates.

False negatives allow silent corruption.

Design thresholds with historical context and owner review.


25. Privacy Gates

Privacy gates should be fail-closed.

Rules:

restricted PII must not appear in gold public table
external export must not contain direct identifiers
hashed identifiers must use approved salt/key policy
masking transform version must be approved
classification tags must exist for all fields
row-level tenant filter must be enforced

Example:

public final class RestrictedFieldExportCheck implements QualityCheck<SchemaContract> {
    @Override
    public QualityCheckResult evaluate(QualityScope scope, SchemaContract contract) {
        long restrictedFields = contract.fields().stream()
            .filter(f -> f.classification() == Sensitivity.RESTRICTED_PII)
            .filter(f -> contract.destination() == Destination.EXTERNAL_VENDOR)
            .count();

        return restrictedFields == 0
            ? QualityResults.pass(rule(), scope, contract.fields().size())
            : QualityResults.fail(rule(), scope, contract.fields().size(), restrictedFields);
    }
}

Privacy gate failures should not be downgraded to warnings without formal exception.


26. Gate Evidence

Every gate decision should be stored.

Quality result event:

{
  "runId": "gold-sla-breach/20260704T0200",
  "asset": "gold.case_sla_breach",
  "scope": {"partition": {"report_date": "2026-07-04"}},
  "suiteId": "gold-case-sla-breach-publication",
  "suiteVersion": "3.1.0",
  "decision": "PASS",
  "evaluatedAt": "2026-07-04T02:08:00Z",
  "checks": [
    {
      "ruleId": "case_id_not_null",
      "status": "PASS",
      "checkedCount": 9904,
      "failedCount": 0
    },
    {
      "ruleId": "breach_flag_consistent",
      "status": "PASS",
      "checkedCount": 9904,
      "failedCount": 0
    }
  ]
}

Attach this to:

  • run manifest,
  • lineage event,
  • asset version metadata,
  • observability metrics,
  • audit evidence store,
  • incident records.

A gate decision without evidence is hard to trust.


27. Metrics From Gates

Gate outputs should produce metrics:

quality_check_total{rule_id,status,asset}
quality_failed_records_total{rule_id,asset}
quality_gate_decision_total{decision,asset}
quality_gate_duration_seconds{suite_id,asset}
quarantine_records_total{reason_code,asset}
publication_block_total{asset,reason_code}
certified_asset_age_seconds{asset}

Do not label metrics with high-cardinality values like record ID, full exception message, or file path.

Use logs/events for high-cardinality detail.


28. Quality Gate and Lineage Integration

Quality gates should affect lineage.

Lineage should show:

  • input asset,
  • candidate output asset,
  • quality result,
  • final publication action,
  • quarantine output if any,
  • downstream degraded status.

Quality failure is lineage-relevant. It changes the graph of trustworthy data.


29. Quality Gates and Impact Analysis

A failed gate should trigger impact analysis.

Examples:

silver.case_event quality failed for event_date=2026-07-04
-> gold.case_sla_breach depends on this partition
-> regulatory dashboard is degraded
-> case-risk-api has critical freshness risk

Gate decision can emit an impact event:

{
  "type": "QUALITY_GATE_BLOCKED",
  "asset": "silver.case_event",
  "partition": {"event_date": "2026-07-04"},
  "failedRules": ["case_id_unique"],
  "recommendedImpactAnalysis": true
}

Then the control plane can:

  • block dependent jobs,
  • mark downstream assets stale/degraded,
  • notify owners,
  • create incident,
  • schedule reprocessing after repair.

30. Gate Configuration as Code

Quality gates should be versioned.

Example YAML:

suiteId: silver-case-event-core
version: 2.4.0
asset: iceberg://prod/enforcement/silver.case_event
owner: enforcement-data-platform
rules:
  - id: case_id_not_null
    dimension: COMPLETENESS
    severity: HIGH
    expression: case_id IS NOT NULL
    actionOnFail: QUARANTINE_RECORDS
  - id: event_id_unique
    dimension: UNIQUENESS
    severity: HIGH
    expression: event_id UNIQUE WITHIN partition
    actionOnFail: BLOCK_PUBLICATION
  - id: status_allowed
    dimension: VALIDITY
    severity: HIGH
    allowedValues: [OPENED, ACCEPTED, ASSIGNED, CLOSED, REOPENED]
    actionOnFail: QUARANTINE_RECORDS
  - id: restricted_pii_absent
    dimension: PRIVACY
    severity: CRITICAL
    actionOnFail: FAIL_CLOSED

Rules should go through review like code.

Changing a gate can be as dangerous as changing the transform.


31. Gate Versioning

A quality result must know which rules were used.

asset version 885 passed suite silver-case-event-core:2.4.0
asset version 886 passed suite silver-case-event-core:2.5.0

This matters for:

  • audit,
  • reproducibility,
  • comparing historical quality,
  • avoiding false regression,
  • backfill consistency,
  • proving what was certified at the time.

Do not overwrite quality rules without version history.


32. CI Gates vs Runtime Gates

Gate TypeRuns WhenProtects Against
CI schema gatepull requestincompatible schema changes
CI contract gatepull requestproducer/consumer assumption break
CI golden dataset gatepull requesttransform behavior regression
Runtime source gatejob startmissing/incomplete input
Runtime quality gateduring runinvalid records/data anomalies
Runtime publication gatebefore publishbad asset becoming current
Post-publication monitorafter publishdelayed anomaly detection

You need both CI and runtime gates.

CI cannot detect bad production data.

Runtime cannot replace code review and compatibility checks.


33. Golden Dataset Quality Gate

Golden datasets prevent accidental behavior changes.

input fixture + expected output + expected quality result

Example:

@Test
void breachDetectorShouldRejectImpossibleClosedCase() {
    var input = GoldenDatasets.loadCaseEvents("closed_before_opened.jsonl");
    var output = breachDetector.transform(input);
    var decision = qualityGate.evaluate(scope, output);

    assertThat(decision.action()).isEqualTo(GateAction.QUARANTINE_RECORDS);
    assertThat(decision.results())
        .anyMatch(r -> r.rule().ruleId().equals("case_status_transition_legal")
            && r.failedCount() == 1);
}

Golden tests should include:

  • normal data,
  • malformed data,
  • boundary time data,
  • duplicate data,
  • late correction,
  • missing reference,
  • privacy leak,
  • schema evolution case,
  • backfill case.

34. Great Expectations, Deequ, Soda, and Custom Java

Tools can help, but they do not remove design responsibility.

Tool/ApproachStrengthCaveat
Great Expectations / GXexpressive expectations, validation workflows, docsPython-centered ecosystem; integrate evidence with Java platform
DeequSpark-scale data quality checks, metrics/analyzersScala/Spark ecosystem; good for large batch datasets
Soda CoreYAML checks/contracts and scansexternal tool integration and governance needed
Custom Java checksdeep domain logic, service integrationmust avoid reinventing full quality platform badly
SQL checkssimple and portableweak for complex domain/record logic

For Java-heavy platforms, a practical design is:

Java pipeline emits run manifest and candidate dataset.
Spark/SQL/GX/Deequ/Soda/custom validators evaluate checks.
Control plane normalizes results into one QualityGateDecision model.
Publication gate uses normalized decision.

Do not let each tool invent a different quality result schema.


35. Normalized Quality Result Schema

Even if checks come from multiple tools, normalize results.

public record NormalizedQualityResult(
        String tool,
        String toolVersion,
        String suiteId,
        String suiteVersion,
        String ruleId,
        QualityDimension dimension,
        QualitySeverity severity,
        CheckStatus status,
        long checkedCount,
        long failedCount,
        Map<String, Object> measurements,
        List<BadRecordRef> examples
) {}

Adapters:

GX result      -> NormalizedQualityResult
Deequ result   -> NormalizedQualityResult
SQL result     -> NormalizedQualityResult
Java result    -> NormalizedQualityResult
Soda scan      -> NormalizedQualityResult

This keeps the control plane independent from validation tooling.


36. Data Quality Result Store

Store quality results separately from logs.

Basic tables:

CREATE TABLE quality_suite (
    suite_id TEXT NOT NULL,
    suite_version TEXT NOT NULL,
    asset_namespace TEXT NOT NULL,
    asset_name TEXT NOT NULL,
    owner TEXT NOT NULL,
    created_at TIMESTAMPTZ NOT NULL,
    PRIMARY KEY (suite_id, suite_version)
);

CREATE TABLE quality_gate_result (
    gate_result_id UUID PRIMARY KEY,
    run_id TEXT NOT NULL,
    asset_namespace TEXT NOT NULL,
    asset_name TEXT NOT NULL,
    asset_version TEXT,
    suite_id TEXT NOT NULL,
    suite_version TEXT NOT NULL,
    decision TEXT NOT NULL,
    max_severity TEXT NOT NULL,
    evaluated_at TIMESTAMPTZ NOT NULL,
    payload JSONB NOT NULL
);

CREATE TABLE quality_check_result (
    gate_result_id UUID NOT NULL REFERENCES quality_gate_result(gate_result_id),
    rule_id TEXT NOT NULL,
    rule_version TEXT NOT NULL,
    dimension TEXT NOT NULL,
    severity TEXT NOT NULL,
    status TEXT NOT NULL,
    checked_count BIGINT,
    failed_count BIGINT,
    observed_value DOUBLE PRECISION,
    threshold_value DOUBLE PRECISION,
    PRIMARY KEY (gate_result_id, rule_id, rule_version)
);

This supports audit and trend analysis.


37. Bad Record Sampling

For high-volume pipelines, do not store every bad record inline.

Store:

  • count,
  • reason code,
  • sample records,
  • payload references,
  • hash/fingerprint,
  • partition/source position,
  • replay instructions.

Sampling strategy:

store first N per reason code
store random sample per partition
store all critical privacy failures in secure quarantine
store aggregate counts for high-volume repeated failures

Bad record examples should be sanitized by default.


38. Handling Check Engine Failure

A quality check can fail because the data is bad or because the checker broke.

Separate:

CHECK_FAIL = data violates rule
CHECK_ERROR = rule could not be evaluated

Policy examples:

Check Error ScenarioAction
optional distribution checker timeoutWARN
schema registry unavailableBLOCK if schema unknown
privacy scanner unavailableFAIL_CLOSED
reference DB timeoutretry then BLOCK/DEGRADE depending asset
quality service unavailable for experimental assetWARN
quality service unavailable for regulatory reportBLOCK

Do not convert check errors into passes.


39. Backfill Quality Gates

Backfill has special risk.

Backfill gates should check:

  • backfill campaign is approved,
  • transform version is pinned,
  • input snapshot/window is declared,
  • output mutation mode is explicit,
  • expected affected partitions are known,
  • quality thresholds are adjusted intentionally if historical data differs,
  • comparison against current output is available,
  • downstream notification is prepared,
  • rollback/supersession plan exists.

Backfill decision:

{
  "backfillId": "BF-20260704-SLA-RECALC",
  "asset": "gold.case_sla_breach",
  "partitions": ["2026-01-01..2026-06-30"],
  "decision": "BLOCK_PUBLICATION",
  "reason": "recomputed breach_count differs by 18% without approved correction manifest"
}

Backfill gates should be stricter than normal daily runs.


40. Thresholds and Tolerances

Thresholds are dangerous when arbitrary.

Bad:

row count must not change by more than 10%

Why 10%?

Better:

For vendor-a daily case file:
- warn if count below trailing 28-day p5 unless known holiday
- block if count below 50% of expected manifest count
- block if accepted + rejected != manifest declared count

Threshold types:

TypeUse
absolutesmall datasets, exact count
percentageproportional tolerance
historical percentileseasonal variation
business calendar awareholidays/weekends
manifest-drivenfile/API source declares count
reference-drivencompare to independent system
zero-toleranceprivacy, uniqueness, critical validity

Critical invariants should not use fuzzy thresholds.


41. Quality Gates for Different Pipeline Types

File ingestion

Gate focus:

  • file arrived completely,
  • manifest matches files,
  • checksum matches,
  • header/trailer valid,
  • row count matches declared count,
  • parse error threshold,
  • duplicate file detection.

API ingestion

Gate focus:

  • response schema,
  • pagination completeness,
  • cursor monotonicity,
  • rate limit handling,
  • deletion semantics,
  • freshness of sync.

CDC ingestion

Gate focus:

  • connector heartbeat,
  • log position progress,
  • snapshot completion,
  • schema history availability,
  • outbox payload parseability,
  • duplicate event ID.

Kafka streaming

Gate focus:

  • schema ID,
  • key presence,
  • event-time bounds,
  • poison rate,
  • duplicate rate,
  • lag and watermark.

Gate focus:

  • state growth,
  • late event rate,
  • checkpoint health,
  • side output rejection count,
  • sink commit success.

Spark batch

Gate focus:

  • input snapshot known,
  • row/partition count,
  • null/validity/uniqueness checks,
  • reconciliation,
  • output staging validation.

Lakehouse

Gate focus:

  • snapshot commit,
  • expected partitions replaced,
  • small file count,
  • schema/partition evolution,
  • orphan write detection,
  • time-travel reproducibility.

42. Quality Gate UX

A gate is operational only if humans can understand it.

Bad message:

Validation failed.

Good message:

Publication blocked for gold.case_sla_breach partition report_date=2026-07-04.
Rule case_id_unique failed.
9904 rows checked, 12 duplicate case_id values found.
Severity HIGH.
Action BLOCK_PUBLICATION.
Input silver.case_event snapshot 884.
Producing run gold-sla-breach/20260704T0200.
Recommended action: inspect duplicate case_id sample, repair upstream dedupe, rerun partition.

Gate output should include:

  • what failed,
  • where it failed,
  • how bad it is,
  • why action was taken,
  • what data is affected,
  • how to investigate,
  • who owns it.

43. Runbook Template

Every critical gate should have a runbook.

# Runbook: case_id_unique failure

## Meaning
The current projection contains duplicate case_id values.

## Impact
Downstream SLA breach and workload reports may double-count cases.

## Immediate Action
Do not publish gold.case_sla_breach.

## Investigation
1. Open quality result for failed run.
2. Inspect duplicate samples.
3. Trace duplicate records to source event_id and source position.
4. Check whether duplicate is source duplicate, replay duplicate, or transform bug.
5. Check dedupe state/version.

## Repair
- Source duplicate: quarantine duplicate source events and rerun affected partition.
- Replay duplicate: verify idempotency key and sink upsert logic.
- Transform bug: patch transform, run golden dataset tests, backfill affected window.

## Evidence
Attach run manifest, quality result, lineage impact analysis, and supersession record.

Runbooks reduce incident improvisation.


44. Anti-Patterns

Anti-pattern: alert-only quality

Problem:

Bad data is published; alert arrives later.

Fix:

Critical gates block publication.

Anti-pattern: all-or-nothing validation

Problem:

One bad optional record blocks a million valid records.

Fix:

Use record quarantine when safe.

Anti-pattern: warn everything

Problem:

Warnings become noise; no one acts.

Fix:

Map severity to action and ownership.

Anti-pattern: no quality result history

Problem:

Cannot prove which checks passed for a published report.

Fix:

Store versioned quality results linked to asset version.

Anti-pattern: quality rules outside code review

Problem:

Someone relaxes rule in UI to make pipeline green.

Fix:

Quality gates as code with approval workflow.

Anti-pattern: blocking on statistical drift by default

Problem:

Legitimate business seasonality causes false incidents.

Fix:

Use drift as warning unless tied to a hard consumer contract.

Anti-pattern: privacy as warning

Problem:

Sensitive data leak continues while team investigates.

Fix:

Privacy/security gates fail closed.

45. Production Checklist

Before calling data quality gates production-grade, verify:

  • Each gate has scope, rule, measurement, threshold, and action.
  • Critical rules are separated from warning/anomaly rules.
  • Gate results are stored as structured evidence.
  • Gate decisions attach to run manifest and lineage.
  • Publication cannot bypass required gates.
  • Quarantine records include reason, source position, rule ID, and replay path.
  • Quarantine does not silently corrupt aggregates.
  • Schema gates run in CI and runtime.
  • Golden datasets cover normal, malformed, duplicate, temporal, privacy, and backfill cases.
  • Backfill has stricter gates and approved campaign metadata.
  • Privacy/security failures fail closed.
  • Check engine errors are distinct from data failures.
  • Gate thresholds are justified and owned.
  • Quality result history is versioned.
  • Quality rules are reviewed like code.
  • Alerts include owner, affected asset, severity, and action.
  • Impact analysis runs on critical gate failures.
  • Runbooks exist for critical gates.

46. Mental Model Recap

Data quality checks answer:

Does the data violate an expectation?

Data quality gates answer:

Given this violation, what is the platform allowed to do?

The most important invariant:

Bad or uncertified data must not silently become trusted data.

The second most important invariant:

Quality gates must protect trust without making the platform unusably brittle.

That balance is what makes quality gates engineering, not just validation.


47. Further Reading

Use these as factual anchors:

  • Great Expectations / GX Core: https://greatexpectations.io/
  • Great Expectations docs: https://docs.greatexpectations.io/
  • Deequ: https://github.com/awslabs/deequ
  • Soda Core: https://github.com/sodadata/soda-core
  • OpenLineage: https://openlineage.io/docs/spec/
  • OpenTelemetry: https://opentelemetry.io/docs/

End of Part 068

Lesson Recap

You just completed lesson 68 in deepen practice. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.