Deepen PracticeOrdered learning track

Control Plane vs Data Plane

Learn Java Data Pipeline Pattern - Part 064

Control plane vs data plane architecture for internal Java data pipeline platforms: pipeline definitions, run state, scheduling, policy enforcement, lineage, execution workers, asset registry, self-service, governance, and production operating model.

16 min read3145 words
PrevNext
Lesson 6484 lesson track46–69 Deepen Practice
#java#data-pipeline#platform-engineering#control-plane+4 more

Part 064 — Control Plane vs Data Plane

A pipeline platform fails when every pipeline owns its own operational brain.

One job implements retries this way. Another stores checkpoints differently. A third has no run manifest. A fourth writes lineage only when it succeeds. A fifth sends Slack alerts from transform code. A sixth has manual backfill scripts known by one engineer.

This is not a platform.

It is a pile of jobs.

The architectural move is to separate:

Control plane = decides, records, governs, schedules, observes.
Data plane = reads, transforms, writes, and moves data.

This separation is not cosmetic.

It determines whether your organization can scale from ten pipelines to hundreds without losing correctness, ownership, auditability, and operational control.


1. The Mental Model

Think of a pipeline platform as two cooperating systems.

The control plane does not process every record.

The data plane does not decide global policy.

That is the boundary.


2. Why the Separation Exists

Without separation, business logic, operational policy, and execution mechanics become tangled.

Example bad Java job:

public static void main(String[] args) {
    readConfig();
    checkIfHoliday();
    queryDatabase();
    transformRows();
    validateOutput();
    writeParquet();
    updateRunStatus();
    sendSlackAlert();
    updateLineage();
    triggerDownstreamJobs();
}

This job knows too much.

It owns:

  • scheduling policy
  • business calendar
  • source integration
  • transformation
  • validation
  • output publishing
  • run state
  • notification
  • lineage
  • downstream triggering

That makes every job a platform clone.

A better split:

Control plane:
  - create run
  - enforce policy
  - pass run context
  - monitor state
  - publish lineage/asset state
  - trigger downstream

Data plane:
  - read assigned input
  - transform deterministically
  - write output through declared boundary
  - report progress and evidence

The job becomes smaller and safer.


3. Control Plane Responsibilities

The control plane owns metadata and decisions.

Core responsibilities:

ResponsibilityMeaning
Pipeline registryWhat pipelines exist and who owns them
Version registryWhich code/config/schema version is active
Run creationWhen a run exists and why
Scheduling/triggeringCron, asset events, manual, SLA, webhook
Policy enforcementAccess, data classification, approval, quality gate
Run statePlanned/running/succeeded/failed/cancelled/superseded
Asset stateFresh/stale/invalid/partial/degraded
LineageInputs, outputs, run, code, schema, transform version
Backfill controlWhat historical range is allowed and isolated
Retry/rerun controlWhat can be retried safely
ObservabilityMetrics, alerts, logs correlation, dashboard
GovernanceOwnership, retention, sensitive-data rules, audit evidence
Self-serviceDeveloper-facing API/UI/scaffold

The control plane is the operational memory of the platform.

If the control plane does not know it happened, it is not operationally real.


4. Data Plane Responsibilities

The data plane owns execution.

Core responsibilities:

ResponsibilityMeaning
Source readPull or receive input records/assets
TransformApply deterministic or declared stateful logic
Sink writeWrite output/effects using boundary contracts
Local resource managementCPU, memory, parallelism, batching, backpressure
Worker lifecyclestartup, graceful shutdown, heartbeat
Checkpointingengine-specific progress and recoverability
Runtime metricsrecords, bytes, lag, error, latency
Evidence emissionrun facts, quality facts, output facts
Failure classificationretryable/non-retryable/unknown outcome

The data plane must be allowed to move fast.

But it should not be allowed to invent platform semantics.


5. What Belongs Where

A useful rule:

If the decision affects multiple pipelines or consumers, it belongs in the control plane.
If the decision is about processing records inside one run, it belongs in the data plane.

Examples:

ConcernPlane
Transforming CaseUpdated into CaseSnapshotData plane
Deciding whether CaseSnapshot is publishableControl plane + quality policy
Reading Kafka partition offsetsData plane
Recording which offsets were processed by a runControl plane metadata
Flink keyed stateData plane
Asset freshness statusControl plane
Spark executor sizingData plane runtime/config
Maximum backfill range allowed for PII assetControl plane policy
DLQ item serializationData plane
DLQ replay approvalControl plane
Retry transient API timeoutData plane/boundary policy
Retry failed regulatory report publicationControl plane decision

6. Pipeline Definition Model

A pipeline definition should be declarative enough for the control plane to reason about it.

Example:

pipeline: case-daily-snapshot
owner: enforcement-data-platform
runtime: java-batch
entrypoint: com.acme.pipeline.case.snapshot.CaseDailySnapshotJob
version: 9.4.1
inputs:
  - asset: silver.case
    freshness_required: PT2H
    mode: read_snapshot
  - asset: silver.case_assignment
    freshness_required: PT2H
    mode: read_snapshot
outputs:
  - asset: gold.case_daily_snapshot
    mode: replace_partition
    partition_key: business_date
contracts:
  output_schema: case_daily_snapshot@3.2.0
  quality_contract: case_daily_snapshot_quality@2.1.0
schedule:
  type: cron
  expression: "0 2 * * *"
  timezone: Asia/Singapore
policies:
  data_classification: restricted
  requires_quality_gate: true
  max_backfill_days_without_approval: 31
  publish_requires_reconciliation: true
runtime_config:
  memory: 4Gi
  max_parallelism: 8
  timeout: PT2H

This definition is not only for documentation.

The control plane can use it to:

  • create runs
  • check dependencies
  • enforce access
  • validate schema compatibility
  • determine rerun scope
  • create lineage edges
  • publish asset state
  • display ownership
  • constrain backfill

7. Run Model

A run is a first-class entity.

Do not rely only on logs.

public record PipelineRun(
    String runId,
    String pipelineId,
    String pipelineVersion,
    RunReason reason,
    RunStatus status,
    Instant plannedAt,
    Optional<Instant> startedAt,
    Optional<Instant> finishedAt,
    Map<String, String> parameters,
    List<InputAssetRef> inputs,
    List<OutputAssetRef> expectedOutputs,
    Optional<String> parentRunId,
    Optional<String> approvalId
) {}

Run status should be explicit:

Do not collapse Succeeded and Published.

A data plane job may finish successfully while quality gate or publish gate fails.


8. Run Context Passed to Data Plane

Every job should receive a run context.

{
  "runId": "run-20260704-000183",
  "pipelineId": "case-daily-snapshot",
  "pipelineVersion": "9.4.1",
  "attempt": 1,
  "reason": "SCHEDULED",
  "dataIntervalStart": "2026-07-03T00:00:00+08:00",
  "dataIntervalEnd": "2026-07-04T00:00:00+08:00",
  "inputAssets": [
    {"asset": "silver.case", "version": "snapshot-9012"}
  ],
  "outputAssets": [
    {"asset": "gold.case_daily_snapshot", "partition": "business_date=2026-07-03"}
  ],
  "policy": {
    "qualityGate": "blocking",
    "piiLoggingAllowed": false
  }
}

The job should not infer this from wall-clock time.

Use the control plane's assigned data interval and input asset versions.

That makes reruns deterministic.


9. Asset Registry

The asset registry is the control plane's model of outputs.

An asset is not just a table name.

public record DataAsset(
    String assetId,
    String name,
    AssetKind kind,
    String owner,
    DataClassification classification,
    SchemaRef schema,
    QualityContractRef qualityContract,
    RetentionPolicy retentionPolicy,
    List<AssetDependency> dependencies,
    AssetFreshnessPolicy freshnessPolicy
) {}

Asset state:

public enum AssetState {
    UNKNOWN,
    MATERIALIZING,
    FRESH,
    STALE,
    INVALID,
    PARTIAL,
    DEGRADED,
    DEPRECATED
}

The registry answers:

  • Who owns this asset?
  • What schema does it promise?
  • Is it fresh enough?
  • Which run produced the current version?
  • Which inputs created it?
  • What downstream assets depend on it?
  • Is it allowed to contain PII?
  • Is it publishable?

10. Data Plane Execution Modes

The platform should support multiple data plane modes.

ModeGood ForControl Plane Role
Java batch workerbounded jobs, custom integrationcreate run, dispatch command, capture output
Java streaming servicecontinuous Kafka consumersdeploy config, monitor lag, manage version rollout
Kafka Streams appKafka-native stream/table processingregister topology, monitor state, manage reset/replay
Flink jobstateful event-time stream processingsubmit job, manage savepoint, observe checkpoint/state
Spark joblarge batch/micro-batch processingsubmit job, track application, validate outputs
Airflow taskworkflow-level task executionorchestrate tasks, pass run context
Temporal workerdurable workflow and side-effect controlstart workflow, observe workflow history

A mature platform does not force one runtime for everything.

It standardizes the contracts around all runtimes.


11. Control Plane API

A pipeline control plane usually needs APIs like:

POST /pipelines
GET  /pipelines/{pipelineId}
POST /pipelines/{pipelineId}/runs
GET  /runs/{runId}
POST /runs/{runId}/cancel
POST /runs/{runId}/retry
POST /runs/{runId}/approve
GET  /assets/{assetId}
GET  /assets/{assetId}/lineage
POST /assets/{assetId}/materializations
POST /quality-results
POST /lineage-events

Internal Java model:

public interface PipelineControlPlane {
    PipelineRun createRun(CreateRunCommand command);
    void markRunStarted(String runId, RunStartedEvidence evidence);
    void reportProgress(String runId, RunProgress progress);
    void publishQualityResult(String runId, QualityResult result);
    void publishMaterialization(String runId, AssetMaterialization materialization);
    void markRunFinished(String runId, RunOutcome outcome);
}

The API should be boring and stable.

The platform should not require every pipeline to invent a new metadata protocol.


12. Worker Contract

Every data plane worker should implement a consistent contract.

Input:

  • run context
  • pipeline version
  • input asset references
  • output target references
  • policy bundle
  • secrets references
  • resource limits

Output:

  • status
  • metrics
  • materialization evidence
  • quality evidence
  • lineage evidence
  • error classification
  • checkpoint/effect references

Worker result example:

{
  "runId": "run-20260704-000183",
  "status": "SUCCEEDED",
  "outputs": [
    {
      "asset": "gold.case_daily_snapshot",
      "partition": "business_date=2026-07-03",
      "recordCount": 188392,
      "schemaVersion": "case_daily_snapshot@3.2.0",
      "storageRef": "iceberg://warehouse/gold/case_daily_snapshot@snapshot-9182"
    }
  ],
  "quality": {
    "status": "PASSED",
    "checksPassed": 41,
    "checksFailed": 0
  },
  "metrics": {
    "durationMs": 482000,
    "recordsRead": 190104,
    "recordsWritten": 188392
  }
}

This result is more valuable than a log line saying job complete.


13. Policy Engine

The control plane needs policy enforcement.

Policy examples:

policy: restricted-asset-publish
rules:
  - output.classification in [restricted, confidential] requires quality_gate == passed
  - output.classification == restricted requires lineage_complete == true
  - backfill.days > 31 requires approval
  - pii_logging_allowed must be false
  - source.owner must approve schema breaking change
  - consumer_count > 0 requires deprecation_notice before deletion

Policy belongs in the control plane because it affects more than one job.

The data plane can enforce local policy checks, but the canonical decision should be centralized.


14. Quality Gate as Control Plane Decision

Quality checks often run in the data plane.

But publish decision should be visible in the control plane.

This allows consistent behavior:

  • fail publish on critical quality breach
  • allow degraded publish for low-severity warning
  • require approval for override
  • record why output was blocked or allowed

15. Lineage and Evidence

Lineage should be emitted by the platform, not manually reconstructed from code.

Minimum lineage event:

public record LineageEvent(
    String runId,
    String jobName,
    String jobVersion,
    List<DatasetRef> inputs,
    List<DatasetRef> outputs,
    Instant eventTime,
    Map<String, String> facets
) {}

Useful facets:

  • schema version
  • transform version
  • source offsets
  • source snapshots
  • row counts
  • data quality summary
  • code git SHA
  • container image digest
  • environment
  • owner
  • classification

For regulated systems, lineage is evidence.

It answers:

Why did this report say what it said on that day?

16. Scheduling in the Control Plane

The scheduler should create runs.

It should not directly mutate business data.

Run creation reasons:

public enum RunReason {
    SCHEDULED,
    ASSET_TRIGGERED,
    MANUAL,
    BACKFILL,
    RETRY,
    REPROCESSING,
    CORRECTION,
    DEPENDENCY_REFRESH,
    SLA_RECOVERY
}

This matters because different reasons imply different policies.

A manual backfill over two years of restricted data may require approval.

A retry of a transient API timeout may not.

A correction run may need to supersede prior output.

A scheduled run may be blocked by upstream freshness.


17. Backfill Control Plane

Backfill is where platform governance becomes necessary.

A backfill request should include:

public record BackfillRequest(
    String pipelineId,
    LocalDate startDate,
    LocalDate endDate,
    String requestedBy,
    String reason,
    BackfillMode mode,
    boolean isolateOutputs,
    Optional<String> approvalId
) {}

Backfill policy questions:

  • Is the date range allowed?
  • Are inputs still retained?
  • Are schemas compatible historically?
  • Will output overwrite current data?
  • Should output go to isolated namespace?
  • Are downstream consumers auto-triggered?
  • Is approval required?
  • What cost limit applies?

The data plane executes chunks.

The control plane owns the backfill campaign.


18. Retry and Rerun Control

Retry is not always safe.

The control plane should know the difference between:

ActionMeaning
Retry attemptSame run, same inputs, same output target, transient failure
RerunNew run for same logical interval/output
ReprocessRecompute with possibly new code or contract
BackfillHistorical range run
CorrectionApply correction semantics and supersede prior output
ReplayRe-read event log from offsets/time

Each action has different safety rules.

Example:

Retry Java task after DB timeout: maybe safe.
Retry side-effect notification: safe only if idempotency/effect ledger exists.
Rerun replace-partition output: safe if staged publish is atomic.
Replay Kafka topic into search index: safe if index write is version-aware.
Reprocess regulatory report: requires supersession record.

The control plane should encode this.


19. Continuous Streaming Jobs

Batch jobs map naturally to runs.

Streaming jobs are different.

A streaming job may run for weeks.

Control plane concepts:

  • deployment version
  • job instance
  • checkpoint/savepoint
  • input lag
  • watermark lag
  • error rate
  • state size
  • current topology version
  • restart history
  • upgrade history

A streaming job can still emit logical materialization events:

case_breach_alert_stream processed up to event_time=2026-07-04T10:00:00Z
silver.case_current updated through source_lsn=18499200

Do not force streaming into fake daily run semantics.

Use:

job lifecycle state + periodic progress/materialization state

20. Versioning Across Planes

A pipeline output depends on many versions:

  • pipeline definition version
  • code version
  • transform version
  • schema version
  • quality contract version
  • input asset version
  • runtime image version
  • config version
  • reference data version
  • platform policy version

The control plane should record them.

Example materialization:

{
  "asset": "gold.case_daily_snapshot",
  "assetVersion": "snapshot-9182",
  "runId": "run-20260704-000183",
  "codeVersion": "git:1a2b3c4",
  "imageDigest": "sha256:...",
  "schemaVersion": "case_daily_snapshot@3.2.0",
  "qualityContractVersion": "2.1.0",
  "transformVersion": "9.4.1",
  "inputVersions": [
    {"asset": "silver.case", "version": "snapshot-9012"}
  ]
}

This enables reproducibility.


21. Platform Database Model

A minimal control plane can start with tables like:

CREATE TABLE pipeline_definition (
  pipeline_id       text PRIMARY KEY,
  name              text NOT NULL,
  owner             text NOT NULL,
  runtime           text NOT NULL,
  active_version    text NOT NULL,
  definition_json   jsonb NOT NULL,
  created_at        timestamptz NOT NULL,
  updated_at        timestamptz NOT NULL
);

CREATE TABLE pipeline_run (
  run_id            text PRIMARY KEY,
  pipeline_id       text NOT NULL,
  pipeline_version  text NOT NULL,
  reason            text NOT NULL,
  status            text NOT NULL,
  data_interval_start timestamptz,
  data_interval_end   timestamptz,
  parameters_json   jsonb NOT NULL,
  created_at        timestamptz NOT NULL,
  started_at        timestamptz,
  finished_at       timestamptz,
  parent_run_id     text,
  approval_id       text
);

CREATE TABLE asset_materialization (
  materialization_id text PRIMARY KEY,
  asset_id           text NOT NULL,
  asset_version      text NOT NULL,
  run_id             text NOT NULL,
  status             text NOT NULL,
  storage_ref        text,
  schema_version     text,
  record_count       bigint,
  data_interval_start timestamptz,
  data_interval_end   timestamptz,
  materialized_at    timestamptz NOT NULL,
  metadata_json      jsonb NOT NULL
);

Add more only when needed.

The important point is not the exact schema.

The important point is that control-plane facts are persisted as first-class state.


22. Control Plane State Transitions

Use compare-and-swap style updates for run state.

Bad:

UPDATE pipeline_run SET status = 'SUCCEEDED' WHERE run_id = ?;

Better:

UPDATE pipeline_run
SET status = 'SUCCEEDED', finished_at = now()
WHERE run_id = ?
  AND status = 'RUNNING';

If no row is updated, your worker is stale or the run was cancelled.

This prevents accidental transitions like:

CANCELLED -> SUCCEEDED
FAILED -> RUNNING
PUBLISHED -> FAILED

Model the state machine explicitly.


23. Dispatch Model

The control plane can dispatch work in several ways:

DispatchGood ForTrade-Off
HTTP call to workersimple servicestimeout/availability coupling
Queue commandasync workersneed command dedupe/lease
Kubernetes jobbatch isolationstartup overhead
Airflow DAG/taskworkflow ecosystemAirflow state must be reconciled
Spark submitlarge batchexternal app tracking needed
Flink job submitstreaming/statefulsavepoint/upgrade lifecycle
Temporal workflowdurable processdeterministic workflow constraints

A dispatch command should be idempotent.

public record DispatchCommand(
    String commandId,
    String runId,
    String pipelineId,
    String runtime,
    String workerRef,
    RunContext runContext
) {}

If the control plane retries dispatch after timeout, it must not start duplicate unsafe runs.


24. Heartbeats and Leases

For long-running workers, use heartbeat/lease.

public record WorkerHeartbeat(
    String runId,
    String workerId,
    Instant heartbeatAt,
    RunProgress progress,
    Map<String, String> runtimeStats
) {}

Lease model:

UPDATE pipeline_run
SET leased_by = ?, lease_expires_at = now() + interval '2 minutes'
WHERE run_id = ?
  AND status IN ('READY', 'RUNNING')
  AND (lease_expires_at IS NULL OR lease_expires_at < now());

Leases prevent two workers from owning the same run.

But leases alone do not protect external side effects.

Use fencing tokens for write boundaries where needed.


25. Fencing Tokens

A fencing token prevents stale workers from committing after losing ownership.

Example:

public record RunLease(
    String runId,
    String workerId,
    long fencingToken,
    Instant expiresAt
) {}

External publish operation includes token:

UPDATE asset_publish_pointer
SET asset_version = ?, fencing_token = ?
WHERE asset_id = ?
  AND fencing_token < ?;

If stale worker tries to publish with old token, update fails.

This matters in distributed systems because cancellation and restart are not instantaneous.


26. Self-Service Platform API

The platform should make the right path easy.

A developer should be able to define:

  • source
  • transform
  • sink
  • contract
  • schedule
  • quality gate
  • ownership
  • data classification

without hand-writing all operational glue.

Self-service does not mean no governance.

It means governance is embedded.

Example developer flow:

pipeline init case-daily-snapshot --runtime java-batch
pipeline add-input silver.case
pipeline add-output gold.case_daily_snapshot --mode replace-partition
pipeline add-quality-contract case_daily_snapshot_quality.yaml
pipeline deploy --env staging
pipeline run --date 2026-07-03
pipeline promote --env prod

Behind the scenes, the control plane creates:

  • pipeline definition
  • CI checks
  • schema compatibility gate
  • deployment metadata
  • run configuration
  • lineage hooks
  • standard alerts
  • standard dashboards

27. Control Plane vs Frameworks

Do not confuse a framework with the control plane.

Airflow can be part of a control plane, but Airflow alone may not be your full platform model.

Flink has job management and checkpointing, but Flink is primarily data plane for stream processing.

Kafka has consumer group state, but Kafka is not an asset registry.

Temporal has durable workflow history, but Temporal does not automatically know data contracts or lineage.

Spark has application tracking, but Spark does not decide whether an asset is publishable.

The platform may compose them:

The control plane is the system that makes cross-runtime decisions.


28. Platform Evolution Stages

Most organizations do not build the final platform first.

A practical evolution:

Stage 1 — Standard Library

Provide Java libraries for:

  • run context
  • metrics
  • logging
  • retry policy
  • boundary adapters
  • quality result emission
  • lineage emission

Stage 2 — Run Registry

Persist:

  • pipeline definitions
  • run state
  • materialization facts
  • quality results
  • basic lineage

Stage 3 — Controlled Scheduling

Move from ad hoc cron to:

  • run creation
  • dependency checks
  • backfill tracking
  • retry/rerun semantics

Stage 4 — Policy Enforcement

Add:

  • data classification
  • approval workflow
  • schema gate
  • quality gate
  • publish gate

Stage 5 — Self-Service Platform

Add:

  • API/UI
  • templates
  • scaffolding
  • ownership model
  • cost controls
  • consumer registry
  • impact analysis

Do not start with a huge platform rewrite.

Start by making metadata and run semantics explicit.


29. Regulatory Enforcement Case Study

Consider a regulatory enforcement data platform.

Data sources:

  • case management database
  • assignment history
  • breach events
  • decision records
  • document metadata
  • external regulator submissions

Outputs:

  • case current state
  • SLA breach alert stream
  • daily enforcement report
  • investigator workload dashboard
  • audit evidence bundle
  • external submission status projection

Control plane responsibilities:

  • enforce restricted data policy
  • prevent report publish if quality checks fail
  • track lineage from operational case changes to report rows
  • require approval for correction backfills over closed reporting periods
  • record which run superseded prior report output
  • manage asset freshness for dashboards
  • block downstream gold assets when silver canonical case is invalid

Data plane responsibilities:

  • consume CDC events
  • run Flink breach detection
  • run Spark daily snapshot
  • write Iceberg tables
  • update search projection
  • emit quality metrics
  • emit lineage events

Architecture:

This is where the control-plane/data-plane separation pays off.

The organization can prove:

  • what ran
  • why it ran
  • which data it used
  • which version of code produced output
  • which quality gates passed
  • who approved exceptional actions
  • which downstream assets were affected

That is regulatory defensibility.


30. Anti-Patterns

Anti-Pattern 1: Every Pipeline Has Its Own Control Plane

Each job stores status differently, retries differently, and alerts differently.

This does not scale.

Anti-Pattern 2: Control Plane Processes Records

A central service tries to transform every record for every pipeline.

This becomes bottleneck and coupling point.

Keep high-volume processing in the data plane.

Anti-Pattern 3: Data Plane Decides Governance

A Spark job decides to ignore quality failure and publish anyway.

That decision must be visible and governed.

Anti-Pattern 4: No Run ID Everywhere

Without run id propagation, debugging becomes archaeology.

Every log, metric, lineage event, materialization, and quality result should carry run context where applicable.

Anti-Pattern 5: Wall-Clock Driven Jobs

A job computes “yesterday” internally.

Reruns become ambiguous.

The control plane should pass explicit data intervals.

Anti-Pattern 6: One Runtime to Rule Them All

Forcing every pipeline into one engine creates unnatural designs.

Use standard contracts across runtimes instead.

Anti-Pattern 7: Platform Metadata as Afterthought

Lineage and quality are emitted only on success or only from some jobs.

Then impact analysis is incomplete exactly when needed most.

Anti-Pattern 8: No Publish State

Job success is treated as asset freshness.

But a job can succeed and publish can fail.

Model materialization and publication explicitly.


31. Implementation Blueprint

A minimal internal platform can be implemented incrementally.

31.1 Java Library

Provide:

pipeline-runtime-core
pipeline-boundary-core
pipeline-quality-core
pipeline-lineage-client
pipeline-controlplane-client
pipeline-testkit

31.2 Control Plane Service

Build a Java service with:

  • pipeline definition API
  • run API
  • asset API
  • materialization API
  • quality result API
  • lineage sink integration
  • policy evaluation
  • backfill campaign API

31.3 Worker SDK

Every job uses:

public final class PipelineJobMain {
    public static void main(String[] args) {
        RunContext context = RunContextLoader.fromArgs(args);
        PipelineControlClient control = PipelineControlClient.fromEnv();

        control.markStarted(context.runId());

        try {
            JobResult result = new CaseDailySnapshotJob().run(context);
            control.publishQualityResult(context.runId(), result.quality());
            control.publishMaterialization(context.runId(), result.materialization());
            control.markSucceeded(context.runId(), result.outcome());
        } catch (PipelineException ex) {
            control.markFailed(context.runId(), FailureReport.from(ex));
            throw ex;
        }
    }
}

31.4 Metadata First

Before building a UI, make sure the platform records:

  • run state
  • input/output assets
  • materialization
  • quality result
  • lineage event
  • failure classification

A small truthful platform is better than a large dashboard over unreliable metadata.


32. Production Checklist

Before calling something a pipeline platform, verify:

Control Plane

  • Pipeline definitions are versioned.
  • Runs are first-class records.
  • Run state transitions are guarded.
  • Data intervals are explicit.
  • Backfills are tracked as campaigns.
  • Asset materializations are recorded.
  • Asset freshness is computed from materializations.
  • Quality gate decisions are visible.
  • Policy decisions are auditable.
  • Ownership is explicit.
  • Sensitive-data classification is explicit.
  • Lineage is emitted consistently.

Data Plane

  • Jobs receive run context.
  • Jobs do not infer logical dates from wall-clock time.
  • Boundary adapters classify failures.
  • Sink writes are idempotent or reconciled.
  • Metrics include run id and boundary names.
  • Logs avoid sensitive payloads.
  • Workers support graceful shutdown.
  • Streaming jobs expose lag/watermark/state metrics.
  • Batch jobs produce materialization evidence.

Cross-Plane

  • Dispatch is idempotent.
  • Heartbeats exist for long-running work.
  • Fencing prevents stale publish.
  • Retry/rerun/reprocess/backfill are distinct.
  • Publish state is separate from job success.
  • Manual overrides are recorded.
  • Downstream impact can be queried.
  • Operators have runbooks.

33. Mental Model Summary

The control plane is the platform's memory and governance brain.

The data plane is the execution muscle.

Do not put all memory in logs. Do not put governance in transform code. Do not put record processing in the control service. Do not let every runtime invent its own semantics.

The invariant:

The data plane may vary by engine.
The control-plane contract must remain stable.

That is how a Java data pipeline platform grows without becoming operationally incoherent.


34. What Comes Next

Part 065 begins the reliability and operational excellence phase.

We will define pipeline SLOs in practical terms:

  • freshness
  • completeness
  • accuracy
  • availability
  • cost
  • recovery time
  • replayability
  • auditability

This is where pipeline engineering moves from implementation patterns to operating promises.

Lesson Recap

You just completed lesson 64 in deepen practice. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.