Deepen PracticeOrdered learning track

Batch Pipeline Architecture

Learn Java Data Pipeline Pattern - Part 049

Production-grade batch pipeline architecture for Java systems: partitioning, deterministic recompute, incremental load, late correction, backfill, reconciliation, and operational design.

17 min read3331 words
PrevNext
Lesson 4984 lesson track46–69 Deepen Practice
#java#data-pipeline#batch#recompute+3 more

Part 049 — Batch Pipeline Architecture

Batch pipeline is often underestimated because it looks simple:

read yesterday's data -> transform -> write output

That description hides the real system.

A production batch pipeline is not just a scheduled script. It is a deterministic recomputation system with explicit rules for:

  • what input range is included,
  • what output partition is owned by the job,
  • how prior output is replaced or corrected,
  • how partial writes are avoided,
  • how late data is handled,
  • how the job can be replayed,
  • how correctness is proven,
  • how downstream consumers know what changed.

If streaming pipelines are hard because data never stops, batch pipelines are hard because humans assume the boundary is obvious. It is not.

The main question in batch architecture is:

Can we recompute a known slice of truth safely, deterministically, and repeatedly without corrupting downstream state?

This part builds the mental model and implementation blueprint.


1. Batch Is Not Legacy

A weak mental model says:

streaming = modern
batch     = old

A better mental model:

streaming = continuous incremental computation
batch     = bounded deterministic recomputation

Batch remains essential when:

  • input arrives in files,
  • source APIs expose paginated snapshots,
  • historical correction is common,
  • regulatory reporting needs reproducibility,
  • cost requires scheduled compute instead of always-on compute,
  • model training needs complete windows,
  • reconciliation needs full-period comparison,
  • downstream reports expect closed accounting periods,
  • you need to rebuild derived state after a bug fix.

In serious systems, batch and streaming are not enemies. They are different ways to answer the same question under different constraints.

The mature design is not “choose batch or streaming”. It is:

Define the source of truth, then decide which computation mode produces which derived truth under which SLA.


2. The Core Batch Contract

Every batch job needs a contract.

A production batch contract answers these questions:

Job identity:
  What logical dataset does this job produce?

Input boundary:
  What input records are included?

Output boundary:
  What output records or partitions are owned by this run?

Time boundary:
  Which business time, event time, ingestion time, or source commit time is used?

Determinism:
  Will the same inputs and parameters produce the same outputs?

Commit protocol:
  When is output visible to consumers?

Correction policy:
  How are late, revised, deleted, or invalid records handled?

Reconciliation:
  How do we know the output is complete and accurate?

Without this contract, the job is not an engineered pipeline. It is a script with operational risk.


3. Batch Job as a State Machine

A batch job should be modeled as a state machine, not a one-shot command.

The key states:

StateMeaning
PLANNEDJob run has parameters but has not fixed input boundary.
INPUT_LOCKEDInput snapshot/range is fixed. Re-run must use same boundary.
RUNNINGTransform is executing and writing to staging.
VALIDATINGOutput exists but is not yet published. Quality/reconciliation checks run here.
COMMITTINGOutput is being promoted atomically.
PUBLISHEDConsumers may read this output.
UNKNOWNCommit may or may not have succeeded; recovery must inspect durable state.
FAILEDOutput must not be consumed unless explicitly marked partial/debug.

The dangerous state is UNKNOWN.

Many systems handle FAILED, but not UNKNOWN. UNKNOWN occurs when the process crashes, network disconnects, or orchestrator kills the container while committing output.

Production batch architecture needs a recovery algorithm, not hope.


4. The Batch Run Manifest

Every batch run should produce a manifest.

A manifest is the durable evidence of what the run intended to do, what it read, what it wrote, and what it published.

Example Java model:

public record BatchRunManifest(
    String runId,
    String jobName,
    String jobVersion,
    BatchMode mode,
    InputBoundary inputBoundary,
    OutputBoundary outputBoundary,
    Map<String, String> parameters,
    Instant plannedAt,
    Instant inputLockedAt,
    Instant startedAt,
    Instant finishedAt,
    List<InputDatasetSnapshot> inputs,
    List<OutputArtifact> outputs,
    List<QualityResult> qualityResults,
    ReconciliationResult reconciliation,
    BatchRunStatus status
) {}

The manifest must be stored in durable storage:

  • relational job ledger,
  • object storage metadata file,
  • lakehouse table metadata,
  • orchestration metadata plus independent audit table.

Do not rely only on orchestrator logs. Logs are operational traces; manifests are data correctness evidence.


5. Input Boundary: The Most Important Design Choice

Batch jobs fail silently when the input boundary is vague.

Bad boundary:

process yesterday's data

Better boundary:

process source events where event_date = 2026-07-03
using raw_dataset_snapshot_id = s-20260704-010000
and contract_version = 12
and reference_data_version = ref-20260703-v4

A batch input boundary can be based on different clocks:

Boundary TypeMeaningRisk
Event timeTime the business event happenedlate data can arrive after window closes
Ingestion timeTime platform captured recordbusiness period may be wrong
Source commit timeTime source DB committed transactionsource-specific and can be hard to expose
File arrival timeTime file landedfile delay changes completeness
Snapshot IDImmutable source/table snapshotbest for reproducibility when available
Cursor rangeAPI/database incremental cursormust handle updates and deletes carefully

The correct boundary depends on output semantics.

For regulatory reporting, business effective time may matter more than ingestion time. For operational freshness monitoring, ingestion time may matter more. For replay and debugging, snapshot ID is often the strongest evidence.


6. Output Boundary: What Does This Job Own?

Input boundary says what the job reads.

Output boundary says what the job is allowed to replace.

Examples:

Output boundary examples:

- report_date = 2026-07-03
- tenant_id = t-17 AND report_month = 2026-06
- dataset_version = v20260704_010000
- case_region = APAC AND effective_date between 2026-07-01 and 2026-07-03
- model_run_id = m-20260704-01

Without output ownership, retry/backfill can accidentally mix old and new records.

Bad sink pattern:

INSERT INTO daily_case_metrics
SELECT ...

Why dangerous?

  • retry duplicates rows,
  • partial run leaves partial data,
  • backfill mixes old and new version,
  • delete/correction is ambiguous.

Better pattern:

-- write to staging first
INSERT INTO daily_case_metrics_staging(run_id, report_date, metric_name, metric_value)
SELECT :run_id, report_date, metric_name, metric_value
FROM ...;

-- validate staging
-- then atomically replace owned partition
BEGIN;
DELETE FROM daily_case_metrics WHERE report_date = :reportDate;
INSERT INTO daily_case_metrics(report_date, metric_name, metric_value, run_id)
SELECT report_date, metric_name, metric_value, run_id
FROM daily_case_metrics_staging
WHERE run_id = :runId;
COMMIT;

For object storage/lakehouse, the equivalent is:

write files under staging/run_id=...
validate files
commit table snapshot or swap partition pointer
publish manifest

7. Partitioning Is a Correctness Boundary

Partitioning is not only a performance technique.

It defines the unit of:

  • recompute,
  • replacement,
  • reconciliation,
  • ownership,
  • retention,
  • downstream invalidation,
  • incident recovery.

Common partition strategies:

StrategyExampleGood ForMain Risk
Business datebusiness_date=2026-07-03reports, daily metricslate/corrected data
Ingestion dateingest_date=2026-07-04raw landing, operational auditbusiness period split
Tenant + datetenant_id=t1/date=...multi-tenant isolationsmall-file explosion
Region + dateregion=APAC/date=...operational routingre-org changes
Hash bucketbucket=17large uniform tableshard for business repair
Snapshot/run IDrun_id=...reproducible outputconsumer must select active run

The mistake is picking partitioning only for query speed.

A strong design asks:

If a correction arrives for one business day, can I recompute exactly the affected slice without rewriting unrelated truth?


8. Immutable Output vs Replace-in-Place

Batch output has two broad styles.

8.1 Immutable Versioned Output

Each run writes a new version:

s3://lake/case_metrics/version=20260704T010000/...

Consumers read through a pointer:

current version = 20260704T010000

Benefits:

  • easy rollback,
  • strong auditability,
  • supports diffing old vs new,
  • good for regulated reporting,
  • avoids destructive overwrite.

Costs:

  • more storage,
  • pointer management,
  • consumer must respect version pointer,
  • compaction/retention needed.

8.2 Replace Owned Partition

Job replaces a known partition:

dataset=case_metrics/report_date=2026-07-03

Benefits:

  • simpler consumer model,
  • lower storage cost,
  • works well for daily derived tables,
  • easier partition pruning.

Costs:

  • rollback needs previous copy/snapshot,
  • concurrent writers are dangerous,
  • partial replacement must be guarded.

8.3 Rule of Thumb

Use immutable versioned output when:

  • output is regulatory evidence,
  • rollback is frequent,
  • diff/review is required,
  • downstream consumers need reproducible versions.

Use partition replacement when:

  • output is derived operational analytics,
  • partition boundary is stable,
  • table format supports atomic commit,
  • validation happens before publish.

9. Full Recompute, Incremental Recompute, and Delta Apply

Batch architecture has three common compute modes.

9.1 Full Recompute

Read all relevant historical input and regenerate full output.

input: all cases since 2020
output: all monthly metrics since 2020

Best when:

  • data volume is manageable,
  • correctness matters more than cost,
  • transformation changed deeply,
  • historical correction affects many periods.

Risk:

  • expensive,
  • long runtime,
  • larger blast radius,
  • difficult to schedule often.

9.2 Incremental Recompute

Recompute affected partitions from raw truth.

input: raw events for business_date in [D-3, D]
output: metrics for those dates

Best when:

  • correction window is bounded,
  • raw data is retained,
  • output partitions align with correction boundary.

Risk:

  • missing late data outside lookback,
  • derived dependency can cross partition boundaries,
  • incremental logic can diverge from full logic.

9.3 Delta Apply

Apply changes to prior output.

previous metric + delta = new metric

Best when:

  • source emits reliable change events,
  • output is additive or ledger-like,
  • recompute is too costly.

Risk:

  • delta bugs compound,
  • correction/deletion is hard,
  • output becomes dependent on mutation history,
  • replay requires exact prior state.

Top-tier engineering bias:

Prefer recompute from immutable raw truth where possible. Use delta apply only when you can prove algebraic safety and reconciliation coverage.


10. Determinism Rules

A batch transform is deterministic when the same inputs and parameters produce the same outputs.

Sources of nondeterminism:

  • reading “latest” reference data without version pinning,
  • using now() inside transformation logic,
  • unordered aggregation with non-associative floating-point behavior,
  • random UUID generation for output identity,
  • relying on file listing order,
  • mutable external APIs during transform,
  • timezone defaults from host/container,
  • locale-dependent parsing,
  • concurrent side effects,
  • non-repeatable source query.

Bad:

record.put("generatedAt", Instant.now());
record.put("outputId", UUID.randomUUID().toString());

Better:

record.put("runId", runContext.runId());
record.put("computedAt", runContext.inputLockedAt());
record.put("outputId", stableHash(record.businessKey(), runContext.outputVersion()));

A production batch job should treat time, randomness, and reference data as explicit dependencies.


11. Reference Data Pinning

Batch jobs often enrich with reference data:

  • product catalog,
  • organization hierarchy,
  • regulatory rule set,
  • country/region mapping,
  • risk score threshold,
  • SLA calendar,
  • currency rate,
  • user-role snapshot.

If reference data is not pinned, backfill output changes unpredictably.

Example:

Case event happened on: 2026-06-10
Backfill executed on:   2026-07-04
Current region mapping: APAC-2
Historical mapping:     APAC-1

Which mapping should be used?

There is no universal answer. The contract must say.

Common policies:

PolicyMeaning
As-at event timeUse reference value effective when event happened.
As-at processing timeUse reference value at original processing time.
As-at report closeUse reference value at reporting cutoff.
LatestUse newest reference data. Dangerous unless explicitly intended.
Rule-version pinnedUse fixed rule/config version.

Implementation model:

public record ReferenceDataBinding(
    String name,
    String version,
    TemporalPolicy temporalPolicy,
    Instant effectiveAt,
    String checksum
) {}

The run manifest should record reference bindings.


12. Late Data and Correction Strategy

Late data is data that arrives after the batch window has been processed.

Correction is data that changes prior truth.

They are related but not identical.

Late data:
  event happened yesterday, arrived today

Correction:
  event arrived yesterday, value was wrong, fixed today

Deletion:
  record was previously valid, now removed or invalidated

Restatement:
  output for prior period is officially changed

A serious batch pipeline does not hide these under “rerun job”.

It defines correction classes:

ClassExampleAction
Within open windowevent for current day arrives lateinclude in next incremental run
Within lookback windowevent for D-3 arrives todayrecompute affected partition
Closed period correctionregulatory report already publishedcreate restatement version
Invalid historical datasource sends deletion/correctionapply correction event and audit
Unknown impactschema/business rule changedimpact analysis before backfill

13. Lookback Window Pattern

The lookback window pattern recomputes recent partitions repeatedly.

Example:

Daily job runs at 01:00.
For run date 2026-07-04, recompute business dates:

2026-07-01
2026-07-02
2026-07-03

This catches late arrivals within three days.

Benefits:

  • simple,
  • robust for bounded lateness,
  • avoids per-record correction complexity,
  • easy to reason about.

Risks:

  • does not catch late data outside window,
  • may repeatedly rewrite large partitions,
  • downstream consumers must tolerate restated partitions,
  • output should carry run_id and as_of metadata.

Rule:

Lookback windows are not correctness guarantees. They are operational approximations that must be backed by lateness metrics and reconciliation.


14. Incremental Cursor Pattern

Some batch jobs read incrementally from source:

SELECT *
FROM cases
WHERE updated_at > :lastWatermark
  AND updated_at <= :nextWatermark
ORDER BY updated_at, id;

This is simple but dangerous.

Problems:

  • timestamp precision collision,
  • source clock skew,
  • updates during scan,
  • missing rows with same timestamp,
  • deletes not visible,
  • updated_at not updated consistently,
  • out-of-order commits,
  • source transaction isolation differences.

Safer composite cursor:

SELECT *
FROM cases
WHERE (updated_at, id) > (:lastUpdatedAt, :lastId)
  AND (updated_at, id) <= (:nextUpdatedAt, :maxIdAtBoundary)
ORDER BY updated_at, id;

Better when available:

  • CDC log position,
  • immutable event table sequence,
  • source snapshot ID,
  • monotonically increasing transaction ID,
  • table format snapshot.

The principle:

Cursor must identify a stable ordered boundary, not just a convenient timestamp column.


15. Staging and Atomic Publish

Never write final output directly if consumers can see partial state.

The standard commit protocol:

1. Create run manifest with status = RUNNING.
2. Read input boundary.
3. Write output to staging path/table with run_id.
4. Validate staging output.
5. Reconcile staging output with input.
6. Publish atomically.
7. Mark manifest as PUBLISHED.
8. Notify/invalidate downstream consumers if needed.

Object storage example:

/staging/case_metrics/run_id=run-123/part-000.parquet
/staging/case_metrics/run_id=run-123/part-001.parquet

/published/case_metrics/business_date=2026-07-03/...

RDBMS example:

CREATE TABLE case_metrics_stage (..., run_id text not null);
CREATE TABLE case_metrics (..., active_run_id text not null);

Lakehouse example:

write data files -> create metadata -> commit snapshot -> publish snapshot ID

Atomic publish is the difference between “job completed” and “dataset is safe to consume”.


16. The Unknown Commit Problem

A batch job can crash while publishing.

Example:

DELETE old partition succeeded
INSERT new partition partially succeeded
process crashed before manifest update

Or:

Iceberg/Delta/Hudi commit succeeded
orchestrator timed out before seeing success

You cannot solve this with retry alone.

You need idempotent recovery:

public BatchRecoveryDecision recover(String runId) {
    BatchRunManifest manifest = manifestStore.load(runId);
    OutputInspection inspection = outputStore.inspect(manifest.outputBoundary(), runId);

    if (inspection.isPublishedForRun(runId)) {
        manifestStore.markPublished(runId, inspection.publishedVersion());
        return BatchRecoveryDecision.ALREADY_PUBLISHED;
    }

    if (inspection.hasPartialStagingOnly()) {
        outputStore.cleanupStaging(runId);
        manifestStore.markFailed(runId, "partial staging cleaned");
        return BatchRecoveryDecision.FAILED_CLEANED;
    }

    if (inspection.hasAmbiguousFinalOutput()) {
        manifestStore.markUnknown(runId, inspection.evidence());
        return BatchRecoveryDecision.MANUAL_REVIEW_REQUIRED;
    }

    return BatchRecoveryDecision.SAFE_TO_RETRY;
}

The recovery algorithm must inspect durable output, not memory state.


17. Reconciliation Patterns

Reconciliation proves the output is plausible or correct enough to publish.

Common checks:

CheckExample
Count reconciliationinput events count vs accepted + rejected + duplicate
Sum reconciliationfinancial amount totals match source control total
Hash/checksumstable hash over sorted business keys
Min/max boundaryevent time min/max within expected window
Partition completenessall expected tenant/date partitions exist
Duplicate checkno duplicate natural key in output
Referential checkall foreign keys resolved or explicitly quarantined
Drift checkcategorical distribution did not change unexpectedly
Freshness checksource snapshot not older than SLA

Example reconciliation result:

public record ReconciliationResult(
    long inputRecords,
    long acceptedRecords,
    long rejectedRecords,
    long duplicateRecords,
    Map<String, BigDecimal> inputControlTotals,
    Map<String, BigDecimal> outputControlTotals,
    List<ReconciliationViolation> violations,
    ReconciliationDecision decision
) {}

Important:

Reconciliation should happen before publish, but the result should be stored after publish as evidence.


18. Data Quality Gate Severity

Not every quality violation should fail the job.

A production system needs severity levels:

SeverityMeaningAction
INFOinformational metricpublish
WARNunusual but allowedpublish with warning
QUARANTINE_RECORDbad records isolatedpublish accepted subset
FAIL_PARTITIONone partition unsafedo not publish that partition
FAIL_RUNdataset unsafedo not publish run
MANUAL_REVIEWambiguous business impacthold output

Example:

public enum QualityAction {
    PUBLISH,
    PUBLISH_WITH_WARNING,
    QUARANTINE_RECORDS,
    FAIL_PARTITION,
    FAIL_RUN,
    MANUAL_REVIEW
}

The mistake is treating all validation failures as either ignore or fail everything.

Good systems route failures based on consumer impact.


19. Backfill Architecture

Backfill is not just “run old dates”.

A backfill changes historical output and can have downstream impact.

Backfill must define:

Why:
  bug fix, schema migration, source correction, new dataset, restatement

Scope:
  dataset, tenant, date range, entity range, version range

Input version:
  raw snapshot, reference data version, transform version

Output mode:
  shadow, compare-only, publish, restatement

Downstream behavior:
  notify, block, auto-refresh, manual approval

Validation:
  golden diff, reconciliation, sampled review, full parity check

Backfill modes:

ModeMeaning
Dry runcompute but do not write final output
Shadowwrite separate shadow output for diff
Compare-onlycompare old vs new without publish
Replacereplace owned partitions
Restatementpublish new official historical version
Migrationwrite output in new schema/version

A mature backfill is planned like a production release.


20. Dependency-Aware Backfill

Batch datasets often depend on each other.

If silver_cases is backfilled for June, downstream datasets may be stale.

A backfill planner needs impact analysis:

Changed dataset: silver_cases
Changed partitions: 2026-06-01..2026-06-30
Affected downstream:
  - case_daily_metrics same date range
  - case_sla_metrics same date range plus window spillover
  - executive_dashboard monthly June report

Window spillover matters.

If a 30-day rolling metric changes for June 15, the affected output may extend into July 14.

This is why output boundary must understand transformation semantics, not only table partitions.


21. Closed Period and Restatement

Some outputs become closed.

Examples:

  • regulatory report submitted,
  • monthly financial statement closed,
  • SLA breach count certified,
  • customer invoice generated,
  • enforcement decision summary archived.

For closed periods, replacing the old partition may be unacceptable.

Use restatement:

report_month=2026-06 version=1 status=SUBMITTED
report_month=2026-06 version=2 status=RESTATED reason=SOURCE_CORRECTION

Restatement output should include:

  • prior version reference,
  • reason code,
  • changed metrics,
  • input delta summary,
  • approver/reviewer if required,
  • timestamp and run ID,
  • downstream notification state.

In regulated environments, restatement is not an error path. It is a first-class lifecycle.


22. Small Files and Partition Explosion

Batch jobs often write too many small files.

Small-file symptoms:

  • slow listing,
  • slow query planning,
  • high metadata cost,
  • poor scan efficiency,
  • object-store request explosion,
  • downstream query instability.

Causes:

  • too many partitions,
  • too many tasks,
  • low data volume per tenant/date,
  • streaming-like writes into batch tables,
  • writing each source file independently,
  • no compaction.

Rules of thumb:

  • do not partition by high-cardinality fields unless query and repair boundaries require it,
  • avoid tenant/date partitioning for thousands of tiny tenants unless you bucket or group,
  • compact after high-frequency writes,
  • separate raw landing layout from curated analytical layout,
  • measure file count per partition as an SLO.

Example metric:

files_per_partition{dataset="case_metrics", partition="2026-07-03"} = 1842
average_file_size_mb = 2.1

This is not a cosmetic issue. File layout becomes a performance and cost contract.


23. Batch Pipeline Observability

A batch pipeline should emit metrics at run, partition, and record class level.

Run-level metrics:

batch_run_duration_seconds
batch_run_status
batch_run_input_records
batch_run_output_records
batch_run_rejected_records
batch_run_duplicate_records
batch_run_publish_duration_seconds
batch_run_reconciliation_status

Partition-level metrics:

partition_input_records{date="2026-07-03"}
partition_output_records{date="2026-07-03"}
partition_late_records{date="2026-07-03"}
partition_quality_violations{date="2026-07-03"}
partition_publish_status{date="2026-07-03"}

Data-level metrics:

null_rate(field="case_owner")
invalid_enum_count(field="case_status")
amount_sum(source="raw")
amount_sum(source="output")
duplicate_business_key_count

Important:

Batch observability is not just job success/failure. A successful job can produce wrong data.


24. Java Batch Job Skeleton

A minimal production-minded Java batch job has these components:

public interface BatchJob<P extends BatchParameters> {
    JobName name();
    JobVersion version();
    BatchPlan plan(P parameters);
    BatchRunResult execute(BatchRunContext context) throws Exception;
}

Plan:

public record BatchPlan(
    InputBoundary inputBoundary,
    OutputBoundary outputBoundary,
    List<ReferenceDataBinding> referenceData,
    List<ExpectedPartition> expectedPartitions,
    ValidationPolicy validationPolicy,
    PublishPolicy publishPolicy
) {}

Execution context:

public record BatchRunContext(
    String runId,
    JobName jobName,
    JobVersion jobVersion,
    BatchMode mode,
    Instant plannedAt,
    Instant inputLockedAt,
    BatchPlan plan,
    ManifestStore manifestStore,
    Metrics metrics
) {}

Runner:

public final class BatchRunner<P extends BatchParameters> {
    private final ManifestStore manifestStore;
    private final OutputPublisher outputPublisher;
    private final BatchRecoveryService recoveryService;

    public BatchRunResult run(BatchJob<P> job, P parameters) throws Exception {
        BatchPlan plan = job.plan(parameters);
        String runId = manifestStore.createRun(job.name(), job.version(), plan);

        try {
            manifestStore.markInputLocked(runId, plan.inputBoundary());
            manifestStore.markRunning(runId);

            BatchRunContext context = createContext(runId, job, plan);
            BatchRunResult result = job.execute(context);

            manifestStore.markValidating(runId);
            validate(result);

            manifestStore.markCommitting(runId);
            outputPublisher.publish(result);

            manifestStore.markPublished(runId, result.publishedVersion());
            return result;
        } catch (UnknownCommitException e) {
            manifestStore.markUnknown(runId, e.evidence());
            return recoveryService.recover(runId);
        } catch (Exception e) {
            manifestStore.markFailed(runId, e);
            throw e;
        }
    }
}

The skeleton is intentionally explicit. The goal is not to replace Spark/Airflow. The goal is to force correct boundaries even when Spark/Airflow execute the heavy work.


25. Batch Job Packaging Pattern

A batch job should separate:

job definition     -> what dataset is produced
planning           -> input/output boundaries
transform logic    -> pure computation
IO adapters        -> read/write implementations
publish protocol   -> commit visibility
validation         -> quality and reconciliation
operations         -> runbook, metrics, recovery

Suggested module layout:

case-metrics-batch/
  pom.xml
  src/main/java/com/company/pipeline/casemetrics/
    CaseMetricsJob.java
    CaseMetricsParameters.java
    CaseMetricsPlanner.java
    CaseMetricsTransform.java
    CaseMetricsInputReader.java
    CaseMetricsOutputWriter.java
    CaseMetricsPublisher.java
    CaseMetricsReconciliation.java
    CaseMetricsRunbook.md

Do not bury business transform logic inside orchestration glue.

Bad:

public static void main(String[] args) {
    // parse args, query DB, transform, write files, update manifest, send Slack
}

Better:

public final class CaseMetricsTransform {
    public Dataset<CaseMetric> transform(
        Dataset<CaseEvent> events,
        Dataset<SlaCalendar> calendars,
        TransformConfig config
    ) {
        // pure transformation boundary
    }
}

26. Batch and Orchestration

Airflow, Dagster, Argo, Jenkins, Kubernetes CronJob, and enterprise schedulers can all start batch work.

But orchestration does not replace batch semantics.

Wrong boundary:

Airflow task succeeded => data is correct

Correct boundary:

Airflow task succeeded
AND manifest status = PUBLISHED
AND reconciliation status = PASSED
AND output version is visible

Orchestration should manage control-flow:

  • schedule,
  • dependency order,
  • retries,
  • notifications,
  • resource allocation,
  • task state.

The batch job should manage data correctness:

  • input boundary,
  • transform version,
  • output boundary,
  • validation,
  • publish semantics,
  • manifest evidence.

This separation prevents orchestration metadata from becoming the only source of truth.


27. Batch Anti-Patterns

Anti-Pattern 1: DELETE + INSERT without Staging

Risk:

  • partial final output,
  • unknown commit,
  • consumer reads half-written data.

Fix:

  • staging table/path,
  • validate before publish,
  • atomic partition replacement or version pointer.

Anti-Pattern 2: “Yesterday” Hardcoded Inside Job

Risk:

  • rerun changes meaning,
  • timezone bugs,
  • impossible historical replay.

Fix:

  • explicit businessDate, inputSnapshot, runMode parameters.

Anti-Pattern 3: Incremental Logic Different from Backfill Logic

Risk:

  • historical output does not match daily output,
  • bug fixes behave differently.

Fix:

  • shared transform core,
  • mode-specific IO only.

Anti-Pattern 4: Latest Reference Data in Backfill

Risk:

  • historical results change unpredictably.

Fix:

  • reference data binding with temporal policy.

Anti-Pattern 5: Success Means No Exception

Risk:

  • silent data loss,
  • wrong output with green scheduler.

Fix:

  • reconciliation and quality gates.

Anti-Pattern 6: Backfill Without Impact Analysis

Risk:

  • downstream inconsistency,
  • dashboards/reporting mismatch,
  • unplanned restatement.

Fix:

  • lineage-aware backfill plan.

28. Production Checklist

Before approving a batch pipeline, ask:

  • What is the input boundary?
  • Is the input boundary reproducible?
  • What is the output boundary?
  • Can the job be safely retried?
  • Can a crash during commit be recovered?
  • Are outputs staged before publish?
  • Is publish atomic or versioned?
  • Does output include run_id or equivalent provenance?
  • Are reference datasets version-pinned?
  • Is transformation deterministic?
  • Are late data and corrections explicitly handled?
  • Is there a lookback or restatement policy?
  • Are quality gates defined by severity?
  • Are reconciliation results stored?
  • Can a backfill run in shadow mode?
  • Is downstream impact known for backfill?
  • Does the run manifest contain enough audit evidence?
  • Are small files and partition explosion monitored?
  • Can consumers know which output version is active?

If these questions are not answerable, the batch pipeline is not production-grade yet.


29. Case Study Mini-Example: Daily Enforcement SLA Metrics

Suppose we compute daily SLA metrics for enforcement cases.

Input:

raw_case_events
raw_assignment_events
raw_escalation_events
sla_calendar_snapshot

Output:

case_sla_daily_metrics
partitioned by business_date

Contract:

Input boundary:
  business_date in [run_date - 3, run_date - 1]
  raw snapshot as of run_date 01:00 Asia/Jakarta

Reference data:
  SLA calendar effective as of business date
  organization hierarchy effective as of event time

Output boundary:
  replace business_date partitions in lookback window

Quality:
  no negative durations
  all cases must map to one owning unit or quarantine
  accepted + quarantined = input relevant cases

Publish:
  write to staging
  validate
  atomically replace partitions
  publish manifest

Architecture:

This is not complicated because it uses exotic technology. It is robust because every boundary is explicit.


30. The Mental Model to Keep

A batch pipeline is a controlled recomputation machine.

The core loop is:

lock input boundary
compute deterministic output
validate output
publish atomically
record evidence
support safe replay/backfill

The strongest batch systems are boring in the right way:

  • explicit boundaries,
  • immutable evidence,
  • staged writes,
  • deterministic transforms,
  • pinned reference data,
  • reproducible backfills,
  • lineage-aware impact,
  • clear correction/restatement policy.

The weak version of batch is a cron script.

The strong version is a production truth factory.


References

  • Apache Spark documentation: Spark is a unified analytics engine with high-level APIs including Java and Spark SQL/DataFrames.
  • Apache Iceberg specification: table snapshots represent table state and are used to access the complete set of data files.
  • Apache Airflow documentation: DAGs model task dependencies and scheduling/control-flow, not per-record dataflow.
Lesson Recap

You just completed lesson 49 in deepen practice. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.