Deepen PracticeOrdered learning track

Backfill and Reprocessing

Learn Java Data Pipeline Pattern - Part 055

Backfill and reprocessing as deterministic production operations: replay window, transformation version, side-effect boundaries, validation, rollout, and recovery.

21 min read4155 words
PrevNext
Lesson 5584 lesson track46–69 Deepen Practice
#java#data-pipeline#backfill#reprocessing+5 more

Part 055 — Backfill and Reprocessing

Backfill is not “run the job again.”
Backfill is a controlled attempt to rewrite history without lying about what happened.

A junior pipeline treats backfill as a script.

A serious pipeline treats backfill as a production change operation with identity, scope, ownership, isolation, validation, rollback, cost control, and audit evidence.

The mental model:

Backfill = deterministic recomputation + bounded mutation + evidence trail

A backfill is safe only when we can answer:

  1. What exact input range is being replayed?
  2. What transformation version is being used?
  3. What target state is allowed to change?
  4. What side effects are forbidden, mocked, deduplicated, or compensated?
  5. How do we prove the output is correct before publication?
  6. How do we undo or supersede the result if it is wrong?

This part is deliberately operational. Many engineers can write a Spark job, Kafka consumer, or Java batch runner. Fewer can run a 3-year historical reprocessing operation without corrupting reporting, alerting, downstream caches, regulatory audit trails, or customer-visible state.


1. The Problem Backfill Actually Solves

Backfill exists because the first output was incomplete, wrong, or no longer sufficient.

Common causes:

CauseExampleRequired response
New derived fieldAdd case_age_days to enforcement analyticsRecompute historical rows
Buggy transformationWrong timezone conversionRestate affected partitions
Missing source recordsAPI outage for two daysIngest missing window and recompute downstream
Late/corrected dataCase decision effective date amendedApply correction and restate impacted aggregates
Schema evolutionNew event version introduces better semanticsDual-read old/new versions and republish canonical output
New downstream productML feature store needs 5 years of historyHistorical feature generation
Compliance requirementNeed reproducible audit evidenceRebuild from immutable source and preserve run manifest
MigrationMove from old warehouse table to IcebergOne-time bulk load plus verification

Backfill is not exceptional. In data systems, it is part of the normal lifecycle.

The weak assumption is: “pipeline output is final.”

A more accurate assumption is:

Pipeline output is a versioned claim produced from specific input, code, configuration, and reference data.

If any of those change, output may need to be recomputed.


2. Vocabulary: Backfill, Replay, Reprocessing, Restatement

These terms are often mixed. Keep them separate.

TermMeaningExample
ReplayRead old events again from a log/sourceReset Kafka offset to 2025-01-01
BackfillProduce missing or newly required historical outputGenerate case_sla_snapshot for 2023–2025
ReprocessingRe-run transformation over existing inputRecompute risk score after bug fix
RestatementPublish corrected output that supersedes previous outputReplace April 2026 regulatory report partition
RepairFix a bounded corrupted subsetReprocess records for tenant T-41 only
Migration backfillPopulate a new target from an old sourceLoad new Iceberg table from warehouse history
BootstrapInitial load before streaming beginsSnapshot DB table, then switch to CDC
ReconciliationCompare source and target to detect mismatchCount/hash check by partition

The most important distinction:

Replay reads old input.
Backfill creates or changes historical output.
Restatement changes the accepted truth for a period/product.

A replay that writes to a non-idempotent sink can corrupt output. A restatement without evidence can break regulatory trust.


3. Backfill as a State Machine

A production backfill should have lifecycle state.

Avoid a design where backfill is only a terminal command:

java -jar pipeline.jar --from 2023-01-01 --to 2023-12-31

That command has no durable memory of what was intended, what ran, what version ran, what changed, or whether the output was accepted.

A serious backfill has a run manifest.


4. The Backfill Run Manifest

The run manifest is the contract between engineering, data owners, and operations.

Minimum fields:

{
  "runId": "bf-2026-07-04-case-sla-v3-001",
  "kind": "BACKFILL",
  "requestedBy": "data-platform",
  "approvedBy": "regulatory-reporting-owner",
  "reason": "Fix SLA breach calculation after timezone bug",
  "source": {
    "type": "ICEBERG_TABLE",
    "name": "bronze.case_events",
    "snapshotId": "8392012349012",
    "range": {
      "eventDateFrom": "2025-01-01",
      "eventDateTo": "2025-12-31"
    }
  },
  "transform": {
    "name": "case-sla-transform",
    "version": "3.1.0",
    "gitCommit": "8d91ab7",
    "containerImage": "registry/company/case-sla:3.1.0",
    "configHash": "sha256:...",
    "referenceDataVersion": "calendar-v2026-06-30"
  },
  "target": {
    "type": "ICEBERG_TABLE",
    "name": "silver.case_sla_daily",
    "publishMode": "REPLACE_PARTITIONS",
    "partitions": ["2025-01", "2025-02", "...", "2025-12"]
  },
  "safety": {
    "sideEffectsAllowed": false,
    "maxRowsExpected": 500000000,
    "maxCostUsd": 1200,
    "requiresManualPublish": true
  },
  "validation": {
    "countTolerance": 0,
    "checksumRequired": true,
    "qualityGate": "case-sla-v3-gate"
  }
}

The manifest should be immutable after approval. If scope changes, create a new manifest version.

The manifest gives you audit-grade answers:

QuestionManifest answer
Why did historical numbers change?reason
Who approved it?approvedBy
What input was used?source snapshot/range
What code generated it?transform version/git SHA/image
What output changed?target partitions/table
Was it validated?validation section + result
Was it allowed to cause side effects?sideEffectsAllowed

5. Core Invariants

Backfill correctness depends on invariants.

5.1 Input scope must be explicit

Bad:

Reprocess all old data.

Good:

Reprocess case events with event_date >= 2025-01-01 and event_date < 2026-01-01 from Iceberg snapshot 8392012349012.

If the input scope is not explicit, the output cannot be defended.

5.2 Transformation must be versioned

A backfill must not run with “whatever code is currently deployed.”

It should run with:

  • transform name
  • semantic version
  • Git commit
  • container image digest
  • config hash
  • dependency lock
  • reference data version
  • schema version set

5.3 Output mutation must be bounded

The target mutation must say exactly what it can change:

  • append only
  • replace partitions
  • merge by business key
  • write new versioned table
  • write shadow table only
  • publish pointer swap

Avoid unrestricted mutation:

DELETE FROM target;
INSERT INTO target SELECT ...;

That is not a backfill strategy. It is a production incident waiting to happen.

5.4 Side effects must be isolated

Backfill must not accidentally send notifications, charge customers, trigger enforcement actions, open tickets, or update operational systems.

Data pipelines often have hidden side effects:

  • publishing Kafka events consumed by alerting systems
  • updating search indexes used by users
  • refreshing materialized views
  • sending metrics interpreted as live alerts
  • invalidating caches
  • touching feature stores consumed by online models

Backfill must carry an execution mode:

enum ExecutionMode {
    LIVE,
    BACKFILL,
    REPLAY,
    DRY_RUN,
    RECONCILIATION
}

Downstream systems should know whether an output is live or historical.

5.5 Publication must be separate from computation

Do not compute directly into the final target unless the mutation is inherently safe.

Preferred pattern:

read input -> compute into staging/shadow -> validate -> publish atomically

6. Backfill Architecture

A robust design separates four planes:

ComponentResponsibility
PlannerConvert intent into bounded input/output scope
Manifest storeDurable record of approved run plan
ExecutorRun computation deterministically
Staging targetHold output before publication
ValidatorCompare output against rules and expected invariants
PublisherMake result visible safely
Evidence storePreserve logs, metrics, checks, lineage, approvals

A backfill platform is not necessarily a big product. It may be a set of conventions, tables, and Java libraries. But the separation matters.


7. Backfill Modes

7.1 Append-only historical generation

Use when target did not exist before.

Example:

Generate daily case SLA facts for 2023–2025.

Characteristics:

  • safe if target is empty or partition-isolated
  • no need to delete old output
  • still requires validation
  • common for migration/bootstrap

7.2 Partition replacement

Use when output is partitioned by a stable dimension such as date.

Recompute target partitions dt=2025-01-01 through dt=2025-12-31.

This is often the best lakehouse/warehouse strategy.

Requirements:

  • partition key must match output correction boundary
  • target engine must support atomic or controlled partition replacement
  • validation must happen before publication
  • downstream consumers must tolerate restatement

7.3 Keyed upsert/merge

Use when output identity is business-key based.

Example:

Recompute current case projection by case_id.

Risks:

  • stale rows may remain if deletes are not handled
  • merge condition may be wrong
  • version ordering may be ambiguous
  • concurrent live writes may conflict

Use a version column:

MERGE INTO case_projection t
USING staging_case_projection s
ON t.case_id = s.case_id
WHEN MATCHED AND s.projection_version >= t.projection_version THEN UPDATE SET ...
WHEN NOT MATCHED THEN INSERT ...;

7.4 Shadow output + pointer swap

Use when correctness risk is high.

Write to case_sla_daily_bf_20260704
Validate
Swap view or metadata pointer

Useful when:

  • output is large
  • validation needs time
  • downstream consumers require stable table names
  • rollback must be fast

7.5 Versioned output

Use when old and new output must coexist.

Example:

case_sla_daily_v2
case_sla_daily_v3

or:

case_sla_daily(output_version='v3')

Useful for:

  • regulatory restatement comparison
  • ML feature backtesting
  • migration validation
  • consumer phased rollout

7.6 Replay into live topic

High risk.

Only use when downstream semantics are designed for it.

Replay to a live Kafka topic can trigger:

  • duplicate alerts
  • state regression
  • consumer backlog explosion
  • cache poisoning
  • external side effects

Safer alternatives:

  • replay to dedicated backfill topic
  • include processingMode=BACKFILL
  • include runId
  • require consumers to opt in
  • write to table instead of event topic

8. Deterministic Reprocessing

A transform is deterministic if the same input and same context produce the same output.

same input records
+ same transform version
+ same configuration
+ same reference data
+ same time policy
= same output

Non-determinism sources:

SourceProblemFix
Instant.now()Output changes per runInject run clock
Random UUIDDifferent IDs per replayDerive deterministic IDs from input/run
Live reference DB lookupReference value changesVersion reference data
Unordered iterationOutput order/hash unstableSort before hash/output
Floating point aggregationNon-associative result driftUse decimal/fixed precision where needed
External API callMutable responseSnapshot API response or forbid during backfill
Non-versioned configOld output unreproducibleHash and store config

Java pattern:

public record TransformContext(
        String runId,
        ExecutionMode mode,
        Instant runStartedAt,
        Clock clock,
        String transformVersion,
        String configHash,
        String referenceDataVersion
) {}

public interface DeterministicTransform<I, O> {
    List<O> apply(I input, TransformContext context);
}

Do not let transform code access static system time directly.

Bad:

record.setProcessedAt(Instant.now());

Better:

record.setProcessedAt(context.runStartedAt());

Best, for business semantics:

record.setEffectiveAt(input.businessEffectiveAt());
record.setProducedByRun(context.runId());

9. Backfill Identity

Every output produced by a backfill should carry lineage.

Minimum output metadata:

public record OutputLineage(
        String runId,
        ExecutionMode mode,
        String transformName,
        String transformVersion,
        String inputSnapshot,
        String inputRange,
        Instant producedAt
) {}

In tables:

produced_by_run_id      varchar not null,
produced_by_transform   varchar not null,
transform_version       varchar not null,
input_snapshot_id       varchar null,
input_range_start       timestamp null,
input_range_end         timestamp null,
processing_mode         varchar not null

In Kafka headers:

x-run-id: bf-2026-07-04-case-sla-v3-001
x-processing-mode: BACKFILL
x-transform-version: case-sla-transform/3.1.0
x-input-snapshot: iceberg:bronze.case_events@8392012349012

Without lineage, historical output changes become suspicious.

With lineage, they become explainable.


10. Source-Specific Backfill Patterns

10.1 Kafka replay

Kafka is replay-friendly because events in a topic can be read multiple times while retained.

But replay requires strict offset/range control.

topic: case-events-v1
partition: 0..47
start offset per partition: known
end offset per partition: known
consumer group: unique backfill group

Do not rely only on timestamp lookup if exact reproducibility matters. Timestamp-to-offset lookup can be useful for planning, but the actual run manifest should persist resolved offsets.

Kafka replay manifest section:

{
  "source": {
    "type": "KAFKA",
    "cluster": "prod-a",
    "topic": "case-events-v1",
    "partitions": [
      { "partition": 0, "fromOffset": 9012331, "toOffsetExclusive": 9120000 },
      { "partition": 1, "fromOffset": 8871200, "toOffsetExclusive": 8990201 }
    ]
  }
}

Important rules:

  • use a new consumer group or assign partitions manually
  • disable live side effects
  • preserve original event time
  • set processingMode=BACKFILL
  • avoid committing offsets to a live consumer group
  • validate output before publication

Java sketch: manual partition assignment

try (KafkaConsumer<String, byte[]> consumer = new KafkaConsumer<>(props)) {
    TopicPartition tp = new TopicPartition("case-events-v1", 0);
    consumer.assign(List.of(tp));
    consumer.seek(tp, 9_012_331L);

    long endExclusive = 9_120_000L;

    while (consumer.position(tp) < endExclusive) {
        ConsumerRecords<String, byte[]> records = consumer.poll(Duration.ofMillis(500));
        for (ConsumerRecord<String, byte[]> record : records.records(tp)) {
            if (record.offset() >= endExclusive) {
                break;
            }
            processBackfillRecord(record);
        }
    }
}

This is intentionally manual. A backfill should not accidentally mutate a shared consumer group position.


10.2 CDC backfill

CDC backfill is tricky because the source has two histories:

  1. table snapshot state
  2. change log stream

Typical bootstrap:

initial snapshot -> stream changes from captured log position

Backfill options:

OptionUse caseRisk
Re-snapshot tableCurrent-state projectionLoses historical changes unless table stores history
Replay CDC topicHistorical reconstructionRequires retained CDC topic and schema history
Use raw changelog tableLakehouse CDC archiveBest for audit/replay if complete
Combine snapshot + CDC rangePoint-in-time reconstructionHarder but powerful

CDC reprocessing must preserve operation semantics:

c = create
u = update
d = delete
r = snapshot read

A delete is not “missing data.” It is a fact.

A snapshot row is not always equivalent to a create event. It is a state observation at snapshot time.


10.3 File backfill

File backfill is common when historical files are re-landed.

Required identity:

  • file path
  • file size
  • content hash
  • manifest ID
  • producer timestamp
  • business date
  • schema version
  • ingestion attempt ID

Safe design:

landing/2025/01/file.csv
manifest/2025/01/manifest.json
archive/accepted/...
archive/rejected/...

Use import ledger:

create table file_import_ledger (
    file_id varchar primary key,
    content_hash varchar not null,
    file_path varchar not null,
    business_date date not null,
    schema_version varchar not null,
    first_seen_at timestamp not null,
    accepted_run_id varchar null,
    status varchar not null
);

Backfill should say whether duplicate files are:

  • ignored
  • reprocessed
  • treated as correction
  • rejected

10.4 API backfill

API backfill is constrained by external behavior.

Problems:

  • rate limits
  • mutable responses
  • missing historical endpoints
  • pagination instability
  • deleted records not exposed
  • token expiration
  • vendor-side retention
  • data corrected without updated timestamp

Safe API backfill patterns:

PatternUse case
Cursor replayProvider supports stable cursor/history
Time-window scanProvider supports updated_since
ID range scanStable ID ordering exists
Snapshot exportVendor can export historical dump
Webhook + reconciliationWebhooks are incomplete, periodic scan fixes gaps

Always store raw API responses for critical backfills when allowed by policy.


10.5 Lakehouse backfill

Lakehouse backfill often uses partition replacement, snapshot isolation, or versioned table publication.

Good targets support:

  • immutable files
  • table snapshots
  • atomic commit
  • schema evolution
  • partition evolution
  • time travel
  • metadata inspection

Backfill on a lakehouse table should distinguish:

compute output files != publish table snapshot

This separation enables validation before commit.


11. Target Mutation Patterns

11.1 Append with run ID

insert into target
select *, 'bf-2026-07-04-001' as produced_by_run_id
from staging;

Use when target supports multiple output versions.

Add a view for accepted version:

create or replace view accepted_case_sla as
select *
from target
where output_status = 'ACCEPTED';

11.2 Replace partition

replace partitions dt=2025-01-01..2025-12-31

Validation before replacement:

  • count check
  • null check
  • uniqueness check
  • business reconciliation
  • sample diff
  • checksum by partition

11.3 Merge by key

Use only when key identity is stable and delete handling is explicit.

Output should include:

  • key
  • version
  • operation
  • valid/effective time if relevant
  • source lineage

11.4 Publish pointer swap

case_sla_current -> case_sla_v20260704

Useful when table/view indirection exists.

Benefits:

  • fast rollback
  • visible publication moment
  • easier validation
  • consumers see a consistent cutover

11.5 Correction event publication

For event-sourced downstream systems, restating table rows may be wrong. Publish correction events instead.

Example:

{
  "eventType": "CaseSlaBreachCorrected",
  "caseId": "C-101",
  "correctsEventId": "evt-2025-00091",
  "reason": "TIMEZONE_BUG_RESTATEMENT",
  "oldBreachAt": "2025-04-10T00:00:00+07:00",
  "newBreachAt": "2025-04-09T17:00:00Z",
  "producedByRunId": "bf-2026-07-04-case-sla-v3-001"
}

Do not silently overwrite history when downstream meaning requires correction lineage.


12. Live + Backfill Concurrency

A dangerous scenario:

Live pipeline writes current data while backfill rewrites historical data.

Questions:

  • Can live and backfill write the same keys?
  • Which output wins?
  • Is there a version ordering?
  • Does the sink support concurrent commits?
  • Are downstream consumers aware of restatement?

Patterns:

12.1 Freeze target window

Block live writes for affected partitions/time range.

Pros:

  • simple correctness

Cons:

  • operational disruption

12.2 Versioned writes

Live and backfill write separate versions.

output_version = live-current
output_version = bf-20260704

Publish selects accepted version.

12.3 Last-writer-wins with run priority

Usually bad unless explicitly modeled.

12.4 Merge with event-time/version semantics

Backfill output can be accepted only if it represents a known correction version.

boolean shouldUpdate(Projection existing, Projection candidate) {
    return candidate.sourceVersion().compareTo(existing.sourceVersion()) > 0;
}

But be careful: sourceVersion is not the same as processing time.


13. Side-Effect Boundaries

Backfill must classify every sink.

Sink typeBackfill default
Analytical tableAllowed with staging/validation
Materialized viewAllowed if versioned or rebuildable
Kafka live event topicUsually forbidden
Notification serviceForbidden
Ticket/case management commandForbidden
Search indexAllowed only if isolated index or controlled rebuild
Feature storeAllowed only with backfill mode/version
CacheUsually forbidden; rebuild after publish
External APIForbidden unless compensation/idempotency exists

A transform should not decide this alone. The run manifest should.

Java guard:

public interface SinkPolicy {
    void assertAllowed(SinkDescriptor sink, ExecutionMode mode);
}

public final class DefaultSinkPolicy implements SinkPolicy {
    @Override
    public void assertAllowed(SinkDescriptor sink, ExecutionMode mode) {
        if (mode == ExecutionMode.BACKFILL && sink.hasExternalSideEffect()) {
            throw new IllegalStateException(
                "Backfill cannot write to external side-effect sink: " + sink.name()
            );
        }
    }
}

14. Validation Before Publication

Backfill output is not accepted because the job succeeded. It is accepted because checks passed.

Validation layers:

14.1 Technical checks

  • no failed tasks
  • no unhandled poison records
  • expected partitions written
  • output row count within range
  • no unexpected null metadata
  • no sink commit errors

14.2 Schema checks

  • output schema compatible with consumer contract
  • no unexpected column removal/type change
  • schema version recorded
  • nullable/non-nullable expectations satisfied

14.3 Data quality checks

  • uniqueness
  • nullability
  • range
  • referential integrity
  • accepted enum values
  • timestamp bounds
  • PII rules

14.4 Reconciliation checks

Examples:

-- count by business date
select business_date, count(*)
from staging_case_sla
where run_id = 'bf-2026-07-04-001'
group by business_date;
-- checksum by partition
select business_date,
       count(*) as row_count,
       sum(hash(case_id, sla_state, breach_at)) as checksum
from staging_case_sla
group by business_date;

For financial/regulatory systems, reconciliation often must be domain-specific:

  • balance totals
  • case counts by lifecycle state
  • number of open/closed/escalated cases
  • breach counts by jurisdiction
  • appeal counts by decision category

14.5 Business reasonableness

A result can pass schema checks and still be wrong.

Example:

SLA breach count drops by 80% after backfill.

Maybe the bug fix is correct. Maybe the transform is broken. That requires owner review.


15. Publication Strategies

15.1 Atomic publish

Best when supported.

Examples:

  • Iceberg snapshot commit
  • atomic table rename/swap
  • transactional warehouse table update
  • version pointer update in metadata table

15.2 Progressive publish

Use when target is large or consumers are segmented.

publish tenant A -> validate -> tenant B -> validate -> full publish

15.3 Dark publish

Compute and expose only to internal validators.

case_sla_daily_candidate

15.4 Dual-read cutover

Consumers compare old and new output before switch.

read old + new -> diff -> accept -> switch

15.5 Correction publication

Instead of replacing old rows, publish correction events or restatement tables.

For regulated systems, this may be the most defensible pattern.


16. Rollback and Supersession

Rollback means the system returns to previous accepted state.

Supersession means a newer accepted output replaces a previous accepted output.

In immutable/lakehouse systems, prefer supersession.

run bf-001 published
run bf-002 supersedes bf-001

Do not physically delete evidence unless retention policy requires it.

Run table:

create table pipeline_run (
    run_id varchar primary key,
    run_type varchar not null,
    status varchar not null,
    reason text not null,
    transform_name varchar not null,
    transform_version varchar not null,
    input_descriptor jsonb not null,
    target_descriptor jsonb not null,
    created_at timestamp not null,
    approved_at timestamp null,
    published_at timestamp null,
    superseded_by_run_id varchar null
);

Publication pointer:

create table accepted_output_version (
    data_product varchar primary key,
    accepted_run_id varchar not null,
    accepted_at timestamp not null,
    accepted_by varchar not null
);

Rollback becomes pointer movement, not blind deletion.


17. Backfill Cost Model

Backfill can overload systems.

Cost dimensions:

DimensionRisk
Source read loadDB/API/object store pressure
Broker read loadKafka cluster/network load
ComputeSpark/Flink/Java worker cost
ShuffleNetwork and disk explosion
StateRocksDB/state-store growth
Output filesSmall file problem
Sink commitWarehouse/lakehouse contention
Downstream impactConsumer backlog/cache rebuild

Backfill should be throttled intentionally.

Example controls:

public record BackfillLimits(
        int maxPartitionsInParallel,
        long maxRowsPerMinute,
        long maxBytesPerMinute,
        int maxSinkCommitsPerHour,
        Duration sleepBetweenWindows
) {}

A slow correct backfill is better than a fast incident.


18. Chunking Strategy

Backfills should be chunked along a dimension that matches correctness boundaries.

Possible chunk keys:

  • business date
  • event date
  • source commit date
  • tenant
  • jurisdiction
  • source partition
  • ID range
  • Kafka partition/offset range

Good chunking properties:

  • independent validation
  • retryable without duplicate effects
  • bounded memory
  • bounded sink mutation
  • natural progress reporting
  • limited blast radius

Bad chunking:

Split arbitrary every 1 million rows even though aggregates cross chunk boundaries.

If output aggregate is monthly, chunking daily may be fine only if monthly publish waits for all days.


19. Backfill Runner Skeleton in Java

A minimal but serious Java model:

public record BackfillPlan(
        String runId,
        SourceRange sourceRange,
        TransformDescriptor transform,
        TargetMutation targetMutation,
        ValidationPlan validationPlan,
        BackfillLimits limits
) {}

public interface BackfillSource<I> {
    Iterator<I> read(SourceRange range);
}

public interface BackfillTransform<I, O> {
    Stream<O> transform(I input, TransformContext context);
}

public interface StagingSink<O> {
    void writeBatch(List<O> outputs, OutputLineage lineage);
}

public interface BackfillValidator {
    ValidationResult validate(String runId);
}

public interface Publisher {
    PublishResult publish(String runId, TargetMutation mutation);
}

Execution:

public final class BackfillRunner<I, O> {
    private final BackfillSource<I> source;
    private final BackfillTransform<I, O> transform;
    private final StagingSink<O> stagingSink;
    private final BackfillValidator validator;
    private final Publisher publisher;

    public void run(BackfillPlan plan) {
        TransformContext context = TransformContext.forBackfill(plan);

        List<O> buffer = new ArrayList<>();
        Iterator<I> inputs = source.read(plan.sourceRange());

        while (inputs.hasNext()) {
            I input = inputs.next();
            transform.transform(input, context).forEach(buffer::add);

            if (buffer.size() >= 10_000) {
                stagingSink.writeBatch(buffer, context.lineage());
                buffer.clear();
            }
        }

        if (!buffer.isEmpty()) {
            stagingSink.writeBatch(buffer, context.lineage());
        }

        ValidationResult result = validator.validate(plan.runId());
        if (!result.passed()) {
            throw new BackfillValidationException(result);
        }

        if (plan.targetMutation().requiresManualPublish()) {
            return;
        }

        publisher.publish(plan.runId(), plan.targetMutation());
    }
}

This skeleton intentionally separates staging and publication.


20. Idempotency for Backfill Output

Backfill can fail mid-run. Re-running must be safe.

Patterns:

20.1 Run-scoped staging table

create table staging_case_sla (
    run_id varchar not null,
    case_id varchar not null,
    business_date date not null,
    sla_state varchar not null,
    breach_at timestamp null,
    primary key (run_id, case_id, business_date)
);

Re-run can overwrite same (run_id, key) or require a new run_id.

20.2 Deterministic output ID

String outputId = sha256(runId + ":" + caseId + ":" + businessDate);

For restatements, do not use random IDs.

20.3 Chunk ledger

create table backfill_chunk_run (
    run_id varchar not null,
    chunk_id varchar not null,
    status varchar not null,
    input_count bigint null,
    output_count bigint null,
    checksum varchar null,
    started_at timestamp not null,
    finished_at timestamp null,
    primary key (run_id, chunk_id)
);

Chunk ledger enables retry from failed chunk, not from beginning.


21. Reprocessing Stateful Pipelines

Stateful pipelines are harder to backfill than stateless transformations.

Examples:

  • dedupe state
  • session windows
  • SLA timers
  • fraud/risk rolling windows
  • temporal joins
  • case lifecycle state machines

Questions:

  1. Is state derived entirely from replayed input?
  2. Does state require a warm-up period?
  3. Is the start boundary safe?
  4. Are timers deterministic?
  5. Is late data policy versioned?
  6. Can state be snapshotted and restored?

Warm-up window

If output for March depends on events from February, input range must include February.

requested output: 2025-03-01..2025-03-31
required input:   2025-02-01..2025-03-31
publish output:   March only

Manifest should separate input range from output range.

{
  "inputRange": {
    "eventDateFrom": "2025-02-01",
    "eventDateTo": "2025-04-01"
  },
  "publishRange": {
    "businessDateFrom": "2025-03-01",
    "businessDateTo": "2025-04-01"
  }
}

22. Backfill and Watermarks

For event-time pipelines, backfill needs a synthetic watermark policy.

Live stream:

watermark follows observed event-time progress

Backfill:

input is finite; watermark can advance deterministically after sorted/chunked input

But be careful. If backfill reads unsorted historical data, watermark advancement may close windows too early.

Strategies:

StrategyUse case
Sort by event time within keyHighest correctness, expensive
Process by time partitionsGood for batch/lakehouse
Use bounded out-of-ordernessGood if source disorder bound known
Treat backfill as batch aggregationBetter if window logic can be batch-equivalent
Emit correction events after late factsGood for always-on systems

23. Backfill and Reference Data

A hidden source of incorrect backfill is reference data.

Example:

Case SLA due date depends on jurisdiction calendar.

If the calendar table today is different from the calendar table in 2025, which one should be used?

Answer depends on business semantics:

SemanticsUse
Reconstruct what system believed thenReference data as recorded then
Compute what should have been trueCorrected reference data
Produce current reporting truthCurrent accepted reference version
Audit an old decisionReference version used by original decision

Backfill manifest must include reference data version.

{
  "referenceData": [
    {
      "name": "jurisdiction_calendar",
      "version": "calendar-v2025-restated-002",
      "semantics": "CORRECTED_TRUTH"
    }
  ]
}

24. Backfill and Privacy

Backfill can resurrect data that should no longer be used.

Risks:

  • replaying deleted PII
  • writing raw sensitive data into new targets
  • violating retention policy
  • bypassing masking/tokenization
  • publishing historical fields now classified as restricted

Backfill source policy must check:

  • source retention legality
  • purpose limitation
  • target classification
  • masking requirements
  • access rights
  • deletion/tombstone handling

Rule:

Replay permission is not implied by historical availability.

A source may technically contain old data that policy no longer allows you to process.


25. Backfill Observability

Metrics:

MetricMeaning
backfill_input_records_totalRecords read
backfill_output_records_totalRecords produced
backfill_rejected_records_totalValidation/parse rejects
backfill_chunks_completed_totalProgress
backfill_chunk_duration_secondsRuntime per chunk
backfill_sink_write_latency_msSink pressure
backfill_validation_failures_totalQuality failure count
backfill_cost_estimateCost tracking
backfill_publish_duration_secondsPublication cost

Logs should always include:

run_id
chunk_id
source_range
transform_version
target
processing_mode

Tracing is useful for orchestration and sink calls, but large record-level tracing is usually too expensive. Use sampled record traces for debugging.


26. Backfill Runbook

A production runbook should include:

  1. define reason
  2. identify impacted products/consumers
  3. define input range and output range
  4. identify transform version
  5. identify reference data version
  6. estimate cost and duration
  7. choose mutation strategy
  8. choose side-effect policy
  9. create manifest
  10. get owner approval
  11. run dry run on subset
  12. validate subset
  13. run full staging
  14. validate full output
  15. review diff with owner
  16. publish
  17. monitor downstream
  18. archive evidence
  19. communicate restatement
  20. create post-run notes

Communication matters. Backfill changes facts people rely on.


27. Testing Backfill Logic

Test classes:

27.1 Determinism test

@Test
void sameInputAndContextProducesSameOutput() {
    var context = fixedBackfillContext();
    var input = sampleCaseEvent();

    var first = transform.apply(input, context);
    var second = transform.apply(input, context);

    assertEquals(first, second);
}

27.2 Idempotent staging test

Run same chunk twice. Output count must not double.

27.3 Partial failure test

Fail after writing staging but before chunk commit. Retry must converge.

27.4 Publish failure test

Simulate unknown outcome during publication. Recovery must inspect target state before retry.

27.5 Warm-up range test

Assert output range excludes warm-up records but state includes them.

27.6 Side-effect guard test

Assert backfill mode cannot write to forbidden sinks.

27.7 Reconciliation test

Golden dataset + expected aggregate checks.


28. Anti-Patterns

Anti-pattern: using production consumer group for replay

This mutates live consumer offsets.

Use a dedicated replay group or manual assignment.

Anti-pattern: backfill directly to live event topic

This can trigger downstream side effects.

Use staging, backfill topic, or versioned target.

Anti-pattern: no transform version

You cannot reproduce output.

Anti-pattern: no input snapshot/range

You cannot defend why output changed.

Anti-pattern: replaying without reference data version

You may generate a hybrid truth that never existed and was never approved.

Anti-pattern: delete-and-insert without atomicity

Readers can see partial output.

Anti-pattern: validation after publish only

At that point validation is incident detection, not prevention.

Anti-pattern: “just rerun Airflow DAG”

A DAG rerun may not imply same input, same code, same config, same target mutation, or same side-effect policy.


29. Case Study: Regulatory Enforcement SLA Restatement

Scenario:

  • Enforcement cases have SLA breach deadlines.
  • Deadline depends on jurisdiction, business calendar, escalation pause periods, and decision effective date.
  • A timezone bug caused some deadlines to be computed one day late.
  • Need restatement for 2025 reports.

29.1 Plan

Reason: timezone bug in SLA transform v2.4.1
New transform: v3.1.0
Input: bronze.case_events snapshot 8392012349012, event date 2024-12-01..2026-01-01
Output: silver.case_sla_daily, business date 2025-01-01..2026-01-01
Warm-up: 2024-12 events for open cases crossing into 2025
Reference: jurisdiction_calendar v2026-06-restated
Publish mode: replace 2025 partitions
Side effects: forbidden
Validation: compare breach count by jurisdiction/month; sample high-risk cases
Approval: regulatory reporting owner

29.2 Why input starts before output

An SLA breach in January 2025 may depend on a case opened in December 2024.

So:

input range  != output range

29.3 Output lineage

Every restated row includes:

produced_by_run_id = bf-2026-07-04-case-sla-v3-001
transform_version = case-sla-transform/3.1.0
reference_data_version = jurisdiction_calendar/v2026-06-restated
restatement_reason = TIMEZONE_BUG_FIX

29.4 Validation

Checks:

  • count by jurisdiction/month
  • breach count delta threshold
  • no missing active cases
  • no negative SLA duration
  • due date timezone normalized
  • sample 100 previously disputed cases
  • compare old vs new for all changed rows

29.5 Publication

Write staging table, validate, replace 2025 partitions, update accepted run pointer.

29.6 Evidence

Store:

  • run manifest
  • code version
  • validation result
  • diff summary
  • owner approval
  • publish timestamp
  • superseded run ID

30. Design Checklist

Before approving a backfill, answer:

  • What is the business reason?
  • What output is allowed to change?
  • What input range/snapshot is used?
  • Is input range different from output range?
  • What transform version is used?
  • What config/reference data version is used?
  • Are side effects forbidden or controlled?
  • Is output written to staging first?
  • What validation gates must pass?
  • What reconciliation proves completeness?
  • What is the publication strategy?
  • What is rollback/supersession strategy?
  • What downstream consumers are impacted?
  • Is privacy/retention policy satisfied?
  • Is cost/blast radius bounded?
  • Is run evidence preserved?

31. The Core Lesson

Backfill is where pipeline architecture tells the truth about itself.

If a system cannot backfill safely, then it likely also cannot recover safely, audit safely, or evolve safely.

The mature design is not:

We can rerun the job.

The mature design is:

We can reproduce a bounded output from declared input, code, config, and reference data; validate it before publication; publish it safely; and explain the result later.

That is the difference between a script and an engineering system.

Lesson Recap

You just completed lesson 55 in deepen practice. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.