Series/Learn Java Microservices File Handling, State, Configuration and Secret Management

Build CoreOrdered learning track

State Reconstruction and Replay

Learn Java Microservices File Handling, State, Configuration and Secret Management - Part 033

State reconstruction dan replay untuk Java microservices: event replay, file reindex, metadata repair, checkpoint, idempotency, reconciliation, dan audit-safe recovery.

[2026-07-05]12 min read2334 words

In This Lesson

1. Mental Model 2. Reconstruction Taxonomy 3. Source of Truth Hierarchy

PrevNext

Lesson 3370 lesson track14–38 Build Core

#java#microservices#state-management#event-replay+6 more

Part 033 — State Reconstruction and Replay

The strongest state is not the state that never breaks.

The strongest state is the state you can explain, rebuild, verify, and repair.

Di microservices production, state akan rusak dalam bentuk yang jarang dramatis tetapi sangat mahal:

metadata file ada, object payload hilang;
object payload ada, metadata tidak ada;
index search tertinggal 2 jam dari database;
cache menyimpan authorization lama;
event consumer sudah memproses event, tetapi offset belum commit;
offset sudah commit, tetapi side effect belum masuk database;
worker crash setelah upload storage sukses tetapi sebelum audit event ditulis;
scan result datang dua kali dengan urutan berbeda;
reprocessing job menghasilkan state berbeda dari original processing;
retention job menghapus object yang ternyata masih direferensikan workflow aktif.

State reconstruction adalah kemampuan sistem untuk membangun ulang state yang benar dari sumber yang lebih otoritatif.

Replay adalah salah satu teknik reconstruction, tetapi bukan satu-satunya.

Di part ini kita akan membahas:

model state reconstruction;
perbedaan replay, reindex, reconciliation, repair, dan backfill;
bagaimana merancang state agar bisa direkonstruksi;
bagaimana replay event tanpa merusak correctness;
bagaimana memperbaiki metadata file dan object storage drift;
bagaimana checkpoint dan idempotency bekerja;
bagaimana membuat recovery yang audit-safe.

1. Mental Model

State reconstruction dimulai dari pertanyaan sederhana:

If this state disappears or is suspected wrong, from what source can we rebuild it?

Jawaban untuk setiap state berbeda.

State	Bisa Direkonstruksi Dari	Catatan
Search index	DB source of truth / event log	Reindex normal
Cache	Source service / DB / computed policy	Bisa flush/rebuild
File metadata	Upload ledger + object inventory + audit log	Sulit jika ledger tidak lengkap
File payload	Tidak bisa jika payload hilang dan tidak ada backup	Harus backup/replication/versioning
Workflow state	Event log / BPM history / DB snapshot	Bergantung desain
Session state	Biasanya tidak direkonstruksi	User login ulang
Secret lease state	Secret manager authority	Refresh/reissue
Config effective state	Git revision + runtime source + deployment metadata	Bisa reconstruct jika provenance ada

Key idea:

Not all state must be reconstructable.
But every state must have an explicit recovery story.

Session state boleh hilang jika UX menerima login ulang. Evidence payload tidak boleh hilang jika regulated retention membutuhkan pembuktian.

2. Reconstruction Taxonomy

Jangan menyebut semua recovery sebagai “replay”. Ada beberapa teknik berbeda.

Teknik	Sumber	Target	Tujuan
Replay	Event log	Projection / state store	Build ulang berdasarkan event historis
Reindex	DB / object metadata	Search index / read model	Sinkronisasi read model
Reconciliation	Dua atau lebih source	Mismatch report + correction	Menemukan drift
Repair	Source of truth + rule	Corrupted state	Memperbaiki data salah
Backfill	Existing data	New field / new projection	Mengisi state baru
Recompute	Raw facts	Derived value	Hitung ulang value
Restore	Backup/snapshot	Durable store	Mengembalikan data
Resync	Remote authority	Local copy	Sinkronisasi external state

Dalam architecture review, gunakan istilah spesifik.

Buruk:

Kalau rusak, kita replay saja.

Lebih baik:

Search index bisa direbuild dengan reindex dari evidence_file table.
Object-metadata drift dideteksi reconciliation job harian.
Accepted file payload tidak bisa direkonstruksi dari metadata; harus bergantung pada object versioning, backup, dan replication.

3. Source of Truth Hierarchy

Reconstruction butuh hierarchy. Jika semua source dianggap sama, repair job bisa memperkuat data yang salah.

Contoh hierarchy untuk evidence file:

Tetapi hierarchy-nya tidak tunggal. Untuk setiap fakta, source of truth bisa berbeda.

Fact	Source of Truth	Derived/Secondary
File ID exists	Metadata DB	Search index, cache
Payload bytes exist	Object storage	Metadata pointer
Payload hash	Verified upload process / object checksum	Metadata copy
File lifecycle status	Metadata DB/domain state machine	Audit projection
Who changed status	Audit log	Metadata lastModifiedBy
Searchable text	Extractor result	Search index
Retention policy decision	Compliance policy service	File metadata snapshot

Jangan melakukan repair tanpa tahu source authoritative untuk fakta yang diperbaiki.

4. Designing State for Reconstruction

State yang bisa direkonstruksi tidak terjadi otomatis. Ia harus dirancang.

4.1 Stable Identity

Semua artifact penting harus punya identity stabil.

public record FileId(String value) {
    public FileId {
        if (value == null || value.isBlank()) {
            throw new IllegalArgumentException("fileId is required");
        }
    }
}

Jangan bergantung pada:

nama file original dari user;
path lokal sementara;
object key sebagai domain ID;
auto-increment ID yang tidak bisa dikorelasikan dengan external event jika tidak pernah disimpan;
trace ID sebagai artifact ID.

4.2 Immutable Event Identity

Event yang bisa direplay harus punya event ID dan aggregate ID.

public record FileLifecycleEvent(
    String eventId,
    String fileId,
    String eventType,
    String previousStatus,
    String nextStatus,
    String reasonCode,
    String actorId,
    String correlationId,
    Instant occurredAt
) {}

Event ID dipakai untuk idempotency. Aggregate ID dipakai untuk ordering per entity.

4.3 Versioned State

State yang diperbarui berulang harus punya version.

ALTER TABLE evidence_file
ADD COLUMN version BIGINT NOT NULL DEFAULT 0;

Update:

UPDATE evidence_file
SET status = ?, version = version + 1
WHERE file_id = ? AND version = ?;

Jika row count 0, berarti ada concurrent update atau state sudah berubah.

4.4 Provenance

Derived state harus menyimpan asalnya.

public record SearchProjectionMetadata(
    String sourceTable,
    String sourceId,
    long sourceVersion,
    String projectionVersion,
    Instant indexedAt
) {}

Tanpa provenance, Anda tidak tahu apakah index sudah merepresentasikan source terbaru.

4.5 Rebuildable Projection Contract

Projection harus bisa dihapus dan dibangun ulang.

Contoh:

Search index is disposable.
Evidence metadata DB is authoritative.
Reindex must rebuild all documents from metadata DB and extractor output.

Konsekuensinya:

index schema harus versioned;
reindex job harus bisa berjalan paralel;
user query bisa switch alias dari old index ke new index;
reindex tidak boleh memodifikasi source of truth;
failure reindex tidak boleh menghapus index lama sebelum index baru valid.

5. Event Replay

Event replay berarti membaca event historis dan membangun state/projection dari event tersebut.

Kafka dan event log lain cocok untuk banyak replay use case karena event disimpan berurutan dalam partition dan consumer membaca stream. Namun replay bukan magic exactly-once. Correctness bergantung pada desain consumer, offset, idempotency, dan sink.

5.1 Replay Use Cases

Replay cocok untuk:

membangun read model baru;
memperbaiki projection setelah bug consumer;
migrasi schema projection;
audit reconstruction;
reprocessing file extraction setelah extractor diperbaiki;
backfill event-derived field.

Replay tidak cocok untuk:

menghidupkan kembali payload yang hilang tanpa backup;
menjalankan ulang side effect eksternal seperti email/payment tanpa guard;
memproses ulang command yang seharusnya hanya sekali;
mengganti domain decision lama dengan logic baru tanpa governance.

5.2 Event as Fact, Not Command

Event replay aman jika event adalah fakta.

Baik:

FileAccepted(fileId=F1, acceptedAt=T1, checksum=abc, policyVersion=v3)

Berbahaya:

AcceptFile(fileId=F1)

Command bisa menghasilkan keputusan berbeda saat replay karena config, policy, time, atau external state sudah berubah.

Event historis harus menyimpan keputusan yang sudah terjadi.

5.3 Deterministic Projection

Projection replay harus deterministic.

public final class EvidenceProjectionBuilder {
    public EvidenceProjection apply(EvidenceProjection current, FileLifecycleEvent event) {
        return switch (event.eventType()) {
            case "FILE_UPLOADED" -> current.withUploadedAt(event.occurredAt());
            case "FILE_ACCEPTED" -> current.withStatus("ACCEPTED")
                                             .withAcceptedAt(event.occurredAt());
            case "FILE_REJECTED" -> current.withStatus("REJECTED")
                                             .withRejectedAt(event.occurredAt())
                                             .withReason(event.reasonCode());
            default -> current;
        };
    }
}

Hindari:

membaca current time saat replay untuk field historis;
memanggil external API;
menggunakan config saat ini untuk menafsirkan event lama tanpa version;
random ID tanpa event ID;
side effect non-idempotent.

5.4 Replay Mode vs Live Mode

Consumer sering butuh mode berbeda.

Mode	Tujuan	Side Effect
Live consumer	Proses event baru	Boleh emit downstream event jika idempotent
Replay consumer	Rebuild projection	Tidak boleh kirim email/webhook/payment
Repair consumer	Correct known corruption	Terbatas, audit-heavy
Backfill consumer	Isi field/projection baru	Harus scoped

Buat flag eksplisit:

public enum ProcessingMode {
    LIVE,
    REPLAY,
    REPAIR,
    BACKFILL
}

Jangan gunakan live handler yang sama untuk replay tanpa mematikan side effect berbahaya.

6. Offset, Checkpoint, and Commit Boundary

Replay dan consumer processing selalu menghadapi masalah klasik:

When do we record progress?

Jika progress dicatat terlalu awal, data bisa hilang. Jika terlalu lambat, event bisa diproses ulang.

6.1 At-Least-Once Baseline

Banyak sistem consumer menggunakan model at-least-once:

1. Read message
2. Process message
3. Write side effect
4. Commit offset/checkpoint

Jika crash setelah step 3 sebelum step 4, message diproses ulang.

Maka side effect harus idempotent.

6.2 Store Offset with Derived State

Untuk projection DB, pattern yang kuat:

In one DB transaction:
- upsert projection row
- record processed event ID or source offset

Contoh table:

CREATE TABLE projection_checkpoint (
  consumer_name TEXT NOT NULL,
  partition_id INT NOT NULL,
  offset_value BIGINT NOT NULL,
  updated_at TIMESTAMPTZ NOT NULL,
  PRIMARY KEY (consumer_name, partition_id)
);

CREATE TABLE processed_event (
  consumer_name TEXT NOT NULL,
  event_id TEXT NOT NULL,
  processed_at TIMESTAMPTZ NOT NULL,
  PRIMARY KEY (consumer_name, event_id)
);

Pseudo-code:

@Transactional
public void handle(EventEnvelope envelope) {
    if (processedEventRepository.exists(consumerName, envelope.eventId())) {
        return;
    }

    projectionRepository.apply(envelope);
    processedEventRepository.insert(consumerName, envelope.eventId());
    checkpointRepository.update(
        consumerName,
        envelope.partition(),
        envelope.offset()
    );
}

Ini tidak membuat semua dunia exactly-once, tetapi membuat sink DB tahan duplicate event.

6.3 Checkpoint for Batch/Reindex

Untuk reindex file metadata:

CREATE TABLE reindex_job_checkpoint (
  job_id TEXT PRIMARY KEY,
  last_seen_file_id TEXT,
  last_seen_updated_at TIMESTAMPTZ,
  processed_count BIGINT NOT NULL,
  status TEXT NOT NULL,
  updated_at TIMESTAMPTZ NOT NULL
);

Gunakan stable ordering:

SELECT *
FROM evidence_file
WHERE (updated_at, file_id) > (?, ?)
ORDER BY updated_at, file_id
LIMIT 500;

Jangan hanya pakai offset pagination untuk dataset yang berubah. Offset pagination bisa skip/duplicate saat ada insert/delete.

7. File Reindex

Reindex adalah membangun ulang derived index dari source of truth.

Contoh target:

Elasticsearch/OpenSearch index;
database read model;
reporting table;
file search metadata;
full-text extracted content projection.

7.1 Reindex Architecture

Safe strategy:

1. Create new index version: evidence-v2-20260705
2. Read source in stable batches
3. Build documents with projectionVersion=v2
4. Write to new index
5. Validate document count and sample integrity
6. Switch alias from old index to new index
7. Keep old index during rollback window
8. Delete old index after acceptance

7.2 Reindex Must Not Be Domain Mutation

Reindex tidak boleh mengubah lifecycle file.

Buruk:

If file missing from object storage during reindex, mark file DELETED.

Lebih baik:

Emit mismatch metric and reconciliation candidate.
Domain repair flow decides status change.

Reindex membangun projection. Reconciliation/repair memperbaiki source inconsistency.

8. Object Storage and Metadata Reconciliation

File platform punya dua source fisik:

metadata DB;
object storage.

Drift normal terjadi karena partial failure.

8.1 Drift Types

Drift	Contoh	Risk
Metadata without object	DB row points to missing key	Download failure, audit issue
Object without metadata	Upload succeeded, DB commit failed	Orphan cost, retention ambiguity
Checksum mismatch	Metadata sha256 != actual object hash	Integrity violation
Status mismatch	Metadata ACCEPTED but object in quarantine prefix	Security/lifecycle bug
Retention mismatch	Metadata legal hold active, storage no lock	Compliance risk
Version mismatch	Metadata object version not current	Wrong payload served

8.2 Reconciliation Job

public final class FileStorageReconciliationJob {
    private final EvidenceFileRepository files;
    private final ObjectStorage storage;
    private final ReconciliationFindingRepository findings;

    public void reconcileBatch(Instant olderThan, int limit) {
        List<StoredFile> batch = files.findActiveFilesForReconciliation(olderThan, limit);

        for (StoredFile file : batch) {
            ObjectHead head = storage.headObject(file.bucket(), file.storageKey(), file.objectVersion());

            if (!head.exists()) {
                findings.recordMissingObject(file.fileId(), file.storageKey());
                continue;
            }

            if (!file.sha256().equals(head.sha256())) {
                findings.recordChecksumMismatch(file.fileId(), file.sha256(), head.sha256());
            }

            if (file.retentionUntil() != null && !head.hasRetentionProtection()) {
                findings.recordRetentionMismatch(file.fileId(), file.retentionUntil());
            }
        }
    }
}

Important:

reconciliation records finding;
repair is separate;
dangerous fixes require approval or domain rule;
findings are observable;
repeated findings must not spam without dedupe.

8.3 Orphan Object Handling

Object without metadata bisa terjadi dari failed commit.

Safe cleanup flow:

1. List objects in temp/quarantine prefix older than threshold
2. Check metadata DB by uploadSessionId/fileId/objectKey
3. If no metadata and object is older than safety window, mark orphan candidate
4. Emit metric and audit-like operational record
5. Delete only if object is in temp prefix and not under legal retention
6. Never auto-delete accepted/final prefix without stronger proof

Orphan cleanup harus konservatif. Lebih baik membayar storage beberapa hari daripada menghapus evidence final.

9. Metadata Repair

Repair adalah perubahan state untuk mengembalikan invariant.

Repair lebih berbahaya daripada reconciliation karena repair menulis.

9.1 Repair Command

public record RepairFileMetadataCommand(
    String repairId,
    String fileId,
    String findingId,
    String repairType,
    String reason,
    String approvedBy,
    Instant approvedAt
) {}

Repair harus:

idempotent;
auditable;
scoped;
approval-aware untuk data kritikal;
reversible jika memungkinkan;
menghasilkan event.

9.2 Repair Types

Repair	Safe Automation?	Notes
Expire stale UPLOADING session	Usually yes	Jika melewati TTL
Delete temp orphan object	Usually yes	Hanya temp prefix
Recompute checksum from object	Sometimes	Butuh trust ke object payload
Mark metadata as MISSING_PAYLOAD	Maybe	Domain impact tinggi
Delete accepted metadata	No	Perlu review
Remove legal hold	No	Compliance decision

9.3 Repair as State Machine

Repair itu sendiri adalah workflow. Jangan lakukan repair besar dengan script manual tanpa state dan audit.

10. Replay Drift

Replay drift terjadi saat hasil replay berbeda dari state original, padahal event sama.

Penyebab umum:

handler logic berubah;
config sekarang berbeda dari config saat event terjadi;
event schema migration salah;
external lookup menghasilkan data baru;
event tidak menyimpan keputusan final, hanya input;
ordering event tidak stabil;
missing event;
duplicate event tidak idempotent.

10.1 Detect Replay Drift

Gunakan comparison:

Original projection snapshot hash
vs
Replayed projection snapshot hash

Contoh row:

CREATE TABLE replay_validation_result (
  replay_id TEXT NOT NULL,
  aggregate_id TEXT NOT NULL,
  original_hash TEXT,
  replayed_hash TEXT,
  status TEXT NOT NULL,
  diff_summary JSONB,
  created_at TIMESTAMPTZ NOT NULL,
  PRIMARY KEY (replay_id, aggregate_id)
);

10.2 Reduce Replay Drift

Design rules:

store policy version in event;
store decision result, not only input;
avoid external calls during replay;
version event schema;
store actor and reason;
use deterministic ordering per aggregate;
treat replay as pure function over event stream;
record projection version.

11. Backfill Without Breaking Production

Backfill sering diremehkan. Padahal backfill adalah production write path sementara.

Contoh:

Add new column evidence_file.normalized_content_type
Backfill all existing rows based on detectedContentType and file extension

Risiko:

lock besar;
DB load tinggi;
inconsistent partial state;
bad logic mengisi data salah;
backfill bertabrakan dengan live writes;
rollback sulit.

11.1 Backfill Pattern

1. Add nullable column
2. Deploy code that writes new column for new records
3. Backfill old records in small batches
4. Validate count and sample
5. Make read path prefer new column if present
6. Add constraint only after data complete
7. Remove fallback later

Pseudo-code:

public void backfillNormalizedContentType(int limit) {
    List<FileMetadata> batch = repository.findMissingNormalizedContentType(limit);

    for (FileMetadata file : batch) {
        String normalized = contentTypeNormalizer.normalize(
            file.detectedContentType(),
            file.originalFilename()
        );
        repository.updateNormalizedContentType(file.fileId(), file.version(), normalized);
    }
}

Use version check. Jangan overwrite concurrent update.

12. Restore vs Replay

Restore dan replay sering dipertukarkan, padahal berbeda.

Aspect	Restore	Replay
Source	Backup/snapshot	Event log/source records
Target	Durable store	Projection/state store
Granularity	DB/table/object/set	Event/entity/projection
Risk	Data loss since backup	Logic drift, missing event
Speed	Bisa cepat untuk snapshot	Bergantung volume event
Correctness	State at backup time	State according to replay logic

Untuk source of truth DB, restore backup sering lebih tepat daripada event replay jika event stream tidak lengkap.

Untuk search index, replay/reindex lebih tepat daripada restore backup karena index derived.

13. Runbook for Reconstruction

Setiap state penting harus punya runbook.

Template:

## State Reconstruction Runbook

### State
- Name:
- Owner:
- Source of truth:
- Derived stores:

### Failure symptoms
- Metrics:
- Alerts:
- User impact:

### Safety classification
- Can rebuild automatically: yes/no
- Requires approval: yes/no
- Data loss risk:
- Compliance risk:

### Reconstruction method
- Replay / reindex / reconcile / repair / restore:
- Input source:
- Target:
- Batch size:
- Checkpoint:
- Idempotency key:

### Validation
- Count check:
- Hash check:
- Sample check:
- Invariant check:

### Rollback
- How to stop:
- How to revert target:
- How to preserve evidence:

### Audit
- Events emitted:
- Operator identity:
- Change ticket:

14. Engineering Checklist

Sebelum menyatakan state “production-grade”, jawab:

Reconstructability

State ini authoritative atau derived?
Jika hilang, bisa dibangun ulang?
Dari source apa?
Berapa lama rebuild?
Apa RTO/RPO-nya?
Apakah rebuild butuh downtime?

Replay

Event adalah fact atau command?
Event punya ID stabil?
Handler idempotent?
Offset/checkpoint disimpan di mana?
Side effect eksternal dimatikan saat replay?
Replay deterministic?
Schema event versioned?

Reconciliation

Source apa yang dibandingkan?
Mismatch apa yang mungkin?
Apakah job hanya record finding atau langsung repair?
Apakah repair punya approval?
Apakah false positive aman?

Backfill

Apakah live write sudah mendukung field baru?
Apakah batch kecil?
Apakah checkpoint tersedia?
Apakah validation tersedia?
Apakah rollback tersedia?

Audit

Apakah reconstruction action tercatat?
Apakah operator identity tercatat?
Apakah hasil repair bisa dibuktikan?
Apakah event sensitive data direduksi?

15. Key Takeaways

State reconstruction bukan fitur tambahan. Ia adalah bagian inti dari reliability.

Prinsip utamanya:

Every state needs a recovery story.
Replay is only one reconstruction technique.
Derived state should be disposable and rebuildable.
Source of truth must be explicit per fact, not per system.
Replay must be deterministic and side-effect controlled.
Offset/checkpoint must align with side effect commit boundary.
Reconciliation detects drift; repair changes state. Do not mix them casually.
Backfill is a temporary production write path and needs the same discipline as normal code.
Regulated systems need audit-safe reconstruction, not silent scripts.

Di part berikutnya kita menutup blok state dengan consistency dan failure modeling: lost update, duplicate event, split brain, replay drift, stale read, dan bagaimana Java microservices harus bertahan terhadap semuanya.

References

Apache Kafka Documentation: https://kafka.apache.org/documentation/
Kafka Design: Delivery Semantics: https://docs.confluent.io/kafka/design/delivery-semantics.html
Kafka Log Compaction: https://docs.confluent.io/kafka/design/log_compaction.html
Spring Batch Reference Documentation: https://docs.spring.io/spring-batch/reference/
Kubernetes Leases: https://kubernetes.io/docs/concepts/architecture/leases/

Lesson Recap

You just completed lesson 33 in build core. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Back To Series Next Lesson

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.

State Consistency and Failure Modeling