Build CoreOrdered learning track

State Consistency and Failure Modeling

Learn Java Microservices File Handling, State, Configuration and Secret Management - Part 034

State consistency dan failure modeling untuk Java microservices: lost update, duplicate event, stale read, split brain, replay drift, fencing token, optimistic locking, and convergence.

10 min read1873 words
PrevNext
Lesson 3470 lesson track14–38 Build Core
#java#microservices#state-management#consistency+5 more

Part 034 — State Consistency and Failure Modeling

Consistency bugs are not bugs in syntax.

They are bugs in the mental model of time.

Microservices tidak gagal hanya karena server down. Ia gagal karena dua hal benar terjadi pada waktu yang salah:

  • dua request valid mengubah row yang sama secara bersamaan;
  • event yang benar datang dua kali;
  • event yang benar datang terlambat;
  • cache benar pada 10 detik lalu tetapi salah sekarang;
  • leader lama masih bekerja setelah leader baru terpilih;
  • retry yang niatnya menyembuhkan malah menggandakan side effect;
  • projection yang dibangun ulang memakai logic baru dan menghasilkan keputusan historis berbeda;
  • config berubah di sebagian pod tetapi belum di pod lain;
  • secret dirotasi tetapi beberapa connection masih memakai credential lama.

Part ini menutup blok state management dengan failure modeling.

Tujuannya bukan menghafal istilah CAP atau isolation level. Tujuannya adalah bisa melihat desain Java microservice lalu bertanya:

Apa yang terjadi jika operasi ini terjadi dua kali, bersamaan, terlambat,
terbalik urutannya, sebagian sukses, atau dijalankan oleh dua leader?

1. Consistency as Invariant Preservation

Consistency bukan berarti semua node selalu melihat data yang sama pada saat yang sama.

Dalam production design, consistency berarti:

System behavior preserves declared invariants under concurrency, retry,
partial failure, stale read, and recovery.

Contoh invariant:

A file cannot be ACCEPTED unless scan result is CLEAN and checksum is verified.

Pertanyaan consistency:

  • Bagaimana jika scan result duplicate?
  • Bagaimana jika clean result datang setelah file sudah rejected?
  • Bagaimana jika checksum verification dan scan result selesai bersamaan?
  • Bagaimana jika worker crash setelah update status tapi sebelum audit event?
  • Bagaimana jika replay menghasilkan accepted state tanpa scan event?

Consistency harus dimodelkan terhadap invariant konkret.


2. Failure Model Vocabulary

Gunakan vocabulary ini saat design review.

FailureBentukRisiko
Lost updateDua writer overwrite perubahanState hilang
Dirty readMembaca data belum commitKeputusan pakai data sementara
Non-repeatable readBaca ulang hasil berbedaLogic multi-step tidak stabil
Phantom readQuery range berubahBatch/reconciliation skip/duplicate
Stale readData sudah berubah tetapi cache/replica lamaAuthorization/config salah
Duplicate eventEvent diproses lebih dari sekaliSide effect ganda
Out-of-order eventEvent lama datang setelah event baruStatus mundur
Missing eventEvent tidak pernah diterimaProjection tidak lengkap
Partial failureSatu side effect sukses, lainnya gagalDrift
Split brainDua actor mengira dirinya leaderDouble processing
Replay driftReplay tidak sama dengan originalAudit/projection mismatch
Clock skewNode punya waktu berbedaTTL/lease/order salah
Config skewPod memakai config berbedaBehavior tidak konsisten
Secret skewSebagian pod credential lama, sebagian baruAuth failure sporadis

Vocabulary yang jelas menghindari diskusi kabur seperti “kayaknya race condition”.


3. Lost Update

Lost update terjadi saat dua transaksi membaca state lama yang sama lalu menulis state baru, sehingga satu perubahan hilang.

Contoh buruk:

T1 reads file.status = SCANNED
T2 reads file.status = SCANNED
T1 writes ACCEPTED
T2 writes REJECTED
Final status = REJECTED
Accepted audit may already exist

3.1 Optimistic Locking

Gunakan version column.

ALTER TABLE evidence_file
ADD COLUMN version BIGINT NOT NULL DEFAULT 0;

Update:

UPDATE evidence_file
SET status = ?, version = version + 1, updated_at = now()
WHERE file_id = ? AND version = ?;

Java boundary:

public final class EvidenceFileRepository {
    public void updateStatus(FileId fileId, long expectedVersion, FileLifecycleStatus next) {
        int updated = jdbc.update("""
            UPDATE evidence_file
            SET status = ?, version = version + 1, updated_at = now()
            WHERE file_id = ? AND version = ?
        """, next.name(), fileId.value(), expectedVersion);

        if (updated != 1) {
            throw new ConcurrentModificationException(
                "Evidence file was modified concurrently: " + fileId.value()
            );
        }
    }
}

Optimistic locking cocok ketika conflict jarang tetapi harus terdeteksi.

3.2 Pessimistic Locking

Untuk critical transition yang conflict-nya mahal:

SELECT *
FROM evidence_file
WHERE file_id = ?
FOR UPDATE;

Gunakan dengan hati-hati:

  • lock duration pendek;
  • jangan panggil external service saat lock terbuka;
  • hindari lock ordering inconsistent;
  • ukur deadlock;
  • set timeout.

3.3 Domain Transition Guard

Lock tanpa transition guard masih bisa salah.

public void transitionTo(FileLifecycleStatus next) {
    if (!this.status.canMoveTo(next)) {
        throw new IllegalStateException("Invalid transition " + this.status + " -> " + next);
    }
    this.status = next;
}

DB guard + domain guard lebih kuat daripada salah satu saja.


4. Duplicate Event

At-least-once delivery berarti duplicate event normal.

Kafka dan broker lain bisa memberi delivery guarantee, tetapi consumer tetap harus memodelkan duplicate handling karena retry, rebalance, crash sebelum offset commit, dan sink failure.

4.1 Idempotent Consumer

CREATE TABLE processed_event (
  consumer_name TEXT NOT NULL,
  event_id TEXT NOT NULL,
  processed_at TIMESTAMPTZ NOT NULL,
  PRIMARY KEY (consumer_name, event_id)
);

Handler:

@Transactional
public void onFileAccepted(FileAccepted event) {
    if (processedEventRepository.exists("file-projection", event.eventId())) {
        return;
    }

    projectionRepository.markAccepted(
        event.fileId(),
        event.acceptedAt(),
        event.checksum()
    );

    processedEventRepository.insert("file-projection", event.eventId());
}

Key point:

Idempotency record and side effect must commit together.

Jika idempotency record commit tetapi side effect gagal, event akan dianggap sudah diproses padahal belum.

4.2 Natural Idempotency

Beberapa operation naturally idempotent:

UPDATE evidence_projection
SET status = 'ACCEPTED', accepted_at = ?
WHERE file_id = ? AND status <> 'ACCEPTED';

Tetapi hati-hati. Natural idempotency hanya aman jika repeated operation benar-benar menghasilkan state sama dan tidak memancarkan side effect tambahan.


5. Out-of-Order Event

Event bisa datang tidak sesuai urutan global. Bahkan jika ordering per partition dijaga, event untuk aggregate yang sama harus dipastikan memakai key yang sama agar masuk partition yang sama.

5.1 Versioned Event

public record FileLifecycleChanged(
    String eventId,
    String fileId,
    long aggregateVersion,
    String previousStatus,
    String nextStatus,
    Instant occurredAt
) {}

Consumer:

@Transactional
public void apply(FileLifecycleChanged event) {
    Projection projection = repository.get(event.fileId());

    if (event.aggregateVersion() <= projection.lastSeenVersion()) {
        return; // duplicate or old event
    }

    if (event.aggregateVersion() != projection.lastSeenVersion() + 1) {
        throw new MissingEventGapException(event.fileId(), projection.lastSeenVersion(), event.aggregateVersion());
    }

    repository.apply(event);
}

Gap detection penting untuk projection yang butuh complete event stream.

5.2 Monotonic State Guard

Untuk lifecycle tertentu, jangan izinkan status mundur.

public boolean isTerminal() {
    return this == ACCEPTED || this == REJECTED || this == DELETED;
}

Tetapi jangan hanya mengandalkan ordinal enum. Lifecycle sering bercabang. Buat transition matrix eksplisit.


6. Stale Read

Stale read terjadi saat service membaca value lama dari cache, replica, local memory, config cache, atau token claim.

Stale read tidak selalu buruk. Ia buruk jika melanggar invariant.

6.1 Classify Staleness Tolerance

DataStaleness ToleranceStrategy
Product display nameMenit-jamCache OK
File download permissionRendahRecheck source/policy
Upload size limitSedangShort TTL + validation server-side
Secret credentialTTL-boundRefresh before expiry
Legal hold statusSangat rendahAuthoritative read before delete
Retention policyRendahVersioned policy snapshot
Search indexMenit mungkin OKShow eventual consistency expectation

6.2 Critical Action Recheck

Untuk action berisiko, jangan bergantung pada stale cache.

public DownloadTicket createDownloadTicket(UserContext user, FileId fileId) {
    StoredFile file = fileRepository.getRequired(fileId);
    PermissionDecision decision = accessPolicy.evaluateAuthoritatively(user, file);

    if (!decision.allowed()) {
        throw new AccessDeniedException("Not allowed to download file payload");
    }

    return downloadTicketIssuer.issue(file, user, Duration.ofMinutes(5));
}

Cache boleh mempercepat UI, tetapi final action tetap authoritative.


7. Partial Failure

Partial failure adalah inti distributed systems.

Contoh:

1. DB metadata created
2. Object uploaded
3. Audit event write failed

Atau:

1. Secret rotated in manager
2. Some pods reload
3. Some pods keep old connection
4. Old secret revoked too early

7.1 Commit Point

Untuk setiap workflow, tentukan commit point.

File upload example:

Commit point = metadata status moves to UPLOADED/QUARANTINED with verified object pointer.

Sebelum commit point, cleanup boleh membuang temp object. Setelah commit point, file harus diperlakukan sebagai durable artifact.

7.2 Outbox Pattern

Jika update DB dan publish event harus konsisten, gunakan outbox.

CREATE TABLE outbox_event (
  event_id TEXT PRIMARY KEY,
  aggregate_type TEXT NOT NULL,
  aggregate_id TEXT NOT NULL,
  event_type TEXT NOT NULL,
  payload JSONB NOT NULL,
  status TEXT NOT NULL,
  created_at TIMESTAMPTZ NOT NULL,
  published_at TIMESTAMPTZ
);

Dalam satu transaksi:

- update evidence_file status
- insert outbox_event FILE_ACCEPTED

Publisher terpisah membaca outbox dan publish ke broker.

Ini menghindari state DB berubah tanpa event, sejauh outbox publisher dan reconciliation bekerja.

7.3 Compensation

Jika tidak bisa atomic, gunakan compensation.

If object upload succeeds but metadata commit fails:
- mark object with temporary tag/uploadSessionId
- cleanup job deletes temp object after safety window

Compensation bukan rollback sempurna. Ia adalah tindakan eksplisit untuk membawa sistem ke state aman.


8. Split Brain and Leader Election

Split brain terjadi ketika dua actor mengira mereka punya authority eksklusif.

Contoh:

  • dua scheduler menjalankan cleanup job yang sama;
  • dua worker melakukan compaction/repair bersamaan;
  • leader lama lambat menyadari lease expired;
  • network partition membuat lock tampak valid di satu sisi;
  • pod pause/GC freeze lalu lanjut menulis setelah lease hilang.

Kubernetes memiliki Lease API yang digunakan untuk leader election pada komponen high availability. Tetapi memiliki lease saja tidak cukup jika worker lama masih bisa menulis setelah kehilangan leadership.

8.1 Fencing Token

Fencing token mencegah leader lama menulis setelah token-nya kalah.

Leader A gets token 10
Leader B gets token 11
Storage accepts writes only with token >= current token
Leader A resumes and writes with token 10 -> rejected

Database schema:

CREATE TABLE singleton_job_fence (
  job_name TEXT PRIMARY KEY,
  current_token BIGINT NOT NULL
);

Worker update:

UPDATE repair_job
SET status = 'RUNNING', fencing_token = ?
WHERE job_id = ?
  AND ? >= (
    SELECT current_token FROM singleton_job_fence WHERE job_name = 'repair-worker'
  );

Lease selects leader. Fencing protects resource from stale leader.

8.2 Idempotent Scheduled Jobs

Even with leader election, scheduled jobs should be idempotent.

Leader election reduces duplicate execution.
Idempotency makes duplicate execution safe.

Do not rely on scheduler exclusivity as your only correctness boundary.


9. Replay Drift

Replay drift adalah consistency failure antara historical processing dan replay processing.

Example:

Original event at T1:
- max upload size policy v3 allowed 150MB
Replay at T2:
- current policy v5 allows only 100MB
If replay re-evaluates policy, file becomes rejected.

Ini salah jika replay bertujuan membangun historical projection.

9.1 Store Decision Context

Event harus membawa context.

public record FileAccepted(
    String eventId,
    String fileId,
    String checksum,
    long sizeBytes,
    String policyVersion,
    String scanEngineVersion,
    Instant acceptedAt
) {}

Replay memakai decision result, bukan mengevaluasi ulang policy current.

9.2 Projection Version

Projection boleh berubah. Karena itu version projection.

evidence-search-projection-v1
evidence-search-projection-v2

Jangan berharap projection v2 hash sama dengan v1 jika schema/logic memang berubah. Yang harus sama adalah invariant yang dideklarasikan.


10. Config and Secret Consistency

Part ini bukan blok config/secret, tetapi consistency-nya sudah perlu dipahami.

10.1 Config Skew

Config skew:

Pod A: maxUploadSize=100MB
Pod B: maxUploadSize=200MB

Dampak:

  • user behavior berbeda tergantung pod;
  • canary mungkin sengaja membuat skew;
  • rollout lambat bisa menyebabkan mixed behavior;
  • audit sulit jika effective config tidak tercatat.

Mitigasi:

  • version config;
  • log effective config version saat startup;
  • include config version in critical decision audit;
  • avoid runtime reload for boundary-changing config;
  • rollout via deployment strategy yang jelas.

10.2 Secret Skew

Secret skew:

Pod A uses DB credential v1
Pod B uses DB credential v2
Secret manager revoked v1
Pod A starts failing

Mitigasi:

  • overlap window;
  • dual credential acceptance;
  • pool max lifetime;
  • refresh before expiry;
  • observe old credential use;
  • revoke after all consumers moved.

11. Consistency Pattern Catalog

ProblemPatternNotes
Lost updateOptimistic lockingGood default for domain rows
High-conflict transitionPessimistic lockKeep transaction short
Duplicate eventIdempotent consumerStore event ID with side effect
DB update + event publishTransactional outboxAvoid dual-write gap
External side effect retryIdempotency keyScope key by operation
Out-of-order eventAggregate versionDetect gap/old event
Stale cacheTTL + invalidation + critical recheckClassify tolerance
Split brainLease + fencing tokenLease alone may not protect writes
Partial object/metadata writeLifecycle state + reconciliationTreat temp/final differently
Replay driftDecision event + policy versionAvoid current policy re-eval
Backfill conflictSmall batch + version guardAvoid long lock
Derived state driftReindex + validationProjection disposable

12. Failure Modeling with Timeline

Use timelines in design docs.

12.1 Duplicate Upload Finalization

T1 Client sends finalizeUpload(idempotencyKey=K)
T2 Service verifies object checksum
T3 Service updates metadata to UPLOADED
T4 Network timeout before response
T5 Client retries finalizeUpload(idempotencyKey=K)
T6 Service returns previous result, no new file created

Required:

  • idempotency key;
  • durable idempotency store;
  • same response semantics;
  • no duplicate metadata row.

12.2 Worker Crash

T1 Worker reads FILE_UPLOADED event
T2 Worker starts scan
T3 Worker writes scan result CLEAN
T4 Worker crashes before commit offset
T5 Event redelivered
T6 Worker sees scan result already exists and returns safely

Required:

  • scan result keyed by fileId + scannerVersion or eventId;
  • idempotent write;
  • no duplicate downstream transition.

12.3 Stale Leader

T1 Pod A acquires lease token 10
T2 Pod A pauses for 90s
T3 Pod B acquires lease token 11
T4 Pod B runs repair job
T5 Pod A resumes and tries to write with token 10
T6 DB/storage rejects token 10

Required:

  • fencing token stored with writes;
  • resource validates token;
  • job idempotent.

13. Java Design Guidelines

13.1 Make Time Explicit

Avoid hidden current time in domain logic.

Buruk:

if (Instant.now().isAfter(file.retentionUntil())) {
    file.delete();
}

Lebih baik:

public void requestDeletion(Instant decisionTime, RetentionPolicy policy) {
    if (!policy.canDelete(this, decisionTime)) {
        throw new RetentionViolationException(id.value());
    }
    this.status = FileLifecycleStatus.DELETION_REQUESTED;
}

Time explicit membuat test dan audit lebih kuat.

13.2 Make Version Explicit

public record VersionedFileState(
    FileId fileId,
    FileLifecycleStatus status,
    long version
) {}

Jangan kirim command mutation tanpa expected version untuk operasi yang bisa conflict.

public record AcceptFileCommand(
    FileId fileId,
    long expectedVersion,
    String scanResultId,
    String actorId
) {}

13.3 Make Idempotency Explicit

public record CommandEnvelope<T>(
    String commandId,
    String idempotencyKey,
    String actorId,
    Instant requestedAt,
    T payload
) {}

Idempotency bukan afterthought. Ia bagian dari command contract.

13.4 Make Side Effects Explicit

Pisahkan domain decision dan side effect.

public record DomainDecision(
    List<StateMutation> mutations,
    List<OutboxEvent> events,
    List<ExternalActionRequest> externalActions
) {}

Semakin eksplisit side effect, semakin mudah failure modeling.


14. Testing Consistency

14.1 Concurrent Test

@Test
void concurrentAcceptAndRejectCannotBothWin() throws Exception {
    FileId fileId = fixture.scannedFile();

    ExecutorService pool = Executors.newFixedThreadPool(2);

    Future<?> accept = pool.submit(() -> service.accept(fileId));
    Future<?> reject = pool.submit(() -> service.reject(fileId, "MALWARE"));

    ignoreConflict(accept);
    ignoreConflict(reject);

    StoredFile file = repository.getRequired(fileId);
    assertTrue(file.status() == ACCEPTED || file.status() == REJECTED);
    assertSingleTerminalAuditEvent(fileId);
}

14.2 Duplicate Event Test

@Test
void duplicateFileAcceptedEventIsIdempotent() {
    FileAccepted event = fixture.fileAcceptedEvent();

    consumer.onFileAccepted(event);
    consumer.onFileAccepted(event);

    assertEquals(1, projectionRepository.countAcceptedRows(event.fileId()));
    assertEquals(1, processedEventRepository.count(event.eventId()));
}

14.3 Out-of-Order Test

@Test
void oldLifecycleEventCannotMoveProjectionBackward() {
    consumer.apply(event(version: 5, status: "ACCEPTED"));
    consumer.apply(event(version: 4, status: "SCANNED"));

    Projection p = projectionRepository.get(fileId);
    assertEquals("ACCEPTED", p.status());
    assertEquals(5, p.lastSeenVersion());
}

14.4 Config Skew Test

Given pod A uses config version v1
And pod B uses config version v2
When both process upload finalization
Then audit event records config version
And behavior difference is explainable

14.5 Secret Rotation Test

Given service has active DB connections using credential v1
When credential v2 is issued
And v1 remains valid during overlap window
Then service refreshes connection pool
And v1 is not revoked until no active v1 use is observed

15. Production Observability

Monitor consistency stress, bukan hanya throughput.

Metrics:

state_concurrent_modification_total
state_invalid_transition_total
event_duplicate_detected_total
event_gap_detected_total
event_old_version_ignored_total
outbox_publish_lag_seconds
outbox_unpublished_total
reconciliation_finding_total
replay_drift_total
cache_stale_critical_read_total
leader_fencing_rejection_total
config_version_skew_pods
secret_old_version_in_use_total

Alerts:

event_gap_detected_total > 0
outbox_unpublished_total increasing for 15 minutes
replay_drift_total > 0 after replay validation
leader_fencing_rejection_total > 0 after leadership change
accepted_file_without_scan_total > 0
metadata_payload_mismatch_total > 0
secret_old_version_in_use_total > 0 after rotation window

Good observability answers:

  • invariant apa yang hampir rusak?
  • entity mana terdampak?
  • sejak kapan?
  • apakah recovery otomatis berjalan?
  • apakah operator perlu intervensi?

16. Consistency Review Template

## Consistency Review

### State
- Entity:
- Source of truth:
- Derived states:

### Invariants
- Invariant 1:
- Invariant 2:

### Concurrency
- Possible concurrent commands:
- Conflict detection:
- Conflict resolution:
- Locking strategy:

### Events
- Event identity:
- Ordering key:
- Aggregate version:
- Duplicate handling:
- Missing event detection:

### Partial Failure
- Commit point:
- Side effects:
- Outbox/transaction boundary:
- Compensation:
- Reconciliation:

### Staleness
- Cached data:
- TTL:
- Critical recheck:
- Replica lag tolerance:

### Leadership
- Singleton jobs:
- Lease:
- Fencing token:
- Idempotency:

### Replay
- Replay mode:
- Side effects disabled:
- Projection version:
- Drift detection:

### Observability
- Metrics:
- Alerts:
- Audit events:
- Runbook:

17. Key Takeaways

Consistency in microservices is not a single setting. It is a set of explicit protections around invariants.

Prinsip utamanya:

  1. Consistency must be defined against concrete invariants.
  2. Lost update needs versioning, locking, or single-writer design.
  3. Duplicate event is normal; consumer must be idempotent.
  4. Out-of-order event needs aggregate version and gap detection.
  5. Stale read is acceptable only when invariant tolerance allows it.
  6. Partial failure requires commit point, outbox, compensation, and reconciliation.
  7. Leader election reduces duplication; fencing prevents stale leaders from writing.
  8. Replay drift is a historical correctness problem, not just a technical mismatch.
  9. Config and secret skew are consistency problems too.
  10. Testing must include duplicate, concurrent, delayed, stale, partial, and replay scenarios.

Blok state management selesai di sini. Part berikutnya akan membuka blok baru: Configuration Mental Model.


References

Lesson Recap

You just completed lesson 34 in build core. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.