State Consistency and Failure Modeling
Learn Java Microservices File Handling, State, Configuration and Secret Management - Part 034
State consistency dan failure modeling untuk Java microservices: lost update, duplicate event, stale read, split brain, replay drift, fencing token, optimistic locking, and convergence.
Part 034 — State Consistency and Failure Modeling
Consistency bugs are not bugs in syntax.
They are bugs in the mental model of time.
Microservices tidak gagal hanya karena server down. Ia gagal karena dua hal benar terjadi pada waktu yang salah:
- dua request valid mengubah row yang sama secara bersamaan;
- event yang benar datang dua kali;
- event yang benar datang terlambat;
- cache benar pada 10 detik lalu tetapi salah sekarang;
- leader lama masih bekerja setelah leader baru terpilih;
- retry yang niatnya menyembuhkan malah menggandakan side effect;
- projection yang dibangun ulang memakai logic baru dan menghasilkan keputusan historis berbeda;
- config berubah di sebagian pod tetapi belum di pod lain;
- secret dirotasi tetapi beberapa connection masih memakai credential lama.
Part ini menutup blok state management dengan failure modeling.
Tujuannya bukan menghafal istilah CAP atau isolation level. Tujuannya adalah bisa melihat desain Java microservice lalu bertanya:
Apa yang terjadi jika operasi ini terjadi dua kali, bersamaan, terlambat,
terbalik urutannya, sebagian sukses, atau dijalankan oleh dua leader?
1. Consistency as Invariant Preservation
Consistency bukan berarti semua node selalu melihat data yang sama pada saat yang sama.
Dalam production design, consistency berarti:
System behavior preserves declared invariants under concurrency, retry,
partial failure, stale read, and recovery.
Contoh invariant:
A file cannot be ACCEPTED unless scan result is CLEAN and checksum is verified.
Pertanyaan consistency:
- Bagaimana jika scan result duplicate?
- Bagaimana jika clean result datang setelah file sudah rejected?
- Bagaimana jika checksum verification dan scan result selesai bersamaan?
- Bagaimana jika worker crash setelah update status tapi sebelum audit event?
- Bagaimana jika replay menghasilkan accepted state tanpa scan event?
Consistency harus dimodelkan terhadap invariant konkret.
2. Failure Model Vocabulary
Gunakan vocabulary ini saat design review.
| Failure | Bentuk | Risiko |
|---|---|---|
| Lost update | Dua writer overwrite perubahan | State hilang |
| Dirty read | Membaca data belum commit | Keputusan pakai data sementara |
| Non-repeatable read | Baca ulang hasil berbeda | Logic multi-step tidak stabil |
| Phantom read | Query range berubah | Batch/reconciliation skip/duplicate |
| Stale read | Data sudah berubah tetapi cache/replica lama | Authorization/config salah |
| Duplicate event | Event diproses lebih dari sekali | Side effect ganda |
| Out-of-order event | Event lama datang setelah event baru | Status mundur |
| Missing event | Event tidak pernah diterima | Projection tidak lengkap |
| Partial failure | Satu side effect sukses, lainnya gagal | Drift |
| Split brain | Dua actor mengira dirinya leader | Double processing |
| Replay drift | Replay tidak sama dengan original | Audit/projection mismatch |
| Clock skew | Node punya waktu berbeda | TTL/lease/order salah |
| Config skew | Pod memakai config berbeda | Behavior tidak konsisten |
| Secret skew | Sebagian pod credential lama, sebagian baru | Auth failure sporadis |
Vocabulary yang jelas menghindari diskusi kabur seperti “kayaknya race condition”.
3. Lost Update
Lost update terjadi saat dua transaksi membaca state lama yang sama lalu menulis state baru, sehingga satu perubahan hilang.
Contoh buruk:
T1 reads file.status = SCANNED
T2 reads file.status = SCANNED
T1 writes ACCEPTED
T2 writes REJECTED
Final status = REJECTED
Accepted audit may already exist
3.1 Optimistic Locking
Gunakan version column.
ALTER TABLE evidence_file
ADD COLUMN version BIGINT NOT NULL DEFAULT 0;
Update:
UPDATE evidence_file
SET status = ?, version = version + 1, updated_at = now()
WHERE file_id = ? AND version = ?;
Java boundary:
public final class EvidenceFileRepository {
public void updateStatus(FileId fileId, long expectedVersion, FileLifecycleStatus next) {
int updated = jdbc.update("""
UPDATE evidence_file
SET status = ?, version = version + 1, updated_at = now()
WHERE file_id = ? AND version = ?
""", next.name(), fileId.value(), expectedVersion);
if (updated != 1) {
throw new ConcurrentModificationException(
"Evidence file was modified concurrently: " + fileId.value()
);
}
}
}
Optimistic locking cocok ketika conflict jarang tetapi harus terdeteksi.
3.2 Pessimistic Locking
Untuk critical transition yang conflict-nya mahal:
SELECT *
FROM evidence_file
WHERE file_id = ?
FOR UPDATE;
Gunakan dengan hati-hati:
- lock duration pendek;
- jangan panggil external service saat lock terbuka;
- hindari lock ordering inconsistent;
- ukur deadlock;
- set timeout.
3.3 Domain Transition Guard
Lock tanpa transition guard masih bisa salah.
public void transitionTo(FileLifecycleStatus next) {
if (!this.status.canMoveTo(next)) {
throw new IllegalStateException("Invalid transition " + this.status + " -> " + next);
}
this.status = next;
}
DB guard + domain guard lebih kuat daripada salah satu saja.
4. Duplicate Event
At-least-once delivery berarti duplicate event normal.
Kafka dan broker lain bisa memberi delivery guarantee, tetapi consumer tetap harus memodelkan duplicate handling karena retry, rebalance, crash sebelum offset commit, dan sink failure.
4.1 Idempotent Consumer
CREATE TABLE processed_event (
consumer_name TEXT NOT NULL,
event_id TEXT NOT NULL,
processed_at TIMESTAMPTZ NOT NULL,
PRIMARY KEY (consumer_name, event_id)
);
Handler:
@Transactional
public void onFileAccepted(FileAccepted event) {
if (processedEventRepository.exists("file-projection", event.eventId())) {
return;
}
projectionRepository.markAccepted(
event.fileId(),
event.acceptedAt(),
event.checksum()
);
processedEventRepository.insert("file-projection", event.eventId());
}
Key point:
Idempotency record and side effect must commit together.
Jika idempotency record commit tetapi side effect gagal, event akan dianggap sudah diproses padahal belum.
4.2 Natural Idempotency
Beberapa operation naturally idempotent:
UPDATE evidence_projection
SET status = 'ACCEPTED', accepted_at = ?
WHERE file_id = ? AND status <> 'ACCEPTED';
Tetapi hati-hati. Natural idempotency hanya aman jika repeated operation benar-benar menghasilkan state sama dan tidak memancarkan side effect tambahan.
5. Out-of-Order Event
Event bisa datang tidak sesuai urutan global. Bahkan jika ordering per partition dijaga, event untuk aggregate yang sama harus dipastikan memakai key yang sama agar masuk partition yang sama.
5.1 Versioned Event
public record FileLifecycleChanged(
String eventId,
String fileId,
long aggregateVersion,
String previousStatus,
String nextStatus,
Instant occurredAt
) {}
Consumer:
@Transactional
public void apply(FileLifecycleChanged event) {
Projection projection = repository.get(event.fileId());
if (event.aggregateVersion() <= projection.lastSeenVersion()) {
return; // duplicate or old event
}
if (event.aggregateVersion() != projection.lastSeenVersion() + 1) {
throw new MissingEventGapException(event.fileId(), projection.lastSeenVersion(), event.aggregateVersion());
}
repository.apply(event);
}
Gap detection penting untuk projection yang butuh complete event stream.
5.2 Monotonic State Guard
Untuk lifecycle tertentu, jangan izinkan status mundur.
public boolean isTerminal() {
return this == ACCEPTED || this == REJECTED || this == DELETED;
}
Tetapi jangan hanya mengandalkan ordinal enum. Lifecycle sering bercabang. Buat transition matrix eksplisit.
6. Stale Read
Stale read terjadi saat service membaca value lama dari cache, replica, local memory, config cache, atau token claim.
Stale read tidak selalu buruk. Ia buruk jika melanggar invariant.
6.1 Classify Staleness Tolerance
| Data | Staleness Tolerance | Strategy |
|---|---|---|
| Product display name | Menit-jam | Cache OK |
| File download permission | Rendah | Recheck source/policy |
| Upload size limit | Sedang | Short TTL + validation server-side |
| Secret credential | TTL-bound | Refresh before expiry |
| Legal hold status | Sangat rendah | Authoritative read before delete |
| Retention policy | Rendah | Versioned policy snapshot |
| Search index | Menit mungkin OK | Show eventual consistency expectation |
6.2 Critical Action Recheck
Untuk action berisiko, jangan bergantung pada stale cache.
public DownloadTicket createDownloadTicket(UserContext user, FileId fileId) {
StoredFile file = fileRepository.getRequired(fileId);
PermissionDecision decision = accessPolicy.evaluateAuthoritatively(user, file);
if (!decision.allowed()) {
throw new AccessDeniedException("Not allowed to download file payload");
}
return downloadTicketIssuer.issue(file, user, Duration.ofMinutes(5));
}
Cache boleh mempercepat UI, tetapi final action tetap authoritative.
7. Partial Failure
Partial failure adalah inti distributed systems.
Contoh:
1. DB metadata created
2. Object uploaded
3. Audit event write failed
Atau:
1. Secret rotated in manager
2. Some pods reload
3. Some pods keep old connection
4. Old secret revoked too early
7.1 Commit Point
Untuk setiap workflow, tentukan commit point.
File upload example:
Commit point = metadata status moves to UPLOADED/QUARANTINED with verified object pointer.
Sebelum commit point, cleanup boleh membuang temp object. Setelah commit point, file harus diperlakukan sebagai durable artifact.
7.2 Outbox Pattern
Jika update DB dan publish event harus konsisten, gunakan outbox.
CREATE TABLE outbox_event (
event_id TEXT PRIMARY KEY,
aggregate_type TEXT NOT NULL,
aggregate_id TEXT NOT NULL,
event_type TEXT NOT NULL,
payload JSONB NOT NULL,
status TEXT NOT NULL,
created_at TIMESTAMPTZ NOT NULL,
published_at TIMESTAMPTZ
);
Dalam satu transaksi:
- update evidence_file status
- insert outbox_event FILE_ACCEPTED
Publisher terpisah membaca outbox dan publish ke broker.
Ini menghindari state DB berubah tanpa event, sejauh outbox publisher dan reconciliation bekerja.
7.3 Compensation
Jika tidak bisa atomic, gunakan compensation.
If object upload succeeds but metadata commit fails:
- mark object with temporary tag/uploadSessionId
- cleanup job deletes temp object after safety window
Compensation bukan rollback sempurna. Ia adalah tindakan eksplisit untuk membawa sistem ke state aman.
8. Split Brain and Leader Election
Split brain terjadi ketika dua actor mengira mereka punya authority eksklusif.
Contoh:
- dua scheduler menjalankan cleanup job yang sama;
- dua worker melakukan compaction/repair bersamaan;
- leader lama lambat menyadari lease expired;
- network partition membuat lock tampak valid di satu sisi;
- pod pause/GC freeze lalu lanjut menulis setelah lease hilang.
Kubernetes memiliki Lease API yang digunakan untuk leader election pada komponen high availability. Tetapi memiliki lease saja tidak cukup jika worker lama masih bisa menulis setelah kehilangan leadership.
8.1 Fencing Token
Fencing token mencegah leader lama menulis setelah token-nya kalah.
Leader A gets token 10
Leader B gets token 11
Storage accepts writes only with token >= current token
Leader A resumes and writes with token 10 -> rejected
Database schema:
CREATE TABLE singleton_job_fence (
job_name TEXT PRIMARY KEY,
current_token BIGINT NOT NULL
);
Worker update:
UPDATE repair_job
SET status = 'RUNNING', fencing_token = ?
WHERE job_id = ?
AND ? >= (
SELECT current_token FROM singleton_job_fence WHERE job_name = 'repair-worker'
);
Lease selects leader. Fencing protects resource from stale leader.
8.2 Idempotent Scheduled Jobs
Even with leader election, scheduled jobs should be idempotent.
Leader election reduces duplicate execution.
Idempotency makes duplicate execution safe.
Do not rely on scheduler exclusivity as your only correctness boundary.
9. Replay Drift
Replay drift adalah consistency failure antara historical processing dan replay processing.
Example:
Original event at T1:
- max upload size policy v3 allowed 150MB
Replay at T2:
- current policy v5 allows only 100MB
If replay re-evaluates policy, file becomes rejected.
Ini salah jika replay bertujuan membangun historical projection.
9.1 Store Decision Context
Event harus membawa context.
public record FileAccepted(
String eventId,
String fileId,
String checksum,
long sizeBytes,
String policyVersion,
String scanEngineVersion,
Instant acceptedAt
) {}
Replay memakai decision result, bukan mengevaluasi ulang policy current.
9.2 Projection Version
Projection boleh berubah. Karena itu version projection.
evidence-search-projection-v1
evidence-search-projection-v2
Jangan berharap projection v2 hash sama dengan v1 jika schema/logic memang berubah. Yang harus sama adalah invariant yang dideklarasikan.
10. Config and Secret Consistency
Part ini bukan blok config/secret, tetapi consistency-nya sudah perlu dipahami.
10.1 Config Skew
Config skew:
Pod A: maxUploadSize=100MB
Pod B: maxUploadSize=200MB
Dampak:
- user behavior berbeda tergantung pod;
- canary mungkin sengaja membuat skew;
- rollout lambat bisa menyebabkan mixed behavior;
- audit sulit jika effective config tidak tercatat.
Mitigasi:
- version config;
- log effective config version saat startup;
- include config version in critical decision audit;
- avoid runtime reload for boundary-changing config;
- rollout via deployment strategy yang jelas.
10.2 Secret Skew
Secret skew:
Pod A uses DB credential v1
Pod B uses DB credential v2
Secret manager revoked v1
Pod A starts failing
Mitigasi:
- overlap window;
- dual credential acceptance;
- pool max lifetime;
- refresh before expiry;
- observe old credential use;
- revoke after all consumers moved.
11. Consistency Pattern Catalog
| Problem | Pattern | Notes |
|---|---|---|
| Lost update | Optimistic locking | Good default for domain rows |
| High-conflict transition | Pessimistic lock | Keep transaction short |
| Duplicate event | Idempotent consumer | Store event ID with side effect |
| DB update + event publish | Transactional outbox | Avoid dual-write gap |
| External side effect retry | Idempotency key | Scope key by operation |
| Out-of-order event | Aggregate version | Detect gap/old event |
| Stale cache | TTL + invalidation + critical recheck | Classify tolerance |
| Split brain | Lease + fencing token | Lease alone may not protect writes |
| Partial object/metadata write | Lifecycle state + reconciliation | Treat temp/final differently |
| Replay drift | Decision event + policy version | Avoid current policy re-eval |
| Backfill conflict | Small batch + version guard | Avoid long lock |
| Derived state drift | Reindex + validation | Projection disposable |
12. Failure Modeling with Timeline
Use timelines in design docs.
12.1 Duplicate Upload Finalization
T1 Client sends finalizeUpload(idempotencyKey=K)
T2 Service verifies object checksum
T3 Service updates metadata to UPLOADED
T4 Network timeout before response
T5 Client retries finalizeUpload(idempotencyKey=K)
T6 Service returns previous result, no new file created
Required:
- idempotency key;
- durable idempotency store;
- same response semantics;
- no duplicate metadata row.
12.2 Worker Crash
T1 Worker reads FILE_UPLOADED event
T2 Worker starts scan
T3 Worker writes scan result CLEAN
T4 Worker crashes before commit offset
T5 Event redelivered
T6 Worker sees scan result already exists and returns safely
Required:
- scan result keyed by fileId + scannerVersion or eventId;
- idempotent write;
- no duplicate downstream transition.
12.3 Stale Leader
T1 Pod A acquires lease token 10
T2 Pod A pauses for 90s
T3 Pod B acquires lease token 11
T4 Pod B runs repair job
T5 Pod A resumes and tries to write with token 10
T6 DB/storage rejects token 10
Required:
- fencing token stored with writes;
- resource validates token;
- job idempotent.
13. Java Design Guidelines
13.1 Make Time Explicit
Avoid hidden current time in domain logic.
Buruk:
if (Instant.now().isAfter(file.retentionUntil())) {
file.delete();
}
Lebih baik:
public void requestDeletion(Instant decisionTime, RetentionPolicy policy) {
if (!policy.canDelete(this, decisionTime)) {
throw new RetentionViolationException(id.value());
}
this.status = FileLifecycleStatus.DELETION_REQUESTED;
}
Time explicit membuat test dan audit lebih kuat.
13.2 Make Version Explicit
public record VersionedFileState(
FileId fileId,
FileLifecycleStatus status,
long version
) {}
Jangan kirim command mutation tanpa expected version untuk operasi yang bisa conflict.
public record AcceptFileCommand(
FileId fileId,
long expectedVersion,
String scanResultId,
String actorId
) {}
13.3 Make Idempotency Explicit
public record CommandEnvelope<T>(
String commandId,
String idempotencyKey,
String actorId,
Instant requestedAt,
T payload
) {}
Idempotency bukan afterthought. Ia bagian dari command contract.
13.4 Make Side Effects Explicit
Pisahkan domain decision dan side effect.
public record DomainDecision(
List<StateMutation> mutations,
List<OutboxEvent> events,
List<ExternalActionRequest> externalActions
) {}
Semakin eksplisit side effect, semakin mudah failure modeling.
14. Testing Consistency
14.1 Concurrent Test
@Test
void concurrentAcceptAndRejectCannotBothWin() throws Exception {
FileId fileId = fixture.scannedFile();
ExecutorService pool = Executors.newFixedThreadPool(2);
Future<?> accept = pool.submit(() -> service.accept(fileId));
Future<?> reject = pool.submit(() -> service.reject(fileId, "MALWARE"));
ignoreConflict(accept);
ignoreConflict(reject);
StoredFile file = repository.getRequired(fileId);
assertTrue(file.status() == ACCEPTED || file.status() == REJECTED);
assertSingleTerminalAuditEvent(fileId);
}
14.2 Duplicate Event Test
@Test
void duplicateFileAcceptedEventIsIdempotent() {
FileAccepted event = fixture.fileAcceptedEvent();
consumer.onFileAccepted(event);
consumer.onFileAccepted(event);
assertEquals(1, projectionRepository.countAcceptedRows(event.fileId()));
assertEquals(1, processedEventRepository.count(event.eventId()));
}
14.3 Out-of-Order Test
@Test
void oldLifecycleEventCannotMoveProjectionBackward() {
consumer.apply(event(version: 5, status: "ACCEPTED"));
consumer.apply(event(version: 4, status: "SCANNED"));
Projection p = projectionRepository.get(fileId);
assertEquals("ACCEPTED", p.status());
assertEquals(5, p.lastSeenVersion());
}
14.4 Config Skew Test
Given pod A uses config version v1
And pod B uses config version v2
When both process upload finalization
Then audit event records config version
And behavior difference is explainable
14.5 Secret Rotation Test
Given service has active DB connections using credential v1
When credential v2 is issued
And v1 remains valid during overlap window
Then service refreshes connection pool
And v1 is not revoked until no active v1 use is observed
15. Production Observability
Monitor consistency stress, bukan hanya throughput.
Metrics:
state_concurrent_modification_total
state_invalid_transition_total
event_duplicate_detected_total
event_gap_detected_total
event_old_version_ignored_total
outbox_publish_lag_seconds
outbox_unpublished_total
reconciliation_finding_total
replay_drift_total
cache_stale_critical_read_total
leader_fencing_rejection_total
config_version_skew_pods
secret_old_version_in_use_total
Alerts:
event_gap_detected_total > 0
outbox_unpublished_total increasing for 15 minutes
replay_drift_total > 0 after replay validation
leader_fencing_rejection_total > 0 after leadership change
accepted_file_without_scan_total > 0
metadata_payload_mismatch_total > 0
secret_old_version_in_use_total > 0 after rotation window
Good observability answers:
- invariant apa yang hampir rusak?
- entity mana terdampak?
- sejak kapan?
- apakah recovery otomatis berjalan?
- apakah operator perlu intervensi?
16. Consistency Review Template
## Consistency Review
### State
- Entity:
- Source of truth:
- Derived states:
### Invariants
- Invariant 1:
- Invariant 2:
### Concurrency
- Possible concurrent commands:
- Conflict detection:
- Conflict resolution:
- Locking strategy:
### Events
- Event identity:
- Ordering key:
- Aggregate version:
- Duplicate handling:
- Missing event detection:
### Partial Failure
- Commit point:
- Side effects:
- Outbox/transaction boundary:
- Compensation:
- Reconciliation:
### Staleness
- Cached data:
- TTL:
- Critical recheck:
- Replica lag tolerance:
### Leadership
- Singleton jobs:
- Lease:
- Fencing token:
- Idempotency:
### Replay
- Replay mode:
- Side effects disabled:
- Projection version:
- Drift detection:
### Observability
- Metrics:
- Alerts:
- Audit events:
- Runbook:
17. Key Takeaways
Consistency in microservices is not a single setting. It is a set of explicit protections around invariants.
Prinsip utamanya:
- Consistency must be defined against concrete invariants.
- Lost update needs versioning, locking, or single-writer design.
- Duplicate event is normal; consumer must be idempotent.
- Out-of-order event needs aggregate version and gap detection.
- Stale read is acceptable only when invariant tolerance allows it.
- Partial failure requires commit point, outbox, compensation, and reconciliation.
- Leader election reduces duplication; fencing prevents stale leaders from writing.
- Replay drift is a historical correctness problem, not just a technical mismatch.
- Config and secret skew are consistency problems too.
- Testing must include duplicate, concurrent, delayed, stale, partial, and replay scenarios.
Blok state management selesai di sini. Part berikutnya akan membuka blok baru: Configuration Mental Model.
References
- Apache Kafka Documentation: https://kafka.apache.org/documentation/
- Kafka Design: Delivery Semantics: https://docs.confluent.io/kafka/design/delivery-semantics.html
- Kafka Log Compaction: https://docs.confluent.io/kafka/design/log_compaction.html
- Kubernetes Leases: https://kubernetes.io/docs/concepts/architecture/leases/
- Spring Batch Reference Documentation: https://docs.spring.io/spring-batch/reference/
You just completed lesson 34 in build core. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.
Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.