Series/Learn Java Microservices File Handling, State, Configuration and Secret Management

Final StretchOrdered learning track

Chaos and Failure Testing

Learn Java Microservices File Handling, State, Configuration and Secret Management - Part 061

Chaos and failure testing untuk Java microservices yang mengelola file, state, configuration, dan secret: storage outage, secret expiry, config drift, worker crash, replay, rollback, and game days.

[2026-07-05]12 min read2269 words

In This Lesson

1. Failure Testing Mindset 2. The Steady State 3. Invariant-Driven Chaos

PrevNext

Lesson 6170 lesson track59–70 Final Stretch

#java#microservices#chaos-engineering#failure-testing+4 more

Part 061 — Chaos and Failure Testing

Do not wait for production to teach you how your system fails.

Production is an expensive classroom.

Kita sudah membangun model:

file lifecycle;
object storage;
state consistency;
configuration management;
secret management;
threat modeling;
leakage prevention;
encryption/access control;
auditability;
observability.

Sekarang kita harus membuktikan semuanya.

Chaos and failure testing bukan “randomly break production”. Itu pemahaman yang salah dan berbahaya.

Chaos engineering adalah eksperimen terkontrol untuk memvalidasi hipotesis tentang sistem dalam kondisi failure. Untuk seri ini, targetnya sangat spesifik:

Can the system preserve or detect its invariants when file, state,
config, secret, and runtime dependencies fail?

Kubernetes sendiri menyediakan probe readiness/liveness/startup untuk membantu kubelet menentukan apakah container sehat, siap menerima traffic, atau perlu direstart. Chaos tooling seperti Chaos Mesh menyediakan fault injection di Kubernetes untuk menguji perilaku sistem terhadap abnormality yang realistis. Tetapi tool hanyalah mekanisme. Nilai utamanya ada pada desain eksperimen.

1. Failure Testing Mindset

Ada tiga level test.

Level	Question
Unit/integration failure test	Does this code path handle failure?
Scenario test	Does this workflow recover from partial failure?
Chaos experiment	Does the running system preserve steady state under injected fault?

Example:

Unit test:
ObjectStorageClient throws timeout.

Scenario test:
Upload object succeeds, DB commit fails, reconciliation cleans orphan object.

Chaos experiment:
Object storage returns intermittent 503 for 10 minutes while upload traffic continues.
System degrades safely, no accepted file lacks checksum, alerts fire.

2. The Steady State

Chaos experiment must define steady state first.

Weak steady state:

Service is up.

Useful steady state:

- Valid upload sessions can be created.
- Uploaded files reach QUARANTINED or fail with explicit reason.
- No ACCEPTED file exists without checksum.
- Audit outbox oldest age < 60 seconds.
- Secret TTL > 15 minutes.
- Config version is consistent across ready pods.
- Download authorization deny rate does not unexpectedly drop to zero.

Steady state should be measurable.

3. Invariant-Driven Chaos

Start with invariant.

Example:

Invariant	Failure
No accepted file without checksum	scanner returns duplicate/out-of-order result
No secret expires before refresh	Vault unavailable near lease renewal
No config unsafe value in prod	ConfigMap patched manually
No metadata-payload mismatch persists	DB failure after object write
No duplicate processing side effect	worker crashes after side effect before ack
No stale authorization decision for critical download	permission revoked while cache contains allow

4. Failure Taxonomy

4.1 File Handling Failures

Failure	What to Verify
disk full in temp directory	upload fails safely, cleanup works
temp file deleted mid-processing	worker retries/rejects explicitly
path permission denied	service fails readiness or operation fails cleanly
partial multipart upload	abort/reconciliation works
checksum mismatch	file rejected/quarantined
malware scanner timeout	file not accepted
scanner duplicate event	idempotent state transition
object missing for metadata	reconciliation detects
metadata missing for object	orphan cleanup detects
delete denied by retention	deletion blocked and audited

4.2 Object Storage Failures

Failure	What to Verify
`PutObject` timeout	retry/idempotency, no false accepted state
`GetObject` 404	error classification, no misleading response
storage 403	access policy alert
storage 503	backoff/circuit breaker
slow storage	timeout, backpressure
multipart complete failure	session remains recoverable
abort multipart failure	cleanup retry
object version conflict	metadata version consistency

4.3 State Failures

Failure	What to Verify
optimistic lock conflict	retry or explicit conflict
duplicate event	idempotent consumer
out-of-order event	lifecycle guard
DB deadlock	bounded retry
DB commit after external side effect fails	reconciliation
cache stale	critical operations recheck
replay job uses changed config	replay policy versioning
worker split brain	fencing/idempotency

4.4 Configuration Failures

Failure	What to Verify
missing required config	startup fails
invalid config value	validation blocks
unsafe production value	policy blocks
config reload partial failure	old config remains active
mixed config versions	detected and bounded
ConfigMap changed manually	drift alert
feature flag service down	safe fallback
config import unavailable	fail-fast or safe degrade

4.5 Secret Failures

Failure	What to Verify
secret missing	startup/readiness fails
secret malformed	validation blocks
secret manager unavailable	cached credential TTL respected
lease renewal fails	alert and degrade before expiry
secret expires	readiness fails before unsafe work
rotation new credential invalid	old credential remains valid
old credential revoked too early	alert and recovery path
secret volume updates but app does not reload	rollout/reload strategy verified

5. Experiment Template

Use consistent template.

# Chaos Experiment: <Name>

## Purpose
What invariant are we validating?

## Hypothesis
Under <fault>, the system will <expected behavior>.

## Scope
- Environment:
- Services:
- Tenants/test data:
- Blast radius:
- Time window:

## Steady State
- Metric:
- Threshold:
- Dashboard:

## Fault Injection
- Fault:
- Duration:
- Intensity:
- Target:

## Expected Behavior
- User-visible:
- Internal state:
- Audit:
- Alerts:
- Recovery:

## Abort Conditions
- Error budget burn:
- Data integrity risk:
- Security risk:
- Manual stop signal:

## Rollback
- How to stop fault:
- How to recover:
- Who owns recovery:

## Evidence
- Metrics:
- Logs:
- Traces:
- Audit events:
- Screenshots/links:

## Result
- Pass/fail:
- Findings:
- Corrective actions:

6. Safety Rules

Chaos without guardrails is negligence.

6.1 Blast Radius

Limit by:

environment;
namespace;
label selector;
tenant;
traffic percentage;
duration;
dependency;
operation type;
canary service instance.

Never start with broad production faults.

6.2 Abort Conditions

Define before running.

Examples:

- upload error rate > 10% for 5 minutes
- audit outbox oldest age > 15 minutes
- accepted file without checksum > 0
- secret seconds until expiry < 5 minutes
- scanner queue age > 1 hour
- security deny rate unexpectedly drops

6.3 Data Safety

Use test tenants and synthetic files.

If real data is involved:

get explicit approval;
ensure retention/legal constraints;
no destructive test unless controlled;
backup/recovery plan;
audit evidence.

6.4 Communication

For game day:

declare window;
owner;
incident channel;
rollback authority;
observer roles;
expected alerts.

7. Local and Integration Failure Injection

Not every failure test needs cluster chaos.

7.1 Java Unit Test with Faulty Storage

final class FailingObjectStorage implements ObjectStorage {
    @Override
    public StoredObject put(PutObjectRequest request) {
        throw new StorageTimeoutException("simulated timeout");
    }
}

Test:

@Test
void uploadDoesNotAcceptFileWhenStorageTimeouts() {
    ObjectStorage storage = new FailingObjectStorage();
    UploadService service = new UploadService(storage, repository, audit);

    assertThrows(StorageTimeoutException.class, () -> service.completeUpload(command));

    StoredFile file = repository.get(command.fileId());
    assertNotEquals(FileLifecycleStatus.ACCEPTED, file.status());
}

7.2 Toxiproxy for Dependency Faults

For integration tests, use proxy to simulate:

latency;
timeout;
connection reset;
bandwidth limit;
dependency unavailable.

Useful targets:

PostgreSQL;
Redis;
object storage emulator;
Vault;
HTTP scanner.

7.3 Testcontainers

Testcontainers can run dependencies and inject lifecycle disruption:

start DB -> perform upload -> stop DB -> verify retry/failure -> start DB -> verify recovery

Do not overfit tests to emulator behavior. Object storage emulator may not perfectly match S3 semantics.

8. Kubernetes Fault Injection

Kubernetes-native experiments can target:

pod kill;
container kill;
network delay;
network loss;
DNS failure;
CPU stress;
memory stress;
disk pressure;
time skew where supported by tooling;
HTTP fault injection via mesh/proxy.

Chaos Mesh, LitmusChaos, and other tools provide CRD-driven experiments in Kubernetes. Tool choice matters less than experiment discipline.

8.1 Pod Kill Experiment

Hypothesis:

If upload worker pod dies after receiving scan result but before ack,
duplicate delivery/retry will not create invalid lifecycle transition.

Expected:

worker restarts/replaced;
event reprocessed idempotently;
file remains in valid state;
audit event not duplicated or duplicate is idempotently ignored;
alert only if backlog exceeds threshold.

8.2 Network Delay to Object Storage

Hypothesis:

If object storage latency increases by 2s, upload API applies timeout/backpressure,
does not exhaust servlet threads, and does not mark file accepted prematurely.

Expected metrics:

object_storage_request_duration_seconds increases
file_upload_failed_total{reason="storage_timeout"} may increase
JVM threads stable
no accepted file without checksum

8.3 Secret Manager Outage

Hypothesis:

If Vault/secret manager is unavailable for 10 minutes,
service continues only while cached credential is valid,
alerts before expiry, and readiness fails before unsafe work.

Expected:

secret refresh failures increment;
secret TTL gauge decreases;
readiness degrades near threshold;
no secret value logged;
no immediate crash loop if current credential valid.

8.4 ConfigMap Drift

Hypothesis:

If ConfigMap is manually patched to unsafe value,
drift detection and policy block or alert before service accepts unsafe behavior.

Expected:

GitOps reverts or drift alert fires;
config validation blocks if app restarts;
runtime reload rejects unsafe value;
audit event records drift.

9. File Workflow Experiments

9.1 DB Commit Fails After Object Upload

Failure:

Object upload succeeds.
Metadata DB commit fails.

Expected:

client gets explicit failure or retryable response;
object remains in temp/quarantine prefix;
reconciliation detects object_without_metadata;
cleanup or metadata repair path runs;
no accepted file exists without metadata;
audit records failure or recovery.

9.2 Scan Result Arrives Twice

Failure:

Scanner sends SCAN_CLEAN twice.

Expected:

first event transitions SCANNED -> ACCEPTED;
second event is ignored/idempotent;
no duplicate audit material event or duplicate marked as idempotent;
state version unchanged.

9.3 Scan Result Arrives After Deletion Request

Failure:

File deletion requested while scan is still pending.
Scan result arrives later.

Expected:

lifecycle guard prevents deleted/rejected file becoming accepted;
late event recorded as ignored/conflict;
alert if frequent.

9.4 Presigned Upload Never Completed

Failure:

Client creates upload session but never uploads.

Expected:

session expires;
multipart upload aborted;
temp metadata marked expired/rejected;
no permanent object;
quota released.

10. Config Experiments

10.1 Bad Startup Config

Inject:

evidence:
  file:
    quarantine-bucket: evidence-prod
    accepted-bucket: evidence-prod

Expected:

startup fails;
readiness never becomes ready;
deployment rollout stops;
previous version continues serving;
alert includes config validation failure;
no traffic sent to bad pod.

10.2 Runtime Reload Failure

Inject:

reloadable timeout value malformed

Expected:

new config rejected;
old config remains active;
metric increments;
audit/ops event records failure;
no partial object/client rebuild.

10.3 Mixed Config Version

Inject:

only 50% pods receive new config/restart

Expected:

dashboard shows mixed versions;
alert if beyond rollout window;
canary analysis catches behavior difference;
rollback or complete rollout path.

11. Secret Experiments

11.1 Secret Rotation New Credential Invalid

Inject:

new DB password wrong

Expected:

canary fails readiness;
old credential remains valid;
rollout paused;
old pods continue serving;
alert fires;
no old credential revoke.

11.2 Secret Expires During Traffic

Inject:

short TTL dynamic credential, block renewal

Expected:

refresh failure observed;
readiness fails before expiry or near unsafe threshold;
existing in-flight work handled according policy;
no infinite auth retry storm;
no secret logged.

11.3 Mounted Secret Updated Without App Reload

Inject:

Kubernetes Secret volume content changes

Expected:

if runtime reload supported: app loads version, validates, rebuilds client;
if not supported: rollout triggered;
service does not assume file update automatically updates connection pool.

12. State Experiments

12.1 Lost Update

Inject:

two requests attempt same lifecycle transition concurrently

Expected:

optimistic lock conflict;
only one transition commits;
loser gets conflict or retry;
audit event only for committed transition.

12.2 Worker Split Brain

Inject:

two workers process same job with same fileId

Expected:

idempotency key/fencing prevents duplicate side effect;
state transition guard holds;
duplicate metric increments;
no duplicate physical delete.

12.3 Replay Drift

Inject:

run replay with newer policy against old events

Expected:

replay uses captured policy version or flags conflict;
drift report generated;
no silent mutation of authoritative state.

13. Readiness and Probe Testing

Kubernetes readiness controls whether Pod receives traffic. Use failure tests to verify readiness semantics.

Test cases:

Condition	Expected Readiness
DB down	not ready for write service
object storage down	not ready for file operations
scanner down	maybe ready but upload accepted only to quarantine, depending design
secret expired	not ready
config invalid	not ready
audit outbox too old for high-risk operations	degraded/not ready depending policy

Do not use liveness to restart pod for every dependency blip. That can cause crash-loop storms. Liveness should detect unrecoverable process failure; readiness should represent traffic safety.

14. Game Day Structure

A game day is a coordinated exercise.

14.1 Roles

Role	Responsibility
Experiment lead	runs experiment
Service owner	observes service behavior
SRE	monitors platform
Security	watches security signals
Incident commander	coordinates if abort needed
Scribe	records evidence/timeline

14.2 Timeline

T-30m: confirm steady state
T-15m: announce experiment
T0: inject fault
T+5m: observe first signals
T+15m: evaluate hypothesis
T+20m: remove fault
T+30m: confirm recovery
T+60m: write findings

14.3 Output

Every game day should produce:

pass/fail result;
observed metrics;
screenshots/dashboard links;
audit/log/tracing evidence;
bugs found;
runbook improvements;
backlog items;
owner and due date.

15. Failure Testing in CI/CD

Not all chaos belongs in production.

15.1 Pull Request Level

unit failure tests;
config validation tests;
secret redaction tests;
lifecycle transition tests;
idempotency tests.

15.2 Integration Pipeline

dependency timeout tests;
DB restart;
object storage emulator failure;
duplicate message delivery;
malformed config;
missing secret.

15.3 Staging Game Day

Kubernetes pod kill;
network delay/loss;
secret manager outage simulation;
config drift;
storage 503 simulation;
DLQ replay.

15.4 Production Controlled Experiment

Only after staging proof:

tiny blast radius;
synthetic tenant;
clear abort;
owner present;
known dashboard/runbook.

16. Chaos Experiment Examples

16.1 Object Storage 503

## Hypothesis
When object storage returns intermittent 503 for 10 minutes,
file upload API returns retryable errors and no file reaches ACCEPTED without checksum.

## Steady State
- file_upload_completed_total increasing
- file_checksum_verification_failed_total stable
- audit_outbox_oldest_age_seconds < 60
- no metadata_payload_mismatch for accepted files

## Expected
- object_storage_retry_total increases
- file_upload_failed_total{reason="storage_unavailable"} increases
- readiness maybe remains up if service can return graceful errors
- alert fires only if SLO breached

16.2 Audit Sink Down

## Hypothesis
If audit sink is down, business transactions still write audit outbox rows,
publisher retries, and high-risk delete operations fail closed if synchronous audit is required.

## Expected
- audit_outbox_pending_total increases
- audit_publish_failure_total increases
- no material state change without audit outbox row
- destructive operation blocked if policy requires

16.3 Secret Manager Down Near Expiry

## Hypothesis
If secret manager is unavailable and cached secret expires soon,
service fails readiness before using invalid credential.

## Expected
- secret_refresh_failure_total increases
- secret_seconds_until_expiry decreases
- readiness not ready below threshold
- dependency auth failures do not explode

17. Anti-Patterns

17.1 Random Failure Injection

No hypothesis, no steady state, no learning.

17.2 No Abort Condition

This is how experiments become incidents.

17.3 Testing Only Infrastructure, Not Domain Invariants

Killing pods is easy. Proving no accepted file lacks checksum is valuable.

17.4 No Evidence Capture

If you do not record results, the organization does not learn.

17.5 Running Chaos Before Observability

If you cannot observe outcome, do not inject fault yet.

17.6 No Follow-Through

Finding weakness without fixing it trains the team to ignore risk.

18. Production Checklist

[ ] Critical invariants defined
[ ] Steady state metrics exist
[ ] Failure scenarios mapped to invariants
[ ] Abort conditions defined
[ ] Blast radius limited
[ ] Synthetic test data available
[ ] Owners present
[ ] Dashboards ready
[ ] Alerts tested
[ ] Runbooks linked
[ ] Audit evidence captured
[ ] Findings turned into backlog
[ ] Experiment repeated after fixes

19. Key Takeaways

Chaos testing validates invariants under controlled fault, not random destruction.
Every experiment needs hypothesis, steady state, blast radius, abort condition, and evidence.
File systems need failure tests for partial upload, scanner failure, orphan objects, checksum mismatch, and retention delete denial.
Configuration tests must cover missing, invalid, unsafe, stale, mixed, and drifted config.
Secret tests must cover missing secret, lease failure, bad rotation, expiry, and stale consumer.
State tests must cover duplicate event, lost update, replay drift, stale cache, and split brain.
Readiness must represent traffic safety; liveness must not restart pods for every dependency blip.
Game days should produce evidence and corrective actions.
Run chaos only after observability is sufficient.
A system is not production-grade until its failure behavior is tested.

Next, we close the cross-cutting block with Compliance and Regulatory Defensibility: turning technical controls into evidence that can survive audit, legal review, and post-incident scrutiny.

References

Principles of Chaos Engineering: https://principlesofchaos.org/
Chaos Mesh Documentation: https://chaos-mesh.org/docs/
Kubernetes Liveness, Readiness, and Startup Probes: https://kubernetes.io/docs/concepts/workloads/pods/probes/
Kubernetes Configure Probes: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/
Kubernetes Pod Disruption Budgets: https://kubernetes.io/docs/tasks/run-application/configure-pdb/
OpenTelemetry Documentation: https://opentelemetry.io/docs/

Lesson Recap

You just completed lesson 61 in final stretch. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Back To Series Next Lesson

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.

Previous Lesson

Lesson 60

Observability for File, Config, Secret, and State

Next Lesson

Lesson 62

Compliance and Regulatory Defensibility