Series/Learn Java Microservices File Handling, State, Configuration and Secret Management

Final StretchOrdered learning track

Observability for File, Config, Secret, and State

Learn Java Microservices File Handling, State, Configuration and Secret Management - Part 060

Observability untuk file, configuration, secret, dan state di Java microservices: metrics, logs, traces, audit events, SLO, alerts, dashboards, and invariant-driven monitoring.

[2026-07-05]8 min read1549 words

In This Lesson

1. Observability vs Monitoring 2. Signal Taxonomy 3. Invariant-Driven Observability

PrevNext

Lesson 6070 lesson track59–70 Final Stretch

#java#microservices#observability#metrics+4 more

Part 060 — Observability for File, Config, Secret, and State

Observability is not “we have dashboards”.

Observability is the ability to ask new questions about a running system without shipping new code.

For this series, observability has a sharper purpose:

Detect invariant stress before it becomes data loss, security failure,
configuration incident, secret outage, or regulatory gap.

Generic RED metrics are useful:

request rate;
error rate;
latency.

But they are not enough for file/state/config/secret systems.

You need signals like:

accepted files without checksum
upload sessions stuck in UPLOADING
metadata-payload mismatch
config reload failures
pods running mixed config versions
secret lease nearing expiry
old credential still used after rotation
cache stale authorization reads
audit outbox backlog
object storage delete denied by retention

This part builds an observability model for Java microservices that manage runtime artifacts.

OpenTelemetry provides a vendor-neutral framework for generating, collecting, and exporting telemetry data such as traces, metrics, and logs. That gives us the plumbing. The engineering challenge is deciding what to observe.

1. Observability vs Monitoring

Monitoring answers known questions:

Is CPU high?
Is error rate high?
Is service up?

Observability supports unknown investigation:

Why are accepted files increasing but scan completions flat?
Which config version caused upload failures?
Are only pods with secret version v42 failing DB auth?
Did a specific file transition happen before or after object storage PutObject?

You need both.

2. Signal Taxonomy

For this series, use six signal types.

Signal	Purpose
Metrics	aggregate health and alerting
Logs	contextual events and debugging
Traces	causal path across service/dependency
Audit events	accountable material decisions
Health/readiness	traffic routing safety
Reconciliation reports	invariant drift detection

Do not force every signal into one tool. Each has different semantics.

3. Invariant-Driven Observability

Start from invariant, derive signal.

3.1 Example: File Integrity

Invariant:

No file in ACCEPTED state may exist without verified checksum.

Signals:

file_integrity_missing_total
file_accepted_total
file_checksum_verification_failed_total
file_reconciliation_mismatch_total

Alert:

file_integrity_missing_total > 0

Dashboard:

accepted files per minute;
checksum failures;
reconciliation mismatches;
oldest accepted file without checksum.

3.2 Example: Secret Rotation

Invariant:

Secret must not expire before consumer refreshes.

Signals:

secret_seconds_until_expiry
secret_refresh_success_total
secret_refresh_failure_total
secret_current_version_info
dependency_auth_failure_total

Alert:

secret_seconds_until_expiry < 600 AND secret_refresh_failure_total increasing

3.3 Example: Config Rollout

Invariant:

Pods should not run mixed critical config versions beyond rollout window.

Signals:

config_current_version_info{pod="...", version="..."}
config_reload_success_total
config_validation_failure_total
config_runtime_mixed_version_pods

Alert:

mixed critical config versions > 0 for > expected rollout window

4. Metrics Design

Metrics should be:

bounded cardinality;
semantically stable;
action-oriented;
connected to SLO/invariant;
redacted.

4.1 File Metrics

file_upload_started_total{tenant_class, file_type}
file_upload_completed_total{tenant_class, file_type}
file_upload_failed_total{reason}
file_upload_bytes_total{file_type}
file_upload_active_sessions
file_upload_stale_sessions_total
file_upload_session_age_seconds
file_checksum_verification_failed_total{reason}
file_lifecycle_transition_total{from,to}
file_lifecycle_transition_denied_total{from,to,reason}
file_scan_requested_total
file_scan_completed_total{decision}
file_scan_pending_age_seconds
file_download_granted_total{file_type}
file_download_denied_total{reason}
file_presigned_url_issued_total{method}
file_orphan_object_total
file_metadata_payload_mismatch_total

Avoid labels:

raw fileId;
filename;
user email;
object key;
presigned URL;
unbounded case ID.

4.2 Object Storage Metrics

object_storage_request_total{operation,status}
object_storage_request_duration_seconds{operation}
object_storage_retry_total{operation,reason}
object_storage_error_total{operation,error_class}
object_storage_put_bytes_total
object_storage_get_bytes_total
object_storage_multipart_active_uploads
object_storage_multipart_aborted_total
object_storage_multipart_stale_uploads_total
object_storage_egress_bytes_total{destination_class}

Important dimensions:

operation: put, get, delete, copy, complete_multipart, abort_multipart;
status: success, failure;
error class: throttled, timeout, access_denied, not_found, conflict.

4.3 State Metrics

state_transition_total{entity,from,to}
state_transition_denied_total{entity,reason}
state_optimistic_lock_conflict_total{entity}
state_replay_started_total{job}
state_replay_completed_total{job,status}
state_replay_conflict_total{job}
state_reconciliation_mismatch_total{type}
cache_hit_total{cache}
cache_miss_total{cache}
cache_stale_read_total{cache}
cache_invalidation_total{cache,source}

For authorization cache, explicitly expose:

authorization_cache_stale_decision_total
authorization_cache_forced_source_check_total

4.4 Config Metrics

config_validation_success_total
config_validation_failure_total{reason}
config_reload_success_total{config_group}
config_reload_failure_total{config_group,reason}
config_current_version_info{service,environment,version}
config_drift_detected_total{source}
config_unsafe_value_blocked_total{key_class}
config_runtime_mixed_version_pods

Do not put raw config key/value labels if values are sensitive or cardinality-heavy.

4.5 Secret Metrics

secret_load_success_total{secret}
secret_load_failure_total{secret,reason}
secret_refresh_success_total{secret}
secret_refresh_failure_total{secret,reason}
secret_seconds_until_expiry{secret}
secret_rotation_started_total{secret}
secret_rotation_completed_total{secret}
secret_old_version_usage_total{secret}
secret_access_denied_total{secret}
dependency_auth_failure_total{dependency,reason}

Do not expose:

secret value;
token prefix;
full version ID if sensitive/high-cardinality.

4.6 Audit Metrics

audit_event_created_total{event_type}
audit_event_publish_success_total
audit_event_publish_failure_total{reason}
audit_outbox_pending_total
audit_outbox_oldest_age_seconds
audit_event_duplicate_total
audit_sink_latency_seconds

Audit metrics are crucial. If audit path fails silently, accountability is false.

5. Logs Design

Logs should provide context, not raw data.

5.1 Event-Style Logs

Good:

INFO file_lifecycle_transition fileId=FILE-01JZ from=SCANNED to=ACCEPTED actorType=SERVICE correlationId=req-abc

Bad:

INFO accepted uploaded file john-smith-report.pdf with url https://...

5.2 Log Levels

Level	Usage
DEBUG	local/staging diagnostics; avoid sensitive data
INFO	material operational events, state transitions
WARN	recoverable anomaly or invariant stress
ERROR	failed operation requiring attention
SECURITY/AUDIT	separate channel if framework supports

Avoid using ERROR for expected denied operations. Use structured event and metrics.

5.3 Log Sampling

Sampling can hide rare security events.

Do not sample:

access denied high-risk events;
secret access denied;
file deletion attempts;
config unsafe value blocked;
audit publish failure.

Can sample:

repeated transient storage retry logs;
high-frequency health check logs;
noisy expected validation errors, while metrics count all.

6. Tracing Design

Tracing helps answer causality:

Request -> metadata DB -> object storage -> scan worker -> audit sink

6.1 Trace Spans for File Upload

Span names:

POST /files/upload-sessions
db.evidence_file.insert
object_storage.verify_object
audit.outbox.insert
file.transition.UPLOADING_TO_UPLOADED

Attributes:

file.lifecycle.from=UPLOADING
file.lifecycle.to=UPLOADED
file.size.bucket=lt_100mb
storage.operation=head_object
audit.event_type=FILE_PAYLOAD_RECEIVED

Avoid:

filename;
presigned URL;
auth token;
raw object key if it leaks semantics;
request body.

6.2 Async Trace Propagation

For outbox/worker:

propagate traceparent if appropriate;
preserve correlation ID;
add causation ID/event ID.

If tracing breaks at async boundary, forensic debugging becomes harder.

7. Health and Readiness

Health is not just process alive.

7.1 Liveness

Liveness answers:

Should Kubernetes restart this container?

Use liveness for deadlock/fatal state, not dependency blips.

7.2 Readiness

Readiness answers:

Should this pod receive traffic?

Readiness should fail when required runtime dependencies or invariants make request handling unsafe.

Examples:

Condition	Readiness Impact
required config invalid	not ready
secret missing/expired	not ready
DB unavailable	usually not ready
object storage unavailable	not ready for file endpoints
audit outbox temporarily pending	maybe ready
audit outbox too old for high-risk service	not ready/degraded
scanner queue backlog high	maybe ready but degrade upload acceptance
cache unavailable	depends fallback

7.3 Component Health

Expose safe component health:

{
  "status": "DEGRADED",
  "components": {
    "metadataDb": "UP",
    "objectStorage": "UP",
    "scanner": "DEGRADED",
    "auditOutbox": "UP",
    "secret:evidence-db": "UP",
    "config": "UP"
  }
}

Do not expose secret values, URLs with credentials, or internal topology publicly.

8. SLO Design

SLOs should reflect user and invariant outcomes.

8.1 File Upload SLO

Bad SLO:

HTTP 200 rate > 99.9%

Better:

99.5% of valid upload sessions under 100MB complete and reach QUARANTINED
within 2 minutes.

8.2 File Processing SLO

99% of uploaded files are scanned and accepted/rejected within 10 minutes.

8.3 Download SLO

99.9% of authorized download grant requests complete within 500ms,
excluding object storage client download time for direct downloads.

8.4 Secret SLO

Required production secrets have at least 15 minutes of valid TTL remaining
for 99.99% of service-minutes.

8.5 Config SLO

Critical config rollout converges to a single validated version across ready pods
within 10 minutes.

8.6 Audit SLO

99.9% of material audit events are durably stored within 60 seconds.
No high-risk destructive operation commits without durable audit intent.

9. Alert Design

Alerts should be actionable.

9.1 Bad Alerts

CPU > 80%
log contains "error"
S3 request failed once
cache miss rate high

These may be useful dashboard signals but poor pages.

9.2 Good Alerts

Accepted file without checksum > 0
Secret expires in < 10m and refresh failing
Audit outbox oldest pending > 5m for prod
Config validation failed on new rollout
Pods mixed critical config version for > 15m
File scan pending p95 age > SLO
Object storage access denied spike after deploy
Metadata-payload mismatch detected
Old DB credential still used after rotation window

Every alert should have:

owner;
severity;
runbook;
dashboard link;
expected action;
false-positive guidance.

10. Dashboards

10.1 File Platform Dashboard

Sections:

upload started/completed/failed;
active upload sessions;
stale sessions;
object storage error/latency;
scan queue depth and pending age;
lifecycle transitions;
download grants/denies;
metadata-payload mismatch;
orphan object count;
audit event backlog.

10.2 Config Dashboard

current config version per pod;
validation failures;
reload successes/failures;
drift detections;
unsafe values blocked;
rollout convergence time;
config source availability.

10.3 Secret Dashboard

secret version per service/pod;
seconds until expiry;
refresh success/failure;
dependency auth failures;
old credential usage;
rotation progress;
secret manager latency/error rate.

10.4 State Dashboard

state transition counts;
invalid transition attempts;
optimistic lock conflicts;
cache hit/miss/stale reads;
replay/reconciliation status;
DLQ counts;
oldest unprocessed event.

11. Java Instrumentation Patterns

11.1 Micrometer Counter

@Component
public final class FileMetrics {
    private final Counter uploadStarted;
    private final Counter uploadFailed;

    public FileMetrics(MeterRegistry registry) {
        this.uploadStarted = Counter.builder("file_upload_started_total")
            .description("Number of file uploads started")
            .register(registry);

        this.uploadFailed = Counter.builder("file_upload_failed_total")
            .description("Number of file uploads failed")
            .tag("reason", "unknown")
            .register(registry);
    }

    public void uploadStarted() {
        uploadStarted.increment();
    }
}

Be careful with dynamic tags. Do not create a counter per file/user/error message.

Better:

public void uploadFailed(String reason) {
    Counter.builder("file_upload_failed_total")
        .tag("reason", normalizeReason(reason))
        .register(registry)
        .increment();
}

Where normalizeReason returns bounded values:

size_limit
auth_denied
storage_timeout
checksum_mismatch
scan_rejected
unknown

11.2 Timer for Object Storage

public <T> T timedStorageCall(String operation, Supplier<T> supplier) {
    return Timer.builder("object_storage_request_duration_seconds")
        .tag("operation", operation)
        .register(meterRegistry)
        .record(supplier);
}

Again, operation must be bounded.

11.3 Gauge for Audit Backlog

Gauge.builder("audit_outbox_pending_total", repository, AuditOutboxRepository::countPending)
    .description("Pending audit outbox events")
    .register(registry);

Gauge query must be efficient. Do not run expensive DB count every scrape on huge table without optimization.

11.4 OpenTelemetry Span

Span span = tracer.spanBuilder("file.transition")
    .setAttribute("file.lifecycle.from", from.name())
    .setAttribute("file.lifecycle.to", to.name())
    .setAttribute("file.type", safeFileType)
    .startSpan();

try (Scope ignored = span.makeCurrent()) {
    transition();
    span.setStatus(StatusCode.OK);
} catch (Exception ex) {
    span.recordException(ex);
    span.setStatus(StatusCode.ERROR);
    throw ex;
} finally {
    span.end();
}

Do not attach raw filename or secret values.

12. Correlation IDs

Correlation ID must flow across:

HTTP ingress;
logs;
traces;
audit events;
outbox messages;
workers;
storage metadata if safe;
DLQ.

12.1 HTTP Filter

public final class CorrelationIdFilter implements Filter {
    private static final String HEADER = "X-Correlation-ID";

    @Override
    public void doFilter(ServletRequest request, ServletResponse response, FilterChain chain)
        throws IOException, ServletException {

        HttpServletRequest http = (HttpServletRequest) request;
        String correlationId = Optional.ofNullable(http.getHeader(HEADER))
            .filter(this::isValid)
            .orElse(UUID.randomUUID().toString());

        try {
            MDC.put("correlationId", correlationId);
            ((HttpServletResponse) response).setHeader(HEADER, correlationId);
            chain.doFilter(request, response);
        } finally {
            MDC.clear();
        }
    }

    private boolean isValid(String value) {
        return value.length() <= 128 && value.matches("[A-Za-z0-9._-]+");
    }
}

Do not trust arbitrary correlation ID length/content. Otherwise attacker can inject logs or create high-cardinality chaos.

13. Reconciliation Observability

Reconciliation is a first-class signal.

Examples:

reconciliation_run_started_total{job}
reconciliation_run_completed_total{job,status}
reconciliation_duration_seconds{job}
reconciliation_mismatch_total{type}
reconciliation_repaired_total{type}
reconciliation_failed_total{type}

For file metadata-payload reconciliation:

mismatch_type:
- metadata_without_object
- object_without_metadata
- checksum_mismatch
- invalid_lifecycle
- retention_policy_mismatch

Alert:

metadata_without_object > 0 for accepted files

Not all mismatches are same severity. Temporary orphan in upload temp prefix may be low severity; accepted metadata without object is high severity.

14. DLQ Observability

DLQ must be monitored.

Signals:

dlq_messages_total{queue,reason}
dlq_oldest_message_age_seconds{queue}
dlq_replay_success_total{queue}
dlq_replay_failure_total{queue}

Runbook should answer:

is message safe to replay?
is operation idempotent?
does payload contain sensitive data?
what caused failure?
should it be repaired manually or code-fixed?

15. Observability for Cost

File systems can fail economically.

Metrics:

storage_bytes_total{class}
storage_object_count{prefix_class}
storage_egress_bytes_total{destination_class}
multipart_incomplete_bytes_total
archive_restore_request_total
presigned_download_bytes_total
scan_compute_seconds_total

Alerts:

incomplete multipart bytes > threshold
egress bytes spike
archive restore requests spike
orphan object count increasing
storage class transition failures

Cost observability is reliability observability. A runaway upload abuse can be both DoS and cost incident.

16. Observability for Security Controls

Security controls need health metrics.

malware_scan_required_effective{service="evidence"} 1
malware_scan_bypass_attempt_total
authorization_denied_total{operation}
secret_redaction_test_failure_total
config_unsafe_value_blocked_total
rbac_access_denied_total
kms_decrypt_denied_total

If scan is disabled accidentally, you want a binary signal:

malware_scan_required_effective == 0 in prod

Alert immediately.

17. Observability Anti-Patterns

17.1 Dashboard Theater

Many graphs, no action.

Fix:

Every dashboard panel should map to a question or invariant.

17.2 High Cardinality Labels

Bad:

file_download_total{fileId="FILE-..."}

This can hurt metrics backend and leak data.

17.3 Logging Everything

More logs can reduce observability if noise hides signal and leaks data.

17.4 No Async Correlation

Worker logs without correlation ID make event-driven systems hard to debug.

17.5 Only Infrastructure Metrics

CPU/memory is not enough. You need domain and invariant metrics.

17.6 Alert on Symptoms Too Late

Example:

DB CPU high

after secret rotation caused auth failure/retry storm.

Better alert earlier:

dependency_auth_failure_total spike after secret version change

18. Observability Review Template

# Observability Review

## Service
- Name:
- Owner:
- Runtime:
- Critical artifacts:

## Invariants
| Invariant | Metric | Log | Trace | Audit | Alert |
|---|---|---|---|---|---|

## Metrics
- Counters:
- Gauges:
- Histograms:
- Cardinality review:

## Logs
- Structured fields:
- Redaction:
- Sampling:
- Retention:

## Traces
- Span boundaries:
- Attributes:
- Async propagation:

## Health
- Liveness:
- Readiness:
- Dependency checks:
- Degraded states:

## Alerts
| Alert | Severity | Owner | Runbook | SLO |
|---|---|---|---|---|

## Dashboards
- Operational:
- Security:
- Cost:
- Compliance:

## Failure Drills
- ...

19. Testing Observability

Observability must be tested.

19.1 Metric Test

Given upload fails due to checksum mismatch
Then file_upload_failed_total{reason="checksum_mismatch"} increments
And no label contains fileId or filename

19.2 Log Test

Given request contains Authorization header
When operation fails
Then logs do not contain token
And logs contain correlationId

19.3 Trace Test

Given file upload request
Then trace contains spans for DB, storage, audit
And does not contain request body or presigned URL

19.4 Alert Test

Synthetic failure:

Force secret_seconds_until_expiry below threshold with refresh failure
Verify alert fires and runbook is correct

19.5 Dashboard Review

During game day, ask:

Can an engineer answer what is broken, what is affected,
and what action to take within 5 minutes?

20. Key Takeaways

Observability should be driven by invariants, not generic dashboards.
File systems need metrics for lifecycle, scan latency, object mismatch, orphan objects, and upload/session health.
Config observability must expose version, validation, reload, drift, and rollout convergence.
Secret observability must expose TTL, refresh, rotation, old-version usage, and dependency auth failures.
State observability must expose transition conflicts, replay drift, stale cache, and reconciliation mismatch.
Audit pipeline itself needs metrics and alerts.
Tracing should preserve causality across async boundaries without leaking sensitive data.
Readiness should represent whether service can safely receive traffic, not just whether JVM is alive.
Cost and security controls are part of observability.
Every alert needs owner, runbook, severity, and action.

Next, we intentionally break things: Chaos and Failure Testing for file, config, secret, and state systems.

References

OpenTelemetry Documentation: https://opentelemetry.io/docs/
OpenTelemetry Logs Data Model: https://opentelemetry.io/docs/specs/otel/logs/data-model/
OpenTelemetry Handling Sensitive Data: https://opentelemetry.io/docs/security/handling-sensitive-data/
Spring Boot Actuator Endpoints: https://docs.spring.io/spring-boot/reference/actuator/endpoints.html
Micrometer Documentation: https://docs.micrometer.io/micrometer/reference/
OWASP Logging Cheat Sheet: https://cheatsheetseries.owasp.org/cheatsheets/Logging_Cheat_Sheet.html

Lesson Recap

You just completed lesson 60 in final stretch. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Back To Series Next Lesson

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.

Previous Lesson

Lesson 59

Auditability and Forensics

Next Lesson

Lesson 61

Chaos and Failure Testing