Series/Learn Production Grade Contract-First Java Orchestration Platform

Final StretchOrdered learning track

Observability Logging Metrics Tracing

Learn Production Grade Contract-First Java Orchestration Platform - Part 038

Observability design for a production-grade contract-first Java orchestration platform, covering structured logging, metrics, tracing, correlation, SLOs, dashboards, alerting, and failure drills across NGINX, Jersey, Kafka, PostgreSQL, Camunda 7, and Kubernetes.

[2026-07-03]24 min read4797 words

In This Lesson

1. Mental Model: Telemetry bukan Observability 2. Observability Contract untuk Platform Ini 3. Golden Signals, RED, and USE

PrevNext

Lesson 3840 lesson track34–40 Final Stretch

#observability#logging#metrics#tracing+7 more

Part 038 — Observability: Logging, Metrics, Tracing

Part sebelumnya membahas edge. Sekarang kita membahas bagaimana mengetahui apa yang sedang terjadi setelah request masuk ke sistem.

Observability bukan “pasang dashboard”. Observability adalah kemampuan menjawab pertanyaan produksi yang belum kita tahu sebelumnya.

Contoh pertanyaan nyata:

Mengapa partner A mendapat 504, tetapi case ternyata berhasil dibuat?
Apakah Kafka lag menyebabkan SLA task terlambat?
Apakah Camunda incident storm berasal dari satu delegate atau banyak process definition?
Apakah PostgreSQL lock contention membuat API timeout?
Apakah release baru meningkatkan p95 latency untuk POST /v1/cases?
Event mana yang membuat process instance tertentu masuk escalation path?
Apakah duplicate command berasal dari retry client, proxy timeout, atau Kafka replay?
Mengapa officer queue kosong, padahal case sudah masuk?

Jika telemetry tidak bisa menjawab pertanyaan seperti ini, sistem belum production-grade.

1. Mental Model: Telemetry bukan Observability

Telemetry adalah data:

log;
metric;
trace;
event;
profile;
audit record;
health signal.

Observability adalah kemampuan menggunakan data itu untuk memahami behavior internal sistem.

Sistem observability yang buruk punya banyak data tetapi sedikit jawaban.

Sistem observability yang baik punya data yang:

terstruktur;
konsisten;
rendah kardinalitas untuk metric;
tinggi konteks untuk log/trace;
tidak membocorkan data sensitif;
bisa dikorelasikan lintas layer;
punya owner;
dipakai untuk alert dan runbook;
diuji lewat failure drill.

2. Observability Contract untuk Platform Ini

Stack kita:

Kita butuh menjawab flow end-to-end:

HTTP request -> Java command -> PostgreSQL transaction -> outbox row -> Kafka event -> consumer inbox -> Camunda process/message -> human task/SLA

Canonical correlation fields:

Field	Scope	Masuk ke
`request_id`	one HTTP request	edge log, app log, response, audit technical context
`trace_id`	distributed trace	spans/logs/metrics exemplars
`correlation_id`	business operation	command, event, audit, process integration
`case_id`	aggregate	domain log, event payload/header, audit, task projection
`event_id`	produced event	outbox, Kafka header/payload, consumer inbox
`idempotency_key`	command dedupe	request log, DB idempotency table, audit
`process_instance_id`	Camunda runtime	process log, incident, operation log
`business_key`	domain-process bridge	Camunda correlation, logs, event metadata
`tenant_id`	isolation boundary	logs/metrics with caution, auth decision, audit
`actor_id`	user/service actor	audit; logs only if safe/pseudonymized

Do not put high-cardinality values blindly in metrics labels. case_id, request_id, and event_id usually belong in logs/traces, not metric labels.

3. Golden Signals, RED, and USE

Google SRE popularized four golden signals:

latency;
traffic;
errors;
saturation.

For services, RED is often useful:

Rate;
Errors;
Duration.

For resources, USE is useful:

Utilization;
Saturation;
Errors.

Mapping to our platform:

Layer	Primary model
NGINX/API endpoint	RED + golden signals
Java service/JVM	RED + USE
PostgreSQL	USE + query latency/error
Kafka producer/consumer	lag/rate/error/duration
Camunda 7	job backlog, incident count, process duration
Kubernetes node/pod	USE + restart/availability
Business workflow	domain SLO, SLA breach, queue aging

Observability harus menggabungkan technical dan domain signals.

Technical-only monitoring bisa bilang “semua Pod sehat” saat case queue macet.

Domain-only monitoring bisa bilang “SLA breach naik” tanpa tahu root cause Kafka lag atau DB lock.

4. Logging Contract

Log adalah catatan event diskret.

Log yang baik menjawab:

apa yang terjadi;
kapan;
di service mana;
untuk request/event/case/process mana;
severity-nya apa;
outcome-nya apa;
error code apa;
retryable atau tidak;
berapa lama;
apa next action.

4.1 JSON log schema baseline

{
  "timestamp": "2026-07-03T10:15:30.123Z",
  "level": "INFO",
  "service": "case-api",
  "environment": "prod",
  "version": "1.12.3",
  "message": "Case intake command accepted",
  "event_type": "case.command.accepted",
  "request_id": "req-abc",
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "correlation_id": "corr-123",
  "case_id": "case-2026-000001",
  "actor_type": "PARTNER_SYSTEM",
  "route": "POST /v1/cases",
  "http_status": 202,
  "duration_ms": 83,
  "error_code": null
}

4.2 Required log fields by runtime

Runtime	Required fields
NGINX	request id, host, method, path, status, request time, upstream status/time
Jersey API	request id, trace id, route template, status, duration, command type
PostgreSQL adapter	operation name, duration, SQLSTATE on error, row count, transaction id if available
Outbox publisher	event id, outbox id, topic, key, attempt, status, duration
Kafka consumer	event id, topic, partition, offset, consumer group, attempt, handling result
Camunda adapter	business key, process key, process instance id, message name, correlation result
Camunda delegate	activity id, process definition key, process instance id, job id, retry count
Kubernetes workload	pod name, namespace, container, version, restart context

4.3 Log levels

Level	Use
DEBUG	local diagnosis, disabled by default in prod
INFO	important state transition/outcome
WARN	recoverable abnormal behavior requiring attention if frequent
ERROR	failed operation requiring operator/developer diagnosis

Do not log every successful SQL query at INFO.

Do not log expected validation failure as ERROR.

Do not log stack trace for business error like invalid state transition unless it indicates bug.

4.4 Safe logging

Never log:

raw access token;
password;
API key;
full Authorization header;
full evidence payload;
large request body;
personally sensitive data unless explicitly approved;
unredacted document content;
database connection string with password;
private key/certificate.

Use redaction:

public final class SafeLog {
  public static String token(String value) {
    if (value == null || value.length() < 8) return "<redacted>";
    return value.substring(0, 4) + "..." + value.substring(value.length() - 4);
  }
}

But best practice is not to pass sensitive values into logger at all.

5. Metrics Contract

Metric adalah pengukuran numerik dari runtime.

Metric harus:

punya nama stabil;
punya unit;
punya label rendah kardinalitas;
punya semantic jelas;
punya owner;
punya dashboard/alert bila penting;
tidak menyimpan ID unik sebagai label.

5.1 HTTP metrics

Metric	Type	Labels
`http_server_requests_total`	counter	service, method, route, status_class
`http_server_request_duration_seconds`	histogram	service, method, route, status_class
`http_server_active_requests`	gauge	service, route
`http_request_body_rejected_total`	counter	route, reason
`http_idempotency_conflicts_total`	counter	route, command_type

Route label harus route template:

/v1/cases/{caseId}

bukan actual path:

/v1/cases/case-2026-000001

Actual path akan meledakkan cardinality.

5.2 NGINX metrics

Metric	Meaning
request rate	traffic masuk
4xx/5xx rate	client/edge/upstream error
upstream response time	latency upstream
request time	total latency edge
active connections	connection pressure
413/429/502/503/504 count	edge failure classes

NGINX access log sering menjadi sumber utama untuk edge metrics jika tidak memakai exporter khusus.

5.3 JVM metrics

Metric	Why it matters
heap used/max	memory pressure
non-heap/metaspace	classloading/leak indicators
GC pause duration	latency spike
thread count	thread leak/pool pressure
executor active/queued tasks	backpressure
connection pool active/idle/pending	DB bottleneck
CPU usage	saturation
process uptime/restart	stability

Jangan hanya memonitor heap. Banyak outage Java berasal dari thread pool, connection pool, GC pause, atau CPU throttling.

5.4 PostgreSQL metrics

Metric	Why it matters
active connections	pool/DB pressure
waiting sessions	lock/resource contention
transaction duration	long transaction risk
query duration histogram	performance regression
deadlocks total	concurrency bug
serialization failures	retry pressure
rows scanned/returned	query shape issue
replication lag	HA/read replica awareness
autovacuum activity	bloat/maintenance
table/index bloat indicators	long-term performance

Application-level DB metrics:

Metric	Label
`db_operation_duration_seconds`	operation, outcome
`db_sqlstate_errors_total`	sqlstate_class, operation
`db_lock_wait_seconds`	operation
`db_transaction_retries_total`	reason

SQLSTATE label should be class or curated code, not full message.

5.5 Kafka metrics

Producer:

Metric	Why
send rate	throughput
send error rate	broker/network/schema issue
request latency	broker pressure
batch size	efficiency
record retry rate	transient issue

Consumer:

Metric	Why
records consumed rate	throughput
consumer lag	backlog
processing duration	handler speed
processing error total	handler correctness
DLQ/quarantine count	poison events
rebalance count	instability
poll interval/max poll risk	stuck consumer

Outbox/inbox metrics:

Metric	Why
outbox pending count	publication backlog
outbox oldest age seconds	event freshness SLO
outbox publish failures	publisher issue
inbox pending count	consumer backlog
inbox oldest age seconds	process delay
inbox duplicate count	replay/duplicate behavior
quarantined message count	manual action required

5.6 Camunda 7 metrics

Camunda-specific indicators:

Metric	Why
active process instances by key	volume
process start rate	traffic
process completion rate	throughput
process duration	workflow latency
open incidents by process/activity	failure hotspot
failed jobs count	technical failure
job acquisition latency/backlog	executor pressure
external task locked/failed count	worker issue
user task count by group	workload
user task oldest age	SLA risk
timer due backlog	scheduler pressure
history cleanup duration/failure	maintenance risk

Domain workflow metrics:

Metric	Why
case intake accepted total	business traffic
case triage completed total	process progress
investigation task open count	workload
SLA breach total	regulatory risk
escalation total	risk/control
case closure duration	lifecycle performance
appeal window expired count	timer correctness

6. Tracing Contract

Trace menunjukkan path request/operation lintas service.

OpenTelemetry mendefinisikan signals seperti traces, metrics, logs, baggage, dan profiles. Untuk platform ini, trace dipakai untuk menjawab path, latency breakdown, dan causal relationship.

6.1 HTTP trace

Span model:

HTTP POST /v1/cases
  validate.case_intake
  db.transaction.create_case
    db.insert.case
    db.insert.audit
    db.insert.outbox
  response.serialize

Attributes:

Attribute	Example
`http.route`	`/v1/cases`
`http.method`	`POST`
`http.status_code`	`202`
`service.name`	`case-api`
`case.command_type`	`CASE_INTAKE`
`error.code`	`CASE_STATE_CONFLICT`

Avoid putting full case data in span attributes.

6.2 Async trace: HTTP to Kafka to consumer

Async trace is harder because the causal path crosses a database outbox and Kafka.

Outbox row should preserve enough trace/correlation context:

trace_id text,
span_context jsonb,
request_id text,
correlation_id text,

Kafka headers:

traceparent
tracestate
x-request-id
x-correlation-id
x-event-id
x-case-id

Consumer creates a new span linked to producer context.

Do not force every async continuation to look like one synchronous trace if tooling cannot represent it cleanly. It is acceptable to use span links or correlation search.

6.3 Camunda trace boundary

Camunda 7 internal execution may not automatically produce perfect spans for every BPMN element. We can still instrument boundaries:

start process;
correlate message;
execute delegate;
complete external task;
create user task projection;
handle incident;
publish process event.

Useful span attributes:

Attribute	Example
`camunda.process_definition_key`	`case_enforcement_lifecycle`
`camunda.process_instance_id`	`...`
`camunda.business_key`	`case-2026-000001`
`camunda.activity_id`	`AssessCaseTask`
`camunda.job_id`	`...`
`camunda.message_name`	`CaseAccepted`
`camunda.incident_type`	`failedJob`

7. OpenTelemetry Architecture

Vendor-neutral architecture:

The Collector gives one place to:

receive telemetry;
enrich resource attributes;
sample traces;
filter unsafe data;
batch/export;
route data to vendor backend;
decouple application from vendor endpoint.

Baseline resource attributes:

Attribute	Example
`service.name`	`case-api`
`service.version`	`1.12.3`
`deployment.environment`	`prod`
`k8s.namespace.name`	`enforcement`
`k8s.pod.name`	`case-api-abc`
`k8s.container.name`	`case-api`
`cloud.region`	if applicable

8. Structured Logging in Java

8.1 MDC Context

Use MDC for request-scoped log fields:

public final class LogContext implements AutoCloseable {
  private final Map<String, String> previous = new HashMap<>();

  public static LogContext put(String key, String value) {
    LogContext ctx = new LogContext();
    ctx.capture(key);
    if (value != null) MDC.put(key, value);
    return ctx;
  }

  private void capture(String key) {
    previous.put(key, MDC.get(key));
  }

  @Override
  public void close() {
    for (var entry : previous.entrySet()) {
      if (entry.getValue() == null) MDC.remove(entry.getKey());
      else MDC.put(entry.getKey(), entry.getValue());
    }
  }
}

Use in request filter:

try (var ignored = LogContext.put("request_id", requestId)) {
  chain.proceed();
}

Be careful with thread pools. MDC does not always propagate automatically.

8.2 Log Event Naming

Use stable event_type:

case.command.received
case.command.accepted
case.command.rejected
case.outbox.event.created
case.outbox.event.published
case.kafka.event.consumed
case.workflow.message.correlated
case.workflow.incident.created
case.task.assigned
case.sla.breached

Stable event types make queries and dashboards easier.

9. API Observability

For every API request log once at the boundary:

{
  "event_type": "http.request.completed",
  "service": "case-api",
  "method": "POST",
  "route": "/v1/cases",
  "status": 202,
  "duration_ms": 83,
  "request_id": "req-abc",
  "trace_id": "...",
  "idempotency_key_hash": "sha256:...",
  "outcome": "ACCEPTED"
}

Do not log full payload.

Measure:

request count;
duration histogram;
4xx by error code;
5xx by error code;
validation failure count;
idempotency replay count;
idempotency conflict count;
auth failure count;
downstream timeout count.

Dashboard panels:

RPS by route;
p50/p95/p99 latency by route;
error rate by route/status;
top error codes;
idempotency conflicts/replays;
NGINX 504 vs app 500;
deployment version comparison.

10. PostgreSQL Observability

Application should expose operation-level DB metrics instead of raw SQL labels.

Good label:

operation=create_case_transaction

Bad label:

sql=insert into case_core.case_record ...

DB operation metrics:

db_operation_duration_seconds{operation="create_case_transaction", outcome="success"}
db_operation_errors_total{operation="create_case_transaction", sqlstate_class="23"}
db_transaction_retries_total{operation="assign_case", reason="serialization_failure"}

Log SQLSTATE on error:

{
  "event_type": "db.operation.failed",
  "operation": "create_case_transaction",
  "sqlstate": "23505",
  "sqlstate_class": "23",
  "retryable": false,
  "request_id": "req-abc",
  "case_id": "case-2026-000001"
}

Important PostgreSQL dashboards:

active connections vs max;
waiting locks;
long-running transactions;
deadlocks;
slow query count;
top operations by latency;
outbox/inbox pending rows;
autovacuum lag/bloat indicators;
table/index size growth;
replication lag if used.

Failure drill:

create lock contention on case_record;
send concurrent assignment commands;
observe API p95 latency;
observe DB lock wait;
observe SQLSTATE errors/retries;
verify runbook points to locking, not random API timeout.

11. Kafka Observability

Kafka observability must connect broker-level and application-level signals.

Broker/client level:

produce request latency;
send error rate;
consumer lag;
rebalance count;
fetch latency;
records consumed/produced rate.

Application level:

event handling duration;
inbox pending count;
oldest unprocessed event age;
duplicate event count;
quarantined event count;
DLQ publish count;
event schema/version rejection count;
event-to-process correlation failure count.

Do not only alert on consumer lag. Lag is meaningful only with context:

Scenario	Interpretation
lag high, incoming rate high, processing rate healthy	capacity issue
lag high, processing errors high	poison message/bug
lag high, rebalance high	consumer instability
lag zero, inbox pending high	consumer wrote inbox but processor stuck
lag high only one partition	hot key / ordering bottleneck

Kafka log example:

{
  "event_type": "kafka.event.handled",
  "service": "case-process-adapter",
  "topic": "case.lifecycle.v1",
  "partition": 3,
  "offset": 912391,
  "event_id": "evt-123",
  "case_id": "case-2026-000001",
  "handler": "CaseAcceptedHandler",
  "duration_ms": 47,
  "outcome": "CORRELATED_TO_PROCESS"
}

12. Outbox/Inbox Observability

Outbox and inbox are reliability infrastructure. They need first-class observability.

12.1 Outbox metrics

Metric	Alert possibility
pending count	if rising for N minutes
oldest pending age	if exceeds freshness SLO
publish success rate	low success indicates broker/publisher issue
publish failure rate	high means action needed
stale in-flight count	publisher crashed or stuck
poison event count	manual triage

Outbox log:

{
  "event_type": "outbox.publish.failed",
  "outbox_id": "out-123",
  "event_id": "evt-123",
  "topic": "case.lifecycle.v1",
  "key": "case-2026-000001",
  "attempt": 5,
  "error_code": "KAFKA_PRODUCE_TIMEOUT",
  "retryable": true
}

12.2 Inbox metrics

Metric	Meaning
pending count	backlog
oldest pending age	staleness
duplicate count	replay/at-least-once evidence
processing failure count	handler bug or dependency issue
quarantine count	operator action
stale lock count	worker crash

Inbox is where exactly-once illusion is tested. If duplicate count rises, it may be normal during replay, but it must be visible.

13. Camunda 7 Observability

Camunda has technical runtime state and business workflow state.

Technical:

job executor backlog;
failed jobs;
incidents;
retries left;
due timers;
external task locks;
process definition version distribution;
DB query latency against Camunda tables;
history cleanup.

Business:

active cases by lifecycle phase;
tasks open by candidate group;
oldest task age;
SLA breach count;
escalation count;
case closure duration;
cases stuck in assessment/investigation/appeal.

Incident log example:

{
  "event_type": "camunda.incident.created",
  "service": "case-process-engine",
  "process_definition_key": "case_enforcement_lifecycle",
  "process_instance_id": "pi-123",
  "business_key": "case-2026-000001",
  "activity_id": "NotifyDecisionServiceTask",
  "incident_type": "failedJob",
  "error_code": "DOWNSTREAM_TIMEOUT",
  "retryable": true
}

Dashboard panels:

open incidents by process/activity;
failed jobs by exception class/error code;
job backlog by due date;
active process instances by version;
open user tasks by group;
oldest user task age;
SLA timer due/triggered;
process duration percentile;
migration version distribution.

14. Kubernetes Observability

Kubernetes tells whether runtime is healthy enough to run the application.

Signals:

Pod restart count;
container OOMKilled;
CPU throttling;
memory usage vs limit;
readiness failures;
liveness restarts;
pending Pods;
image pull failures;
rollout progress;
HPA scaling;
node pressure;
network policy denies if available;
disk pressure/log volume.

Correlate release with symptoms:

Symptom	Possible K8s signal
p99 latency spike	CPU throttling, GC, node pressure
502 from NGINX	Pod restart/readiness failure
Kafka lag rising	consumer Pods crashloop or CPU throttled
outbox backlog	publisher not scheduled / blocked / no config
Camunda incidents	process adapter version mismatch
DB connection exhaustion	too many replicas/pool size too high

Deployment version must be in telemetry.

Without version label, canary/rollback diagnosis becomes guesswork.

15. SLO Design

SLO connects telemetry to user/business expectation.

Example SLIs:

User journey	SLI
Case intake	percentage of valid `POST /v1/cases` accepted within 2s
Case event publication	outbox event published within 30s of DB commit
Workflow start	process instance started/correlated within 60s of `CaseAccepted` event
Human task creation	triage task visible within 90s of case acceptance
Case search	p95 query response < 1s for standard filters
SLA escalation	escalation timer fires within tolerance window

Example SLO:

99.5% of valid case intake commands are accepted within 2 seconds over 30 days.

Another:

99% of outbox events are published to Kafka within 30 seconds of transaction commit over 7 days.

Do not define only uptime SLO. A system can be “up” while cases are not moving.

16. Alerting Principles

Alert on symptoms first, causes second.

Bad alert:

CPU > 70% for 5 minutes

May be harmless.

Better:

case intake 5xx rate > threshold and p95 latency > threshold

Cause alerts are still useful, but usually lower priority unless they imply imminent user impact.

16.1 Alert examples

Alert	Severity	Why
valid case intake success rate below SLO	page	user/business impact
outbox oldest pending age > 5m	page/warn	event pipeline stuck
Camunda open incidents rising rapidly	page	workflow stuck
SLA breach count spike	page	regulatory impact
NGINX 504 spike	page	user-visible timeout
PostgreSQL deadlocks spike	warn/page	concurrency bug
Kafka consumer lag rising with processing errors	page	poison/handler failure
Pod restart loop for critical workload	page	availability risk
metrics scrape missing	warn	blind spot

16.2 Avoid noisy alerts

Avoid:

alert per single 500;
alert on every validation error;
alert on normal duplicate event during replay;
alert on raw CPU without saturation/user impact;
alert with no runbook;
alert without owner;
alert that fires every deploy.

Each alert should have:

owner;
severity;
runbook;
dashboard link;
likely causes;
immediate mitigation;
escalation path.

17. Dashboard Design

17.1 Executive/system overview

Panels:

valid case intake success rate;
case intake latency p95/p99;
active case count by phase;
SLA breaches;
open Camunda incidents;
outbox/inbox oldest age;
Kafka lag summary;
DB saturation summary;
current deployment versions.

17.2 API dashboard

RPS by route;
latency by route/status;
4xx/5xx by error code;
idempotency replays/conflicts;
auth failures;
NGINX 413/429/502/503/504;
app 500s;
top slow DB operations triggered by route.

17.3 Event pipeline dashboard

outbox pending count/age;
publish success/failure;
Kafka producer latency/error;
consumer lag by topic/partition;
inbox pending count/age;
DLQ/quarantine;
duplicate events;
event processing duration.

17.4 Workflow dashboard

active process instances;
process duration p50/p95/p99;
open incidents by activity;
failed jobs;
due timers;
open user tasks by group;
oldest task age;
SLA breaches;
migration version distribution.

17.5 Database dashboard

connections;
waiting locks;
deadlocks;
slow queries;
transaction duration;
table/index growth;
vacuum/autovacuum indicators;
outbox/inbox table size;
DB operation latency from app.

17.6 Release dashboard

request rate by version;
error rate by version;
latency by version;
consumer lag by version;
Camunda incidents after deployment;
DB errors after migration;
Pod restarts after rollout;
canary vs stable comparison.

18. End-to-End Diagnostic Example

Incident:

Partner reports 504 when submitting cases. Some cases still appear later.

Investigation path:

Search NGINX logs by partner/time/status 504.
Extract request_id.
Search app logs by request_id.
Check whether command accepted and DB transaction committed.
Check idempotency table for same key.
Check DB operation duration and locks.
Check outbox row created.
Check outbox publication age.
Check Kafka event produced.
Check consumer inbox and Camunda correlation.
Determine user-visible correction: client should query command status or retry with same idempotency key.

If observability is good, this is minutes.

If observability is poor, this is archaeology.

19. Failure Drills

Drill 1 — PostgreSQL lock contention

Action:

open long transaction locking a case row;
send assignment command;
observe timeout/lock wait.

Expected signals:

DB lock wait rises;
API latency rises for assignment route;
SQLSTATE/retry logs appear;
no duplicate assignment;
alert if user impact threshold crossed.

Drill 2 — Kafka broker unavailable

Action:

block producer from Kafka;
submit case intake.

Expected:

API still accepts command if outbox commit succeeds;
outbox pending grows;
oldest pending age alert fires;
no request path sync publish failure;
publisher logs retryable error.

Drill 3 — Poison event

Action:

inject event with unsupported schema version.

Expected:

consumer rejects with clear error;
event quarantined/DLQ;
consumer continues processing other events if ordering policy permits;
alert includes topic/partition/offset/event id;
runbook identifies replay/quarantine action.

Drill 4 — Camunda delegate failure

Action:

make service task throw retryable technical exception.

Expected:

failed job/incident visible;
process instance remains recoverable;
logs include process instance id/business key/activity id;
retry policy behaves as expected;
alert maps to workflow dashboard.

Drill 5 — NGINX body limit

Action:

upload evidence above limit.

Expected:

NGINX returns 413;
app receives no request;
access log contains request id/status;
client error contract is understandable;
no Pod memory pressure.

Drill 6 — Pod crash during request

Action:

kill API Pod during command handling.

Expected:

NGINX may return 502/504;
Kubernetes restart count increments;
idempotency handles retry;
DB transaction either committed or rolled back;
outbox consistency preserved.

20. Observability as Code

Observability should live in repository:

observability/
  dashboards/
    api-dashboard.json
    event-pipeline-dashboard.json
    workflow-dashboard.json
    database-dashboard.json
  alerts/
    case-api-alerts.yaml
    kafka-pipeline-alerts.yaml
    camunda-alerts.yaml
  log-schema/
    application-log.schema.json
    edge-access-log.schema.json
  runbooks/
    nginx-504.md
    outbox-backlog.md
    camunda-incident-storm.md
    db-lock-contention.md

Build/release should validate:

dashboard JSON parseable;
alert rules syntactically valid;
runbook links present;
service emits required metrics in integration test;
log sample matches schema;
trace smoke test can find expected spans.

21. Observability Test Examples

21.1 Log schema test

@Test
void commandAcceptedLogContainsRequiredFields() {
  var log = captureLog(() -> submitValidCase());

  assertThat(log.eventType()).isEqualTo("case.command.accepted");
  assertThat(log.requestId()).isNotBlank();
  assertThat(log.caseId()).isNotBlank();
  assertThat(log.durationMs()).isGreaterThanOrEqualTo(0);
  assertThat(log.rawPayload()).isNull();
}

21.2 Metrics smoke test

@Test
void exposesHttpMetricsWithRouteTemplate() {
  submitValidCase();

  String metrics = scrapeMetrics();

  assertThat(metrics).contains("http_server_request_duration_seconds");
  assertThat(metrics).contains("route=\"/v1/cases\"");
  assertThat(metrics).doesNotContain("case-2026-");
}

21.3 Trace smoke test

@Test
void createsTraceAcrossApiAndOutbox() {
  var response = submitValidCaseWithTraceparent();
  var traceId = response.traceId();

  assertTraceContains(traceId, "HTTP POST /v1/cases");
  assertTraceContains(traceId, "db.transaction.create_case");
  assertTraceContains(traceId, "outbox.event.created");
}

22. Privacy, Compliance, and Audit Separation

Audit log and application log are not the same.

Application log:

operational diagnosis;
can be sampled/rotated;
avoids sensitive details;
may go to centralized logging platform.

Audit log:

evidentiary record;
immutable/append-only design;
actor/action/resource/outcome;
domain-level meaning;
retention governed by policy;
not used as debug dumping ground.

Example audit record:

{
  "audit_type": "CASE_DECISION_RECORDED",
  "case_id": "case-2026-000001",
  "actor_id": "officer-123",
  "actor_type": "HUMAN_OFFICER",
  "decision": "ENFORCEMENT_ACTION_APPROVED",
  "occurred_at": "2026-07-03T10:15:30Z",
  "request_id": "req-abc"
}

Example application log for same action:

{
  "event_type": "case.decision.recorded",
  "case_id": "case-2026-000001",
  "request_id": "req-abc",
  "duration_ms": 42,
  "outcome": "SUCCESS"
}

Do not use application logs as legal audit source unless explicitly designed for it.

23. Cardinality Discipline

High cardinality kills metrics systems.

Usually safe labels:

service;
route template;
method;
status class;
error code from registry;
environment;
version;
topic;
consumer group;
process definition key;
activity id;
candidate group.

Usually unsafe labels:

request id;
trace id;
case id;
event id;
user id;
raw exception message;
SQL query text;
full URL;
partner-provided arbitrary value.

Put unsafe identifiers in logs/traces, not metric labels.

24. Sampling Strategy

Not every trace needs to be retained at full rate.

But sampling must preserve rare failures.

Baseline:

sample all errors;
sample all high-latency requests above threshold;
sample small percentage of successful common routes;
sample all workflow incidents;
sample all DLQ/quarantine paths;
preserve trace context even if not sampled.

Do not sample away regulatory failure evidence. Audit remains separate.

25. Runbook Template

Each major alert needs runbook.

# Runbook: Outbox Oldest Pending Age High

## Meaning
Outbox rows are not being published to Kafka within freshness SLO.

## User impact
Case lifecycle events may be delayed. Camunda workflow may start late. Officer tasks may not appear.

## First checks
1. Open event pipeline dashboard.
2. Check outbox pending count and oldest age.
3. Check publisher Pod readiness/restarts.
4. Check Kafka producer errors.
5. Check DB lock/connection saturation.
6. Check recent deployment.

## Likely causes
- Kafka unavailable.
- Publisher crashed.
- Poison row stuck at head of batch.
- DB lock on outbox table.
- Bad config/secret after deploy.

## Immediate mitigation
- Restart publisher only if stuck and safe.
- Scale publisher if backlog and Kafka healthy.
- Quarantine poison row if identified.
- Rollback recent deployment if error started after release.

## Recovery verification
- Oldest pending age decreases.
- Publish success rate normal.
- Kafka consumer lag acceptable.
- Camunda process starts resume.

Runbook without telemetry is wishful thinking.

Telemetry without runbook is noise.

26. Production Readiness Checklist

27. Anti-Pattern

Anti-pattern 1 — Logging everything

Too much log without structure makes diagnosis slower and increases data/privacy risk.

Anti-pattern 2 — Metrics with unique IDs as labels

This can destroy metric backend performance.

Anti-pattern 3 — Dashboard without SLO

Dashboard becomes wall decoration.

Anti-pattern 4 — Alert without runbook

This creates human stress, not reliability.

Anti-pattern 5 — Only monitoring infrastructure

Pods healthy does not mean workflow healthy.

Anti-pattern 6 — Only monitoring business state

SLA breach without technical cause path leads to slow recovery.

Anti-pattern 7 — No edge logs

Without NGINX status/upstream time, 502/504 diagnosis is guesswork.

Anti-pattern 8 — Treating audit log as debug log

Audit must be evidentiary and stable, not noisy operational dumping.

Anti-pattern 9 — No observability test

Instrumentation often breaks silently during refactor unless tested.

28. Mini Capstone untuk Part Ini

Untuk POST /v1/cases, desain telemetry lengkap:

Signal	Required
NGINX access log	request id, status, upstream status, request time
API log	command received/accepted/rejected
API metric	request count/duration/error
DB metric	create case transaction duration/error
Outbox metric	event created/published age
Kafka metric	produce success/failure
Consumer metric	event handling duration/error
Camunda metric	process started/message correlated/incident
Trace	HTTP + DB + outbox + Kafka + consumer + Camunda boundary
Audit	case intake accepted/rejected with actor/outcome
SLO	valid command accepted within target
Alert	success rate/latency breach, outbox age breach

Then simulate:

valid request;
invalid schema;
duplicate idempotency key;
DB lock;
Kafka unavailable;
Camunda correlation failure;
NGINX timeout.

For each scenario, answer:

where do you see it first?
what ID connects the signals?
what dashboard shows impact?
what runbook gives action?
what user/business effect occurred?

If you cannot answer these, observability is incomplete.

29. Ringkasan

Production observability for this platform means:

structured logs with stable event types;
metrics with low-cardinality labels;
traces across sync and async boundaries;
correlation IDs across HTTP, DB, Kafka, Camunda;
NGINX edge visibility;
PostgreSQL lock/query visibility;
Kafka lag and handler visibility;
outbox/inbox backlog visibility;
Camunda incident/job/task visibility;
Kubernetes runtime visibility;
SLO-driven dashboards;
actionable alerts with runbooks;
failure drills that prove telemetry works.

Observability is not an add-on at the end. It is part of the contract-first architecture.

Part berikutnya membahas production readiness dan failure drills secara lebih sistematis: bagaimana membuktikan sistem siap menghadapi outage, deploy gagal, Kafka lag, DB lock, Camunda incident storm, dan rollback.

References

OpenTelemetry Documentation: https://opentelemetry.io/docs/
OpenTelemetry Signals: https://opentelemetry.io/docs/concepts/signals/
OpenTelemetry Observability Primer: https://opentelemetry.io/docs/concepts/observability-primer/
Google SRE Book, Monitoring Distributed Systems: https://sre.google/sre-book/monitoring-distributed-systems/
Kubernetes Monitoring, Logging, and Debugging: https://kubernetes.io/docs/tasks/debug/
Apache Kafka Documentation: https://kafka.apache.org/documentation/
PostgreSQL Monitoring: https://www.postgresql.org/docs/current/monitoring.html
Camunda 7 User Guide: https://docs.camunda.org/manual/7.21/user-guide/

Lesson Recap

You just completed lesson 38 in final stretch. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Back To Series Next Lesson

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.

Previous Lesson

Lesson 37

NGINX Edge and Ingress Design

Next Lesson

Lesson 39

Production Readiness and Failure Drills