Series/Learn Java Error, Reliability & Observability Engineering

Final StretchOrdered learning track

Debugging Production Failures

Learn Java Error, Reliability & Observability Engineering - Part 032

Debugging production failures untuk Java systems: evidence chain, hypothesis loop, logs-metrics-traces correlation, thread dump, heap dump, JFR, GC, Kubernetes/deployment context, dan production-safe diagnosis.

[2026-06-28]14 min read2759 words

In This Lesson

1. Kaufman Deconstruction 2. Mental Model: Evidence Chain 3. First 10 Minutes: Production Triage Algorithm

PrevNext

Lesson 3235 lesson track30–35 Final Stretch

#java#reliability#observability#debugging+6 more

Part 032 — Debugging Production Failures

Target skill: mampu mendiagnosis production failure secara sistematis, evidence-driven, dan aman. Setelah part ini, kamu harus bisa membangun timeline, membaca korelasi logs-metrics-traces, mengambil thread dump/JFR/heap evidence dengan hati-hati, membedakan cause vs symptom, dan memilih mitigasi tanpa memperparah kondisi sistem.

Production debugging berbeda dari local debugging.

Di local, kamu sering punya debugger, controllable input, dan waktu. Di production, kamu punya:

partial visibility,
pressure dari user/business,
data sensitif,
distributed causality,
state yang sedang berubah,
risiko tindakan diagnosis memperburuk sistem,
kebutuhan mitigasi sebelum root cause sempurna ditemukan.

Top engineer tidak “menebak penyebab”. Mereka membangun evidence chain.

1. Kaufman Deconstruction

Production debugging terdiri dari sub-skill berikut:

Sub-skill	Outcome
Impact scoping	Menentukan siapa/apa yang terdampak dan seberapa parah
Timeline construction	Menyusun urutan signal, change, dan symptom
Hypothesis loop	Membuat, menguji, dan membuang hipotesis berdasarkan evidence
Signal correlation	Menggabungkan logs, metrics, traces, events, dumps, dan deployment data
JVM diagnostics	Menggunakan thread dump, heap dump, GC data, JFR, dan jcmd secara aman
Distributed debugging	Menelusuri failure lintas service, queue, DB, cache, dan external dependency
Mitigation reasoning	Memilih aksi yang menurunkan impact tanpa menambah risiko
Evidence preservation	Menyimpan data penting untuk postmortem, audit, dan prevention

Debugging produksi bukan hanya menemukan bug. Tujuannya adalah memulihkan service, memahami failure mode, dan mencegah recurrence.

2. Mental Model: Evidence Chain

Evidence chain adalah rangkaian bukti yang menghubungkan symptom ke contributing factors.

Kunci:

Mulai dari impact, bukan dari komponen favorit.
Jangan loncat dari correlation ke causation.
Setiap hipotesis harus punya expected evidence.
Jika evidence tidak cocok, buang atau revisi hipotesis.

3. First 10 Minutes: Production Triage Algorithm

Saat alert firing, gunakan urutan ini.

3.1 Confirm impact

Tanya:

Apakah user benar-benar terdampak?
Operation apa yang gagal?
Availability, latency, correctness, freshness, atau deadline?
Berapa persentase request/job/case terdampak?
Region/tenant/version mana?

3.2 Establish timeline

Cari:

kapan symptom mulai,
kapan alert firing,
deployment terakhir,
config/feature flag change,
traffic spike,
dependency incident,
infrastructure event,
schema/data migration.

3.3 Identify blast radius

Segmentasi dengan dimensi aman:

service,
endpoint/operation,
region/zone,
tenant tier, bukan tenant ID high-cardinality di metric,
app version,
dependency,
message topic/consumer group,
business capability.

3.4 Choose mitigation path

Pilih mitigation yang paling aman:

rollback jika deployment correlated dan reversible,
disable feature flag,
route to fallback/degraded mode,
stop harmful retry,
shed load,
pause consumer untuk mencegah damage,
isolate bad pod,
scale out jika bottleneck adalah capacity dan downstream sehat.

3.5 Preserve evidence

Sebelum restart/rollback jika memungkinkan:

sample traces,
representative logs,
thread dump,
heap/JFR jika relevant dan safe,
deployment hash,
config/flag state,
DB/session evidence,
message IDs for failed sample.

4. Hypothesis Loop

Gunakan loop ini agar debugging tidak berubah menjadi spekulasi.

Template hipotesis:

Hypothesis:
  Recent deployment v2026.06.28.3 introduced a blocking call inside the request path.

Predictions:
  - Latency increase starts after deployment time.
  - Only pods on new version are affected.
  - Traces show time spent in dependency X or thread pool Y.
  - Rollback or traffic shift reduces latency.

Evidence to check:
  - request latency by version
  - deployment events
  - trace critical path
  - thread dump blocked/waiting states

Jika prediction tidak muncul, hipotesis lemah.

5. Logs, Metrics, Traces: How to Use Each

5.1 Metrics answer “how much” and “when”

Metrics cocok untuk:

impact size,
start time,
trend,
rate,
saturation,
error budget burn,
recovery confirmation.

Contoh questions:

When did error ratio increase?
Which operation has the highest bad-event ratio?
Is latency high for all versions or only new deployment?
Is queue age rising or draining?

5.2 Logs answer “what happened here”

Logs cocok untuk:

error code distribution,
exception cause chain,
domain rejection reason,
request lifecycle events,
audit events,
deployment/config decisions,
sample-level investigation.

Jangan mulai dari full-text log search tanpa scope. Mulai dari metric/trace, lalu ambil sample log berdasarkan correlation ID/trace ID/error code.

5.3 Traces answer “where time/failure flowed”

Traces cocok untuk:

critical path,
dependency latency,
retry/fallback behavior,
fan-out/fan-in,
context propagation gap,
cross-service causal chain.

Trace yang baik harus bisa menjawab:

Which span consumed the latency budget?
Which dependency returned error?
Was fallback used?
How many retry attempts happened?
Where did context disappear?

6. Debugging by Failure Shape

6.1 Availability drop

Symptoms:

5xx naik,
timeout naik,
circuit breaker open,
pod restart,
rejected task.

Check:

error ratio by operation/version,
top error codes,
dependency failure ratio,
recent deployment,
pod status/restart reason,
thread pool and connection pool metrics.

Likely failure modes:

bad deploy,
dependency outage,
connection pool exhaustion,
thread pool saturation,
retry storm,
schema mismatch,
config error.

6.2 Latency spike

Symptoms:

p95/p99 naik,
timeout mulai muncul,
queue age naik,
request concurrency naik.

Check:

trace critical path,
dependency latency,
DB query duration,
GC pause,
lock contention,
pool acquisition time,
CPU throttling/container limits.

Likely failure modes:

slow dependency,
DB lock/query regression,
blocking call in event loop,
pool starvation,
GC pressure,
noisy neighbor/container throttling.

6.3 Correctness failure

Symptoms:

duplicate side effects,
wrong state transition,
inconsistent read/write,
audit mismatch,
support tickets despite green metrics.

Check:

domain event timeline,
idempotency table,
transaction boundaries,
state transition logs,
message redelivery,
retry behavior,
compensation events.

Likely failure modes:

non-idempotent retry,
race condition,
stale read,
missing invariant check,
partial commit,
out-of-order event.

6.4 Freshness/backlog failure

Symptoms:

queue lag/age naik,
batch job misses deadline,
stale cache,
delayed projection.

Check:

producer rate vs consumer rate,
consumer error/retry/DLQ,
oldest message age,
partition skew,
downstream bottleneck,
poison messages.

Likely failure modes:

consumer capacity insufficient,
poison message retry loop,
downstream timeout,
partition hot spot,
batch window too short,
schema evolution issue.

7. JVM Diagnostics: Production-Safe Ladder

JVM diagnostics harus dilakukan bertahap. Mulai dari yang paling murah dan aman.

Semakin bawah, semakin besar overhead, risk, dan sensitivity.

8. Thread Dump

Thread dump menunjukkan apa yang sedang dilakukan thread pada saat tertentu.

Cocok untuk:

deadlock,
lock contention,
thread pool starvation,
blocked I/O,
runaway thread,
virtual thread diagnosis,
stuck shutdown.

Command examples:

jcmd <pid> Thread.print > thread-dump-$(date +%s).txt

Atau:

jstack <pid> > thread-dump-$(date +%s).txt

Ambil beberapa dump dengan interval:

for i in 1 2 3; do
  jcmd <pid> Thread.print > thread-dump-$i.txt
  sleep 10
done

Satu dump hanya snapshot. Tiga dump membantu melihat apakah thread stuck atau bergerak.

8.1 What to look for

Pattern	Meaning
Many threads BLOCKED on same monitor	Lock contention
Many threads WAITING on pool acquisition	Connection/thread pool starvation
Same stack across dumps	Stuck work
Deadlock section present	JVM detected monitor deadlock
Many threads logging	Logging bottleneck
Many virtual threads parked on I/O	May be normal; check downstream latency

9. Java Flight Recorder

Java Flight Recorder adalah diagnostic/profiling facility di JDK yang dapat merekam event runtime seperti allocation, CPU, lock, GC, thread, I/O, dan exception dengan overhead yang relatif rendah jika dikonfigurasi benar.

Cocok untuk:

CPU hot path,
allocation pressure,
lock contention,
GC behavior,
exception storm,
file/socket I/O,
thread scheduling,
virtual thread pinning investigation.

Command examples:

jcmd <pid> JFR.start name=incident settings=profile duration=120s filename=/tmp/incident.jfr

Jika recording sudah berjalan:

jcmd <pid> JFR.check
jcmd <pid> JFR.dump name=incident filename=/tmp/incident-dump.jfr
jcmd <pid> JFR.stop name=incident

Production caution:

Jangan merekam terlalu lama tanpa alasan.
Simpan file di lokasi dengan space cukup.
Perlakukan JFR sebagai potentially sensitive artifact.
Jangan mengaktifkan konfigurasi yang terlalu verbose tanpa memahami overhead.

10. Heap Diagnostics

10.1 Heap histogram

Heap histogram lebih ringan daripada heap dump.

jcmd <pid> GC.class_histogram > class-histogram.txt

Gunakan untuk melihat class mana yang dominan.

Useful questions:

Apakah ada object type yang tumbuh cepat?
Apakah buffer/string/byte[] mendominasi?
Apakah cache/map tidak terkendali?

10.2 Heap dump

Heap dump bisa sangat besar, mengandung data sensitif, dan dapat memberi tekanan ke disk/CPU.

jcmd <pid> GC.heap_dump /tmp/heap-$(date +%s).hprof

Gunakan jika:

OOM/leak kuat dicurigai,
histogram tidak cukup,
instance retention perlu dianalisis,
ada approval/security handling.

Jangan otomatis heap dump semua incident.

11. GC and Memory Failure Debugging

11.1 Symptoms

latency spikes aligned with GC pauses,
high allocation rate,
old generation grows and does not drop,
container OOMKilled,
OutOfMemoryError,
restart loop.

11.2 Evidence

Check:

GC logs,
heap usage after GC,
allocation rate,
native memory if relevant,
container memory limit,
heap dump/histogram,
JFR allocation events,
recent deployment allocation changes.

11.3 Common root patterns

Pattern	Evidence
Java heap leak	Old gen grows across full GC, retained objects grow
Allocation storm	High allocation rate, frequent young GC, CPU pressure
Native memory pressure	RSS grows beyond heap, direct buffers/metaspace/thread stacks
Container limit mismatch	JVM heap + native > cgroup memory
Large response buffering	byte[]/String/object graph spike

Command examples:

jcmd <pid> GC.heap_info
jcmd <pid> VM.native_memory summary

VM.native_memory needs Native Memory Tracking enabled to be useful.

12. Thread Pool and Connection Pool Debugging

Thread pool and connection pool failures often present as latency or timeout.

12.1 Thread pool evidence

Metrics to collect:

active threads,
pool size,
queue size,
queue age,
completed task rate,
rejected task count,
task duration.

Thread dump evidence:

workers blocked on dependency,
workers waiting on DB connection,
workers stuck in synchronized block,
logging appender blocking,
too many tasks in same path.

12.2 Connection pool evidence

Metrics:

active connections,
idle connections,
pending acquisition,
acquisition duration,
timeout count,
query duration,
transaction duration.

Common mistakes:

increasing pool size without DB capacity,
increasing timeout without deadline budget,
ignoring leaked connections,
ignoring long transactions,
retrying when pool is already exhausted.

13. Distributed Failure Debugging

Distributed systems fail partially.

13.1 Use trace to identify boundary

In trace view, inspect:

root span duration,
slow child span,
failed child span,
retries,
fallback span,
queue publish/consume link,
missing parent context.

13.2 Use logs to explain local decision

For failed sample, correlate by:

trace ID,
correlation ID,
request ID,
error code,
idempotency key hash,
message ID,
job execution ID.

13.3 Use metrics to measure scale

After understanding one failed example, measure:

how many requests share that error code,
how many tenants/regions/versions affected,
whether rate is increasing,
whether mitigation works.

One sample explains behavior. Metrics prove magnitude.

14. Change Correlation

Most production incidents are triggered by change, but not all changes are deployments.

Check:

application deployment,
dependency deployment,
feature flag change,
config change,
secret/certificate rotation,
schema migration,
data backfill,
traffic/routing change,
autoscaling event,
infrastructure/node upgrade,
library version change,
observability agent change.

Timeline example:

10:00 deployment v42 begins
10:04 first pod v42 ready
10:06 p99 latency starts increasing for v42 only
10:08 retry rate increases
10:10 checkout SLO fast-burn alert fires
10:12 rollback begins
10:18 error ratio returns to baseline

This is strong evidence for deployment-related regression, but still not root cause. Root cause may be blocking call, config mismatch, query regression, or dependency behavior triggered by the new version.

15. Kubernetes and Runtime Context

For Java services in Kubernetes, include platform evidence.

Check:

kubectl get pods -n <namespace>
kubectl describe pod <pod> -n <namespace>
kubectl logs <pod> -n <namespace> --previous
kubectl get events -n <namespace> --sort-by=.lastTimestamp
kubectl top pod -n <namespace>

Look for:

restarts,
OOMKilled,
CrashLoopBackOff,
readiness/liveness probe failures,
CPU throttling,
node pressure,
image pull issues,
rollout stuck,
termination grace exceeded,
sidecar issues.

For graceful shutdown debugging, correlate:

SIGTERM time,
readiness change,
request drain,
executor shutdown logs,
queue listener stop,
telemetry flush,
KILL time.

16. Debugging Error Code Spikes

If your architecture has error codes, debugging becomes faster.

Process:

Identify top error codes by rate.
Split by endpoint/operation/version.
Pick representative trace/log sample for each.
Map error code to owner and domain meaning.
Distinguish expected rejection vs unexpected failure.
Check whether retryable errors are actually retried safely.
Check whether unknown error code is boundary translation failure.

Example query intent:

Show rate of application_errors_total by error_code for service=case-decision in last 30m.

Investigation questions:

Is this a business rejection spike?
Is validation rejecting a new client payload?
Is dependency failure being translated correctly?
Is error code stable or newly introduced?
Is client behavior changed?

17. Production-Safe Debugging Rules

17.1 Do not make state-changing experiments blindly

Bad:

Let's retry all failed messages immediately.

Risk:

duplicate side effects,
retry storm,
downstream overload,
corrupted state.

Better:

sample a small batch,
verify idempotency,
use rate limit,
monitor downstream,
record replay decision.

17.2 Do not increase timeouts as default mitigation

Increasing timeout can:

hold threads longer,
exhaust pools,
increase queue age,
worsen user latency,
amplify cascading failure.

Only increase timeout if:

downstream is healthy but legitimately slower,
caller deadline allows it,
concurrency/pool capacity is safe,
SLO impact improves.

17.3 Do not restart before preserving evidence

Restart may remove:

thread state,
heap state,
JFR context,
logs in ephemeral storage,
in-memory queue/cache evidence.

If impact is severe, mitigation may outweigh evidence preservation. But make that trade-off explicit.

17.4 Do not expose sensitive artifacts

Thread dump, heap dump, logs, and JFR may contain:

PII,
secrets,
tokens,
payload data,
tenant identifiers,
business-sensitive decisions.

Handle with restricted access and retention policy.

18. Debugging Playbooks by Symptom

18.1 High 5xx

1. Check error ratio by endpoint/version/region.
2. Check top error codes and exception classes.
3. Check recent deployment/config/flag changes.
4. Sample traces for failed requests.
5. Check dependency error ratio.
6. Check pod restarts and OOM.
7. Mitigate: rollback, flag off, fallback, isolate dependency, shed load.

18.2 High latency

1. Check p95/p99 by endpoint/version/region.
2. Compare app latency vs dependency latency.
3. Inspect slow trace critical path.
4. Check pool acquisition time.
5. Check GC pause and CPU throttling.
6. Take thread dump if saturation/stuck suspected.
7. Mitigate: rollback, reduce concurrency, shed load, scale if safe.

18.3 Consumer lag/backlog

1. Check oldest message age and consumer throughput.
2. Check error/retry/DLQ rate.
3. Identify poison message pattern.
4. Check downstream latency/error.
5. Check partition skew.
6. Pause harmful consumer if repeated side effects risk exists.
7. Mitigate: isolate poison messages, scale consumers if downstream allows, fix retry policy.

18.4 Memory pressure/OOM

1. Check restart reason and memory graph.
2. Check heap vs RSS.
3. Capture histogram/JFR/heap dump if safe.
4. Check recent deployment and traffic/payload change.
5. Look for cache/map/buffer growth.
6. Mitigate: rollback, reduce load, scale out, disable memory-heavy feature.

18.5 Stuck shutdown

1. Check SIGTERM and readiness transition logs.
2. Check in-flight request drain.
3. Check executor shutdown logs.
4. Take thread dump before KILL if possible.
5. Identify blocking tasks and non-daemon threads.
6. Check telemetry flush and message ack behavior.
7. Fix lifecycle ordering and deadlines.

19. Debugging Checklist

During incident:

After mitigation:

SLO recovery confirmed.
Backlog/drain state checked.
Duplicate/partial side effects checked.
Unknown outcomes captured.
Root/contributing factors investigated.
Telemetry gaps documented.
Preventive actions assigned.

20. Practice Lab

Lab 1 — Build a timeline

Given logs, deployment events, and SLO graph, build:

start time,
first bad event,
detection time,
mitigation time,
recovery time,
suspected trigger,
evidence strength.

Lab 2 — Debug latency spike

Simulate:

one endpoint p99 latency spike,
dependency call slow,
retry count rising,
thread pool queue growing.

Produce:

hypothesis,
evidence query,
mitigation,
postmortem action.

Lab 3 — Thread dump reading

Take three thread dumps from a test service with an intentionally blocked pool.

Identify:

blocked threads,
waiting threads,
repeated stack,
owner code path,
mitigation.

Lab 4 — JFR incident recording

Run a Java app with artificial allocation pressure.

Capture:

JFR recording,
allocation hotspot,
GC behavior,
code path causing pressure.

21. Anti-Patterns

Anti-pattern	Why it fails
Debugging from favorite subsystem	Bias hides actual cause
Starting from logs without scope	Too much noise
Treating correlation as causation	Wrong fix risk
Restarting everything	Removes evidence, may amplify issue
Increasing timeout blindly	Can worsen saturation
Retrying failed jobs blindly	Can duplicate side effects
Ignoring correctness failures	Green availability can hide bad outcomes
No timeline	Team argues from memory
No evidence preservation	Postmortem becomes speculation

22. Key Takeaways

Production debugging is evidence-driven impact reduction.
Start from user impact and symptom metrics, not from implementation guesses.
Use traces for causal path, logs for local facts, metrics for scale and trend.
JVM diagnostics are powerful but must be used with production safety.
Thread dumps are cheap and useful for stuck/saturation problems.
JFR is valuable for CPU, allocation, lock, GC, and runtime event analysis.
Heap dumps can expose sensitive data and should be deliberate.
The strongest debugging artifact is a clear timeline with evidence.

23. References

OpenTelemetry — Observability Primer: https://opentelemetry.io/docs/concepts/observability-primer/
OpenTelemetry — Logs specification and correlation model: https://opentelemetry.io/docs/specs/otel/logs/
Java SE 25 — jdk.jfr package: https://docs.oracle.com/en/java/javase/25/docs/api/jdk.jfr/jdk/jfr/package-summary.html
Java SE 25 — JFR Recording API: https://docs.oracle.com/en/java/javase/25/docs/api/jdk.jfr/jdk/jfr/Recording.html
Oracle — Flight Recorder API Programmer's Guide: https://docs.oracle.com/en/java/javase/25/jfapi/flight-recorder-api-programmers-guide.pdf
Prometheus — Alerting Rules: https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/

24. What Comes Next

Part 033 akan membahas error management architecture: bagaimana menyatukan exception hierarchy, error code registry, Problem Details, logging, metrics, traces, audit, retry policy, dan boundary translation menjadi satu architecture yang defensible.

Lesson Recap

You just completed lesson 32 in final stretch. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Back To Series Next Lesson

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.

Previous Lesson

Lesson 31

Alerting & Incident Response

Next Lesson

Lesson 33

Error Management Architecture