Series/Learn Java Error, Reliability & Observability Engineering

Deepen PracticeOrdered learning track

Distributed Tracing Mental Model

Learn Java Error, Reliability & Observability Engineering - Part 027

Mental model distributed tracing untuk Java production systems: trace, span, parent-child, causal chain, critical path, context propagation, sampling, span design, dan debugging failure lintas service.

[2026-06-28]17 min read3260 words

In This Lesson

1. Skill Target 2. Mental Model: Trace as Evidence of Work 3. Trace vs Span vs Log vs Metric

PrevNext

Lesson 2735 lesson track20–29 Deepen Practice

#java#tracing#observability#reliability+4 more

Part 027 — Distributed Tracing Mental Model

Metrics menjawab: berapa sering, seberapa lambat, seberapa rusak. Logs menjawab: apa yang terjadi pada titik tertentu. Tracing menjawab: bagaimana satu pekerjaan bergerak melalui sistem dan di mana causal chain-nya rusak.

Distributed tracing bukan fitur dashboard. Ia adalah model bukti untuk menjawab pertanyaan seperti:

request ini mulai dari mana?
service mana yang memanggil service mana?
operasi mana yang paling mahal?
error ini berasal dari dependency, policy, data, atau bug lokal?
apakah timeout di caller menyebabkan cancellation di callee?
apakah retry membuat multiple attempt atau hanya satu attempt panjang?
apakah log, metric, dan error dapat dikaitkan ke unit pekerjaan yang sama?

Part ini membangun mental model tracing sebelum masuk ke OpenTelemetry Java di Part 028.

1. Skill Target

Setelah menyelesaikan part ini, kamu harus mampu:

Menjelaskan perbedaan trace, span, event, link, attribute, status, resource, dan baggage.
Mendesain span boundary yang berguna, bukan sekadar banyak span.
Membaca trace sebagai causal graph, bukan timeline cantik.
Mengidentifikasi critical path dan bottleneck.
Membedakan latency lokal, dependency latency, queue latency, retry latency, dan client wait time.
Menentukan kapan membuat span baru, kapan menambahkan event, dan kapan cukup menambahkan attribute.
Menghindari trace cardinality leak, privacy leak, dan span explosion.
Menggunakan trace untuk debugging error, timeout, retry, fallback, shutdown, dan partial failure.

2. Mental Model: Trace as Evidence of Work

Satu trace merepresentasikan satu unit pekerjaan end-to-end.

Contoh unit pekerjaan:

HTTP request POST /cases/{id}/submit
message handling CaseSubmitted
scheduled job RecomputeRiskScores
batch chunk processing import-line[10000..10999]
workflow step ApplySanctionDecision
async fan-out operation FetchCustomerProfile + FetchRiskScore + FetchAuditTrail

Satu trace terdiri dari banyak span. Span merepresentasikan satu operasi terukur dalam unit pekerjaan itu.

Trace bukan hanya tree. Dalam sistem modern, trace bisa menjadi graph:

parent-child untuk synchronous causal chain
span link untuk async/messaging/batch fan-out
remote parent untuk context dari service lain
separate trace untuk long-running workflows yang dihubungkan dengan domain correlation ID

Prinsip utama:

Trace harus menjelaskan perjalanan pekerjaan, bukan struktur kode.

3. Trace vs Span vs Log vs Metric

Signal	Unit	Pertanyaan Utama	Kekuatan	Kelemahan
Metric	Time series aggregate	“Apakah sistem sehat?”	Murah, agregat, alertable	Kehilangan detail per request
Log	Event record	“Apa yang terjadi?”	Detail tekstual/event	Sulit melihat causal chain
Trace	Causal path	“Di mana request ini menghabiskan waktu/gagal?”	End-to-end path	Sampling, storage cost, instrumentation gap

Trace bukan pengganti log atau metric.

Trace yang baik biasanya menghubungkan ketiganya:

Dalam incident, pola investigasi ideal:

metric menunjukkan gejala agregat
alert membuka dashboard
dashboard menunjuk trace sample/exemplar
trace menunjukkan service/span yang bermasalah
logs pada trace_id menjelaskan detail lokal
code/config menjelaskan root cause

4. Core Terms

4.1 Trace

Trace adalah keseluruhan causal path dari satu pekerjaan.

Trace memiliki trace_id yang sama di seluruh span dalam path tersebut.

Contoh:

trace_id = 0af7651916cd43dd8448eb211c80319c

Trace bukan request ID. Request ID bisa bersifat lokal gateway; trace ID harus dipropagasikan lintas service dan operation boundary.

4.2 Span

Span merepresentasikan operasi dengan start time, end time, parent, attributes, status, dan events.

Minimal span yang berguna punya:

name        = POST /cases/{caseId}/submit
kind        = SERVER
start/end   = timestamp
status      = OK | ERROR | UNSET
attributes  = http.method, http.route, domain.operation, error.code
parent      = previous span context

Span adalah measurement + context.

4.3 Parent-Child

Parent-child menyatakan causal dependency:

A started B
A waits for B or depends on B's outcome

Contoh:

Trace tree:

POST /cases/{id}/submit
├── db.query case
└── GET risk-service /risk/{id}

4.4 Span Kind

Common span kind:

Kind	Meaning	Example
`SERVER`	menerima request remote	HTTP controller menerima request
`CLIENT`	membuat request remote	HTTP client call, DB client call
`PRODUCER`	mengirim message	publish Kafka/Rabbit/SQS message
`CONSUMER`	menerima/memproses message	message listener
`INTERNAL`	operasi internal process	validation, policy evaluation

Rule of thumb:

Jika operasi melewati network/process boundary, kind biasanya bukan INTERNAL.

4.5 Attribute

Attribute adalah key-value metadata pada span.

Contoh:

service.name = case-command-service
http.route = /cases/{caseId}/submit
case.state.from = DRAFT
case.state.to = SUBMITTED
error.code = CASE_STATE_CONFLICT
retry.attempt = 2

Attribute harus searchable dan low-cardinality bila digunakan untuk agregasi.

Jangan masukkan:

raw payload
token
password
full user input
full SQL with sensitive literal
unbounded exception message sebagai dimension
customer name, email, phone, address

4.6 Event

Span event adalah point-in-time marker dalam span.

Gunakan event untuk kejadian penting di dalam operasi:

event: validation.failed
attributes:
  validation.error_count = 3
  error.code = CASE_REQUIRED_FIELD_MISSING

Event cocok untuk:

exception event
fallback activated
retry attempt
state transition accepted/rejected
lock acquired/released
degraded mode selected
message ack/nack decision

4.7 Status

Span status menjelaskan outcome telemetry-level:

OK: operasi sukses secara observability
ERROR: operasi gagal
UNSET: default, sering dipakai ketika status tidak eksplisit

Jangan samakan HTTP status dengan span status secara buta.

Contoh:

HTTP Status	Domain Meaning	Span Status
200	success	OK
400	expected client validation rejection	bisa OK atau ERROR tergantung policy observability
404	expected “not found” lookup	sering OK untuk lookup biasa
409	domain state conflict	biasanya ERROR jika operasi command gagal
500	server failure	ERROR
503	dependency unavailable	ERROR

Untuk platform reliability, command rejection yang normal tapi penting biasanya dicatat sebagai domain outcome metric/log, bukan selalu sebagai infrastructure error.

4.8 Link

Span link menghubungkan span yang tidak punya parent-child direct relationship.

Gunakan link untuk:

message consumer span yang terkait producer span
batch job yang memproses banyak input trace
fan-in dari banyak parent
async handoff yang tidak mempertahankan parent lifecycle

Parent-child berarti “dijalankan dalam causal execution tree”. Link berarti “terkait secara kausal, tetapi bukan child langsung”.

4.9 Resource

Resource menggambarkan entitas yang menghasilkan telemetry.

Contoh:

service.name = case-command-service
service.version = 2026.06.28.1
deployment.environment = prod
cloud.region = ap-southeast-3
k8s.namespace.name = enforcement
k8s.pod.name = case-command-7fdc...

Resource attribute harus membantu menjawab:

service mana?
versi mana?
environment mana?
region/zone mana?
pod/container mana?

4.10 Baggage

Baggage adalah context key-value yang dipropagasikan lintas service.

Penting:

Baggage bukan attribute otomatis yang aman. Ia dapat ikut keluar ke downstream request, termasuk third-party, sehingga harus sangat dibatasi.

Gunakan baggage hanya untuk data yang:

tidak sensitif
ber-cardinality terkendali
memang perlu tersedia lintas service
aman jika terlihat oleh service downstream yang tidak sepenuhnya kamu kontrol

Contoh aman relatif:

business_unit = enforcement
channel = backoffice

Contoh berbahaya:

user_email
customer_id mentah
jwt
case narrative
investigation note

5. W3C Trace Context Mental Model

Agar trace tidak pecah lintas vendor/tool/service, distributed tracing membutuhkan format propagasi standar.

HTTP biasanya memakai header:

traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
tracestate: vendor-specific-data

traceparent membawa:

version-trace_id-parent_id-flags

Interpretasi:

00                                version
0af7651916cd43dd8448eb211c80319c  trace_id
b7ad6b7169203331                  parent/span id
01                                sampled flag, etc.

Mental model:

Jika context tidak dipropagasikan:

Gejala trace context broken:

setiap service punya trace baru
logs punya correlation ID tapi no trace ID
root span muncul di service tengah
client span tidak punya matching server span
message consumer trace tidak terhubung ke producer
dashboard dependency map tampak terputus

6. Trace as Causal Graph, Not Stack Trace

Stack trace menunjukkan call stack lokal saat error.

Distributed trace menunjukkan causal path lintas service dan time.

Contoh stack trace:

CaseService.submit
PolicyEvaluator.evaluate
RiskClient.getRisk
HttpClient.send

Contoh trace:

POST /cases/{id}/submit  1400ms ERROR
├── validate request       8ms OK
├── db.select case         24ms OK
├── evaluate policy        1170ms ERROR
│   ├── GET /risk          1000ms ERROR timeout
│   └── fallback stale     80ms OK
└── persist audit          50ms OK

Trace menjawab:

latency terjadi di risk service, bukan DB
fallback aktif tapi command tetap gagal
audit tetap ditulis walau command gagal
caller menunggu 1.4 detik
error code domain mungkin berasal dari dependency timeout

7. Critical Path

Critical path adalah jalur span yang menentukan total latency.

Contoh fan-out:

POST /cases/{id}/review  900ms
├── load case             40ms
├── parallel checks       700ms
│   ├── sanctions check   120ms
│   ├── risk check        700ms
│   └── duplicate check   200ms
└── persist decision      80ms

Total bukan 40 + 120 + 700 + 200 + 80. Karena checks parallel, critical path adalah:

load case + max(checks) + persist decision
= 40 + 700 + 80
= 820ms plus overhead

Dalam trace UI, jangan hanya cari span terpanjang. Cari span yang berada di critical path.

Non-critical long span bisa terjadi karena:

background task detached
async operation not awaited
telemetry span lifetime salah
queue wait time disatukan dengan processing time

8. Span Design Principles

8.1 Span harus merepresentasikan operation boundary

Span baik:

POST /cases/{caseId}/submit
policy.evaluate.submit_case
db.case.select_by_id
risk-service GET /risk/{subjectId}
case.transition.persist

Span buruk:

methodA
line 48
loop iteration 923
process
execute

Nama span harus stabil dan tidak mengandung high-cardinality value.

Baik:

POST /cases/{caseId}/submit

Buruk:

POST /cases/CASE-2026-000000123/submit

8.2 Satu span bukan satu method

Jangan instrument setiap method.

Instrumentasikan:

API boundary
dependency call
expensive domain operation
state transition
queue wait/processing
retry attempt
fallback decision
transaction boundary
external side effect

Tidak perlu span untuk:

getter/setter
pure helper trivial
string formatting
mapping object kecil
setiap predicate

8.3 Span duration harus jelas

Sebelum membuat span, jawab:

Apa yang dimulai saat span start, dan apa yang selesai saat span end?

Contoh ambiguity:

span: publish CaseSubmitted

Apakah durasinya:

serialize event?
send ke broker?
wait broker ack?
schedule async publish?
include retry?

Tulis span boundary dengan jelas:

message.serialize CaseSubmitted
kafka.produce case-events

8.4 Attribute harus menjelaskan diagnosis

Span tanpa attribute sering kurang berguna.

Minimal untuk domain operation:

operation.name = submit_case
case.type = enforcement
case.state.from = DRAFT
case.state.to = SUBMITTED
outcome = rejected
error.code = CASE_STATE_CONFLICT
retryable = false

Minimal untuk dependency call:

peer.service = risk-service
http.method = GET
http.route = /risk/{subjectId}
http.response.status_code = 503
retry.attempt = 2
timeout.ms = 800

9. Span Boundary Catalog for Java Services

9.1 HTTP Server Boundary

Root-ish span per incoming request:

SERVER POST /cases/{caseId}/submit

Attributes:

http.request.method = POST
url.path = /cases/123/submit
http.route = /cases/{caseId}/submit
http.response.status_code = 409
user_agent.original = ... if safe
client.address = ... if policy allows

Avoid:

http.route = /cases/123/submit
request.body = {...}

9.2 Domain Command Boundary

Untuk command penting:

INTERNAL case.submit

Attributes:

domain.entity = case
domain.operation = submit
case.state.from = DRAFT
case.state.to = SUBMITTED
rule_set.version = 2026-06-01

9.3 Policy Evaluation Boundary

INTERNAL policy.evaluate submit_case

Attributes:

policy.name = submit_case_eligibility
policy.version = 2026.06
policy.outcome = reject
policy.reason_code = MISSING_APPROVAL

Jangan memasukkan detail narasi sensitif. Simpan detail lengkap di audit store dengan access control, bukan di telemetry.

9.4 Persistence Boundary

Auto-instrumentation biasanya membuat DB spans.

Pastikan:

query span tidak mengandung raw PII
parameter sensitif tidak muncul
slow query terlihat
connection pool metrics tersedia di metrics, bukan hanya trace

Span:

CLIENT SELECT enforcement.case

Attributes:

db.system = postgresql
db.operation.name = SELECT
db.collection.name = enforcement_case

9.5 Messaging Boundary

Producer:

PRODUCER case-events publish CaseSubmitted

Consumer:

CONSUMER case-events process CaseSubmitted

Important attributes:

messaging.system = kafka
messaging.destination.name = case-events
messaging.operation.name = process
message.type = CaseSubmitted
message.id = stable message id if safe

For async systems, span link may be more accurate than parent-child depending on instrumentation model.

9.6 Batch Boundary

Batch trace design must avoid one giant unhelpful span.

Better:

job.import_cases
├── chunk.read[0]
├── chunk.validate[0]
├── chunk.persist[0]
├── chunk.read[1]
├── chunk.validate[1]
└── chunk.persist[1]

Attributes:

job.name = import_cases
chunk.size = 1000
chunk.index = 12
records.success = 982
records.failed = 18

Avoid span per row unless debugging special sampling mode.

10. Error Representation in Traces

When an operation fails, trace should show:

where the failure happened
what kind of failure it was
whether it was expected or unexpected
whether it was retryable
whether fallback/degradation happened
what external effect already happened
final caller-visible outcome

Example:

POST /cases/{id}/submit ERROR 409
├── case.load OK
├── policy.evaluate ERROR
│   └── event: exception
│       exception.type = CaseStateConflictException
│       exception.message = safe summary only
│       error.code = CASE_STATE_CONFLICT
│       retryable = false
└── audit.record OK

Do not mark all client errors as span error without thinking.

A validation failure may be:

normal product behavior
not an availability error
still a domain rejection metric
still important for UX/support

Recommended distinction:

Failure	Span Status	Metric	Log
Validation rejection	OK/ERROR by policy	`request_rejected_total{reason}`	INFO/WARN structured
Domain conflict	ERROR for command span	`domain_rejection_total`	INFO/WARN with code
Dependency timeout	ERROR	`dependency_timeout_total`	WARN/ERROR
Bug/null pointer	ERROR	`application_exception_total`	ERROR with stack
Security denial	OK or ERROR by policy	`security_denial_total`	WARN/security audit

The key is consistency.

11. Trace Sampling

Tracing every request can be expensive.

Sampling decides which traces are retained/exported.

Common strategies:

Strategy	Meaning	Use Case
Head sampling	decision near trace start	simple, cheap
Tail sampling	decision after seeing full trace	keep errors/slow traces
Probabilistic sampling	percentage-based	high-volume endpoints
Rules-based sampling	endpoint/error/tenant-aware	production control
Always sample	retain everything	local/dev/test/low volume critical ops

Important mental model:

Sampling is not just cost optimization. It shapes what failures you can see.

Bad sampling policy:

sample 1% of all traffic, regardless of status

Problem:

rare error may disappear
slow traces may not be retained
VIP tenant issue may be invisible
async failure trace may be dropped

Better policy:

retain:
- all ERROR traces
- traces above latency threshold
- all critical command operations
- small percentage of normal traffic
- selected tenant/case under investigation

Tail sampling is powerful but requires collector/backend support and buffering.

12. Trace Cardinality and Cost

High cardinality in traces is less destructive than high cardinality in metrics, but still dangerous.

Bad attribute examples:

case.id = CASE-2026-0000000123
user.email = person@example.com
sql.full_statement = SELECT ... literal PII ...
exception.message = "Case 123 owned by John..."

Safer alternatives:

case.type = enforcement
case.state = SUBMITTED
user.role = supervisor
error.code = CASE_STATE_CONFLICT
sql.operation = SELECT

For debugging a specific entity, use:

secure audit store
support tooling
searchable domain event store
temporary controlled diagnostic mode
hash/tokenized identifier if approved by policy

Do not turn observability backend into an unauthorized data lake.

13. Tracing Failure Patterns

13.1 Broken Context Propagation

Symptoms:

Service A trace: POST /submit -> HTTP client risk-service
Service B trace: GET /risk appears as independent root trace

Likely causes:

custom HTTP client not instrumented
manual thread/executor context lost
message headers dropped
gateway/proxy strips headers
non-W3C propagator mismatch
async callback starts with empty context

13.2 Span Explosion

Symptoms:

trace has thousands of spans
UI slow/unusable
cost spikes
important spans hidden

Causes:

instrument every method
span per item in large batch
span per loop iteration
high-volume internal cache lookup spans

Fix:

aggregate small operations
use events/counters instead of spans
sample detail mode only when diagnosing

13.3 Missing Error Cause

Symptoms:

root span ERROR
dependency span OK
no exception event
logs have stack trace but no trace ID

Causes:

catch and wrap without recording cause
error swallowed and converted to generic response
trace status only set at outer boundary
logger not correlated

Fix:

record exception on the span where it occurs
set error code attribute
keep cause chain
correlate logs with trace/span id

13.4 Misleading Span Duration

Symptoms:

span shows 20s duration but actual network call was 200ms
span includes queue wait and processing without distinction
span starts before task is scheduled, ends after callback cleanup

Fix:

separate queue.wait and task.process
define span lifetime precisely
avoid starting span too early

13.5 Trace Without Domain Meaning

Symptoms:

Controller.process
Service.execute
Repository.find
HttpClient.send

This is technically traced but not operationally useful.

Fix:

use route names
use domain operation names
add stable error code
add policy/rule version
add outcome attributes

14. Reading a Trace During Incident

Use this order:

Step 1: Identify the root span outcome

Ask:

status code?
error code?
user-visible result?
total duration?
route/operation?

Step 2: Find the critical path

Ask:

which child span dominates latency?
are spans sequential or parallel?
is there queue time?
is caller waiting on callee?

Step 3: Locate first meaningful failure

Do not stop at outermost error.

Find:

first exception event
first dependency timeout
first non-2xx/5xx dependency result
first cancellation
first fallback activation
first retry attempt

Step 4: Compare retries and attempts

Ask:

how many attempts?
were they serialized or parallel?
did retry exceed deadline?
did attempts hit same dependency?
did error change between attempts?

Step 5: Correlate logs

Use trace_id and span_id.

Look for:

stack trace
boundary translation
safe attributes
audit decision
cancellation reason
shutdown marker

Step 6: Confirm aggregate impact

Trace shows one request. Metrics show population.

Ask:

is this isolated?
is error rate elevated?
is latency p95/p99 elevated?
is one tenant/region/version affected?
is one dependency failing globally?

15. Tracing Retry, Timeout, and Fallback

Bad trace:

POST /submit 3000ms ERROR timeout
└── GET /risk 3000ms ERROR

Better trace:

POST /submit 3000ms ERROR RISK_TIMEOUT
└── risk.lookup 3000ms ERROR
    ├── attempt 1 GET /risk 800ms ERROR timeout
    ├── event retry.scheduled backoff_ms=100
    ├── attempt 2 GET /risk 800ms ERROR timeout
    ├── event retry.scheduled backoff_ms=200
    ├── attempt 3 GET /risk 800ms ERROR timeout
    └── fallback.stale-risk 100ms OK stale_age_ms=60000

This trace answers:

timeout budget used by retries
fallback activated
stale data age
final result still failed or degraded
retry count visible

Important attributes:

retry.attempt
retry.max_attempts
retry.backoff.ms
timeout.ms
deadline.remaining.ms
fallback.name
fallback.outcome
fallback.data_staleness.ms

16. Tracing Shutdown and Cancellation

During shutdown, traces can show incomplete work.

Recommended events:

event: shutdown.received
attributes:
  signal = SIGTERM
  grace_period_ms = 30000

event: intake.stopped

event: inflight.drain.started
attributes:
  inflight.count = 42

event: task.cancelled
attributes:
  reason = shutdown_deadline

For request traces:

POST /cases/{id}/submit ERROR
├── event cancellation.requested reason=shutdown
├── db.write UNKNOWN
└── audit.unknown_outcome_recorded OK

Operational rule:

If shutdown can interrupt side effects, the trace must help classify final outcome as success, failure, cancelled, or unknown.

17. Trace and Regulatory Defensibility

For regulatory systems, tracing helps with engineering diagnosis, but it is not always the system of record.

Use trace for:

execution path
latency source
technical failure cause
dependency behavior
retry/fallback/cancellation evidence

Use audit/event store for:

official decision history
actor identity
legal basis
before/after state
immutable business record
user-visible explanation

Never rely on sampled traces as sole audit record.

Recommended relationship:

A trace can reference audit event ID only if policy allows and cardinality/cost are acceptable. Often safer:

audit.event.type = CASE_SUBMISSION_REJECTED
audit.recorded = true

Not:

audit.event.id = AUDIT-2026-000000123456789

Unless you deliberately need it and the backend is access-controlled.

18. Good Trace Example

Trace: POST /cases/{caseId}/submit
trace_id=0af765...

SERVER POST /cases/{caseId}/submit 942ms ERROR
attributes:
  http.route=/cases/{caseId}/submit
  http.response.status_code=409
  error.code=CASE_STATE_CONFLICT
  domain.operation=submit_case

├── INTERNAL case.submit 870ms ERROR
│   attributes:
│     case.state.from=UNDER_REVIEW
│     case.state.to=SUBMITTED
│     retryable=false
│
│   ├── INTERNAL validation.request 8ms OK
│   │   attributes:
│   │     validation.error_count=0
│   │
│   ├── CLIENT db.case.select_by_id 25ms OK
│   │
│   ├── INTERNAL policy.evaluate submit_case 80ms ERROR
│   │   event exception
│   │   attributes:
│   │     policy.name=submit_case_state_policy
│   │     policy.version=2026.06
│   │     error.code=CASE_STATE_CONFLICT
│   │
│   └── CLIENT audit.record 45ms OK
│
└── INTERNAL response.problem_details 5ms OK

Why this is good:

stable route name
clear domain operation
failure cause visible near policy span
outer HTTP response mapped to problem details
audit recording visible
no sensitive payload
no raw case ID in span name

19. Bad Trace Example

Trace: /cases/CASE-123/submit

Controller.submit 1000ms ERROR
├── method1 2ms
├── method2 3ms
├── method3 4ms
├── method4 5ms
├── execute 900ms ERROR
└── toString 1ms

Problems:

route contains concrete ID
method names expose code structure, not operation meaning
no error code
no policy version
no dependency spans
no outcome classification
too many low-value spans
no correlation with audit/domain state

20. Trace Design Checklist

For each new service/feature, define:

Root operation:
- What is the main unit of work?
- What is the stable span name?
- What route/message/job name identifies it?

Important child operations:
- Which dependency calls matter?
- Which domain decisions matter?
- Which side effects matter?
- Which expensive computations matter?

Attributes:
- What low-cardinality attributes help diagnosis?
- What error code identifies failure?
- What outcome field supports filtering?
- What data must never be recorded?

Context propagation:
- Which HTTP headers are propagated?
- Which message headers carry trace context?
- Which executor/reactive boundaries preserve context?

Sampling:
- Are errors retained?
- Are slow traces retained?
- Are critical commands retained?

Correlation:
- Do logs contain trace_id/span_id?
- Are metrics linked through exemplars if supported?

21. Practice Lab

Lab 1 — Trace Design for Command API

Design spans for:

POST /cases/{caseId}/escalate

Requirements:

validate request
load case
check state transition
evaluate policy
persist transition
publish event
record audit
return Problem Details on rejection

Produce:

span tree
attributes per important span
error mapping
log correlation fields
metrics that complement the trace

Lab 2 — Debug Broken Trace

Given:

Trace A:
POST /submit
└── HTTP client GET risk-service

Trace B:
GET /risk
└── DB query

Find likely causes and fixes.

Expected answers:

outgoing HTTP instrumentation missing
gateway strips traceparent
risk service not extracting W3C context
propagation library mismatch
custom client not instrumented

Lab 3 — Retry Trace

Design trace for:

Risk service timeout after 3 attempts, stale fallback succeeds, command returns 202 Accepted with degraded flag.

Include:

attempt spans
retry events
fallback span
final status
attributes
logs

22. Common Anti-Patterns

22.1 “Trace Everything”

More spans do not equal more observability.

Trace everything often creates:

high cost
unreadable trace
noisy UI
low signal
performance overhead

22.2 “Trace Nothing Internal”

Only auto-instrumented HTTP/DB spans often miss domain causes.

Add manual spans for:

policy decision
state transition
fallback decision
validation aggregate
queue wait
retry orchestration

22.3 Sensitive Attributes

Telemetry has broad operational access. Do not put confidential case data into traces.

22.4 Misusing Baggage

Do not put everything into baggage because it “follows the request”. That is exactly why it is dangerous.

22.5 Exception Only at Root

If only the root span has error, the trace does not show failure origin.

Record exception near source.

22.6 No Sampling Policy for Errors

Sampling that drops rare errors defeats tracing during incident.

23. Production Readiness Checklist

A service is tracing-ready when:

24. Key Takeaways

A trace is evidence of one unit of work moving through a system.
A span is an operation boundary with timing, status, attributes, events, and parent context.
Trace value comes from causal structure, not from span count.
Good traces use stable names, safe attributes, and domain outcome fields.
Context propagation is the backbone of distributed tracing.
Sampling policy determines which failures remain visible.
In regulatory systems, traces support diagnosis but do not replace audit records.

25. References

OpenTelemetry — Context Propagation: https://opentelemetry.io/docs/concepts/context-propagation/
OpenTelemetry — Semantic Conventions: https://opentelemetry.io/docs/concepts/semantic-conventions/
OpenTelemetry — Trace Semantic Conventions: https://opentelemetry.io/docs/specs/semconv/general/trace/
OpenTelemetry — Baggage: https://opentelemetry.io/docs/concepts/signals/baggage/
W3C Trace Context: https://www.w3.org/TR/trace-context/

Lesson Recap

You just completed lesson 27 in deepen practice. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Back To Series Next Lesson

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.

Previous Lesson

Lesson 26

Micrometer, Prometheus & Actuator

Next Lesson

Lesson 28

OpenTelemetry Java