OpenTelemetry Java
Learn Java Error, Reliability & Observability Engineering - Part 028
Praktik OpenTelemetry Java production-grade: Java agent, manual instrumentation, tracer, span, context propagation, exporter, collector, semantic conventions, exception recording, logs correlation, sampling, dan debugging trace gap.
Part 028 — OpenTelemetry Java
Part sebelumnya membangun mental model distributed tracing. Sekarang kita masuk ke praktik Java dengan OpenTelemetry.
OpenTelemetry Java dapat dipakai dalam tiga lapisan:
- Zero-code instrumentation dengan Java agent.
- Manual instrumentation dengan OpenTelemetry API untuk domain-specific span, metric, log enrichment.
- SDK/Collector configuration untuk export, sampling, resource attributes, processors, dan vendor/backend integration.
Tujuan part ini bukan hanya “trace muncul di Jaeger/Tempo/vendor APM”. Tujuan sebenarnya:
Aplikasi Java menghasilkan telemetry yang benar secara causal, aman secara data, stabil secara schema, dan berguna saat debugging production failure.
1. OpenTelemetry Architecture
Responsibilities:
| Component | Responsibility |
|---|---|
| API | Code-level interface untuk membuat span/metric/log tanpa mengikat vendor |
| SDK | Implementasi runtime: sampler, processor, exporter, resource |
| Java Agent | Bytecode instrumentation untuk framework/library populer |
| Exporter | Mengirim telemetry keluar, umumnya OTLP |
| Collector | Menerima, memproses, batch, sample, transform, route telemetry |
| Backend | Storage, query, visualization, alerting |
Important boundary:
Business code sebaiknya bergantung pada OpenTelemetry API, bukan langsung pada vendor APM.
2. Auto Instrumentation vs Manual Instrumentation
2.1 Auto Instrumentation
Java agent cocok untuk edge spans:
- inbound HTTP server
- outbound HTTP client
- JDBC/database client
- messaging clients
- common frameworks
- logs MDC correlation depending on configuration
- runtime/JVM metrics depending on setup
Kelebihan:
- cepat aktif
- coverage luas
- sedikit perubahan code
- bagus untuk dependency map
Keterbatasan:
- tidak tahu domain operation kamu
- tidak tahu regulatory decision semantics
- tidak tahu state transition
- tidak tahu fallback policy khusus
- tidak tahu business error code kamu
2.2 Manual Instrumentation
Manual instrumentation cocok untuk:
- domain command
- policy evaluation
- validation aggregate
- state transition
- retry orchestration
- fallback selection
- audit recording
- idempotency decision
- batch chunk processing
- feature degradation
Rule:
Pakai agent untuk technical boundary. Pakai manual span untuk domain boundary.
3. Running the Java Agent
Common launch:
java \
-javaagent:/opt/opentelemetry-javaagent.jar \
-Dotel.service.name=case-command-service \
-Dotel.resource.attributes=deployment.environment=prod,service.version=2026.06.28.1 \
-Dotel.exporter.otlp.endpoint=http://otel-collector:4318 \
-Dotel.traces.exporter=otlp \
-Dotel.metrics.exporter=otlp \
-Dotel.logs.exporter=otlp \
-jar app.jar
Equivalent environment style:
export JAVA_TOOL_OPTIONS="-javaagent:/opt/opentelemetry-javaagent.jar"
export OTEL_SERVICE_NAME="case-command-service"
export OTEL_RESOURCE_ATTRIBUTES="deployment.environment=prod,service.version=2026.06.28.1"
export OTEL_EXPORTER_OTLP_ENDPOINT="http://otel-collector:4318"
export OTEL_TRACES_EXPORTER="otlp"
export OTEL_METRICS_EXPORTER="otlp"
export OTEL_LOGS_EXPORTER="otlp"
java -jar app.jar
For Kubernetes, prefer environment variables and mounted agent jar image/layer.
Example container command concept:
containers:
- name: app
image: registry.example.com/case-command-service:2026.06.28.1
env:
- name: JAVA_TOOL_OPTIONS
value: "-javaagent:/otel/opentelemetry-javaagent.jar"
- name: OTEL_SERVICE_NAME
value: "case-command-service"
- name: OTEL_RESOURCE_ATTRIBUTES
value: "deployment.environment=prod,service.version=2026.06.28.1"
- name: OTEL_EXPORTER_OTLP_ENDPOINT
value: "http://otel-collector.observability:4318"
4. Resource Attributes
Resource attributes identify the producer of telemetry.
Minimal production fields:
service.name=case-command-service
service.version=2026.06.28.1
deployment.environment=prod
service.namespace=enforcement
Useful platform fields:
k8s.namespace.name=enforcement
k8s.pod.name=case-command-7fdc...
k8s.container.name=app
cloud.region=ap-southeast-3
Do not put per-request/user/case data in resource attributes. Resource describes service instance, not business entity.
Bad:
case.id=CASE-123
user.id=U-999
5. Maven Dependencies for Manual Instrumentation
For application code that creates spans manually:
<dependencies>
<dependency>
<groupId>io.opentelemetry</groupId>
<artifactId>opentelemetry-api</artifactId>
</dependency>
<dependency>
<groupId>io.opentelemetry</groupId>
<artifactId>opentelemetry-context</artifactId>
</dependency>
</dependencies>
If you configure SDK in application code, add SDK/exporter dependencies. In many production setups with Java agent, application code only needs API dependency. The agent supplies SDK runtime and bridges instrumentation.
6. Creating a Manual Span
Example domain command instrumentation:
package com.example.casecmd.telemetry;
import io.opentelemetry.api.GlobalOpenTelemetry;
import io.opentelemetry.api.trace.Span;
import io.opentelemetry.api.trace.StatusCode;
import io.opentelemetry.api.trace.Tracer;
import io.opentelemetry.context.Scope;
public final class CaseCommandTelemetry {
private static final Tracer tracer = GlobalOpenTelemetry.getTracer(
"com.example.casecmd",
"1.0.0"
);
private final CaseService caseService;
public CaseCommandTelemetry(CaseService caseService) {
this.caseService = caseService;
}
public SubmitResult submit(SubmitCaseCommand command) {
Span span = tracer.spanBuilder("case.submit")
.setAttribute("domain.entity", "case")
.setAttribute("domain.operation", "submit")
.setAttribute("case.state.target", "SUBMITTED")
.startSpan();
try (Scope ignored = span.makeCurrent()) {
SubmitResult result = caseService.submit(command);
span.setAttribute("outcome", result.outcome().name());
span.setAttribute("retryable", result.retryable());
if (result.rejected()) {
span.setAttribute("error.code", result.errorCode());
span.setStatus(StatusCode.ERROR, result.errorCode());
}
return result;
} catch (DomainException ex) {
span.recordException(ex);
span.setAttribute("error.code", ex.errorCode());
span.setAttribute("retryable", ex.retryable());
span.setStatus(StatusCode.ERROR, ex.errorCode());
throw ex;
} catch (RuntimeException ex) {
span.recordException(ex);
span.setStatus(StatusCode.ERROR, ex.getClass().getSimpleName());
throw ex;
} finally {
span.end();
}
}
}
Important points:
span.makeCurrent()makes this span parent for downstream spans created in the same context.recordException(ex)records exception event on the span.setStatus(ERROR, ...)classifies failure.span.end()must happen exactly once.- The span name is stable:
case.submit, notcase.submit.CASE-123.
7. Utility Wrapper to Avoid Span Leaks
Manual span code can become repetitive. Encapsulate carefully.
package com.example.telemetry;
import io.opentelemetry.api.trace.Span;
import io.opentelemetry.api.trace.StatusCode;
import io.opentelemetry.api.trace.Tracer;
import io.opentelemetry.context.Scope;
import java.util.function.Supplier;
public final class TracingSupport {
private final Tracer tracer;
public TracingSupport(Tracer tracer) {
this.tracer = tracer;
}
public <T> T inSpan(String spanName, Supplier<T> action) {
Span span = tracer.spanBuilder(spanName).startSpan();
try (Scope ignored = span.makeCurrent()) {
return action.get();
} catch (RuntimeException ex) {
span.recordException(ex);
span.setStatus(StatusCode.ERROR, ex.getClass().getSimpleName());
throw ex;
} finally {
span.end();
}
}
}
Domain-aware wrapper:
public <T> T inDomainSpan(
String spanName,
String operation,
Supplier<T> action
) {
Span span = tracer.spanBuilder(spanName)
.setAttribute("domain.operation", operation)
.startSpan();
try (Scope ignored = span.makeCurrent()) {
return action.get();
} catch (DomainException ex) {
span.recordException(ex);
span.setAttribute("error.code", ex.errorCode());
span.setAttribute("retryable", ex.retryable());
span.setStatus(StatusCode.ERROR, ex.errorCode());
throw ex;
} catch (RuntimeException ex) {
span.recordException(ex);
span.setStatus(StatusCode.ERROR, ex.getClass().getSimpleName());
throw ex;
} finally {
span.end();
}
}
Do not hide all tracing behind magic annotations if developers lose understanding of span lifetime.
8. Adding Span Events
Events are useful for state changes inside a longer span.
Span current = Span.current();
current.addEvent("validation.completed", Attributes.builder()
.put("validation.error_count", errors.size())
.put("validation.mode", "accumulate")
.build());
Example fallback event:
current.addEvent("fallback.selected", Attributes.builder()
.put("fallback.name", "stale-risk-score")
.put("fallback.reason", "risk-service-timeout")
.put("fallback.data_staleness_ms", staleAge.toMillis())
.build());
Use events when:
- event is meaningful but not long enough for span
- you need point-in-time evidence
- you need to mark phase transition
Use span when:
- operation has duration
- operation can fail independently
- operation has child operations
- operation crosses boundary
9. Exception Recording Policy
A good exception trace includes:
exception.type
exception.message, if safe
exception.stacktrace, if configured/allowed
error.code
retryable
outcome
Manual pattern:
try {
return action.get();
} catch (DomainException ex) {
span.recordException(ex);
span.setAttribute("error.code", ex.errorCode());
span.setAttribute("error.category", "domain");
span.setAttribute("retryable", ex.retryable());
span.setStatus(StatusCode.ERROR, ex.errorCode());
throw ex;
}
For infrastructure failure:
catch (ExternalDependencyException ex) {
span.recordException(ex);
span.setAttribute("error.category", "dependency");
span.setAttribute("peer.service", ex.peerService());
span.setAttribute("retryable", ex.retryable());
span.setStatus(StatusCode.ERROR, ex.errorCode());
throw ex;
}
Avoid:
span.setAttribute("exception.message", ex.getMessage()); // if message may contain PII
span.setAttribute("payload", command.toString()); // dangerous
10. Context Propagation in Java
10.1 Current Context
OpenTelemetry context is attached to execution flow.
Span parent = tracer.spanBuilder("parent").startSpan();
try (Scope ignored = parent.makeCurrent()) {
doWork(); // child spans created here see parent
} finally {
parent.end();
}
Inside doWork():
Span child = tracer.spanBuilder("child").startSpan();
try (Scope ignored = child.makeCurrent()) {
// child is current
} finally {
child.end();
}
10.2 Executor Boundary
Thread boundaries can lose context if not instrumented.
Problem:
executor.submit(() -> {
Span child = tracer.spanBuilder("async-work").startSpan();
// may not have correct parent if context not propagated
});
Safer explicit pattern:
import io.opentelemetry.context.Context;
Context parentContext = Context.current();
executor.submit(parentContext.wrap(() -> {
Span span = tracer.spanBuilder("async-work").startSpan();
try (Scope ignored = span.makeCurrent()) {
performAsyncWork();
} catch (RuntimeException ex) {
span.recordException(ex);
span.setStatus(StatusCode.ERROR);
throw ex;
} finally {
span.end();
}
}));
10.3 CompletableFuture Boundary
Context context = Context.current();
CompletableFuture<Result> future = CompletableFuture.supplyAsync(
context.wrapSupplier(() -> {
Span span = tracer.spanBuilder("risk.lookup.async").startSpan();
try (Scope ignored = span.makeCurrent()) {
return riskClient.lookup();
} catch (RuntimeException ex) {
span.recordException(ex);
span.setStatus(StatusCode.ERROR);
throw ex;
} finally {
span.end();
}
}),
executor
);
If using agent, many common executors/frameworks may already be instrumented. Still test your actual async path.
10.4 Reactor Boundary
Reactive pipelines have their own context mechanics. Do not assume ThreadLocal/MDC works the same as blocking request-per-thread code.
Conceptual pattern:
Mono.deferContextual(ctx -> {
Span span = tracer.spanBuilder("case.submit.reactive").startSpan();
return Mono.fromCallable(() -> {
try (Scope ignored = span.makeCurrent()) {
return service.submit(command);
}
})
.doOnError(ex -> {
span.recordException(ex);
span.setStatus(StatusCode.ERROR);
})
.doFinally(signal -> span.end());
});
Prefer framework-supported instrumentation where available, then add manual spans around domain operations.
11. Propagating Context Over HTTP Manually
Most production systems should let instrumentation libraries handle HTTP propagation. But knowing the model helps when debugging custom clients.
Conceptual injection:
TextMapSetter<HttpRequestBuilder> setter =
(carrier, key, value) -> carrier.header(key, value);
Context current = Context.current();
openTelemetry.getPropagators()
.getTextMapPropagator()
.inject(current, requestBuilder, setter);
Conceptual extraction:
TextMapGetter<HttpServletRequest> getter = new TextMapGetter<>() {
@Override
public Iterable<String> keys(HttpServletRequest request) {
return Collections.list(request.getHeaderNames());
}
@Override
public String get(HttpServletRequest request, String key) {
return request.getHeader(key);
}
};
Context extracted = openTelemetry.getPropagators()
.getTextMapPropagator()
.extract(Context.current(), request, getter);
Do this manually only when needed. Most frameworks should be instrumented by agent/library.
12. Messaging Context Propagation
For message producers:
current trace context -> message headers
For consumers:
message headers -> extracted context -> consumer span
Conceptual Kafka header injection:
TextMapSetter<ProducerRecord<String, byte[]>> setter = (record, key, value) ->
record.headers().add(key, value.getBytes(StandardCharsets.UTF_8));
openTelemetry.getPropagators()
.getTextMapPropagator()
.inject(Context.current(), record, setter);
Consumer extraction concept:
TextMapGetter<ConsumerRecord<String, byte[]>> getter = new TextMapGetter<>() {
@Override
public Iterable<String> keys(ConsumerRecord<String, byte[]> record) {
List<String> keys = new ArrayList<>();
record.headers().forEach(header -> keys.add(header.key()));
return keys;
}
@Override
public String get(ConsumerRecord<String, byte[]> record, String key) {
Header header = record.headers().lastHeader(key);
if (header == null) {
return null;
}
return new String(header.value(), StandardCharsets.UTF_8);
}
};
Context parent = openTelemetry.getPropagators()
.getTextMapPropagator()
.extract(Context.current(), record, getter);
For message systems, decide whether consumer span should be:
- child of producer span
- linked to producer span
- new trace with domain correlation ID
The answer depends on semantics:
| Scenario | Recommended Relationship |
|---|---|
| immediate async processing | child or linked consumer span |
| durable queue with delayed processing | often link is more accurate |
| batch consumes many messages | links to multiple source contexts |
| long-running workflow | separate trace + domain correlation |
13. Span Links in Java
Conceptual usage:
SpanContext producerContext = extractedSpanContext;
Span consumerSpan = tracer.spanBuilder("case-events process CaseSubmitted")
.addLink(producerContext)
.setSpanKind(SpanKind.CONSUMER)
.startSpan();
Use links when parent-child would lie about lifecycle ownership.
Example:
- Message produced at 10:00
- Consumer processes at 10:30
- Producer request long finished
Making consumer a direct child may imply a misleading continuous execution chain. Link preserves relation without pretending the producer is still active.
14. Semantic Conventions
Semantic conventions provide common attribute names for operations.
Use them whenever applicable:
- HTTP server/client
- database call
- messaging
- RPC
- exceptions
- resources
- logs
- metrics
Why it matters:
- dashboards can aggregate consistently
- backend understands common fields
- polyglot systems correlate better
- future tools work without custom mapping
Domain-specific attributes are still allowed, but prefix/structure them deliberately.
Example:
http.request.method = POST
http.route = /cases/{caseId}/submit
http.response.status_code = 409
error.code = CASE_STATE_CONFLICT
domain.operation = submit_case
case.state.from = UNDER_REVIEW
case.state.to = SUBMITTED
Avoid inventing custom fields for standard concepts:
method=POST
url_template=/cases/{id}
status=409
Use standard keys where standard keys exist.
15. Spring Boot Integration Model
In Spring Boot services, a common production setup:
Design choice:
- Use Micrometer for application metrics if already standardized in Spring Boot.
- Use OpenTelemetry for traces and context propagation.
- Bridge/align logs with
trace_idandspan_id. - Avoid duplicating same metric in both Micrometer and manual OTel unless intentionally harmonized.
16. Logs Correlation
Goal:
{
"timestamp": "2026-06-28T10:15:30Z",
"level": "WARN",
"message": "Case submission rejected",
"trace_id": "0af7651916cd43dd8448eb211c80319c",
"span_id": "b7ad6b7169203331",
"error.code": "CASE_STATE_CONFLICT",
"domain.operation": "submit_case"
}
With Java agent and logging instrumentation, trace/span IDs can be injected into MDC/log context depending on setup.
But verify:
- logs contain
trace_id - logs contain
span_id - async logs still carry context
- Reactor logs still carry context
- virtual thread logs still carry context
- scheduled job logs carry trace when a job span is created
Do not assume. Test by triggering one request and checking trace-to-log navigation.
17. Collector Configuration Mental Model
OpenTelemetry Collector typically does:
Example conceptual config:
receivers:
otlp:
protocols:
grpc:
http:
processors:
batch:
memory_limiter:
check_interval: 1s
limit_mib: 512
resource:
attributes:
- key: deployment.environment
value: prod
action: upsert
exporters:
otlp/traces:
endpoint: trace-backend:4317
prometheusremotewrite:
endpoint: http://prometheus-remote-write/api/v1/write
otlp/logs:
endpoint: logs-backend:4317
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, resource, batch]
exporters: [otlp/traces]
metrics:
receivers: [otlp]
processors: [memory_limiter, resource, batch]
exporters: [prometheusremotewrite]
logs:
receivers: [otlp]
processors: [memory_limiter, resource, batch]
exporters: [otlp/logs]
Production collector concerns:
- backpressure
- queue size
- batch size
- retry policy
- memory limit
- tail sampling
- data redaction
- multi-tenant routing
- environment routing
- failure mode when backend unavailable
Telemetry pipeline itself must be reliable enough not to take down the application.
18. Sampling Configuration
18.1 Parent-Based Trace ID Ratio
Common app-side sampling:
export OTEL_TRACES_SAMPLER="parentbased_traceidratio"
export OTEL_TRACES_SAMPLER_ARG="0.10"
Meaning:
- respect upstream sampling decision
- start new traces with 10% probability
18.2 Error Retention
If you sample head-side at 10%, you may drop rare errors.
Better production model:
- app/head sampling for cost guard
- collector tail sampling to retain errors/slow traces
- always sample selected critical operations if volume allows
Example conceptual tail sampling rules:
processors:
tail_sampling:
decision_wait: 10s
policies:
- name: errors
type: status_code
status_code:
status_codes: [ERROR]
- name: slow
type: latency
latency:
threshold_ms: 1000
- name: probabilistic-normal
type: probabilistic
probabilistic:
sampling_percentage: 5
Tail sampling requires collector/backend capacity and buffering. It is not free.
19. Instrumenting Retry and Fallback
Manual tracing adds business semantics around auto-instrumented HTTP spans.
public RiskScore lookupRisk(SubjectId subjectId) {
Span span = tracer.spanBuilder("risk.lookup")
.setAttribute("peer.service", "risk-service")
.startSpan();
try (Scope ignored = span.makeCurrent()) {
for (int attempt = 1; attempt <= 3; attempt++) {
span.addEvent("retry.attempt.started", Attributes.builder()
.put("retry.attempt", attempt)
.build());
try {
RiskScore score = riskClient.fetch(subjectId); // agent creates HTTP client span
span.setAttribute("retry.attempts.used", attempt);
span.setStatus(StatusCode.OK);
return score;
} catch (TimeoutException ex) {
span.addEvent("retry.attempt.failed", Attributes.builder()
.put("retry.attempt", attempt)
.put("error.category", "timeout")
.build());
if (attempt == 3) {
span.recordException(ex);
span.setAttribute("error.code", "RISK_TIMEOUT");
span.setStatus(StatusCode.ERROR, "RISK_TIMEOUT");
return fallbackRisk(subjectId, span);
}
}
}
throw new IllegalStateException("unreachable");
} finally {
span.end();
}
}
private RiskScore fallbackRisk(SubjectId subjectId, Span parentSpan) {
parentSpan.addEvent("fallback.selected", Attributes.builder()
.put("fallback.name", "stale-risk-score")
.put("fallback.reason", "risk-timeout")
.build());
Span fallbackSpan = tracer.spanBuilder("risk.lookup.fallback.stale")
.startSpan();
try (Scope ignored = fallbackSpan.makeCurrent()) {
RiskScore score = staleRiskRepository.find(subjectId);
fallbackSpan.setAttribute("fallback.outcome", "used_stale_value");
fallbackSpan.setAttribute("fallback.data_staleness_ms", score.ageMillis());
return score;
} finally {
fallbackSpan.end();
}
}
Notes:
- Auto instrumentation captures HTTP client calls.
- Manual parent span captures retry orchestration.
- Fallback is explicit and visible.
- Error code is stable.
20. Instrumenting Validation and Rejection
public ValidationResult validate(SubmitCaseCommand command) {
Span span = tracer.spanBuilder("case.submit.validation")
.setAttribute("validation.mode", "accumulate")
.startSpan();
try (Scope ignored = span.makeCurrent()) {
ValidationResult result = validator.validate(command);
span.setAttribute("validation.error_count", result.errors().size());
span.setAttribute("validation.outcome", result.valid() ? "accepted" : "rejected");
if (!result.valid()) {
span.addEvent("validation.rejected", Attributes.builder()
.put("error.code", result.primaryErrorCode())
.put("validation.error_count", result.errors().size())
.build());
}
return result;
} finally {
span.end();
}
}
Do not attach raw invalid field values.
Better:
validation.field = address.postal_code
validation.rule = required
Dangerous:
validation.value = full user-provided address
21. Instrumenting State Transitions
For regulatory lifecycle systems, state transition spans are highly valuable.
Span span = tracer.spanBuilder("case.transition")
.setAttribute("domain.entity", "case")
.setAttribute("transition.name", "submit")
.setAttribute("case.state.from", fromState.name())
.setAttribute("case.state.to", toState.name())
.startSpan();
try (Scope ignored = span.makeCurrent()) {
transitionService.apply(command);
span.setAttribute("transition.outcome", "applied");
span.setStatus(StatusCode.OK);
} catch (InvalidTransitionException ex) {
span.recordException(ex);
span.setAttribute("transition.outcome", "rejected");
span.setAttribute("error.code", ex.errorCode());
span.setStatus(StatusCode.ERROR, ex.errorCode());
throw ex;
} finally {
span.end();
}
Trace answers operational questions:
- which transition was attempted?
- from which state to which state?
- was rejection domain-valid?
- which rule blocked it?
- was audit recorded?
Audit store still remains the official evidence.
22. Instrumenting Batch Jobs
Avoid span per row for large batch.
Good structure:
job.import_cases
├── chunk.read
├── chunk.validate
├── chunk.persist
└── chunk.publish_summary
Example:
Span jobSpan = tracer.spanBuilder("job.import_cases").startSpan();
try (Scope jobScope = jobSpan.makeCurrent()) {
for (int chunkIndex = 0; chunkIndex < chunks.size(); chunkIndex++) {
processChunk(chunks.get(chunkIndex), chunkIndex);
}
} finally {
jobSpan.end();
}
Chunk span:
private void processChunk(List<Record> records, int chunkIndex) {
Span span = tracer.spanBuilder("job.import_cases.chunk")
.setAttribute("chunk.index", chunkIndex)
.setAttribute("chunk.size", records.size())
.startSpan();
try (Scope ignored = span.makeCurrent()) {
ChunkResult result = importer.process(records);
span.setAttribute("records.success", result.successCount());
span.setAttribute("records.failed", result.failedCount());
} catch (RuntimeException ex) {
span.recordException(ex);
span.setStatus(StatusCode.ERROR);
throw ex;
} finally {
span.end();
}
}
If row-level debugging is needed, enable temporary diagnostic sampling for selected batch/job only.
23. Instrumenting Shutdown
On shutdown, create a lifecycle span or structured events if the telemetry pipeline still has time to flush.
public void shutdown() {
Span span = tracer.spanBuilder("service.shutdown")
.setAttribute("shutdown.reason", "SIGTERM")
.startSpan();
try (Scope ignored = span.makeCurrent()) {
span.addEvent("intake.stopped");
intake.stop();
span.addEvent("executor.drain.started", Attributes.builder()
.put("inflight.count", executor.inflightCount())
.build());
executor.drain(Duration.ofSeconds(20));
span.addEvent("telemetry.flush.started");
telemetry.flush(Duration.ofSeconds(3));
} catch (RuntimeException ex) {
span.recordException(ex);
span.setStatus(StatusCode.ERROR);
throw ex;
} finally {
span.end();
}
}
Be careful:
- Shutdown spans may be lost if exporter/collector is already unavailable.
- Logs may be more reliable at the final moments.
- Metrics can expose shutdown counters before termination.
- Kubernetes grace period limits all of this.
24. Data Safety and Redaction
OpenTelemetry makes it easy to ship data. That is both useful and dangerous.
Never include:
- credentials
- tokens
- raw request/response body
- full SQL with sensitive literal
- customer name/email/phone/address
- case narrative
- investigation notes
- unredacted exception messages if they may contain PII
Prefer:
error.code = CASE_STATE_CONFLICT
case.type = enforcement
case.state.from = UNDER_REVIEW
policy.name = submit_case_policy
policy.version = 2026.06
For special debugging, use controlled diagnostic tooling with:
- access control
- time limit
- sampling limit
- approval
- audit trail
- redaction
25. Debugging Trace Gaps
25.1 No traces at all
Check:
- Java agent actually loaded?
- OTEL_SERVICE_NAME set?
- exporter endpoint reachable?
- collector receiver enabled?
- backend ingest healthy?
- sampling set to always_off accidentally?
- firewall/network policy blocking port?
Useful startup check:
java -javaagent:/otel/opentelemetry-javaagent.jar -jar app.jar
Look for agent startup logs.
25.2 Server spans exist, client spans missing
Check:
- HTTP client library supported?
- custom client wrapper bypasses instrumentation?
- instrumentation disabled?
- request executed in uninstrumented native code?
25.3 Client span exists, downstream server trace is new root
Check:
- traceparent header sent?
- proxy strips headers?
- downstream extracts W3C context?
- different propagation config?
- message headers dropped?
25.4 Logs have no trace ID
Check:
- logging instrumentation enabled?
- MDC fields included in log pattern/JSON layout?
- async appender preserves MDC?
- Reactor/CompletableFuture/Executor context propagated?
25.5 Error trace has no exception
Check:
- exception swallowed?
- boundary converts to response without recordException?
- span status set only at outer handler?
- domain result returned instead of thrown without error attributes?
25.6 Too many spans
Check:
- custom instrumentation per method/loop?
- batch row-level spans?
- debug instrumentation enabled in prod?
- library instrumentation too verbose?
26. Local Development Setup
A simple local loop:
For learning, start with console/log exporter or local collector. Confirm:
- root HTTP span appears
- DB/client spans appear
- manual domain span appears
- logs contain trace ID
- error spans show exception event
- context survives async call
27. Testing Telemetry
Telemetry should be tested because instrumentation breaks silently.
Test categories:
| Test | Purpose |
|---|---|
| Unit test manual span wrapper | span.end() always called, exceptions recorded |
| Integration test HTTP trace | inbound + outbound spans connected |
| Async context test | executor/future/reactor path keeps parent context |
| Error mapping test | domain exception sets error.code |
| Redaction test | sensitive data not emitted |
| Sampling test | errors retained in staging collector |
| Shutdown test | drain events/logs emitted before termination |
Example design:
Given POST /cases/{id}/submit rejects with CASE_STATE_CONFLICT
Then trace contains span case.submit
And span has error.code=CASE_STATE_CONFLICT
And logs contain same trace_id
And no raw case narrative appears in span attributes/logs
28. Production Rollout Plan
Phase 1 — Agent only
- enable Java agent in staging
- set service/resource attributes
- export traces to collector
- verify inbound/outbound/db spans
- verify cost/volume
Phase 2 — Logs correlation
- include trace/span IDs in JSON logs
- verify async boundaries
- verify trace-to-log navigation
Phase 3 — Manual domain spans
Add spans for:
- critical commands
- policy evaluation
- state transition
- fallback/retry orchestration
- audit boundary
Phase 4 — Sampling and governance
- retain errors
- retain slow traces
- sample normal traffic
- redact sensitive data
- define telemetry schema review
Phase 5 — Incident workflow
- dashboards link metrics → traces → logs
- runbooks include trace reading procedure
- on-call knows error code catalog
- postmortems include instrumentation gaps
29. OpenTelemetry Java Anti-Patterns
29.1 Agent Installed, Problem Solved
Agent gives technical spans, not domain understanding.
29.2 Manual Spans Around Every Method
This creates cost and noise. Instrument operation boundaries.
29.3 No Stable Service Name
If service.name changes per pod/build randomly, dashboards fragment.
29.4 High-Cardinality Span Names
Bad:
GET /cases/CASE-123
Good:
GET /cases/{caseId}
29.5 Sensitive Baggage
Baggage propagates. Treat it like outbound header data.
29.6 Forgetting span.end()
Leaked spans distort duration and may never export.
29.7 Status Without Error Code
ERROR alone is not enough. Add stable error code/category.
29.8 Telemetry Backend as Audit Store
Traces are sampled, mutable by retention policy, and operational. They do not replace official audit evidence.
30. Reference Implementation Sketch
public final class CaseSubmitHandler {
private final Tracer tracer;
private final CaseApplicationService service;
public CaseSubmitHandler(OpenTelemetry openTelemetry, CaseApplicationService service) {
this.tracer = openTelemetry.getTracer("case-command-service");
this.service = service;
}
public ProblemOr<SubmitResponse> submit(SubmitCaseCommand command) {
Span span = tracer.spanBuilder("case.submit")
.setAttribute("domain.operation", "submit_case")
.setAttribute("domain.entity", "case")
.startSpan();
try (Scope ignored = span.makeCurrent()) {
SubmitOutcome outcome = service.submit(command);
if (outcome.accepted()) {
span.setAttribute("outcome", "accepted");
span.setStatus(StatusCode.OK);
return ProblemOr.success(new SubmitResponse(outcome.caseVersion()));
}
span.setAttribute("outcome", "rejected");
span.setAttribute("error.code", outcome.errorCode());
span.setAttribute("retryable", outcome.retryable());
span.setStatus(StatusCode.ERROR, outcome.errorCode());
return ProblemOr.problem(ProblemDetails.from(outcome));
} catch (DomainException ex) {
span.recordException(ex);
span.setAttribute("error.category", "domain");
span.setAttribute("error.code", ex.errorCode());
span.setAttribute("retryable", ex.retryable());
span.setStatus(StatusCode.ERROR, ex.errorCode());
throw ex;
} catch (RuntimeException ex) {
span.recordException(ex);
span.setAttribute("error.category", "unexpected");
span.setStatus(StatusCode.ERROR, ex.getClass().getSimpleName());
throw ex;
} finally {
span.end();
}
}
}
This gives:
- stable span name
- domain operation attribute
- error code
- retryability
- consistent status
- exception recording
- context propagation to downstream auto-instrumented spans
31. Capstone Exercise
Instrument a Java/Spring service operation:
POST /cases/{caseId}/submit
Requirements:
- Enable OpenTelemetry Java agent.
- Set
service.name,service.version,deployment.environment. - Create manual span
case.submit. - Add child span/event for validation result.
- Add span/event for policy evaluation.
- Record domain rejection with
error.code. - Record dependency timeout and retry attempts.
- Emit fallback selected event.
- Ensure logs include
trace_idandspan_id. - Verify no sensitive payload in telemetry.
- Verify trace survives
CompletableFutureor executor boundary. - Verify trace-to-log navigation during a controlled failure.
Expected trace:
SERVER POST /cases/{caseId}/submit
└── INTERNAL case.submit
├── INTERNAL case.submit.validation
├── INTERNAL policy.evaluate submit_case
├── CLIENT db.case.select_by_id
├── INTERNAL risk.lookup
│ ├── CLIENT GET /risk/{subjectId}
│ ├── event retry.attempt.failed
│ └── INTERNAL risk.lookup.fallback.stale
├── CLIENT db.case.transition_update
└── PRODUCER case-events publish CaseSubmitted
32. Production Readiness Checklist
- Java agent enabled consistently across environments
-
service.namestable and meaningful -
service.versionset from build/release -
deployment.environmentset - OTLP endpoint points to collector, not random backend from app code
- inbound HTTP spans created
- outbound HTTP spans created
- DB/broker spans visible where required
- manual domain spans added for critical operations
- exceptions recorded near source
- stable
error.codeadded - logs include trace/span IDs
- async context propagation tested
- message context propagation tested
- sampling keeps errors/slow traces
- sensitive data redaction tested
- collector resource/memory/batch policy configured
- telemetry failure does not break business traffic
33. Key Takeaways
- OpenTelemetry Java works best as a combination of agent + manual domain instrumentation.
- Auto instrumentation sees frameworks; manual instrumentation explains business causality.
Contextpropagation is what keeps distributed traces connected.- Span names must be stable and low-cardinality.
- Exception recording must include safe stable error code, not just stack trace.
- Baggage is powerful but risky because it propagates across boundaries.
- Collector configuration is part of production architecture, not an afterthought.
- Telemetry must be tested like other production behavior.
34. References
- OpenTelemetry Java: https://opentelemetry.io/docs/languages/java/
- OpenTelemetry Java Agent: https://opentelemetry.io/docs/zero-code/java/agent/
- OpenTelemetry Context Propagation: https://opentelemetry.io/docs/concepts/context-propagation/
- OpenTelemetry Semantic Conventions: https://opentelemetry.io/docs/concepts/semantic-conventions/
- OpenTelemetry Trace Semantic Conventions: https://opentelemetry.io/docs/specs/semconv/general/trace/
- OpenTelemetry Baggage: https://opentelemetry.io/docs/concepts/signals/baggage/
- W3C Trace Context: https://www.w3.org/TR/trace-context/
You just completed lesson 28 in deepen practice. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.
Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.