Final StretchOrdered learning track

Telemetry Quality Engineering

Learn Java Error, Reliability & Observability Engineering - Part 030

Telemetry quality engineering untuk Java production systems: signal-to-noise ratio, cardinality budget, sampling, semantic conventions, schema governance, telemetry testing, privacy, cost control, dan anti-pattern observability.

16 min read3073 words
PrevNext
Lesson 3035 lesson track3035 Final Stretch
#java#reliability#observability#telemetry+6 more

Part 030 — Telemetry Quality Engineering

Target skill: mampu menilai dan meningkatkan kualitas telemetry sebagai engineering system, bukan hanya menambahkan logs, metrics, dan traces. Setelah part ini, kamu harus bisa mendesain telemetry yang berguna saat incident, murah secara operasional, aman secara data, stabil untuk dashboard/alert, dan cukup presisi untuk debugging maupun audit.

Telemetry yang buruk lebih berbahaya daripada tidak ada telemetry. Ia memberi ilusi bahwa sistem observable, padahal saat incident semua sinyal terlalu noisy, terlalu mahal, tidak konsisten, terlalu high-cardinality, tidak punya context, atau tidak menjawab pertanyaan operasional.

Top 1% engineer tidak bertanya:

“Apakah service ini sudah punya logs, metrics, traces?”

Mereka bertanya:

“Apakah telemetry ini bisa mempercepat diagnosis, membuktikan impact, mengurangi ambiguity, dan tetap aman/cost-effective pada traffic produksi?”


1. Kaufman Deconstruction

Telemetry quality terdiri dari sub-skill berikut:

Sub-skillOutcome
Signal designTahu pertanyaan operasional yang harus dijawab oleh telemetry
Signal-to-noise controlMengurangi log spam, metric noise, trace overload
Cardinality budgetingMencegah cost spike dan backend collapse akibat label/attribute unik
Semantic consistencyNama metric/span/log field konsisten lintas service
Sampling strategyMengendalikan volume tanpa kehilangan insight penting
Privacy/security controlMenjaga telemetry tidak membawa secret/PII/raw payload
Telemetry testingMemastikan telemetry contract tidak rusak saat refactor
Feedback loopMenggunakan incident/postmortem untuk memperbaiki telemetry

2. Mental Model: Telemetry Is an Operational Interface

Telemetry adalah API antara production system dan operator manusia/mesin.

Seperti API, telemetry harus punya:

  • contract,
  • naming convention,
  • versioning discipline,
  • backward compatibility,
  • security policy,
  • cost model,
  • ownership.

Jika telemetry berubah sembarangan, dashboard, alert, runbook, dan incident muscle memory akan rusak.


3. Telemetry Quality Attributes

Gunakan 12 atribut kualitas berikut.

AttributePertanyaan
RelevanceApakah sinyal menjawab pertanyaan operasional penting?
AccuracyApakah sinyal merepresentasikan realitas sistem?
TimelinessApakah sinyal muncul cukup cepat untuk response?
CorrelatabilityBisa dikaitkan dengan trace/request/case/tenant/dependency?
Bounded cardinalityApakah label/attribute tidak meledak?
ConsistencyApakah nama/semantik stabil lintas service?
ActionabilityApakah operator tahu apa yang dilakukan setelah melihat sinyal?
SafetyApakah tidak membocorkan secret/PII/data sensitif?
Cost efficiencyApakah volume dan retention masuk akal?
CompletenessApakah critical path punya sinyal cukup?
Low noiseApakah tidak terlalu banyak false positive/log spam?
TestabilityApakah sinyal penting bisa diuji otomatis?

Telemetry yang bagus tidak harus banyak. Ia harus menjawab pertanyaan yang tepat.


4. Start From Operational Questions

Jangan mulai dari “tambahkan metric”. Mulai dari pertanyaan:

Service health

  • Apakah service menerima traffic?
  • Apakah request berhasil?
  • Berapa latency p50/p95/p99?
  • Apakah error rate naik?
  • Apakah queue backlog naik?
  • Apakah thread/connection pool saturated?

Dependency health

  • Dependency mana yang lambat?
  • Apakah timeout/retry/circuit breaker meningkat?
  • Apakah fallback aktif?
  • Apakah failure isolated atau cascading?

Domain health

  • Berapa case/review/transaction yang berhasil?
  • Berapa yang rejected karena rule?
  • Berapa yang stuck di state tertentu?
  • Apakah escalation melebihi SLA?
  • Apakah automated decision berubah drastis?

Incident investigation

  • Request mana yang gagal?
  • Error code mana yang dominan?
  • Tenant/user segment mana terdampak?
  • Release/config change mana yang berkorelasi?
  • Critical path span mana yang paling lambat?

Regulatory/audit

  • Siapa melakukan apa?
  • Atas dasar policy/rule apa?
  • Outcome-nya apa?
  • Evidence apa yang tersimpan?
  • Apakah failure menghasilkan unknown outcome?

Telemetry quality berarti setiap sinyal punya alasan eksistensi.


5. Logs, Metrics, Traces: Quality Criteria Berbeda

SignalBest ForQuality RiskQuality Rule
LogsEvent detail, forensic evidence, rare errorspam, missing context, secret leakstructured, correlated, sampled if needed
MetricsTrend, alert, SLO, capacitycardinality explosion, wrong aggregationbounded labels, stable names
TracesCausal path, latency breakdown, dependency chainoversampling/undersampling, missing spanspropagate context, meaningful spans
Audit eventsCompliance evidenceincomplete actor/reason/outcomeimmutable, domain-specific, safe

Jangan memaksa satu signal mengerjakan semua tugas.

Bad:

  • semua investigasi bergantung pada log free-text,
  • semua business evidence hanya metric counter,
  • semua alert berdasarkan log pattern,
  • semua debug bergantung pada trace tetapi sampling terlalu rendah.

6. Signal-to-Noise Ratio

Signal-to-noise ratio adalah rasio informasi berguna dibanding noise operasional.

Noise muncul dari:

  • logging setiap loop item,
  • stack trace untuk error expected,
  • metric untuk event yang tidak actionable,
  • trace span terlalu granular,
  • alert untuk symptom yang sudah covered alert lain,
  • duplicate telemetry dari framework + manual instrumentation,
  • debug logs aktif di production.

Log level quality

LevelUse ForBad Use
ERRORuser-visible/system-impacting failure requiring attentionexpected validation rejection
WARNdegraded mode, retry exhausted, fallback activated, suspicious but handledevery retry attempt at scale
INFOlifecycle/business milestoneevery internal branch
DEBUGdiagnostic detail disabled by defaultproduction steady-state flow
TRACEdeep local debuggingdistributed production telemetry

Quality rule:

A log event should either help reconstruct an important event, explain a decision, or diagnose a failure. Otherwise it is noise.


7. Cardinality Budget

Cardinality adalah jumlah kombinasi unik label/attribute. High cardinality adalah salah satu penyebab utama telemetry cost dan backend instability.

Metric cardinality contoh:

http.server.requests{method="GET",status="200",route="/cases/{id}"}

Ini bounded.

Bad:

http.server.requests{method="GET",status="200",path="/cases/C-827361923"}

Jika path memakai actual ID, cardinality bisa meledak.

Cardinality classification

FieldCardinalityMetric Tag?Trace Attribute?Log Field?
http.methodlowyesyesyes
http.status_codelowyesyesyes
error_codelow/mediumyesyesyes
exception.typemediumyes if boundedyesyes
tenant_tierlowyesyesyes
tenant_idhighrarelymaybemaybe
user_idvery highnomaybe hashedmaybe hashed
case_idvery highnomaybeyes if allowed
trace_idvery highnoinherentyes
request_path_rawvery highnonorarely

Cardinality budget policy

telemetry:
  metric_tags:
    allowed:
      - service.name
      - operation
      - error.code
      - dependency.name
      - http.route
      - http.method
      - http.status_code
      - outcome
      - retryable
    forbidden:
      - user.id
      - case.id
      - trace.id
      - correlation.id
      - raw.path
      - email
      - access.token

A mature platform treats tag approval like API design.


8. Semantic Consistency

Semantic inconsistency destroys cross-service analysis.

Bad:

service A: case_error_total{code="RULE_DENIED"}
service B: errors_count{errorCode="rule_denied"}
service C: failed_cases{reason="POLICY_DENIAL"}

Better:

case_decision_total{outcome="rejected", error_code="POLICY_DENIED"}

Create naming rules:

ConceptRuleExample
Metric nameslowercase, dot or underscore style consistentlycase.decision.total or case_decision_total
Error codeuppercase stable domain codeCASE_STATE_CONFLICT
Outcomesmall enumsuccess, error, rejected, degraded
Retryabilityboolean or enumretryable=true
Dependencystable logical nameidentity-service
Operationbusiness operationcase.submit
Routetemplated route/cases/{caseId}

Follow existing semantic conventions where available, and define local conventions for domain-specific telemetry.


9. Telemetry Schema Governance

Telemetry schema needs governance because dashboards and alerts depend on it.

Create a registry:

metrics:
  - name: case.decision.total
    type: counter
    owner: case-platform
    description: Number of case decisions by outcome and error code.
    tags:
      - name: outcome
        values: [approved, rejected, error]
      - name: error_code
        bounded: true
      - name: policy_version
        bounded: true
    stability: stable

logs:
  - event: case.decision.completed
    owner: case-platform
    required_fields:
      - correlation_id
      - trace_id
      - tenant_id
      - case_id
      - outcome
      - decision_id
      - policy_version
    forbidden_fields:
      - access_token
      - raw_payload

spans:
  - name: case.policy.evaluate
    owner: policy-engine
    required_attributes:
      - case.type
      - policy.version
      - decision.outcome

This prevents “observability drift”.


10. Sampling Strategy

Sampling reduces telemetry volume. Bad sampling hides failures.

Types:

Sampling TypeWhereStrengthRisk
Head samplingbefore/when trace startscheapcannot know future error/latency
Tail samplingafter observing tracecan retain errors/slow tracesrequires collector/backend support
Log samplinglogger/appender/platformreduces spammay lose rare detail
Metric aggregationSDK/backendefficient trendloses individual events

Trace sampling policy examples:

sampling:
  traces:
    always_keep:
      - status: ERROR
      - latency_ms_gt: 2000
      - error_code_prefix: PAYMENT_
      - operation: case.escalate
    probabilistic:
      default: 0.05
      high_volume_healthcheck: 0.001

Important: if using head sampling only, you may not retain all error traces unless the sampler or backend policy supports that behavior.


11. Error Telemetry Quality

Every meaningful error should produce coordinated signals.

For example:

catch (CaseStateConflictException ex) {
    log.warn("case command rejected",
        kv("event", "case.command.rejected"),
        kv("error_code", ex.errorCode()),
        kv("case_id", safeCaseId(ex.caseId())),
        kv("expected_state", ex.expectedState()),
        kv("actual_state", ex.actualState()));

    Metrics.counter("case.command.total",
        "operation", "submit",
        "outcome", "rejected",
        "error_code", ex.errorCode()).increment();

    Span.current().setStatus(StatusCode.ERROR);
    Span.current().setAttribute("error.code", ex.errorCode());
    Span.current().recordException(ex);

    throw ex;
}

Quality issue: this example may be too verbose or too high-cardinality depending on fields. The design question is which fields belong to which signal.

Recommended mapping:

Error DataLogMetricTraceResponseAudit
error codeyesyesyesyesyes
exception typeyesboundedyesno/internalmaybe
stack traceunexpected onlynoexception eventnono
domain entity idyes if allowednomaybemaybeyes
user messagemaybenonoyesmaybe
retryableyesyesyesmaybeno
dependency nameyesyesyesmaybeno
raw payloadnonononono

12. Telemetry for Reliability Controls

Reliability controls without telemetry become invisible behavior changes.

Retry

Required signals:

  • attempt count,
  • final outcome,
  • retry reason,
  • backoff duration,
  • retry exhausted,
  • idempotency key hash if needed,
  • dependency name.

Avoid logging every retry attempt at WARN on high-volume paths. Prefer metric + trace event; log only final exhaustion or unusual behavior.

Circuit breaker

Required signals:

  • state transition: closed/open/half-open,
  • rejected call count,
  • failure rate,
  • slow call rate,
  • dependency name,
  • fallback activation.

Circuit state transition should be loggable event and metric dimension, because it changes behavior.

Fallback/degradation

Required signals:

  • fallback type,
  • degraded outcome,
  • user impact,
  • freshness/staleness,
  • policy reason,
  • duration.

Fallback must not look like success in telemetry. Use outcome="degraded", not only success.

Shutdown

Required signals:

  • shutdown received,
  • intake stopped,
  • in-flight count,
  • drain duration,
  • forced cancellation count,
  • telemetry flush result,
  • unknown outcome count.

13. Telemetry and Privacy

Telemetry often accidentally becomes the largest uncontrolled data exhaust in the company.

Forbidden by default:

  • passwords,
  • tokens,
  • authorization headers,
  • session ids,
  • raw request/response body,
  • full names,
  • email addresses,
  • national identifiers,
  • payment identifiers,
  • unredacted documents,
  • arbitrary exception messages from external systems.

Pattern:

public final class TelemetrySanitizer {
    private static final Set<String> FORBIDDEN_KEYS = Set.of(
        "password", "token", "authorization", "cookie", "ssn", "email"
    );

    public static String safeAttribute(String key, String value) {
        if (value == null) return "null";
        if (FORBIDDEN_KEYS.contains(key.toLowerCase(Locale.ROOT))) {
            return "[REDACTED]";
        }
        if (value.length() > 256) {
            return value.substring(0, 256) + "...[TRUNCATED]";
        }
        return value;
    }
}

Better: avoid collecting sensitive data at source rather than redacting later.


14. Telemetry Testing

Telemetry that is not tested will rot.

Unit-level metric test

@Test
void recordsRejectedCaseMetric() {
    SimpleMeterRegistry registry = new SimpleMeterRegistry();
    CaseMetrics metrics = new CaseMetrics(registry);

    metrics.recordRejected("CASE_STATE_CONFLICT");

    Counter counter = registry.find("case.command.total")
        .tag("outcome", "rejected")
        .tag("error_code", "CASE_STATE_CONFLICT")
        .counter();

    assertNotNull(counter);
    assertEquals(1.0, counter.count());
}

Log field contract test

@Test
void rejectionLogContainsRequiredFields() {
    LogEvent event = captureLog(() -> service.reject(command));

    assertThat(event.field("event")).isEqualTo("case.command.rejected");
    assertThat(event.field("correlation_id")).isNotBlank();
    assertThat(event.field("error_code")).isEqualTo("CASE_STATE_CONFLICT");
    assertThat(event.fields()).doesNotContainKeys("authorization", "access_token");
}

Trace contract test

@Test
void spanContainsDomainOutcome() {
    InMemorySpanExporter exporter = InMemorySpanExporter.create();

    service.evaluate(command);

    List<SpanData> spans = exporter.getFinishedSpanItems();
    assertThat(spans).anySatisfy(span -> {
        assertThat(span.getName()).isEqualTo("case.policy.evaluate");
        assertThat(span.getAttributes().get(AttributeKey.stringKey("decision.outcome")))
            .isEqualTo("rejected");
    });
}

Testing does not need to assert every telemetry detail. Test critical contract fields.


15. Telemetry Review Checklist

Before adding a new metric/log/span, answer:

  • What operational question does it answer?
  • Who owns it?
  • What action follows when it changes?
  • Is the name aligned with convention?
  • Are labels/attributes bounded?
  • Does it include required context?
  • Does it exclude secrets/PII?
  • Is it emitted once per meaningful event, not inside noisy loops?
  • Does it duplicate existing signal?
  • Is retention cost acceptable?
  • Is it documented in telemetry registry?
  • Is critical behavior covered by test?

16. Dashboard Quality

A dashboard is not a collection of charts. It is a decision surface.

Good dashboard layers:

Layer 1 — Executive health

  • request rate,
  • error rate,
  • latency p95/p99,
  • saturation,
  • SLO/error budget.

Layer 2 — Dependency impact

  • dependency latency,
  • dependency error rate,
  • retry count,
  • circuit breaker state,
  • fallback activation.

Layer 3 — Domain impact

  • cases submitted/approved/rejected,
  • stuck state count,
  • SLA breach risk,
  • backlog age,
  • automated/manual decision ratio.

Layer 4 — Debug drill-down

  • top error codes,
  • top slow operations,
  • trace exemplars,
  • recent deploy/config changes,
  • log query links.

Bad dashboard:

  • 60 charts with no hierarchy,
  • average latency only,
  • no error budget,
  • no domain impact,
  • no dependency breakdown,
  • high-cardinality filters that break under load.

17. Alert Quality

An alert should mean action.

Bad alerts:

  • “CPU > 80% for 5 minutes” without service impact,
  • “any ERROR log exists”,
  • “one request failed”,
  • duplicate symptom alerts from every layer,
  • alerts that always self-heal before response.

Better alerts:

  • SLO burn rate,
  • high user-visible error rate,
  • elevated p99 latency on critical endpoint,
  • queue age exceeding SLA,
  • circuit breaker open for critical dependency,
  • fallback active beyond allowed duration,
  • domain backlog threatening regulatory deadline.

Alert should include:

  • symptom,
  • scope,
  • impact,
  • likely dashboards,
  • runbook,
  • owner,
  • escalation path.

18. Anti-Patterns

18.1 Log Everything

More logs do not mean better observability. High volume often hides signal and increases cost.

18.2 Metric Everything With IDs

Metrics with user_id, case_id, trace_id, or raw path become cardinality bombs.

18.3 Trace Every Tiny Method

Tracing internal helper methods creates noise. Trace meaningful operations and boundary calls.

18.4 Expected Errors as ERROR

Validation failure and business rejection are not always ERROR. Incorrect level creates alert fatigue.

18.5 Success Masking Degradation

Fallback response counted as success hides user impact. Use outcome=degraded.

18.6 No Telemetry Ownership

If nobody owns a metric/log/span, nobody will maintain it when semantics drift.

18.7 Secret in Exception Message

External libraries may include sensitive detail in exception messages. Do not blindly expose or index them.


19. Telemetry Quality Maturity Model

LevelDescription
0Ad hoc print/log, no consistent correlation
1Basic logs/metrics/traces exist but inconsistent
2Standard fields, correlation IDs, core RED/USE metrics
3SLO dashboards, bounded cardinality, context propagation tests
4Telemetry registry, schema governance, cost/privacy controls
5Incident feedback loop continuously improves telemetry quality

Top-tier engineering teams live around Level 4–5 for critical systems, not necessarily every toy service.


20. Production Implementation Pattern

Create a telemetry facade for domain/platform events.

public final class CaseTelemetry {
    private final MeterRegistry registry;
    private final Tracer tracer;
    private final Logger log = LoggerFactory.getLogger(CaseTelemetry.class);

    public CaseTelemetry(MeterRegistry registry, Tracer tracer) {
        this.registry = registry;
        this.tracer = tracer;
    }

    public void decisionCompleted(CaseDecision decision, ExecutionContext ctx) {
        String outcome = decision.outcome().name().toLowerCase(Locale.ROOT);
        String errorCode = decision.errorCode().orElse("none");

        log.info("case decision completed",
            kv("event", "case.decision.completed"),
            kv("correlation_id", ctx.correlationId()),
            kv("tenant_id", safeTenant(ctx.tenantId())),
            kv("case_id", safeCase(decision.caseId())),
            kv("outcome", outcome),
            kv("error_code", errorCode));

        registry.counter("case.decision.total",
            "outcome", outcome,
            "error_code", errorCode
        ).increment();

        Span span = Span.current();
        span.setAttribute("case.decision.outcome", outcome);
        span.setAttribute("error.code", errorCode);
    }
}

Why facade?

  • keeps telemetry consistent,
  • centralizes redaction,
  • avoids tag drift,
  • simplifies tests,
  • gives ownership.

21. Cost Model

Telemetry cost roughly follows:

cost = volume × cardinality × retention × query/index cost × replication/export factor

Cost control levers:

LeverExample
Reduce volumesample debug logs, avoid loop logs
Reduce cardinalityremove user/case/request IDs from metric tags
Reduce retentionkeep verbose logs shorter
Reduce indexingindex only key fields
Aggregate earliermetrics instead of event logs for high-volume counters
Tail sample traceskeep errors/slow traces, drop normal high-volume traces
Route by importancecritical audit logs retained longer

Cost control is not anti-observability. It protects observability from becoming unsustainable.


22. Incident Feedback Loop

After every incident, ask:

  1. Which signal detected the incident?
  2. Which signal should have detected it but did not?
  3. Which dashboard answered impact fastest?
  4. Which log/trace/metric was misleading?
  5. Which context was missing?
  6. Which alert was noisy or duplicate?
  7. Which telemetry field would have reduced diagnosis time?
  8. Which signal should be removed?
  9. Which runbook should link to which dashboard/query?

Postmortem action items should include telemetry fixes, not only code fixes.


23. Practical Telemetry Quality Scorecard

Score each critical service 0–2:

Category012
Correlationmissingpartialcomplete logs/traces/requests
Metricsad hocRED/USE onlyRED/USE + domain + dependency
Logsfree-textstructured partialstructured, safe, event-based
Tracesmissingauto onlyauto + critical manual spans
Cardinalityunknownsome guardrailsbudget + enforcement
Privacyad hocredaction partialsource controls + tests
Alertsnoisybasic symptomSLO/actionable
Dashboardsscatteredservice viewlayered decision surface
Testsnonelimitedcontract tests
Ownershipunclearteam-ownedregistry + review process

Max score: 20. Critical production systems should target 16+.


24. Deliberate Practice

Exercise 1 — Telemetry Audit

Pick one Java service. Build inventory:

SignalName/EventOwnerOperational QuestionCardinality RiskPrivacy RiskAction

Delete or redesign at least 5 weak signals.

Exercise 2 — Cardinality Attack

Add a metric tag with case_id in local environment and simulate 10,000 case IDs. Observe series count. Then replace it with bounded case_type or outcome.

Exercise 3 — Error Code Dashboard

Build dashboard panel:

  • error rate by error_code,
  • top dependency failures,
  • rejected vs failed domain outcomes,
  • fallback activation count,
  • retry exhausted count.

Ensure no panel depends on raw exception message.

Exercise 4 — Telemetry Contract Test

Write tests that prove:

  • critical error emits stable error code,
  • metric has bounded tags,
  • log has correlation id,
  • log excludes forbidden fields,
  • span records exception for unexpected failure.

Exercise 5 — Incident Replay

Take one past incident. Reconstruct timeline using only existing telemetry. Mark every place where you had to guess. Turn each guess into telemetry improvement.


25. Summary

Telemetry quality engineering is the discipline of making observability reliable, useful, safe, and sustainable. It requires more than adding logs, metrics, and traces. It requires operational questions, schema governance, cardinality discipline, sampling strategy, privacy control, testing, dashboards, alerts, and incident feedback loops.

A high-quality telemetry system has these properties:

  • it detects user-visible impact,
  • it supports fast diagnosis,
  • it correlates logs/metrics/traces/audit events,
  • it avoids high-cardinality explosions,
  • it protects sensitive data,
  • it keeps cost under control,
  • it evolves through incident learning.

The core principle:

Telemetry is production evidence. Design it with the same care as public APIs and domain state transitions.


References

Lesson Recap

You just completed lesson 30 in final stretch. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.