Series MapLesson 23 / 35
Deepen PracticeOrdered learning track

Learn Aws Part 023 Observability Cloudwatch Xray Opentelemetry And Slo

17 min read3374 words
PrevNext
Lesson 2335 lesson track2029 Deepen Practice

title: Learn AWS Engineering Mastery - Part 023 description: Observability engineering on AWS using CloudWatch, X-Ray, OpenTelemetry, correlation IDs, metrics, logs, traces, alarms, dashboards, and SLO-driven operations. series: learn-aws seriesTitle: Learn AWS Engineering Mastery order: 23 partTitle: Observability: CloudWatch, X-Ray, OpenTelemetry, and SLO tags:

  • aws
  • cloudwatch
  • xray
  • opentelemetry
  • observability
  • sre
  • slo
  • operations
  • platform-engineering
  • series date: 2026-07-01

Observability: CloudWatch, X-Ray, OpenTelemetry, and SLO

Target pembelajaran: setelah bagian ini, kita mampu mendesain observability production-grade di AWS: bukan hanya membuat dashboard, tetapi membangun telemetry contract yang membantu engineer mendeteksi, menjelaskan, dan memperbaiki kegagalan sistem dengan cepat.

Observability sering direduksi menjadi “kita punya log” atau “kita punya dashboard CloudWatch”. Itu tidak cukup untuk sistem production-grade.

Observability yang baik menjawab pertanyaan operasional berikut:

  1. Apakah user sedang terdampak?
  2. Dampaknya sebesar apa?
  3. Service mana yang menyebabkan degradasi?
  4. Apakah masalahnya latency, error, saturation, throttling, dependency, deployment, data, atau quota?
  5. Kapan mulai terjadi?
  6. Perubahan apa yang terjadi sebelum gejala muncul?
  7. Apakah rollback akan membantu?
  8. Apakah mitigasi aman dilakukan?
  9. Bukti apa yang bisa disimpan untuk post-incident review?

AWS menyediakan CloudWatch, CloudWatch Logs, CloudWatch Metrics, CloudWatch Alarms, CloudWatch Dashboards, CloudWatch Logs Insights, CloudWatch Embedded Metric Format, CloudWatch Agent, X-Ray, AWS Distro for OpenTelemetry, CloudTrail, EventBridge, AWS Health, Config, service-specific metrics, dan integrasi dengan banyak managed services.

Namun tool bukan inti utamanya. Inti observability adalah struktur informasi yang dapat dipakai saat sistem gagal.


1. Kaufman Skill Map

Kaufman-style deconstruction untuk observability:

Sub-skillYang harus dikuasaiUkuran self-correction
Telemetry designMemilih metric/log/trace/event yang benarAlarm menjawab gejala user, bukan noise infrastructure
CloudWatch metricsNamespace, dimension, statistic, period, alarmAlarm dapat membedakan error rate, latency, dan saturation
LoggingStructured logs, retention, privacy, queryEngineer bisa menjawab “apa yang terjadi?” tanpa SSH
TracingTrace ID, span, downstream call, samplingEngineer bisa melihat jalur request lintas service
SLOSLI, objective, error budget, burn rateAlert berbasis impact, bukan CPU semata
DashboardLayered dashboard: executive, service, dependency, infrastructureDashboard mempercepat diagnosis, bukan memperindah console
GovernanceRetention, encryption, access, costTelemetry tidak bocor data sensitif dan tidak meledakkan biaya

2. Mental Model: Observability Is a Runtime Contract

Observability bukan fitur tambahan. Observability adalah kontrak runtime antara service dan operator.

Service production harus menerbitkan sinyal berikut:

Empat sinyal utama:

SignalKegunaan utamaJangan dipakai untuk
MetricsAlarm, trend, capacity, SLODetail forensic per request
LogsDebug, forensic, audit, detail kejadianHigh-cardinality time series sembarangan
TracesJalur request, dependency latency, causal chainAudit long-term atau semua event bisnis
EventsPerubahan state, deployment, scaling, incident timelineDebug detail aplikasi tanpa konteks

Rule sederhana:

  • Metrics menjawab: “apakah ada masalah?”
  • Logs menjawab: “apa yang terjadi?”
  • Traces menjawab: “di mana waktu habis atau error muncul?”
  • Events menjawab: “perubahan apa yang terjadi?”

Engineer top-tier tidak membuat semua sinyal menjadi log. Mereka memilih sinyal berdasarkan pertanyaan operasional.


3. AWS Observability Stack

CloudWatch adalah pusat observability umum di AWS. CloudWatch dapat memonitor resource AWS dan aplikasi secara real time, menyediakan metrics, alarms, dashboards, logs, agent, cross-account monitoring, OpenTelemetry support, dan fitur observability lain.

AWS observability tidak harus berarti semua data berhenti di CloudWatch. Banyak enterprise mengirim logs/traces ke SIEM, OpenSearch, Datadog, New Relic, Splunk, Grafana, atau data lake. Namun AWS-native baseline tetap penting karena:

  1. Banyak service AWS menerbitkan metrics native ke CloudWatch.
  2. CloudWatch alarms mudah dipakai sebagai trigger otomatis.
  3. IAM/KMS/CloudTrail integration relatif matang.
  4. Cross-account observability bisa menjadi baseline multi-account.
  5. Banyak incident workflow AWS berangkat dari alarm dan event AWS-native.

4. Metrics: Alarmable Facts, Not Debug Text

Metric adalah angka time-series dengan timestamp, namespace, metric name, dimension, statistic, dan period.

Contoh metric yang baik:

MetricDimensionKenapa baik
RequestCountService, Environment, RouteMengukur traffic
ErrorRateService, EnvironmentAlarmable terhadap user impact
LatencyP95Service, RouteMengukur tail latency
DependencyTimeoutCountService, DependencyMengisolasi dependency failure
QueueAgeSecondsQueueName, ConsumerMengukur backlog freshness
ThrottleCountService, ResourceMenandai quota/capacity pressure

Metric yang buruk:

MetricMasalah
RequestId sebagai dimensionCardinality ekstrem, biaya naik, alarm tidak berguna
UserId sebagai dimension defaultCardinality dan privacy risk
ExceptionMessage sebagai dimensionFragmentasi metric dan potensi PII
CPU sebagai satu-satunya alarmTidak selalu mewakili user impact

4.1 Namespace and Dimension Discipline

Rekomendasi namespace custom:

Company/Product/Service

Contoh:

Acme/EnforcementCase/WorkflowService

Contoh dimension minimal:

Environment=prod
Service=case-workflow
Region=ap-southeast-1

Tambahkan dimension hanya jika akan dipakai untuk diagnosis atau alarm.

4.2 Golden Signals

Untuk service online, gunakan golden signals:

SignalAWS implementation
LatencyALB TargetResponseTime, API Gateway Latency, custom p95/p99
TrafficRequestCount, Count, Invocations, MessagesReceived
Errors5XX, function errors, failed transitions, DLQ count
SaturationCPU, memory, concurrency, connection pool, queue age, throttles

Untuk data pipeline:

SignalMetric
FreshnessAge of latest processed event
CompletenessExpected vs processed records
ErrorFailed records, DLQ entries
ThroughputRecords/sec, bytes/sec
LagIteratorAge, consumer lag, queue age

Untuk workflow/case-management platform:

SignalMetric
State transition success rateSuccessful transitions / attempted transitions
Escalation latencyTime from trigger to escalation created
SLA breach riskCases nearing deadline
Stuck workflow countCases with no state change beyond threshold
Audit persistence failureFailed audit append count

5. Logs: Structured Evidence for Debugging and Audit

CloudWatch Logs centralizes logs from systems, applications, and AWS services in a scalable service. In a production system, logs are not print statements. Logs are structured evidence.

Bad log:

Something failed

Good log:

{
  "timestamp": "2026-07-01T10:15:12.341Z",
  "level": "ERROR",
  "service": "case-workflow",
  "environment": "prod",
  "correlationId": "c-8e9b1",
  "caseIdHash": "h:2a9f...",
  "tenantId": "regulator-a",
  "operation": "transitionCaseState",
  "fromState": "UNDER_REVIEW",
  "toState": "ESCALATED",
  "errorType": "ConditionalWriteFailed",
  "dependency": "dynamodb",
  "durationMs": 184,
  "retryable": true
}

5.1 Logging Levels

LevelMeaningExample
DEBUGDevelopment/temporary detailed dataDisabled by default in prod
INFOBusiness/operational milestoneCase transitioned, job completed
WARNRecoverable anomalyRetry, fallback, degraded dependency
ERRORFailed operation requiring attentionRequest failed, audit append failed
FATAL/CRITICALService cannot continue or severe data riskCannot load config, integrity breach

5.2 Log Events That Matter

Production logs should capture:

  1. Request received and completed.
  2. External dependency call and result.
  3. State transition attempt and result.
  4. Authorization decision failure.
  5. Validation failure category, not raw sensitive payload.
  6. Retry and final failure.
  7. DLQ publication.
  8. Idempotency conflict.
  9. Data integrity violation.
  10. Deployment version and configuration version.

5.3 Log Retention

Do not keep all logs forever by default.

Log typeTypical retention reasoning
Debug application logsShort retention, e.g. 7-30 days
Production error logsMedium retention, e.g. 30-90 days
Security/audit logsLong retention based on compliance
Access logsDepends on forensic and privacy policy
Regulated evidence logsExplicit retention/legal policy

Important distinction:

  • Operational logs help run the service.
  • Audit records prove what happened.
  • Evidence records may need immutability and chain-of-custody controls.

Do not confuse ordinary logs with regulatory evidence.


6. Embedded Metric Format

CloudWatch Embedded Metric Format allows applications to emit structured log events that CloudWatch can extract into metrics. This is useful when we want logs and metrics from one event emission path.

Example:

{
  "_aws": {
    "Timestamp": 1782900912341,
    "CloudWatchMetrics": [
      {
        "Namespace": "Acme/CaseWorkflow",
        "Dimensions": [["Environment", "Service"]],
        "Metrics": [
          { "Name": "TransitionLatencyMs", "Unit": "Milliseconds" },
          { "Name": "TransitionFailure", "Unit": "Count" }
        ]
      }
    ]
  },
  "Environment": "prod",
  "Service": "case-workflow",
  "TransitionLatencyMs": 92,
  "TransitionFailure": 0,
  "correlationId": "c-8e9b1"
}

Use EMF when:

  1. Application code owns the metric.
  2. You want structured log + metric together.
  3. Metric dimensions are controlled.
  4. You avoid high-cardinality dimensions.

Avoid EMF when:

  1. Every user/request becomes a metric dimension.
  2. You need pure high-throughput metric ingestion without log retention cost.
  3. You cannot control payload shape.

7. Tracing: Causal Path Across Services

Metrics tell us there is a problem. Logs tell us details. Traces show the path.

AWS X-Ray collects request data and helps visualize and analyze requests across applications and downstream dependencies. AWS Distro for OpenTelemetry can collect and send metrics/traces to AWS X-Ray, CloudWatch, OpenSearch, and other monitoring systems.

A useful trace should show:

  1. Entry point.
  2. Service boundaries.
  3. Dependency calls.
  4. Latency per hop.
  5. Error per segment/span.
  6. Retry behavior.
  7. Trace/correlation ID.

7.1 X-Ray vs OpenTelemetry

OptionStrengthTrade-off
X-Ray SDK/native integrationAWS-native, service map, direct integrationMore AWS-specific instrumentation model
OpenTelemetry/ADOTOpen standard, portable, vendor-flexibleCollector/configuration complexity
CloudWatch Agent OTLPConsolidates metrics/traces ingestion pathRequires agent lifecycle management

Recommended enterprise posture:

  • Prefer OpenTelemetry semantic conventions where possible.
  • Export to AWS-native backends for operational integration.
  • Keep instrumentation independent from dashboard vendor.
  • Standardize trace propagation headers.

7.2 Sampling

Tracing every request may be expensive. Sampling controls volume.

Sampling strategy:

SituationSampling approach
Low traffic critical serviceHigher sampling
High traffic stable serviceLower default sampling
Error responsesAlways sample or biased sampling
Canary deploymentTemporarily higher sampling
Incident investigationIncrease sampling with time limit

Failure mode: sampling only successful requests makes traces useless during incidents.


8. Correlation ID and Context Propagation

Every request crossing a boundary should carry context.

Minimum context:

correlationId
traceId
service
operation
tenantId or tenantHash
environment
requestId
actorType

Do not log raw sensitive identifiers when hashes or internal references are sufficient.

For async messaging, context must be placed in message attributes or event metadata. Without this, an incident timeline breaks at the queue boundary.

Recommended event envelope:

{
  "eventId": "evt-001",
  "eventType": "CaseEscalated",
  "occurredAt": "2026-07-01T10:15:12Z",
  "correlationId": "c-8e9b1",
  "traceId": "1-...",
  "producer": "case-workflow",
  "tenantId": "regulator-a",
  "schemaVersion": "1.0",
  "payload": {}
}

9. SLI, SLO, and Error Budget

A dashboard without SLO often becomes decoration. SLO converts telemetry into engineering commitment.

9.1 Definitions

TermMeaning
SLIService Level Indicator; measurement
SLOService Level Objective; target
SLALegal/business agreement
Error budgetAllowed unreliability within SLO window

Example:

SLI: percentage of successful case state transitions completed under 500 ms
SLO: 99.5% over rolling 30 days
Error budget: 0.5% failed/slow transitions allowed

9.2 Good SLI Examples

WorkloadSLI
API service% valid requests that return non-5xx under latency threshold
Workflow engine% state transitions committed successfully under threshold
Async worker% messages processed before freshness deadline
Data pipeline% partitions delivered complete before SLA deadline
Search service% queries returning successful result under p95 threshold

9.3 Burn Rate

Burn rate measures how fast error budget is being consumed.

burn_rate = current_error_rate / allowed_error_rate

If SLO allows 0.1% error and current error is 1%, burn rate is 10x.

A good alert uses multiple windows:

AlertMeaning
Fast burnSevere current incident
Slow burnSustained degradation
Ticket alertNeeds action but not page
Dashboard-onlyInformational trend

10. Alarm Design

Bad alarm:

CPU > 80% for 5 minutes

This may be useful, but it is not necessarily user impact.

Better alarm:

p95 latency > 800 ms AND 5xx rate > 2% for 5 minutes on prod API

Or:

ApproximateAgeOfOldestMessage > freshness target for 10 minutes

10.1 Alarm Severity

SeverityConditionResponse
Sev1Broad user impact, data integrity risk, critical security issuePage immediately
Sev2Significant degradation, partial region/workload impactPage team/on-call
Sev3Degraded non-critical path or approaching quotaTicket/working-hours response
Sev4Trend or hygiene issueBacklog

10.2 Symptom vs Cause Alarms

Use symptom alarms to page. Use cause alarms to diagnose.

TypeExamplePage?
SymptomUser request success rate below SLOYes
CauseRDS CPU highUsually no alone
CauseLambda throttlesMaybe if linked to user impact
CauseQueue age highYes if freshness SLO violated
CauseDisk usage highTicket unless imminent outage

10.3 Composite Alarms

Composite alarms reduce noise by combining signals:

ALARM if:
  API5xxHigh == ALARM
  AND RequestCountNormal == ALARM
  AND DeploymentInProgress != ALARM

This avoids paging for low traffic anomalies and supports deployment-aware alarms.


11. Dashboards That Actually Help

Use layered dashboards.

11.1 Executive/Service Health Dashboard

Shows:

  1. Current SLO compliance.
  2. Error budget remaining.
  3. Active incidents.
  4. User-impacting latency/error.
  5. Business process throughput.

11.2 Service Owner Dashboard

Shows:

  1. Request rate.
  2. Error rate.
  3. Latency p50/p95/p99.
  4. Dependency error/latency.
  5. Queue age.
  6. Worker throughput.
  7. Deployment version.
  8. Throttling/quota.

11.3 Dependency Dashboard

Shows:

  1. RDS/Aurora connections, CPU, replica lag, deadlocks.
  2. DynamoDB throttles, consumed capacity, hot partition symptoms.
  3. SQS age, visible/not visible messages, DLQ.
  4. Lambda concurrency, errors, duration, throttles.
  5. ALB target health, 5xx, response time.

11.4 Incident Dashboard

Shows only what an incident commander needs:

  1. User impact.
  2. Start time.
  3. Deployment/change timeline.
  4. Regional/AZ symptoms.
  5. Dependency health.
  6. Mitigation state.
  7. Recovery trend.

Dashboard smell:

  • 80 graphs and no clear answer.
  • No SLO.
  • No deployment marker.
  • No dependency view.
  • All p50, no p95/p99.
  • No tenant/environment dimension.

12. Workload-Specific Observability Patterns

12.1 Lambda

Minimum signals:

SignalWhy
InvocationsTraffic
ErrorsFailure rate
Duration p95/p99Latency and timeout risk
ThrottlesConcurrency/capacity issue
ConcurrentExecutionsSaturation
IteratorAgeStream lag
DLQ/destination failuresAsync failure
Cold start count/custom metricRuntime efficiency

Pattern:

12.2 ECS/EKS

Minimum signals:

LayerSignals
ServiceRequest rate, error rate, latency, dependency latency
ContainerCPU, memory, restart count, OOMKilled
Clustercapacity, pending tasks/pods, node pressure
IngressALB 5xx, target response time, target health
Deploymentversion, rollout state, failed deployment

12.3 API Gateway / ALB

Minimum signals:

SignalMeaning
Request countTraffic
4xxClient/auth/input issue
5xxService/platform issue
LatencyEnd-to-end gateway latency
Integration latencyBackend latency
ThrottlesRate/quota issue

12.4 SQS/EventBridge/Step Functions

Minimum signals:

ServiceSignals
SQSAge of oldest message, visible messages, not visible messages, DLQ count
EventBridgeFailed invocations, throttles, DLQ
Step FunctionsFailed/timed-out executions, execution duration, state transition failures

For workflow platform, never monitor only request count. Monitor stuck business state.

Example custom metrics:

CasesStuckInReview
EscalationDeadlineBreaches
AuditAppendFailureCount
ManualOverrideCount
ReopenCaseRate

13. Observability for Regulated Case Management

A regulated case-management/enforcement system needs two telemetry planes:

Operational telemetry can be sampled, aggregated, and expired. Evidence records often cannot.

Do not store regulatory evidence only in application logs. Logs are optimized for operations, not necessarily legal defensibility.

Minimum regulated observability model:

ConcernImplementation direction
Who changed case stateAppend-only audit event
Why changedReason code/comment reference
When changedServer-side timestamp
Under what authorityRole/permission/context
Was transition validPolicy/rule version
Was notification sentNotification event/result
Was SLA breachedCase timer metric + event
Was evidence accessedAccess audit event

14. Security and Privacy

Observability data often contains sensitive information. Treat telemetry as production data.

Rules:

  1. Do not log passwords, tokens, session cookies, private keys, or raw secrets.
  2. Avoid raw PII unless explicitly required and protected.
  3. Use structured redaction libraries.
  4. Encrypt log groups if policy requires customer-managed KMS keys.
  5. Set retention explicitly.
  6. Restrict CloudWatch Logs Insights access.
  7. Separate security logs from application debug logs.
  8. Store audit/evidence records in tamper-resistant storage when required.
  9. Monitor access to logs through CloudTrail.
  10. Use account boundary for centralized logging when appropriate.

Bad pattern:

logger.info("request={}", fullHttpRequest)

Better pattern:

logger.info("request received", fields: method, route, tenant, correlationId, contentLength)

15. Cost Engineering for Observability

Observability cost is real. Cost issues usually come from:

  1. Excessive log volume.
  2. Long retention for noisy logs.
  3. High-cardinality custom metrics.
  4. Too many dashboards/alarms with low value.
  5. Tracing every request in high-volume service.
  6. Copying logs to multiple vendors without filtering.
  7. Debug logs left on in production.

Cost controls:

ControlBenefit
Retention by log groupAvoid infinite storage
SamplingReduce trace cost
EMF dimension disciplineAvoid metric explosion
Subscription filtersExport only useful streams
Log level governanceReduce noise
Aggregated business metricsLower cardinality
Separate hot/cold storageLower long-term cost

Operational rule:

Every new telemetry stream should have an owner, retention policy, security classification, and known use case.


16. Failure Modes

Failure modeSymptomPrevention
No correlation IDsIncident timeline breaks across servicesStandard request/event envelope
Alarm on cause onlyPages for CPU but misses user impactSLO/symptom alarms
High-cardinality metricsCost spike and unusable metricsDimension review
Logs contain secretsSecurity incidentRedaction and policy tests
No deployment markersHard to connect incident with changeEmit deployment events
Async boundary loses traceCannot diagnose delayed failuresPropagate context through messages
Dashboard too broadSlow diagnosisLayered dashboards
No runbook linked to alarmOn-call improvisesAlarm-to-runbook mapping
Sampling hides failuresTraces missing for errorsBias sampling toward errors
Audit mixed with debug logsRegulatory evidence weakSeparate audit store

17. Alarm-to-Runbook Mapping

Every production alarm should have:

alarmName: prod-case-workflow-high-transition-failure-rate
owner: case-platform-team
severity: Sev2
userImpact: case state transitions may fail
slo: case-transition-success-rate
firstChecks:
  - check recent deployments
  - check DynamoDB conditional failures
  - check downstream event publish failures
  - check IAM/KMS errors
mitigation:
  - pause non-critical workflow consumers
  - rollback latest deployment if error started after release
  - increase provisioned capacity only if throttling confirmed
rollbackSafe: conditional
escalation:
  - platform-oncall
  - data-platform-oncall if persistence failure

If an alarm does not have an owner and first response steps, it is not production-ready.


18. Deliberate Practice

Exercise 1: Build a Service Health Dashboard

For one service, define:

  1. Request rate.
  2. Error rate.
  3. p95/p99 latency.
  4. Dependency latency.
  5. Queue age if async.
  6. Current deployment version.
  7. SLO compliance.

Self-check:

  • Can a new on-call identify impact in under 2 minutes?
  • Can they see whether the issue is service or dependency?
  • Can they find the latest deployment/change?

Exercise 2: Design Three Alarms

Create:

  1. One fast-burn SLO alarm.
  2. One slow-burn SLO alarm.
  3. One dependency saturation alarm.

Self-check:

  • Which alarm pages humans?
  • Which alarm opens ticket only?
  • Which alarm triggers automation?

Exercise 3: Trace an Async Flow

Pick a flow:

API -> service -> SQS -> worker -> database -> notification

Propagate:

  1. correlationId
  2. traceId
  3. tenantId
  4. eventId
  5. schemaVersion

Self-check:

  • Can you reconstruct the full path from one user complaint?
  • Can you find where latency accumulated?
  • Can you replay safely if a message failed?

19. Production Checklist

[ ] Every service has owner and service catalog entry.
[ ] Every service emits structured logs.
[ ] Every request has correlation ID.
[ ] Async events carry correlation context.
[ ] Metrics separate service, dependency, and business signals.
[ ] SLO is defined for critical user journeys.
[ ] Page alarms are symptom/SLO-based.
[ ] Cause alarms are used for diagnosis or tickets.
[ ] Dashboards are layered by audience.
[ ] Logs have explicit retention.
[ ] Sensitive fields are redacted.
[ ] Trace sampling strategy is documented.
[ ] Deployment markers are visible.
[ ] Alarm has linked runbook.
[ ] Cost/cardinality review exists for custom metrics.
[ ] Audit/evidence records are separate from debug logs.

20. Summary

Observability engineering di AWS adalah kemampuan membangun sistem yang bisa menjelaskan dirinya sendiri saat gagal.

Inti Part 023:

  1. Observability adalah runtime contract, bukan dashboard.
  2. Metrics untuk alarm dan trend.
  3. Logs untuk forensic dan debug.
  4. Traces untuk causal path.
  5. Events untuk timeline perubahan.
  6. SLO mengubah telemetry menjadi komitmen operasional.
  7. CloudWatch adalah baseline AWS-native yang kuat.
  8. X-Ray dan OpenTelemetry membantu melihat distributed path.
  9. Correlation ID adalah tulang punggung investigasi lintas service.
  10. Telemetry harus aman, hemat, dan punya owner.

Di Part 024, kita akan membahas bagaimana sinyal observability ini dipakai dalam operasi nyata: Systems Manager, Session Manager, Automation, OpsCenter, Incident Manager, runbooks, playbooks, patching, dan incident response.


References

Lesson Recap

You just completed lesson 23 in deepen practice. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.