Learn Aws Part 023 Observability Cloudwatch Xray Opentelemetry And Slo
title: Learn AWS Engineering Mastery - Part 023 description: Observability engineering on AWS using CloudWatch, X-Ray, OpenTelemetry, correlation IDs, metrics, logs, traces, alarms, dashboards, and SLO-driven operations. series: learn-aws seriesTitle: Learn AWS Engineering Mastery order: 23 partTitle: Observability: CloudWatch, X-Ray, OpenTelemetry, and SLO tags:
- aws
- cloudwatch
- xray
- opentelemetry
- observability
- sre
- slo
- operations
- platform-engineering
- series date: 2026-07-01
Observability: CloudWatch, X-Ray, OpenTelemetry, and SLO
Target pembelajaran: setelah bagian ini, kita mampu mendesain observability production-grade di AWS: bukan hanya membuat dashboard, tetapi membangun telemetry contract yang membantu engineer mendeteksi, menjelaskan, dan memperbaiki kegagalan sistem dengan cepat.
Observability sering direduksi menjadi “kita punya log” atau “kita punya dashboard CloudWatch”. Itu tidak cukup untuk sistem production-grade.
Observability yang baik menjawab pertanyaan operasional berikut:
- Apakah user sedang terdampak?
- Dampaknya sebesar apa?
- Service mana yang menyebabkan degradasi?
- Apakah masalahnya latency, error, saturation, throttling, dependency, deployment, data, atau quota?
- Kapan mulai terjadi?
- Perubahan apa yang terjadi sebelum gejala muncul?
- Apakah rollback akan membantu?
- Apakah mitigasi aman dilakukan?
- Bukti apa yang bisa disimpan untuk post-incident review?
AWS menyediakan CloudWatch, CloudWatch Logs, CloudWatch Metrics, CloudWatch Alarms, CloudWatch Dashboards, CloudWatch Logs Insights, CloudWatch Embedded Metric Format, CloudWatch Agent, X-Ray, AWS Distro for OpenTelemetry, CloudTrail, EventBridge, AWS Health, Config, service-specific metrics, dan integrasi dengan banyak managed services.
Namun tool bukan inti utamanya. Inti observability adalah struktur informasi yang dapat dipakai saat sistem gagal.
1. Kaufman Skill Map
Kaufman-style deconstruction untuk observability:
| Sub-skill | Yang harus dikuasai | Ukuran self-correction |
|---|---|---|
| Telemetry design | Memilih metric/log/trace/event yang benar | Alarm menjawab gejala user, bukan noise infrastructure |
| CloudWatch metrics | Namespace, dimension, statistic, period, alarm | Alarm dapat membedakan error rate, latency, dan saturation |
| Logging | Structured logs, retention, privacy, query | Engineer bisa menjawab “apa yang terjadi?” tanpa SSH |
| Tracing | Trace ID, span, downstream call, sampling | Engineer bisa melihat jalur request lintas service |
| SLO | SLI, objective, error budget, burn rate | Alert berbasis impact, bukan CPU semata |
| Dashboard | Layered dashboard: executive, service, dependency, infrastructure | Dashboard mempercepat diagnosis, bukan memperindah console |
| Governance | Retention, encryption, access, cost | Telemetry tidak bocor data sensitif dan tidak meledakkan biaya |
2. Mental Model: Observability Is a Runtime Contract
Observability bukan fitur tambahan. Observability adalah kontrak runtime antara service dan operator.
Service production harus menerbitkan sinyal berikut:
Empat sinyal utama:
| Signal | Kegunaan utama | Jangan dipakai untuk |
|---|---|---|
| Metrics | Alarm, trend, capacity, SLO | Detail forensic per request |
| Logs | Debug, forensic, audit, detail kejadian | High-cardinality time series sembarangan |
| Traces | Jalur request, dependency latency, causal chain | Audit long-term atau semua event bisnis |
| Events | Perubahan state, deployment, scaling, incident timeline | Debug detail aplikasi tanpa konteks |
Rule sederhana:
- Metrics menjawab: “apakah ada masalah?”
- Logs menjawab: “apa yang terjadi?”
- Traces menjawab: “di mana waktu habis atau error muncul?”
- Events menjawab: “perubahan apa yang terjadi?”
Engineer top-tier tidak membuat semua sinyal menjadi log. Mereka memilih sinyal berdasarkan pertanyaan operasional.
3. AWS Observability Stack
CloudWatch adalah pusat observability umum di AWS. CloudWatch dapat memonitor resource AWS dan aplikasi secara real time, menyediakan metrics, alarms, dashboards, logs, agent, cross-account monitoring, OpenTelemetry support, dan fitur observability lain.
AWS observability tidak harus berarti semua data berhenti di CloudWatch. Banyak enterprise mengirim logs/traces ke SIEM, OpenSearch, Datadog, New Relic, Splunk, Grafana, atau data lake. Namun AWS-native baseline tetap penting karena:
- Banyak service AWS menerbitkan metrics native ke CloudWatch.
- CloudWatch alarms mudah dipakai sebagai trigger otomatis.
- IAM/KMS/CloudTrail integration relatif matang.
- Cross-account observability bisa menjadi baseline multi-account.
- Banyak incident workflow AWS berangkat dari alarm dan event AWS-native.
4. Metrics: Alarmable Facts, Not Debug Text
Metric adalah angka time-series dengan timestamp, namespace, metric name, dimension, statistic, dan period.
Contoh metric yang baik:
| Metric | Dimension | Kenapa baik |
|---|---|---|
RequestCount | Service, Environment, Route | Mengukur traffic |
ErrorRate | Service, Environment | Alarmable terhadap user impact |
LatencyP95 | Service, Route | Mengukur tail latency |
DependencyTimeoutCount | Service, Dependency | Mengisolasi dependency failure |
QueueAgeSeconds | QueueName, Consumer | Mengukur backlog freshness |
ThrottleCount | Service, Resource | Menandai quota/capacity pressure |
Metric yang buruk:
| Metric | Masalah |
|---|---|
RequestId sebagai dimension | Cardinality ekstrem, biaya naik, alarm tidak berguna |
UserId sebagai dimension default | Cardinality dan privacy risk |
ExceptionMessage sebagai dimension | Fragmentasi metric dan potensi PII |
| CPU sebagai satu-satunya alarm | Tidak selalu mewakili user impact |
4.1 Namespace and Dimension Discipline
Rekomendasi namespace custom:
Company/Product/Service
Contoh:
Acme/EnforcementCase/WorkflowService
Contoh dimension minimal:
Environment=prod
Service=case-workflow
Region=ap-southeast-1
Tambahkan dimension hanya jika akan dipakai untuk diagnosis atau alarm.
4.2 Golden Signals
Untuk service online, gunakan golden signals:
| Signal | AWS implementation |
|---|---|
| Latency | ALB TargetResponseTime, API Gateway Latency, custom p95/p99 |
| Traffic | RequestCount, Count, Invocations, MessagesReceived |
| Errors | 5XX, function errors, failed transitions, DLQ count |
| Saturation | CPU, memory, concurrency, connection pool, queue age, throttles |
Untuk data pipeline:
| Signal | Metric |
|---|---|
| Freshness | Age of latest processed event |
| Completeness | Expected vs processed records |
| Error | Failed records, DLQ entries |
| Throughput | Records/sec, bytes/sec |
| Lag | IteratorAge, consumer lag, queue age |
Untuk workflow/case-management platform:
| Signal | Metric |
|---|---|
| State transition success rate | Successful transitions / attempted transitions |
| Escalation latency | Time from trigger to escalation created |
| SLA breach risk | Cases nearing deadline |
| Stuck workflow count | Cases with no state change beyond threshold |
| Audit persistence failure | Failed audit append count |
5. Logs: Structured Evidence for Debugging and Audit
CloudWatch Logs centralizes logs from systems, applications, and AWS services in a scalable service. In a production system, logs are not print statements. Logs are structured evidence.
Bad log:
Something failed
Good log:
{
"timestamp": "2026-07-01T10:15:12.341Z",
"level": "ERROR",
"service": "case-workflow",
"environment": "prod",
"correlationId": "c-8e9b1",
"caseIdHash": "h:2a9f...",
"tenantId": "regulator-a",
"operation": "transitionCaseState",
"fromState": "UNDER_REVIEW",
"toState": "ESCALATED",
"errorType": "ConditionalWriteFailed",
"dependency": "dynamodb",
"durationMs": 184,
"retryable": true
}
5.1 Logging Levels
| Level | Meaning | Example |
|---|---|---|
| DEBUG | Development/temporary detailed data | Disabled by default in prod |
| INFO | Business/operational milestone | Case transitioned, job completed |
| WARN | Recoverable anomaly | Retry, fallback, degraded dependency |
| ERROR | Failed operation requiring attention | Request failed, audit append failed |
| FATAL/CRITICAL | Service cannot continue or severe data risk | Cannot load config, integrity breach |
5.2 Log Events That Matter
Production logs should capture:
- Request received and completed.
- External dependency call and result.
- State transition attempt and result.
- Authorization decision failure.
- Validation failure category, not raw sensitive payload.
- Retry and final failure.
- DLQ publication.
- Idempotency conflict.
- Data integrity violation.
- Deployment version and configuration version.
5.3 Log Retention
Do not keep all logs forever by default.
| Log type | Typical retention reasoning |
|---|---|
| Debug application logs | Short retention, e.g. 7-30 days |
| Production error logs | Medium retention, e.g. 30-90 days |
| Security/audit logs | Long retention based on compliance |
| Access logs | Depends on forensic and privacy policy |
| Regulated evidence logs | Explicit retention/legal policy |
Important distinction:
- Operational logs help run the service.
- Audit records prove what happened.
- Evidence records may need immutability and chain-of-custody controls.
Do not confuse ordinary logs with regulatory evidence.
6. Embedded Metric Format
CloudWatch Embedded Metric Format allows applications to emit structured log events that CloudWatch can extract into metrics. This is useful when we want logs and metrics from one event emission path.
Example:
{
"_aws": {
"Timestamp": 1782900912341,
"CloudWatchMetrics": [
{
"Namespace": "Acme/CaseWorkflow",
"Dimensions": [["Environment", "Service"]],
"Metrics": [
{ "Name": "TransitionLatencyMs", "Unit": "Milliseconds" },
{ "Name": "TransitionFailure", "Unit": "Count" }
]
}
]
},
"Environment": "prod",
"Service": "case-workflow",
"TransitionLatencyMs": 92,
"TransitionFailure": 0,
"correlationId": "c-8e9b1"
}
Use EMF when:
- Application code owns the metric.
- You want structured log + metric together.
- Metric dimensions are controlled.
- You avoid high-cardinality dimensions.
Avoid EMF when:
- Every user/request becomes a metric dimension.
- You need pure high-throughput metric ingestion without log retention cost.
- You cannot control payload shape.
7. Tracing: Causal Path Across Services
Metrics tell us there is a problem. Logs tell us details. Traces show the path.
AWS X-Ray collects request data and helps visualize and analyze requests across applications and downstream dependencies. AWS Distro for OpenTelemetry can collect and send metrics/traces to AWS X-Ray, CloudWatch, OpenSearch, and other monitoring systems.
A useful trace should show:
- Entry point.
- Service boundaries.
- Dependency calls.
- Latency per hop.
- Error per segment/span.
- Retry behavior.
- Trace/correlation ID.
7.1 X-Ray vs OpenTelemetry
| Option | Strength | Trade-off |
|---|---|---|
| X-Ray SDK/native integration | AWS-native, service map, direct integration | More AWS-specific instrumentation model |
| OpenTelemetry/ADOT | Open standard, portable, vendor-flexible | Collector/configuration complexity |
| CloudWatch Agent OTLP | Consolidates metrics/traces ingestion path | Requires agent lifecycle management |
Recommended enterprise posture:
- Prefer OpenTelemetry semantic conventions where possible.
- Export to AWS-native backends for operational integration.
- Keep instrumentation independent from dashboard vendor.
- Standardize trace propagation headers.
7.2 Sampling
Tracing every request may be expensive. Sampling controls volume.
Sampling strategy:
| Situation | Sampling approach |
|---|---|
| Low traffic critical service | Higher sampling |
| High traffic stable service | Lower default sampling |
| Error responses | Always sample or biased sampling |
| Canary deployment | Temporarily higher sampling |
| Incident investigation | Increase sampling with time limit |
Failure mode: sampling only successful requests makes traces useless during incidents.
8. Correlation ID and Context Propagation
Every request crossing a boundary should carry context.
Minimum context:
correlationId
traceId
service
operation
tenantId or tenantHash
environment
requestId
actorType
Do not log raw sensitive identifiers when hashes or internal references are sufficient.
For async messaging, context must be placed in message attributes or event metadata. Without this, an incident timeline breaks at the queue boundary.
Recommended event envelope:
{
"eventId": "evt-001",
"eventType": "CaseEscalated",
"occurredAt": "2026-07-01T10:15:12Z",
"correlationId": "c-8e9b1",
"traceId": "1-...",
"producer": "case-workflow",
"tenantId": "regulator-a",
"schemaVersion": "1.0",
"payload": {}
}
9. SLI, SLO, and Error Budget
A dashboard without SLO often becomes decoration. SLO converts telemetry into engineering commitment.
9.1 Definitions
| Term | Meaning |
|---|---|
| SLI | Service Level Indicator; measurement |
| SLO | Service Level Objective; target |
| SLA | Legal/business agreement |
| Error budget | Allowed unreliability within SLO window |
Example:
SLI: percentage of successful case state transitions completed under 500 ms
SLO: 99.5% over rolling 30 days
Error budget: 0.5% failed/slow transitions allowed
9.2 Good SLI Examples
| Workload | SLI |
|---|---|
| API service | % valid requests that return non-5xx under latency threshold |
| Workflow engine | % state transitions committed successfully under threshold |
| Async worker | % messages processed before freshness deadline |
| Data pipeline | % partitions delivered complete before SLA deadline |
| Search service | % queries returning successful result under p95 threshold |
9.3 Burn Rate
Burn rate measures how fast error budget is being consumed.
burn_rate = current_error_rate / allowed_error_rate
If SLO allows 0.1% error and current error is 1%, burn rate is 10x.
A good alert uses multiple windows:
| Alert | Meaning |
|---|---|
| Fast burn | Severe current incident |
| Slow burn | Sustained degradation |
| Ticket alert | Needs action but not page |
| Dashboard-only | Informational trend |
10. Alarm Design
Bad alarm:
CPU > 80% for 5 minutes
This may be useful, but it is not necessarily user impact.
Better alarm:
p95 latency > 800 ms AND 5xx rate > 2% for 5 minutes on prod API
Or:
ApproximateAgeOfOldestMessage > freshness target for 10 minutes
10.1 Alarm Severity
| Severity | Condition | Response |
|---|---|---|
| Sev1 | Broad user impact, data integrity risk, critical security issue | Page immediately |
| Sev2 | Significant degradation, partial region/workload impact | Page team/on-call |
| Sev3 | Degraded non-critical path or approaching quota | Ticket/working-hours response |
| Sev4 | Trend or hygiene issue | Backlog |
10.2 Symptom vs Cause Alarms
Use symptom alarms to page. Use cause alarms to diagnose.
| Type | Example | Page? |
|---|---|---|
| Symptom | User request success rate below SLO | Yes |
| Cause | RDS CPU high | Usually no alone |
| Cause | Lambda throttles | Maybe if linked to user impact |
| Cause | Queue age high | Yes if freshness SLO violated |
| Cause | Disk usage high | Ticket unless imminent outage |
10.3 Composite Alarms
Composite alarms reduce noise by combining signals:
ALARM if:
API5xxHigh == ALARM
AND RequestCountNormal == ALARM
AND DeploymentInProgress != ALARM
This avoids paging for low traffic anomalies and supports deployment-aware alarms.
11. Dashboards That Actually Help
Use layered dashboards.
11.1 Executive/Service Health Dashboard
Shows:
- Current SLO compliance.
- Error budget remaining.
- Active incidents.
- User-impacting latency/error.
- Business process throughput.
11.2 Service Owner Dashboard
Shows:
- Request rate.
- Error rate.
- Latency p50/p95/p99.
- Dependency error/latency.
- Queue age.
- Worker throughput.
- Deployment version.
- Throttling/quota.
11.3 Dependency Dashboard
Shows:
- RDS/Aurora connections, CPU, replica lag, deadlocks.
- DynamoDB throttles, consumed capacity, hot partition symptoms.
- SQS age, visible/not visible messages, DLQ.
- Lambda concurrency, errors, duration, throttles.
- ALB target health, 5xx, response time.
11.4 Incident Dashboard
Shows only what an incident commander needs:
- User impact.
- Start time.
- Deployment/change timeline.
- Regional/AZ symptoms.
- Dependency health.
- Mitigation state.
- Recovery trend.
Dashboard smell:
- 80 graphs and no clear answer.
- No SLO.
- No deployment marker.
- No dependency view.
- All p50, no p95/p99.
- No tenant/environment dimension.
12. Workload-Specific Observability Patterns
12.1 Lambda
Minimum signals:
| Signal | Why |
|---|---|
| Invocations | Traffic |
| Errors | Failure rate |
| Duration p95/p99 | Latency and timeout risk |
| Throttles | Concurrency/capacity issue |
| ConcurrentExecutions | Saturation |
| IteratorAge | Stream lag |
| DLQ/destination failures | Async failure |
| Cold start count/custom metric | Runtime efficiency |
Pattern:
12.2 ECS/EKS
Minimum signals:
| Layer | Signals |
|---|---|
| Service | Request rate, error rate, latency, dependency latency |
| Container | CPU, memory, restart count, OOMKilled |
| Cluster | capacity, pending tasks/pods, node pressure |
| Ingress | ALB 5xx, target response time, target health |
| Deployment | version, rollout state, failed deployment |
12.3 API Gateway / ALB
Minimum signals:
| Signal | Meaning |
|---|---|
| Request count | Traffic |
| 4xx | Client/auth/input issue |
| 5xx | Service/platform issue |
| Latency | End-to-end gateway latency |
| Integration latency | Backend latency |
| Throttles | Rate/quota issue |
12.4 SQS/EventBridge/Step Functions
Minimum signals:
| Service | Signals |
|---|---|
| SQS | Age of oldest message, visible messages, not visible messages, DLQ count |
| EventBridge | Failed invocations, throttles, DLQ |
| Step Functions | Failed/timed-out executions, execution duration, state transition failures |
For workflow platform, never monitor only request count. Monitor stuck business state.
Example custom metrics:
CasesStuckInReview
EscalationDeadlineBreaches
AuditAppendFailureCount
ManualOverrideCount
ReopenCaseRate
13. Observability for Regulated Case Management
A regulated case-management/enforcement system needs two telemetry planes:
Operational telemetry can be sampled, aggregated, and expired. Evidence records often cannot.
Do not store regulatory evidence only in application logs. Logs are optimized for operations, not necessarily legal defensibility.
Minimum regulated observability model:
| Concern | Implementation direction |
|---|---|
| Who changed case state | Append-only audit event |
| Why changed | Reason code/comment reference |
| When changed | Server-side timestamp |
| Under what authority | Role/permission/context |
| Was transition valid | Policy/rule version |
| Was notification sent | Notification event/result |
| Was SLA breached | Case timer metric + event |
| Was evidence accessed | Access audit event |
14. Security and Privacy
Observability data often contains sensitive information. Treat telemetry as production data.
Rules:
- Do not log passwords, tokens, session cookies, private keys, or raw secrets.
- Avoid raw PII unless explicitly required and protected.
- Use structured redaction libraries.
- Encrypt log groups if policy requires customer-managed KMS keys.
- Set retention explicitly.
- Restrict CloudWatch Logs Insights access.
- Separate security logs from application debug logs.
- Store audit/evidence records in tamper-resistant storage when required.
- Monitor access to logs through CloudTrail.
- Use account boundary for centralized logging when appropriate.
Bad pattern:
logger.info("request={}", fullHttpRequest)
Better pattern:
logger.info("request received", fields: method, route, tenant, correlationId, contentLength)
15. Cost Engineering for Observability
Observability cost is real. Cost issues usually come from:
- Excessive log volume.
- Long retention for noisy logs.
- High-cardinality custom metrics.
- Too many dashboards/alarms with low value.
- Tracing every request in high-volume service.
- Copying logs to multiple vendors without filtering.
- Debug logs left on in production.
Cost controls:
| Control | Benefit |
|---|---|
| Retention by log group | Avoid infinite storage |
| Sampling | Reduce trace cost |
| EMF dimension discipline | Avoid metric explosion |
| Subscription filters | Export only useful streams |
| Log level governance | Reduce noise |
| Aggregated business metrics | Lower cardinality |
| Separate hot/cold storage | Lower long-term cost |
Operational rule:
Every new telemetry stream should have an owner, retention policy, security classification, and known use case.
16. Failure Modes
| Failure mode | Symptom | Prevention |
|---|---|---|
| No correlation IDs | Incident timeline breaks across services | Standard request/event envelope |
| Alarm on cause only | Pages for CPU but misses user impact | SLO/symptom alarms |
| High-cardinality metrics | Cost spike and unusable metrics | Dimension review |
| Logs contain secrets | Security incident | Redaction and policy tests |
| No deployment markers | Hard to connect incident with change | Emit deployment events |
| Async boundary loses trace | Cannot diagnose delayed failures | Propagate context through messages |
| Dashboard too broad | Slow diagnosis | Layered dashboards |
| No runbook linked to alarm | On-call improvises | Alarm-to-runbook mapping |
| Sampling hides failures | Traces missing for errors | Bias sampling toward errors |
| Audit mixed with debug logs | Regulatory evidence weak | Separate audit store |
17. Alarm-to-Runbook Mapping
Every production alarm should have:
alarmName: prod-case-workflow-high-transition-failure-rate
owner: case-platform-team
severity: Sev2
userImpact: case state transitions may fail
slo: case-transition-success-rate
firstChecks:
- check recent deployments
- check DynamoDB conditional failures
- check downstream event publish failures
- check IAM/KMS errors
mitigation:
- pause non-critical workflow consumers
- rollback latest deployment if error started after release
- increase provisioned capacity only if throttling confirmed
rollbackSafe: conditional
escalation:
- platform-oncall
- data-platform-oncall if persistence failure
If an alarm does not have an owner and first response steps, it is not production-ready.
18. Deliberate Practice
Exercise 1: Build a Service Health Dashboard
For one service, define:
- Request rate.
- Error rate.
- p95/p99 latency.
- Dependency latency.
- Queue age if async.
- Current deployment version.
- SLO compliance.
Self-check:
- Can a new on-call identify impact in under 2 minutes?
- Can they see whether the issue is service or dependency?
- Can they find the latest deployment/change?
Exercise 2: Design Three Alarms
Create:
- One fast-burn SLO alarm.
- One slow-burn SLO alarm.
- One dependency saturation alarm.
Self-check:
- Which alarm pages humans?
- Which alarm opens ticket only?
- Which alarm triggers automation?
Exercise 3: Trace an Async Flow
Pick a flow:
API -> service -> SQS -> worker -> database -> notification
Propagate:
correlationIdtraceIdtenantIdeventIdschemaVersion
Self-check:
- Can you reconstruct the full path from one user complaint?
- Can you find where latency accumulated?
- Can you replay safely if a message failed?
19. Production Checklist
[ ] Every service has owner and service catalog entry.
[ ] Every service emits structured logs.
[ ] Every request has correlation ID.
[ ] Async events carry correlation context.
[ ] Metrics separate service, dependency, and business signals.
[ ] SLO is defined for critical user journeys.
[ ] Page alarms are symptom/SLO-based.
[ ] Cause alarms are used for diagnosis or tickets.
[ ] Dashboards are layered by audience.
[ ] Logs have explicit retention.
[ ] Sensitive fields are redacted.
[ ] Trace sampling strategy is documented.
[ ] Deployment markers are visible.
[ ] Alarm has linked runbook.
[ ] Cost/cardinality review exists for custom metrics.
[ ] Audit/evidence records are separate from debug logs.
20. Summary
Observability engineering di AWS adalah kemampuan membangun sistem yang bisa menjelaskan dirinya sendiri saat gagal.
Inti Part 023:
- Observability adalah runtime contract, bukan dashboard.
- Metrics untuk alarm dan trend.
- Logs untuk forensic dan debug.
- Traces untuk causal path.
- Events untuk timeline perubahan.
- SLO mengubah telemetry menjadi komitmen operasional.
- CloudWatch adalah baseline AWS-native yang kuat.
- X-Ray dan OpenTelemetry membantu melihat distributed path.
- Correlation ID adalah tulang punggung investigasi lintas service.
- Telemetry harus aman, hemat, dan punya owner.
Di Part 024, kita akan membahas bagaimana sinyal observability ini dipakai dalam operasi nyata: Systems Manager, Session Manager, Automation, OpsCenter, Incident Manager, runbooks, playbooks, patching, dan incident response.
References
- AWS Documentation — What is Amazon CloudWatch?: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/WhatIsCloudWatch.html
- AWS Documentation — What is CloudWatch Logs?: https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/WhatIsCloudWatchLogs.html
- AWS Documentation — Embedded Metric Format: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Embedded_Metric_Format.html
- AWS Documentation — AWS X-Ray: https://docs.aws.amazon.com/xray/latest/devguide/aws-xray.html
- AWS Documentation — AWS Distro for OpenTelemetry and X-Ray: https://docs.aws.amazon.com/xray/latest/devguide/xray-services-adot.html
- AWS Documentation — OpenTelemetry in CloudWatch: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-OpenTelemetry-Sections.html
You just completed lesson 23 in deepen practice. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.
Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.