Series/Learn Kubernetes with Cloud Services AWS & Azure

Deepen PracticeOrdered learning track

Observability Foundation: Logs, Metrics, Traces, and Events

Learn Kubernetes with Cloud Services AWS & Azure - Part 029

Production-grade Kubernetes observability foundation covering logs, metrics, traces, events, telemetry contracts, cardinality, correlation, SLO signals, dashboards, alerts, and incident workflows.

[2026-07-03]23 min read4458 words

In This Lesson

1. The Real Problem 2. Observability vs Monitoring 3. The Four Evidence Streams

PrevNext

Lesson 2940 lesson track23–33 Deepen Practice

#kubernetes#observability#logs#metrics+5 more

Part 029 — Observability Foundation: Logs, Metrics, Traces, and Events

Observability is not the same as installing Prometheus, Grafana, CloudWatch, Azure Monitor, Loki, OpenTelemetry, or any other tool.

Those are instruments.

Observability is the ability to answer operational questions from external evidence:

Is the system healthy?
Which users are affected?
What changed?
Which component is the bottleneck?
Is this a workload failure, node failure, network failure, dependency failure, or control-plane failure?
Is the service violating its SLO?
Can we prove the incident timeline?
Can we debug it without SSH-ing into nodes or guessing from incomplete logs?

In Kubernetes, this is harder than in a static VM platform because applications are ephemeral, Pods are replaced, nodes are drained, endpoints are dynamic, controllers reconcile asynchronously, and failures often cross layers.

The invariant for this part:

Kubernetes observability is reliable only when metrics, logs, traces, events, metadata, and SLOs are designed as one evidence system.

1. The Real Problem

A weak Kubernetes observability setup usually fails in the same way:

dashboards exist but do not explain user impact;
alerts fire but do not point to ownership or next action;
logs exist but cannot be correlated with request, Pod, node, deployment, or trace identity;
metrics exist but have uncontrolled cardinality or missing labels;
traces exist but stop at the ingress boundary or lose async context;
Kubernetes events exist but expire before someone investigates;
cloud logs exist but are separated from cluster and application signals;
incident reviews rely on screenshots and tribal memory.

A production-grade observability platform solves for evidence quality, not tool count.

The signal chain must be continuous.

If there is a gap between user symptom and infrastructure evidence, incident handling becomes speculation.

2. Observability vs Monitoring

Monitoring asks:

Did a known condition happen?

Observability asks:

Can we understand an unknown condition from emitted evidence?

You need both.

Capability	Monitoring	Observability
Primary question	Is known thing broken?	Why is unknown thing happening?
Input	Predefined thresholds	Rich telemetry and metadata
Output	Alert	Explanation path
Example	CPU > 90%	Requests to checkout fail only on zone B after rollout
Failure mode	Alert fatigue	Data overload / poor correlation

In production Kubernetes, monitoring without observability creates noisy alerts. Observability without monitoring creates beautiful dashboards that nobody watches.

3. The Four Evidence Streams

Kubernetes observability has at least four evidence streams.

Stream	Best at answering	Weakness
Metrics	How much? How often? How fast?	Usually loses per-request detail
Logs	What happened at a point in code?	Expensive, noisy, hard to aggregate
Traces	Where did one request spend time?	Requires instrumentation and context propagation
Events	What did Kubernetes decide?	Short retention, not a durable audit log

A mature platform also collects:

audit logs;
cloud load balancer logs;
VPC/VNet flow logs;
DNS logs;
controller logs;
admission webhook logs;
deployment metadata;
CI/CD release metadata;
feature flag changes;
incident timeline annotations.

But the four core evidence streams are the foundation.

4. Metrics Mental Model

Metrics are structured numeric measurements over time.

They are excellent for:

trend detection;
alerting;
SLO evaluation;
saturation analysis;
capacity planning;
rollout comparison;
anomaly detection;
fleet-level aggregation.

They are weak for:

exact request reconstruction;
detailed business context;
long exception payloads;
high-cardinality identifiers;
causal narratives.

4.1 Metric Types

Type	Meaning	Example
Counter	Monotonically increasing count	`http_requests_total`
Gauge	Current value	`queue_depth`
Histogram	Distribution buckets	`http_request_duration_seconds_bucket`
Summary	Client-side distribution	Less common in centralized Prometheus setups

For production services, prefer histograms for latency. Average latency hides tail behavior.

Bad:

avg_latency_ms = 120

Better:

p50 = 80ms
p90 = 220ms
p99 = 1300ms

The p99 is often where user pain lives.

4.2 Golden Signals

For user-facing services, start with four signals:

Signal	Meaning	Common Metric
Latency	How long requests take	request duration histogram
Traffic	How much demand arrives	requests/sec
Errors	How many requests fail	error rate by class
Saturation	How close resources are to limits	CPU, memory, queue, connection pool, thread pool

For infrastructure, use USE:

Signal	Meaning	Example
Utilization	How busy resource is	CPU utilization
Saturation	How much queued demand exists	disk queue, pending Pods
Errors	Explicit failure count	network drops, disk errors

4.3 Kubernetes Metrics Layers

You need metrics at each layer.

Layer	Examples	Failure detected
Application	request rate, error rate, latency, queue lag	service regression
Pod/container	restarts, CPU, memory, throttling	resource pressure
Node	allocatable, disk pressure, network drops	node saturation
Cluster	pending Pods, API server latency, scheduling latency	platform bottleneck
Ingress/gateway	4xx/5xx, target health, TLS errors	edge failures
Cloud	load balancer, NAT, storage, IAM, DNS	provider integration failures

4.4 Cardinality Discipline

Cardinality is the number of unique time series produced by a metric.

This metric is dangerous:

http_requests_total{user_id="123", order_id="A-999", path="/orders/123/items/456"}

Why?

Because every user, order, and path variant creates new time series.

Better:

http_requests_total{method="GET", route="/orders/{orderId}/items/{itemId}", status_class="2xx"}

Use bounded labels:

service;
namespace;
route template;
method;
status class;
error class;
dependency name;
region;
zone;
version;
workload;
tenant tier, if bounded.

Avoid unbounded labels:

user ID;
session ID;
order ID;
email;
request ID;
IP address;
raw URL;
exception message;
SQL query text;
Kubernetes Pod UID as primary business dimension.

The invariant:

Metrics are for aggregation; logs and traces are for high-cardinality detail.

5. Logs Mental Model

Logs are timestamped records of application or system events.

They are useful when they are structured, correlated, and intentionally emitted.

They are harmful when they are treated as a dumping ground.

5.1 Kubernetes Logging Contract

In Kubernetes, application containers should generally write logs to stdout/stderr. The node/container runtime handles capture; a logging agent ships logs to central storage.

The application should not assume local log files survive Pod restart or node replacement.

5.2 Structured Logging

Bad:

Payment failed for user 123, order A-999, timeout from gateway

Better:

{
  "timestamp": "2026-07-03T09:15:21Z",
  "level": "ERROR",
  "service": "payment-api",
  "env": "prod",
  "trace_id": "8f2c1d...",
  "span_id": "41a7...",
  "request_id": "req-7c2e",
  "tenant_id": "tenant-42",
  "operation": "authorize_payment",
  "dependency": "card-gateway",
  "error_class": "dependency_timeout",
  "duration_ms": 3012,
  "message": "Payment authorization timed out"
}

Structured logs make query, aggregation, and correlation possible.

5.3 Log Levels

Level	Meaning	Production guidance
DEBUG	Detailed diagnostic context	Disabled or sampled in prod by default
INFO	Important state transitions	Useful but bounded
WARN	Recoverable abnormal condition	Must be actionable or suppressible
ERROR	Failed operation requiring attention	Must include error class and correlation ID
FATAL	Process cannot continue	Rare; process exits

Do not use ERROR for expected business rejections.

A payment declined by issuer is not necessarily an application error.

A timeout to issuer is.

5.4 Logs Must Carry Identity

Every log line should be joinable to:

service name;
environment;
version/build SHA;
namespace;
pod name;
container name;
node name;
trace ID;
span ID;
request ID;
tenant/customer context, if safe;
operation name;
dependency name;
error class.

Kubernetes metadata can be enriched by collectors, but application-level correlation IDs must come from the app or middleware.

5.5 Log Sampling

High-volume systems need sampling.

But sample carefully.

Good candidates for sampling:

successful debug-level request logs;
high-volume health checks;
repeated expected validation failures;
duplicate dependency warnings.

Bad candidates for blind sampling:

errors;
security events;
payment events;
audit events;
rare tail-latency samples;
workflow state transitions.

The platform should support dynamic log-level change, but it must be controlled and time-bound.

6. Traces Mental Model

A trace represents one request or workflow path across components.

A span represents one operation inside that path.

Traces are strongest for:

latency breakdown;
dependency fan-out;
critical path analysis;
retries;
cascading failures;
async propagation;
request-specific debugging.

They are weak for fleet-level alerting unless converted into metrics.

6.1 Trace Context Propagation

Every ingress request should receive or create a trace context.

That context must propagate through:

HTTP headers;
gRPC metadata;
messaging headers;
async jobs;
scheduled workflows;
outbound dependency calls.

For W3C Trace Context, the key header is usually:

traceparent

If trace context stops at the first async boundary, distributed tracing becomes partial storytelling.

6.2 Span Naming

Bad span names:

GET
callService
process
query

Good span names:

HTTP POST /checkout
OrderService.createOrder
PaymentGateway.authorize
PostgreSQL orders.insert
Kafka publish order-created

A span name must identify intent, not just technology.

6.3 Sampling Strategy

Tracing everything is expensive at scale.

Common strategies:

Strategy	Use case	Risk
Head sampling	Decide at request start	May drop interesting failures
Tail sampling	Decide after seeing outcome	More powerful, more infrastructure
Error-biased sampling	Keep failed traces	May miss slow successful requests
Route-based sampling	Keep critical routes more often	Requires route taxonomy
Tenant-tier sampling	Higher fidelity for premium/regulated tenants	Governance complexity

For serious platforms, start with:

keep all error traces;
keep high-latency traces;
sample successful traces;
preserve traces for critical business operations;
ensure sampling decisions are documented.

7. Kubernetes Events Mental Model

Kubernetes Events describe decisions and observations made by Kubernetes components.

Examples:

failed scheduling;
image pull failure;
probe failure;
backoff;
volume attach failure;
node not ready;
killing container;
scaled ReplicaSet;
load balancer provisioning error.

Events are operational gold, but they are not a durable event store.

A production platform should export Kubernetes events to central storage.

7.1 Event Categories

Category	Example	Likely owner
Scheduling	insufficient CPU, taint mismatch	platform + app
Image	pull backoff, auth failure	app/platform
Runtime	crash loop, OOMKilled	app
Network	endpoint not ready, LB provisioning fail	platform
Storage	PVC pending, attach timeout	platform/data
Policy	admission denied	platform/security
Node	node not ready, disk pressure	platform/cloud

7.2 Event Retention Problem

Native event retention is limited. During noisy failures, important context can disappear or be compacted before humans inspect it.

Therefore:

export events;
normalize event fields;
correlate events with workload, version, node, and deployment;
attach event timeline to incidents;
alert on selected event patterns.

Do not use events as your only observability source. Use them as control-plane evidence.

8. Metadata Is the Join Key

Telemetry without metadata creates isolated evidence.

Every signal should be joinable by common dimensions.

8.1 Required Kubernetes Dimensions

Dimension	Why it matters
cluster	multi-cluster routing and ownership
cloud provider	AWS/Azure comparison
region	regional incident isolation
zone	AZ/zone failure detection
namespace	tenant/team boundary
workload	application ownership
pod	runtime instance
container	sidecar/main distinction
node	infrastructure correlation
image digest	artifact identity
app version	rollout diagnosis
deployment revision	change correlation

8.2 Required Business Dimensions

Use bounded dimensions only.

Dimension	Example
product domain	billing, order, case-management
operation	create_case, approve_order
tenant tier	free, enterprise, regulated
channel	api, batch, portal
criticality	tier-0, tier-1, tier-2

Do not put raw PII or unbounded customer identifiers into metrics labels.

8.3 Label Contract

A platform should define a standard label contract.

Example Kubernetes labels:

metadata:
  labels:
    app.kubernetes.io/name: payment-api
    app.kubernetes.io/instance: payment-api-prod
    app.kubernetes.io/version: "2026.07.03-1a2b3c4"
    app.kubernetes.io/component: api
    app.kubernetes.io/part-of: checkout
    app.kubernetes.io/managed-by: argocd
    platform.company.io/tier: tier-1
    platform.company.io/owner: payments

The same identity should appear in logs, metrics, traces, dashboards, alerts, and runbooks.

9. Service-Level Observability

Infrastructure dashboards are not enough.

The first dashboard for a service should answer:

Are users succeeding?
How slow is the service?
Which dependencies are failing?
Was there a recent deployment?
Is capacity saturated?
Which region/zone/cluster is affected?

9.1 RED Dashboard

For request/response services:

Panel	Metric
Request rate	requests per second by route/status
Error rate	5xx, dependency errors, timeout errors
Duration	p50/p90/p99 latency
Saturation	CPU, memory, queue, DB pool, thread pool

9.2 Worker Dashboard

For async consumers:

Panel	Metric
Input rate	messages/events per second
Processing rate	successful messages per second
Error rate	failed/retried/dead-lettered
Lag/backlog	queue depth, consumer lag
Age	oldest message age
Saturation	worker concurrency, CPU, memory

9.3 Batch/Cron Dashboard

For jobs:

Panel	Metric
Last success time	timestamp gauge
Duration	histogram
Records processed	counter
Failure count	counter
Retry count	counter
Data freshness	age of generated artifact/data

Do not alert on a CronJob Pod failing once if retry succeeds and SLO is intact.

Alert on missed completion or stale business output.

10. SLO-Driven Observability

A production observability system should be shaped by SLOs.

An SLO converts telemetry into service reliability commitments.

Example:

99.9% of checkout API requests over 30 days must complete successfully under 500ms, excluding client-side 4xx errors.

This implies:

good request counter;
correct success/error classification;
latency histogram;
route-level dimensions;
burn-rate alerting;
dashboard linking to traces/logs;
incident policy when error budget burns too fast.

10.1 SLI Design

Service	Good SLI	Bad SLI
API	successful non-5xx requests under threshold	CPU < 80%
Worker	events processed before age threshold	Pod is Running
Batch	output produced before deadline	Job Pod succeeded once
Gateway	successful routed requests	load balancer exists
Storage dependency	successful operations + latency	disk usage only

Infrastructure metrics are useful, but user-facing SLOs should be based on user-visible outcomes.

10.2 Burn-Rate Alerts

A basic threshold alert says:

Error rate > 1%.

A burn-rate alert says:

The service is consuming its monthly error budget too fast.

This reduces noise because it ties alerting to reliability impact.

Example alert classes:

Alert	Window	Meaning
Fast burn	5m + 1h	page now
Slow burn	30m + 6h	urgent business hours
Ticket	6h + 3d	investigate trend

11. Alert Design

An alert is not a metric condition.

An alert is an operational contract.

Every alert should have:

owner;
severity;
user impact statement;
condition;
threshold rationale;
runbook link;
dashboard link;
expected first action;
dependency context;
suppression/maintenance strategy.

11.1 Page vs Ticket

Page only when:

users are affected now;
error budget is burning fast;
data loss/corruption risk exists;
security boundary is breached;
platform capacity is blocking critical workloads;
automated recovery failed.

Ticket when:

trend is unhealthy but not urgent;
capacity will run out later;
certificate expires in weeks;
version is approaching end of support;
policy drift needs cleanup.

11.2 Bad Alerts

Avoid:

CPU > 80% without user impact;
any Pod restart count > 0;
node not ready for 30 seconds;
deployment replica mismatch during normal rollout;
transient load balancer provisioning delay;
every WARN log.

Good alerts encode symptoms and consequences.

12. Dashboard Design

Dashboards are for diagnosis, not decoration.

A useful dashboard follows the incident path.

12.1 Service Dashboard Layout

Suggested order:

SLO and error budget;
request rate, errors, latency;
route/operation breakdown;
dependency calls;
Kubernetes workload health;
resource saturation;
recent deployment/change markers;
trace/log/event links;
runbook links.

12.2 Platform Dashboard Layout

API server health;
scheduler pending Pods;
node readiness;
node pressure;
CNI/IP capacity;
DNS/CoreDNS health;
ingress/gateway health;
storage attach/provisioning;
autoscaler activity;
admission webhook errors;
cluster add-on health;
cloud quota/capacity symptoms.

13. Change Correlation

Most incidents are caused by change.

Your observability system must show changes directly on timelines.

Capture:

deployment time;
image digest;
Git commit;
chart version;
config version;
feature flag changes;
policy changes;
node upgrades;
add-on upgrades;
cloud infrastructure changes;
certificate rotation;
DNS changes.

Without change markers, engineers waste time asking “what changed?” during every incident.

14. Control-Plane Observability

Kubernetes control plane failures often appear as application issues.

Examples:

Pods stay Pending;
rollouts hang;
Services do not update endpoints;
admission webhooks timeout;
CRD controllers lag;
scheduler latency increases;
API server throttles clients;
DNS updates lag.

For managed Kubernetes, AWS/Azure own most control-plane internals, but you still own detecting symptoms and escalating with evidence.

Track:

API server request latency;
API server error rate;
client throttling;
scheduler pending Pods;
controller reconciliation failures;
webhook latency/error rate;
CRD controller workqueue depth;
kubelet/node health;
add-on health.

15. Node and Runtime Observability

Node-level signals explain many workload symptoms.

Track:

CPU utilization;
CPU throttling;
memory working set;
OOM kills;
filesystem usage;
ephemeral storage pressure;
disk I/O;
network transmit/receive;
packet drops;
conntrack pressure;
kubelet errors;
container runtime errors;
node conditions;
image pull latency;
Pod sandbox creation failures.

Do not stop at Pod CPU and memory.

A service can be healthy at app metrics level but approaching node-level failure due to disk pressure or CNI exhaustion.

16. Dependency Observability

Most serious incidents are dependency-shaped.

Your service should emit dependency metrics:

Metric	Purpose
outbound request rate	traffic to dependency
dependency latency	slow dependency detection
dependency error rate	failure boundary
timeout count	timeout tuning
retry count	amplification detection
circuit breaker state	degraded mode visibility
pool saturation	local bottleneck

Dependency labels should be bounded:

dependency="postgres-orders"
dependency="redis-session"
dependency="payment-gateway"
dependency="case-workflow-engine"

Avoid raw URL and host cardinality unless carefully controlled.

17. Observability for Progressive Delivery

Rollouts need telemetry gates.

A safe rollout observes:

new version request rate;
new version error rate;
new version latency;
old vs new comparison;
dependency error delta;
Pod restart delta;
saturation delta;
business KPI delta;
canary trace samples;
Kubernetes events for new Pods.

Version labels are mandatory for progressive delivery.

Without versioned telemetry, canary analysis is guesswork.

18. Observability Data Pipeline

A common platform pipeline:

18.1 Collector Placement

Pattern	Description	Use case
Sidecar collector	Per-Pod collector	strict isolation, expensive
DaemonSet collector	Per-node collector	node-local logs/metrics
Deployment collector	shared gateway collector	trace/metric aggregation
Managed agent	cloud-provided add-on	fastest managed path

A platform often uses a mix:

DaemonSet for logs/node metrics;
Deployment collector for traces/app metrics;
managed add-on for cloud integration;
sidecar only for special regulated or legacy cases.

18.2 Backpressure

Telemetry systems can fail too.

Design for:

collector CPU/memory limits;
queue capacity;
retry policy;
batching;
dropped telemetry counters;
sampling;
priority signals;
failure isolation;
cost guardrails.

A telemetry outage must not take down the application.

19. Security and Privacy

Telemetry can leak sensitive data.

Never assume logs are safe just because they are internal.

Risk areas:

PII in logs;
secrets in environment dumps;
tokens in headers;
SQL parameters;
stack traces with payloads;
trace attributes containing customer data;
metrics labels with user identifiers;
audit logs with privileged operations;
cross-tenant observability access.

Guardrails:

structured logging schema;
redaction library;
allowlist fields;
denylist emergency filters;
separate audit retention;
access control by team/tenant;
encryption at rest;
retention policy;
legal hold procedure;
telemetry review in code review.

20. Cost Model

Observability cost usually grows through:

high-cardinality metrics;
verbose logs;
duplicate collectors;
excessive trace sampling;
long retention for hot storage;
unbounded Kubernetes metadata;
debug logs left on;
per-Pod log sidecars;
too many dashboards and alerts nobody owns.

Use tiered retention:

Data	Hot retention	Cold retention
SLO metrics	long	very long
raw application logs	short-medium	archive if required
security/audit logs	policy-driven	long
traces	short	sampled/archived only
Kubernetes events	medium	incident-linked archive
deployment metadata	long	long

Cost control must not destroy incident evidence.

The correct goal is useful evidence per dollar, not minimum telemetry spend.

21. Failure Modes

21.1 You Have Dashboards but No Answers

Symptoms:

many dashboards;
no SLO view;
no service ownership;
no change markers;
no correlation IDs.

Fix:

redesign from incident questions;
create service golden dashboard;
enforce telemetry labels;
add deployment annotations.

21.2 Metrics Cardinality Explosion

Symptoms:

backend cost spike;
query slow;
ingestion throttling;
missing samples.

Fix:

remove unbounded labels;
aggregate at collector;
normalize routes;
enforce metric linting in CI.

21.3 Logs Cannot Be Correlated

Symptoms:

logs exist but cannot join to traces;
incident timeline manual;
request ID missing.

Fix:

propagate trace/request ID;
inject Kubernetes metadata;
standardize log schema.

21.4 Traces Stop at Async Boundary

Symptoms:

HTTP trace exists;
queue/worker processing invisible;
cause of delay unknown.

Fix:

propagate trace context through message headers;
instrument producer and consumer;
model workflow spans.

21.5 Alerts Page for Non-Impact

Symptoms:

page fatigue;
engineers ignore alerts;
CPU alerts dominate.

Fix:

route alerts by SLO/user impact;
turn weak alerts into tickets;
require runbook and owner.

21.6 Events Disappear Before Investigation

Symptoms:

kubectl describe no longer shows useful event;
CrashLoopBackOff root cause unclear;
scheduler failure evidence lost.

Fix:

export events;
retain event timeline;
include events in incident evidence.

22. Reference Architecture

This architecture is intentionally tool-neutral.

The same mental model applies whether the backend is CloudWatch, Azure Monitor, Prometheus, Grafana, Loki, Tempo, Datadog, New Relic, Honeycomb, Elastic, or a hybrid stack.

23. Production Checklist

23.1 Service Telemetry Checklist

23.2 Platform Telemetry Checklist

23.3 Operational Checklist

Every page alert has owner and runbook.
SLO dashboards exist for tier-0/tier-1 services.
Change markers are visible in dashboards.
Incident timeline can link logs, metrics, traces, events.
Retention policy matches compliance and incident needs.
Cost/cardinality review runs regularly.
Access control prevents cross-team/tenant leaks.

24. Deliberate Practice

Exercise 1 — Build a Signal Map

Choose one production service and map:

user-facing SLO;
application metrics;
logs;
traces;
Kubernetes workload signals;
node signals;
ingress signals;
dependency signals;
cloud provider signals.

Identify every missing join key.

Exercise 2 — Design a Minimal Golden Dashboard

Build a dashboard with only 10 panels.

It must answer:

Are users impacted?
Which operation is failing?
Which version is failing?
Which dependency is involved?
Is the cluster saturated?
What changed recently?

Exercise 3 — Kill the Bad Alerts

Review 20 existing alerts.

For each alert:

does it imply user impact?
does it have an owner?
does it have a runbook?
should it page, ticket, or be deleted?

Exercise 4 — Trace an Async Workflow

Instrument an HTTP request that publishes a message and is consumed by a worker.

Success criteria:

trace crosses producer and consumer;
worker logs include same trace context;
failure in worker can be traced to original request;
queue delay is visible.

25. Summary

Kubernetes observability is not a dashboard collection.

It is an evidence architecture.

Metrics quantify symptoms. Logs explain local facts. Traces show request paths. Events reveal Kubernetes decisions. Metadata joins the evidence. SLOs decide what matters.

The mature engineer does not ask, “Which observability tool should we install?” first.

They ask:

What questions must we answer during failure, and what evidence must exist before the failure happens?

That question is the foundation for EKS and AKS observability in the next parts.

References

Kubernetes Documentation — Observability: https://kubernetes.io/docs/concepts/cluster-administration/observability/
Kubernetes Documentation — Logging Architecture: https://kubernetes.io/docs/concepts/cluster-administration/logging/
Kubernetes Documentation — Resource Metrics Pipeline: https://kubernetes.io/docs/tasks/debug/debug-cluster/resource-metrics-pipeline/
Kubernetes Documentation — Events: https://kubernetes.io/docs/reference/kubernetes-api/cluster-resources/event-v1/
OpenTelemetry Documentation: https://opentelemetry.io/docs/
Prometheus Documentation — Metric Types: https://prometheus.io/docs/concepts/metric_types/
Google SRE Workbook — Alerting on SLOs: https://sre.google/workbook/alerting-on-slos/

Lesson Recap

You just completed lesson 29 in deepen practice. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Back To Series Next Lesson

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.

Previous Lesson

Lesson 28

AKS Automatic, Node Pools, and Scaling

Next Lesson

Lesson 30

Observability on EKS