Series/Learn Kubernetes with Cloud Services AWS & Azure

Deepen PracticeOrdered learning track

Observability on AKS

Learn Kubernetes with Cloud Services AWS & Azure - Part 031

Production-grade observability on Azure Kubernetes Service using Azure Monitor, Container Insights, Managed Prometheus, Azure Managed Grafana, diagnostic settings, activity logs, resource logs, network observability, and AKS-specific runbooks.

[2026-07-03]24 min read4630 words

In This Lesson

1. What AKS Adds to Kubernetes Observability 2. AKS Observability Stack: Responsibility Split 3. Container Insights vs Managed Prometheus vs Managed Grafana

PrevNext

Lesson 3140 lesson track23–33 Deepen Practice

#kubernetes#azure#aks#observability+5 more

Part 031 — Observability on AKS

AKS observability is not just "install Prometheus".

AKS runs inside Azure, so every incident crosses at least two worlds:

Kubernetes evidence: Pods, Deployments, Services, Events, Nodes, probes, workloads.
Azure evidence: Azure Monitor, Log Analytics, Managed Prometheus, Managed Grafana, Activity Logs, Resource Logs, NSG flow evidence, load balancer metrics, Application Gateway metrics, identity, Key Vault, ACR, disk, network, and control-plane diagnostics.

The invariant:

AKS observability is production-grade only when Kubernetes symptoms and Azure platform evidence can be correlated into one timeline.

If your dashboard can show CPU but cannot answer "why did checkout receive 502 for 9 minutes after node pool upgrade?", it is not an operations system. It is only instrumentation.

1. What AKS Adds to Kubernetes Observability

Generic Kubernetes observability gives you:

Pod status;
container restarts;
Deployment rollout state;
Service endpoints;
cluster events;
kubelet/node metrics;
logs;
traces;
application metrics.

AKS adds Azure-specific operational surfaces:

Azure Monitor metrics;
Log Analytics workspace;
Container Insights;
Azure Monitor managed service for Prometheus;
Azure Managed Grafana;
diagnostic settings;
AKS control-plane logs;
Azure Activity Log;
Azure Resource Graph;
Entra ID and Azure RBAC audit trails;
Managed Identity evidence;
ACR pull errors and registry logs;
Azure Load Balancer and Application Gateway evidence;
Azure CNI, Network Observability, NSG, UDR, NAT Gateway, and Azure Firewall evidence;
Azure Disk / Azure Files storage metrics;
Key Vault diagnostics;
node resource group resources;
VMSS / node pool health signals.

AKS is managed, but the visibility contract is still yours.

The important thing is not the number of tools.

The important thing is whether they explain the same production event from different angles.

2. AKS Observability Stack: Responsibility Split

A production AKS observability architecture usually has these layers.

Layer	Primary evidence	Typical Azure capability	Why it matters
Control plane	API server, scheduler, controller manager, cluster autoscaler logs	AKS diagnostic settings to Log Analytics / storage / Event Hub	Explains control-plane and scaling behavior
Node	CPU, memory, disk, kubelet, node condition, VMSS health	Container Insights, VM metrics	Explains capacity, eviction, node failure
Pod/container	restarts, logs, resource usage, probes	Container Insights, Log Analytics	Explains workload symptoms
Application	RED metrics, traces, domain logs	OpenTelemetry, App Insights, Managed Prometheus	Explains user impact
Network	DNS, CNI, NSG, LB/App Gateway, flow logs	Azure Monitor, Network Watcher, Advanced Container Networking Services	Explains connectivity and traffic loss
Identity	token exchange, RBAC, Entra ID, managed identity	Activity Log, Entra logs, Azure resource diagnostics	Explains denied access and auth failures
Storage	attach/mount, latency, throttling, capacity	Azure Disk/File metrics, CSI logs	Explains I/O and stateful workload failure
Delivery	GitOps sync, Helm release, rollout, admission	Argo/Flux metrics, Kubernetes events	Explains deployment-induced incidents

The cloud-native trap:

Engineers often monitor Pods, but incidents usually happen at boundaries: DNS, identity, ingress, storage, node pool, quota, or policy.

3. Container Insights vs Managed Prometheus vs Managed Grafana

These are not interchangeable.

Capability	What it is	Best for	Not enough for
Container Insights	Azure Monitor feature for container/cluster health and logs	cluster inventory, node/pod/container health, logs, portal-driven investigation	high-cardinality app metrics, PromQL-heavy SLOs
Managed Prometheus	Azure Monitor managed Prometheus-compatible metrics platform	Kubernetes/app metrics, PromQL, alerting, Grafana dashboards	logs, traces, raw event narratives
Azure Managed Grafana	Managed Grafana visualization service	dashboarding, Prometheus visualization, cross-source views	storage of telemetry itself
Log Analytics	log query and analytics workspace	container logs, diagnostic logs, KQL investigations	Prometheus-native metric workflows
Application Insights	application performance monitoring	traces, dependencies, request telemetry	cluster-level Kubernetes object view

A practical rule:

use Container Insights for cluster/container health and log investigation;
use Managed Prometheus for metric-driven SLOs and Kubernetes/app metrics;
use Managed Grafana for dashboards;
use Log Analytics/KQL for forensic log and platform diagnostics;
use OpenTelemetry/Application Insights for distributed tracing and application dependency evidence.

Do not force one backend to do everything.

4. Minimum Production Signal Set

Before discussing tool choice, define the evidence contract.

A production AKS platform should answer these questions quickly:

Cluster

Is the Kubernetes API reachable?
Are control-plane diagnostics enabled?
Are node pools healthy?
Are upgrades in progress?
Are add-ons degraded?
Are there quota or capacity constraints?

Nodes

Are nodes Ready?
Are nodes under CPU/memory/disk/PID pressure?
Are kubelet/container runtime errors increasing?
Are system node pools isolated from user workloads?
Are Spot/ephemeral node failures expected or abnormal?

Workloads

Which Deployment changed before the incident?
Which Pods restarted?
Which containers failed probes?
Which workloads are Pending, OOMKilled, CrashLoopBackOff, or ImagePullBackOff?
Which namespaces exceed request/limit budgets?

Traffic

Is DNS resolving?
Is ingress accepting traffic?
Are backend health checks passing?
Are Services backed by ready endpoints?
Are Application Gateway / Load Balancer metrics healthy?
Are NetworkPolicy or NSG rules blocking traffic?

Identity

Are workload identity token exchanges succeeding?
Are Managed Identity permissions correct?
Are Key Vault, Storage, ACR, or Service Bus calls denied?
Did RBAC or Azure role assignments change?

Storage

Are volumes attaching and mounting?
Is disk latency increasing?
Are Azure Disk/File throttling limits hit?
Are CSI sidecars healthy?

Application

Is user-facing latency above SLO?
Are errors concentrated in one dependency, route, zone, or node pool?
Are traces complete enough to locate the failing hop?
Are logs structured and correlated with trace/request IDs?

5. Reference AKS Observability Architecture

This architecture has one main design goal:

Use metrics for detection, logs/events for explanation, traces for request path, and Azure diagnostics for platform boundary proof.

6. AKS Control-Plane Diagnostics

Control-plane observability is often ignored because AKS is managed.

That is a mistake.

You do not manage the AKS control-plane infrastructure, but you still need evidence from it.

Important diagnostic categories can include, depending on cluster features/version:

kube-apiserver;
kube-audit;
kube-audit-admin;
kube-controller-manager;
kube-scheduler;
cluster-autoscaler;
cloud-controller-manager;
guard/admission-related logs;
CNI/network-related diagnostics where supported.

The operational value:

Signal	What it helps diagnose
API server logs	request failures, latency, authentication/authorization failures
Audit logs	who changed what, when, from where
Scheduler logs	scheduling failures and constraints
Controller manager logs	reconciliation and resource controller problems
Cluster autoscaler logs	scale-up/scale-down decisions
Cloud controller logs	Azure load balancer/storage/cloud integration issues

The practical design:

enable diagnostic settings during cluster creation, not after the first outage;
route logs to Log Analytics for interactive investigation;
optionally route copies to storage/Event Hub for retention/SIEM;
define retention by compliance and incident analysis needs;
exclude high-noise categories only with evidence, not intuition.

Example Diagnostic Intent

Do not think in terms of "enable all logs forever".

Think in terms of queries you need during incident response:

who changed RBAC before failed deployment?
why did the scheduler reject these Pods?
did cluster autoscaler refuse scale-up because of max node group size?
did a webhook timeout block admission?
did the API server reject a deprecated API after upgrade?

7. Container Insights: Cluster and Container Evidence

Container Insights is useful because it gives a managed baseline for cluster health and workload evidence.

Use it for:

node inventory;
node utilization;
pod/container inventory;
restarts;
container logs;
namespace-level view;
cluster health visualizations;
quick portal investigation;
KQL-based analysis.

But avoid treating it as the whole observability strategy.

It is a strong default for AKS operations, but application SLOs usually need app metrics and traces.

Useful Investigation Questions

Container Insights should help answer:

Which namespace is producing most restarts?
Which node has abnormal memory pressure?
Which pods are pending or failed?
Which container logs include errors after a deployment?
Which workload consumes unexpected resources?
Which node pool is hot or imbalanced?

Design Rules

Label namespaces and workloads consistently.
Standardize app labels: app.kubernetes.io/name, component, part-of, version.
Keep logs structured.
Include request/correlation IDs in logs.
Avoid logging secrets, tokens, headers, or PII.
Set retention intentionally.
Use sampling or routing for high-volume debug logs.

8. Managed Prometheus: Metric Plane

Managed Prometheus gives a Prometheus-compatible metric model without forcing each platform team to operate Prometheus storage.

Use it for:

Kubernetes metrics;
node metrics;
kube-state-metrics-style object state;
application metrics;
service-level indicators;
PromQL alerting;
Grafana dashboards;
network observability metrics where supported.

Metric Taxonomy

Metric type	Examples	Purpose
Saturation	CPU, memory, disk, pod capacity, connection count	capacity and scaling
Traffic	request rate, queue depth, consumer lag	demand
Errors	HTTP 5xx, failed jobs, dependency failures	correctness
Latency	p50/p95/p99, dependency latency	user experience
State	replicas desired/available, pods ready, node ready	control-plane truth
Cost proxy	requested CPU/memory, idle nodes	FinOps

Anti-Patterns

Do not create metrics for everything.

Bad metrics are expensive and useless.

Common mistakes:

unbounded labels such as user ID, order ID, request ID;
route labels with raw IDs in path;
per-Pod alerts instead of service-level alerts;
dashboards without SLO thresholds;
alerting on CPU without user impact;
collecting verbose metrics with no owner;
mixing application health and infrastructure health into one ambiguous alert.

Production Metric Contract

Every app deployed to AKS should expose:

request rate;
error rate;
latency histogram;
dependency latency/error metrics;
queue depth / lag if asynchronous;
business outcome counters where relevant;
build/version labels;
readiness state or health metric if meaningful.

9. Azure Managed Grafana: Operational Views

Grafana is not the monitoring system.

It is the operational view layer.

Dashboards should be designed by use case.

Dashboard 1: Executive SLO View

Audience: engineering leadership, incident commander.

Shows:

service availability;
error budget burn;
p95/p99 latency;
request volume;
active incidents;
top impacted services;
deployment markers.

Dashboard 2: Service Owner View

Audience: application team.

Shows:

RED metrics;
dependency breakdown;
pod restarts;
rollout status;
app version;
queue/consumer metrics;
JVM/.NET/Node runtime metrics if relevant.

Dashboard 3: Platform View

Audience: platform/SRE team.

Shows:

API server health;
node pool capacity;
pending pods;
cluster autoscaler activity;
namespace resource usage;
CoreDNS health;
ingress health;
NetworkPolicy/egress symptoms;
CSI health;
cost/capacity signals.

Dashboard 4: Edge and Network View

Audience: platform/network team.

Shows:

Application Gateway / Load Balancer metrics;
backend health;
4xx/5xx distribution;
TLS certificate expiry;
DNS symptoms;
SNAT/egress capacity;
network flow metrics/logs;
Azure Firewall/NAT Gateway evidence if used.

10. Logs: From Noise to Forensics

Logs are the most abused telemetry type.

A log strategy must define:

who produces logs;
what format is allowed;
what fields are mandatory;
what is forbidden;
where logs are stored;
how long they are retained;
how logs connect to metrics/traces;
what queries are part of incident playbooks.

Minimum Structured Log Fields

For application logs:

{
  "timestamp": "2026-07-03T10:15:30.123Z",
  "level": "ERROR",
  "service": "checkout-api",
  "version": "2026.07.03-42",
  "environment": "prod",
  "namespace": "payments",
  "pod": "checkout-api-7fbb99c7b8-x9n2c",
  "trace_id": "...",
  "span_id": "...",
  "request_id": "...",
  "operation": "createOrder",
  "error_code": "PAYMENT_PROVIDER_TIMEOUT",
  "message": "Payment provider timed out after retry budget exhausted"
}

For platform logs, preserve Kubernetes and Azure metadata:

cluster name;
namespace;
workload name;
pod name;
container name;
node name;
node pool;
subscription;
resource group;
region;
resource ID;
correlation ID where available.

Bad Log Example

failed

This is not a log. It is an emotion.

Better Log Example

level=error service=checkout-api operation=createOrder dependency=payment-provider status=timeout attempt=3 max_attempts=3 timeout_ms=2000 trace_id=abc123

The second one can be queried, grouped, and joined with traces.

11. Tracing and OpenTelemetry on AKS

Tracing answers a different question than metrics.

Metrics answer:

Is the service unhealthy?

Traces answer:

Which hop made this request slow or failed?

For AKS, a practical tracing architecture is:

Trace Contract

Every production trace should include:

service name;
environment;
version;
route or operation name;
status;
latency;
dependency spans;
error attributes;
tenant/customer tier only if allowed and low-cardinality;
Kubernetes metadata where possible.

Trace Anti-Patterns

sampling all traces during high traffic with no cost model;
sampling away rare errors;
using raw URL paths with IDs as span names;
missing dependency instrumentation;
no correlation between logs and traces;
no deployment version in telemetry.

12. AKS Network Observability

Network incidents are hard because Kubernetes and Azure both participate in the packet path.

A single failed request may involve:

DNS;
Application Gateway / Load Balancer;
Gateway/Ingress controller;
Service;
EndpointSlice;
Pod readiness;
NetworkPolicy;
Azure CNI;
NSG;
UDR;
Azure Firewall;
NAT Gateway;
Private Endpoint DNS;
external dependency routing.

Network Evidence Map

Symptom	Evidence to collect
Client gets 502/503	ingress logs, backend health, endpoints, pod readiness
Pod cannot call dependency	DNS query, NetworkPolicy, NSG, UDR, firewall, identity, dependency logs
Egress intermittent	NAT/SNAT metrics, connection count, DNS, retry logs
East-west service fails	Service endpoints, CoreDNS, NetworkPolicy, pod labels
Private endpoint fails	private DNS zone, VNet link, UDR, NSG, firewall
Node cannot reach registry	ACR firewall/private endpoint, DNS, managed identity, route table

Debugging Principle

Never debug AKS networking from one layer only.

Use a layered path:

client -> Azure edge -> ingress/load balancer -> Service -> EndpointSlice -> Pod -> dependency

At each hop ask:

Is the object configured?
Is the backend healthy?
Is DNS resolving?
Is routing allowed?
Is policy allowing traffic?
Is identity required and valid?
Is the dependency healthy?

13. Identity Observability

AKS production workloads often depend on Azure identity:

Workload Identity;
Managed Identity;
Entra ID;
Azure RBAC;
Key Vault access policies/RBAC;
ACR pull permissions;
Storage/Service Bus/Event Hubs role assignments.

Identity failures usually look like application failures:

403 from Key Vault;
image pull failure from ACR;
failed token exchange;
missing federated credential;
wrong ServiceAccount annotation;
wrong namespace/service account subject;
role assignment not propagated;
tenant/subscription mismatch.

Evidence to Capture

For identity-related incidents, capture:

ServiceAccount name and namespace;
Pod labels and annotations;
federated identity credential subject;
user-assigned managed identity client ID;
Azure role assignments;
target resource scope;
Azure Activity Log around role assignment changes;
application SDK credential chain logs where safe;
Key Vault/Storage/ACR diagnostics.

Identity Debugging Checklist

kubectl -n payments get sa checkout-api -o yaml
kubectl -n payments get pod -l app=checkout-api -o yaml
kubectl -n payments describe pod <pod>
kubectl -n payments logs deploy/checkout-api --since=30m

Then verify Azure side:

does the managed identity exist?
does federated credential subject match exactly?
does role assignment exist at the correct scope?
did propagation delay occur?
are resource firewall/private endpoint rules blocking access?

14. Storage Observability

AKS storage failures often involve Azure Disk, Azure Files, CSI drivers, zones, and node pools.

Common symptoms:

Pod stuck Pending because PVC cannot bind;
Pod stuck ContainerCreating because volume mount fails;
attach/detach timeout;
disk zone mismatch;
Azure Disk throttling;
Azure Files latency;
permission problems with SMB/NFS;
CSI driver crash;
StatefulSet rollout blocked by volume issue.

Storage Evidence

Layer	Evidence
Kubernetes	PVC/PV events, Pod events, StatefulSet status
CSI	CSI controller/node logs
Azure	Disk/File metrics, activity logs, quota, zone, resource health
App	I/O latency, timeout, transaction failure

Storage Runbook Skeleton

kubectl -n data get pvc
kubectl -n data describe pvc <pvc>
kubectl -n data describe pod <pod>
kubectl -n kube-system get pods | grep csi
kubectl -n kube-system logs <csi-controller-pod> --since=30m
kubectl -n kube-system logs <csi-node-pod> --since=30m

Then check:

StorageClass;
volume binding mode;
availability zone;
disk quota;
node pool zone;
CSI driver version;
Azure resource health and metrics.

15. Alerts: From Symptoms to Action

A production alert must be actionable.

Bad alert:

CPU above 80%.

Better alert:

checkout-api p95 latency exceeds SLO for 10 minutes while error budget burn rate is above threshold.

Better platform alert:

More than 20% of production Pods are Pending for 10 minutes due to insufficient CPU in user node pools.

Alert Classes

Class	Example	Page?
User impact	high 5xx, high latency, failed orders	yes
Platform capacity	many pending pods, nodes not ready, autoscaler blocked	yes if prod impact likely
Security	privileged pod admitted, suspicious RBAC change	depends severity
Certificate	cert expires soon	ticket/warning before page
Cost	sudden resource spike	ticket unless causing outage
Hygiene	high restart count in dev	dashboard/ticket

Alert Design Rules

Alert on symptoms before causes.
Page humans only for urgent, actionable issues.
Include service, namespace, cluster, region, severity, and runbook link.
Include recent deployment marker when possible.
Use burn-rate alerts for SLOs.
Avoid one alert per Pod.
Avoid alerts that require tribal knowledge.

16. KQL Investigation Patterns

Log Analytics and KQL are central to AKS forensic investigation.

The exact table names and schemas can vary based on configuration and product evolution, but the investigation patterns are stable.

Pattern 1: Find Recent Container Errors

ContainerLogV2
| where TimeGenerated > ago(30m)
| where LogLevel in ("error", "ERROR", "Error") or LogMessage has_any ("Exception", "timeout", "failed")
| project TimeGenerated, PodNamespace, PodName, ContainerName, LogMessage
| order by TimeGenerated desc

Pattern 2: Find Restarting Pods

KubePodInventory
| where TimeGenerated > ago(1h)
| summarize max(ContainerRestartCount) by Namespace, PodName, ContainerName
| where max_ContainerRestartCount > 0
| order by max_ContainerRestartCount desc

Pattern 3: Correlate Namespace Resource Pressure

InsightsMetrics
| where TimeGenerated > ago(1h)
| where Namespace == "container.azm.ms/telegraf"
| summarize avg(Val), max(Val) by Name, bin(TimeGenerated, 5m)
| order by TimeGenerated desc

Do not memorize table names blindly.

Learn the query habit:

constrain time;
constrain cluster/namespace/service;
find abnormal state;
join with deployment/change event;
preserve evidence in incident notes.

17. AKS Incident Timeline Template

During an outage, build a single timeline.

T-30m: Argo CD synced checkout-api v2026.07.03-42
T-27m: Pods started failing readiness
T-26m: Application Gateway backend unhealthy count increased
T-25m: 5xx exceeded SLO burn threshold
T-24m: checkout-api logs show Key Vault 403
T-23m: Azure Activity Log shows role assignment removed
T-20m: rollback attempted but still failing due to external identity dependency
T-15m: role assignment restored
T-10m: Pods restarted and token refreshed
T-05m: backend health recovered
T+00m: incident mitigated

A good timeline joins:

deployment events;
Kubernetes events;
app logs;
metrics;
traces;
Azure Activity Log;
platform diagnostics;
manual operator actions.

This is what separates observability from dashboard watching.

18. AKS Observability Implementation Blueprint

Phase 1: Baseline Cluster Evidence

Enable:

Container Insights;
diagnostic settings for control-plane categories;
Log Analytics workspace;
Azure Monitor metrics;
Managed Prometheus;
Azure Managed Grafana integration;
Activity Log retention/export;
cluster and node pool resource metrics.

Define:

workspace naming;
retention;
RBAC;
environment labels;
tagging;
data export requirements.

Phase 2: Application Telemetry Standard

Require every service to emit:

structured logs;
RED metrics;
dependency metrics;
traces;
version labels;
correlation IDs;
business error codes.

Provide:

language instrumentation templates;
OpenTelemetry collector baseline;
sample dashboards;
alert rules;
runbook template.

Phase 3: Network and Edge Evidence

Enable/standardize:

ingress controller metrics/logs;
Application Gateway / Load Balancer metrics;
DNS failure visibility;
NetworkPolicy observability;
network flow evidence where required;
NAT Gateway/Azure Firewall metrics if used;
certificate expiry monitoring.

Phase 4: Identity and Security Evidence

Enable/standardize:

Azure Activity Log monitoring;
Entra ID audit/sign-in logs where relevant;
Key Vault diagnostics;
ACR diagnostics;
Kubernetes audit logs;
policy admission logs;
privileged workload detection.

Phase 5: SLO and Incident Operations

Implement:

service SLO catalog;
burn-rate alerts;
incident dashboards;
release markers;
post-incident telemetry review;
recurring alert quality review.

19. Failure Modes and How to Recognize Them

Failure Mode 1: Dashboard Green, Users Broken

Cause:

infrastructure metrics are healthy, but application SLO metrics missing.

Symptoms:

CPU/memory normal;
no node issues;
users report failure;
no service-level latency/error metrics.

Fix:

instrument RED metrics and traces;
alert on user impact;
add synthetic checks for critical flows.

Failure Mode 2: Too Many Alerts

Cause:

alerting on low-level symptoms without deduplication or severity design.

Symptoms:

hundreds of Pod alerts;
engineers ignore notifications;
no clear incident owner.

Fix:

aggregate by service/namespace;
classify alerts;
page on SLO/user-impact first;
route hygiene issues to tickets.

Failure Mode 3: Missing Control-Plane Logs

Cause:

diagnostics not enabled before incident.

Symptoms:

cannot explain admission failure;
cannot audit RBAC changes;
cannot diagnose autoscaler decision.

Fix:

enable AKS diagnostics as baseline;
define retention and export;
create queries for common incidents.

Failure Mode 4: Log Cost Explosion

Cause:

high-volume debug logs, verbose libraries, no sampling/routing.

Symptoms:

Log Analytics bill spikes;
important logs buried;
teams disable logging entirely.

Fix:

enforce log levels;
sample noisy logs;
route audit/security logs separately;
use metrics for high-volume numeric signals.

Failure Mode 5: Metrics Cardinality Explosion

Cause:

labels include user IDs, request IDs, raw URLs, tenant IDs, order IDs.

Symptoms:

Prometheus cost/query latency increases;
dashboards slow;
alerts unstable.

Fix:

restrict label vocabulary;
normalize route labels;
review metrics in CI;
reject unbounded labels.

Failure Mode 6: Identity Failure Looks Like App Failure

Cause:

missing managed identity permission or federated credential mismatch.

Symptoms:

app logs show 403/401;
Pods Ready but operations fail;
dependency metrics show failures.

Fix:

correlate app errors with Activity Log;
check ServiceAccount annotation and federated credential;
verify role scope.

Failure Mode 7: Network Failure Has No Packet Evidence

Cause:

no network observability, no LB/App Gateway logs, no DNS visibility.

Symptoms:

intermittent timeouts;
no clear Kubernetes event;
only app-level timeout seen.

Fix:

add edge and network telemetry;
inspect endpoint/backend health;
capture route/policy/firewall evidence.

20. Production Checklist

Cluster

Container Insights enabled or alternative baseline documented.
Managed Prometheus enabled or equivalent metric backend exists.
Grafana dashboards exist for service, platform, edge, and capacity views.
AKS diagnostic settings enabled.
Activity Logs retained/exported.
Control-plane logs have retention policy.
Cluster tags identify environment, owner, cost center, criticality.

Workload

Logs are structured.
Logs include service, version, environment, request/trace ID.
Metrics include request rate, errors, latency, saturation.
Traces include dependency spans.
Telemetry does not expose secrets/PII.
Deployment/version markers are visible.

Network

Ingress/load balancer metrics and logs are available.
Backend health is observable.
DNS symptoms can be investigated.
Egress/NAT/firewall signals exist if those components are used.
NetworkPolicy effects can be verified.

Identity and Security

Kubernetes audit logs or equivalent evidence are retained.
Azure Activity Log is monitored for critical changes.
Key Vault/ACR diagnostics enabled where relevant.
Workload Identity failures have a runbook.
Privileged workload/policy violation alerts exist.

Operations

Alerts are tied to runbooks.
SLO burn-rate alerts exist for critical services.
Alert ownership is defined.
Incident timeline template is used.
Post-incident telemetry gaps become backlog items.

21. Runbook: AKS Service Is Returning 503

Step 1: Confirm User Impact

kubectl -n prod get deploy,rs,pod,svc,endpointslice
kubectl -n prod get events --sort-by=.lastTimestamp | tail -50

Check:

SLO dashboard;
error rate;
latency;
affected routes;
deployment marker.

Step 2: Check Ingress / Edge

Application Gateway backend health;
Azure Load Balancer metrics;
ingress controller logs;
Gateway/HTTPRoute status;
TLS certificate status;
DNS resolution.

Step 3: Check Service Backends

kubectl -n prod describe svc checkout-api
kubectl -n prod get endpointslice -l kubernetes.io/service-name=checkout-api -o wide
kubectl -n prod get pod -l app=checkout-api -o wide

If there are no ready endpoints, inspect readiness.

Step 4: Check Rollout

kubectl -n prod rollout status deploy/checkout-api
kubectl -n prod describe deploy checkout-api
kubectl -n prod describe pod <pod>
kubectl -n prod logs deploy/checkout-api --since=20m

Look for:

readiness failures;
CrashLoopBackOff;
image pull errors;
config/secret errors;
identity errors;
dependency timeouts.

Step 5: Check Azure Boundary

Activity Log for changes;
Key Vault diagnostics;
ACR diagnostics;
Application Gateway backend logs;
Azure Firewall/NAT/NSG if applicable;
node pool health;
quota/capacity changes.

Step 6: Mitigate

Possible mitigations:

rollback release;
restore identity role assignment;
fix readiness path;
scale replicas;
drain bad node;
revert ingress rule;
disable broken route;
fail over traffic if multi-region.

Do not stop after mitigation. Preserve evidence for post-incident review.

22. Deliberate Practice

Exercise 1: Build an Incident Dashboard

Create one dashboard for a critical AKS service.

It must show:

request rate;
error rate;
latency;
deployment version;
pod restarts;
ready replicas;
ingress/backend health;
dependency error rate;
trace exemplar link if available.

Pass condition:

A new engineer can identify whether the issue is app, rollout, ingress, node, or dependency within 5 minutes.

Exercise 2: Simulate Broken Readiness

Break readiness path in a staging Deployment.

Observe:

Pod readiness;
Service endpoints;
ingress backend health;
app logs;
alert behavior;
rollout status.

Write a short incident timeline.

Exercise 3: Simulate Identity Failure

Remove a non-production Key Vault role assignment from a test workload identity.

Observe:

app logs;
Azure Activity Log;
Key Vault diagnostics;
metrics;
trace error attributes.

Pass condition:

You can prove the failure from both app telemetry and Azure control-plane evidence.

Exercise 4: Metric Cardinality Review

Review metrics from one service.

Reject labels that contain:

user ID;
request ID;
order ID;
raw URL with IDs;
email;
high-cardinality tenant ID unless explicitly approved.

Exercise 5: Alert Quality Review

Pick 10 existing alerts.

For each, answer:

Is it actionable?
Who owns it?
Does it have a runbook?
Does it page or create ticket?
Does it represent user impact or early warning?
How many times did it fire in the last 30 days?

Delete or downgrade bad alerts.

23. Design Heuristics

Use these heuristics when building AKS observability:

Metrics detect. Logs explain. Traces localize. Cloud diagnostics prove platform boundaries.
Every alert needs an owner and a runbook.
Every critical service needs an SLO dashboard.
Every production rollout must leave telemetry markers.
Every managed cloud dependency needs diagnostic visibility.
Every high-volume telemetry stream needs a cost model.
Every incident should create at least one observability improvement.

24. Mental Model Recap

AKS observability is a multi-layer evidence system.

It joins:

Kubernetes object state;
workload telemetry;
Azure platform diagnostics;
network evidence;
identity evidence;
storage evidence;
delivery/change events;
SLO impact.

The best AKS teams do not just ask:

Is the cluster up?

They ask:

Can we prove what changed, what broke, who was impacted, why mitigation worked, and how to detect it earlier next time?

That is the operational standard.

References

Microsoft Learn — Monitor Azure Kubernetes Service (AKS): https://learn.microsoft.com/en-us/azure/aks/monitor-aks
Microsoft Learn — Enable monitoring for Azure Kubernetes Service clusters: https://learn.microsoft.com/en-us/azure/azure-monitor/containers/kubernetes-monitoring-enable
Microsoft Learn — Kubernetes monitoring in Azure Monitor: https://learn.microsoft.com/en-us/azure/azure-monitor/containers/kubernetes-monitoring-overview
Microsoft Learn — Monitor AKS applications with OpenTelemetry Protocol and Azure Monitor: https://learn.microsoft.com/en-us/azure/azure-monitor/containers/kubernetes-open-protocol
Microsoft Learn — Container Network Observability for AKS: https://learn.microsoft.com/en-us/azure/aks/container-network-observability-how-to
Microsoft Learn — Container network logs in Advanced Container Networking Services: https://learn.microsoft.com/en-us/azure/aks/container-network-observability-logs
Kubernetes Documentation — Observability: https://kubernetes.io/docs/concepts/cluster-administration/observability/

Lesson Recap

You just completed lesson 31 in deepen practice. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Back To Series Next Lesson

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.

Delivery with Helm, Kustomize, and Release Strategy