Series/Learn Kubernetes with Cloud Services AWS & Azure

Build CoreOrdered learning track

Azure AKS Architecture Foundation

Learn Kubernetes with Cloud Services AWS & Azure - Part 015

Azure AKS architecture foundation for production platform engineers: control plane boundary, cluster modes, node pools, identity, networking, private access, add-ons, governance, and operational design decisions.

[2026-07-03]26 min read5119 words

In This Lesson

1. The Core Mental Model 2. AKS Standard vs AKS Automatic 3. Managed Control Plane Boundary

PrevNext

Lesson 1540 lesson track09–22 Build Core

#kubernetes#azure#aks#cloud+2 more

Part 015 — Azure AKS Architecture Foundation

AKS looks simple from the outside: create a cluster, deploy workloads, expose services.

Production AKS is not that.

Production AKS is a set of design decisions around who owns the control plane, who owns node lifecycle, where identity lives, how traffic enters and leaves, how Azure services are consumed, how policy is enforced, how upgrades are rehearsed, and how failure is contained.

This part is the AKS counterpart to the previous EKS foundation. The goal is not to memorize every Azure feature. The goal is to build a durable mental model so that when you see an AKS cluster, you can immediately reason about:

what Azure manages;
what the platform team must still manage;
what the application team should never need to manage;
which defaults are safe;
which defaults are dangerous at scale;
where blast radius hides.

AKS is not just Kubernetes on Azure VMs. It is Kubernetes embedded into Azure's identity, networking, load balancing, private connectivity, policy, monitoring, registry, and governance systems.

1. The Core Mental Model

AKS has three layers:

In a production conversation, do not ask only:

“Is this AKS?”

Ask:

“Which AKS operating model are we using, which network model owns pod IPs, which identity boundary is used, where is the ingress/egress boundary, which team owns node pool upgrades, and which Azure controls enforce guardrails?”

That is the difference between a cluster and a platform.

2. AKS Standard vs AKS Automatic

Modern AKS has two broad operating modes:

Mode	Main Idea	Best Fit	Platform Responsibility
AKS Automatic	Azure applies production-oriented defaults and automates more of the operational surface	Teams that want less infrastructure decision burden	Less node/network/add-on tuning, more workload/platform-policy focus
AKS Standard	You control cluster configuration and node pools directly	Enterprises with specific network, compliance, cost, or automation requirements	More explicit ownership of node pools, networking, add-ons, upgrade behavior

This series mostly teaches AKS Standard foundations because that builds your actual architecture muscle. AKS Automatic is easier to consume after you understand the choices it hides.

The practical difference

AKS Automatic compresses many platform decisions into managed defaults:

managed virtual network defaults;
Azure CNI Overlay powered by Cilium in modern guidance;
managed ingress/application routing defaults;
managed egress NAT defaults;
production-oriented node management;
opinionated namespace and policy affordances.

AKS Standard exposes more knobs:

custom VNet/subnet design;
explicit node pools;
explicit ingress controller choices;
explicit egress model;
explicit monitoring and add-on choices;
explicit upgrade and surge behavior;
explicit identity and governance integration.

The invariant

AKS Automatic reduces accidental misconfiguration, but it does not remove architecture responsibility.

You still own:

application readiness;
resource requests;
rollout safety;
data boundaries;
secret usage;
SLOs;
dependency behavior;
namespace and tenancy model;
regulatory defensibility;
incident response.

A managed platform can manage nodes. It cannot manage your domain failure model.

3. Managed Control Plane Boundary

AKS offloads the Kubernetes control plane operation to Azure. Azure creates and manages the API server and associated control plane components. You interact with Kubernetes through kubectl and Azure through az aks / ARM / Bicep / Terraform / Azure Portal.

But “managed control plane” does not mean “no responsibility”.

Azure owns

provisioning and operating the Kubernetes control plane;
high availability of the managed API plane according to service model;
control plane patching mechanics;
integration with Azure resource provider APIs;
cluster lifecycle APIs;
managed add-on lifecycle where enabled.

You still own

Kubernetes version selection within supported range;
upgrade planning and testing;
deprecated API removal;
node pool lifecycle;
add-on compatibility;
RBAC and identity design;
namespace boundaries;
workload behavior;
VNet/subnet sizing when using custom networking;
ingress/egress security posture;
policy exceptions;
backup/restore design.

Dangerous assumption

“Azure manages AKS, so upgrades are safe.”

Better model:

“Azure can orchestrate cluster upgrade mechanics. I must prove my workloads, admission policies, CRDs, controllers, add-ons, node images, and APIs survive that upgrade.”

4. The AKS Resource Graph

An AKS cluster is not one thing. It is a graph of Azure and Kubernetes resources.

Important boundary: node resource group

AKS commonly creates a separate managed node resource group for infrastructure resources such as VM scale sets, load balancers, disks, NICs, and related objects.

Production rule:

Treat the node resource group as AKS-managed infrastructure. Do not manually mutate resources there unless the AKS documentation or your platform runbook explicitly allows it.

Manual mutation creates configuration drift. Drift is not just messy; it creates an upgrade and incident hazard.

5. System Node Pool vs User Node Pools

AKS has node pools. A node pool maps to a set of nodes with consistent VM SKU, OS, configuration, and lifecycle.

At minimum, production AKS should separate:

Pool Type	Purpose	Workload Type
System node pool	Runs critical system pods	CoreDNS, metrics components, CNI/system add-ons, AKS-managed components
User node pool	Runs application workloads	Product services, workers, ingress, batch, platform add-ons depending on design

Why system/user separation matters

If product workloads fill or destabilize the same pool used by system pods, cluster-level functions degrade:

DNS slows or fails;
metrics disappear;
CNI components struggle;
ingress/controller pods may not schedule;
autoscaler signals degrade;
upgrade operations become less predictable.

Baseline invariant

System workloads must have reserved scheduling capacity independent of product workload spikes.

Common production node pool model

Node pool anti-patterns

Anti-Pattern	Why It Hurts
One pool for everything	No isolation, bad cost attribution, bad upgrade control
Too many tiny pools	Fragmentation, autoscaler inefficiency, operational complexity
System workloads mixed with noisy apps	DNS/control add-on instability
Spot pool without eviction design	Random workload loss and false incidents
GPU/specialized nodes without taints	Expensive nodes consumed by generic pods
No labels/taints convention	Scheduling becomes tribal knowledge

Practical baseline

Start with:

one system pool;
one default user pool;
one optional ingress/platform pool;
specialized pools only when there is a measurable reason.

Do not create a pool for every team by default. Prefer namespace-level tenancy first, node-pool tenancy only when isolation, compliance, latency, hardware, or cost model requires it.

6. VM Scale Sets and Node Lifecycle

AKS node pools are usually backed by Azure Virtual Machine Scale Sets. That means node lifecycle is not only Kubernetes lifecycle. It is also Azure compute lifecycle.

When you scale a node pool, you are changing VMSS capacity. When you upgrade a node pool, you are replacing node images or Kubernetes components on VM instances. When cluster autoscaler acts, it changes VMSS size.

Node lifecycle pipeline

Operational implications

Kubernetes sees nodes as schedulable capacity.
Azure sees nodes as VM instances in a scale set.
Autoscaling touches both worlds.
Upgrade touches both worlds.
Drain must protect workload availability.
PodDisruptionBudgets matter because node operations evict pods.

Production node operation rule

No node pool lifecycle action should be approved without checking:

PodDisruptionBudgets;
workload replica counts;
topology spread;
autoscaler behavior;
node surge settings;
remaining subnet IP capacity;
capacity quota for the VM SKU;
expected workload interruption.

7. AKS Networking Foundation

AKS networking is one of the most consequential architecture decisions. It determines:

pod IP ownership;
subnet sizing;
how pods reach Azure services;
how on-prem networks see traffic;
egress control;
network policy engine;
private cluster access;
load balancer behavior;
DNS design.

There are two broad network models:

Model	Pod IP Source	External Visibility	Primary Benefit	Primary Risk
Overlay	Pod IPs from overlay CIDR; outbound usually SNATs to node IP	Pod IP usually hidden outside cluster	Conserves VNet IP space	Less direct pod reachability
Flat	Pod IPs from Azure VNet/subnet	Pod IP can be directly visible/routable depending design	Direct integration with VNet/on-prem	Requires careful IP planning

Modern AKS guidance increasingly pushes toward Azure CNI Overlay for many scenarios because it improves IP efficiency while preserving Azure CNI integration patterns.

Networking decision questions

Before creating a production AKS cluster, answer:

Will pods need direct IP reachability from peered VNets or on-prem?
Is preserving VNet IP space more important than direct pod routability?
Who owns the VNet: app team, platform team, central networking team?
Is outbound internet allowed directly, through NAT Gateway, or through Azure Firewall?
Is the API server public, private, or restricted by authorized IP ranges?
Which ingress boundary is standard: Azure Load Balancer, Application Gateway, Application Gateway for Containers, managed NGINX, or Gateway API?
Which network policy engine is used?
How are DNS and private endpoints handled?

If these are not answered up front, the cluster may work but the platform will be ungovernable.

8. VNet and Subnet Topology

A production AKS cluster belongs inside a network topology, not floating by itself.

Baseline topology:

Subnet design principles

Do not put unrelated infrastructure into the node subnet.
Do not delegate the node subnet incorrectly.
Plan IP capacity based on node count and network model.
Reserve subnet space for growth, not for the first release only.
Separate ingress, private endpoints, Azure Firewall, and node pools when required by architecture.
Treat subnet exhaustion as a production incident risk.

Common subnet mistakes

Mistake	Consequence
Small node subnet	Node scale or pod scheduling fails later
No private endpoint plan	Teams expose services publicly under time pressure
Egress not centralized	Regulatory and audit gaps
Mixed ingress and node subnet carelessly	Harder security and routing ownership
Missing DNS design	Private endpoint resolution failures

9. Private Cluster Design

Private AKS means the Kubernetes API server endpoint is not publicly accessible in the same way as a normal public endpoint. This reduces exposure but increases operational design complexity.

Private clusters require a clear access path for:

CI/CD systems;
platform engineers;
break-glass access;
automation tools;
GitOps controllers if external;
monitoring/control integrations.

Private access model

Private cluster invariant

A private API server is safer only if the private access path is well-governed.

If engineers bypass it with ad-hoc jump boxes, shared credentials, broad firewall rules, or long-lived admin kubeconfigs, the “private” cluster only creates a false sense of security.

Production checklist

Private DNS works from all expected admin networks.
CI/CD runners have private connectivity.
Break-glass workflow is documented and tested.
Admin access uses identity, not shared kubeconfig secrets.
No permanent public fallback exists without approval.
Monitoring and GitOps controllers can reach required endpoints.
Incident runbook includes private endpoint troubleshooting.

10. Identity Foundation: Entra ID, Kubernetes RBAC, and Managed Identity

AKS identity has three overlapping layers:

Layer	Purpose
Microsoft Entra ID	Human and service principal identity source
Kubernetes RBAC	Authorization inside Kubernetes API
Azure Managed Identity / Workload Identity	Access from cluster/workloads to Azure resources

Do not collapse them into one concept.

Control-plane access identity

Human users should authenticate through Entra ID integration and receive Kubernetes permissions through RBAC mappings or Azure RBAC integration, depending on your organization model.

The key question:

Is Azure RBAC the source of truth for Kubernetes authorization, or does Kubernetes RBAC remain the main in-cluster authorization model?

Both can work. Mixing them without governance creates confusion.

Workload identity

Workloads should not use static Azure credentials. Production AKS should prefer workload identity patterns that allow a Kubernetes service account to federate with Azure identity and access Azure resources with narrow permissions.

Example target model:

Managed identity boundary

AKS itself also uses identities for cluster operations. These identities may need permissions against VNets, load balancers, route tables, managed disks, ACR, or other Azure resources depending on configuration.

Do not overgrant cluster identities. A common failure is giving broad Contributor permissions at subscription scope because something failed during provisioning.

Better pattern:

explicit user-assigned managed identities where appropriate;
least privilege per resource group or resource;
infrastructure-as-code controlled role assignments;
periodic access review;
separate cluster identity from workload identity.

11. Container Registry: ACR Integration

AKS commonly pulls images from Azure Container Registry.

Production design decisions:

Is ACR shared across environments or per environment?
Is ACR exposed publicly or via Private Link?
Are images immutable by digest?
Is vulnerability scanning enforced?
Is image signing required?
Are promotion flows tag-based or digest-based?
Who can push to production repositories?
Can dev clusters pull production images?

Minimal pull model

Production pull invariant

The cluster should be able to pull only the image repositories required for its environment.

Do not use broad registry permissions as a convenience layer. Registry permissions are part of supply-chain security.

12. Secrets and Key Vault Boundary

Kubernetes Secret is an API object. Azure Key Vault is an external secret management service. They solve different parts of the problem.

Option	Strength	Risk
Kubernetes Secret only	Simple, native	Secret material lives in Kubernetes API/etcd; needs strict encryption/RBAC
Key Vault CSI Driver	External secret source; mount secret into pods	Runtime dependency; mount/update behavior must be understood
External Secrets operator style	Syncs external secrets into Kubernetes Secrets	Still creates Kubernetes Secrets; controller becomes sensitive component
Workload fetches Key Vault directly	App controls fetch/refresh	App complexity; retry/caching/security mistakes

Production design rule

A secret strategy is not complete until it defines:

where secret material is stored;
how workload gets it;
whether it is mounted or env-var injected;
how rotation happens;
how app reloads;
who can read it;
how access is audited;
how leaked secrets are revoked.

Key Vault integration mental model

The most common failure is assuming Key Vault rotation automatically makes the application use the new value. Runtime reload remains an application/platform contract.

13. Ingress and Edge Architecture

AKS does not prescribe one ingress pattern for all systems. You need a platform standard.

Common options:

Option	Where It Fits
Azure Load Balancer Service	Simple L4 exposure
NGINX Ingress Controller	Common Kubernetes-native L7 routing
Application Gateway Ingress Controller	Azure Application Gateway integrated ingress
Application Gateway for Containers	Modern Azure application load balancing with Kubernetes integration
Gateway API	Future-facing Kubernetes-native traffic API model
Service Mesh Ingress Gateway	Mesh-heavy platforms with mTLS/traffic policy requirements

Edge ownership questions

Does networking team own public IPs and WAF?
Does platform team own ingress controller lifecycle?
Do app teams define routes directly?
Are certificates managed centrally or per namespace?
Is WAF policy shared or app-specific?
Is internal ingress separate from external ingress?
Is Gateway API the long-term route API?

Baseline edge topology

Production invariant

The app team may own route intent. The platform must own route safety.

That means app teams can request:

hostname;
path;
backend service;
timeout profile;
auth policy;
traffic split.

But platform guardrails must enforce:

allowed domains;
TLS policy;
WAF profile;
public/private exposure;
cross-namespace reference controls;
certificate source;
forbidden annotations.

14. Egress Architecture

Ingress gets attention because it is visible. Egress causes incidents because it is invisible.

AKS outbound traffic can be handled through several models:

Azure Load Balancer outbound rules;
NAT Gateway;
user-defined routing through Azure Firewall/NVA;
managed NAT patterns in automatic modes;
private endpoints for Azure service access;
proxy-based egress controls.

Egress decision matrix

Requirement	Preferred Direction
Simple internet egress	NAT Gateway or managed outbound option
Regulated outbound inspection	UDR to Azure Firewall or approved NVA
Private Azure service access	Private Link / private endpoints
Stable outbound IP allowlisting	NAT Gateway or firewall public IPs
Per-namespace egress policy	NetworkPolicy + egress gateway/proxy pattern

Egress failure modes

Failure	Symptom
SNAT port exhaustion	Random outbound connection failures
Missing private DNS for private endpoints	Service resolves public endpoint or fails
Firewall denies required cloud endpoint	Image pull, Key Vault, telemetry, or dependency call fails
No egress ownership	App teams add random public dependencies
Overbroad egress	Data exfiltration and audit risk

Production invariant

Every production AKS cluster should have an egress story before the first workload goes live.

Not after the first audit.

15. Observability Foundation in AKS

AKS observability usually combines Kubernetes-native telemetry with Azure-native telemetry.

Common layers:

Layer	Typical Tools
Cluster metrics	Azure Monitor managed Prometheus, metrics-server, kube-state-metrics
Logs	Container logs to Log Analytics / Azure Monitor
Dashboards	Azure Managed Grafana, Grafana, Azure Portal views
Traces	OpenTelemetry collector / vendor backend
Events	Kubernetes Events, Azure Activity Logs
Control plane audit	Azure diagnostics / control-plane logs depending configuration

Observability ownership

Platform team should provide:

cluster health dashboards;
node pool dashboards;
ingress/egress dashboards;
DNS and CoreDNS dashboards;
API server and control-plane relevant logs where enabled;
network drop/deny visibility;
cost/capacity reports;
alert standards.

Application teams should provide:

service-level metrics;
request latency/error rates;
dependency metrics;
business operation metrics;
trace spans;
actionable alerts tied to SLOs.

Anti-pattern

“We installed Azure Monitor, so we have observability.”

Better model:

“Telemetry exists only if we can answer incident questions quickly.”

Incident questions:

Which deployment changed?
Which node pool is failing?
Are pods unschedulable or crashing?
Is DNS failing?
Is ingress returning 502/503?
Is egress blocked?
Did a policy deny the deployment?
Is the app slow or its dependency slow?
Is the cluster out of IPs, CPU, memory, or SNAT ports?

16. Azure Policy and Governance

AKS production platforms need guardrails. Without guardrails, every namespace becomes a separate governance experiment.

Azure governance commonly uses:

Azure Policy for AKS;
Kubernetes admission policies;
Gatekeeper/OPA;
Kyverno;
RBAC;
namespace templates;
resource quotas;
limit ranges;
network policies;
image policy;
tagging and Azure resource policy.

Governance layers

Guardrail examples

Pods must run as non-root.
Privileged containers are forbidden except approved namespaces.
Images must come from approved registries.
Public LoadBalancer Services are denied by default.
Required labels must exist.
Resource requests are mandatory.
HostPath is denied.
Latest image tag is denied.
Secrets cannot be mounted into unauthorized workloads.
Ingress hostnames must match allowed domains.

Production rule

Policies should have lifecycle:

audit;
warn;
enforce;
exception process;
periodic exception expiry review.

Hard enforcement from day one often causes teams to route around the platform. Audit-only forever creates security theater.

17. AKS Add-ons and Extension Strategy

AKS has built-in add-ons and integrations, but you can also install open-source or vendor components.

Typical platform add-ons:

Azure Monitor / Container Insights;
managed Prometheus;
managed Grafana;
Azure Policy add-on;
Key Vault CSI driver;
workload identity;
application routing add-on;
ingress/gateway controllers;
cert-manager;
external-dns;
secret operators;
service mesh;
GitOps controllers;
policy engines;
backup tools;
security agents.

Add-on ownership matrix

Add-on Type	Owner	Upgrade Rule
Cloud provider managed add-on	Platform team + Azure lifecycle	Track AKS compatibility
Ingress/gateway	Platform networking	Upgrade with traffic canary
Policy engine	Platform security	Audit policies before enforcement
GitOps controller	Platform delivery	Upgrade before app feature adoption
Observability agent	Platform SRE	Validate telemetry continuity
Secret integration	Platform security	Test rotation and outage behavior

Add-on failure model

Every add-on is a controller. Every controller can become a cluster-level dependency.

Ask:

What permissions does it have?
What namespaces can it affect?
What CRDs does it own?
What happens if it is down?
What happens if it is upgraded incorrectly?
What happens if its webhook times out?
Does it block deployments?
Does it mutate workloads?

Admission webhooks deserve special caution. A broken validating/mutating webhook can block cluster changes.

18. AKS Upgrade Model

AKS upgrade is not a button. It is a system change.

There are multiple upgrade surfaces:

Kubernetes control plane version;
node pool Kubernetes/kubelet version;
node image version;
CNI and network components;
managed add-ons;
CRDs and controllers;
ingress/gateway components;
policy engines;
service mesh;
workload APIs.

Upgrade dependency chain

Upgrade invariants

Never upgrade production first.
Never upgrade without deprecated API scan.
Never upgrade without PDB review.
Never upgrade all specialized pools at once.
Never assume add-ons are compatible.
Never ignore node image updates.
Never run unsupported Kubernetes versions as a normal operating model.

Node pool upgrade strategy

For production:

upgrade system pool carefully;
upgrade one user pool at a time;
use surge capacity;
respect PodDisruptionBudgets;
watch unschedulable pods;
watch ingress error rate;
watch DNS and CNI health;
watch application SLOs.

19. AKS Reliability Architecture

Reliability starts with failure domains.

AKS failure domains include:

pod;
container;
node;
node pool;
availability zone;
subnet;
VNet;
region;
Azure service dependency;
identity provider;
DNS;
ingress;
egress;
storage account/disk;
Key Vault;
ACR;
monitoring pipeline.

Zonal design

For production, distribute node pools across availability zones where supported and align workload topology spread with zone failure expectations.

But zone-aware nodes alone are not enough.

You also need:

replicas > 1;
topology spread constraints;
anti-affinity for critical workloads;
multi-zone storage where required;
zone-aware ingress/load balancing;
PDBs that do not block all disruption;
capacity to survive one zone loss;
dependency architecture that survives one zone loss.

Failure-mode table

Failure	Required Design Response
One pod crashes	ReplicaSet restores; readiness removes bad pod
One node lost	Pods reschedule; capacity exists elsewhere
One node pool degraded	Critical workloads can run on other eligible pools or fail safely
One zone lost	Replicas spread across zones; ingress routes remaining endpoints
ACR unavailable	Nodes should have pulled current images; rollout may pause
Key Vault unavailable	Apps must define cached/rotated secret behavior
DNS failure	CoreDNS and upstream DNS observability/runbook
Azure identity issue	Workload identity failure surfaces as dependency incident

Production invariant

A highly available AKS cluster cannot compensate for a single-replica application, a single-zone database, or an app that dies on dependency timeout.

20. Namespace and Tenancy Model

AKS platform design must define how teams consume the cluster.

A namespace is not a security boundary by itself, but it is a practical tenancy unit.

Namespace template should include:

labels and annotations;
ResourceQuota;
LimitRange;
default deny NetworkPolicy;
allowed ingress/egress policy;
RBAC bindings;
service account standards;
workload identity mapping rules;
secret access pattern;
allowed registries;
logging/monitoring labels;
ownership metadata;
cost allocation metadata.

Namespace factory model

Tenant isolation levels

Isolation Level	Mechanism	When to Use
Soft	Namespace + RBAC + quota	Internal low-risk multi-team platform
Medium	Namespace + network policy + policy-as-code + identity isolation	Most enterprise shared clusters
Strong	Dedicated node pools + taints + separate identities	Regulated/high-sensitivity workloads
Hard	Dedicated cluster/subscription	Strong compliance or blast-radius requirements

Do not promise hard isolation from namespaces alone.

21. Production AKS Reference Architecture

A reasonable production AKS baseline looks like this:

This is not the only architecture. It is a reference shape.

Modify it only after you know what property you are trading away.

22. AKS Production Readiness Checklist

Cluster foundation

Kubernetes version is supported and upgrade path is known.
AKS Standard vs Automatic decision is documented.
Cluster is deployed by infrastructure-as-code.
Node resource group is treated as managed infrastructure.
System and user node pools are separated.
Node pool labels/taints follow platform convention.
Autoscaler behavior is tested.
PDBs exist for critical workloads.

Networking

Network model is chosen deliberately: overlay vs flat.
VNet/subnet sizing is based on growth assumptions.
Private cluster decision is documented.
API server access path is tested.
Ingress controller/gateway standard is selected.
Egress path is controlled and observable.
Private endpoints and private DNS are designed.
NetworkPolicy engine is installed and validated.

Identity/security

Entra ID integration is configured.
Admin access is not based on shared credentials.
Kubernetes RBAC/Azure RBAC model is clear.
Workload identity is used for Azure service access.
Cluster managed identity permissions are least privilege.
ACR pull permissions are environment scoped.
Key Vault access is workload scoped.
Policy-as-code has audit/enforce lifecycle.

Operations

Azure Monitor/Prometheus/Grafana or equivalent is configured.
Alerts are actionable and SLO-linked.
Upgrade runbook exists and is rehearsed.
Backup/restore strategy is documented.
Incident runbooks exist for DNS, CNI, ingress, egress, identity, storage.
Cost allocation tags/labels exist.
Platform ownership model is documented.

23. AKS Decision Matrix

Decision	Default Recommendation	Change When
Cluster mode	AKS Standard for full architecture control; Automatic for reduced ops	Your organization values managed defaults over custom controls
Network model	Azure CNI Overlay for many modern clusters	Direct pod routability or strict flat-network integration is required
API access	Private or restricted public	Developer simplicity outweighs strict private access in non-prod
Node pools	System + user + specialized only when justified	Hardware/compliance/cost isolation requires more pools
Ingress	Standardize on Gateway/Ingress platform	Teams have specialized edge requirements
Egress	NAT Gateway or firewall-controlled path	Full private-only dependency model exists
Identity	Entra ID + workload identity	Legacy service principal constraints exist temporarily
Secrets	Key Vault integration for sensitive production secrets	Simple non-prod or low-risk internal workloads
Policy	Audit then enforce	Emergency remediation requires immediate deny
Observability	Azure Monitor + Prometheus/Grafana + OTel	Existing enterprise observability platform is mandated

24. Practical Lab: Design an AKS Platform Slice

Design a production AKS environment for three workloads:

Public API service.
Internal worker consuming queue messages.
Admin dashboard reachable only through private network.

Define:

cluster mode;
VNet/subnet topology;
node pools;
ingress pattern;
egress pattern;
identity pattern;
Key Vault access;
ACR access;
namespace model;
NetworkPolicy baseline;
observability signals;
upgrade approach.

Expected output

A good answer should not just say “use AKS”. It should produce something like:

AKS Standard, private cluster;
custom spoke VNet peered to hub;
system pool + general user pool + ingress pool;
Application Gateway for public API and internal gateway for admin;
NAT Gateway or Azure Firewall for egress;
private endpoints for Key Vault, ACR, and database;
workload identity per service account;
default deny namespace policies;
resource quotas per namespace;
Prometheus metrics + Azure Monitor logs + OpenTelemetry traces;
staged node pool upgrades with PDB validation.

25. Troubleshooting Orientation

When AKS fails, classify first.

Commands to build muscle memory

# Cluster and node pool inventory
az aks show -g <rg> -n <cluster> -o table
az aks nodepool list -g <rg> --cluster-name <cluster> -o table

# Kubernetes view
kubectl get nodes -o wide
kubectl describe node <node>
kubectl get pods -A -o wide
kubectl get events -A --sort-by=.lastTimestamp

# Scheduling issues
kubectl describe pod <pod> -n <ns>
kubectl get quota -A
kubectl get pdb -A

# Network/service issues
kubectl get svc -A
kubectl get ingress -A
kubectl get gateway -A
kubectl get httproute -A
kubectl get endpointslice -A

# Identity/secret symptoms
kubectl describe serviceaccount <sa> -n <ns>
kubectl get secret -n <ns>
kubectl logs <pod> -n <ns>

26. What a Top-Tier Engineer Should Internalize

A top-tier AKS engineer does not merely know AKS commands. They can reason from invariants.

Invariants

AKS manages the control plane, but not your platform operating model.
Node pools are capacity and failure-domain contracts.
Network model must be chosen before scale, not after exhaustion.
Identity must separate human admin access, cluster identity, and workload identity.
Ingress is a governance boundary, not just a routing object.
Egress must be designed before audit and incident pressure.
Secret rotation is a runtime contract, not a storage checkbox.
Observability is only useful if it answers failure questions quickly.
Policy must include exception workflow or teams will bypass it.
Upgrade safety is proven in rehearsal, not asserted in change tickets.

27. References

Microsoft Learn — What is Azure Kubernetes Service: https://learn.microsoft.com/en-us/azure/aks/what-is-aks
Microsoft Learn — Baseline architecture for an AKS cluster: https://learn.microsoft.com/en-us/azure/architecture/reference-architectures/containers/aks/baseline-aks
Microsoft Learn — AKS best practices: https://learn.microsoft.com/en-us/azure/aks/best-practices
Microsoft Learn — AKS networking concepts and Azure CNI overview: https://learn.microsoft.com/en-us/azure/aks/concepts-network-cni-overview
Microsoft Learn — Azure CNI Overlay: https://learn.microsoft.com/en-us/azure/aks/concepts-network-azure-cni-overlay
Microsoft Learn — AKS Automatic introduction: https://learn.microsoft.com/en-us/azure/aks/intro-aks-automatic
Microsoft Learn — AKS managed identity overview: https://learn.microsoft.com/en-us/azure/aks/managed-identity-overview
Kubernetes Documentation — Cluster architecture: https://kubernetes.io/docs/concepts/architecture/

Lesson Recap

You just completed lesson 15 in build core. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Back To Series Next Lesson

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.

Previous Lesson

Lesson 14

AWS EKS Architecture Foundation

Next Lesson

Lesson 16

EKS VPC Networking Deep Dive