Series/Learn Kubernetes with Cloud Services AWS & Azure

Series MapLesson 35 / 40

Final StretchOrdered learning track

Reliability, SLO, and Failure Modeling

Learn Kubernetes with Cloud Services AWS & Azure - Part 035

Reliability, SLO, and failure modeling for production Kubernetes platforms on AWS EKS and Azure AKS.

[2026-07-03]27 min read5325 words

In This Lesson

1. What You Will Build in This Part 2. The Core Mental Model 3. Reliability Terms That Matter

PrevNext

Lesson 3540 lesson track34–40 Final Stretch

#kubernetes#reliability#slo#sre+4 more

Part 035 — Reliability, SLO, and Failure Modeling

Kubernetes does not make a system reliable by itself.

It gives you a set of control surfaces:

desired state reconciliation
replica management
rollout strategy
health probes
resource isolation
topology-aware scheduling
disruption management
autoscaling
event stream
policy enforcement
cloud integration

Reliability comes from how those surfaces are composed into a system that continues to provide correct user-visible behavior when parts of the platform fail.

This part is about that composition.

The practical question is not:

Are my Pods running?

The real question is:

Can the user-visible capability still satisfy its service objective while the platform is changing, failing, scaling, draining, upgrading, throttling, and recovering?

That distinction is the difference between operating Kubernetes as a deployment target and operating Kubernetes as a production platform.

1. What You Will Build in This Part

By the end of this part, you should be able to:

Translate product reliability goals into Kubernetes-level design constraints.
Define useful SLIs/SLOs for Kubernetes-hosted services.
Connect SLOs to probes, resources, replicas, PDBs, topology spread, autoscaling, and rollout policy.
Model failure at container, Pod, node, zone, cluster, region, DNS, certificate, identity, and dependency layers.
Design reliability classes for workloads.
Run failure reviews before incidents happen.
Build runbooks that explain causality, not just commands.
Review EKS and AKS reliability from a cloud-provider-aware perspective.

We are not repeating basic probes, resources, autoscaling, storage, GitOps, or observability. Those were already covered in earlier parts. Here we combine them into a reliability system.

2. The Core Mental Model

Reliability is not a property of a component. It is a property of a user journey under stress.

A Kubernetes workload may look healthy while the user journey is broken:

Pods are Running, but DNS resolution is failing.
Deployment has all replicas available, but TLS certificate expired.
HPA is scaling, but downstream database pool is exhausted.
Service has endpoints, but NetworkPolicy blocks egress.
Cluster is healthy, but identity federation fails and Pods cannot read secrets.
Pod readiness is green, but the app returns stale or incorrect data.

A production reliability model must connect multiple layers:

The platform engineer's job is to keep these links honest.

If the SLO says 99.95% availability, but the workload has one replica, no PDB, no zone spread, no graceful shutdown, and no dependency timeout policy, then the SLO is fiction.

3. Reliability Terms That Matter

3.1 SLI

A Service Level Indicator is a measured signal of user-visible service behavior.

Good SLIs are close to the user journey.

Examples:

Capability	Useful SLI	Weak SLI
Public REST API	Percentage of valid requests completed successfully under 300 ms	Pod CPU usage
Checkout	Percentage of checkout attempts completed without duplicate charge	Deployment available replicas
Search	Percentage of queries returning valid result set under 500 ms	Number of Pods running
Event processor	Percentage of events processed within 60 seconds	Consumer Pod count
Reporting	Percentage of reports generated before SLA deadline	CronJob succeeded count only
Payment callback	Percentage of callbacks accepted and persisted exactly once	Service endpoint count

Kubernetes metrics are supporting evidence. They are usually not the SLI.

3.2 SLO

A Service Level Objective is the target threshold for an SLI over a time window.

Examples:

99.9% of valid API requests succeed over 30 days.
99% of successful responses complete below 300 ms over 7 days.
99.95% of payment events are persisted exactly once over 30 days.
99% of queue messages are processed within 60 seconds over 24 hours.

A useful SLO has:

a user-visible event
a success definition
an exclusion policy
a measurement source
a time window
an owner
a consequence when breached

3.3 Error Budget

If the SLO is 99.9%, the allowed unreliability is 0.1% for the window.

For 30 days:

30 days = 43,200 minutes
99.9% SLO allows 0.1% bad minutes
0.001 * 43,200 = 43.2 minutes of allowed badness

This is not just math. It is a decision tool.

When error budget is healthy:

release velocity can remain normal
controlled experiments are acceptable
dependency upgrades can proceed

When error budget is burning fast:

freeze risky releases
prioritize reliability fixes
reduce blast radius
review autoscaling, rollouts, and dependency resilience

3.4 SLA

A Service Level Agreement is an external promise, often contractual.

Do not confuse internal SLO with external SLA. In engineering practice, internal SLO should usually be stricter than SLA so the team has room to detect and recover before contractual impact.

4. Kubernetes Reliability Is Mostly About Disruption Control

A Kubernetes cluster is always changing:

Deployments roll out.
Nodes are drained.
Autoscalers add and remove capacity.
Images are pulled.
Volumes are attached and detached.
CNI allocates IPs.
DNS caches expire.
Certificates rotate.
Cloud load balancers update targets.
Policies mutate or reject objects.
Add-ons upgrade.
Nodes reboot.

A reliable platform controls the rate, blast radius, and observability of those changes.

The invariant:

Every planned change must preserve enough healthy capacity to satisfy the SLO, and every unplanned failure must degrade within an understood blast radius.

5. Failure Domains

A failure domain is a boundary where failure can be isolated.

In Kubernetes on cloud, the important domains are:

Domain	Example Failure	Usual Control
Process	app crash, deadlock, bad config	probes, restart policy, app shutdown contract
Container	OOMKill, filesystem full, bad image	requests/limits, image policy, logs
Pod	stuck terminating, not ready	probes, termination grace, rollout
ReplicaSet	wrong selector, rollout stall	Deployment strategy, validation
Node	kernel issue, kubelet failure, disk pressure	node pool spread, eviction, autoscaling
Node pool	bad AMI/image, bad VM SKU, taint mistake	canary node pool, surge upgrade
Zone	AZ capacity issue, zonal disk failure	topology spread, multi-AZ LB, storage class design
Cluster	API server outage, etcd issue, policy outage	managed control plane, multi-cluster strategy
Region	cloud regional issue	multi-region architecture, DNS failover
DNS	bad record, CoreDNS failure, external DNS delay	DNS monitoring, caching, fallback design
TLS	expired cert, wrong secret, CA issue	cert-manager, ACME monitoring, manual break-glass
Identity	STS/Entra failure, wrong role binding	workload identity validation, scoped fallback
Dependency	database outage, queue lag, API throttling	timeout, retry budget, circuit breaker, backpressure
Delivery	bad chart, drift, wrong environment	GitOps policy, canary, progressive delivery

The mistake is to design only for Pod failure. Pod failure is the easy case.

The hard cases are correlated failures:

all replicas scheduled on one zone
all replicas depend on one secret rotation path
all nodes in one pool use a bad image
all traffic goes through one ingress controller deployment
all workloads share one overloaded DNS path
all releases depend on one broken admission webhook

6. Reliability Controls in Kubernetes

Think of each Kubernetes object as a reliability lever.

Control	Reliability Purpose	Common Mistake
`replicas`	tolerate Pod failure	replica count chosen without traffic or zone model
`readinessProbe`	remove unsafe Pods from traffic	probe checks process, not dependency readiness
`livenessProbe`	recover wedged processes	probe too aggressive; causes restart storm
`startupProbe`	protect slow startup	missing for slow JVM/runtime boot
`resources.requests`	reserve scheduling capacity	requests too low; node pressure during peak
`resources.limits`	contain blast radius	CPU limit throttles latency-sensitive app
`PodDisruptionBudget`	protect against voluntary disruption	`minAvailable: 100%` blocks node drain
`topologySpreadConstraints`	reduce correlated failure	only spread by node, not zone
`affinity` / `antiAffinity`	influence placement	hard rules make Pods unschedulable
`priorityClass`	protect critical workloads	everything marked critical
`maxSurge` / `maxUnavailable`	control rollout capacity	unsafe values for low replica count
`terminationGracePeriodSeconds`	allow graceful shutdown	shorter than real in-flight request time
`preStop`	coordinate drain	used as sleep without understanding endpoint propagation
HPA/KEDA	scale replicas by signal	signal is lagging or wrong unit
Cluster Autoscaler/Karpenter/NAP	add nodes for pending Pods	missing subnet/IP/capacity planning
NetworkPolicy	contain lateral movement	default deny without DNS and observability egress

Reliability is not one feature. It is the interaction among these controls.

7. Service Reliability Classes

Do not use the same reliability standard for every workload.

Create workload classes.

7.1 Class C — Best Effort Internal

Use for:

admin tools
low criticality dashboards
batch helpers
non-production-like preview apps

Baseline:

1 replica acceptable
no cross-zone requirement
basic readiness/liveness
normal priority
no strict PDB
manual recovery acceptable

Not acceptable for:

customer-facing APIs
regulatory workflows
payment/event ingestion

7.2 Class B — Standard Production

Use for:

typical internal APIs
worker services with retryable work
business services with moderate impact

Baseline:

2+ replicas
readiness probe required
startup probe for slow boot
resource requests required
graceful shutdown required
PDB with at least one voluntary disruption allowed
topology spread across nodes
dashboard and alerts required

7.3 Class A — Critical Production

Use for:

public APIs
order creation
payment orchestration
enforcement/case lifecycle services
compliance-impacting workflows

Baseline:

3+ replicas
zone spread required
PDB required
strict readiness semantics
explicit rollout policy
autoscaling tested under load
dependency timeout/retry policy
error-budget policy
synthetic monitoring
incident runbook
restore drill if stateful

7.4 Class S — Regulated / Mission Critical

Use for:

financial settlement
audit-sensitive workflows
regulatory enforcement lifecycle
evidence chain systems
safety-critical processing

Baseline:

Class A plus evidence retention
immutable audit log
controlled change window
approval workflow
recovery test evidence
policy exception registry
multi-region or formally accepted regional risk
strict RPO/RTO
compliance mapping

8. SLO-to-Kubernetes Mapping

This is the bridge that most teams skip.

8.1 Example: Public API SLO

SLO:

99.9% of valid API requests complete successfully under 300 ms over 30 days.

Kubernetes design implications:

SLO Concern	Kubernetes Control
request success	readiness probe, dependency health, rollout policy
latency	CPU request, no excessive CPU limit, HPA by latency/RPS, node sizing
availability during deploy	`maxUnavailable: 0`, `maxSurge: 1 or 25%`, PDB
zone failure tolerance	topology spread by zone, multi-AZ load balancer
node drain tolerance	PDB, enough replicas, graceful shutdown
traffic safety	ingress health check aligned with readiness
dependency protection	timeout, retry budget, circuit breaker
diagnosis	logs, metrics, traces, events, deployment annotations

8.2 Example: Queue Consumer SLO

SLO:

99% of accepted events are processed within 60 seconds over 24 hours.

Kubernetes design implications:

SLO Concern	Kubernetes Control
lag	KEDA/HPA by queue lag or consumer lag
poison messages	dead-letter strategy, app-level handling
processing correctness	idempotency, checkpoint discipline
capacity	requests, node provisioning, batch size
graceful termination	stop polling, finish in-flight message, commit only after success
rollout safety	max unavailable low enough to preserve consumers
observability	lag, processing age, failure reason, retry count

8.3 Example: Scheduled Job SLO

SLO:

Daily report must be generated by 06:00 local business time with 99.5% monthly success.

Kubernetes design implications:

SLO Concern	Kubernetes Control
deadline	CronJob schedule, `startingDeadlineSeconds`, alert before deadline
retry	`backoffLimit`, idempotent job design
concurrency	`concurrencyPolicy: Forbid` or `Replace` based on semantics
data correctness	input snapshot/versioning
capacity	dedicated node pool or priority
failure visibility	job status export, log retention, alert on late run

9. PodDisruptionBudget: Useful, But Often Misunderstood

A PDB limits voluntary evictions. It does not protect against every failure.

Voluntary disruptions include:

node drain
cluster/node upgrade
autoscaler scale-down
manual eviction through eviction API

PDB does not protect against:

node hardware failure
kernel crash
cloud zone outage
direct Pod deletion
Deployment deletion
application crash
OOMKill
badly configured rollout

9.1 Good PDB for 3+ Replica Stateless API

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: orders-api-pdb
  namespace: prod-orders
spec:
  maxUnavailable: 1
  unhealthyPodEvictionPolicy: AlwaysAllow
  selector:
    matchLabels:
      app.kubernetes.io/name: orders-api

Why maxUnavailable: 1?

Because it keeps node drains possible while still preventing too much voluntary disruption at once.

Why unhealthyPodEvictionPolicy: AlwaysAllow?

Because an unhealthy Pod should usually not block maintenance forever. If a Pod is already unhealthy, preserving it does not preserve availability.

9.2 Dangerous PDB

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: bad-pdb
spec:
  minAvailable: 100%
  selector:
    matchLabels:
      app: api

This can block node drain indefinitely. It is often created with good intent but poor operational semantics.

A PDB must protect availability and allow maintenance.

If it prevents all maintenance, it creates a larger reliability risk.

10. Topology Spread Constraints

Replica count alone is not enough.

Three replicas on one node do not give high availability.

Three replicas on three nodes in one zone do not survive zone failure.

Three replicas across three zones are a meaningful starting point.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: orders-api
  namespace: prod-orders
spec:
  replicas: 3
  selector:
    matchLabels:
      app.kubernetes.io/name: orders-api
  template:
    metadata:
      labels:
        app.kubernetes.io/name: orders-api
    spec:
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: topology.kubernetes.io/zone
          whenUnsatisfiable: DoNotSchedule
          labelSelector:
            matchLabels:
              app.kubernetes.io/name: orders-api
        - maxSkew: 1
          topologyKey: kubernetes.io/hostname
          whenUnsatisfiable: ScheduleAnyway
          labelSelector:
            matchLabels:
              app.kubernetes.io/name: orders-api
      containers:
        - name: app
          image: registry.example.com/orders-api@sha256:...
          ports:
            - containerPort: 8080
          readinessProbe:
            httpGet:
              path: /ready
              port: 8080
          startupProbe:
            httpGet:
              path: /started
              port: 8080
            failureThreshold: 30
            periodSeconds: 2
          resources:
            requests:
              cpu: "500m"
              memory: "768Mi"
            limits:
              memory: "1Gi"

Hard zone spread is reasonable for critical services if capacity exists in every zone.

For less critical services, ScheduleAnyway may be better to avoid availability loss due to temporary capacity shortage.

The trade-off:

`whenUnsatisfiable`	Benefit	Risk
`DoNotSchedule`	preserves placement invariant	Pods may remain pending
`ScheduleAnyway`	favors availability now	may weaken failure-domain isolation

Top 1% engineering is not memorizing which one is better. It is knowing which failure you are choosing.

11. Rollout Reliability

A Deployment rollout is a controlled failure injection.

You are intentionally killing old Pods and introducing new Pods.

Unsafe rollout parameters can violate an SLO even when Kubernetes behaves correctly.

11.1 Critical API Rollout Baseline

apiVersion: apps/v1
kind: Deployment
metadata:
  name: case-api
spec:
  replicas: 4
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 0
      maxSurge: 1
  minReadySeconds: 20
  progressDeadlineSeconds: 600

This means:

do not voluntarily reduce available capacity during rollout
add at most one extra Pod above desired replica count
only count Pods as available after they remain ready for 20 seconds
fail rollout if progress stalls

11.2 Common Rollout Failure Modes

Failure	Symptom	Cause	Control
readiness too weak	traffic errors after rollout	readiness does not check real serving ability	improve readiness contract
startup too slow	CrashLoopBackOff	liveness kills app before boot	startup probe
capacity drop	latency spike during deploy	`maxUnavailable` too high	reduce maxUnavailable, increase surge
node full	new Pods pending	no surge capacity	cluster autoscaler/Karpenter/NAP readiness
bad dependency	new version ready but failing user path	readiness checks only local process	synthetic and canary checks
rollback false confidence	rollback succeeds but data incompatible	schema/state change not backward compatible	expand/contract migration

12. Reliability and Resource Management

Under-sized requests create hidden unreliability.

The scheduler places Pods using requests. If requests are too low, the cluster admits too much work onto the node. Under real load, the node hits pressure and kubelet starts eviction.

12.1 Memory

Memory is not compressible.

If a container exceeds its memory limit, it may be OOMKilled.

If a node experiences memory pressure, kubelet may evict Pods according to eviction policy and QoS class.

Reliability impact:

BestEffort Pods are easiest to evict.
Burstable Pods may be evicted before Guaranteed depending on usage relative to requests.
Guaranteed can protect critical workloads but can reduce bin-packing efficiency.

12.2 CPU

CPU is compressible.

CPU limit does not kill the process; it throttles it.

Reliability impact:

excessive CPU limits can create latency spikes
missing CPU requests can cause poor scheduling
HPA CPU percentage depends on CPU requests

For latency-sensitive services, prefer measured CPU requests and cautious CPU limits.

13. Reliability and Autoscaling

Autoscaling is not instant reliability.

Autoscaling has delay:

load increase
  -> metric changes
  -> metrics pipeline delay
  -> HPA/KEDA decision
  -> replica creation
  -> scheduler placement
  -> image pull
  -> app startup
  -> readiness
  -> load balancer endpoint propagation

This delay is your scaling reaction time.

If your SLO cannot tolerate that delay, you need warm capacity.

13.1 Autoscaling Failure Modes

Failure	Symptom	Control
wrong metric	scales but SLO still burns	scale on user or queue signal
cold start too long	spike during traffic burst	minimum replicas, faster startup, pre-warmed nodes
no node capacity	Pods pending	cluster autoscaler/Karpenter/NAP, subnet/IP capacity
too aggressive scale down	repeated cold starts	stabilization window
dependency saturation	more Pods make system worse	backpressure, concurrency limits
HPA/VPA conflict	unstable sizing	use VPA recommend mode or separate responsibilities

13.2 Reliability Rule

Do not scale blindly from application load alone. Scale against the bottleneck.

If the bottleneck is database connections, adding more API Pods can reduce reliability.

14. Dependency Failure Modeling

Most Kubernetes incidents are not purely Kubernetes incidents.

A user journey usually crosses:

ingress controller
application Pod
identity provider
secrets provider
database
cache
message broker
object storage
external API
DNS
certificate chain

Model each dependency as a failure object.

Dependency	Failure Mode	Platform Evidence	App Control
database	connection exhaustion	latency, connection errors, DB metrics	pooling, timeout, circuit breaker
cache	unavailable or stale	connection errors, cache hit drop	fallback, TTL discipline
queue	lag spike	consumer lag, processing age	KEDA, backpressure, DLQ
object storage	throttling	429/503, retry metrics	retry budget, idempotency
identity	token exchange failure	STS/Entra errors	token cache, fail closed/open decision
secrets	mount/sync failure	CSI/ExternalSecrets events	startup validation, rotation runbook
DNS	lookup failure	CoreDNS errors, NXDOMAIN	DNS monitoring, caching

The platform should expose enough telemetry to prove which dependency caused the SLO burn.

15. EKS Reliability Design Notes

15.1 Multi-AZ Is Not Optional for Critical Workloads

For Class A/S workloads:

use subnets in multiple Availability Zones
spread node capacity across zones
ensure load balancer subnets exist in required zones
configure topology spread by topology.kubernetes.io/zone
review EBS volume topology if stateful

A stateless API with replicas across zones can survive a node or zone problem better than one packed into a single zone.

15.2 EBS Is Zonal

EBS volumes are attached within an Availability Zone.

Reliability implication:

StatefulSet Pods with EBS-backed PVCs are constrained by volume zone.
Rescheduling to another zone is not simple failover.
Backup/restore and replication strategy matter.

If the workload needs multi-AZ shared filesystem semantics, evaluate EFS carefully, including latency and consistency expectations.

15.3 Node Provisioning

EKS node reliability depends on the provisioning model:

Model	Reliability Strength	Risk
Managed Node Groups	predictable, AWS-managed lifecycle	less flexible packing/provisioning
Karpenter	fast, workload-aware provisioning	configuration complexity
EKS Auto Mode	managed provisioning/operation	less low-level control
Fargate	no node management for some workloads	constraints and unsupported patterns

For critical workloads, do not rely on autoscaling alone. Use minimum warm capacity and test scale from zero separately.

15.4 Load Balancer Target Health

EKS traffic path depends on AWS load balancer health checks.

Check alignment:

load balancer health check path
Kubernetes readiness path
Pod termination behavior
target group deregistration delay
app graceful shutdown time

If these disagree, traffic may reach a Pod that Kubernetes already considers not ready, or be removed too slowly/quickly.

16. AKS Reliability Design Notes

16.1 Availability Zones and Node Pools

For critical workloads:

use zone-capable regions
spread node pools across zones
avoid single-zone system bottlenecks
enforce topology spread by zone
test node image upgrade with surge capacity

16.2 Azure Disk Is Zonal/Regional Depending on Disk Type

Most AKS stateful workloads using Azure Disk have placement constraints.

Reliability implication:

Pod recovery follows disk placement rules
zone failure can affect volume availability
restore and replication strategy must be explicit

Azure Files may provide shared filesystem behavior, but it has its own latency, throughput, and semantics trade-offs.

16.3 Egress Dependency

AKS production designs often route egress through:

NAT Gateway
Azure Firewall
UDR
private endpoints
proxy appliances

A reliable app can fail because egress to identity, registry, storage, or dependency endpoints is blocked.

Always include egress path in failure modeling.

16.4 Managed Identity and Key Vault

AKS workload identity improves secretless access, but reliability depends on:

OIDC issuer availability
federated credential correctness
managed identity permissions
Key Vault network/firewall rules
token cache behavior

Treat identity as a runtime dependency, not just a security setting.

17. Failure Modeling Method

Use this sequence for every critical service.

Step 1 — Define the User Journey

Example:

User submits enforcement case escalation request.

Include:

entry point
service chain
data writes
events emitted
external dependencies
success definition

Step 2 — Define SLIs

Example:

request success rate
p95 latency
durable persistence success
event emission success
duplicate prevention rate

Step 3 — Define SLO

Example:

99.9% of valid escalation requests must be accepted, persisted,
and return a final synchronous response under 800 ms over 30 days.

Step 4 — Draw the Runtime Path

Step 5 — Enumerate Failure Points

Point	Failure	User Impact	Detection	Mitigation
Edge	TLS expired	total outage	synthetic check	certificate alert + renewal runbook
API Pod	CrashLoop	partial outage	Pod restart alert	rollback, inspect logs
DB	pool exhausted	slow/failing requests	DB metrics, trace spans	limit concurrency, tune pool
Kafka	publish timeout	request failure or async loss	producer error metric	outbox pattern
Identity	token failure	secret/cloud access failure	STS/Entra error logs	scoped retry/cache
Node	memory pressure	evictions	node condition alert	right-size requests
Zone	capacity loss	capacity reduction	zone-level metrics	topology spread, multi-zone LB

Step 6 — Map Failure to Controls

For every failure, identify:

prevention
detection
containment
recovery
evidence

If one of those is missing, reliability is incomplete.

18. Reliability Review Template

Use this before approving a workload into production.

# Reliability Review: <service-name>

## 1. User Journey
- Capability:
- Criticality class:
- Primary users:
- Business impact of failure:

## 2. SLO
- SLI:
- SLO target:
- Window:
- Exclusions:
- Error budget policy:

## 3. Runtime Path
- Edge:
- Service:
- Database:
- Queue/stream:
- External APIs:
- Identity:
- Secrets:

## 4. Kubernetes Controls
- Replicas:
- Readiness:
- Liveness:
- Startup:
- Resources:
- PDB:
- Topology spread:
- Autoscaling:
- Rollout strategy:
- Priority class:

## 5. Cloud Controls
- EKS/AKS cluster:
- Node pools:
- Zones:
- Load balancer:
- DNS:
- Certificate:
- Storage:
- Identity:

## 6. Failure Model
| Failure | Expected impact | Detection | Recovery | Owner |
|---|---|---|---|---|

## 7. Runbooks
- Pod crash:
- Rollout failure:
- Node drain blocked:
- Dependency outage:
- DNS/TLS failure:
- Identity failure:

## 8. Test Evidence
- Load test:
- Rollout test:
- Node drain test:
- Zone disruption assumption:
- Backup/restore:

19. Failure Injection Exercises

Do not wait for production incidents to learn failure behavior.

Exercise	Method	Expected Learning
Pod kill	delete one Pod	replica and readiness behavior
Deployment bad image	deploy nonexistent image	rollout failure and alert path
Slow startup	increase startup delay	startup probe correctness
Memory pressure	run controlled memory stress	eviction and QoS behavior
Node drain	cordon/drain node	PDB and topology behavior
Zone capacity simulation	taint nodes in one zone	scheduling and traffic impact
DNS failure simulation	block DNS egress in test namespace	dependency on CoreDNS
Secret rotation	rotate secret while app runs	reload/restart contract
Identity permission removal	remove cloud permission	error visibility and fail mode
Queue backlog	inject event spike	KEDA/HPA behavior

Rules:

Run in non-prod first.
Define expected outcome before test.
Capture metrics/logs/events/traces.
Convert surprise into platform improvement.

20. Runbook: SLO Burn Investigation

20.1 Start From User Impact

# Identify which SLO is burning
# This is usually done in observability platform, not kubectl.

Ask:

Is the burn availability, latency, correctness, freshness, or durability?
Is it global or isolated to one route/tenant/zone?
Did it start after deploy, scale event, node drain, certificate rotation, policy change, or dependency incident?

20.2 Check Deployment State

kubectl -n prod-orders rollout status deployment/orders-api
kubectl -n prod-orders get deploy,rs,pod -l app.kubernetes.io/name=orders-api -o wide
kubectl -n prod-orders describe deployment orders-api

Look for:

rollout stuck
unavailable replicas
pending Pods
old/new ReplicaSet overlap
image pull errors
readiness failures

20.3 Check Pod Health and Events

kubectl -n prod-orders get pod -l app.kubernetes.io/name=orders-api -o wide
kubectl -n prod-orders describe pod <pod-name>
kubectl -n prod-orders logs <pod-name> --previous
kubectl -n prod-orders get events --sort-by=.lastTimestamp

Look for:

OOMKilled
probe failures
CrashLoopBackOff
failed mounts
failed scheduling
node pressure

20.4 Check Capacity and Scheduling

kubectl top pod -n prod-orders
kubectl top node
kubectl get nodes -L topology.kubernetes.io/zone
kubectl describe node <node-name>

Look for:

memory pressure
disk pressure
CPU saturation
uneven zone distribution
taints blocking scheduling
missing capacity in one node pool

20.5 Check PDB and Drain Constraints

kubectl -n prod-orders get pdb
kubectl -n prod-orders describe pdb orders-api-pdb

Look for:

DisruptionsAllowed: 0
selector mismatch
too strict minAvailable
unhealthy Pods blocking eviction

20.6 Check Traffic Path

kubectl -n prod-orders get svc,endpointslices
kubectl -n ingress-system get pod,svc

For EKS, also inspect load balancer target health and AWS Load Balancer Controller logs.

For AKS, inspect Azure Load Balancer/Application Gateway/Application Gateway for Containers health and controller logs.

20.7 Check Dependency Path

Use traces and dependency dashboards.

Validate:

database latency and connection pool
queue lag
external API throttling
identity token failures
Key Vault/Secrets Manager access
DNS errors

20.8 Decide Action

Possible actions:

rollback release
pause rollout
increase replicas
increase node capacity
remove bad Pod from traffic
fix dependency configuration
apply emergency NetworkPolicy exception
rotate certificate/secret
restore previous config

Do not blindly restart Pods. Restarting removes evidence and may amplify load.

21. The Reliability Anti-Patterns

21.1 All Green Dashboards, Broken User Journey

Cause:

dashboards monitor infrastructure state, not user-visible behavior

Fix:

add synthetic checks and request-level SLIs

21.2 Probe as Dependency Health Check

Cause:

readiness checks every downstream dependency

Risk:

transient downstream issue removes all Pods from traffic

Fix:

readiness should answer: can this Pod accept traffic safely now?
dependency degradation should be managed with circuit breaker/backpressure where appropriate

21.3 PDB That Blocks Maintenance

Cause:

minAvailable: 100%

Fix:

use maxUnavailable: 1 where possible
test node drains

21.4 One Replica with a 99.9% SLO

Cause:

SLO not translated into architecture

Fix:

require reliability class before production approval

21.5 Autoscaling Without Warm Capacity

Cause:

assuming scaling is instantaneous

Fix:

minimum replicas, pre-warmed nodes, startup optimization

21.6 Zone Spread Without Capacity Planning

Cause:

hard topology spread with insufficient per-zone capacity

Fix:

capacity reservation, node pool spread, autoscaler validation

21.7 Control Plane Dependency Hidden in App SLO

Cause:

app deployment/recovery depends on Kubernetes API, admission webhooks, GitOps, and cloud APIs

Fix:

model delivery/control-plane failure separately from serving-path failure

22. Decision Matrix

22.1 How Many Replicas?

Requirement	Baseline
dev/test	1
internal non-critical	1-2
standard production API	2+
critical stateless API	3+ across zones
zone failure tolerance	at least one healthy replica per surviving zone, plus enough capacity
high traffic + deploy safety	enough replicas to absorb `maxUnavailable` and load during rollout

22.2 PDB Strategy

Workload	Suggested PDB
1 replica best effort	usually none or accept drain blocking explicitly
2 replicas	`maxUnavailable: 1` if app tolerates 50% capacity loss briefly
3+ replicas	`maxUnavailable: 1` baseline
quorum system	design based on quorum rules, not generic template
stateful primary/replica	app-aware disruption rules required

22.3 Topology Strategy

Workload	Strategy
best effort	no hard zone spread
standard production	spread by hostname, prefer zone spread
critical stateless	hard or strong preferred zone spread
stateful zonal storage	align Pod placement with storage topology
ingress/controller	spread across nodes/zones; protect with PDB

23. Reliability Scorecard

Use this for a quick review.

Item	Score 0	Score 1	Score 2
SLO	none	vague target	measurable SLI/SLO with owner
Replicas	1	2	3+ and zone-aware
Probes	missing	basic	realistic startup/readiness/liveness
Resources	missing	guessed	measured and reviewed
PDB	missing	present but untested	tested and maintenance-friendly
Topology	none	node spread	zone + node failure-aware
Rollout	default	configured	tested with failure scenario
Autoscaling	none	basic HPA	signal-driven and load-tested
Observability	logs only	dashboards	SLO + trace + runbook
Dependency model	implicit	documented	tested with failure injection
Cloud path	implicit	partially documented	LB/DNS/TLS/IAM/storage modeled
Recovery	tribal	basic runbook	practiced and evidenced

Interpretation:

below 12: not production-ready for important workload
12-18: standard production with gaps
19-24: strong production posture

24. Capstone Exercise

Design reliability for a case-escalation-api deployed on both EKS and AKS.

Requirements:

SLO: 99.9% successful escalation requests under 800 ms over 30 days.
Must survive one node failure without user-visible outage.
Must tolerate planned node drain.
Must not lose accepted escalation event.
Must expose evidence for incident review.
Must have clear difference between EKS and AKS cloud dependencies.

Deliverables:

SLI/SLO definition.
Workload reliability class.
Deployment YAML with probes, resources, rollout, topology spread.
PDB YAML.
Failure model table.
EKS-specific reliability notes.
AKS-specific reliability notes.
SLO burn runbook.
Failure injection plan.

25. Production Checklist

Before approving a critical workload:

26. References

Official and primary references for this part:

Kubernetes Documentation — Disruptions: https://kubernetes.io/docs/concepts/workloads/pods/disruptions/
Kubernetes Documentation — Configure Pod Disruption Budget: https://kubernetes.io/docs/tasks/run-application/configure-pdb/
Kubernetes Documentation — Scheduling, Preemption and Eviction: https://kubernetes.io/docs/concepts/scheduling-eviction/
Kubernetes Documentation — Node-pressure Eviction: https://kubernetes.io/docs/concepts/scheduling-eviction/node-pressure-eviction/
Kubernetes Documentation — Topology Spread Constraints: https://kubernetes.io/docs/concepts/scheduling-eviction/topology-spread-constraints/
Kubernetes Documentation — Resource Management: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/
Google SRE Book — Service Level Objectives: https://sre.google/sre-book/service-level-objectives/
Google SRE Workbook — Implementing SLOs: https://sre.google/workbook/implementing-slos/
AWS EKS Best Practices Guide: https://docs.aws.amazon.com/eks/latest/best-practices/introduction.html
Azure AKS Baseline Architecture: https://learn.microsoft.com/en-us/azure/architecture/reference-architectures/containers/aks/baseline-aks

Lesson Recap

You just completed lesson 35 in final stretch. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Back To Series Next Lesson

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.

Previous Lesson

Lesson 34

Platform Engineering and Internal Developer Platform

Next Lesson

Lesson 36

Upgrades, Versioning, and API Deprecation