Deepen PracticeOrdered learning track

Reliability Engineering, SLOs, and Failure Modelling

Learn Kubernetes, Deployment Model, and Cloud Native Platform Engineering - Part 028

Reliability engineering for Kubernetes platforms, including SLOs, SLIs, error budgets, availability thinking, probes, graceful shutdown, disruption budgets, rollout reliability, dependency failure, blast radius, resilience patterns, and failure modelling.

19 min read3770 words
PrevNext
Lesson 2835 lesson track2029 Deepen Practice
#kubernetes#reliability#sre#slo+6 more

Part 028 — Reliability Engineering, SLOs, and Failure Modelling

1. Why This Part Exists

Kubernetes can restart containers.

That does not make your system reliable.

Kubernetes can reschedule Pods.

That does not make your system available.

Kubernetes can roll out new versions.

That does not make deployment safe.

Kubernetes can autoscale replicas.

That does not guarantee latency, correctness, or user trust.

Reliability is not the absence of failure. In distributed systems, failure is normal.

Reliability is the ability to keep user-visible behavior within an acceptable boundary while parts of the system fail, change, overload, restart, move, or degrade.

This part connects Kubernetes primitives to reliability engineering:

  • SLI
  • SLO
  • error budget
  • blast radius
  • graceful shutdown
  • readiness and liveness
  • disruption budget
  • redundancy
  • rollout safety
  • dependency failure
  • capacity margin
  • failure modelling
  • operational maturity

The goal is to stop thinking:

Is the Pod running?

And start thinking:

Is the user journey reliable under expected failure modes?

2. Kaufman Skill Target

The practical skill target is:

Given a Kubernetes workload and deployment model, identify the reliability objective, map failure modes, choose Kubernetes controls, and design a safe operating envelope.

After this part, you should be able to:

  1. Define meaningful SLIs and SLOs for Kubernetes workloads.
  2. Connect error budgets to release and operations decisions.
  3. Model failure at Pod, Node, Zone, dependency, rollout, and traffic layers.
  4. Use probes without creating cascading outages.
  5. Design graceful shutdown so Pod termination does not drop requests.
  6. Use PodDisruptionBudget correctly and understand its limits.
  7. Evaluate rollout strategy against user impact and rollback safety.
  8. Reason about availability, redundancy, quorum, capacity, and blast radius.
  9. Build reliability checklists for platform and application teams.

The aim is not academic SRE theory.

The aim is operational judgment.


3. Mental Model: Kubernetes Reliability Is Layered

A reliable Kubernetes application is not reliable because one object is configured well.

It is reliable because several layers cooperate.

A Deployment with three replicas can still be unreliable if:

  • all replicas are on the same node
  • readiness probes remove every Pod during a dependency blip
  • liveness probes restart healthy-but-slow Pods
  • shutdown drops in-flight requests
  • rollout introduces incompatible schema changes
  • HPA reacts too late
  • PDB blocks node maintenance or allows too much disruption
  • dependency timeouts exceed request deadlines
  • retries amplify overload
  • logs exist but no user-facing SLI exists

Reliability emerges from system design, not from replica count alone.


4. From Pod Health to User Reliability

Kubernetes has object health.

Users experience journey health.

These are not the same.

Kubernetes SignalWhat It Tells YouWhat It Does Not Tell You
Pod RunningContainer process existsApp is useful to users
Pod ReadyPod is eligible for trafficUser journey succeeds end-to-end
Deployment available replicasController has minimum availabilityLatency/error budget is healthy
HPA scaled replicasMetric crossed thresholdCapacity is enough for SLO
Node ReadyNode can run PodsWorkload is correctly distributed
PDB healthyVoluntary disruption is boundedInvoluntary failures are handled

Reliability must be measured at the boundary users care about.

For a payment API, the SLI is not:

Number of running Pods.

It is closer to:

Percentage of valid payment authorization requests completed successfully within 500 ms.

5. SLI, SLO, and Error Budget

5.1 Definitions

TermMeaning
SLIService Level Indicator: measured signal of service behavior
SLOService Level Objective: target value for an SLI over a window
Error budgetAllowed unreliability before the SLO is violated
SLAExternal contractual promise, often with business/legal consequences

5.2 Example

SLI:
  successful HTTP requests under 300 ms / all valid HTTP requests

SLO:
  99.9% over 30 rolling days

Error budget:
  0.1% of valid requests may fail or exceed latency threshold

5.3 Why Error Budget Matters

Without an error budget, teams argue emotionally:

Can we deploy faster?
Is this system reliable enough?
Should we stop feature work?
Was the incident bad?

With an error budget, the decision becomes grounded:

If we are burning budget too fast, reduce release risk and prioritize reliability.
If budget is healthy, controlled risk is acceptable.

Kubernetes does not create the error budget.

Kubernetes gives you mechanisms to protect it.


6. Good SLIs for Kubernetes Workloads

Choose SLIs close to user outcomes.

6.1 HTTP API

SLINotes
Availability rateSuccessful valid requests / total valid requests
Latency percentileUsually p95 or p99, not average
Error rate5xx, selected 4xx, dependency-mapped errors
SaturationCPU, memory, queue depth, connection pool, worker utilization

6.2 Async Worker

SLINotes
Queue lagAge of oldest unprocessed message
Processing success rateSuccessful jobs / attempted jobs
Retry rateHigh retry may indicate hidden failure
Dead-letter rateData loss or manual repair indicator
End-to-end processing timeUser/business outcome delay

6.3 Batch Job

SLINotes
Completion successJob finishes successfully
Completion timeMeets expected window
CorrectnessOutput count/checksum/reconciliation
Retry exhaustionJobs that hit retry limit

6.4 Stateful Service

SLINotes
Read/write availabilitySeparate read and write paths
Replication lagData freshness
Commit latencyUser-visible write performance
Quorum healthMinimum members for consistency
Backup freshnessRecovery guarantee

7. Availability Math Without Illusions

Availability percentage feels abstract.

Translate it into downtime budget.

Approximate monthly downtime:

SLOMonthly Error Budget
99%~7h 18m
99.5%~3h 39m
99.9%~43m 49s
99.95%~21m 54s
99.99%~4m 23s

The trap is assuming Kubernetes makes high availability cheap.

A 99.99% user-facing SLO requires:

  • fast detection
  • safe rollout
  • redundancy
  • dependency reliability
  • capacity margin
  • tested rollback
  • minimal manual recovery
  • zone/provider failure strategy
  • strong observability
  • operational discipline

Replica count alone cannot buy four nines.


8. Failure Modelling Framework

For each workload, model failures by layer.

Use this table:

Failure ModeDetectionImpactExisting ControlMissing ControlRecovery
Pod crashRestart count, logsReduced capacityDeployment replicasStartup probeAuto restart / rollback
Node lostNode conditionPods unavailableReplica spreadTopology spreadReschedule
Bad rolloutError SLI, rollout statusUser errorsRolling updateCanary metric gateRollback
DNS failureDNS SLI, CoreDNS metricsDependency discovery failsCoreDNS replicasDNS egress policy testFix CoreDNS/policy
DB latencyApp latency SLISlow requestsTimeoutsCircuit breakerDegrade/read-only
PVC mount failurePod eventsStateful Pod unavailableCSIRunbook/snapshotRemount/failover

The purpose is not to predict everything.

The purpose is to discover the most dangerous unhandled failures before production does.


9. Kubernetes Primitives as Reliability Controls

Kubernetes features are reliability controls only when mapped to a failure mode.

PrimitiveReliability UseDoes Not Solve
DeploymentMaintains stateless replicas and rolloutBad code, bad dependency, bad schema
StatefulSetStable identity/storageData correctness, quorum design
Readiness probeRemoves unready Pods from ServiceDependency architecture
Liveness probeRestarts deadlocked processesStartup slowness, downstream outage
Startup probeProtects slow startupRuntime deadlock after startup
PDBLimits voluntary disruptionInvoluntary node failure
HPAAdds/removes replicasBad metric, cold start, dependency bottleneck
Topology spreadReduces correlated node/zone failureApp-level quorum/capacity math
Resource requestsEnables scheduling/capacity planningRuntime leaks or overload alone
LimitsBounds noisy behaviorCPU throttling risk
PriorityClassProtects critical workloadsInfinite capacity
NetworkPolicyReduces network blast radiusApp authz, mesh policy, DNS correctness
RollbackRestores previous templateData corruption or incompatible migration

10. Probe Design for Reliability

Probes are one of the most powerful and most dangerous reliability tools in Kubernetes.

10.1 Correct Probe Roles

ProbeReliability Meaning
Startup“Do not judge liveness until boot is complete.”
Readiness“Should this instance receive traffic now?”
Liveness“Is this process unrecoverably broken and should be restarted?”

10.2 Bad Probe Patterns

Liveness checks a dependency

Bad:

If database is slow, liveness fails.
Kubelet restarts all API Pods.
Startup storms increase database load.
Outage worsens.

A liveness probe should usually check local process health, not downstream dependency health.

Readiness removes all replicas simultaneously

Bad:

All Pods depend on same downstream service.
Downstream service has 10-second blip.
All Pods fail readiness.
Service has zero endpoints.
Ingress returns 503.

Readiness must be designed with degradation semantics.

Probe timeouts too low

A strict timeout during CPU throttling or GC pauses can trigger false failure.

10.3 Probe Configuration Example

startupProbe:
  httpGet:
    path: /health/startup
    port: http
  failureThreshold: 30
  periodSeconds: 2

readinessProbe:
  httpGet:
    path: /health/ready
    port: http
  periodSeconds: 5
  timeoutSeconds: 2
  failureThreshold: 3

livenessProbe:
  httpGet:
    path: /health/live
    port: http
  periodSeconds: 10
  timeoutSeconds: 2
  failureThreshold: 3

This is not a universal template.

The thresholds must reflect real startup time, request latency, runtime pauses, and dependency behavior.


11. Graceful Shutdown and Request Draining

Pod termination is a reliability event.

If termination is not designed, rolling update, node drain, autoscaling, and eviction can drop requests.

11.1 Termination Flow

The exact behavior depends on application, kubelet, endpoints, and traffic infrastructure, but the design goal is simple:

Stop new traffic before killing existing work.

11.2 Application Requirements

A reliable application should:

  • handle SIGTERM
  • stop accepting new requests
  • fail readiness quickly on shutdown
  • finish in-flight requests within grace period
  • close connections cleanly
  • commit or rollback work safely
  • avoid starting new long-running tasks after termination begins

11.3 Kubernetes Configuration

terminationGracePeriodSeconds: 60
lifecycle:
  preStop:
    exec:
      command: ["/bin/sh", "-c", "sleep 10"]

A sleep preStop can give load balancers time to stop routing, but it is not enough by itself.

The application should participate in draining.

11.4 Common Failure

Deployment rollout starts.
Old Pod receives SIGTERM.
App exits immediately.
External LB still has connection.
User request drops.
Rollout looks healthy, but users see intermittent failures.

This is why graceful shutdown must be tested, not assumed.


12. PodDisruptionBudget

A PodDisruptionBudget limits voluntary disruption for a set of Pods.

It helps during:

  • node drain
  • cluster maintenance
  • autoscaler scale-down
  • upgrade operations
  • administrator-initiated eviction

It does not prevent:

  • node hardware failure
  • kernel crash
  • cloud zone outage
  • process crash
  • OOM kill
  • forced deletion

12.1 Example

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: payments-api
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app.kubernetes.io/name: payments-api

For three replicas, minAvailable: 2 allows one voluntary disruption at a time.

12.2 PDB Design Rules

WorkloadPossible PDB
Stateless API, 3 replicasminAvailable: 2
Stateless API, 10 replicasmaxUnavailable: 10% or minAvailable: 90%
SingletonUsually no meaningful PDB; fix architecture
Quorum system, 5 membersKeep quorum available, usually at least 3
DaemonSetPDB semantics may not match operational needs

12.3 PDB Anti-Patterns

minAvailable: 100%

This can block node drains and upgrades indefinitely.

Use only when you fully understand the operational cost.

PDB selector matches wrong Pods

A selector bug can protect nothing or block too much.

PDB without spare capacity

If the cluster has no spare capacity, a PDB can prevent maintenance from completing.

Reliability requires both policy and capacity.


13. Replica, Topology, and Blast Radius

Three replicas do not guarantee high availability if they all run on the same node.

13.1 Bad Placement

Node A:
  payments-api-1
  payments-api-2
  payments-api-3

Node A failure removes all replicas.

13.2 Better Placement

Node A:
  payments-api-1
Node B:
  payments-api-2
Node C:
  payments-api-3

Use topology spread constraints or anti-affinity to reduce correlated failure.

13.3 Zone-Aware Placement

For regional clusters:

Zone A: 2 replicas
Zone B: 2 replicas
Zone C: 2 replicas

But zone-aware placement must account for capacity.

If Zone A fails, remaining zones must absorb traffic.

13.4 The N+1 Capacity Rule

A service is not resilient to a node/zone failure unless remaining capacity can handle the load.

Availability = replicas + placement + capacity + dependency survival

14. Rollout Reliability

Every deployment is a controlled failure injection.

It replaces working code with new code under live traffic.

14.1 Native Rolling Update Controls

Deployment rolling update knobs:

strategy:
  type: RollingUpdate
  rollingUpdate:
    maxSurge: 1
    maxUnavailable: 0

Interpretation:

FieldMeaning
maxSurgeExtra Pods above desired replicas during rollout
maxUnavailableHow many desired replicas may be unavailable

maxUnavailable: 0 protects capacity, but requires surge capacity.

14.2 Rollout Risk Dimensions

RiskExample
Code riskNew bug causes 500s
Config riskMissing env var crashes Pods
Schema riskNew version writes incompatible data
Dependency riskNew version calls dependency differently
Capacity riskNew version uses more CPU/memory
Startup riskCold start too slow for rollout deadline
Traffic riskCanary receives unrepresentative traffic
Rollback riskOld version cannot read new data

14.3 Rollout Safety Ladder

Unit/integration tests
-> container build verification
-> admission policy
-> staging environment
-> one replica smoke
-> canary traffic
-> metric gate
-> gradual promotion
-> rollback automation

Kubernetes Deployment handles only part of this ladder.

Progressive delivery tools and platform process fill the rest.


15. Dependency Failure and Graceful Degradation

Most user-visible incidents are not pure Kubernetes failures.

They happen at boundaries:

  • database
  • cache
  • queue
  • object storage
  • identity provider
  • payment provider
  • DNS
  • third-party API
  • internal service

15.1 Dependency Control Toolkit

ControlPurpose
TimeoutBound waiting time
Retry with backoffRecover from transient errors
Circuit breakerStop amplifying overload
BulkheadIsolate resource pools
FallbackServe degraded response
CacheReduce dependency load
QueueDecouple request path from processing
Rate limitProtect downstream service
Load sheddingPreserve core traffic under overload

15.2 Retry Storm Pattern

Autoscaling can worsen dependency overload if the bottleneck is downstream.

Scaling clients is not always the answer.

15.3 Degradation Example

For a product catalog:

If recommendation service fails:
  Serve product page without recommendations.

If review service fails:
  Serve cached rating or hide review section.

If payment authorization fails:
  Do not fake success; fail safely.

Reliability is not always “return success.”

Sometimes reliability means failing clearly, safely, and recoverably.


16. Stateful Reliability

Stateful workloads require different reliability thinking.

A stateless API can often be rolled back or replaced.

A database cannot be treated as disposable.

16.1 Stateful Failure Questions

What is the quorum requirement?
What is the replica placement model?
What is the backup frequency?
What is the restore time objective?
What is the restore point objective?
Can old and new versions read the same data?
What happens during node drain?
What happens during zone failure?
Can the system tolerate split brain?

16.2 RPO and RTO

TermMeaning
RPORecovery Point Objective: acceptable data loss window
RTORecovery Time Objective: acceptable recovery duration

Example:

RPO: 5 minutes
RTO: 30 minutes

This means the system may lose up to 5 minutes of data and must recover within 30 minutes.

If you do not test restore, you do not have a backup strategy.

You have a backup hope.

16.3 Quorum and PDB

For a 5-member quorum system, losing 2 may still preserve quorum.

But planned disruption should avoid dropping below quorum.

A PDB can help with voluntary disruptions, but it cannot save the system from correlated node/zone failure if placement is wrong.


17. Capacity Reliability

Capacity is a reliability feature.

17.1 Capacity Failure Modes

FailureDescription
No scheduling headroomNew Pods cannot start during rollout/failure
CPU throttlingLatency increases while Pods appear healthy
Memory pressureOOM kills and evictions occur
Disk pressureNode becomes unhealthy, image pulls fail
Connection pool saturationApp threads wait despite CPU being low
Queue backlogAsync system misses processing SLO
Autoscaling lagReplicas arrive after user impact
Cold startNew replicas not useful quickly enough

17.2 Headroom Questions

Can we lose one node and still serve traffic?
Can we lose one zone and still serve priority traffic?
Can we roll out with maxSurge without Pending Pods?
Can HPA scale before saturation?
Can dependencies handle scaled clients?

17.3 Requests and Limits

Resource requests define scheduling assumptions.

If requests are too low, the scheduler overpacks nodes and runtime pressure appears later.

If requests are too high, utilization is poor and Pods may remain Pending.

Reliability needs realistic requests based on observed behavior, not arbitrary defaults.


18. Reliability of Cluster Operations

Cluster operations are reliability events:

  • node drain
  • node image upgrade
  • Kubernetes version upgrade
  • CNI upgrade
  • CSI upgrade
  • ingress controller upgrade
  • certificate rotation
  • policy rollout
  • autoscaler configuration change

Treat platform changes like application deployments.

18.1 Operational Safety Checklist

What workloads are affected?
Are PDBs configured correctly?
Is there enough spare capacity?
Are system components redundant?
Are add-on versions compatible?
Can we roll back?
Are alerts temporarily noisy or blind?
Is there a maintenance window or progressive rollout?

18.2 Node Drain Reliability

A node drain is safe only if:

  • workloads have enough replicas
  • PDB allows controlled eviction
  • other nodes have capacity
  • storage can detach/attach if needed
  • topology constraints can be satisfied
  • critical DaemonSets are healthy

Otherwise, a node drain can become an outage.


19. Failure Mode and Effects Analysis for Kubernetes

Use FMEA-style reasoning.

ComponentFailure ModeEffectDetectionControl
PodCrash loopReduced capacityRestart count, SLIReplicas, rollback, probes
NodeNotReadyPod lossNode conditionSpread, reschedule, capacity
ZoneUnavailableRegional capacity lossCloud/K8s metricsMulti-zone placement, spare capacity
ServiceNo endpoints503EndpointSlice alertReadiness, selector tests
DNSResolution timeoutDependency failureDNS probeCoreDNS scaling, NetworkPolicy allow
CSIMount failureStateful outagePod eventsStorage runbook, snapshots
RolloutBad versionErrors/latencySLI burnCanary, rollback, schema discipline
HPAScales too lateLatency spikeSaturationBetter metrics, min replicas
NetworkPolicyBlocks dependencyTimeoutsSynthetic testPolicy tests, staged rollout
Secret rotationApp auth failureErrorsAuth metrics/logsRotation runbook, dual credentials

The strongest engineers do this before incident review, not only after.


20. Reliability Design Patterns

20.1 Redundancy

Run multiple instances across failure domains.

But redundancy without capacity is weak.

20.2 Bulkheads

Separate resource pools so one failure does not consume everything.

Examples:

  • separate Deployments for public API and internal batch workers
  • separate node pools for critical workloads
  • separate connection pools per dependency
  • separate queues for high-priority traffic

20.3 Circuit Breakers

Stop calling a failing dependency temporarily.

Without circuit breakers, retries can amplify failure.

20.4 Load Shedding

Reject low-priority work to preserve core service.

Example:

Drop recommendation enrichment before dropping checkout.

20.5 Backpressure

Make overload visible to callers instead of hiding it until collapse.

20.6 Idempotency

Critical for retries, Jobs, queue consumers, and rollout recovery.

20.7 Compatibility Windows

During rollout, old and new versions may run simultaneously.

APIs, events, and database schemas must support overlap.


21. Reliability Anti-Patterns

21.1 Replica Count as Religion

We have 3 replicas, so we are highly available.

Not if all replicas share the same node, dependency, storage bottleneck, or bad release.

21.2 Liveness as a Hammer

Restarting a process is not always healing.

Sometimes it creates more load and loses evidence.

21.3 PDB as Decoration

A PDB with the wrong selector protects nothing.

A PDB with impossible constraints blocks operations.

21.4 Autoscaling as Reliability Strategy

Autoscaling is not instant and may scale the wrong bottleneck.

21.5 Ignoring Shutdown

If the app ignores SIGTERM, every rollout and node drain can drop traffic.

21.6 SLO Without Ownership

An SLO nobody reviews does not change behavior.

Error budgets must influence release and reliability decisions.


22. Reliability Review Template

Use this before promoting a workload to production.

Workload: <name>
Owner: <team>
Criticality: <tier>
User journey: <journey>

SLI:
- Availability:
- Latency:
- Correctness:
- Freshness/lag:

SLO:
- Target:
- Window:
- Error budget:

Failure domains:
- Pod:
- Node:
- Zone:
- Dependency:
- Storage:
- Rollout:
- Policy/security:

Kubernetes controls:
- Replicas:
- Readiness:
- Liveness:
- Startup:
- Resources:
- HPA/VPA:
- PDB:
- Topology spread:
- NetworkPolicy:
- Rollout strategy:

Recovery:
- Rollback method:
- Data restore method:
- Escalation owner:
- Runbook:

Known risks:
- <risk>

Required before launch:
- <guardrail>

23. SLO-Driven Alerting

Alert on user impact and fast burn, not only infrastructure symptoms.

23.1 Weak Alerts

Pod restarted once.
CPU > 80%.
Memory > 80%.
One node NotReady.

These may be useful signals, but they are not always pages.

23.2 Stronger Alerts

Checkout availability SLO burn rate > threshold.
P99 latency exceeds SLO for 10 minutes.
Queue oldest message age exceeds processing SLO.
No ready endpoints for critical service.
Error budget burn indicates page-worthy impact.

A good alert says:

Users are or soon will be affected, and human action is needed.

23.3 Platform Alerts

Platform alerts are still needed:

  • API server unavailable
  • CoreDNS failure
  • ingress controller down
  • CNI failure
  • CSI failure
  • node pool capacity exhausted
  • certificate expiration
  • admission webhook failure

But they should be tied to likely user or workload impact.


24. Reliability Maturity Rubric

LevelBehavior
1 — ReactiveFixes incidents manually after users complain
2 — BasicHas replicas, probes, and dashboards
3 — ManagedDefines SLOs, PDBs, rollout checks, and runbooks
4 — ProactiveUses error budgets, canaries, failure testing, capacity planning
5 — PlatformizedProvides golden paths, policy guardrails, automated rollback, and reliability scorecards

Top engineers move systems from Level 2 to Level 4/5.

They do not merely add more YAML.


25. Practice Labs

Lab 1 — Readiness Failure Blast Radius

Create three replicas where readiness checks a shared dependency.

Simulate dependency failure.

Observe whether all endpoints disappear.

Improve readiness semantics.

Lab 2 — Graceful Shutdown

Deploy an HTTP server with long-running requests.

Roll out a new version.

Observe dropped requests.

Add SIGTERM handling, readiness drain, and termination grace period.

Lab 3 — PDB and Node Drain

Create a Deployment with three replicas and a PDB minAvailable: 2.

Drain a node.

Observe controlled disruption.

Then test bad PDB settings.

Lab 4 — Topology Spread

Force all replicas onto one node.

Simulate node loss.

Then add topology spread constraints and compare impact.

Lab 5 — Bad Rollout and Rollback

Deploy a version that fails readiness.

Watch Deployment rollout status.

Rollback and verify SLO recovery.

Lab 6 — Retry Storm

Create a service that calls a slow dependency with aggressive retries.

Increase dependency latency.

Observe client amplification.

Add timeout, backoff, and circuit breaker.


26. Production Checklist

Before declaring a Kubernetes workload production-ready:

Does it have user-centered SLIs?
Does it have an SLO and owner?
Does it have realistic resource requests?
Does it have startup, readiness, and liveness probes with correct semantics?
Does it handle SIGTERM gracefully?
Does it have enough replicas for the failure model?
Are replicas spread across nodes/zones?
Does it have a PDB where appropriate?
Can it roll out without reducing capacity below safe levels?
Can it roll back safely?
Are dependencies protected by timeouts and backoff?
Is autoscaling based on the right signal?
Are dashboards and alerts tied to user impact?
Is storage backed up and restore tested?
Is the runbook tested?

If the answer to most of these is no, the workload is not production-grade.

It is merely deployed.


27. Key Takeaways

  1. Kubernetes health is not the same as user reliability.
  2. SLOs convert reliability from opinion into an operating constraint.
  3. Error budgets should influence release speed and reliability investment.
  4. Probes can improve reliability or create outages depending on semantics.
  5. Graceful shutdown is mandatory for safe rollout, drain, and scale-down.
  6. PDBs protect against voluntary disruptions, not all failures.
  7. Replica count only matters when combined with placement and capacity.
  8. Autoscaling is useful but cannot fix every bottleneck.
  9. Stateful reliability requires backup, restore, quorum, RPO, and RTO thinking.
  10. Failure modelling is the bridge between Kubernetes primitives and production resilience.

28. References

Lesson Recap

You just completed lesson 28 in deepen practice. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.