Build CoreOrdered learning track

Learn Kubernetes Deployment Model Part 013 Resource Management

[]21 min read4170 words

In This Lesson

1. Kaufman Deconstruction 2. Resource Management Mental Model 3. The Core Resource Types

Lesson 1335 lesson track07–19 Build Core

title: Learn Kubernetes, Deployment Model, and Cloud Native Platform Engineering - Part 013 description: Deep dive into Kubernetes CPU, memory, requests, limits, QoS classes, eviction, overcommit, noisy-neighbor control, and production-grade resource governance. series: learn-kubernetes-deployment-model seriesTitle: Learn Kubernetes, Deployment Model, and Cloud Native Platform Engineering order: 13 partTitle: CPU, Memory, Requests, Limits, QoS, and Resource Governance tags:

kubernetes
deployment
resource-management
qos
production-engineering
platform-engineering date: 2026-07-01

Part 013 — CPU, Memory, Requests, Limits, QoS, and Resource Governance

Goal: understand Kubernetes resource management deeply enough to design safe workload defaults, diagnose resource-related incidents, prevent noisy-neighbor failure, and build defensible production governance.

This part is not about memorizing resources.requests and resources.limits.

It is about answering harder production questions:

Why did the Pod get scheduled, then die later?
Why is the app slow even though CPU is below 100%?
Why did a memory limit make latency worse?
Why did a low-priority batch workload evict a business-critical service?
Why did Cluster Autoscaler not help even though the application was overloaded?
Why did a node have available CPU on paper but still behave badly?
Why does one team’s deployment degrade another team’s workload?

Kubernetes resource management is a contract between four systems:

the application,
the kube-scheduler,
the kubelet,
the underlying Linux/kernel/container runtime isolation layer.

A top engineer understands all four.

1. Kaufman Deconstruction

Using Josh Kaufman’s skill acquisition framing, we deconstruct resource management into smaller sub-skills.

Sub-skill	What You Must Be Able To Do
Resource vocabulary	Explain CPU, memory, ephemeral storage, huge pages, extended resources, requests, limits, QoS, eviction.
Scheduling model	Predict whether a Pod can be scheduled based on requests and node allocatable capacity.
Runtime model	Explain what happens when a container exceeds CPU or memory limits.
QoS model	Predict Pod eviction priority under node pressure.
Overcommit model	Decide when CPU/memory overcommit is acceptable.
Sizing model	Derive requests/limits from real usage distributions, not guesses.
Failure model	Diagnose OOMKilled, CPU throttling, Pending Pods, Evicted Pods, node pressure, and noisy neighbors.
Governance model	Design namespace quotas, LimitRanges, policy defaults, and platform guardrails.

The most valuable sub-skill is resource reasoning under failure.

Anyone can copy this:

resources:
  requests:
    cpu: "500m"
    memory: "512Mi"
  limits:
    cpu: "1"
    memory: "1Gi"

A stronger engineer can explain exactly what trade-off that contract creates.

2. Resource Management Mental Model

Kubernetes does not magically know how much capacity your application needs.

It only knows what you declare.

Important distinction:

Field	Primary Consumer	Main Purpose
CPU request	Scheduler, HPA, kubelet	Reserve scheduling capacity and calculate utilization baseline.
CPU limit	Runtime/kernel	Cap CPU usage through throttling.
Memory request	Scheduler, QoS, eviction	Reserve scheduling capacity and classify eviction risk.
Memory limit	Runtime/kernel	Kill container if memory exceeds limit.
Ephemeral storage request	Scheduler/kubelet	Reserve local scratch/log/image writable storage.
Ephemeral storage limit	Kubelet/runtime	Evict/fail if local storage usage exceeds contract.

Kubernetes scheduling is request-driven.

Runtime enforcement is limit-driven.

This means a Pod can be:

schedulable but later unstable,
unschedulable despite low actual usage,
apparently healthy but throttled,
low memory at request time but killed at peak,
safe alone but unsafe during multi-tenant contention.

3. The Core Resource Types

3.1 CPU

Kubernetes CPU is measured in cores.

Common representations:

cpu: "1"      # one CPU core
cpu: "500m"   # half a CPU core
cpu: "100m"   # one tenth of a CPU core

m means millicpu.

So:

Value	Meaning
`1000m`	1 CPU
`500m`	0.5 CPU
`250m`	0.25 CPU
`100m`	0.1 CPU
`10m`	0.01 CPU

CPU is compressible.

If a container wants more CPU than it can get, it usually becomes slower, not dead.

That makes CPU failure subtle:

latency increases,
queue depth grows,
request timeout rises,
GC pauses can get worse,
health probes may start failing,
autoscaling may lag behind demand.

3.2 Memory

Memory is measured in bytes using suffixes such as Mi, Gi, M, G.

Prefer binary units for Kubernetes memory sizing:

memory: "256Mi"
memory: "1Gi"

Memory is non-compressible.

If a container exceeds its memory limit, it can be killed.

That kill commonly appears as:

Reason: OOMKilled
Exit Code: 137

Memory pressure is therefore sharper than CPU pressure.

CPU overuse slows things down.

Memory overuse kills things.

3.3 Ephemeral Storage

Ephemeral storage includes local writable storage associated with a Pod/container, such as:

container writable layer,
logs,
emptyDir,
temporary files,
unpacked runtime data.

This is frequently ignored until production fails.

Typical failure cases:

verbose logs fill node storage,
batch job writes temporary files without cleanup,
image layers consume excessive disk,
sidecar proxy logs grow unexpectedly,
emptyDir becomes an unbounded local database by accident.

Example:

resources:
  requests:
    ephemeral-storage: "1Gi"
  limits:
    ephemeral-storage: "4Gi"

3.4 Extended Resources

Extended resources are custom node-level resources such as:

GPUs,
FPGAs,
smart NICs,
specialized accelerators,
device-plugin resources.

Example:

resources:
  limits:
    nvidia.com/gpu: 1

Extended resources are usually integer and non-overcommittable.

You generally request them by using limits, and Kubernetes schedules Pods only onto nodes that advertise the resource.

4. Requests vs Limits

4.1 Requests

A request says:

“This workload needs at least this much resource capacity to be scheduled safely.”

The scheduler uses requests to decide whether a node can fit a Pod.

Example:

apiVersion: v1
kind: Pod
metadata:
  name: api
spec:
  containers:
    - name: api
      image: example.com/api:1.0.0
      resources:
        requests:
          cpu: "500m"
          memory: "512Mi"

If a node has insufficient unallocated requested capacity, the Pod remains Pending.

This is why incorrect requests produce two opposite failures:

Mistake	Failure
Request too high	Pod cannot schedule or cluster becomes underutilized.
Request too low	Pod schedules onto crowded node and becomes unstable under contention.

4.2 Limits

A limit says:

“This container must not exceed this runtime boundary.”

Example:

resources:
  limits:
    cpu: "1"
    memory: "1Gi"

CPU and memory limits behave differently.

Resource	If Exceeded
CPU limit	Container is throttled.
Memory limit	Container can be killed.

The difference is fundamental.

Do not reason about CPU and memory limits as if they are symmetric.

5. Scheduling Capacity vs Actual Utilization

Kubernetes scheduling is based on declared requests, not live usage.

Suppose a node has:

Allocatable CPU:    4 cores
Allocatable Memory: 8Gi

Existing Pods request:

CPU requested:    3.5 cores
Memory requested: 7Gi

A new Pod requests:

CPU:    700m
Memory: 512Mi

Even if actual CPU usage is only 20%, the new Pod cannot fit because CPU requests exceed allocatable capacity.

This is often misunderstood.

The scheduler is not a live load balancer.

It is a placement decision engine based primarily on declared constraints and available allocatable capacity.

6. Node Capacity, Allocatable, and Reserved Resources

A node’s total capacity is not the same as capacity available for Pods.

Node Capacity
  - kube reserved
  - system reserved
  - eviction reserved
  = Node Allocatable

Kubernetes schedules Pods against node allocatable capacity.

Production implication:

You cannot safely pack Pods based on cloud VM advertised size alone.

A 4 vCPU / 16Gi node is not a 4 CPU / 16Gi Pod bin.

Some capacity is intentionally reserved for:

kubelet,
container runtime,
OS processes,
logging agent,
monitoring agent,
CNI agent,
CSI driver,
node-local DNS,
eviction safety margin.

7. QoS Classes

Kubernetes assigns Pods one of three QoS classes:

Guaranteed,
Burstable,
BestEffort.

QoS affects eviction ordering under resource pressure.

7.1 Guaranteed

A Pod is Guaranteed when every container has CPU and memory request equal to limit.

Example:

resources:
  requests:
    cpu: "500m"
    memory: "512Mi"
  limits:
    cpu: "500m"
    memory: "512Mi"

Use for:

hard real-time-ish platform components,
latency-sensitive services with predictable resource envelope,
critical infrastructure workloads,
workloads where eviction is more dangerous than underutilization.

Trade-off:

Guaranteed QoS can reduce cluster efficiency if limits are set too conservatively.

7.2 Burstable

A Pod is Burstable when at least one request is set, but not all requests equal all limits.

Example:

resources:
  requests:
    cpu: "250m"
    memory: "512Mi"
  limits:
    cpu: "1"
    memory: "1Gi"

This is the most common production QoS class.

Use for:

normal microservices,
APIs with variable demand,
workers,
non-critical but important services.

Trade-off:

Burstable Pods can be evicted before Guaranteed Pods when nodes face pressure.

7.3 BestEffort

A Pod is BestEffort when no CPU or memory request/limit is specified.

Example:

resources: {}

Use for:

throwaway experiments,
local dev clusters,
intentionally low-priority opportunistic work.

Avoid for production services.

BestEffort Pods are first-class Kubernetes objects but last-class citizens under pressure.

8. QoS Decision Table

Requests Set?	Limits Set?	Request = Limit?	QoS
No	No	N/A	BestEffort
Some	No or partial	No	Burstable
CPU + memory for all containers	CPU + memory for all containers	Yes	Guaranteed
CPU + memory for all containers	Limits higher than requests	No	Burstable

QoS is Pod-level, not container-level.

A single sidecar without proper resources can accidentally change the Pod’s QoS class.

That matters for service mesh sidecars, log shippers, and agents injected by admission controllers.

9. Eviction Mental Model

Eviction means kubelet terminates Pods to reclaim node resources.

Common pressure signals:

memory pressure,
disk pressure,
PID pressure,
inode pressure.

Eviction is not random.

Kubelet considers factors such as:

QoS class,
whether usage exceeds requests,
Pod priority,
resource pressure type,
local node conditions.

Production lesson:

If your service is business-critical but has low requests and no priority class, Kubernetes has little evidence that it should protect it.

10. OOMKilled vs Evicted

These are different failures.

Failure	Trigger	Actor	Common Cause
`OOMKilled`	Container exceeds memory limit	Kernel/runtime	Memory limit too low, memory leak, traffic spike, GC mis-sizing.
`Evicted`	Node pressure threshold crossed	Kubelet	Node memory/disk/PID pressure, low requests, BestEffort/Burstable pressure.
`Pending`	No schedulable node	Scheduler	Requests too high, constraints too strict, insufficient node capacity.

10.1 OOMKilled Example

Last State: Terminated
Reason: OOMKilled
Exit Code: 137

Interpretation:

The container exceeded its memory limit or was killed under cgroup memory enforcement.

10.2 Evicted Example

Status: Failed
Reason: Evicted
Message: The node was low on resource: memory.

Interpretation:

The kubelet terminated the Pod to protect the node.

11. CPU Throttling

CPU throttling happens when a container wants more CPU than its configured CPU quota allows.

Symptoms:

latency spikes,
timeouts,
low apparent CPU usage at service level,
increased request queueing,
longer GC cycles,
slow startup,
liveness probe failures,
HPA not scaling as expected.

A dangerous configuration:

resources:
  requests:
    cpu: "100m"
    memory: "512Mi"
  limits:
    cpu: "200m"
    memory: "512Mi"

This looks safe but can create artificial latency ceilings.

For CPU-bound services, a low CPU limit can make the service slower than the hardware actually allows.

11.1 CPU Request Without CPU Limit

Many production teams prefer:

resources:
  requests:
    cpu: "500m"
    memory: "512Mi"
  limits:
    memory: "1Gi"

Rationale:

CPU request helps scheduling and HPA math.
No CPU limit avoids unnecessary throttling.
Memory limit prevents one process from consuming the node.

This is not universal, but it is a common production pattern.

Use it intentionally, not blindly.

12. Memory Limit Sizing

Memory limit is a kill boundary.

Set it too low and the app restarts.

Set it too high and the node becomes vulnerable to pressure.

A good memory limit considers:

steady-state heap,
native memory,
off-heap buffers,
thread stacks,
JIT/compiler memory,
sidecar memory,
TLS buffers,
request burst buffers,
cache behavior,
GC behavior,
memory fragmentation,
startup peak,
migration/reindexing peak.

For JVM services, do not set container memory equal to JVM heap.

Example bad pattern:

resources:
  limits:
    memory: "1Gi"

-Xmx1g

This ignores non-heap memory.

Safer mental model:

container memory limit
  > heap
  + metaspace
  + thread stacks
  + direct buffers
  + code cache
  + GC overhead
  + native libraries
  + sidecar/proxy overhead if in same Pod
  + safety margin

13. Requests and HPA Interaction

Horizontal Pod Autoscaler commonly uses CPU utilization relative to CPU requests.

Simplified:

CPU utilization = current CPU usage / requested CPU

If the CPU request is too low, the HPA may scale too aggressively.

If the CPU request is too high, the HPA may scale too slowly.

Example:

Pod CPU Usage	CPU Request	Observed Utilization
200m	100m	200%
200m	500m	40%
200m	1000m	20%

Same actual usage.

Different autoscaling signal.

This is why resource requests are not only scheduling hints.

They are also control-loop calibration inputs.

14. Overcommit Strategy

Overcommit means scheduling more requested/possible workload than physical capacity can fully satisfy at peak.

There are two forms:

Type	Meaning
CPU overcommit	Sum of CPU limits or potential demand exceeds physical CPU. Common and often acceptable.
Memory overcommit	Sum of potential memory usage exceeds physical memory. Riskier because memory pressure kills Pods.

14.1 CPU Overcommit

CPU is compressible, so moderate overcommit is common.

Example:

Node allocatable CPU: 8 cores
Total CPU requests: 6 cores
Total CPU limits: 20 cores

This can be acceptable if workloads do not peak simultaneously.

Risk:

contention,
throttling,
latency degradation,
noisy-neighbor effects.

14.2 Memory Overcommit

Memory overcommit is more dangerous.

Example:

Node allocatable memory: 32Gi
Total memory requests: 20Gi
Total memory limits: 80Gi

If workloads burst together, the node can enter memory pressure and evict Pods.

Memory overcommit must be governed by workload class.

15. Workload Classes and Resource Policy

A production platform should not use one resource policy for all workloads.

Workload Class	CPU Strategy	Memory Strategy	QoS Target
Critical control-plane add-on	Request = realistic baseline; limit careful or absent for CPU	Tight but safe limit	Guaranteed or high-priority Burstable
Latency-sensitive API	Request based on p50-p70 steady usage; avoid low CPU limit	Limit based on p99 + safety margin	Burstable or Guaranteed
Worker service	Request based on concurrency model	Limit based on max in-flight workload	Burstable
Batch job	Lower priority; explicit request	Explicit limit	Burstable
Opportunistic analytics	Low request; quota controlled	Strict limit	BestEffort/Burstable depending risk
Stateful database	Conservative request; avoid CPU starvation	Carefully tested memory limit	Guaranteed/Burstable depending engine

No single resource template is correct for every workload.

Resource policy must reflect business criticality and failure mode.

16. Namespace ResourceQuota

ResourceQuota limits aggregate resource consumption in a namespace.

Example:

apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-a-quota
  namespace: team-a
spec:
  hard:
    requests.cpu: "20"
    requests.memory: "80Gi"
    limits.cpu: "40"
    limits.memory: "160Gi"
    pods: "100"
    persistentvolumeclaims: "20"

Use quotas to prevent:

one team consuming all cluster capacity,
runaway deployments,
unbounded test environments,
accidental infinite job creation,
namespace-level denial of service.

But quotas can also create friction:

deployments fail unexpectedly,
teams ask platform teams for quota increases,
stale workloads consume quota,
bad requests waste namespace capacity.

A good quota system needs visibility.

17. LimitRange

LimitRange can define default and min/max resource constraints for objects in a namespace.

Example:

apiVersion: v1
kind: LimitRange
metadata:
  name: default-container-resources
  namespace: team-a
spec:
  limits:
    - type: Container
      defaultRequest:
        cpu: "100m"
        memory: "128Mi"
      default:
        memory: "512Mi"
      min:
        cpu: "50m"
        memory: "64Mi"
      max:
        cpu: "2"
        memory: "4Gi"

Use LimitRange for:

safe defaults,
preventing missing requests,
preventing absurd limits,
normalizing team behavior.

But be careful:

Defaulting can hide bad engineering.

If every app silently gets the same default, capacity planning becomes fictional.

18. Policy-as-Code Guardrails

At scale, ResourceQuota and LimitRange are not enough.

You may need admission policies such as:

every production container must set CPU request,
every production container must set memory request and memory limit,
no BestEffort Pods in production namespaces,
max CPU limit/request ratio must be bounded,
privileged namespaces require explicit exception,
sidecars must have resource constraints,
Pods must include ownership labels,
batch namespaces must use lower PriorityClass.

Example policy intent:

Deny production Pods where:
  container.resources.requests.memory is missing
  OR container.resources.requests.cpu is missing
  OR container.resources.limits.memory is missing

Governance should enforce invariants, not personal preferences.

19. Practical Sizing Method

Do not start with random requests.

Use measurement.

19.1 Minimum Viable Sizing Loop

19.2 CPU Sizing

For a stateless API:

CPU request = enough CPU for normal steady traffic without excessive queueing
CPU limit   = optional; if used, high enough to avoid artificial p99 latency collapse

Suggested signal set:

CPU usage p50/p90/p99,
request rate,
latency p95/p99,
queue depth,
GC pause,
thread pool saturation,
CPU throttling metric,
HPA replica count.

19.3 Memory Sizing

For a stateless API:

memory request = stable working set under normal traffic
memory limit   = p99 memory + burst + startup + safety margin

Suggested signal set:

RSS / working set,
heap usage,
non-heap usage,
direct buffer usage,
OOM events,
restart count,
GC frequency,
memory growth slope.

20. Resource Anti-Patterns

20.1 No Requests in Production

resources: {}

Why it is bad:

scheduler cannot place accurately,
QoS becomes BestEffort,
HPA CPU utilization cannot be calibrated correctly,
eviction risk increases,
capacity planning becomes impossible.

20.2 Request Equals Limit Everywhere

requests:
  cpu: "1"
  memory: "1Gi"
limits:
  cpu: "1"
  memory: "1Gi"

This gives Guaranteed QoS, but if applied blindly:

CPU burst is blocked,
cluster utilization drops,
teams over-request to avoid throttling,
cost increases,
capacity becomes stranded.

Good for some workloads.

Bad as universal policy.

20.3 Tiny CPU Limit on JVM/API Services

limits:
  cpu: "200m"

Risk:

GC slows,
startup slows,
TLS handshakes slow,
request latency spikes,
probes fail,
HPA reacts late or oddly.

20.4 Memory Limit Too Close to Heap

limits:
  memory: "1Gi"

-Xmx1g

Risk:

off-heap memory causes OOM,
native memory not accounted for,
thread stacks exceed margin,
container restarts during traffic spike.

20.5 Missing Sidecar Resources

Injected sidecars consume real resources.

If you size only the main container, you understate Pod cost.

Common examples:

service mesh proxy,
log collector,
metrics exporter,
secret agent,
security scanner.

20.6 Unbounded `emptyDir`

volumes:
  - name: tmp
    emptyDir: {}

Risk:

local disk pressure,
Pod eviction,
node instability.

Prefer explicit size limits when appropriate:

volumes:
  - name: tmp
    emptyDir:
      sizeLimit: "2Gi"

21. Debugging Resource Failures

21.1 Pending Pod

Command path:

kubectl describe pod <pod>
kubectl get events --sort-by=.lastTimestamp
kubectl describe node <node>
kubectl top nodes

Look for:

0/10 nodes are available: Insufficient cpu.
0/10 nodes are available: Insufficient memory.
node(s) had untolerated taint.
node(s) didn't match Pod's node affinity/selector.

Root causes:

requests too high,
quota exhausted,
node selector too strict,
affinity impossible,
taint not tolerated,
cluster autoscaler unable to provision matching node,
PV zone constraints.

21.2 OOMKilled

Command path:

kubectl describe pod <pod>
kubectl logs <pod> --previous
kubectl get pod <pod> -o jsonpath='{.status.containerStatuses[*].lastState}'

Look for:

Reason: OOMKilled
Exit Code: 137
Restart Count: increasing

Questions:

Did memory usage grow gradually?
Did startup spike exceed limit?
Did traffic spike increase in-flight objects?
Did a new release change heap/cache behavior?
Did sidecar memory increase?
Did request body size increase?

21.3 CPU Throttling

Look for metrics such as:

container_cpu_cfs_throttled_periods_total
container_cpu_cfs_periods_total
container_cpu_cfs_throttled_seconds_total

Questions:

Is CPU limit too low?
Is latency correlated with throttling?
Is HPA target based on request too high/low?
Are startup probes failing during cold start?
Is the service CPU-bound or I/O-bound?

21.4 Evicted Pod

Command path:

kubectl describe pod <pod>
kubectl describe node <node>
kubectl get events --field-selector involvedObject.kind=Pod

Look for:

The node was low on resource: memory.
The node was low on resource: ephemeral-storage.

Questions:

What QoS class was the Pod?
Was usage above request?
Were logs or temp files growing?
Did a batch job land on the same node?
Did node allocatable leave enough eviction margin?

22. Production Resource Design Patterns

22.1 API Service Default Pattern

resources:
  requests:
    cpu: "500m"
    memory: "512Mi"
  limits:
    memory: "1Gi"

Characteristics:

CPU request supports scheduling and HPA.
No CPU limit avoids throttling.
Memory limit protects the node.
Burstable QoS.

Use when:

service is latency-sensitive,
CPU burst is beneficial,
cluster has quota/governance controls,
platform monitors CPU noisy-neighbor behavior.

22.2 Critical Add-On Pattern

resources:
  requests:
    cpu: "200m"
    memory: "256Mi"
  limits:
    cpu: "200m"
    memory: "256Mi"

Characteristics:

Guaranteed QoS.
Predictable resource envelope.
Less likely to be evicted.

Use when:

workload is platform-critical,
resource profile is predictable,
losing the component causes cluster-wide impact.

22.3 Worker Pattern

resources:
  requests:
    cpu: "250m"
    memory: "512Mi"
  limits:
    memory: "2Gi"

Characteristics:

Allows CPU burst.
Memory limit reflects max in-flight processing.
Concurrency should be tuned to resource budget.

Worker sizing must connect resource limits to queue concurrency.

max concurrency <= memory limit / worst-case memory per item

22.4 Batch Job Pattern

resources:
  requests:
    cpu: "1"
    memory: "2Gi"
  limits:
    cpu: "2"
    memory: "4Gi"

Characteristics:

Explicit bounded execution.
Can tolerate throttling more than APIs.
Should use PriorityClass and quotas.

23. PriorityClass and Business Criticality

Resource management is not only technical.

It encodes business priority.

PriorityClass helps Kubernetes decide which Pods matter more during scheduling/preemption pressure.

Example:

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: production-critical
value: 100000
preemptionPolicy: PreemptLowerPriority
globalDefault: false
description: "Critical production workloads."

Pod usage:

spec:
  priorityClassName: production-critical

Use carefully.

Too many “critical” workloads means none are critical.

24. Resource Governance Architecture

A mature platform uses multiple layers.

Layers:

Layer	Purpose
CI validation	Catch obvious bad manifests before cluster admission.
Admission policy	Enforce hard platform invariants.
LimitRange	Provide namespace-level defaults and bounds.
ResourceQuota	Prevent aggregate overuse.
PriorityClass	Encode relative importance.
Metrics	Reveal real behavior.
VPA/recommendation	Improve resource requests over time.
FinOps reports	Connect resources to cost ownership.

25. Capacity Planning

Cluster capacity planning starts with requests, not usage.

Simplified model:

required nodes = total requested resources / allocatable resources per node

But production planning must include:

zone failure tolerance,
node upgrade surge,
PodDisruptionBudget constraints,
DaemonSet overhead,
system reserved resources,
bin-packing inefficiency,
topology spread constraints,
anti-affinity constraints,
buffer for autoscaler latency,
peak traffic expansion,
deployment surge capacity.

25.1 Deployment Surge Capacity

Rolling updates can temporarily increase Pod count.

If a Deployment uses:

strategy:
  rollingUpdate:
    maxSurge: 25%
    maxUnavailable: 0

The cluster may need extra capacity during rollout.

A service that fits at steady state may fail to roll out if there is no surge headroom.

26. Failure Scenario: The Safe-Looking Deployment That Fails

Assume this service:

replicas: 20
resources:
  requests:
    cpu: "100m"
    memory: "256Mi"
  limits:
    cpu: "200m"
    memory: "512Mi"

Symptoms after traffic growth:

p99 latency increases,
CPU usage never appears high,
HPA scales slowly,
liveness probes occasionally fail,
restart count rises,
users see timeouts.

Likely issues:

CPU limit throttles the app before it can use available node capacity.
CPU request is too low to represent actual per-Pod capacity.
HPA utilization math is distorted by incorrect request.
Probe configuration may treat slow response as dead process.
Memory limit may be too close to burst allocation.

Better investigation:

kubectl top pod
kubectl describe pod <pod>
kubectl get hpa
kubectl describe hpa <hpa>
kubectl logs <pod> --previous

Metrics to inspect:

CPU throttling,
request rate per Pod,
latency per Pod,
GC pause,
memory working set,
HPA desired replicas,
probe failure events.

27. Resource Review Checklist

Before promoting a service to production, ask:

Scheduling

Are CPU and memory requests set for every container?
Do requests represent measured behavior?
Can the workload schedule during rolling update surge?
Do topology constraints reduce effective capacity?

Runtime

Is memory limit safe under peak and startup?
Is CPU limit intentionally set or accidentally copied?
Are sidecars included in total Pod sizing?
Are emptyDir and ephemeral storage bounded?

Reliability

What happens under node memory pressure?
What QoS class is assigned?
Does PriorityClass match business criticality?
Are PodDisruptionBudgets aligned with capacity?

Autoscaling

Does HPA target use calibrated requests?
Does scaling react before SLO burn?
Does node autoscaler have headroom and compatible node groups?
Is startup time included in autoscaling math?

Governance

Does the namespace have quota?
Are defaults explicit and documented?
Are exceptions tracked?
Are reports available for cost and resource drift?

28. Top 1% Mental Models

28.1 Requests Are Scheduling Truth, Not Runtime Truth

Requests shape placement and autoscaling math.

They do not cap usage.

28.2 Limits Are Runtime Boundaries, Not Capacity Reservations

Limits protect nodes and enforce boundaries.

They can also create artificial failure.

28.3 CPU Is Delay; Memory Is Death

CPU pressure usually becomes latency.

Memory pressure often becomes restart or eviction.

28.4 QoS Is an Eviction Contract

QoS tells Kubernetes how expendable your Pod appears under pressure.

28.5 Resource Policy Is Product Policy

Criticality, quotas, priority, and capacity allocation encode business decisions.

28.6 Autoscaling Cannot Fix Bad Sizing

HPA, VPA, and node autoscaling are control loops.

Bad inputs create bad control behavior.

29. Practice Lab

Lab 1 — Observe QoS

Create three Pods:

no requests/limits,
requests lower than limits,
requests equal limits.

Run:

kubectl get pod <pod> -o jsonpath='{.status.qosClass}'

Expected:

BestEffort,
Burstable,
Guaranteed.

Lab 2 — Create a Pending Pod

Request impossible capacity:

resources:
  requests:
    cpu: "999"
    memory: "999Gi"

Observe:

kubectl describe pod <pod>

Learn how scheduling failure is reported.

Lab 3 — Trigger Memory OOM

Run a container with a low memory limit and allocate more memory.

Observe:

kubectl describe pod <pod>
kubectl logs <pod> --previous

Lab 4 — Observe CPU Throttling

Run CPU-bound work under a low CPU limit.

Observe latency and throttling metrics.

Lab 5 — Quota Failure

Create a namespace quota and deploy more replicas than allowed.

Observe admission failure.

30. Summary

Resource management is one of the deepest Kubernetes production skills because it sits between application behavior, scheduler placement, runtime enforcement, and business governance.

Key takeaways:

Requests drive scheduling and autoscaling calibration.
Limits drive runtime enforcement.
CPU pressure usually creates latency.
Memory pressure creates kills and evictions.
QoS influences eviction priority.
Quotas and LimitRanges govern namespace-level behavior.
CPU limits can damage latency-sensitive services when set too low.
Memory limits must include heap, off-heap, native, thread, and burst overhead.
Sidecars must be included in Pod-level resource design.
A mature platform treats resource policy as an explicit contract, not YAML decoration.

In the next part, we connect this foundation to autoscaling: HPA, VPA, Cluster Autoscaler, KEDA, metric selection, scaling lag, control-loop instability, and SLO-aware capacity design.

References

Kubernetes Documentation — Resource Management for Pods and Containers: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/
Kubernetes Documentation — Pod Quality of Service Classes: https://kubernetes.io/docs/concepts/workloads/pods/pod-qos/
Kubernetes Documentation — Node-pressure Eviction: https://kubernetes.io/docs/concepts/scheduling-eviction/node-pressure-eviction/
Kubernetes Documentation — Resource Quotas: https://kubernetes.io/docs/concepts/policy/resource-quotas/
Kubernetes Documentation — Limit Ranges: https://kubernetes.io/docs/concepts/policy/limit-range/
Kubernetes Documentation — Assign CPU Resources to Containers and Pods: https://kubernetes.io/docs/tasks/configure-pod-container/assign-cpu-resource/
Kubernetes Documentation — Assign Memory Resources to Containers and Pods: https://kubernetes.io/docs/tasks/configure-pod-container/assign-memory-resource/

Lesson Recap

You just completed lesson 13 in build core. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Back To Series Next Lesson

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.

Previous Lesson

Lesson 12

Learn Kubernetes Deployment Model Part 012 Configuration Secrets

Next Lesson

Lesson 14

Learn Kubernetes Deployment Model Part 014 Autoscaling

Learn Kubernetes Deployment Model Part 013 Resource Management

Part 013 — CPU, Memory, Requests, Limits, QoS, and Resource Governance

1. Kaufman Deconstruction

2. Resource Management Mental Model

3. The Core Resource Types

3.1 CPU

3.2 Memory

3.3 Ephemeral Storage

3.4 Extended Resources

4. Requests vs Limits

4.1 Requests

4.2 Limits

5. Scheduling Capacity vs Actual Utilization

6. Node Capacity, Allocatable, and Reserved Resources

7. QoS Classes

7.1 Guaranteed

7.2 Burstable

7.3 BestEffort

8. QoS Decision Table

9. Eviction Mental Model

10. OOMKilled vs Evicted

10.1 OOMKilled Example

10.2 Evicted Example

11. CPU Throttling

11.1 CPU Request Without CPU Limit

12. Memory Limit Sizing

13. Requests and HPA Interaction

14. Overcommit Strategy

14.1 CPU Overcommit

14.2 Memory Overcommit

15. Workload Classes and Resource Policy

16. Namespace ResourceQuota

17. LimitRange

18. Policy-as-Code Guardrails

19. Practical Sizing Method

19.1 Minimum Viable Sizing Loop

19.2 CPU Sizing

19.3 Memory Sizing

20. Resource Anti-Patterns

20.1 No Requests in Production

20.2 Request Equals Limit Everywhere

20.3 Tiny CPU Limit on JVM/API Services

20.4 Memory Limit Too Close to Heap

20.5 Missing Sidecar Resources

20.6 Unbounded emptyDir

21. Debugging Resource Failures

21.1 Pending Pod

21.2 OOMKilled

21.3 CPU Throttling

21.4 Evicted Pod

22. Production Resource Design Patterns

22.1 API Service Default Pattern

22.2 Critical Add-On Pattern

22.3 Worker Pattern

22.4 Batch Job Pattern

23. PriorityClass and Business Criticality

24. Resource Governance Architecture

25. Capacity Planning

25.1 Deployment Surge Capacity

26. Failure Scenario: The Safe-Looking Deployment That Fails

27. Resource Review Checklist

Scheduling

Runtime

Reliability

Autoscaling

Governance

28. Top 1% Mental Models

28.1 Requests Are Scheduling Truth, Not Runtime Truth

28.2 Limits Are Runtime Boundaries, Not Capacity Reservations

28.3 CPU Is Delay; Memory Is Death

28.4 QoS Is an Eviction Contract

28.5 Resource Policy Is Product Policy

28.6 Autoscaling Cannot Fix Bad Sizing

29. Practice Lab

Lab 1 — Observe QoS

Lab 2 — Create a Pending Pod

Lab 3 — Trigger Memory OOM

Lab 4 — Observe CPU Throttling

Lab 5 — Quota Failure

30. Summary

20.6 Unbounded `emptyDir`