Series/Learn Kubernetes, Deployment Model, and Cloud Native Platform Engineering

Deepen PracticeOrdered learning track

Production Debugging: Pods, Nodes, Network, DNS, Storage

Learn Kubernetes, Deployment Model, and Cloud Native Platform Engineering - Part 027

Production debugging for Kubernetes systems, including a systematic triage method across Pods, Nodes, Services, DNS, networking, storage, resources, rollout state, events, logs, ephemeral containers, and incident evidence collection.

[2026-07-01]20 min read3913 words

In This Lesson

1. Why This Part Exists 2. Kaufman Skill Target 3. The Core Mental Model

PrevNext

Lesson 2735 lesson track20–29 Deepen Practice

#kubernetes#debugging#troubleshooting#production+6 more

Part 027 — Production Debugging: Pods, Nodes, Network, DNS, Storage

1. Why This Part Exists

A top-tier Kubernetes engineer is not defined by how quickly they can write YAML.

They are defined by how quickly and safely they can answer this question during production pressure:

What changed, where is the failure, what is the blast radius, and what is the safest next action?

Kubernetes gives you many moving parts:

API server
admission controllers
scheduler
controller manager
kubelet
container runtime
CNI
CSI
CoreDNS
Services
EndpointSlices
Ingress or Gateway controllers
autoscalers
policy engines
mesh sidecars
application containers

Because the system is distributed, a symptom rarely tells you the root cause directly.

A CrashLoopBackOff is not a root cause.

A Pending Pod is not a root cause.

A 503 from an Ingress is not a root cause.

A Forbidden error is not a root cause.

A missing endpoint is not always a networking problem.

Production debugging is the skill of turning symptoms into evidence, evidence into hypotheses, and hypotheses into safe interventions.

This part is deliberately practical. It teaches a repeatable debugging method that works across Kubernetes clusters, cloud providers, ingress controllers, service meshes, CSI drivers, and workload types.

2. Kaufman Skill Target

Based on The First 20 Hours, the target is not to memorize every possible failure.

The target is to acquire enough structure to self-correct under ambiguity.

After this part, you should be able to:

Triage a Kubernetes incident without randomly changing manifests.
Map symptoms to Kubernetes layers.
Read object state from spec, status, conditions, events, and logs.
Debug Pending, CrashLoopBackOff, ImagePullBackOff, Running but not Ready, Service has no endpoints, DNS failures, network blocks, and storage mount failures.
Distinguish workload failure from platform failure.
Use kubectl debug and ephemeral containers safely.
Produce an incident evidence timeline.
Decide when to rollback, scale, drain, patch, restart, or escalate.

The first 20 hours should be spent mostly on the debugging loop, not on memorizing commands.

3. The Core Mental Model

Kubernetes debugging has one invariant:

Never debug a Kubernetes symptom at only one layer.

A Kubernetes workload is an object graph.

When a user reports failure, you must locate the broken edge in the graph.

Examples:

Symptom	Possible Broken Edge
HTTP 503 from ingress	Ingress -> Service, Service -> EndpointSlice, readiness, backend crash, Gateway route attachment
Pod stuck `Pending`	Pod -> Node scheduling, resource requests, affinity, taints, PVC binding
Pod `Running` but not receiving traffic	Pod readiness -> EndpointSlice, Service selector mismatch, NetworkPolicy, mesh sidecar readiness
DNS timeout	Pod -> CoreDNS, NetworkPolicy egress, node DNS config, CoreDNS overload
PVC stuck `Pending`	PVC -> StorageClass, provisioner, topology, quota, unavailable zone
App slow	CPU throttling, memory pressure, dependency latency, node pressure, autoscaling lag

The object graph is the debugging map.

4. Debugging Is a Control Loop

Do not debug by jumping directly to a fix.

Debug like Kubernetes itself: observe, compare, act, verify.

A weak debugging loop looks like this:

Symptom -> random restart -> maybe works -> unknown cause -> repeats later

A strong debugging loop looks like this:

Symptom -> layer map -> evidence -> hypothesis -> minimal test -> safe action -> verified recovery -> post-incident hardening

The difference is not speed.

The difference is whether the organization learns.

5. First 10 Minutes of Production Triage

When the incident starts, do not begin with a deep dive.

First, answer four questions:

1. Who is affected?
2. What changed recently?
3. Which Kubernetes object graph owns the traffic path?
4. Is the system getting better, worse, or stable?

5.1 Minimal Triage Commands

# Identify namespace and workload
kubectl get deploy,sts,ds,job,cronjob -n <namespace>

# Get high-level Pod health
kubectl get pods -n <namespace> -o wide

# Inspect rollout state
kubectl rollout status deployment/<name> -n <namespace>
kubectl rollout history deployment/<name> -n <namespace>

# Inspect workload details
kubectl describe deployment/<name> -n <namespace>
kubectl describe pod/<pod> -n <namespace>

# Check recent events
kubectl get events -n <namespace> --sort-by=.lastTimestamp

# Check services and endpoints
kubectl get svc,endpointslice -n <namespace>

# Check logs
kubectl logs deployment/<name> -n <namespace> --all-containers --tail=200

# Check previous crashed container logs
kubectl logs pod/<pod> -n <namespace> -c <container> --previous --tail=200

5.2 What You Are Looking For

Evidence	Why It Matters
`RESTARTS` increasing	Runtime failure, probe failure, OOM, crash
`READY 0/1`	Pod exists but is not serving traffic
`Pending`	Scheduler, resources, taints, affinity, PVC
`ImagePullBackOff`	Registry, credentials, tag/digest, image not found
`OOMKilled`	Memory limit too low, leak, spike, wrong sizing
`FailedScheduling` event	Scheduler explains why no node was selected
Empty EndpointSlice	Service has no ready matching Pod
Rollout stuck	Deployment progress deadline, readiness, resource, image, admission
Node pressure	Platform capacity or noisy neighbor

Do not treat kubectl get pods as the full truth. It is only a summary.

The real evidence is in:

conditions
events
status fields
container states
controller status
metrics
logs
recent changes

6. The Debugging Stack

A useful production debugging stack is ordered from user-visible symptom down to infrastructure.

Each layer has a different evidence source.

Layer	Primary Evidence
User impact	SLI dashboards, error rate, latency, support signal
Traffic entry	Ingress/Gateway status, controller logs, LB health
Service discovery	Service, EndpointSlice, DNS lookup
Controller	Deployment/StatefulSet/DaemonSet/Job status
Pod	Pod phase, conditions, events
Container	state, lastState, exitCode, logs, probes
Node	node conditions, kubelet events, pressure signals
Network	NetworkPolicy, CNI logs, DNS behavior, connectivity tests
Storage	PVC/PV status, events, CSI logs, mount errors
Resource	metrics, throttling, OOM, eviction, quota

7. Debugging `Pending` Pods

A Pending Pod means Kubernetes accepted the object, but it has not successfully become a running container.

There are two broad states:

Pod exists but is not scheduled.
Pod is scheduled but containers are not running.

7.1 First Commands

kubectl get pod <pod> -n <namespace> -o wide
kubectl describe pod <pod> -n <namespace>
kubectl get events -n <namespace> --sort-by=.lastTimestamp

Look for these event reasons:

Event Reason	Likely Cause
`FailedScheduling`	No node satisfies constraints
`Insufficient cpu`	Requests exceed available allocatable CPU
`Insufficient memory`	Requests exceed available allocatable memory
`node(s) had untolerated taint`	Missing toleration
`didn't match Pod's node affinity/selector`	Placement constraints too narrow
`pod has unbound immediate PersistentVolumeClaims`	PVC not bound before scheduling
`max node group size reached`	Cluster autoscaler cannot add capacity

7.2 Scheduling Debugging Tree

7.3 Common Root Causes

Too-large requests

resources:
  requests:
    cpu: "8"
    memory: "32Gi"

A Pod with huge requests may be unschedulable even if the cluster has enough total free capacity distributed across many nodes.

Kubernetes schedules a Pod to one node. It does not split a single Pod across nodes.

Taints without tolerations

Dedicated nodes often use taints:

kubectl describe node <node> | grep -i taints

If the Pod does not tolerate the taint, it will not schedule there.

Over-constrained affinity

Node affinity, pod anti-affinity, and topology constraints can combine into impossible placement.

Example bad pattern:

Require zone A.
Require node pool GPU.
Require anti-affinity across hostname.
Require PVC bound in zone B.

No scheduler can satisfy contradictory constraints.

PVC topology mismatch

A PVC may bind to a volume in one zone while the Pod is constrained to another zone.

This is why volumeBindingMode: WaitForFirstConsumer is commonly important for topology-aware dynamic provisioning.

7.4 Safe Fixes

Cause	Safer Fix
Requests too large	Reduce request based on observed usage, or add node capacity
Missing toleration	Add explicit toleration only for intended workload class
Bad node selector	Relax selector, use node affinity with clear labels
PVC pending	Fix StorageClass/provisioner/quota before recreating workload
Topology impossible	Remove contradictory constraints

Do not simply delete and recreate Pods repeatedly. If the scheduler cannot place the Pod, recreation only creates new unschedulable Pods.

8. Debugging `ImagePullBackOff` and `ErrImagePull`

Image pull failures happen before application code runs.

8.1 Evidence

kubectl describe pod <pod> -n <namespace>
kubectl get secret -n <namespace>
kubectl get serviceaccount <sa> -n <namespace> -o yaml

Look at events:

Failed to pull image
manifest unknown
unauthorized
no basic auth credentials
i/o timeout
x509: certificate signed by unknown authority

8.2 Failure Taxonomy

Error Pattern	Meaning
`manifest unknown`	Tag/digest does not exist in registry
`unauthorized`	Missing or wrong imagePullSecret/workload identity
`no basic auth credentials`	Runtime cannot authenticate to registry
`x509`	Registry certificate trust issue
`i/o timeout`	Node cannot reach registry or DNS/proxy issue
`not found`	Repository path or registry hostname wrong

8.3 Production Advice

Use immutable image references for production:

image: registry.example.com/team/app@sha256:<digest>

Tags are convenient for humans.

Digests are safer for release evidence.

If an incident involves a bad image, the digest tells you exactly what binary content ran.

9. Debugging `CrashLoopBackOff`

CrashLoopBackOff means the container repeatedly starts and exits, and Kubernetes backs off restart attempts.

It does not tell you why.

9.1 First Commands

kubectl describe pod <pod> -n <namespace>
kubectl logs <pod> -n <namespace> -c <container> --previous --tail=200
kubectl get pod <pod> -n <namespace> -o jsonpath='{.status.containerStatuses[*]}'

9.2 Important Fields

Look for:

state:
  waiting:
    reason: CrashLoopBackOff
lastState:
  terminated:
    reason: Error
    exitCode: 1
    startedAt: ...
    finishedAt: ...
restartCount: 12

9.3 CrashLoop Decision Tree

9.4 Common Root Causes

Root Cause	Evidence
Bad config	Logs show missing env/config file; ConfigMap/Secret mismatch
Missing secret	Mount error, env var absent, app auth failure
OOMKilled	`lastState.terminated.reason=OOMKilled`
Bad command/args	Immediate exit, shell error, file not found
Dependency unavailable	DB/cache/API timeout during startup
Liveness probe too aggressive	Container killed while still starting
Wrong controller	Batch process exits successfully but Deployment restarts it
Read-only filesystem issue	App tries to write to root filesystem

9.5 Exit Codes

Exit Code	Typical Meaning
0	Process completed successfully; wrong if managed by Deployment expecting long-running process
1	General application error
2	Shell/builtin misuse or application-specific error
126	Command found but not executable
127	Command not found
137	Killed, often SIGKILL/OOM
143	SIGTERM, often graceful termination path

Exit codes are not universal truth, but they narrow the hypothesis.

9.6 Safe Fixes

Cause	Fix
App exits after completing work	Use Job/CronJob instead of Deployment
OOMKilled	Increase memory limit, fix leak, reduce concurrency, tune JVM/runtime
Liveness kills during startup	Add `startupProbe`, relax liveness threshold
Config missing	Fix ConfigMap/Secret reference and rollout
Dependency required at startup	Add retry/backoff, avoid fatal boot dependency when possible

10. Debugging `Running` but Not `Ready`

A Pod can be running and still not serve traffic.

This is usually correct.

Running means containers exist.

Ready means the Pod should receive traffic through Services.

10.1 First Commands

kubectl get pod <pod> -n <namespace>
kubectl describe pod <pod> -n <namespace>
kubectl get endpointslice -n <namespace> -l kubernetes.io/service-name=<service>

10.2 Common Causes

Cause	Evidence
Readiness probe failing	Pod events show readiness failures
App listening on different port	Probe connection refused
Dependency check too strict	Readiness fails because DB/cache temporarily unavailable
Sidecar not ready	Multi-container Pod readiness blocked
Service selector mismatch	Pod ready but not in EndpointSlice
Port name mismatch	Service targetPort does not match Pod port name

10.3 Probe Semantics

Use probes intentionally:

Probe	Purpose	Bad Use
`startupProbe`	Protect slow startup from liveness kill	Using huge liveness delay instead
`livenessProbe`	Restart deadlocked process	Checking downstream dependency
`readinessProbe`	Remove unready Pod from traffic	Failing on optional dependency

A common outage pattern:

Database has short blip.
Readiness probe checks database strictly.
All Pods mark NotReady.
Service loses all endpoints.
Ingress returns 503.
Application could have degraded, but readiness removed all capacity.

Readiness should indicate whether the instance can serve useful traffic.

It should not blindly mirror every dependency state.

11. Debugging Service and EndpointSlice Failures

A Service is a stable virtual access point.

It does not guarantee backends exist.

11.1 First Commands

kubectl get svc <service> -n <namespace> -o yaml
kubectl get endpointslice -n <namespace> -l kubernetes.io/service-name=<service> -o yaml
kubectl get pods -n <namespace> --show-labels

11.2 Key Questions

Does the Service selector match the intended Pods?
Are matching Pods Ready?
Does targetPort match containerPort or named port?
Are EndpointSlices created?
Are endpoints marked ready?
Is traffic blocked by NetworkPolicy or mesh policy?

11.3 Selector Debugging

Service selector:

selector:
  app.kubernetes.io/name: payments
  app.kubernetes.io/component: api

Pod labels:

labels:
  app.kubernetes.io/name: payment
  app.kubernetes.io/component: api

One missing s can remove all backends.

To test selector match:

kubectl get pods -n <namespace> -l app.kubernetes.io/name=payments,app.kubernetes.io/component=api

11.4 Service Failure Taxonomy

Symptom	Likely Cause
Service exists, no EndpointSlices	Selector mismatch or no Pods
EndpointSlices exist, endpoints not ready	Readiness failure
Endpoints ready, connection refused	Wrong targetPort or app not listening
Endpoints ready, timeout	NetworkPolicy, CNI, app hang, node path
Works from same Pod, fails cross-namespace	NetworkPolicy or DNS naming
Works by Pod IP, fails by Service name	DNS or kube-proxy/data plane issue

12. Debugging DNS

Kubernetes DNS enables workloads to discover Services by name.

DNS failures often look like application failures.

12.1 Test DNS from Inside the Cluster

Run a temporary debug Pod:

kubectl run dns-debug -n <namespace> --rm -it --restart=Never \
  --image=busybox:1.36 -- nslookup kubernetes.default

Test the service:

nslookup <service>.<namespace>.svc.cluster.local
wget -qO- http://<service>.<namespace>.svc.cluster.local:<port>/health

12.2 Inspect CoreDNS

kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=200
kubectl get configmap coredns -n kube-system -o yaml

Label names vary by distro. Some clusters use different labels.

12.3 DNS Failure Taxonomy

Symptom	Possible Cause
`NXDOMAIN`	Wrong service name/namespace, Service does not exist
Timeout	CoreDNS unreachable, NetworkPolicy egress block, CNI issue
Slow lookup	CoreDNS overload, upstream resolver latency, high search path expansion
Works by FQDN, fails by short name	Namespace/search path issue
Works in one namespace only	NetworkPolicy or namespace-specific DNS behavior

12.4 DNS Search Path Trap

Inside namespace payments, resolving orders usually searches:

orders.payments.svc.cluster.local
orders.svc.cluster.local
orders.cluster.local
...

If the real Service is in namespace commerce, use:

orders.commerce.svc.cluster.local

Do not rely on short names across namespaces.

13. Debugging NetworkPolicy and Connectivity

NetworkPolicy failures are often silent.

A denied packet usually does not produce a clean Kubernetes event.

13.1 Important Mental Model

NetworkPolicy is additive.

Once a Pod is selected by an ingress or egress policy, only explicitly allowed traffic is permitted for that direction.

No matching allow rule means deny.

13.2 First Commands

kubectl get networkpolicy -n <namespace>
kubectl describe networkpolicy <policy> -n <namespace>
kubectl get pods -n <namespace> --show-labels

Connectivity test:

kubectl run net-debug -n <namespace> --rm -it --restart=Never \
  --image=nicolaka/netshoot -- bash

Inside the debug container:

curl -v http://<service>.<namespace>.svc.cluster.local:<port>
dig <service>.<namespace>.svc.cluster.local
nc -vz <host> <port>

13.3 Common NetworkPolicy Mistakes

Mistake	Effect
Default-deny egress without DNS allow	Apps cannot resolve names
Label mismatch in `podSelector`	Policy does not select intended Pods
Namespace selector missing	Cross-namespace traffic blocked
Allows Pod port but not dependency port	App timeout
CNI does not enforce NetworkPolicy	Policy object exists but has no effect
Service mesh also has authz policy	NetworkPolicy permits but mesh denies

13.4 DNS Allow Example

If you use default-deny egress, explicitly allow DNS to CoreDNS.

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-dns
spec:
  podSelector: {}
  policyTypes:
  - Egress
  egress:
  - to:
    - namespaceSelector:
        matchLabels:
          kubernetes.io/metadata.name: kube-system
    ports:
    - protocol: UDP
      port: 53
    - protocol: TCP
      port: 53

This is a baseline pattern, not a universal manifest. Your DNS labels may differ.

14. Debugging Ingress and Gateway Failures

Ingress/Gateway failures sit between external traffic and internal Service discovery.

14.1 Debugging Path

Client -> DNS -> External LB -> Ingress/Gateway listener -> Route -> Service -> EndpointSlice -> Pod

14.2 Commands

kubectl get ingress -A
kubectl describe ingress <name> -n <namespace>

kubectl get gatewayclass,gateway,httproute -A
kubectl describe gateway <name> -n <namespace>
kubectl describe httproute <name> -n <namespace>

kubectl get svc,endpointslice -n <namespace>

Controller logs are critical:

kubectl logs -n <controller-namespace> deployment/<controller-deployment> --tail=200

14.3 Failure Taxonomy

Symptom	Possible Cause
External DNS resolves wrong IP	DNS/LB provisioning issue
TLS handshake failure	Wrong certificate, SNI, secret, listener config
404	Host/path route does not match
503	Route matches but backend unavailable
502	Backend connection error, protocol mismatch
Works internally but not externally	Gateway/Ingress/LB/firewall issue
HTTPRoute not attached	ParentRef/namespace policy/listener mismatch

14.4 Gateway API-Specific Checks

For Gateway API, inspect conditions.

kubectl get httproute <name> -n <namespace> -o yaml
kubectl get gateway <name> -n <namespace> -o yaml

Look for:

status:
  parents:
  - conditions:
    - type: Accepted
    - type: ResolvedRefs

If a route is not accepted, the traffic path is broken before it reaches the Service.

15. Debugging Storage

Storage failures are often stateful, slow, and expensive to fix incorrectly.

Be careful.

Never delete a PVC during an incident unless you fully understand the reclaim policy, backup status, and workload data model.

15.1 Debugging PVC Pending

kubectl get pvc -n <namespace>
kubectl describe pvc <pvc> -n <namespace>
kubectl get storageclass
kubectl get pv
kubectl get events -n <namespace> --sort-by=.lastTimestamp

Possible causes:

Symptom	Cause
PVC `Pending`	No default StorageClass, wrong StorageClass, provisioner failure, quota
Pod pending with unbound PVC	Immediate binding or provisioning failure
Volume zone conflict	PV exists in different topology zone than Pod
Mount timeout	CSI/node plugin failure, cloud attach issue
Permission denied	`fsGroup`, UID/GID, read-only volume, filesystem ownership
Multi-attach error	RWO volume attached to another node

15.2 Attach/Mount Debugging

kubectl describe pod <pod> -n <namespace>
kubectl describe pvc <pvc> -n <namespace>
kubectl describe pv <pv>
kubectl get pods -n kube-system | grep -i csi
kubectl logs -n kube-system <csi-controller-or-node-pod> --tail=200

CSI component names vary by driver.

15.3 The Dangerous Shortcut

Bad incident reaction:

PVC mount failed.
Delete PVC.
Recreate workload.
Data lost.

Better reaction:

Identify whether failure is provisioning, attach, mount, permissions, or application data corruption.
Check reclaim policy.
Check snapshot/backup.
Check whether PV still exists.
Only mutate storage after evidence and owner approval.

15.4 Storage Debugging Tree

16. Debugging Node-Level Problems

Node problems can create many Pod symptoms at once.

16.1 First Commands

kubectl get nodes -o wide
kubectl describe node <node>
kubectl top node
kubectl get pods -A -o wide --field-selector spec.nodeName=<node>

Look for node conditions:

Condition	Meaning
`Ready=False`	Node cannot run Pods reliably
`MemoryPressure=True`	Node memory pressure
`DiskPressure=True`	Node disk pressure
`PIDPressure=True`	Too many processes
`NetworkUnavailable=True`	Network not configured/available

16.2 Node Event Patterns

Event	Possible Meaning
Evicted Pods	Node pressure
FailedMount	CSI or volume path issue
ContainerGCFailed	Runtime garbage collection issue
ImageGCFailed	Disk pressure/image cleanup issue
NodeNotReady	kubelet/node/network problem

16.3 Node Remediation Options

Action	Use When	Risk
Cordon	Stop new Pods from scheduling	Existing Pods remain
Drain	Evict workloads for maintenance	May violate availability without PDB/capacity
Reboot	Node-level stuck condition	Disruptive
Replace node	Cloud/node image corruption	Requires capacity and automation
Scale node pool	Capacity pressure	Cost, placement shifts

Commands:

kubectl cordon <node>
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data
kubectl uncordon <node>

Do not drain blindly during a capacity incident. Draining reduces available capacity before the cluster is healthy.

17. Debugging Resource Pressure

Resource issues are not always visible as crashes.

They can appear as:

latency spikes
timeouts
slow startup
probe failures
OOM kills
evictions
noisy-neighbor behavior
autoscaling lag

17.1 CPU Throttling

A container can be under CPU limit throttling while not obviously failing.

Symptoms:

Latency increases.
Readiness probes timeout.
Thread pools saturate.
Garbage collection takes longer.
Autoscaler sees average CPU too late.

Check metrics:

kubectl top pod -n <namespace>
kubectl top pod -n <namespace> --containers

For deeper analysis, use Prometheus/container runtime metrics if available.

17.2 Memory Pressure

Memory failures are more abrupt.

Evidence:

kubectl describe pod <pod> -n <namespace>
# Look for OOMKilled

Node pressure may evict lower QoS Pods before killing individual containers.

17.3 Resource Debugging Questions

Are requests realistic?
Are limits too tight?
Did traffic increase?
Did deployment change runtime memory behavior?
Did autoscaling react?
Did node pressure affect unrelated workloads?
Is this workload in the right QoS class?

18. Debugging Rollout Incidents

Many incidents start with a deployment.

18.1 Commands

kubectl rollout status deployment/<name> -n <namespace>
kubectl rollout history deployment/<name> -n <namespace>
kubectl describe deployment/<name> -n <namespace>
kubectl get rs -n <namespace> -l app.kubernetes.io/name=<name>
kubectl get pods -n <namespace> --show-labels

18.2 Rollout Failure Modes

Failure	Evidence
New Pods fail readiness	Deployment stuck, unavailable replicas
Image pull fails	New ReplicaSet Pods ImagePullBackOff
Bad config	New Pods CrashLoopBackOff
Capacity unavailable	New Pods Pending
`maxUnavailable` too high	Too much capacity removed
HPA conflict	Replica count changes during rollout
PDB too strict	Old Pods cannot be evicted
Dependency incompatible	New version errors under real traffic

18.3 Rollback Decision

Rollback is usually safer when:

The previous version is known good.
The failure is tied to a recent rollout.
The data/schema contract remains backward compatible.
The current version is causing user impact.
No irreversible migration has occurred.

Command:

kubectl rollout undo deployment/<name> -n <namespace>

Rollback is not magic.

It only changes workload version. It cannot undo data corruption, external side effects, or incompatible migrations.

19. Using Ephemeral Containers Safely

Ephemeral containers are useful when:

the app image has no shell or tools
kubectl exec is impossible
the target container crashed too quickly
you need network tools in the same Pod namespace

Command:

kubectl debug -it pod/<pod> -n <namespace> \
  --image=nicolaka/netshoot \
  --target=<container> -- bash

19.1 What Ephemeral Containers Are Good For

Task	Example
Network testing	`curl`, `dig`, `nc`, `tcpdump` if permitted
File inspection	Check mounted config/secret paths
Process namespace inspection	If process namespace sharing is available
DNS testing	Validate resolver/search path
TLS testing	Verify certificate chain/SNI

19.2 What They Are Not For

Do not use ephemeral containers to:

hotfix production binaries
modify application state casually
bypass security controls
run long-lived admin processes
install untracked debugging agents

A debug container is evidence collection tooling, not a deployment mechanism.

20. Safe Production Debugging Rules

20.1 The Rule of Minimal Mutation

Prefer read-only actions first.

Order of safety:

Observe -> query -> describe -> logs -> metrics -> temporary debug Pod -> ephemeral container -> scale/rollback -> patch -> drain/delete

Mutating actions should have a hypothesis and rollback path.

20.2 Dangerous Commands

Command	Why Dangerous
`kubectl delete pvc`	Potential data loss
`kubectl delete namespace`	Massive destructive scope
`kubectl delete pod`	Can worsen capacity or hide evidence
`kubectl drain`	Can reduce capacity and trigger cascading failure
`kubectl scale --replicas=0`	Full outage if wrong workload
`kubectl apply -f random.yaml`	Unknown drift and ownership
`kubectl edit`	Unreviewed, hard-to-audit mutation

20.3 Evidence Preservation

Before destructive action, capture:

kubectl get all -n <namespace> -o wide
kubectl get events -n <namespace> --sort-by=.lastTimestamp
kubectl describe pod <pod> -n <namespace>
kubectl logs <pod> -n <namespace> --all-containers --tail=500
kubectl get pod <pod> -n <namespace> -o yaml

For severe incidents, export objects:

kubectl get deploy,rs,pod,svc,endpointslice,ingress,gateway,httproute,pvc -n <namespace> -o yaml > incident-state.yaml

21. Debugging Playbooks by Symptom

21.1 `Pod Pending`

1. describe Pod
2. inspect FailedScheduling events
3. check resource requests vs node allocatable
4. check taints/tolerations
5. check node selector/affinity/topology spread
6. check PVC binding
7. check cluster autoscaler events if available
8. fix constraints or capacity

21.2 `CrashLoopBackOff`

1. get previous logs
2. inspect lastState.terminated
3. check exit code and reason
4. check config/secret mounts/env
5. check liveness/startup probes
6. check OOMKilled/resource metrics
7. compare with previous ReplicaSet version
8. rollback or patch

21.3 `Service Returns 503`

1. identify ingress/gateway/backend service
2. inspect route conditions
3. inspect Service selector
4. inspect EndpointSlices
5. inspect Pod readiness
6. test from inside cluster
7. inspect NetworkPolicy/mesh policy
8. inspect ingress/gateway controller logs

21.4 `DNS Failure`

1. test FQDN from debug Pod
2. test kubernetes.default
3. check CoreDNS Pods and logs
4. check NetworkPolicy egress to DNS
5. check node resolver behavior
6. check service existence and namespace
7. inspect DNS search path assumptions

21.5 `PVC Mount Failure`

1. inspect Pod events
2. inspect PVC/PV status
3. inspect StorageClass
4. inspect CSI controller/node logs
5. check access mode and multi-attach
6. check topology/zone
7. check fsGroup/permissions
8. do not delete PVC without backup/reclaim review

21.6 `Latency Spike`

1. check user-facing latency and error SLIs
2. correlate deployment timeline
3. check CPU/memory/network/storage metrics
4. check throttling and GC behavior
5. check dependency latency
6. check HPA state and scaling lag
7. inspect node pressure and noisy neighbors
8. reduce load, rollback, scale, or degrade gracefully

22. Incident Timeline Template

During debugging, write a timeline.

Incident: <name>
Namespace: <namespace>
Workload: <workload>
Started: <timestamp>
Detected by: <alert/user/report>
User impact: <errors/latency/unavailable/data risk>

Timeline:
- HH:MM: Alert fired: <signal>
- HH:MM: Recent deployment detected: <version/digest>
- HH:MM: Pods in new ReplicaSet show CrashLoopBackOff
- HH:MM: Previous logs show missing env var PAYMENT_DB_URL
- HH:MM: Rolled back Deployment to revision 18
- HH:MM: Error rate returned to baseline

Evidence:
- kubectl rollout history: <summary>
- pod events: <summary>
- logs: <summary>
- metrics: <summary>

Root cause hypothesis:
- <hypothesis>

Confirmed root cause:
- <after validation>

Preventive actions:
- <tests/policy/alerts/runbook/platform guardrail>

The timeline turns debugging into organizational memory.

23. Anti-Patterns

23.1 Restart-Driven Debugging

It failed.
Restart it.
It works.
Done.

This hides root cause and destroys evidence.

Restarts are acceptable as mitigation, but not as analysis.

23.2 Debugging Only the Pod

Many incidents are outside the Pod:

bad Service selector
missing EndpointSlice
NetworkPolicy deny
CoreDNS overload
node pressure
CSI mount problem
Gateway route not accepted
admission mutation

The Pod is only one object in the graph.

23.3 Treating Readiness as Health of the Entire Universe

Readiness is a traffic eligibility signal.

If readiness checks every dependency strictly, a dependency blip can remove all application capacity even when partial service could continue.

23.4 Deleting Stateful Objects During Pressure

Deleting Pods may be fine.

Deleting PVCs, PVs, and namespaces is fundamentally different.

Storage actions require data ownership, backup awareness, and reclaim policy understanding.

23.5 Ignoring Recent Change

Most incidents have a trigger:

deployment
config change
secret rotation
node upgrade
policy change
autoscaler event
certificate renewal
dependency deploy
traffic spike

Always build the change timeline.

24. Production Debugging Rubric

Level	Behavior
Beginner	Reads Pod status and restarts Pods
Intermediate	Uses describe/logs/events and can fix common workload failures
Senior	Maps Service/EndpointSlice/Node/DNS/Storage/NetworkPolicy relationships
Staff	Builds incident hypotheses, minimizes mutation, protects evidence, manages blast radius
Principal	Improves platform guardrails so whole classes of incidents cannot recur

The goal is not to be the person who knows the most commands.

The goal is to be the person who finds the safest path from uncertainty to recovery.

25. Practice Labs

Lab 1 — Pending Pod

Create a Pod with impossible nodeSelector.

Expected skill:

Use describe/events to identify scheduler constraint failure.
Fix the selector or labels.

Lab 2 — CrashLoopBackOff

Deploy an app with a missing required env var.

Expected skill:

Use previous logs and container lastState to identify app startup failure.
Fix ConfigMap/Secret and rollout.

Lab 3 — Empty EndpointSlice

Create a Service with selector typo.

Expected skill:

Show that Pods are ready but not selected.
Fix selector/labels.

Lab 4 — DNS Blocked by NetworkPolicy

Apply default-deny egress without DNS allow.

Expected skill:

Test DNS from debug Pod.
Add explicit DNS egress rule.

Lab 5 — PVC Permission Error

Run app as non-root writing to mounted volume without correct permissions.

Expected skill:

Identify permission denied from logs.
Fix securityContext/fsGroup or storage ownership.

Lab 6 — Rollout-Induced 503

Deploy version with failing readiness.

Expected skill:

Trace Ingress -> Service -> EndpointSlice -> Pod readiness.
Rollback safely.

26. Field Checklist

Before applying a fix, answer:

What is the user impact?
What changed recently?
Which object graph is affected?
What evidence supports the hypothesis?
Is this a workload, platform, network, storage, or dependency failure?
What is the safest reversible action?
What evidence should be preserved first?
How will recovery be verified?
What guardrail prevents recurrence?

If you cannot answer these, you are not ready for a destructive action.

27. Key Takeaways

Kubernetes debugging is object-graph debugging.
Symptoms are not root causes.
describe, events, conditions, status, logs, and metrics must be read together.
Pending means scheduling/startup path failure.
CrashLoopBackOff means repeated container exit, not the root cause.
Running does not mean ready for traffic.
Service failures often come from selectors, readiness, EndpointSlices, policy, or routes.
DNS and NetworkPolicy failures often masquerade as application failures.
Storage debugging requires extra caution because data loss is possible.
Production debugging should minimize mutation and preserve evidence.

28. References

Kubernetes Documentation — Debug Running Pods: https://kubernetes.io/docs/tasks/debug/debug-application/debug-running-pod/
Kubernetes Documentation — Debug Pods: https://kubernetes.io/docs/tasks/debug/debug-application/debug-pods/
Kubernetes Documentation — Debug Services: https://kubernetes.io/docs/tasks/debug/debug-application/debug-service/
Kubernetes Documentation — Debugging DNS Resolution: https://kubernetes.io/docs/tasks/administer-cluster/dns-debugging-resolution/
Kubernetes Documentation — Ephemeral Containers: https://kubernetes.io/docs/concepts/workloads/pods/ephemeral-containers/
Kubernetes Documentation — kubectl debug: https://kubernetes.io/docs/reference/kubectl/generated/kubectl_debug/
Kubernetes Documentation — Pod Lifecycle: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/
Kubernetes Documentation — Services: https://kubernetes.io/docs/concepts/services-networking/service/
Kubernetes Documentation — Network Policies: https://kubernetes.io/docs/concepts/services-networking/network-policies/
Kubernetes Documentation — Persistent Volumes: https://kubernetes.io/docs/concepts/storage/persistent-volumes/

Lesson Recap

You just completed lesson 27 in deepen practice. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Back To Series Next Lesson

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.

Previous Lesson

Lesson 26

Observability Foundations: Logs, Metrics, Events, Traces

Next Lesson

Lesson 28

Reliability Engineering, SLOs, and Failure Modelling

Production Debugging: Pods, Nodes, Network, DNS, Storage

Part 027 — Production Debugging: Pods, Nodes, Network, DNS, Storage

1. Why This Part Exists

2. Kaufman Skill Target

3. The Core Mental Model

4. Debugging Is a Control Loop

5. First 10 Minutes of Production Triage

5.1 Minimal Triage Commands

5.2 What You Are Looking For

6. The Debugging Stack

7. Debugging Pending Pods

7.1 First Commands

7.2 Scheduling Debugging Tree

7.3 Common Root Causes

Too-large requests

Taints without tolerations

Over-constrained affinity

PVC topology mismatch

7.4 Safe Fixes

8. Debugging ImagePullBackOff and ErrImagePull

8.1 Evidence

8.2 Failure Taxonomy

8.3 Production Advice

9. Debugging CrashLoopBackOff

9.1 First Commands

9.2 Important Fields

9.3 CrashLoop Decision Tree

9.4 Common Root Causes

9.5 Exit Codes

9.6 Safe Fixes

10. Debugging Running but Not Ready

10.1 First Commands

10.2 Common Causes

10.3 Probe Semantics

11. Debugging Service and EndpointSlice Failures

11.1 First Commands

11.2 Key Questions

11.3 Selector Debugging

11.4 Service Failure Taxonomy

12. Debugging DNS

12.1 Test DNS from Inside the Cluster

12.2 Inspect CoreDNS

12.3 DNS Failure Taxonomy

12.4 DNS Search Path Trap

13. Debugging NetworkPolicy and Connectivity

13.1 Important Mental Model

13.2 First Commands

13.3 Common NetworkPolicy Mistakes

13.4 DNS Allow Example

14. Debugging Ingress and Gateway Failures

14.1 Debugging Path

14.2 Commands

14.3 Failure Taxonomy

14.4 Gateway API-Specific Checks

15. Debugging Storage

15.1 Debugging PVC Pending

15.2 Attach/Mount Debugging

15.3 The Dangerous Shortcut

15.4 Storage Debugging Tree

16. Debugging Node-Level Problems

16.1 First Commands

16.2 Node Event Patterns

16.3 Node Remediation Options

17. Debugging Resource Pressure

17.1 CPU Throttling

17.2 Memory Pressure

17.3 Resource Debugging Questions

18. Debugging Rollout Incidents

18.1 Commands

18.2 Rollout Failure Modes

18.3 Rollback Decision

19. Using Ephemeral Containers Safely

19.1 What Ephemeral Containers Are Good For

19.2 What They Are Not For

20. Safe Production Debugging Rules

20.1 The Rule of Minimal Mutation

20.2 Dangerous Commands

20.3 Evidence Preservation

21. Debugging Playbooks by Symptom

21.1 Pod Pending

7. Debugging `Pending` Pods

8. Debugging `ImagePullBackOff` and `ErrImagePull`

9. Debugging `CrashLoopBackOff`

10. Debugging `Running` but Not `Ready`

21.1 `Pod Pending`

21.2 `CrashLoopBackOff`

21.3 `Service Returns 503`

21.4 `DNS Failure`

21.5 `PVC Mount Failure`

21.6 `Latency Spike`