Final StretchOrdered learning track

Failure Models, Chaos Testing, and Debugging Playbooks

Learn Kubernetes Networking, Gateway API, Service Mesh, and Multi-Cluster Traffic Engineering - Part 033

Production-grade failure models, chaos testing, incident triage, packet-path debugging, Gateway API status analysis, mesh failure isolation, DNS/CNI/policy diagnosis, and reusable Kubernetes networking debugging playbooks.

21 min read4105 words
PrevNext
Lesson 3335 lesson track3035 Final Stretch
#kubernetes#networking#failure-modeling#debugging+10 more

Part 033 — Failure Models, Chaos Testing, and Debugging Playbooks

1. Tujuan Part Ini

Part 001 sampai Part 032 membangun stack Kubernetes networking dari bawah ke atas:

  • Linux networking dan CNI.
  • Pod networking dan Service VIP.
  • DNS dan EndpointSlice.
  • Ingress dan Gateway API.
  • TLS, ReferenceGrant, policy attachment, dan controller portability.
  • North-south dan east-west traffic.
  • Service mesh, mTLS, SPIFFE, traffic shaping, resilience, observability, NetworkPolicy, egress, dan multi-cluster.

Part ini mengubah seluruh pemahaman itu menjadi failure operating model.

Target part ini:

Anda mampu men-debug incident Kubernetes networking secara sistematis, memisahkan symptom dari failure domain, membuktikan hipotesis dengan evidence, menjalankan chaos experiment yang aman, dan menghasilkan playbook yang dapat dipakai ulang oleh tim platform, SRE, security, dan application engineering.

Yang ingin kita hindari:

Incident terjadi
  ↓
Semua orang menjalankan kubectl random
  ↓
Ada yang restart CoreDNS
  ↓
Ada yang edit Gateway
  ↓
Ada yang rollout ulang deployment
  ↓
Symptom berubah
  ↓
Root cause tidak pernah diketahui
  ↓
Incident yang sama muncul lagi bulan depan

Seorang engineer top-tier tidak hanya bertanya:

“Apa YAML-nya benar?”

Ia bertanya:

“Di failure domain mana traffic berhenti, evidence apa yang membuktikannya, blast radius-nya apa, mitigasi paling aman apa, dan invariant apa yang harus kita tambahkan agar failure ini tidak diam-diam kembali?”


2. Source Anchors

Materi ini memakai referensi utama berikut:

  • Kubernetes Debug Services — https://kubernetes.io/docs/tasks/debug/debug-application/debug-service/
  • Kubernetes Debug Pods — https://kubernetes.io/docs/tasks/debug/debug-application/debug-pods/
  • Kubernetes Services, Load Balancing, and Networking — https://kubernetes.io/docs/concepts/services-networking/
  • Kubernetes Service — https://kubernetes.io/docs/concepts/services-networking/service/
  • Kubernetes DNS for Services and Pods — https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/
  • Kubernetes EndpointSlices — https://kubernetes.io/docs/concepts/services-networking/endpoint-slices/
  • Kubernetes Network Policies — https://kubernetes.io/docs/concepts/services-networking/network-policies/
  • Gateway API Troubleshooting and Status — https://gateway-api.sigs.k8s.io/docs/concepts/troubleshooting/
  • Gateway API API Reference — https://gateway-api.sigs.k8s.io/reference/api-spec/main/spec/
  • Istio Operations and Troubleshooting — https://istio.io/latest/docs/ops/diagnostic-tools/
  • Envoy Admin Interface — https://www.envoyproxy.io/docs/envoy/latest/operations/admin
  • Cilium Troubleshooting — https://docs.cilium.io/en/stable/operations/troubleshooting/
  • Chaos Mesh Overview — https://chaos-mesh.org/docs/
  • AWS Well-Architected Reliability Pillar — https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/welcome.html

Fakta penting yang menjadi anchor:

  • Kubernetes menyediakan task resmi untuk debugging Service, termasuk menjalankan command dari Pod agar melihat apa yang dilihat workload.
  • Gateway API menekankan bahwa troubleshooting harus dimulai dari status.conditions object Gateway API.
  • NetworkPolicy hanya efektif jika network plugin mendukung enforcement.
  • Service, DNS, EndpointSlice, dan policy adalah object terpisah; object valid tidak menjamin packet path sehat.
  • Chaos engineering harus memvalidasi hipotesis reliability, bukan sekadar “mematikan sesuatu” untuk terlihat canggih.

3. Kaufman Framing: Debugging Is a Skill, Not a Heroic Mood

Dalam framework Kaufman, skill kompleks harus didekonstruksi menjadi sub-skill yang bisa dilatih.

Untuk Kubernetes networking, debugging bisa dipecah menjadi enam sub-skill:

Sub-skillPertanyaan UtamaOutput
Symptom framingApa yang user atau system lihat?Incident statement
Failure domain isolationLayer mana yang rusak?Narrowed fault domain
Evidence collectionBukti apa yang mendukung atau menolak hipotesis?Evidence bundle
Safe mitigationPerubahan apa yang mengurangi impact tanpa memperbesar blast radius?Mitigation action
Root cause analysisKenapa sistem mengizinkan failure itu terjadi?Causal chain
Learning loopGuardrail apa yang mencegah recurrence?Test, policy, alert, runbook

Tujuan 20 jam deliberate practice bukan menghafal semua command. Tujuannya adalah membangun debugging reflex:

Symptom → Hypothesis → Evidence → Narrowing → Mitigation → Root cause → Guardrail

4. Core Mental Model: Packet Path + Control Path + Intent Path

Jangan debug Kubernetes networking sebagai satu sistem monolitik. Pecah menjadi tiga path.

4.1 Intent Path

Intent path adalah deklarasi yang Anda tulis:

  • Service
  • EndpointSlice
  • Ingress
  • Gateway
  • HTTPRoute
  • GRPCRoute
  • TCPRoute
  • NetworkPolicy
  • AuthorizationPolicy
  • VirtualService
  • DestinationRule
  • ServiceExport
  • ServiceImport

Intent path menjawab:

“Apa yang kita inginkan?”

4.2 Control Path

Control path adalah proses controller membaca intent lalu mengubahnya menjadi konfigurasi runtime:

  • kube-controller-manager membuat EndpointSlice.
  • kube-proxy membaca Service dan EndpointSlice.
  • CoreDNS membaca Service dan Endpoint data.
  • Gateway controller membaca Gateway dan Route.
  • Mesh control plane membuat proxy config.
  • CNI agent membuat route, policy, atau eBPF map.
  • Cloud controller manager membuat load balancer.

Control path menjawab:

“Apakah intent sudah diterjemahkan menjadi state yang bisa dipakai dataplane?”

4.3 Packet Path

Packet path adalah realitas:

  • DNS query.
  • TCP handshake.
  • TLS handshake.
  • HTTP/2 stream.
  • Proxy forwarding.
  • CNI routing.
  • Node NAT.
  • Conntrack.
  • Endpoint selection.
  • Application response.

Packet path menjawab:

“Apakah paket benar-benar sampai, diterima, diproses, dan dibalas?”

4.4 Debugging Rule

Jangan percaya intent sampai control path membuktikan intent diterima.
Jangan percaya control path sampai packet path membuktikan traffic berjalan.
Jangan percaya packet success sampai user-visible semantic success terbukti.

Mermaid model:


5. Failure Taxonomy

Top-tier debugging dimulai dari taxonomy, bukan command.

5.1 Failure by Layer

LayerExample SymptomCommon Root Cause
DNScould not resolve hostCoreDNS overload, wrong namespace, ndots, blocked DNS egress
Service selectionService exists but no trafficSelector mismatch, no ready endpoint
Endpoint lifecycleIntermittent 503 during deployreadiness too early, termination drain wrong
Node dataplaneOnly some nodes failkube-proxy/eBPF map stale, conntrack issue, CNI agent broken
CNI routingcross-node Pod traffic failsoverlay issue, MTU mismatch, route propagation failure
NetworkPolicytimeout after policy rolloutmissing DNS/health/mesh exceptions
Gateway attachmentRoute not servedAccepted=False, wrong listener, namespace not allowed
TLShandshake failureSNI mismatch, expired cert, wrong trust bundle
Mesh identity403/503 from proxymTLS mode mismatch, policy deny, stale xDS
Resilience policycascading failureretry storm, timeout mismatch, circuit breaker misconfigured
Multi-clusteronly remote region failsMCS import stale, east-west gateway down, trust domain mismatch

5.2 Failure by Plane

PlaneWhat FailsDebug Evidence
API planeObject invalid, rejected, conflictkubectl describe, status conditions, events
Control planeController cannot reconcilecontroller logs, status, metrics, leader election
Data planeRuntime cannot forwardpacket capture, proxy config, eBPF maps, iptables, conntrack
Identity planeWorkload cannot prove identitycert chain, SPIFFE ID, trust bundle, mTLS mode
Policy planeTraffic denied or allowed incorrectlypolicy objects, audit logs, proxy deny logs, flow logs
Discovery planeClient resolves wrong destinationDNS result, service import, EndpointSlice, client cache
Observability planeFailure invisiblemissing labels, high cardinality, sampled trace gap

5.3 Failure by Time Pattern

PatternMeaningExample
ConstantAlways failswrong selector, missing route, invalid cert
IntermittentSome requests failendpoint readiness, node-specific dataplane, conntrack
PeriodicFails at intervalcert rotation, DNS TTL, controller resync, autoscaler cycle
Load-dependentFails under trafficCoreDNS saturation, NAT port exhaustion, proxy memory
Deploy-correlatedFails during rolloutreadiness/termination, connection drain, canary route
Region-correlatedFails by geographyGSLB, cross-region latency, local gateway health

6. First 10 Minutes of a Networking Incident

Pada incident nyata, waktu awal sangat mahal. Tujuan 10 menit pertama bukan root cause sempurna, tetapi impact containment dan failure domain narrowing.

6.1 Incident Statement Template

Since: <time>
Who is affected: <users/services/regions/namespaces>
Symptom: <timeout/5xx/403/DNS/TLS/latency>
Entry point: <public gateway/internal service/mesh/multi-cluster>
Blast radius: <single route/single namespace/single node/single region/global>
Current mitigation: <none/rollback/scale/traffic shift/disable route>
Unknowns: <what we still need to prove>

Example:

Since: 10:07 UTC
Who is affected: mobile-api users in ap-southeast-1
Symptom: 40% HTTP 503 from checkout-api
Entry point: public Gateway api-gw / HTTPRoute checkout-route
Blast radius: route-level, not all services
Current mitigation: canary weight reduced from 20% to 0%
Unknowns: whether 503 is Gateway backend failure, mesh mTLS failure, or app readiness issue

6.2 Initial Triage Questions

Ask in order:

  1. Is it name resolution, connection, TLS, HTTP, or application semantic failure?
  2. Is it one client, one namespace, one service, one node, one zone, one cluster, or global?
  3. Did anything change recently?
  4. Is the route/control object accepted and programmed?
  5. Are there ready endpoints?
  6. Does traffic fail from inside the cluster too?
  7. Does traffic fail when bypassing Gateway/mesh?
  8. Is policy blocking it?
  9. Are only new pods affected?
  10. Are only remote clusters/regions affected?

6.3 Do Not Start With These Actions

Avoid reflex actions unless impact requires emergency mitigation:

  • Restarting CoreDNS without evidence.
  • Restarting all app pods.
  • Deleting EndpointSlices.
  • Editing multiple Gateway/Route objects simultaneously.
  • Disabling NetworkPolicy globally.
  • Turning off mTLS globally.
  • Scaling every component randomly.
  • Flushing conntrack on all nodes without blast-radius analysis.

These actions destroy evidence and may increase blast radius.


7. Evidence Bundle Standard

For regulated systems and serious production environments, every incident should produce an evidence bundle.

7.1 Minimum Evidence

incident/
  00-summary.md
  01-timeline.md
  02-impact.md
  03-recent-changes.md
  04-kubernetes-objects/
  05-status-conditions/
  06-events/
  07-controller-logs/
  08-dataplane-evidence/
  09-packet-captures/
  10-metrics-screenshots-or-exports/
  11-traces/
  12-mitigation-actions.md
  13-root-cause.md
  14-follow-ups.md

7.2 Commands to Capture Baseline

kubectl get ns
kubectl get nodes -o wide
kubectl get pods -A -o wide
kubectl get svc -A -o wide
kubectl get endpointslices -A
kubectl get events -A --sort-by=.lastTimestamp

Gateway API:

kubectl get gatewayclass -A
kubectl get gateway -A
kubectl get httproute -A
kubectl get grpcroute -A
kubectl get tcproute -A
kubectl get tlsroute -A
kubectl get referencegrant -A

Describe the precise object:

kubectl -n <namespace> describe gateway <gateway>
kubectl -n <namespace> describe httproute <route>
kubectl -n <namespace> describe svc <service>
kubectl -n <namespace> describe endpointslice -l kubernetes.io/service-name=<service>

Mesh:

istioctl proxy-status
istioctl analyze -A
istioctl proxy-config clusters <pod> -n <namespace>
istioctl proxy-config routes <pod> -n <namespace>
istioctl proxy-config listeners <pod> -n <namespace>
istioctl proxy-config endpoints <pod> -n <namespace>

Cilium examples:

cilium status
cilium connectivity test
cilium service list
cilium endpoint list
hubble observe --namespace <namespace>

Node-level examples:

ip route
ip addr
ss -antp
conntrack -S
iptables-save | head
nft list ruleset | head
tcpdump -ni any host <ip>

8. Debugging Decision Tree

A practical decision tree:


9. Playbook: DNS Failure

9.1 Symptoms

  • no such host
  • Temporary failure in name resolution
  • high latency before every outbound request
  • intermittent timeout with low app CPU
  • only some namespaces affected
  • only Java/Go/Node clients affected due to caching/resolver behavior

9.2 Hypotheses

HypothesisEvidence
CoreDNS unavailableCoreDNS pods not ready, errors, high CPU
DNS egress blockedNetworkPolicy denies UDP/TCP 53 to kube-dns
wrong namespace resolutionquery resolves unexpected service
ndots amplificationmany search-domain queries per request
NodeLocal DNSCache issueaffected only on nodes with cache failure
client cache staleDNS correct from debug pod but app still wrong

9.3 Commands

kubectl -n kube-system get pods -l k8s-app=kube-dns -o wide
kubectl -n kube-system logs deploy/coredns --tail=200
kubectl -n kube-system describe svc kube-dns
kubectl -n kube-system get endpointslices -l kubernetes.io/service-name=kube-dns

Run from same namespace:

kubectl run -n <namespace> dns-debug \
  --rm -it --restart=Never \
  --image=registry.k8s.io/e2e-test-images/agnhost:2.45 \
  -- /bin/sh

Inside pod:

cat /etc/resolv.conf
nslookup <service>
nslookup <service>.<namespace>.svc.cluster.local
time nslookup <service>.<namespace>.svc.cluster.local

9.4 Packet Capture

tcpdump -ni any port 53

Look for:

  • query sent but no response
  • repeated search domain attempts
  • TCP fallback
  • high DNS response time
  • SERVFAIL/NXDOMAIN

9.5 Common Fixes

Root CauseSafer Fix
NetworkPolicy blocks DNSAdd explicit DNS egress to kube-dns/CoreDNS
CoreDNS saturatedscale CoreDNS, tune cache, deploy NodeLocal DNSCache
bad service nameuse FQDN or correct namespace
client cache stalerestart only affected clients or lower TTL design
ndots amplificationuse FQDN for external calls, tune resolver when justified

9.6 Anti-pattern

Restarting CoreDNS every time DNS feels slow.

This hides whether the problem is query volume, policy, node-level cache, or client resolver behavior.


10. Playbook: Service Exists but Traffic Fails

10.1 Symptoms

  • curl service:port times out
  • Service has ClusterIP but no response
  • Gateway returns 503 no healthy upstream
  • some pods can reach service, others cannot
  • only cross-node traffic fails

10.2 Object Chain

Service selector
  → Pod labels
  → EndpointSlice addresses
  → Endpoint readiness
  → kube-proxy/eBPF service table
  → CNI route
  → Pod listener

10.3 Commands

kubectl -n <ns> get svc <svc> -o yaml
kubectl -n <ns> get pods --show-labels
kubectl -n <ns> get endpointslices -l kubernetes.io/service-name=<svc> -o yaml
kubectl -n <ns> describe svc <svc>

Check app listening:

kubectl -n <ns> exec <pod> -- ss -lntp
kubectl -n <ns> exec <pod> -- curl -v localhost:<targetPort>

From another pod:

kubectl -n <ns> run curl-debug --rm -it --restart=Never --image=curlimages/curl -- sh
curl -v http://<svc>:<port>/health
curl -v http://<pod-ip>:<targetPort>/health

10.4 Interpretation Matrix

ClusterIPPod IPMeaning
failssucceedsService/kube-proxy/eBPF/EndpointSlice issue
succeedsfailsunusual; maybe wrong target pod or network path
failsfailsapp listener, NetworkPolicy, CNI, pod network issue
succeeds sometimessucceeds sometimesreadiness, node-specific dataplane, rollout, conntrack

10.5 EndpointSlice Fields to Inspect

endpoints:
  - addresses:
      - 10.42.1.27
    conditions:
      ready: true
      serving: true
      terminating: false
    nodeName: worker-1
    zone: ap-southeast-1a

Interpretation:

FieldMeaning
readyshould receive normal traffic
servingstill serving during termination if supported
terminatingendpoint is being removed
nodeNameuseful for node-specific failure
zoneuseful for topology-aware routing

11. Playbook: Gateway API Route Not Working

11.1 Rule Zero

Gateway API troubleshooting starts with status.

kubectl -n <ns> describe gateway <gateway>
kubectl -n <ns> describe httproute <route>

Look for:

  • Accepted
  • Programmed
  • ResolvedRefs
  • listener conditions
  • route parent status
  • attached route count
  • reason/message fields

11.2 Status Meaning

ConditionWhat It Means
Accepted=TrueObject accepted by controller for this parent/listener
Accepted=FalseObject rejected; route may be ignored
Programmed=TrueImplementation has applied config to dataplane
ResolvedRefs=FalseBackendRef, Secret, or cross-namespace ref problem
Conflicted=TrueHostname/listener/route conflict

11.3 Debugging Attachment

Check Gateway:

kubectl -n platform get gateway public-gw -o yaml

Check listener:

listeners:
  - name: https
    hostname: "api.example.com"
    port: 443
    protocol: HTTPS
    allowedRoutes:
      namespaces:
        from: Selector
        selector:
          matchLabels:
            shared-gateway-access: "true"

Check namespace label:

kubectl get ns <app-ns> --show-labels

Check HTTPRoute:

parentRefs:
  - name: public-gw
    namespace: platform
    sectionName: https
hostnames:
  - api.example.com

11.4 Common Failure Patterns

SymptomLikely Cause
Route ignorednamespace not allowed by listener
ResolvedRefs=Falsebackend cross-namespace reference missing ReferenceGrant
404host/path did not match route
503route matched but no healthy backend
TLS cert errorlistener certificate problem
works in one controller not anothernon-portable extension or conformance gap

11.5 Safe Mitigation

  • Reduce canary weight to 0.
  • Detach broken route from listener.
  • Move hostname to fallback route.
  • Roll back route object only.
  • Avoid changing GatewayClass during incident.
  • Avoid replacing Gateway controller unless the controller is proven to be the fault domain.

12. Playbook: TLS or mTLS Failure

12.1 Symptom Classes

SymptomLayer
certificate has expiredTLS certificate lifecycle
x509: certificate signed by unknown authoritytrust bundle
no required SSL certificate was sentmTLS client cert
tls: handshake failureTLS version/cipher/SNI/mTLS mismatch
HTTP 503 after enabling STRICT mTLSmesh identity/policy mismatch
works with curl -k onlyvalidation disabled hides trust problem

12.2 Debug External TLS

openssl s_client -connect api.example.com:443 -servername api.example.com -showcerts

Inspect:

  • certificate chain
  • SAN
  • expiration
  • issuer
  • SNI behavior
  • ALPN

12.3 Debug Gateway TLS Secret

kubectl -n <gateway-ns> get secret <tls-secret> -o yaml
kubectl -n <gateway-ns> describe gateway <gateway>

If Secret is in another namespace, check ReferenceGrant.

12.4 Debug Mesh mTLS

Istio examples:

istioctl authn tls-check <pod>.<ns>
istioctl proxy-config secret <pod> -n <ns>
istioctl proxy-config cluster <pod> -n <ns> | grep tls

Check:

  • workload has sidecar/ambient enrollment
  • PeerAuthentication mode
  • DestinationRule TLS mode
  • AuthorizationPolicy
  • trust domain alias
  • certificate expiry

12.5 Common Root Cause

Plaintext client → STRICT mTLS workload

or:

Sidecar workload → non-mesh workload
DestinationRule assumes ISTIO_MUTUAL

13. Playbook: NetworkPolicy Regression

13.1 Symptoms

  • sudden timeout after policy rollout
  • DNS no longer works
  • health checks fail
  • mesh sidecar cannot talk to control plane
  • app can receive but cannot respond
  • only cross-namespace traffic fails

13.2 Isolation Model

NetworkPolicy is allow-list oriented once isolation applies. Multiple policies are additive.

Questions:

  1. Is the pod selected by any ingress policy?
  2. Is the pod selected by any egress policy?
  3. Are DNS paths allowed?
  4. Are health check paths allowed?
  5. Are mesh control plane paths allowed?
  6. Are node-local agents required?
  7. Are external dependencies allowed?
  8. Is the CNI enforcing Kubernetes NetworkPolicy or using extensions?

13.3 Commands

kubectl -n <ns> get networkpolicy
kubectl -n <ns> describe networkpolicy <policy>
kubectl -n <ns> get pod <pod> --show-labels
kubectl get ns --show-labels

Test from selected pod:

kubectl -n <ns> exec <pod> -- nslookup kubernetes.default.svc.cluster.local
kubectl -n <ns> exec <pod> -- curl -v http://<dependency>.<dep-ns>.svc.cluster.local:<port>/health

13.4 Minimum Default-Deny Exceptions

Usually you need explicit allows for:

  • DNS.
  • Metrics scraping.
  • Health checks.
  • Mesh control plane.
  • Identity/certificate agents.
  • Required intra-namespace calls.
  • Required cross-namespace calls.
  • Required egress gateway/proxy.

Example pattern:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-dns-egress
  namespace: payments
spec:
  podSelector: {}
  policyTypes:
    - Egress
  egress:
    - to:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: kube-system
      ports:
        - protocol: UDP
          port: 53
        - protocol: TCP
          port: 53

Note: namespace label examples differ by cluster setup. Validate actual labels.


14. Playbook: Mesh Proxy Failure

14.1 Symptoms

  • app container healthy, but request fails
  • proxy returns 503/403/504
  • only mesh-enabled workloads affected
  • failure appears after mesh config change
  • curl pod-ip works from node but service-to-service fails

14.2 Proxy Failure Classes

ClassExamples
Config distributionstale xDS, control plane unavailable
Identitycertificate not issued, mTLS mismatch
PolicyAuthorizationPolicy deny, JWT validation fail
RoutingVirtualService/DestinationRule conflict
ResourceEnvoy memory/cpu pressure
Interceptionsidecar injection missed, iptables redirection broken
Protocol detectionport name wrong, HTTP treated as TCP

14.3 Istio Debug Commands

istioctl analyze -A
istioctl proxy-status
istioctl proxy-config listeners <pod> -n <ns>
istioctl proxy-config routes <pod> -n <ns>
istioctl proxy-config clusters <pod> -n <ns>
istioctl proxy-config endpoints <pod> -n <ns>
istioctl proxy-config secret <pod> -n <ns>

Check sidecar injection:

kubectl -n <ns> get pod <pod> -o jsonpath='{.spec.containers[*].name}'
kubectl -n <ns> get ns --show-labels

Check proxy logs:

kubectl -n <ns> logs <pod> -c istio-proxy --tail=200

14.4 Envoy Admin Checks

If admin access is available:

curl localhost:15000/config_dump
curl localhost:15000/clusters
curl localhost:15000/listeners
curl localhost:15000/stats

14.5 Common Mistake

Seeing 503 and blaming the app.

In mesh, 503 can mean:

  • no healthy upstream
  • cluster not found
  • connection failure
  • mTLS failure
  • circuit breaker overflow
  • outlier ejection
  • upstream reset
  • route config issue

Always identify the response flag and proxy log context.


15. Playbook: CNI or Node Dataplane Failure

15.1 Symptoms

  • only pods on one node fail
  • pod-to-pod same node works, cross-node fails
  • ClusterIP fails but Pod IP works
  • new pods cannot get IP
  • packet loss under load
  • node restarts correlate with networking issue

15.2 Node-Specific Narrowing

kubectl get pods -A -o wide | grep <node>
kubectl get nodes -o wide
kubectl describe node <node>

Test same-node vs cross-node:

pod-a on node-1 → pod-b on node-1
pod-a on node-1 → pod-c on node-2
pod-d on node-2 → pod-c on node-2

Interpretation:

Same NodeCross NodeMeaning
worksfailsCNI routing/overlay/MTU/node route issue
failsfailslocal pod listener/policy/dataplane issue
works only on some nodesmixednode agent or route state issue

15.3 Linux Checks

ip addr
ip route
ip rule
ss -s
conntrack -S
cat /proc/sys/net/netfilter/nf_conntrack_count
cat /proc/sys/net/netfilter/nf_conntrack_max

Capture:

tcpdump -ni any host <pod-ip>
tcpdump -ni <cni-interface> host <pod-ip>

15.4 MTU Check

ping -M do -s 1472 <destination-ip>
ping -M do -s 1372 <destination-ip>

Adjust values based on network and encapsulation overhead.

15.5 eBPF/Cilium Checks

cilium status
cilium endpoint list
cilium service list
cilium bpf lb list
cilium monitor
hubble observe --from-pod <ns>/<pod>

15.6 Safe Mitigation

  • cordon affected node
  • drain only if safe for workload
  • restart CNI agent on affected node only
  • remove node from LB target group if needed
  • reduce traffic locality dependence
  • avoid cluster-wide CNI rollout during incident unless proven necessary

16. Playbook: External Load Balancer / NodePort Failure

16.1 Symptoms

  • public endpoint fails but internal Service works
  • only some client geographies affected
  • health check passes but request fails
  • source IP unexpectedly changed
  • only externalTrafficPolicy: Local services fail on some nodes

16.2 Debug Path

Client
  → DNS/GSLB/CDN
  → cloud/bare-metal load balancer
  → node target
  → NodePort / healthCheckNodePort
  → kube-proxy/eBPF
  → EndpointSlice
  → Pod

16.3 Commands

kubectl -n <ns> get svc <svc> -o yaml
kubectl -n <ns> describe svc <svc>
kubectl get nodes -o wide
kubectl -n <ns> get endpointslices -l kubernetes.io/service-name=<svc> -o wide

Check from outside:

curl -v https://api.example.com/health
curl -v --resolve api.example.com:443:<lb-ip> https://api.example.com/health

16.4 Key Fields

FieldRisk
externalTrafficPolicy: Localpreserves source IP but only nodes with local endpoints should receive traffic
healthCheckNodePortLB health may not equal app health
loadBalancerSourceRangesmay block clients or health checkers
sessionAffinitycan distort load distribution
internalTrafficPolicycan create node-local blackhole if no local endpoint

17. Playbook: Egress Failure

17.1 Symptoms

  • app cannot call external API
  • only external calls fail, internal calls work
  • external firewall sees unexpected source IP
  • NAT gateway port exhaustion
  • TLS works from debug pod but not app
  • proxy returns 407/502/504

17.2 Debug Path

Pod
  → NetworkPolicy/CNI egress policy
  → mesh egress policy / ServiceEntry
  → egress gateway or proxy
  → node SNAT
  → NAT gateway/firewall
  → external DNS/TLS endpoint

17.3 Commands

kubectl -n <ns> exec <pod> -- nslookup api.vendor.com
kubectl -n <ns> exec <pod> -- curl -v https://api.vendor.com/health
kubectl -n <ns> exec <pod> -- curl -v --connect-timeout 3 https://<ip>

Check source IP:

kubectl -n <ns> exec <pod> -- curl -s https://ifconfig.me

17.4 Interpret Failure Type

FailureLikely Domain
DNS failsDNS policy/CoreDNS/external resolver
TCP timeoutNetworkPolicy/NAT/firewall/routing
TLS unknown authoritytrust bundle/corporate proxy
HTTP 407proxy auth
HTTP 403 from vendorsource IP allowlist/API auth
intermittent timeout under loadNAT port exhaustion/proxy saturation

18. Playbook: Multi-Cluster Failure

18.1 Symptoms

  • local cluster works, remote cluster fails
  • service exists in one cluster but not imported elsewhere
  • failover does not happen
  • failover happens but worsens outage
  • cross-cluster mTLS fails
  • only one region sees stale backend endpoints

18.2 Debug Path

Cluster identity
  → namespace sameness
  → ServiceExport
  → ServiceImport
  → imported DNS
  → endpoint aggregation
  → global Gateway/GSLB/mesh route
  → east-west gateway
  → trust bundle
  → backend readiness

18.3 MCS Checks

kubectl -n <ns> get serviceexport
kubectl -n <ns> get serviceimport
kubectl -n <ns> describe serviceexport <svc>
kubectl -n <ns> describe serviceimport <svc>

DNS:

nslookup <svc>.<ns>.svc.clusterset.local

Endpoint evidence:

kubectl -n <ns> get endpointslices -l multicluster.kubernetes.io/service-name=<svc> -o yaml

Actual labels vary by implementation. Validate your provider.

18.4 Global Routing Checks

  • Which region did DNS/GSLB return?
  • Which Gateway received the request?
  • Which cluster handled the request?
  • Which backend version served the request?
  • Was failover manual or automatic?
  • Did control plane health differ from data plane health?
  • Are data dependencies also failed over?

18.5 Dangerous Pattern

Application traffic fails in region A.
Global traffic manager shifts all traffic to region B.
Region B does not have enough capacity or fresh data.
Incident becomes global.

Failover is a workload behavior, not only a networking behavior.


19. Chaos Engineering for Kubernetes Networking

Chaos engineering is not random destruction. It is controlled hypothesis testing.

19.1 Good Experiment Template

Title: <experiment name>
Hypothesis: <what should remain true under failure>
Scope: <namespace/service/route/node/cluster>
Blast radius: <expected maximum impact>
Steady state metric: <what proves system is healthy>
Fault: <what is injected>
Duration: <how long>
Abort condition: <when to stop>
Expected behavior: <fallback/degradation>
Evidence to capture: <metrics/logs/traces/events>
Rollback: <how to restore>
Owner: <team>

19.2 Bad Experiment

Let's kill random nodes in production and see what happens.

This is not engineering. This is gambling.

19.3 Networking Fault Types

FaultWhat It Tests
DNS latencyclient resolver behavior, retry budget
DNS failurefallback and caching
packet lossTCP/gRPC resilience
network delaytimeout alignment
bandwidth limitbackpressure/load shedding
pod-to-service blockpolicy and dependency visibility
gateway unavailableglobal routing/failover
endpoint terminationgraceful drain
certificate expiry simulationrotation alarms
control plane disconnectproxy config staleness

19.4 Chaos Experiment: DNS Latency

Hypothesis:

If DNS latency increases by 500ms for checkout-api, request p95 should increase less than 100ms because clients reuse connections and do not resolve per request.

If p95 increases by 500ms, the client may be resolving too often.

19.5 Chaos Experiment: Endpoint Drain

Hypothesis:

During rollout, terminating pods should stop receiving new requests within 10 seconds, while existing requests complete within 30 seconds.

Evidence:

  • EndpointSlice terminating state.
  • Gateway/backend access logs.
  • app graceful shutdown logs.
  • 5xx rate during rollout.

19.6 Chaos Experiment: Gateway Controller Down

Hypothesis:

If Gateway controller is unavailable for 10 minutes, already programmed routes keep serving traffic, but new route changes are not applied.

This separates control plane availability from data plane availability.

19.7 Chaos Experiment: Mesh Control Plane Down

Hypothesis:

Existing mesh traffic continues using cached proxy config, but certificate rotation and new service discovery may degrade after a defined interval.

Check:

  • xDS staleness.
  • cert expiration window.
  • new pod behavior.
  • telemetry impact.

20. Incident Debugging Anti-Patterns

20.1 YAML Staring

Reading YAML forever without packet path evidence.

Better:

YAML → status → runtime config → packet test → semantic test

20.2 Global Disablement

Turning off all policy/mTLS/routing because one service fails.

Better:

Narrow to route/service/namespace/source identity.
Apply minimal emergency exception with expiry.

20.3 Restart Therapy

Restarting components until symptom changes.

Better:

Capture state before restart.
Restart smallest proven failure domain.
Record exactly what changed.

20.4 Dashboard Anchoring

Believing green dashboards while users fail.

Better:

Use synthetic probes and user-path telemetry.
Dashboards are evidence, not truth.

20.5 Blaming the Last Change Without Proof

Recent change correlation is useful. It is not proof.

Better:

Recent change → hypothesis → reproducible evidence → mitigation.

21. Runbook Design

A runbook should encode decision-making, not only commands.

21.1 Runbook Sections

1. Scope
2. Symptoms
3. Severity mapping
4. Impact questions
5. Decision tree
6. Evidence commands
7. Safe mitigations
8. Dangerous mitigations
9. Escalation criteria
10. Recovery validation
11. Post-incident artifacts

21.2 Example: Gateway Route Runbook

# Gateway Route Failure Runbook

## Symptoms
- 404 from public Gateway
- 503 from public Gateway
- TLS handshake failure
- route accepted but not serving

## First checks
- describe Gateway
- describe HTTPRoute
- inspect status.conditions
- inspect listener attachedRoutes
- inspect backend Service and EndpointSlice

## Mitigation
- set canary backend weight to 0
- rollback last route change
- detach hostname from broken route
- shift DNS/GSLB if regional impact proven

## Do not
- replace GatewayClass during incident
- delete shared Gateway
- disable TLS validation globally

22. Recovery Validation

A mitigation is not done when alerts become green. Validate multiple layers.

22.1 Validation Checklist

LayerValidation
DNScorrect name resolves from affected namespace/client
TCPconnection succeeds within expected time
TLScert valid, correct SNI, correct trust chain
HTTP/gRPCstatus code correct, headers correct
App semanticbusiness operation succeeds
Observabilitylogs/metrics/traces show correct route/backend
Policyno emergency allow remains open unintentionally
Rolloutbad version no longer receives traffic
Multi-clustertraffic returns to intended locality/failover state

22.2 Synthetic Probe Design

A good probe should include:

  • DNS name used by real clients.
  • TLS validation enabled.
  • production-like headers.
  • expected status code.
  • semantic body check.
  • route/backend label capture.
  • region/cluster identity capture.

Bad probe:

curl -k http://pod-ip:8080/health

That proves only a local process responds. It does not prove the user path.


23. Post-Incident Review Model

23.1 Causal Chain

Use chain, not single root cause.

Bad route change
  → controller accepted route
  → backend had no ready endpoints
  → Gateway returned 503
  → canary alert only watched app metrics, not Gateway 5xx
  → rollback required manual YAML edit
  → incident lasted 28 minutes

23.2 Better Remediation

Weak RemediationStrong Remediation
“Be more careful”pre-merge route validation
“Watch dashboard”alert on Gateway 5xx by route/backend
“Document rollback”automated canary abort
“Ask platform team”self-service status debugging guide
“Do training”chaos experiment and regression test

23.3 Regulatory Lens

For regulated systems, capture:

  • who changed what
  • when it changed
  • approval path
  • impact window
  • affected users/entities
  • evidence used for diagnosis
  • mitigation decision rationale
  • data integrity impact
  • residual risk
  • preventive controls

24. Practice Lab Sequence

Lab 1 — Selector Mismatch

Inject:

selector:
  app: wrong-name

Expected skill:

  • identify Service has no EndpointSlice addresses
  • distinguish app health from Service routing

Lab 2 — DNS Blocked by Policy

Inject default deny egress without DNS exception.

Expected skill:

  • prove DNS failure from same namespace
  • add minimal DNS egress rule

Lab 3 — Gateway Route Not Attached

Inject namespace without allowed label.

Expected skill:

  • inspect route parent status
  • fix namespace label or listener policy

Lab 4 — TLS SNI Mismatch

Inject wrong hostname/cert.

Expected skill:

  • use openssl s_client -servername
  • identify SNI/cert SAN mismatch

Lab 5 — mTLS Mode Mismatch

Enable STRICT mTLS for workload while caller is outside mesh.

Expected skill:

  • identify proxy-level failure
  • fix enrollment or policy, not app code

Lab 6 — Retry Storm

Configure retries without budget against slow backend.

Expected skill:

  • identify increased upstream request volume
  • tune timeout/retry/load shedding

Lab 7 — Cross-Node CNI Failure

Simulate node-specific network loss.

Expected skill:

  • compare same-node vs cross-node traffic
  • isolate node dataplane

Lab 8 — Multi-Cluster Import Stale

Break service import propagation.

Expected skill:

  • inspect ServiceExport/ServiceImport
  • validate clusterset.local discovery

25. Top 1% Debugging Checklist

Before changing anything, answer:

1. What is the exact symptom?
2. Who is affected?
3. What is the blast radius?
4. What changed recently?
5. What failure layer is most likely?
6. What evidence supports that?
7. What evidence disproves alternatives?
8. What is the smallest safe mitigation?
9. What evidence must be preserved?
10. What will prove recovery?

During incident:

- keep timeline
- isolate one variable at a time
- prefer read-only inspection before mutation
- mutate smallest proven scope
- verify from user path, not only pod path
- record exact commands and timestamps

After incident:

- produce causal chain
- add regression test
- add guardrail
- add alert if missing
- improve runbook
- remove emergency exceptions
- schedule game day

26. Mental Model Summary

Kubernetes networking debugging is not “knowing many commands”.

It is the discipline of mapping symptoms to layers:

User symptom
  → application semantic
  → HTTP/gRPC status
  → TLS/mTLS
  → route match
  → Gateway/mesh proxy
  → Service
  → EndpointSlice
  → Pod listener
  → CNI/node dataplane
  → DNS/discovery
  → policy/identity
  → external dependency

The most important invariant:

Every production networking system must be debuggable from intent, control-plane status, dataplane state, packet evidence, and user-visible semantics.

If you cannot produce evidence at each of those layers, the architecture is not mature yet.


27. Part 033 Completion Check

Anda selesai dengan Part 033 jika dapat:

  • Membuat incident statement yang jelas.
  • Mengisolasi failure domain berdasarkan symptom.
  • Men-debug DNS, Service, EndpointSlice, Gateway API, TLS, NetworkPolicy, mesh, CNI, egress, dan multi-cluster failures.
  • Menggunakan status.conditions Gateway API sebagai evidence, bukan dekorasi.
  • Memisahkan intent path, control path, dan packet path.
  • Mendesain chaos experiment dengan hypothesis, scope, abort condition, dan evidence.
  • Membuat runbook yang berisi keputusan, bukan hanya command.
  • Menulis post-incident causal chain yang menghasilkan guardrail.

Part berikutnya akan membahas Production Architecture Review and Decision Framework: bagaimana mengevaluasi CNI, Gateway controller, mesh, multi-cluster, policy, egress, observability, cost, ownership, migration, dan regulatory defensibility dalam satu review framework.

Lesson Recap

You just completed lesson 33 in final stretch. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.