Build CoreOrdered learning track

Learn Kubernetes Deployment Model Part 015 Service Discovery Networking

[]24 min read4773 words

In This Lesson

1. Kaufman Deconstruction 2. The Core Problem: Stable Clients, Unstable Backends 3. Kubernetes Networking Invariants

Lesson 1535 lesson track07–19 Build Core

title: Learn Kubernetes, Deployment Model, and Cloud Native Platform Engineering - Part 015 description: Deep dive into Kubernetes Service discovery, Services, EndpointSlices, DNS, kube-proxy, service types, traffic semantics, and production networking failure modes. series: learn-kubernetes-deployment-model seriesTitle: Learn Kubernetes, Deployment Model, and Cloud Native Platform Engineering order: 15 partTitle: Service Discovery and Kubernetes Networking Model tags:

kubernetes
deployment
networking
service-discovery
service
endpointslice
dns
platform-engineering date: 2026-07-01

Part 015 — Service Discovery and Kubernetes Networking Model

Goal: understand how Kubernetes gives unstable Pods a stable network identity through Services, EndpointSlices, DNS, and data-plane programming, so you can design service-to-service communication that is resilient, debuggable, scalable, and safe under rollout.

A Pod is not a stable server.

A Pod is an ephemeral execution unit.

It can be replaced, rescheduled, restarted, evicted, or scaled away. Its IP can disappear. Its process can be healthy, unhealthy, terminating, or not yet ready. If other workloads connect directly to Pod IPs, they couple themselves to the least stable object in the workload model.

Kubernetes solves this with a service discovery layer.

At minimum, that layer provides:

a stable name;
a stable virtual endpoint;
backend selection by label;
readiness-aware endpoint membership;
load distribution across matching backends;
integration with DNS;
a decoupling point between clients and Pods.

The beginner thinks a Service is “a load balancer.”

The production engineer thinks a Service is a stable contract over a dynamic set of endpoints.

That distinction matters.

1. Kaufman Deconstruction

Service discovery can be decomposed into a small set of sub-skills.

Sub-skill	What You Must Be Able To Do
Service mental model	Explain why clients should depend on Services, not Pods.
Selector reasoning	Predict which Pods become backends for a Service.
EndpointSlice reasoning	Inspect the real backend set behind a Service.
DNS reasoning	Resolve service names across namespaces and understand DNS search paths.
Service type selection	Choose ClusterIP, Headless, NodePort, LoadBalancer, or ExternalName intentionally.
Data-plane reasoning	Understand what kube-proxy or an alternative data plane does with Service traffic.
Readiness routing	Explain why a Pod can exist but not receive Service traffic.
Debugging	Diagnose DNS failure, empty endpoints, blackholed traffic, cross-namespace mistakes, and port mismatch.
Governance	Define naming, labels, ports, and exposure rules for teams.

The highest-value skill is this:

Given a client request to a service name, trace the request from DNS name to Service to EndpointSlice to Pod to container port.

If you can do that, Kubernetes networking becomes much less mysterious.

2. The Core Problem: Stable Clients, Unstable Backends

Applications need stable dependencies.

But Kubernetes backends are unstable by design.

A Deployment may create new Pods during rollout. Old Pods terminate. New Pods become ready. The Service remains stable.

The stable part is not the Pod.

The stable part is the Service contract:

name + namespace + port + selector + traffic policy

This is why Service design is part of deployment design.

A rollout is not safe just because the new Pods start.

A rollout is safe only if:

new Pods become ready at the right time;
old Pods stop receiving traffic at the right time;
DNS names remain stable;
clients tolerate backend changes;
connections drain correctly;
traffic does not route to wrong versions;
endpoints reflect readiness accurately.

3. Kubernetes Networking Invariants

Before Services make sense, we need a few cluster networking invariants.

Kubernetes expects a cluster networking implementation where:

Pods receive IP addresses;
Pods can communicate with other Pods, subject to network policy and implementation details;
nodes can communicate with Pods;
Services provide stable access to a dynamic backend set;
DNS can provide stable names for Services and selected Pod records;
the actual packet forwarding is implemented by cluster components and networking plugins.

Kubernetes does not prescribe one universal network implementation.

A cluster may use Calico, Cilium, Flannel, Antrea, cloud-provider networking, or another CNI implementation. It may use kube-proxy with iptables or nftables, or an eBPF-based replacement. The API objects are portable; the packet path is implementation-specific.

The production implication:

Kubernetes networking should be understood at two layers: API contract and data-plane implementation.

Layer	Example	Portability
API contract	Service, EndpointSlice, DNS, NetworkPolicy, Gateway	Mostly portable, subject to feature support.
Data plane	iptables, nftables, eBPF, cloud load balancer, CNI routing	Implementation-specific.

Do not debug only the YAML.

Debug the object graph and the data plane.

4. Service Object Mental Model

A Service is an API object that defines a stable way to access a group of backend endpoints.

The common case is selector-based:

apiVersion: v1
kind: Service
metadata:
  name: orders-api
  namespace: commerce
spec:
  type: ClusterIP
  selector:
    app.kubernetes.io/name: orders-api
    app.kubernetes.io/component: api
  ports:
    - name: http
      port: 80
      targetPort: http
      protocol: TCP

This Service says:

create a stable internal service named orders-api;
in namespace commerce;
expose Service port 80;
route to selected backend Pods' named container port http;
select Pods with the matching labels.

This means the Service is not tied to a Deployment directly.

It is tied to Pods through labels.

That selector relationship is powerful, but dangerous.

If labels are wrong, traffic is wrong.

5. Service Is Not the Same as Deployment

A Deployment owns ReplicaSets.

ReplicaSets own Pods.

A Service selects Pods.

It does not own them.

This difference creates several production failure modes.

Failure	Cause	Symptom
Empty Service	Selector does not match Pods	DNS resolves, connection fails or times out.
Wrong backend	Selector too broad	Traffic goes to unrelated Pods.
Version leakage	Stable and canary Pods share selector accidentally	Clients see mixed versions unexpectedly.
No traffic after rollout	New Pods labels differ from Service selector	Deployment healthy, Service has no endpoints.
Old Pods still targeted	Old Pods still match and remain ready during drain	Requests hit terminating or incompatible backend.

A Service selector is a production routing rule.

Treat it with the same seriousness as an API gateway rule.

6. Anatomy of Service Ports

Service ports are a common source of confusion.

ports:
  - name: http
    port: 80
    targetPort: 8080
    protocol: TCP

Field	Meaning
`port`	Port exposed by the Service. Clients connect to this.
`targetPort`	Port on the backend Pod/container. Traffic is forwarded here.
`name`	Logical port name. Useful for readability and named target ports.
`protocol`	Usually TCP or UDP.

The Service port and container port do not need to be the same.

A common production convention:

ports:
  - name: http
    port: 80
    targetPort: http

And in the container:

ports:
  - name: http
    containerPort: 8080

This makes the Service resilient to container port changes if the name remains stable.

However, named ports are not magic. If the Pod does not define the named port correctly, endpoint resolution can fail or route incorrectly depending on configuration.

7. Service Types

Kubernetes supports multiple Service types. Each type answers a different exposure question.

Type	Purpose	Typical Use
`ClusterIP`	Internal virtual IP reachable inside the cluster.	Service-to-service communication.
`Headless`	No cluster virtual IP; DNS returns backend records.	Stateful discovery, direct Pod addressing, client-side load balancing.
`NodePort`	Exposes a port on every node.	Low-level external exposure, often behind external LB.
`LoadBalancer`	Requests an external load balancer from cloud/provider integration.	Public/private external service exposure.
`ExternalName`	DNS CNAME-style alias to an external name.	Integrating external dependencies with Kubernetes naming.

The default should usually be ClusterIP.

Use more exposed types only when the boundary is intentional.

8. ClusterIP Service

A ClusterIP Service gives a stable virtual IP inside the cluster.

apiVersion: v1
kind: Service
metadata:
  name: payment-api
spec:
  type: ClusterIP
  selector:
    app.kubernetes.io/name: payment-api
  ports:
    - name: http
      port: 80
      targetPort: http

Clients should normally call:

http://payment-api

or cross-namespace:

http://payment-api.payments.svc.cluster.local

The ClusterIP is not a process listening on a machine. It is a virtual service address programmed into the cluster data plane.

That is why debugging with only netstat on a node can mislead you.

For a Service, inspect:

kubectl get svc payment-api -n payments
kubectl get endpointslice -n payments -l kubernetes.io/service-name=payment-api
kubectl describe svc payment-api -n payments

The real question is not “is the Service running?”

A Service does not run.

The real question is:

Does the Service resolve to ready backend endpoints, and does the data plane route to them?

9. Headless Service

A headless Service sets clusterIP: None.

apiVersion: v1
kind: Service
metadata:
  name: postgres
spec:
  clusterIP: None
  selector:
    app.kubernetes.io/name: postgres
  ports:
    - name: postgres
      port: 5432
      targetPort: postgres

Headless Services are common for stateful systems because clients may need to know individual backend identities.

For a StatefulSet named postgres with a headless Service named postgres, Pods can have stable DNS names such as:

postgres-0.postgres.database.svc.cluster.local
postgres-1.postgres.database.svc.cluster.local
postgres-2.postgres.database.svc.cluster.local

This matters for:

database replication;
quorum systems;
broker clusters;
sharded systems;
systems where identity is part of the protocol.

Headless Service means Kubernetes is not providing the same virtual-IP load balancing abstraction. Clients may receive backend records and make their own selection.

Use it intentionally.

10. NodePort Service

A NodePort exposes a port on each node.

apiVersion: v1
kind: Service
metadata:
  name: legacy-web
spec:
  type: NodePort
  selector:
    app.kubernetes.io/name: legacy-web
  ports:
    - name: http
      port: 80
      targetPort: http
      nodePort: 30080

Clients outside the cluster can connect to:

NODE_IP:30080

This is rarely the best high-level deployment model for modern production traffic.

It can be useful when:

building simple lab environments;
integrating with external load balancers;
exposing traffic in bare-metal clusters;
debugging specific network paths.

But direct NodePort exposure creates operational issues:

every node becomes part of the exposure surface;
firewall policy must be managed carefully;
node lifecycle affects external routing;
TLS and host/path routing are not handled by NodePort itself;
cloud-provider load balancer integration is usually more appropriate.

Treat NodePort as a low-level primitive, not a complete ingress strategy.

11. LoadBalancer Service

A LoadBalancer Service asks the infrastructure integration to provision an external or internal load balancer.

apiVersion: v1
kind: Service
metadata:
  name: public-api
spec:
  type: LoadBalancer
  selector:
    app.kubernetes.io/name: public-api
  ports:
    - name: http
      port: 80
      targetPort: http

This is simple and useful, but has trade-offs.

Benefit	Cost
Easy external exposure	Can create one load balancer per Service.
Cloud-provider integration	Provider behavior differs.
Works for TCP/UDP	HTTP routing, TLS, auth, rate limits may require extra layers.
Good for dedicated service endpoints	Not ideal for many host/path routes.

For HTTP applications, Ingress or Gateway API usually provides better routing abstraction.

For non-HTTP protocols, LoadBalancer Service may be appropriate, or Gateway API TCP/UDP routes may be better depending on implementation support.

12. ExternalName Service

An ExternalName Service maps a Kubernetes service name to an external DNS name.

apiVersion: v1
kind: Service
metadata:
  name: fraud-provider
  namespace: payments
spec:
  type: ExternalName
  externalName: api.fraud-provider.example.com

This allows applications to use a cluster-local name such as:

fraud-provider.payments.svc.cluster.local

while resolving to an external DNS target.

Use cases:

stable internal naming for external dependencies;
migrating an external service into the cluster later;
abstracting third-party endpoints.

Cautions:

no Kubernetes readiness semantics for external backend health;
no Pod endpoint membership;
DNS behavior may surprise HTTP clients if host headers, TLS SNI, or certificates expect the external name;
observability and policy may be less direct than with in-cluster Services.

ExternalName is a naming convenience, not a traffic-management system.

13. EndpointSlice Mental Model

A Service is the contract.

EndpointSlices are the backend inventory.

When a selector-based Service matches Pods, Kubernetes creates and updates EndpointSlices that describe the backend endpoints.

EndpointSlices exist because a single Endpoints object does not scale well for large backend sets. EndpointSlices partition endpoint information into multiple smaller API objects.

Inspect them directly:

kubectl get endpointslice -n commerce -l kubernetes.io/service-name=orders-api
kubectl describe endpointslice -n commerce -l kubernetes.io/service-name=orders-api

Useful fields include:

endpoint addresses;
endpoint conditions;
target references;
ports;
address type;
topology hints, depending on configuration and version;
service ownership labels.

When debugging, do not stop at kubectl get svc.

A Service can exist and still have no usable endpoints.

14. Readiness and Endpoint Membership

A Pod can be running but not ready.

A running-but-not-ready Pod should not normally receive Service traffic.

This is why readiness probes are part of networking design.

readinessProbe:
  httpGet:
    path: /ready
    port: http
  initialDelaySeconds: 5
  periodSeconds: 5
  timeoutSeconds: 2
  failureThreshold: 3

The readiness signal influences whether the Pod appears as a ready backend endpoint.

This creates an important invariant:

A Service routes to readiness-qualified backends, not merely existing Pods.

Bad readiness probes create bad traffic decisions.

Probe Mistake	Consequence
Readiness always returns success	Broken Pods receive traffic.
Readiness checks only process liveness	App receives traffic before dependencies are usable.
Readiness depends on fragile downstream	Temporary downstream issue removes all Pods from Service.
Readiness has high latency	Endpoint state flaps under load.
Readiness ignores schema/cache warmup	New Pods get traffic too early.

Readiness is not a health-check checkbox.

It is the contract between workload state and traffic routing.

15. DNS for Services

Kubernetes clusters normally run a cluster-aware DNS service such as CoreDNS. It watches Services and creates DNS records for service discovery.

Common forms:

service-name
service-name.namespace
service-name.namespace.svc
service-name.namespace.svc.cluster.local

Example:

orders-api.commerce.svc.cluster.local

Inside the same namespace, a client can often use just:

orders-api

Across namespaces, use at least:

orders-api.commerce

or the fully qualified name:

orders-api.commerce.svc.cluster.local

Production recommendation:

use short names only for same-namespace calls;
use namespace-qualified names for cross-namespace calls;
use fully qualified names in platform-level configuration where ambiguity is unacceptable.

DNS ambiguity is a real failure mode in multi-team clusters.

16. DNS Search Path and Namespace Coupling

A Pod's DNS configuration includes search domains. This allows short names to resolve relative to the Pod's namespace.

Suppose a Pod runs in namespace checkout and calls:

orders-api

The resolver may try names like:

orders-api.checkout.svc.cluster.local
orders-api.svc.cluster.local
orders-api.cluster.local

The exact behavior depends on resolver configuration, but the lesson is stable:

Short service names couple clients to their own namespace.

This is fine for same-namespace components. It is risky for shared services.

For shared services, prefer explicit names:

orders-api.commerce.svc.cluster.local

This makes ownership and dependency clearer.

17. Service Environment Variables

Kubernetes can inject environment variables for Services into Pods.

Example pattern:

PAYMENT_API_SERVICE_HOST
PAYMENT_API_SERVICE_PORT

However, DNS is usually preferred for application-level service discovery.

Service environment variables have drawbacks:

they are generated only for Services existing before the Pod starts;
they can create many environment variables in large namespaces;
names can collide with application expectations;
they are less flexible than DNS.

For most modern workloads, use DNS and consider setting:

enableServiceLinks: false

when environment-variable injection is unwanted.

18. kube-proxy and Service Data Plane

The Kubernetes API defines Services.

The data plane makes them work.

Historically, kube-proxy programs node-level packet forwarding rules using modes such as iptables or IPVS. Modern clusters may use nftables or CNI/eBPF-based service handling depending on distribution and configuration.

The key concept:

A Service IP is not usually owned by a single process. It is a virtual IP implemented through network rules or equivalent data-plane logic.

This means traffic bugs can occur at several layers:

DNS returns no record;
Service exists but has no endpoints;
endpoint is not ready;
target port is wrong;
kube-proxy or replacement data plane is unhealthy;
CNI routing is broken;
NetworkPolicy denies the connection;
application accepts TCP but fails HTTP;
TLS/SNI/Host mismatch occurs above Kubernetes Service.

Production debugging must isolate the layer.

19. Packet Path: Pod to Service

A simplified internal request path:

Useful diagnostic questions:

Can the client resolve the service name?
Does the Service exist in the expected namespace?
Does the Service select the expected Pods?
Do EndpointSlices contain ready endpoints?
Does the target port match the application listener?
Is NetworkPolicy blocking traffic?
Does traffic fail only across nodes or also same-node?
Does traffic fail only by DNS name or also by ClusterIP?
Does traffic fail only at HTTP/TLS layer or also at TCP connect?

This order prevents random debugging.

20. Service Selector Design

Selector design is deployment design.

Good selectors are stable and intentional.

A typical selector:

selector:
  app.kubernetes.io/name: orders-api
  app.kubernetes.io/component: api

Avoid selectors that include volatile labels:

selector:
  app.kubernetes.io/version: v1.7.3

Version labels are useful for observability and progressive delivery, but they can accidentally break stable routing if used as the main Service selector.

Better pattern:

metadata:
  labels:
    app.kubernetes.io/name: orders-api
    app.kubernetes.io/component: api
    app.kubernetes.io/version: v1.7.3
    platform.example.com/release-track: stable

Stable Service:

selector:
  app.kubernetes.io/name: orders-api
  app.kubernetes.io/component: api
  platform.example.com/release-track: stable

Canary Service:

selector:
  app.kubernetes.io/name: orders-api
  app.kubernetes.io/component: api
  platform.example.com/release-track: canary

This makes release routing explicit.

21. Port Naming Governance

Port names become part of the operational contract.

Bad:

ports:
  - name: port1
    containerPort: 8080

Better:

ports:
  - name: http
    containerPort: 8080

For multi-port services:

ports:
  - name: http
    port: 80
    targetPort: http
  - name: metrics
    port: 9090
    targetPort: metrics

Do not expose metrics through the same Service used by application clients unless intentional.

Create separate Services when traffic classes have different consumers, security posture, or policies.

orders-api            -> application traffic
orders-api-metrics    -> observability scraping
orders-api-headless   -> direct backend discovery, if needed

Separate Services create clearer policy and fewer accidental exposures.

22. Internal Service Design Patterns

22.1 One Service per application API

Use when a workload exposes one stable application interface.

orders-api.commerce.svc.cluster.local

Good for stateless HTTP/gRPC services.

22.2 Separate public and private Services

Use when the same Pods expose different traffic classes.

orders-api-public
orders-api-internal
orders-api-admin
orders-api-metrics

This allows different policies, routes, and observability.

22.3 Headless plus normal Service

Use for stateful workloads.

postgres          -> stable client entrypoint
postgres-headless -> individual Pod identity

22.4 Service per release track

Use for manual or controller-driven canary.

orders-api-stable
orders-api-canary

Ingress, Gateway, or service mesh can split traffic between them.

22.5 ExternalName for migration bridge

Use to preserve service naming while dependency remains external.

legacy-fraud-provider.payments.svc.cluster.local

23. Service Topology and Traffic Locality

In large clusters, cross-zone traffic can add latency and cost.

Kubernetes and cloud providers offer mechanisms to influence traffic locality, but behavior depends on feature state, provider implementation, and data plane.

The design question is not “can traffic stay local?”

The better question:

What happens when local endpoints are unavailable?

Possible strategies:

Strategy	Benefit	Risk
Prefer local endpoints	Lower latency and cross-zone cost	Uneven load if zones differ.
Require local endpoints	Strong locality	Outage if zone has no healthy endpoints.
Global balancing	Better availability	More cross-zone traffic.
Client-aware routing	More control	More application complexity.

For most services, availability beats strict locality.

For high-throughput data-plane services, locality can matter enough to justify complexity.

24. Session Affinity

Services can support client IP-based session affinity.

spec:
  sessionAffinity: ClientIP

This attempts to send traffic from the same client IP to the same backend for a period.

Use cautiously.

It can help legacy stateful applications, but it can also:

create uneven backend load;
hide session state problems;
complicate rollout behavior;
break when client IP is shared by many clients;
interact poorly with proxies and NAT.

Modern applications should prefer stateless request handling or explicit state stores.

Session affinity is a compatibility tool, not a primary scalability strategy.

25. `externalTrafficPolicy` and Source IP Preservation

For externally exposed Services, externalTrafficPolicy can affect routing and source IP behavior.

Common values:

externalTrafficPolicy: Cluster

or:

externalTrafficPolicy: Local

High-level trade-off:

Policy	Effect
`Cluster`	Traffic can be distributed to endpoints across the cluster, but source IP may be obscured depending on path.
`Local`	Preserves client source IP in common scenarios, but only nodes with local endpoints should receive traffic.

This matters for:

audit logs;
rate limiting;
geo/security decisions;
external load balancer health checks;
uneven traffic when endpoints are not present on all nodes.

Do not set this casually.

It is a traffic semantics decision.

26. Service Discovery and Rollouts

During a Deployment rollout, the Service selector usually stays constant.

The backend set changes.

For safe rollout:

readiness must only pass when the Pod can serve real traffic;
preStop and termination grace must allow connection drain;
PodDisruptionBudget should prevent too many backends disappearing;
Service selector should not change accidentally;
clients should retry safely;
backend APIs should be compatible during mixed-version windows.

A rolling update almost always creates a mixed-version backend set.

Your API and data contracts must tolerate that.

27. NetworkPolicy Interaction

A Service being reachable by name does not mean traffic is allowed.

NetworkPolicy can restrict traffic between Pods and namespaces.

Important distinction:

Service discovery answers “where is the service?”
NetworkPolicy answers “is this traffic allowed?”

A common failure:

DNS works.
Service has endpoints.
TCP connection times out.

Possible cause: NetworkPolicy denies the traffic.

Debug separately:

kubectl get networkpolicy -A
kubectl describe networkpolicy -n target-namespace

In zero-trust clusters, default-deny policies are expected.

Service creation should not automatically imply reachability.

28. Observability for Service Discovery

A production platform should observe the service discovery layer.

Useful signals:

Signal	Meaning
DNS latency/error rate	Cluster DNS health and resolver behavior.
Service endpoint count	Whether Services have usable backends.
Endpoint readiness churn	Flapping readiness or unstable rollout.
Connection failure rate	Network or app-level failure.
Cross-zone traffic volume	Cost and locality behavior.
kube-proxy or data-plane errors	Node-level service routing health.
NetworkPolicy deny metrics	Policy-driven connectivity failures.

Basic alert examples:

critical Service has zero ready endpoints for more than N minutes;
CoreDNS error rate spikes;
DNS p95 latency exceeds threshold;
EndpointSlice churn increases sharply during rollout;
service-level 5xx increases after backend endpoint change.

Services are part of the reliability surface.

Observe them accordingly.

29. Debugging Playbook: Service Does Not Work

Use layered debugging.

Step 1: Confirm the Service exists

kubectl get svc -n commerce orders-api
kubectl describe svc -n commerce orders-api

Check:

namespace;
type;
ClusterIP;
ports;
selector;
annotations from controllers or cloud provider.

Step 2: Confirm selected Pods

kubectl get pods -n commerce -l app.kubernetes.io/name=orders-api,app.kubernetes.io/component=api --show-labels

If no Pods match, the Service cannot route.

Step 3: Inspect EndpointSlices

kubectl get endpointslice -n commerce -l kubernetes.io/service-name=orders-api -o wide
kubectl describe endpointslice -n commerce -l kubernetes.io/service-name=orders-api

Check endpoint readiness and ports.

Step 4: Test DNS from a Pod

kubectl exec -n checkout deploy/checkout-api -- nslookup orders-api.commerce.svc.cluster.local

or with a debug Pod:

kubectl run netshoot -n checkout --rm -it --image=nicolaka/netshoot -- /bin/bash

Step 5: Test TCP connectivity

curl -v http://orders-api.commerce.svc.cluster.local/health

or:

nc -vz orders-api.commerce.svc.cluster.local 80

Step 6: Test direct endpoint

If allowed, test a specific Pod IP and target port.

This isolates Service routing from application behavior.

Step 7: Check policies

kubectl get networkpolicy -n commerce
kubectl get networkpolicy -n checkout

Step 8: Check node-level data plane

Depending on cluster:

kubectl get pods -n kube-system -l k8s-app=kube-proxy
kubectl logs -n kube-system daemonset/kube-proxy

If using Cilium, Calico, or another implementation, use that implementation's tooling.

The point is not to memorize one command set.

The point is to isolate the layer.

30. Failure Mode Catalogue

Symptom	Likely Layer	Check
`NXDOMAIN`	DNS or wrong name	Fully qualified service name.
DNS resolves but connection refused	App not listening or wrong target port	Pod container ports and app listener.
DNS resolves but timeout	NetworkPolicy, routing, data plane, backend hang	NetworkPolicy, EndpointSlice, CNI.
Service has no endpoints	Selector or readiness	Labels, readiness probes, EndpointSlices.
Some requests fail	Mixed backend health	Endpoint readiness, rollout, app logs.
Only cross-node traffic fails	CNI routing	CNI health, node routes.
Only external traffic fails	LB/NodePort/Ingress path	Service type, health checks, firewall.
Wrong app responds	Selector collision	Label taxonomy and Service selector.
Canary receives too much traffic	Selector or routing layer	Stable/canary labels and Gateway/Ingress rules.
Metrics scraping fails	Wrong Service/port/policy	Separate metrics Service and NetworkPolicy.

Debugging skill is largely classification skill.

Classify the failure before changing YAML.

31. Anti-Patterns

31.1 Calling Pod IPs directly

Bad:

http://10.42.3.17:8080

Why bad:

Pod IP is ephemeral;
rollout breaks clients;
readiness is bypassed;
observability and policy become harder;
client stores stale backend state.

Use Service names.

31.2 Selector too broad

Bad:

selector:
  app: api

In a large namespace, this may match more than intended.

Prefer namespaced, standardized labels.

31.3 One Service for everything

Bad:

same Service for user traffic, admin traffic, metrics, debug port

This creates policy ambiguity.

Separate traffic classes.

31.4 No readiness probe

Without readiness, Kubernetes may route traffic to Pods before they are useful.

This is a rollout safety bug.

31.5 Hiding all routing in annotations

For north-south traffic, excessive Service annotations can create provider lock-in and unclear ownership.

Prefer Gateway API or explicit ingress resources when appropriate.

32. Enterprise Naming and Label Standards

A platform should define standard labels for service discovery.

Example:

app.kubernetes.io/name: orders-api
app.kubernetes.io/instance: orders-api-prod
app.kubernetes.io/component: api
app.kubernetes.io/part-of: commerce-platform
app.kubernetes.io/version: 1.7.3
platform.example.com/tier: backend
platform.example.com/exposure: internal
platform.example.com/release-track: stable

Service selector:

selector:
  app.kubernetes.io/name: orders-api
  app.kubernetes.io/component: api
  platform.example.com/release-track: stable

Policy selectors:

podSelector:
  matchLabels:
    app.kubernetes.io/name: orders-api

Observability grouping:

service=orders-api
namespace=commerce
tier=backend
release_track=stable

The same taxonomy should support:

selection;
routing;
policy;
observability;
ownership;
cost allocation;
incident response.

Do not let every team invent their own labels.

33. Designing Service Boundaries

A Service should represent a stable interface, not just a deployment artifact.

Ask these questions:

Who owns this Service?
Which clients are allowed to call it?
Is it namespace-local, cluster-shared, or externally exposed?
Is the API stable across rolling updates?
Does it expose user traffic, admin traffic, metrics, or replication traffic?
Should it have a ClusterIP or be headless?
Does it require source IP preservation?
Does it require session affinity?
How is it observed?
What happens if it has zero endpoints?

This turns service design from YAML creation into interface design.

34. Java Service-to-Service Considerations

For Java workloads, Kubernetes Service discovery interacts with runtime behavior.

34.1 DNS caching

The JVM and libraries may cache DNS results. Long DNS caching can cause clients to hold stale information, especially when using headless Services or external DNS names.

For normal ClusterIP Services, DNS returns the Service IP, which is stable. For headless Services, DNS can return individual endpoints, so caching behavior matters more.

34.2 Connection pooling

HTTP/gRPC clients often maintain long-lived connections.

During rollout:

old Pods may be terminating;
new Pods may enter endpoint sets;
existing pooled connections may continue until closed;
load distribution may be skewed.

Use:

graceful shutdown;
readiness fail-before-terminate pattern;
connection draining;
sensible client timeouts;
retry budgets;
circuit breakers where appropriate.

34.3 Timeout discipline

Kubernetes Service does not magically create application timeouts.

Every client call should have:

connection timeout;
request timeout;
retry limit;
retry backoff;
idempotency rules;
observability tags.

Service discovery finds backends.

It does not make dependency calls safe.

35. Service Discovery Design Review Checklist

Use this before approving a production Service.

API Contract

Service name is stable and meaningful.
Namespace ownership is clear.
Service type is justified.
Ports are named consistently.
Service selector is precise and stable.
Metrics/admin ports are separated if needed.

Runtime Safety

Pods have readiness probes.
Shutdown and connection draining are designed.
Rolling updates tolerate mixed backend versions.
Clients have timeouts and retry limits.
Session affinity is avoided unless justified.

Security

NetworkPolicy allows only intended callers.
External exposure is intentional.
Source IP behavior is understood if externally exposed.
Sensitive ports are not reachable by broad clients.

Operations

Endpoint count is observable.
Zero-ready-endpoint alerts exist for critical Services.
DNS health is monitored.
Service ownership labels exist.
Runbook includes DNS, EndpointSlice, and policy checks.

36. Practice Lab

Lab 1: Build a stable internal Service

Deploy a simple HTTP app.
Create a ClusterIP Service.
Call it from another Pod using short name, namespace-qualified name, and FQDN.
Scale the Deployment.
Observe EndpointSlices change.

Commands:

kubectl get svc
kubectl get endpointslice -l kubernetes.io/service-name=<service-name>
kubectl exec <client-pod> -- nslookup <service-name>
kubectl exec <client-pod> -- curl -v http://<service-name>

Lab 2: Break the selector

Change Service selector to a non-matching label.
Observe Service still exists.
Observe EndpointSlices become empty.
Restore the selector.

Learning objective:

A Service object can be healthy as an API object while useless as a routing object.

Lab 3: Readiness and routing

Add a readiness probe.
Make the readiness endpoint fail.
Observe endpoint removal.
Restore readiness.
Observe endpoint return.

Learning objective:

Readiness controls Service membership.

Lab 4: Cross-namespace DNS

Create orders-api in namespace commerce.
Call orders-api from namespace checkout.
Observe failure or wrong lookup.
Call orders-api.commerce.svc.cluster.local.

Learning objective:

Short names are namespace-relative.

37. Mental Model Summary

Kubernetes service discovery is a chain:

Client -> DNS name -> Service -> EndpointSlice -> ready Pod IP -> targetPort -> application

A failure at any link breaks communication.

The key invariants:

Pods are unstable; Services are stable contracts.
Services select Pods through labels, not ownership.
EndpointSlices show the real backend set.
DNS gives stable names, but namespace context matters.
Readiness controls whether Pods receive Service traffic.
The Service API and the packet data plane are different layers.
Service design is interface design, not just infrastructure YAML.

If you master this chain, you can reason about Kubernetes traffic without guessing.

38. References

Kubernetes Documentation — Services: https://kubernetes.io/docs/concepts/services-networking/service/
Kubernetes Documentation — EndpointSlices: https://kubernetes.io/docs/concepts/services-networking/endpoint-slices/
Kubernetes Documentation — DNS for Services and Pods: https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/
Kubernetes Documentation — Virtual IPs and Service Proxies: https://kubernetes.io/docs/reference/networking/virtual-ips/
Kubernetes Documentation — Connecting Applications with Services: https://kubernetes.io/docs/tutorials/services/connect-applications-service/
Kubernetes Documentation — Network Policies: https://kubernetes.io/docs/concepts/services-networking/network-policies/

Lesson Recap

You just completed lesson 15 in build core. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Back To Series Next Lesson

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.

Previous Lesson

Lesson 14

Learn Kubernetes Deployment Model Part 014 Autoscaling

Next Lesson

Lesson 16

Learn Kubernetes Deployment Model Part 016 Ingress Gateway Api

Learn Kubernetes Deployment Model Part 015 Service Discovery Networking

Part 015 — Service Discovery and Kubernetes Networking Model

1. Kaufman Deconstruction

2. The Core Problem: Stable Clients, Unstable Backends

3. Kubernetes Networking Invariants

4. Service Object Mental Model

5. Service Is Not the Same as Deployment

6. Anatomy of Service Ports

7. Service Types

8. ClusterIP Service

9. Headless Service

10. NodePort Service

11. LoadBalancer Service

12. ExternalName Service

13. EndpointSlice Mental Model

14. Readiness and Endpoint Membership

15. DNS for Services

16. DNS Search Path and Namespace Coupling

17. Service Environment Variables

18. kube-proxy and Service Data Plane

19. Packet Path: Pod to Service

20. Service Selector Design

21. Port Naming Governance

22. Internal Service Design Patterns

22.1 One Service per application API

22.2 Separate public and private Services

22.3 Headless plus normal Service

22.4 Service per release track

22.5 ExternalName for migration bridge

23. Service Topology and Traffic Locality

24. Session Affinity

25. externalTrafficPolicy and Source IP Preservation

26. Service Discovery and Rollouts

27. NetworkPolicy Interaction

28. Observability for Service Discovery

29. Debugging Playbook: Service Does Not Work

Step 1: Confirm the Service exists

Step 2: Confirm selected Pods

Step 3: Inspect EndpointSlices

Step 4: Test DNS from a Pod

Step 5: Test TCP connectivity

Step 6: Test direct endpoint

Step 7: Check policies

Step 8: Check node-level data plane

30. Failure Mode Catalogue

31. Anti-Patterns

31.1 Calling Pod IPs directly

31.2 Selector too broad

31.3 One Service for everything

31.4 No readiness probe

31.5 Hiding all routing in annotations

32. Enterprise Naming and Label Standards

33. Designing Service Boundaries

34. Java Service-to-Service Considerations

34.1 DNS caching

34.2 Connection pooling

34.3 Timeout discipline

35. Service Discovery Design Review Checklist

API Contract

Runtime Safety

Security

Operations

36. Practice Lab

Lab 1: Build a stable internal Service

Lab 2: Break the selector

Lab 3: Readiness and routing

Lab 4: Cross-namespace DNS

37. Mental Model Summary

38. References

25. `externalTrafficPolicy` and Source IP Preservation