Platform Communication Observability, Debugging, and Runbooks
Learn Java Microservices Communication - Part 088
Production-grade observability, debugging, and runbooks for platform-mediated communication: DNS, Kubernetes Services, ingress, gateways, service mesh, Envoy, egress, multi-cluster routing, telemetry correlation, dashboards, incident workflows, testing, and policy.
Part 088 — Platform Communication Observability, Debugging, and Runbooks
Platform-mediated communication has many layers:
Java client
DNS
Kubernetes Service
EndpointSlice
network policy
gateway
ingress
service mesh sidecar
Envoy route
mTLS
authorization policy
destination service
application handler
When communication fails, the symptom may be simple:
HTTP 503
timeout
connection refused
gRPC UNAVAILABLE
But the cause may live in any layer.
A top-tier engineer can debug the path systematically.
They do not randomly increase timeouts.
They ask:
Where exactly did the request fail?
Was it DNS?
Was it routing?
Was it mTLS?
Was it authz?
Was it no endpoints?
Was it the gateway?
Was it the sidecar?
Was it the app?
Was it the downstream dependency?
This part provides the operational model for debugging platform-mediated communication.
1. Communication Path Mental Model
A failure can occur at:
- name resolution,
- connection establishment,
- TLS handshake,
- mTLS peer auth,
- routing rule,
- authorization,
- load balancing,
- upstream connection pool,
- application readiness,
- application handler,
- downstream call,
- response path.
Observability must identify the failing segment.
2. The Four Questions
For any communication incident, ask:
2.1 Did the client send?
Evidence:
- client logs,
- client metrics,
- trace span,
- connection pool metrics,
- request ID.
2.2 Did the platform route?
Evidence:
- DNS resolution,
- gateway/mesh access logs,
- route metrics,
- Envoy response flags,
- Service endpoints.
2.3 Did the destination receive?
Evidence:
- server access logs,
- application metrics,
- server trace spans,
- inbound proxy logs.
2.4 Did the destination complete?
Evidence:
- application status,
- dependency calls,
- DB metrics,
- response code,
- error logs.
This separates client, platform, and server responsibilities.
3. Error Code Interpretation
Common symptoms:
| Symptom | Possible layer |
|---|---|
| DNS lookup failed | DNS/CoreDNS/search domain |
| connection refused | port/app not listening/proxy |
| connection timeout | network policy/routing/firewall |
| TLS handshake failed | cert/SNI/trust/mTLS |
| HTTP 401 | authn/gateway/app |
| HTTP 403 | authz/mesh/app |
| HTTP 404 | route/path/app |
| HTTP 408 | gateway/client timeout |
| HTTP 413 | gateway body limit |
| HTTP 429 | rate limit |
| HTTP 502 | gateway upstream connection/protocol |
| HTTP 503 | no healthy upstream/circuit/app |
| HTTP 504 | gateway/upstream timeout |
| gRPC UNAVAILABLE | upstream/proxy/network |
| gRPC DEADLINE_EXCEEDED | deadline/timeout |
Do not assume status code source.
A 503 may be generated by app, gateway, or proxy.
4. Source of Error
Always identify error source.
Questions:
- Did response include gateway/server headers?
- Is there an Envoy response flag?
- Does app log show request?
- Does server access log show status?
- Does gateway access log show upstream status?
- Does trace include destination span?
If gateway logs 503 but app has no request, issue is before app.
If app logs 503 and gateway shows upstream 503, app generated it.
Source matters.
5. Envoy Response Flags
Envoy access logs can include response flags.
They help identify proxy-level failures.
Examples include concepts such as:
- upstream connection failure,
- upstream reset,
- local reset,
- no healthy upstream,
- timeout,
- rate limited,
- downstream connection termination.
The exact flags depend on Envoy version/config.
Operational rule:
proxy response flags are first-class debugging signals
Do not ignore them.
6. DNS Debugging
When service name fails:
Check from client pod:
nslookup case-service.case.svc.cluster.local
or:
dig case-service.case.svc.cluster.local
Check:
- correct namespace,
- FQDN,
- CoreDNS health,
- search path,
- Service exists,
- headless vs ClusterIP,
- DNS policy,
- network policy blocking DNS,
- JVM DNS cache.
Application symptom:
UnknownHostException
or timeout during DNS lookup.
DNS is dependency.
Monitor it.
7. Kubernetes Service Debugging
Commands:
kubectl get svc -n case case-service
kubectl describe svc -n case case-service
kubectl get endpointslices -n case -l kubernetes.io/service-name=case-service
kubectl get pods -n case -l app=case-service -o wide
Check:
- Service selector,
- port/targetPort,
- endpoints exist,
- pods ready,
- labels match,
- namespace,
- headless/ClusterIP,
- endpoint addresses.
No endpoints means routing cannot work.
8. Pod Readiness Debugging
Check:
kubectl get pods -n case
kubectl describe pod -n case <pod>
kubectl logs -n case <pod>
Look for:
- readiness probe failures,
- liveness restarts,
- startup probe still running,
- container not listening,
- port mismatch,
- dependency readiness check broken,
- sidecar readiness issue,
- termination state.
During deploy, readiness issues are common.
9. Network Policy Debugging
Symptoms:
- connection timeout,
- one namespace cannot call another,
- DNS fails if DNS egress blocked,
- egress dependency unreachable.
Check:
kubectl get networkpolicy -A
kubectl describe networkpolicy -n <namespace>
Test from pod:
curl -v http://case-service.case.svc.cluster.local:8080/ready
Network policy bugs often look like service downtime.
Use allowed/denied connectivity tests in CI/staging.
10. Gateway/Ingress Debugging
Check:
- route exists,
- host matches,
- path matches,
- TLS certificate valid,
- backend Service exists,
- backend port correct,
- auth/rate-limit policy,
- gateway logs,
- upstream status,
- gateway-generated errors.
Commands vary by gateway/controller.
Kubernetes basics:
kubectl get ingress -A
kubectl describe ingress -n edge case-api
kubectl get httproute -A
kubectl describe httproute -n edge case-route
For Gateway API:
kubectl get gateway -A
kubectl get httproute -A
Route config is production behavior.
11. Mesh Debugging
Check:
- sidecar injected?
- sidecar ready?
- mTLS policy?
- authorization policy?
- destination rule?
- virtual service?
- service entry?
- proxy config?
- Envoy clusters/listeners/routes?
- proxy logs?
Tooling depends on mesh.
For Istio, common patterns include inspecting proxy status and proxy config.
Debug with actual source/destination identities.
Mesh issues often appear as:
- 403 denied,
- 503 no healthy upstream,
- TLS handshake failure,
- timeout,
- route to wrong subset.
12. Authorization Debugging
For 403/denied:
Ask:
- Who is source principal?
- What is destination workload?
- Which policy applied?
- Is namespace correct?
- Is service account correct?
- Is path/method matched?
- Is JWT required?
- Are claims present?
- Is request coming through gateway or direct?
- Are dry-run policies showing would-deny?
Authorization logs should include policy name and principal.
If not, debugging becomes painful.
13. mTLS Debugging
Symptoms:
- TLS handshake failures,
- connection reset,
- 503 upstream connect error,
- peer authentication failure.
Check:
- source has sidecar/proxy,
- destination expects STRICT/PERMISSIVE,
- certificates valid,
- trust domain compatible,
- destination policy,
- gateway TLS mode,
- service entry TLS mode,
- client/proxy SNI.
mTLS misconfiguration is common during migration.
Use permissive/dry-run phases.
14. Route Debugging
For wrong routing:
- match order,
- host,
- path,
- headers,
- method,
- subset,
- weight,
- gateway binding,
- namespace,
- export/import scope,
- labels.
Canary issues often come from:
- subset label mismatch,
- weight typo,
- route rule priority,
- header not injected,
- route applied in wrong namespace,
- stale proxy config.
Test route rules with real requests.
15. Java Client Debugging
Client-side metrics:
http.client.requests{dependency,status}
http.client.duration{dependency}
http.client.connection.acquire.duration
http.client.connection.active
http.client.connection.idle
http.client.errors{exception}
grpc.client.calls{method,status}
grpc.channel.state
Client logs should include:
- dependency name,
- method/operation,
- timeout budget,
- attempt count,
- error type,
- request ID,
- correlation ID,
- target host,
- not full payload.
Common Java issues:
- connection pool exhausted,
- DNS cached too long,
- connect timeout too high,
- response timeout missing,
- retry loop,
- request body not repeatable,
- gRPC channel stuck,
- TLS truststore issue.
16. Timeout Debugging
When a timeout occurs:
Ask:
- which timeout fired?
- client connect timeout?
- client response timeout?
- gateway timeout?
- mesh route timeout?
- server execution timeout?
- DB timeout?
- downstream timeout?
- overall deadline?
Timeline trace:
client deadline 500ms
mesh per-try timeout 400ms
backend DB timeout 2s
This is inconsistent.
Timeout debugging requires timeline.
Use trace spans and logs with elapsed time.
17. Retry Debugging
When traffic amplifies:
- client retries?
- gateway retries?
- mesh retries?
- server retries?
- async retry?
- external provider retry?
- load balancer retry?
Compute attempts:
client attempts × gateway attempts × mesh attempts × server attempts
If not controlled, a single request can become many upstream calls.
Retry metrics must be tagged by layer.
18. Observability Correlation
Use consistent IDs:
- request ID,
- trace ID,
- correlation ID,
- causation ID,
- workflow ID,
- event ID if async side effect,
- region/cluster.
Headers:
X-Request-Id
traceparent
tracestate
X-Correlation-Id
X-Causation-Id
Do not rely on one ID for all purposes.
Request ID helps edge debugging.
Correlation ID helps business flow.
Trace ID helps distributed trace.
19. Golden Signals by Layer
Client
- request rate,
- error rate,
- latency,
- connection pool saturation,
- timeout/retry count.
Gateway
- route rate,
- auth failures,
- rate limits,
- upstream failures,
- gateway-generated 5xx.
Mesh/Proxy
- source/destination traffic,
- mTLS status,
- authz denies,
- retries/timeouts,
- outlier ejections,
- proxy resource usage.
Destination App
- inbound rate,
- handler latency,
- error classification,
- dependency errors,
- saturation.
Platform
- DNS health,
- endpoint count,
- pod readiness,
- network policy denies,
- node/network health.
Observe all layers.
20. Dashboard Design
Dashboard tabs:
- Dependency map.
- Gateway routes.
- Service-to-service mesh traffic.
- DNS/service endpoint health.
- Client dependency metrics.
- Destination service metrics.
- Auth/mTLS/security denies.
- Canary/version traffic.
- Egress dependencies.
- Multi-region traffic.
Each dashboard should answer:
where is traffic failing?
what changed recently?
which team owns the layer?
what is the business impact?
21. Dependency Graph
A dependency graph shows:
service A -> service B
service A -> external provider
gateway -> service C
Enhance with:
- rate,
- error rate,
- latency,
- mTLS,
- authz denies,
- retry count,
- version,
- region,
- owner.
Dependency graphs reveal hidden dependencies.
But they must be accurate.
Traffic observed by mesh/gateway can help build them.
22. Canary Debugging
When canary fails:
Check by version:
- error rate,
- p99 latency,
- gateway status,
- app status,
- business errors,
- DB errors,
- event DLQ,
- projection lag,
- auth denies,
- retries.
Questions:
- Is only v2 failing?
- Is route sending expected percentage?
- Are v2 pods ready?
- Are v2 labels correct?
- Is v2 schema compatible?
- Did v2 emit bad events?
- Is rollback data-safe?
Canary debugging must include async side effects too.
23. Egress Debugging
External call fails.
Check:
- DNS resolution,
- egress policy allows host,
- ServiceEntry exists,
- egress gateway route,
- TLS/SNI,
- client cert,
- API credential,
- provider status,
- rate limit,
- source IP allowlist,
- firewall/private link,
- circuit breaker,
- provider error body.
External failures often involve teams outside your platform.
Runbooks need contact paths.
24. Multi-Region Debugging
Check:
- source region,
- target region,
- global routing decision,
- owner region,
- replication lag,
- cross-region latency,
- failover status,
- data residency policy,
- mTLS trust domain,
- regional capacity,
- partial outage.
Do not debug multi-region issue with global averages.
Always split by region/cluster.
25. Runbook: HTTP 503
Steps:
- Identify response source: app, gateway, proxy.
- Check gateway/mesh response flags.
- Check destination app received request.
- Check Service endpoints.
- Check pod readiness.
- Check route/subset labels.
- Check outlier ejection/circuit breaking.
- Check recent deploy/config change.
- Check dependency saturation.
- Mitigate: rollback route, scale, pause canary, fix readiness.
503 is a symptom.
Find the generator.
26. Runbook: Timeout
Steps:
- Identify timeout layer.
- Check client timeout config.
- Check gateway/mesh timeout.
- Check backend latency.
- Check downstream dependency.
- Check retries increasing latency.
- Check connection pool acquisition.
- Check DNS/connect latency.
- Check saturation CPU/DB/thread pool.
- Mitigate with fail-fast, rollback, throttle, circuit breaker.
Timeout without layer identification leads to bad fixes.
27. Runbook: 403 Denied
Steps:
- Identify whether 403 from gateway, mesh, or app.
- Capture source principal.
- Capture destination workload.
- Capture method/path.
- Check auth token/JWT.
- Check AuthorizationPolicy.
- Check trusted identity headers.
- Check direct vs gateway path.
- Check recent policy change.
- Mitigate with correct minimal allow rule or rollback.
Never fix by broad wildcard allow unless emergency approved and time-bounded.
28. Runbook: DNS Failure
Steps:
- Reproduce from pod.
- Check FQDN and namespace.
- Check CoreDNS health.
- Check DNS egress/network policy.
- Check Service exists.
- Check search domains.
- Check JVM cache if stale.
- Check node-level DNS issues.
- Check external DNS provider if external.
- Mitigate with rollback, local cache, or platform fix.
DNS failure can become cascading failure if clients retry aggressively.
29. Runbook: Bad Route/Canary
Steps:
- Route canary weight to 0 if user impact.
- Verify traffic returns to stable subset.
- Check subset labels.
- Check route match priority.
- Check metrics by version.
- Check bad version side effects/events.
- Check schema/database compatibility.
- Preserve logs/traces for analysis.
- Fix config in code.
- Add route test.
Traffic rollback is fast.
Data rollback may not be.
30. Runbook: Egress Provider Down
Steps:
- Confirm provider-specific metrics.
- Check DNS/TLS/auth/rate limit.
- Check egress gateway health.
- Check provider status.
- Open circuit/enable degradation.
- Queue async work if supported.
- Stop retry storm.
- Communicate feature degradation.
- Reconcile after provider recovers.
- Review timeout/retry/bulkhead.
External outages should not take down unrelated internal services.
31. Testing Debuggability
Test not only behavior, but debuggability.
Examples:
- route failure includes route name,
- authz deny logs policy name,
- timeout logs dependency and timeout type,
- proxy access logs include request ID,
- app logs include correlation ID,
- canary metrics include version,
- egress errors include provider name,
- DNS errors include hostname.
If incidents cannot be diagnosed quickly, observability is incomplete.
32. Observability Contract
Every service should expose:
communicationObservability:
inbound:
requestRate: true
errorRate: true
duration: true
requestId: true
traceId: true
outbound:
dependencyMetrics: true
timeoutType: true
retryCount: true
connectionPool: true
platform:
routeName: true
sourcePrincipal: true
destinationPrincipal: true
mTls: true
logs:
structured: true
payloadLogging: disabled
This is a contract between app and platform teams.
33. Synthetic Probes
Use probes for:
- gateway route health,
- internal service path,
- egress provider sandbox,
- cross-region route,
- mTLS/authz test,
- canary route,
- DNS resolution.
Probes should test meaningful paths.
Not only:
GET /health
But:
authenticated GET /cases/{synthetic-id}
when safe.
Synthetic probes catch routing/auth failures before users do.
34. Change Correlation
Many communication incidents are caused by config changes.
Correlate alerts with:
- deployment,
- route change,
- mesh policy change,
- gateway config change,
- DNS change,
- certificate rotation,
- NetworkPolicy change,
- secret rotation,
- schema migration,
- HPA scaling,
- node upgrade.
Dashboards should show recent changes.
"Nothing changed" is often false.
35. Production Policy Template
platformCommunicationObservability:
requiredSignals:
dns:
- lookupFailures
- corednsErrors
kubernetesService:
- endpointCount
- readinessTransitions
gateway:
- routeMetrics
- upstreamStatus
- authFailures
- rateLimits
- timeouts
mesh:
- sourceDestinationMetrics
- mtlsStatus
- authzDenies
- responseFlags
- retries
- outlierEjections
javaClient:
- dependencyLatency
- connectionPoolMetrics
- timeoutType
- retryCount
egress:
- providerLatency
- tlsFailures
- rateLimits
- circuitState
logs:
structured: true
requestIdRequired: true
correlationIdRequired: true
payloadLogging: disabled
runbooks:
required:
- http503
- timeout
- dnsFailure
- authzDeny
- badCanary
- egressProviderDown
testing:
routeTestsRequired: true
authzNegativeTestsRequired: true
syntheticProbesRequired: true
debuggabilityTestsRequired: true
If this policy is missing, incidents become archaeology.
36. Common Anti-Patterns
36.1 Debugging only app logs
Failure may be DNS/gateway/mesh.
36.2 No response source identification
503 origin unclear.
36.3 No route/version labels
Canary invisible.
36.4 No authz deny logs
403 impossible to debug.
36.5 No endpoint count metrics
Zero-endpoint Service discovered by users.
36.6 Payload logs for debugging
Security risk.
36.7 Global averages in multi-region
Regional outage hidden.
36.8 Retry count not visible
Amplification hidden.
36.9 No synthetic probes
Route/auth failures found by customers.
36.10 No runbooks
Incident response improvisation.
37. Decision Model
Debug by narrowing the failing segment.
38. Design Checklist
Before declaring platform communication operable:
- Can you identify error source?
- Are DNS failures visible?
- Are Service endpoints monitored?
- Are readiness transitions visible?
- Are gateway route metrics available?
- Are mesh response flags logged?
- Are authz denies explainable?
- Are mTLS failures visible?
- Are Java client connection pools monitored?
- Are timeout types distinguishable?
- Are retries counted by layer?
- Are egress providers named in metrics?
- Are region/cluster labels present?
- Are synthetic probes configured?
- Are runbooks written and tested?
- Are recent config changes visible?
- Are logs structured and redacted?
39. The Real Lesson
Platform-mediated communication adds power and layers.
Those layers must be observable.
A mature system can answer quickly:
where did the request go?
which policy applied?
which identity was used?
which route matched?
which backend received it?
which layer returned the error?
what changed recently?
who owns the fix?
If you cannot answer those, service discovery, gateways, and mesh have become hidden complexity.
Production readiness means every routing layer is debuggable.
References
- Kubernetes Services: https://kubernetes.io/docs/concepts/services-networking/service/
- Kubernetes DNS for Services and Pods: https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/
- Kubernetes Gateway API: https://kubernetes.io/docs/concepts/services-networking/gateway/
- Istio Operations and Deployment Architecture: https://istio.io/latest/docs/ops/deployment/architecture/
- Istio Security Concepts: https://istio.io/latest/docs/concepts/security/
- Envoy Access Logging: https://www.envoyproxy.io/docs/envoy/latest/configuration/observability/access_log/usage
- Envoy Response Flags: https://www.envoyproxy.io/docs/envoy/latest/configuration/observability/access_log/usage#response-flags
You just completed lesson 88 in final stretch. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.
Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.