Series/Learn Kubernetes Networking, Gateway API, Service Mesh, and Multi-Cluster Traffic Engineering

Final StretchOrdered learning track

Production Architecture Review and Decision Framework

Learn Kubernetes Networking, Gateway API, Service Mesh, and Multi-Cluster Traffic Engineering - Part 034

Production architecture review and decision framework for Kubernetes networking platforms, including CNI, Gateway API, ingress, service mesh, mTLS, egress, multi-cluster, observability, cost, ownership, migration, risk, and regulatory defensibility.

[2026-07-01]22 min read4317 words

In This Lesson

1. Tujuan Part Ini 2. Source Anchors 3. Kaufman Framing: Architecture Review as Deliberate Practice

PrevNext

Lesson 3435 lesson track30–35 Final Stretch

#kubernetes#networking#architecture-review#gateway-api+9 more

Part 034 — Production Architecture Review and Decision Framework

1. Tujuan Part Ini

Part 033 membahas failure models, chaos testing, dan debugging playbooks. Part ini menjawab pertanyaan yang lebih strategis:

Bagaimana menilai apakah arsitektur Kubernetes networking layak produksi, aman dioperasikan, scalable, dapat diaudit, dan dapat berkembang tanpa berubah menjadi tumpukan exception?

Target part ini:

Anda mampu memimpin architecture review untuk Kubernetes networking platform yang mencakup CNI, Service, DNS, Gateway API, ingress, service mesh, mTLS, NetworkPolicy, egress, multi-cluster, observability, ownership, migration, cost, risk, dan regulatory defensibility.

Part ini bukan template checklist biasa. Ini adalah decision framework.

Kita akan membangun cara berpikir untuk menjawab:

Apakah kita butuh service mesh?
Apakah Gateway API cukup, atau tetap butuh API gateway eksternal?
Apakah multi-cluster menyelesaikan reliability, atau hanya menambah failure mode?
CNI mana yang cocok dengan security dan traffic model kita?
Di mana TLS terminate?
Bagaimana egress dikontrol tanpa membunuh developer productivity?
Bagaimana platform team memberi guardrail tanpa menjadi bottleneck?
Bagaimana membuktikan desain ini defensible untuk audit/regulatory case?

2. Source Anchors

Materi ini memakai referensi utama berikut:

Kubernetes Services, Load Balancing, and Networking — https://kubernetes.io/docs/concepts/services-networking/
Kubernetes Gateway API — https://kubernetes.io/docs/concepts/services-networking/gateway/
Gateway API official docs — https://gateway-api.sigs.k8s.io/
Gateway API Conformance — https://gateway-api.sigs.k8s.io/concepts/conformance/
Gateway API Policy Attachment — https://gateway-api.sigs.k8s.io/reference/policy-attachment/
Kubernetes NetworkPolicy — https://kubernetes.io/docs/concepts/services-networking/network-policies/
Kubernetes EndpointSlices — https://kubernetes.io/docs/concepts/services-networking/endpoint-slices/
SIG Multicluster MCS API — https://multicluster.sigs.k8s.io/concepts/multicluster-services-api/
Istio Traffic Management — https://istio.io/latest/docs/concepts/traffic-management/
Istio Security — https://istio.io/latest/docs/concepts/security/
Istio Deployment Models — https://istio.io/latest/docs/ops/deployment/deployment-models/
Linkerd Architecture — https://linkerd.io/2-edge/reference/architecture/
Cilium Service Mesh — https://docs.cilium.io/en/stable/network/servicemesh/
AWS Well-Architected Framework — https://aws.amazon.com/architecture/well-architected/
AWS Well-Architected Reliability Pillar — https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/welcome.html

Fakta penting yang menjadi anchor:

Kubernetes networking primitives tidak otomatis menentukan security, observability, or ownership model; itu perlu didesain.
Gateway API adalah model role-oriented dan extensible untuk routing dan infrastructure provisioning, tetapi behavior tetap bergantung pada implementation/controller.
NetworkPolicy mengontrol L3/L4 traffic hanya jika CNI mendukung enforcement.
Service mesh menambahkan policy, traffic management, telemetry, dan identity, tetapi juga menambah control plane, data plane, dan operational cost.
Multi-cluster adalah boundary design; bukan automatic high availability tanpa data, identity, routing, and operational readiness.

3. Kaufman Framing: Architecture Review as Deliberate Practice

Dalam framework Kaufman, tujuan belajar bukan menumpuk pengetahuan pasif. Tujuan belajar adalah mencapai level performa yang jelas.

Untuk part ini, target performa adalah:

Given a Kubernetes networking proposal,
you can identify hidden assumptions,
separate mandatory requirements from nice-to-have features,
compare options using explicit criteria,
predict failure modes,
define guardrails,
and produce a decision record that another senior engineer can audit.

Architecture review adalah deliberate practice karena memaksa Anda melakukan empat hal:

Deconstruct problem menjadi boundary, traffic class, identity, policy, reliability, and ownership.
Compare options berdasarkan constraints, bukan hype.
Predict failure modes sebelum production incident.
Encode decisions menjadi ADR, runbook, guardrail, and test.

4. Review Model: Seven Planes

Gunakan seven-plane model untuk menilai desain.

A design is weak if it optimizes only one plane.

Example:

“We use a mesh, so traffic is secure.”

This is incomplete. Mesh might help identity and policy, but you still need:

certificate lifecycle
enrollment boundaries
bypass prevention
observability
emergency break-glass
policy ownership
performance budget
upgrade model

5. Architecture Review Intake

Before comparing tools, collect requirements.

5.1 Workload Profile

Question	Why It Matters
Is traffic mostly HTTP/gRPC, TCP, UDP, or mixed?	Determines Gateway/mesh/L4 needs
Are connections short-lived or long-lived?	Affects draining, retries, load balancing
Are requests idempotent?	Affects retry safety
Is latency-sensitive?	Affects proxy overhead and cross-zone routing
Is traffic internal, external, partner, or regulated?	Affects trust boundaries
Does service call graph change often?	Affects policy automation
Are clients inside or outside mesh?	Affects mTLS and routing design
Are there legacy protocols?	Affects L7 routing and observability

5.2 Organizational Profile

Question	Why It Matters
Who owns cluster networking?	CNI and dataplane operations
Who owns public routes?	Gateway and DNS governance
Who owns service-to-service policy?	Security/app/platform split
Who approves egress?	Compliance and vendor dependency
Who handles incidents?	Runbook and access model
Who can create cross-namespace references?	Multi-tenant safety
Who can change global traffic weights?	Failover and blast radius

5.3 Non-Functional Requirements

Capture explicit targets:

availability: 99.9 / 99.95 / 99.99
p95 latency budget: <ms>
p99 latency budget: <ms>
RTO: <minutes>
RPO: <minutes>
max regional failover time: <minutes>
max route rollout blast radius: <% traffic>
source IP audit requirement: yes/no
mTLS requirement: none/internal/all/regulated-only
egress allowlist requirement: domain/IP/proxy/vendor-specific
log retention: <days>
trace sampling: <%>

Without numbers, architecture review becomes preference debate.

6. Decision 1: CNI and Dataplane

6.1 Core Question

What dataplane capabilities do we need, and who can operate them under failure?

6.2 Decision Criteria

Criteria	Why It Matters
Kubernetes NetworkPolicy support	baseline microsegmentation
egress policy support	external dependency control
L7 policy support	HTTP/gRPC/Kafka/DNS-aware controls
eBPF visibility	flow-level debugging
kube-proxy replacement	performance and dataplane simplification
cloud-native IPAM	VPC integration and routability
encryption	node-to-node/pod-to-pod traffic protection
BGP support	bare-metal/on-prem routing
multi-cluster support	cluster mesh and global service discovery
operational maturity	upgrade, debug, and incident response

6.3 Option Pattern

Option	Strength	Risk
Simple overlay CNI	easy to understand	limited policy/visibility/performance
Cloud VPC CNI	native cloud routability	IP exhaustion, cloud-specific coupling
Calico-style policy/routing	strong policy and BGP patterns	operational complexity
Cilium/eBPF	observability, policy, kube-proxy replacement	eBPF/kernel/debug maturity requirement

6.4 Review Questions

Can we enforce default-deny policies?
Can we debug dropped packets with evidence?
Can app teams understand failure messages?
Can we roll back CNI upgrades safely?
What kernel versions are supported?
What happens when CNI agent fails on one node?
Do we need source IP preservation?
Do we rely on implementation-specific CRDs?

6.5 Red Flags

- “We chose it because it is fastest” without operational evidence.
- No test for NetworkPolicy enforcement.
- No rollback plan for CNI upgrade.
- No node-level troubleshooting access.
- No IPAM exhaustion alert.
- No flow visibility for production incidents.

7. Decision 2: Ingress, Gateway API, and API Gateway

7.1 Core Question

Which layer owns external request admission, routing, security, and API product concerns?

Do not collapse these into one word: “gateway”.

There are at least three concerns:

Concern	Typical Tool
Kubernetes route programming	Gateway API / Ingress
Edge security and global traffic	CDN/WAF/cloud LB/GSLB
API product management	API gateway/developer portal/quota/API keys

7.2 Ingress vs Gateway API

Dimension	Ingress	Gateway API
API shape	simple HTTP ingress	role-oriented resource model
Extensibility	annotations	typed resources, policy attachment, extensions
Multi-tenancy	weaker	stronger listener/route/delegation model
Protocol support	mostly HTTP/HTTPS	HTTP, gRPC, TLS, TCP, UDP depending on support
Status model	controller-specific	more explicit conditions
Portability	annotation-dependent	conformance-oriented, still implementation-dependent

7.3 Gateway API Review Questions

Which GatewayClass is platform-approved?
Who can create Gateway objects?
Who can attach Route objects?
Is cross-namespace attachment controlled?
Are ReferenceGrant objects reviewed?
Are policies inherited or direct?
Which features are Core, Extended, or implementation-specific?
How is conformance tested before upgrades?
How do we detect conflicting routes?
How do we roll back route changes?

7.4 API Gateway Still Needed?

Gateway API may not replace an API gateway if you need:

developer portal
API product lifecycle
subscription plans
API keys
monetization
request/response transformation at product layer
partner onboarding workflows
detailed consumer analytics
schema validation as product contract
legacy auth integration

But avoid duplicating policy in too many layers:

CDN/WAF denies
API gateway rewrites
Gateway API routes
mesh authorizes
application authorizes

This can be valid, but only if each layer has clear ownership and evidence.

8. Decision 3: Service Mesh

8.1 Core Question

Do we need a service mesh, or do we need one mesh capability?

Common mesh capabilities:

mTLS
workload identity
authorization
traffic splitting
retries/timeouts
circuit breaking
telemetry
egress control
service discovery extension
multi-cluster connectivity

Do not adopt a full mesh only because one feature is attractive.

8.2 Mesh Fit Matrix

Requirement	Mesh Fit
all internal service-to-service traffic must use mTLS	strong fit
app teams need consistent retries/timeouts	medium/strong fit
only public ingress routing needed	weak fit
traffic is mostly raw TCP/UDP	depends on mesh implementation
strict identity-based auth between services	strong fit
team cannot operate proxy/control plane	weak fit
latency budget is extremely tight	requires benchmark
many legacy non-mesh clients	migration complexity

8.3 Sidecar vs Ambient vs Sidecarless

Model	Strength	Risk
Sidecar	mature L7 feature isolation per workload	resource overhead, injection complexity
Ambient/waypoint	reduced sidecar overhead, simpler enrollment	policy placement complexity, newer operations
eBPF/node-assisted	efficient dataplane, strong visibility potential	implementation-specific behavior, kernel coupling
Lightweight sidecar	low overhead, simpler semantics	fewer advanced L7 features

8.4 Mesh Review Questions

What exact mesh features are mandatory?
Which traffic remains outside mesh?
How is bypass prevented or detected?
How are certificates issued and rotated?
How do we handle non-mesh clients?
Who owns mesh policy?
Who owns mesh upgrades?
How are proxy resources budgeted?
How is xDS/config staleness detected?
What happens if the mesh control plane is unavailable?
How do we roll back a bad mesh policy?

8.5 Mesh Red Flags

- Mesh installed before ownership model exists.
- STRICT mTLS enabled without dependency inventory.
- App teams cannot read proxy errors.
- No resource budget for sidecars/waypoints.
- No emergency exception process.
- No plan for long-lived connections.
- No compatibility test for batch/cron jobs.

9. Decision 4: Identity, TLS, mTLS, and Trust Domains

9.1 Core Question

What identity does a workload present, who vouches for it, and who trusts it?

IP address is a weak identity in dynamic Kubernetes environments.

9.2 Identity Layers

Layer	Example
User identity	end-user JWT/session
Client app identity	OAuth client/service account
Workload identity	SPIFFE ID / mesh identity
Node identity	cloud instance identity / kubelet cert
Cluster identity	cluster trust domain
External partner identity	client certificate/API credential

9.3 TLS Termination Review

Termination Point	Benefit	Risk
CDN/WAF	edge protection	plaintext after edge unless re-encrypted
Cloud LB	managed cert/LB integration	limited app context
Gateway	Kubernetes route-level control	Secret ownership and controller blast radius
Mesh proxy	service-to-service identity	proxy dependency
App	end-to-end app control	duplicated TLS operations

9.4 Review Questions

Where does TLS terminate for public traffic?
Is backend traffic re-encrypted?
Is mTLS required internally?
What is the trust domain naming scheme?
How are trust bundles distributed?
How are cert expiry and rotation monitored?
How do we revoke a compromised workload identity?
How are cross-cluster identities federated?
What evidence proves traffic was encrypted?

9.5 Red Flags

- “TLS is handled by the load balancer” but backend is plaintext across untrusted networks.
- `curl -k` used in production checks.
- no certificate expiry alert.
- no owner for trust bundle rotation.
- mTLS exceptions undocumented.
- namespace name used as sole security identity.

10. Decision 5: NetworkPolicy and Microsegmentation

10.1 Core Question

What is the minimum network access each workload needs, and how do we prove policy is enforced?

10.2 Policy Maturity Levels

Level	Description
0	default allow, no visibility
1	default allow, observe flows
2	default deny for selected namespaces
3	default deny broadly, explicit service dependencies
4	automated policy generation/review with drift detection
5	identity-aware and L7-aware policy with audit evidence

10.3 Review Questions

Does the CNI enforce NetworkPolicy?
Are namespaces labeled consistently?
Is default-deny applied gradually?
Are DNS, metrics, health checks, and mesh dependencies allowed?
Are policies generated from observed flows or manually written?
How are unused allows removed?
How are emergency exceptions created and expired?
How are policy decisions audited?

10.4 Microsegmentation Design

Recommended rollout:

observe → model dependencies → simulate → default deny non-critical namespace → test → enforce → expand → audit drift

Do not start with all namespaces at once.

10.5 Red Flags

- default deny applied without DNS exception.
- selectors depend on unstable labels.
- namespace selectors too broad.
- policy YAML reviewed without traffic evidence.
- no way to tell whether a packet was denied by policy.

11. Decision 6: Egress Control

11.1 Core Question

How do workloads reach external dependencies, and how do we prevent uncontrolled data movement?

Egress is usually harder than ingress because external dependencies are less standardized.

11.2 Egress Options

Option	Strength	Risk
Node SNAT only	simple	weak audit, source IP drift
NAT gateway	stable-ish cloud path	port exhaustion, cost, limited identity
Static egress IP	vendor allowlist friendly	scaling and failover complexity
HTTP proxy	strong audit and policy	app compatibility, proxy bottleneck
Mesh egress gateway	identity-aware egress	mesh dependency and config complexity
Private connectivity	avoids public internet	provider-specific, route complexity
FQDN policy	developer-friendly	DNS drift and wildcard risk

11.3 Review Questions

Which workloads can access the internet?
Are external dependencies inventoried?
Is source IP stable where vendors require allowlisting?
Are domains validated beyond DNS names?
Is TLS inspection used? If yes, how is trust handled?
How is NAT port exhaustion monitored?
How are emergency egress exceptions approved?
Are egress logs retained for audit?
Does egress route through a single bottleneck?

11.4 Red Flags

- all pods can reach internet by default.
- vendor allowlist uses node IPs that autoscale unpredictably.
- wildcard FQDN allows broad exfiltration.
- no egress logs.
- no ownership for external dependency registry.

12. Decision 7: Multi-Cluster

12.1 Core Question

What boundary does multi-cluster create, and what failure does it actually solve?

Multi-cluster can solve:

regional availability
blast radius isolation
compliance/data residency
cluster upgrade isolation
team/environment separation
capacity scaling

Multi-cluster can create:

split-brain
inconsistent policy
stale service discovery
cross-region latency
data consistency conflict
certificate federation complexity
failover that overloads surviving region

12.2 Multi-Cluster Pattern Matrix

Pattern	Use Case	Risk
Active-passive	DR, strict primary data ownership	failover rehearsal required
Active-active stateless	global latency, availability	capacity and routing complexity
Active-active stateful	rare and hard	data consistency and conflict
Cluster per region	locality and isolation	global governance complexity
Cluster per tenant	isolation	operational sprawl
Cluster per lifecycle	upgrade safety	environment drift

12.3 Review Questions

Why do we need multiple clusters?
Are Pod/Service CIDRs non-overlapping?
Is namespace sameness required?
Is service discovery MCS-based, DNS-based, mesh-based, or custom?
How does failover work?
What health signal triggers failover?
Does data fail over too?
Is spare capacity reserved?
How are policies synchronized?
How are trust domains federated?
Can we test regional isolation safely?

12.4 Red Flags

- “Multi-cluster means HA” with no data/RTO/RPO plan.
- global DNS failover based only on Gateway health.
- no capacity in secondary region.
- cross-cluster mTLS not tested.
- overlapping CIDRs.
- manual failover runbook not rehearsed.

13. Decision 8: Resilience Policies

13.1 Core Question

Where are timeouts, retries, circuit breakers, and load shedding defined, and are they consistent with application semantics?

13.2 Policy Ownership

Policy	Owner Candidates
global edge timeout	platform/SRE
route timeout	app/platform shared
service retry	app owner with platform guardrail
circuit breaker	app/SRE shared
rate limit	platform/API/security
load shedding	app/SRE
failover	platform/SRE/business owner

13.3 Review Questions

Are all retries idempotency-aware?
Is there a retry budget?
Are timeouts ordered correctly from client to backend?
Is circuit breaker configured by observed capacity?
Are load shedding responses explicit?
Are retry storms visible in metrics?
Are policies applied in app, mesh, gateway, or all three?
Is there a documented precedence model?

13.4 Timeout Ladder

Example:

client request timeout: 10s
edge gateway timeout: 9s
internal gateway timeout: 8s
service mesh request timeout: 7s
application handler timeout: 6s
database query timeout: 5s

The exact values depend on workload. The invariant is:

Inner dependencies should fail before outer callers give up, otherwise capacity is wasted and errors become ambiguous.

14. Decision 9: Observability and Evidence

14.1 Core Question

Can we prove what happened to a request across route, service, workload, node, cluster, and policy boundary?

14.2 Required Dimensions

Every production request log/metric/trace should be able to answer:

source identity
source namespace
destination service
route name
Gateway name
backend version
cluster
region/zone
response code
response flag
latency
retry count
policy decision
mTLS mode
trace ID

Not every signal must include every field, but the observability model must allow correlation.

14.3 Observability Review Questions

Can we identify which HTTPRoute served a request?
Can we identify backend version for canary traffic?
Can we identify policy denies?
Can we identify DNS latency separately from app latency?
Can we identify cross-zone/cross-region traffic?
Can we distinguish Gateway 503 from app 503?
Can we debug one user request end-to-end?
Can we export an incident evidence bundle?

14.4 Red Flags

- app metrics only, no gateway/proxy metrics.
- high-cardinality labels added without budget.
- traces sampled so aggressively that incidents disappear.
- no flow logs for policy denied traffic.
- no route/backend labels in access logs.

15. Decision 10: Ownership and Governance

15.1 Core Question

Who is allowed to change traffic behavior, and how is that change reviewed, limited, and audited?

15.2 Ownership Matrix

Resource	Platform	App Team	Security	SRE
CNI config	owner	consulted	consulted	consulted
GatewayClass	owner	consumer	consulted	consulted
Shared Gateway	owner	attach routes	consulted	consulted
HTTPRoute	guardrail	owner	consulted for public/sensitive	consulted
ReferenceGrant	approve/control	request	approve sensitive refs	consulted
NetworkPolicy	guardrail	define dependency	approve model	observe
AuthZ policy	platform/security	service owner input	owner	consulted
Egress allowlist	platform/security	request	owner	observe
Failover weights	platform/SRE	consulted	consulted	owner
Mesh config	platform	service owner input	consulted	consulted

15.3 Governance Controls

Use:

RBAC.
admission policy.
namespace labels.
GitOps review.
policy-as-code.
automated conformance tests.
route linting.
emergency exception expiry.
audit logs.
periodic drift review.

15.4 Red Flags

- any app team can attach to public Gateway.
- app team can reference Secrets in platform namespace.
- no approval for cross-namespace ReferenceGrant.
- route weights changed manually with no audit trail.
- emergency NetworkPolicy exception never expires.

16. Cost and Capacity Model

Networking architecture has cost beyond cloud bills.

16.1 Cost Categories

Category	Examples
Proxy compute	sidecars, waypoints, gateways, API gateway
Cross-zone traffic	topology-unaware load balancing
Cross-region traffic	active-active or failover testing
NAT cost	NAT gateway processing and hourly cost
Logging cost	access logs, flow logs, traces
Cardinality cost	metrics dimensions route/pod/user
Operational cost	upgrades, debugging, on-call load
Cognitive cost	multiple policy layers and hidden interactions

16.2 Capacity Questions

How many RPS per Gateway replica?
What is p99 latency added by proxy layers?
How many active connections per proxy?
What is CPU/memory per sidecar/waypoint?
What is NAT connection tracking limit?
What is DNS QPS under deploy/load test?
What is max EndpointSlice update rate during scale event?
What happens during regional failover when traffic doubles?

16.3 Red Flags

- no load test through real Gateway/mesh path.
- capacity test bypasses TLS/mTLS.
- log volume cost not estimated.
- cross-zone traffic not measured.
- failover capacity not reserved.

17. Security and Threat Model

17.1 Threats

Threat	Control
route hijacking	listener allowedRoutes, RBAC, admission
Secret reference abuse	ReferenceGrant review, namespace isolation
lateral movement	NetworkPolicy, mTLS, AuthorizationPolicy
egress exfiltration	egress gateway/proxy/FQDN policy/logging
plaintext internal traffic	mTLS or backend TLS
identity spoofing	SPIFFE/mTLS, workload attestation
policy bypass	sidecar/ambient enrollment validation, CNI policy
public accidental exposure	route admission and public/private Gateway split
stale cert	rotation monitoring and expiry alerts
debug access abuse	ephemeral access controls and audit

17.2 Security Review Questions

What is the trust boundary between namespaces?
What is the trust boundary between clusters?
Are public and private routes physically/logically separated?
Are app teams allowed to create public exposure directly?
How is Secret reference controlled?
Can a compromised pod reach metadata services or internet?
Can a compromised namespace attach to shared Gateway?
Can policy be bypassed by direct Pod IP?
Are emergency debug pods restricted?

18. Migration Framework

Most real platforms migrate from something already running.

18.1 Migration Principles

- Migrate behavior, not only YAML.
- Preserve rollback path.
- Move one traffic class at a time.
- Keep user-visible probes active.
- Compare old and new telemetry.
- Avoid changing routing, identity, and policy simultaneously.

18.2 Ingress to Gateway API

Phases:

1. Inventory Ingress objects and annotations.
2. Classify annotations: core routing, TLS, auth, rewrite, rate limit, controller-specific.
3. Select Gateway controller and conformance profile.
4. Create shared Gateway and listener model.
5. Migrate low-risk internal route.
6. Migrate public route with parallel hostname or weighted DNS.
7. Validate status, telemetry, rollback.
8. Deprecate old Ingress gradually.

18.3 Service Mesh Adoption

Phases:

1. Observe service graph.
2. Enroll non-critical namespace.
3. Enable permissive mTLS.
4. Validate telemetry and proxy overhead.
5. Add authorization policies for selected services.
6. Move to STRICT mTLS for known-good boundary.
7. Expand by domain.
8. Add egress/traffic shaping after base identity is stable.

18.4 Multi-Cluster Adoption

Phases:

1. Define why multi-cluster exists.
2. Ensure CIDR/IPAM compatibility.
3. Establish cluster identity and trust model.
4. Test service discovery across clusters.
5. Test non-critical failover.
6. Test capacity and data dependencies.
7. Add global routing guardrails.
8. Run game day before production dependency.

18.5 Migration Red Flags

- adopting mesh and multi-cluster simultaneously.
- enabling STRICT mTLS globally in first phase.
- replacing ingress controller and DNS/GSLB at same time.
- no rollback route.
- no telemetry comparison between old and new path.

19. Architecture Decision Record Template

Use this ADR format for every major networking decision.

# ADR-XXX: <Decision Title>

## Status
Proposed | Accepted | Deprecated | Superseded

## Context
What problem are we solving?
What constraints exist?
What is out of scope?

## Requirements
- functional requirements
- non-functional requirements
- security requirements
- compliance requirements
- operational requirements

## Options Considered
1. Option A
2. Option B
3. Option C

## Decision
What option did we choose?

## Rationale
Why this option?
Why not the alternatives?

## Consequences
Positive consequences.
Negative consequences.
Operational burden.
Cost implications.

## Failure Modes
What can go wrong?
How will we detect it?
How will we mitigate it?

## Rollout Plan
Phases.
Validation.
Rollback.

## Ownership
Who owns config, incident, upgrades, exceptions?

## Audit Evidence
What logs/events/metrics prove behavior?

20. Production Review Checklist

20.1 Connectivity

[ ] Pod CIDR and Service CIDR planned.
[ ] IPAM exhaustion monitored.
[ ] CNI supports required policy.
[ ] Node-level dataplane debugging available.
[ ] MTU validated.
[ ] kube-proxy/eBPF mode documented.
[ ] cross-node traffic tested.

20.2 Service Discovery

[ ] CoreDNS capacity tested.
[ ] NodeLocal DNSCache decision documented.
[ ] DNS policy exceptions defined.
[ ] service FQDN conventions documented.
[ ] headless/stateful discovery reviewed.
[ ] external DNS dependency modeled.

20.3 Gateway and Routing

[ ] GatewayClass selected and approved.
[ ] public/private Gateway separated.
[ ] listener ownership documented.
[ ] AllowedRoutes configured.
[ ] ReferenceGrant reviewed.
[ ] route conflicts detected.
[ ] rollback strategy exists.
[ ] conformance tested.

20.4 TLS and Identity

[ ] TLS termination points documented.
[ ] backend encryption decision documented.
[ ] mTLS mode documented.
[ ] certificate rotation monitored.
[ ] trust domain defined.
[ ] cross-cluster trust reviewed.
[ ] emergency cert replacement runbook exists.

20.5 Policy

[ ] default-deny rollout plan exists.
[ ] DNS/health/metrics/mesh exceptions modeled.
[ ] policy enforcement tested.
[ ] policy deny observability exists.
[ ] emergency exception expiry enforced.
[ ] cross-namespace references controlled.

20.6 Resilience

[ ] timeout ladder defined.
[ ] retry budget defined.
[ ] circuit breaker/load shedding reviewed.
[ ] rollout/canary abort conditions defined.
[ ] failover behavior tested.
[ ] capacity under failover tested.

20.7 Observability

[ ] route/backend labels in logs/metrics.
[ ] Gateway metrics captured.
[ ] mesh/proxy metrics captured.
[ ] CNI/flow visibility available.
[ ] DNS latency visible.
[ ] policy deny visible.
[ ] trace correlation available.
[ ] evidence bundle process defined.

20.8 Multi-Cluster

[ ] explicit reason for multi-cluster documented.
[ ] CIDR non-overlap verified.
[ ] namespace sameness policy defined.
[ ] service export/import governance defined.
[ ] global routing health signal validated.
[ ] trust federation tested.
[ ] data dependency failover tested.
[ ] regional game day completed.

21. Risk Register Template

| Risk | Likelihood | Impact | Detection | Mitigation | Owner | Review Date |
|---|---:|---:|---|---|---|---|
| Gateway route conflict exposes wrong backend | Medium | High | admission + route status alert | hostname ownership policy | Platform | monthly |
| DNS saturation during deploy | Medium | High | CoreDNS QPS/latency alert | NodeLocal DNSCache + cache tuning | SRE | quarterly |
| NAT port exhaustion | Low | High | NAT conn metrics | egress gateway scaling + connection reuse | Platform | quarterly |
| mTLS cert rotation failure | Low | Critical | expiry alert + synthetic probe | rotation runbook | Security/Platform | monthly |

Risk review should be tied to ownership and date. A risk without owner is a wish.

22. Regulatory Defensibility

For enforcement lifecycle, case management, financial, healthcare, or other regulated systems, networking architecture must support explanation.

22.1 Defensible Claims

Weak claim:

Traffic is secure because we use Kubernetes and TLS.

Defensible claim:

Public traffic terminates at the edge WAF, is re-encrypted to the cluster Gateway, and service-to-service traffic for regulated namespaces uses mTLS with workload identity. Authorization policies restrict access by service identity. NetworkPolicy denies non-declared east-west traffic. Egress requires approved proxy path. Route, policy, and egress changes are GitOps-reviewed and auditable.

22.2 Evidence Needed

Claim	Evidence
traffic encrypted	TLS/mTLS config, cert telemetry, packet/proxy evidence
access restricted	AuthorizationPolicy/NetworkPolicy, deny logs
public exposure controlled	Gateway/Route inventory, RBAC, admission logs
egress controlled	egress proxy logs, allowlist, policy
changes reviewed	Git history, approval workflow
incidents traceable	logs, metrics, traces, timeline
failover tested	game day report, synthetic probe results

22.3 Audit Questions

Who can expose a new public endpoint?
Who approved the route?
What backend received traffic?
Was traffic encrypted at every required hop?
Which workloads could call this service?
Could this service call the internet?
Where are deny logs stored?
How long are access logs retained?
How was failover tested?
How were emergency changes approved and reverted?

23. Anti-Pattern Catalog

23.1 “One Gateway to Rule Them All”

A single shared Gateway handles every public/private/internal/partner route with weak delegation.

Risk:

route conflict
blast radius expansion
unclear ownership
policy coupling

Better:

Separate public, private, partner, and internal Gateway boundaries.
Use listener and namespace attachment controls.

23.2 “Mesh as Magic Security”

Installing mesh but not defining identity, bypass prevention, or policy ownership.

Better:

Define mesh threat model and enrollment invariants first.

23.3 “Default Allow Forever”

Relying on application auth only for all lateral movement.

Better:

Start with observability, then default-deny by domain.

23.4 “Multi-Cluster as Disaster Recovery”

Deploying to two clusters but not testing data failover, identity, or capacity.

Better:

Define RTO/RPO and run regional game days.

23.5 “Policy in Five Places”

WAF, API gateway, Gateway controller, mesh, app, and CNI all enforce overlapping auth/rate/headers with no precedence model.

Better:

Assign each layer a clear purpose.
Document precedence and evidence.

23.6 “Observability Later”

Shipping routing/security architecture before route/backend/policy visibility exists.

Better:

Observability is part of architecture readiness, not a backlog nice-to-have.

24. Example Architecture Review: Regulated Case Platform

24.1 Scenario

A regulated case management platform runs on Kubernetes. It has:

public citizen APIs
internal enforcement APIs
admin console
document processing workers
case event stream
external registry integrations
strict audit requirement
regional DR requirement

24.2 Proposed Architecture

24.3 Review Findings

Area	Finding	Decision
Gateway	public and internal traffic separated	accept
mTLS	regulated services require mTLS	accept with staged rollout
egress	registry calls through egress gateway	accept
multi-cluster	active-passive for write path	require DR game day
observability	missing route/backend labels	block launch
policy	default-deny not yet tested	launch only for non-regulated namespace
failover	DB RPO unclear	block production DR claim

24.4 Launch Gate

Must pass before production:
[ ] public route inventory reviewed
[ ] TLS chain validated
[ ] mTLS enabled for regulated service path
[ ] egress allowlist approved
[ ] NetworkPolicy deny logs visible
[ ] route/backend metrics present
[ ] failover game day completed
[ ] emergency rollback tested
[ ] ADR approved by platform/security/SRE/app owner

25. Final Architecture Scorecard

Use a 0–3 score.

Score	Meaning
0	absent or unknown
1	exists but manual/fragile
2	defined and partially automated
3	production-ready, tested, observable, owned

Score areas:

Connectivity: __ / 3
Service discovery: __ / 3
Gateway/routing: __ / 3
TLS/identity: __ / 3
NetworkPolicy: __ / 3
Egress control: __ / 3
Service mesh: __ / 3
Resilience policy: __ / 3
Observability: __ / 3
Multi-cluster: __ / 3
Ownership/governance: __ / 3
Cost/capacity: __ / 3
Migration/rollback: __ / 3
Regulatory evidence: __ / 3

Interpretation:

Total	Interpretation
0–15	not production-ready
16–25	usable for low-risk/internal workloads only
26–34	production candidate with known gaps
35–42	strong production posture

The exact total is less important than the lowest-scoring critical area. One 0 in identity, egress, or observability can invalidate the design for regulated workloads.

26. Review Meeting Format

26.1 Participants

platform/networking owner
application owner
SRE/on-call owner
security representative
compliance/regulatory representative when applicable
data/storage owner for multi-cluster/failover decisions

26.2 Agenda

1. Problem statement and non-goals
2. Traffic classes and trust boundaries
3. Architecture diagram
4. Seven-plane review
5. Failure modes
6. Decision comparison
7. Operational readiness
8. Security/regulatory evidence
9. Migration and rollback
10. Open risks and launch gate

26.3 Output

Meeting must produce:

accepted decisions
rejected options and rationale
risk register
launch blockers
required tests
owner per follow-up
review date

If no decision artifact exists, the review did not happen in a durable way.

27. Part 034 Completion Check

Anda selesai dengan Part 034 jika dapat:

Mengevaluasi CNI, Gateway API, mesh, policy, egress, observability, dan multi-cluster menggunakan explicit criteria.
Membedakan Gateway API, API gateway, CDN/WAF, cloud LB, dan service mesh berdasarkan responsibility.
Menentukan kapan service mesh layak, kapan terlalu mahal, dan kapan cukup memakai Gateway/API/CNI policy.
Mendesain ownership model lintas platform, app team, security, dan SRE.
Membuat ADR untuk keputusan networking besar.
Menilai risiko multi-cluster secara jujur, termasuk data, identity, capacity, and failover readiness.
Membuat scorecard production readiness.
Menghubungkan architecture decision dengan evidence untuk audit/regulatory defensibility.

Part berikutnya adalah Part 035 — Capstone Design: Top 1% Kubernetes Networking Handbook. Itu akan menjadi bagian terakhir seri ini, menyatukan semua materi menjadi desain end-to-end, diagram, policy model, runbook, SLO, risk register, dan self-assessment.

Lesson Recap

You just completed lesson 34 in final stretch. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Back To Series Next Lesson

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.

Previous Lesson

Lesson 33

Failure Models, Chaos Testing, and Debugging Playbooks

Next Lesson

Lesson 35

Capstone Design - Top 1 Percent Kubernetes Networking Handbook