Series/Learn Java Microservices Design and Architect

Series MapLesson 97 / 100

Final StretchOrdered learning track

Case Study - Runtime and Deployment Topology

Learn Java Microservices Design and Architect - Part 097

Studi kasus production-grade tentang runtime dan deployment topology untuk regulatory case-management microservices, mencakup Kubernetes workload, ingress, gateway, service mesh, node pool, availability zone, scaling profile, data topology, failure isolation, dan topology review.

[2026-07-05]19 min read3632 words

In This Lesson

1. Target Mental Model 2. Case Study Runtime Goal 3. High-Level Deployment Topology

PrevNext

Lesson 97100 lesson track83–100 Final Stretch

#java#microservices#architecture#kubernetes+4 more

Part 097 — Case Study: Runtime and Deployment Topology

Logical architecture menjawab “service apa saja yang ada.” Runtime topology menjawab pertanyaan yang lebih keras: di mana service itu hidup, bagaimana traffic mengalir, bagaimana failure menyebar, bagaimana kapasitas dihitung, dan bagaimana sistem tetap bisa dioperasikan ketika sebagian infrastruktur rusak?

Pada case-management domain, desain service sudah kita bentuk di part sebelumnya:

Case Service,
Party Service,
Allegation Service,
Evidence Service,
Decision Service,
Workflow/Process Service,
Notification Service,
Audit Service,
Reporting/Projection Service,
Gateway/BFF,
Policy/Rules Service.

Part ini menerjemahkan desain tersebut menjadi runtime/deployment topology yang bisa benar-benar dijalankan, diobservasi, di-scale, dan direview.

Kita tidak akan mengulang Kubernetes basic. Fokus kita adalah arsitektur runtime: keputusan placement, isolation, capacity, failure boundary, dan topology evidence.

1. Target Mental Model

Satu service microservices tidak sama dengan satu repository atau satu class. Di production, sebuah service adalah kombinasi beberapa boundary:

Boundary	Pertanyaan
Logical service	capability apa yang dimiliki?
Runtime instance	berapa replica yang berjalan?
Deployment unit	bagaimana artifact dirilis?
Network endpoint	bagaimana service ditemukan/dipanggil?
Security identity	siapa workload identity-nya?
Data boundary	database/schema/topic apa yang dimiliki?
Failure boundary	apa yang terjadi jika dependency gagal?
Scaling boundary	apa metric dan bottleneck scaling-nya?
Operational boundary	siapa on-call dan apa runbook-nya?

Runtime topology adalah graph dari boundary-boundary tersebut.

Jika topology tidak eksplisit, sistem akan tetap punya topology, tetapi topology itu muncul secara liar dari default platform, kebiasaan tim, dan keputusan ad-hoc.

2. Case Study Runtime Goal

Regulatory case-management system punya sifat berikut:

Audit-sensitive
- keputusan harus bisa direkonstruksi,
- event penting tidak boleh hilang,
- actor attribution harus stabil.
Workflow-heavy
- proses case berjalan berhari-hari sampai berbulan-bulan,
- ada timer, SLA, human task, escalation,
- ada retry dan compensation.
Read-heavy untuk case dashboard
- investigator sering membuka case overview,
- supervisor membaca queue dan risk summary,
- regulator/auditor membaca history.
Write-sensitive untuk decision
- decision command harus idempotent,
- double-submit harus dicegah,
- policy version harus dicatat.
External integration
- notification/email/SMS,
- external evidence store,
- identity provider,
- policy/rules system.
Data privacy
- evidence metadata tidak sama dengan evidence content,
- PII tidak boleh menyebar ke log, trace, event payload, dan read model tanpa tujuan jelas.

Runtime topology harus melindungi semua sifat ini.

3. High-Level Deployment Topology

Kita mulai dari topology sederhana tetapi production-realistic.

Perhatikan: diagram ini bukan sekadar gambar. Setiap edge adalah potensi latency, failure, authorization, observability, retry, dan cost.

4. Topology Layer

Runtime topology sebaiknya dilihat dalam beberapa layer.

Jika salah satu layer tidak jelas, architecture review harus berhenti.

Contoh:

Service punya API tapi tidak punya owner: operational risk.
Service punya database tapi data ownership tidak jelas: consistency risk.
Service punya 10 replica tapi DB pool 50 per replica: database overload risk.
Service punya endpoint public tetapi tidak punya rate limit: abuse risk.
Service punya audit event tetapi tidak ada immutable store: defensibility risk.

5. Namespace Strategy

Namespace bukan security boundary sempurna, tetapi berguna sebagai operational grouping.

Untuk case-management:

Namespace	Isi	Tujuan
`edge`	gateway, BFF	expose traffic, policy edge, client-specific composition
`case-core`	core domain services	high-value operational services
`async`	projection, audit, notification workers	async processing, consumer scaling
`platform`	OTel collector, mesh, agents	shared platform components
`data-system`	DB operator/broker operator jika ada	platform-managed data infra

Rule praktis:

Jangan campur gateway, core domain service, dan background worker tanpa alasan.
Namespace harus mempermudah policy, observability, ownership, dan blast-radius reasoning.
Namespace bukan pengganti service-level authorization.

6. Workload Profile per Service

Tidak semua service punya workload sama. Runtime topology harus mengikuti shape workload.

Service	Workload type	Scaling driver	Criticality	Runtime concern
Gateway	synchronous edge	RPS, latency, connections	critical	rate limit, auth, routing
BFF	synchronous composition	concurrent users, fan-out latency	high	partial response, cache, timeout
Case Service	command + query	write rate, DB latency	critical	idempotency, DB pool, outbox
Decision Service	command-heavy	decision submissions	critical	policy call, audit, locking
Workflow Service	orchestration	active workflows, timer count, worker queue	critical	durable execution, activity retry
Evidence Metadata Service	metadata write/read	upload metadata, object reference lookup	high	object-store consistency, privacy
Projection Service	async consumer	consumer lag, event rate	high	idempotent projection, rebuild
Audit Service	append-only event consumer	audit event rate, store latency	critical	durability, no loss, retention
Notification Service	async external IO	delivery queue depth	medium	retry, DLQ, provider rate limit
Policy Service	low-latency decision	QPS, rule eval time	critical	versioning, cache, explainability

A top 1% engineer does not ask “how many replicas?” first. They ask:

What kind of work does this service do?
What is the bottleneck?
What is the failure mode?
What is the safe degradation mode?
What metric proves it is healthy?

7. Deployment Unit Design

Setiap service sebaiknya punya deployment unit sendiri jika service memang dimaksudkan independently deployable.

Contoh Kubernetes deployment untuk Case Service:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: case-service
  namespace: case-core
  labels:
    app.kubernetes.io/name: case-service
    app.kubernetes.io/part-of: regulatory-case-management
    app.kubernetes.io/component: domain-service
spec:
  replicas: 3
  revisionHistoryLimit: 5
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 0
      maxSurge: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: case-service
  template:
    metadata:
      labels:
        app.kubernetes.io/name: case-service
        app.kubernetes.io/version: "2026.07.05-001"
    spec:
      terminationGracePeriodSeconds: 45
      containers:
        - name: app
          image: registry.example.com/case-service@sha256:...
          ports:
            - name: http
              containerPort: 8080
          env:
            - name: SPRING_PROFILES_ACTIVE
              value: prod
          resources:
            requests:
              cpu: "500m"
              memory: "768Mi"
            limits:
              cpu: "2"
              memory: "1536Mi"
          startupProbe:
            httpGet:
              path: /actuator/health/startup
              port: http
            failureThreshold: 30
            periodSeconds: 5
          readinessProbe:
            httpGet:
              path: /actuator/health/readiness
              port: http
            periodSeconds: 5
            failureThreshold: 2
          livenessProbe:
            httpGet:
              path: /actuator/health/liveness
              port: http
            periodSeconds: 10
            failureThreshold: 3

Architecture point:

maxUnavailable: 0 menjaga capacity saat rolling deploy.
startupProbe mencegah liveness membunuh JVM yang sedang warm-up.
readinessProbe mengeluarkan pod dari traffic jika belum siap.
terminationGracePeriodSeconds memberi waktu untuk draining.
Image menggunakan digest, bukan tag mutable.

8. Service Endpoint Topology

Kubernetes Service memberi nama stabil untuk pod yang ephemeral.

apiVersion: v1
kind: Service
metadata:
  name: case-service
  namespace: case-core
spec:
  type: ClusterIP
  selector:
    app.kubernetes.io/name: case-service
  ports:
    - name: http
      port: 80
      targetPort: http

Policy:

Internal domain service default-nya ClusterIP, bukan public.
Public exposure harus lewat edge/gateway.
DNS name menjadi dependency contract: case-service.case-core.svc.cluster.local.
Client harus punya timeout, retry policy, dan circuit breaker; service discovery tidak menyelesaikan overload.

9. Edge Route and BFF Topology

Gateway/BFF harus dibatasi tanggung jawabnya.

Gateway boleh memiliki:

authentication enforcement,
coarse authorization,
rate limiting,
request size limit,
routing,
API version routing,
coarse request logging,
correlation ID injection.

Gateway tidak boleh menjadi:

domain service tersembunyi,
place untuk business workflow,
tempat rule regulatory,
tempat data authority,
tempat orchestration kompleks yang seharusnya dimiliki workflow service.

10. Service Mesh: Where It Helps, Where It Does Not

Service mesh dapat membantu:

mTLS antar workload,
traffic splitting,
policy routing,
telemetry dasar,
retry/timeout di level network,
circuit breaking tertentu,
identity enforcement.

Tetapi aplikasi tetap harus memiliki:

idempotency,
business timeout semantics,
command status,
compensation,
audit event,
domain authorization,
validation,
error taxonomy,
data consistency rules.

Runtime topology harus menandai policy mana yang dimiliki mesh dan mana yang dimiliki aplikasi.

The trap: placing retries in mesh, client library, and application code at the same time. That creates retry multiplication.

11. Availability Zone Placement

For production, topology must define zone strategy.

Rules:

Critical sync services should have replicas across zones.
Async workers should tolerate zone loss without duplicate side effect.
DB/broker topology must be understood; application replicas across zones do not help if database is single-zone.
Pod anti-affinity/topology spread constraints can reduce co-location risk.

Example topology spread constraint:

topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: ScheduleAnyway
    labelSelector:
      matchLabels:
        app.kubernetes.io/name: decision-service

Use DoNotSchedule only when capacity is guaranteed. Otherwise, your anti-affinity can become an availability problem during node pressure.

12. Node Pool Strategy

Not all workloads should run on the same node pool.

Node pool	Workload	Reason
`edge-pool`	gateway, BFF	predictable ingress CPU/network
`core-pool`	case/decision/policy services	critical sync workload
`worker-pool`	projection/audit/notification workers	async workload, scale independently
`memory-pool`	read model / search-heavy service	memory/cache pressure
`platform-pool`	OTel collector, mesh control plane	platform isolation

Do not overdo node pools. Too many pools reduce cluster bin-packing efficiency and increase operational overhead.

The decision model:

separate when workload failure/resource pattern is materially different,
keep together when difference is only team preference,
measure utilization and pending pods before adding isolation.

13. Scaling Profile

Replica count should not be copied across services. Each service gets a scaling profile.

13.1 Sync service scaling

For Case Service:

Metric	Why
HTTP RPS	traffic volume
p95/p99 latency	user-visible performance
in-flight requests	concurrency pressure
DB connection pool active/waiting	downstream bottleneck
CPU saturation	compute pressure
JVM heap/native memory	memory pressure
error rate	overload/failure symptom

Example HPA-like intent:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: case-service
  namespace: case-core
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: case-service
  minReplicas: 3
  maxReplicas: 12
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 65

CPU-only HPA is not enough when bottleneck is database pool, external API, or lock contention. HPA must be paired with downstream capacity budgets.

13.2 Async worker scaling

For Projection Service:

Metric	Why
consumer lag	backlog size
oldest event age	user-visible staleness
processing duration	per-message cost
DLQ rate	poison/failure rate
projection DB latency	downstream bottleneck

Workers should scale from lag/oldest-age, not only CPU.

13.3 Workflow worker scaling

Workflow Service often has two layers:

workflow engine/control-plane,
activity workers.

Activity workers should scale by:

task queue backlog,
activity latency,
external dependency rate limit,
timer volume,
retry volume.

If you scale workers blindly, you can overload Case Service, Evidence Service, or external notification providers.

14. Capacity Envelope

Each service should have a capacity envelope.

Example: Decision Service.

Dimension	Value	Note
min replicas	3	zone spread
max replicas	10	DB pool budget
p95 target	< 300 ms	without policy cache miss
p99 target	< 900 ms	under normal traffic
per-pod DB pool	12	max 120 total at 10 replicas
policy timeout	150 ms	fail-closed for final decision
command timeout	800 ms	API deadline budget
retry attempts	0 for non-idempotent unknown outcome; 1 for safe transient	must use idempotency key
audit write	required via outbox	cannot accept command without audit evidence path
degradation	reject final decision if policy unavailable	no silent allow

Capacity envelope is more useful than “replicas: 3”. It tells future engineers what assumptions hold the service together.

15. Database Connection Budget

A classic production mistake: every replica opens a large pool.

total_db_connections = replicas * max_pool_size

If:

Case Service max replicas = 12,
max pool = 30,
Decision Service max replicas = 10,
max pool = 20,
Workflow workers max replicas = 15,
max pool = 10,

then total possible app connections:

12*30 + 10*20 + 15*10 = 710

If database safe connection budget is 300, your autoscaling design is already wrong.

Better:

Service	Max replicas	Pool per pod	Total
Case Service	12	12	144
Decision Service	10	10	100
Workflow workers	15	3	45
Projection workers	10	5	50
Total			339

Then adjust based on actual DB capacity, query cost, and workload shape.

16. Runtime Data Topology

Data topology should show ownership and access mode.

Rules:

No direct cross-service DB access.
Read model is not source of truth.
Audit store is not debug log store.
Object store access must go through evidence authority or signed URL boundary.
Projection rebuild must not mutate source of truth.

17. Failure Isolation Topology

A runtime topology must show how failure is contained.

Failure policy examples:

Failure	User/system behavior
Read model stale	UI shows watermark; command path still uses source service
Notification provider down	task retries; no rollback of regulatory decision
Policy service unavailable	final decision command rejects or queues depending policy type
Audit service consumer lag	source services continue publishing outbox; audit lag alert pages if evidence age exceeds threshold
Evidence object store unavailable	metadata read works; upload/download unavailable; case decision requiring evidence blocks
Projection DB unavailable	write path continues; dashboard may degrade; rebuild after recovery

This is more useful than “service A depends on service B”. It states what the business sees.

18. Traffic Class Separation

Not all traffic deserves equal priority.

Traffic class	Examples	Priority
Human interactive	dashboard, case search, decision submit	high
Critical workflow	SLA escalation, decision finalization	high
Audit ingestion	audit event persistence	critical
Projection rebuild	read model rebuild	low/controlled
Batch export	reporting export, data reconciliation	low
Notification retry	email/SMS retry	medium

Implement separation with:

separate worker deployments,
separate queues/topics,
concurrency limits,
priority-aware rate limiting,
bulkheads,
node pool isolation only when justified.

Do not let projection rebuild starve audit ingestion.

19. Runtime Identity and Security Topology

Every workload needs identity.

Rules:

Service account per service, not shared cluster-wide account.
Database role per service.
Broker principal per publisher/consumer.
Secrets scoped by service.
Mesh identity should align with service catalog identity.
Audit service writer cannot mutate business source databases.

Example Kubernetes service account:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: case-service
  namespace: case-core
  labels:
    app.kubernetes.io/name: case-service

Architecture review question:

If this workload token leaks, which data and actions become possible?

If the answer is “everything in the namespace”, identity topology is too coarse.

20. Runtime Observability Topology

Telemetry path is part of runtime topology.

Design rules:

Application emits telemetry to collector, not directly to many vendors when possible.
Trace sampling must not break auditability; audit event is separate.
High-cardinality labels are controlled.
Logs are structured and redacted before leaving workload boundary.
traceId, correlationId, and caseId linkage must be consistent.

Telemetry failure should not block critical business commands, except audit evidence path when business requirement says command cannot be accepted without evidence record.

21. Deployment Wave Design

For case-management, deployment order matters only when compatibility is violated. The goal is to avoid lockstep.

Safe release wave:

Runtime implication:

multiple versions may run during rollout,
events and APIs must tolerate version skew,
read model rebuild must handle old and new shapes,
workflow versioning must protect in-flight processes.

22. Topology Card Template

Each service should have a topology card.

service: decision-service
namespace: case-core
owner: enforcement-platform/decision-team
runtime:
  type: kubernetes-deployment
  minReplicas: 3
  maxReplicas: 10
  zoneSpread: required-preferred
  nodePool: core-pool
container:
  imagePolicy: digest-only
  cpuRequest: 500m
  memoryRequest: 768Mi
  memoryLimit: 1536Mi
network:
  exposure: internal
  inbound:
    - workflow-service
    - case-bff
  outbound:
    - policy-service
    - decision-db
    - event-broker
security:
  serviceAccount: decision-service
  databaseRole: decision_rw
  meshMtls: required
data:
  owns:
    - decision-db
  publishes:
    - DecisionProposed
    - DecisionApproved
    - DecisionRejected
reliability:
  timeoutMs: 800
  policyTimeoutMs: 150
  retries: safe-idempotent-only
  circuitBreaker: policy-service
  degradation: reject-final-decision-if-policy-unavailable
observability:
  slo:
    availability: 99.9
    p95LatencyMs: 300
  dashboards:
    - decision-service-overview
  alerts:
    - decision-command-error-budget-burn
    - policy-call-failure-rate
  runbook: runbooks/decision-service.md

A topology card should be version-controlled and linked from the service catalog.

23. Example: Decision Service Deployment Topology

Decision Service is critical because it creates regulatory decisions.

Runtime requirements

Internal-only service.
Minimum 3 replicas.
Spread across zones.
No final decision without policy decision.
Audit event path must be reliable through outbox.
Strong idempotency for decision commands.
DB pool bounded.
No direct calls to Evidence object store; only Evidence Service/metadata.

Failure policy

Dependency	Failure handling
Decision DB	reject command; page if sustained
Policy Service	fail-closed for final decision; maybe queue draft validation depending command type
Event Broker	command commits with outbox pending; outbox lag alert
Audit Consumer	command continues; audit lag monitored; no deletion of outbox evidence
OTel Collector	continue; telemetry loss alert if sustained

Deployment skeleton

apiVersion: apps/v1
kind: Deployment
metadata:
  name: decision-service
  namespace: case-core
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 0
      maxSurge: 1
  template:
    spec:
      serviceAccountName: decision-service
      terminationGracePeriodSeconds: 60
      containers:
        - name: app
          image: registry.example.com/decision-service@sha256:...
          env:
            - name: JAVA_TOOL_OPTIONS
              value: >-
                -XX:MaxRAMPercentage=65
                -XX:+ExitOnOutOfMemoryError
            - name: DECISION_POLICY_TIMEOUT_MS
              value: "150"
            - name: DECISION_COMMAND_TIMEOUT_MS
              value: "800"
          readinessProbe:
            httpGet:
              path: /actuator/health/readiness
              port: 8080
          livenessProbe:
            httpGet:
              path: /actuator/health/liveness
              port: 8080
          resources:
            requests:
              cpu: "500m"
              memory: "768Mi"
            limits:
              cpu: "2"
              memory: "1536Mi"

24. Example: Audit Service Deployment Topology

Audit Service is async but critical.

Misleading assumption:

“Audit is async, so it is less critical.”

Wrong. Audit ingestion can be async, but audit durability is critical.

Audit Service topology:

consumes audit-relevant events,
validates event envelope,
writes append-only audit store,
emits audit lag metric,
sends poison event to quarantine, not silent drop,
supports replay with idempotency.

Failure policy:

Failure	Behavior
audit store down	consumer pauses/fails; lag increases; alert pages
malformed audit event	quarantine with reason; alert if rate > threshold
duplicate event	idempotent no-op or link to original
event schema unknown	quarantine unless compatible fallback exists
backlog too old	page; regulatory evidence window at risk

25. Example: Projection Service Deployment Topology

Projection Service is not source of truth, but user experience depends on it.

It owns:

read model DB,
projection checkpoint,
freshness watermark,
rebuild job,
query surface if exposed through Read Model Service.

Runtime rules:

projection update must be idempotent,
checkpoint update must be atomic with projection write,
rebuild uses separate traffic class,
UI must show freshness when relevant,
projection lag is SLI, not debug metric.

26. Runtime Topology Review Checklist

Use this before production readiness.

26.1 Workload identity

Does each service have its own service account?
Are DB/broker permissions scoped per service?
Are admin endpoints isolated?
Is service-to-service auth enforced beyond network location?

26.2 Network

Is public exposure limited to edge/gateway?
Are internal services ClusterIP only?
Are NetworkPolicies/service mesh policies defined?
Are timeouts and retries defined per edge?

26.3 Availability

Are critical services spread across zones?
Do PodDisruptionBudgets protect critical replicas?
Are readiness/liveness/startup probes correct?
Does rollout preserve minimum capacity?

26.4 Capacity

Are max replicas bounded by downstream capacity?
Is DB pool math reviewed?
Is async worker scaling based on lag/oldest-age?
Are traffic classes separated?

26.5 Data

Is data ownership enforced?
Are cross-service DB joins blocked?
Is read model staleness explicit?
Is audit store append-only and replay-safe?

26.6 Observability

Are trace/log/metric identities consistent?
Is telemetry routed through collector or known pipeline?
Are cardinality risks controlled?
Are topology dashboards linked from service catalog?

26.7 Failure handling

Does each dependency have a failure policy?
Is degradation visible to user/operator?
Are retry/circuit breaker policies consistent across app/mesh/client?
Are emergency levers documented?

27. Common Runtime Topology Anti-Patterns

27.1 Logical service exists, runtime identity shared

All services run under one Kubernetes service account.

Impact:

impossible least privilege,
audit attribution weak,
blast radius huge.

27.2 HPA hides downstream overload

Service scales up under latency pressure, opens more DB connections, and makes DB slower.

Impact:

autoscaling amplifies failure.

Fix:

cap max replicas,
bound DB pool,
use load shedding,
scale database/read model separately if justified.

27.3 Async worker competes with interactive traffic

Projection rebuild and audit ingestion share same DB pool or node pool without limits.

Impact:

low-priority batch damages critical path.

Fix:

separate worker deployment,
separate concurrency limit,
priority queues,
traffic class metrics.

27.4 Gateway becomes domain monolith

Gateway starts deciding regulatory status.

Impact:

hidden domain coupling,
hard-to-audit decisions,
impossible reuse.

Fix:

move business decisions into domain services/workflow/policy service.

27.5 Readiness checks lie

Readiness returns UP while service cannot access required dependency or is overloaded.

Impact:

bad pods receive traffic,
false green dashboards.

Fix:

define shallow/deep readiness intentionally,
use overload-aware readiness carefully,
avoid liveness dependency checks that restart healthy process due to dependency outage.

28. Topology Decision Record

Example ADR fragment.

# ADR: Runtime topology for Decision Service

## Context
Decision Service executes regulatory decisions that require policy evaluation, audit evidence, and durable event publication.

## Decision
Decision Service will run as an internal Kubernetes Deployment in `case-core`, minimum 3 replicas, spread across zones, behind ClusterIP Service, with service account `decision-service`, DB role `decision_rw`, and mTLS enabled via service mesh.

## Capacity constraints
- max replicas: 10
- DB pool per pod: 10
- policy timeout: 150 ms
- command timeout: 800 ms

## Failure policy
- Policy unavailable: fail closed for final decision.
- Broker unavailable: command writes outbox; publisher retries.
- Audit consumer lag: alert if audit event age > 5 minutes.

## Consequences
- Higher operational cost than single replica.
- Explicit DB connection budget required.
- Requires topology spread and PDB configuration.

29. Final Topology Diagram for Review Pack

This diagram should live with the architecture review pack. It should be updated when service topology changes.

30. Practical Exercise

Take one service from the case study and write its topology card:

Service name.
Namespace.
Owner.
Runtime type.
Min/max replicas.
Scaling metric.
Data ownership.
Inbound dependencies.
Outbound dependencies.
Security identity.
Timeout/retry/circuit breaker.
Failure policy.
SLO.
Alert/runbook links.
Deployment risk.

Then answer:

What happens if this service loses its database?
What happens if this service is slow but not down?
What happens if one zone disappears?
What happens if a duplicate command arrives?
What happens if telemetry pipeline is unavailable?
What happens if audit ingestion lags by 30 minutes?

If you cannot answer concretely, the topology is not production-ready.

31. Key Takeaways

Runtime topology is not infrastructure decoration; it is architecture in motion.
Every service needs a topology card, not only a code repository.
Scaling must be bounded by downstream capacity.
Namespace, node pool, service account, database role, broker principal, and telemetry pipeline are all architecture surfaces.
Readiness, liveness, rollout, shutdown, and autoscaling are part of service design.
Async does not mean less critical; audit and projection pipelines need explicit SLOs.
Gateway and mesh reduce platform burden but do not own business correctness.
A topology diagram without failure policy is only a map, not an architecture.

References

Kubernetes Workloads: https://kubernetes.io/docs/concepts/workloads/
Kubernetes Deployments: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/
Kubernetes Services: https://kubernetes.io/docs/concepts/services-networking/service/
Kubernetes Ingress: https://kubernetes.io/docs/concepts/services-networking/ingress/
Kubernetes Probes: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/
Kubernetes Horizontal Pod Autoscaling: https://kubernetes.io/docs/concepts/workloads/autoscaling/horizontal-pod-autoscale/
OpenTelemetry Concepts: https://opentelemetry.io/docs/concepts/
Google SRE Book — Addressing Cascading Failures: https://sre.google/sre-book/addressing-cascading-failures/

Lesson Recap

You just completed lesson 97 in final stretch. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Back To Series Next Lesson

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.

Previous Lesson

Lesson 96

Case Study - Observability and Audit Design

Next Lesson

Lesson 98

Case Study - Architecture Review and Risk Register