Capstone Production Grade Platform
Learn Kubernetes, Deployment Model, and Cloud Native Platform Engineering - Part 035
Capstone design for a production-grade Kubernetes deployment platform, including architecture synthesis, invariants, workload onboarding, GitOps delivery, security, observability, reliability, governance, maturity assessment, and final review checklist.
Part 035 — Capstone: Designing a Production-Grade Kubernetes Deployment Platform
1. Why This Part Exists
This is the final part of the series.
The previous parts decomposed Kubernetes into focused subskills: API model, Pods, controllers, scheduling, deployment strategy, configuration, resources, networking, storage, security, policy, observability, debugging, reliability, upgrades, GitOps, packaging, CRDs, multi-tenancy, and platform engineering.
This part recomposes those pieces into one production-grade platform design.
That matters because Kubernetes knowledge is often learned in fragments:
I know Deployments.
I know Services.
I know Ingress.
I know Helm.
I know RBAC.
I know Prometheus.
That is useful, but insufficient.
A top engineer can connect those fragments into a coherent operating system for delivery:
A team can safely introduce a service, deploy it across environments, expose traffic, rotate secrets, observe behavior, respond to failure, recover from incidents, prove compliance, and evolve the platform without losing control.
A production platform is not a pile of YAML.
It is a set of enforced invariants, operating boundaries, feedback loops, and human workflows.
The goal of this capstone is to make that architecture visible.
2. Kaufman Skill Target
Using Josh Kaufman's learning model, this capstone targets the final synthesis skill:
Design, review, and evolve a Kubernetes-based deployment platform that is safe, observable, scalable, secure, governable, and usable by many product teams.
After this part, you should be able to:
- Design a production-grade Kubernetes deployment platform from first principles.
- Explain why each platform layer exists.
- Define platform invariants that prevent unsafe deployments.
- Classify workloads and map them to the correct Kubernetes primitives.
- Design a GitOps delivery model for multi-environment and multi-cluster deployment.
- Define security controls from image build to runtime and API access.
- Design observability around debugging and SLOs, not dashboards alone.
- Model failure domains and blast radius.
- Create review checklists for production readiness.
- Identify platform maturity gaps and plan the next improvement cycle.
The expected outcome is not memorization.
The expected outcome is architectural fluency.
3. The Capstone Scenario
Assume the following enterprise context.
You are designing a Kubernetes-based platform for an organization with:
- 80 engineering teams.
- 400 services.
- Several regulatory workloads.
- Public APIs, internal APIs, async workers, scheduled jobs, and stateful workloads.
- Multiple environments:
dev,test,staging,prod. - Multiple regions for production.
- A mix of Java, Go, Node.js, Python, and frontend services.
- A compliance requirement for auditability, least privilege, vulnerability management, change traceability, and production incident evidence.
- A platform team that must enable delivery without becoming a manual ticket queue.
The business goal:
Enable teams to ship safely and independently while the organization retains operational, security, cost, and compliance control.
The engineering goal:
Create a platform where the safe path is the easiest path.
4. Production Platform Mental Model
A production Kubernetes platform has four major loops.
The platform is not simply the cluster.
It is the complete loop from developer intent to runtime feedback.
| Layer | Question It Answers |
|---|---|
| Product intent | What does the team want to run? |
| Platform API | What safe abstraction should the team use? |
| Git desired state | What versioned declaration represents that intent? |
| Kubernetes reconciliation | What actual state does the cluster converge toward? |
| Runtime signals | Is the system healthy, secure, efficient, and compliant? |
| Decision loop | What should humans or automation do next? |
A weak platform optimizes only the cluster layer.
A strong platform optimizes the full loop.
5. Non-Negotiable Platform Invariants
Before choosing tools, define invariants.
Invariants are rules that must hold even when teams move fast, people make mistakes, or systems partially fail.
| Invariant | Why It Exists | Example Enforcement |
|---|---|---|
| Every workload has an owner | Incident routing and accountability | Required owner, team, service labels |
| Every workload has resource requests | Scheduling and capacity safety | Admission policy rejects missing requests |
| Every public route has TLS | Traffic confidentiality | Gateway policy and certificate automation |
| Every production change is traceable | Audit and rollback | GitOps commit, PR, deployment metadata |
| Every workload has health semantics | Rollout safety | Readiness/startup/liveness probes where appropriate |
| Every production service has SLO metadata | Reliability management | Service catalog contract |
| Every secret has an external source of truth | Rotation and auditability | External secret integration |
| No workload runs privileged by default | Runtime hardening | Pod Security Admission / policy engine |
| No namespace is network-open by default | Blast-radius reduction | Default-deny NetworkPolicy baseline |
| Every deployment has rollback semantics | Incident recovery | Git revert, rollout rollback, traffic rollback |
| Every cluster emits standard signals | Operability | Metrics/logs/traces/events/audit pipeline |
| Every exception expires | Governance hygiene | Exception CRD or policy annotation with expiry |
A common mistake is to start with tool selection:
Should we use Argo CD, Flux, Helm, Kustomize, Kyverno, Gatekeeper, Istio, Linkerd, Backstage, Crossplane, Terraform, or something else?
That is backwards.
Start with invariants.
Then select the smallest toolset that can enforce and observe those invariants.
6. Reference Platform Architecture
A production-grade Kubernetes platform can be represented as layered capabilities.
Each layer must be owned.
A platform fails when ownership is vague.
For each capability, define:
- Who owns the default?
- Who can override it?
- How is override approved?
- How is behavior observed?
- How is failure handled?
- How is cost attributed?
- How is compliance evidence produced?
7. Platform Design From First Principles
A production Kubernetes deployment platform has to satisfy six forces.
These forces compete.
For example:
- More policy can improve safety but slow delivery.
- More abstraction can reduce cognitive load but hide important operational details.
- More shared infrastructure can reduce cost but increase blast radius.
- More self-service can improve autonomy but increase governance risk.
- More automation can reduce toil but amplify bad assumptions.
The architecture must balance these forces explicitly.
A top engineer does not say:
Let's automate everything.
A top engineer asks:
Which decisions are safe to automate, under which constraints, with what rollback and observability?
8. Workload Classification Model
Before onboarding a service, classify it.
Do not let every workload enter the platform as a generic Deployment.
| Workload Type | Kubernetes Primitive | Key Risk | Required Controls |
|---|---|---|---|
| Stateless HTTP API | Deployment + Service + Gateway/Ingress | Bad rollout affects users | Probes, PDB, HPA, canary, SLO alerts |
| Internal API | Deployment + Service | Dependency cascade | NetworkPolicy, retries/timeouts, SLO |
| Async queue worker | Deployment or KEDA ScaledObject | Backlog growth, duplicate processing | Idempotency, queue metrics, HPA/KEDA |
| Scheduled task | CronJob | Duplicate/missed execution | Concurrency policy, idempotency, alerting |
| Migration job | Job | Data corruption | Explicit approval, backup, one-shot semantics |
| Node agent | DaemonSet | Node instability | Toleration review, resource limits, priority |
| Stateful database | StatefulSet or operator | Data loss, split brain | Quorum model, backups, PDB, storage class |
| ML/batch compute | Job/Indexed Job | Cost explosion | Quota, priority, node pool isolation |
| Edge/gateway service | Deployment + Gateway | Public exposure risk | TLS, WAF/integration, auth policy, rate limits |
A good platform asks classification questions during onboarding.
Example onboarding form:
service:
name: payments-api
ownerTeam: payments-platform
workloadType: stateless-http-api
criticality: tier-1
dataClassification: confidential
exposure: public
runtime: java
expectedRps: 500
p95LatencyTargetMs: 250
availabilityTarget: 99.9
dependencies:
- postgres-payments
- fraud-service
- kafka-payments
requiresPersistentStorage: false
requiresPublicIngress: true
requiresExternalSecrets: true
deploymentStrategy: canary
This metadata should drive defaults.
A public tier-1 API should not receive the same deployment policy as an internal experimental worker.
9. Environment and Cluster Topology
A common enterprise topology is:
The topology should reflect failure domains and governance needs.
| Environment | Main Purpose | Design Bias |
|---|---|---|
| Dev | Fast feedback | Low friction, safe defaults, cheap capacity |
| Test | Integration validation | Representative dependencies and policy checks |
| Staging | Production rehearsal | Prod-like routing, policy, SLO, observability |
| Prod | User-facing operations | Strict governance, high availability, progressive delivery |
| DR | Business continuity | Recovery validation, backup restore, failover drills |
Avoid the trap of calling staging production-like when it lacks:
- Same admission policies.
- Same routing model.
- Same secret integration.
- Same observability signals.
- Same deployment controller behavior.
- Same resource constraints.
- Same failure injection or rollback practice.
A staging cluster that cannot detect production rollout failures is mostly theater.
10. Namespace and Tenant Model
Namespace design is one of the highest-leverage platform choices.
A practical model:
<environment>-<team>-<domain>
Examples:
dev-payments-api
staging-payments-api
prod-payments-api
prod-risk-engine
prod-observability
prod-platform-system
Each tenant namespace should receive a baseline pack.
baseline:
rbac:
team-admin: controlled
deployer: gitops-only
viewer: read-only
quota:
cpu: required
memory: required
objectCount: required
security:
podSecurity: restricted
serviceAccountAutoMount: disabled-by-default
privilegedPods: denied
network:
defaultDenyIngress: true
defaultDenyEgress: true
dnsEgress: allowed
observability:
metricsScrape: enabled
logCollection: enabled
tracePropagation: required-for-tier1
cost:
labels: required
Namespace is not a perfect security boundary.
For strong isolation, use separate clusters or node pools with stronger controls.
Decision guide:
| Requirement | Prefer Namespace | Prefer Dedicated Cluster |
|---|---|---|
| Same trust zone | Yes | No |
| Different regulatory boundary | No | Yes |
| Different admin group | Usually no | Yes |
| Noisy batch workload | Maybe | Often yes |
| Different upgrade cadence | No | Yes |
| Strong blast-radius isolation | No | Yes |
| Cost-sensitive shared services | Yes | Maybe |
11. GitOps Desired-State Model
GitOps should not be treated as a magic deployment button.
It is a control model:
Git stores desired state.
A controller compares desired state to live state.
Differences are surfaced or reconciled.
Human changes outside Git become drift.
Reference repo topology:
platform-gitops/
clusters/
dev/
cluster-a/
apps/
platform-addons/
policies/
staging/
cluster-a/
apps/
platform-addons/
policies/
prod/
region-a/
apps/
platform-addons/
policies/
region-b/
apps/
platform-addons/
policies/
base/
services/
payments-api/
fraud-service/
platform/
ingress/
observability/
policy/
overlays/
dev/
staging/
prod/
A simpler alternative is app-per-repo with environment folders.
There is no universal answer.
Use the topology that optimizes for ownership and review boundaries.
| Repo Model | Strength | Weakness |
|---|---|---|
| Monorepo GitOps | Global visibility, easier fleet review | Large blast radius, PR noise |
| App repo owns manifests | Team autonomy | Harder platform-wide governance |
| Environment repo | Clear promotion path | Cross-repo coordination |
| Generated manifests from platform API | Low cognitive load | Requires platform maturity |
A mature platform often evolves toward:
Developer submits intent through platform API.
Platform generates or updates desired state.
GitOps reconciles it.
Policy validates it.
Runtime signals verify it.
12. Deployment State Machine
Every production deployment should have an explicit state machine.
Each transition should have evidence.
| Transition | Evidence |
|---|---|
| Proposed → Built | CI logs, test report, commit SHA |
| Built → Scanned | vulnerability report, SBOM |
| Scanned → Signed | image signature, provenance attestation |
| Signed → DesiredStateUpdated | approved PR, change ticket if required |
| DesiredStateUpdated → Synced | GitOps sync status |
| Synced → Ready | Deployment conditions, Pod readiness |
| Ready → Canary | traffic split configuration |
| Canary → Promoted | metrics analysis, error rate, latency, saturation |
| Canary → RolledBack | rollback event, failing signal |
| Promoted → Stable | post-deploy observation window |
A deployment platform without evidence is hard to defend during incidents and audits.
13. Service Golden Path
The golden path is the default path from service creation to production operation.
It should not ask every team to rediscover platform architecture.
A good golden path includes:
- Repository template.
- Build pipeline.
- Container image rules.
- Service metadata contract.
- Default Kubernetes manifests or platform API object.
- Secrets integration.
- Observability setup.
- SLO template.
- Runbook template.
- Deployment strategy defaults.
- Security baseline.
- Cost attribution labels.
- Rollback instructions.
- Ownership and escalation metadata.
The developer should not need to know every cluster detail to deploy safely.
But the developer must know enough to operate responsibly.
14. Example Platform API
A platform API can expose workload intent rather than raw Kubernetes complexity.
Example custom resource:
apiVersion: platform.example.com/v1alpha1
kind: ServiceDeployment
metadata:
name: payments-api
namespace: prod-payments-api
labels:
platform.example.com/team: payments
platform.example.com/tier: tier-1
spec:
image:
repository: registry.example.com/payments-api
digest: sha256:abc123
runtime:
type: java
port: 8080
exposure:
type: public-http
host: payments.example.com
tls: required
reliability:
availabilitySLO: "99.9"
latencyP95Ms: 250
errorRateThreshold: "1%"
scaling:
minReplicas: 4
maxReplicas: 30
metric: cpu-and-request-rate
rollout:
strategy: canary
steps:
- weight: 5
duration: 5m
- weight: 25
duration: 10m
- weight: 50
duration: 10m
- weight: 100
resources:
requests:
cpu: "500m"
memory: "768Mi"
limits:
memory: "1536Mi"
secrets:
- name: database-credentials
source: external-secret-store
dependencies:
outbound:
- fraud-service
- kafka-payments
The platform controller or generator can translate this into:
- Deployment.
- Service.
- HTTPRoute.
- HPA/KEDA object.
- NetworkPolicy.
- ServiceMonitor or equivalent.
- Alert rules.
- Dashboard links.
- ExternalSecret.
- PodDisruptionBudget.
- Policy annotations.
This is how Kubernetes becomes a platform substrate.
Do not expose raw Kubernetes complexity unless the team needs it.
Do expose enough intent to allow safe governance.
15. Production Workload Baseline
For most stateless production services, baseline objects include:
Namespace
ResourceQuota
LimitRange
ServiceAccount
Role / RoleBinding
Secret reference or ExternalSecret
ConfigMap
Deployment
Service
HTTPRoute or Ingress
NetworkPolicy
HorizontalPodAutoscaler
PodDisruptionBudget
ServiceMonitor / PodMonitor / telemetry config
Alert rules
Dashboard metadata
Runbook link
The minimum Deployment contract:
apiVersion: apps/v1
kind: Deployment
metadata:
name: payments-api
labels:
app.kubernetes.io/name: payments-api
app.kubernetes.io/part-of: payments
app.kubernetes.io/managed-by: gitops
platform.example.com/team: payments
platform.example.com/tier: tier-1
spec:
replicas: 4
revisionHistoryLimit: 5
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 0
maxSurge: 25%
selector:
matchLabels:
app.kubernetes.io/name: payments-api
template:
metadata:
labels:
app.kubernetes.io/name: payments-api
platform.example.com/team: payments
spec:
serviceAccountName: payments-api
automountServiceAccountToken: false
securityContext:
runAsNonRoot: true
seccompProfile:
type: RuntimeDefault
containers:
- name: app
image: registry.example.com/payments-api@sha256:abc123
ports:
- containerPort: 8080
resources:
requests:
cpu: 500m
memory: 768Mi
limits:
memory: 1536Mi
startupProbe:
httpGet:
path: /health/startup
port: 8080
failureThreshold: 30
periodSeconds: 2
readinessProbe:
httpGet:
path: /health/ready
port: 8080
periodSeconds: 5
livenessProbe:
httpGet:
path: /health/live
port: 8080
periodSeconds: 10
lifecycle:
preStop:
httpGet:
path: /shutdown/drain
port: 8080
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop:
- ALL
This is still not enough by itself.
The surrounding platform controls determine whether this service is production-grade.
16. Traffic and Exposure Model
Traffic design should separate four concerns:
- Service identity.
- Routing.
- Security.
- Progressive delivery.
Guidelines:
| Concern | Platform Decision |
|---|---|
| Public HTTP route | Use Gateway API where available; Ingress only when simpler or already standardized |
| Internal service discovery | Use Service DNS name, not Pod IP |
| Canary | Prefer traffic splitting at route/mesh/progressive delivery layer |
| TLS | Automate certificates and make TLS default for public routes |
| East-west auth | Use workload identity or mesh only when requirement justifies complexity |
| Egress | Govern with NetworkPolicy, egress gateway, firewall, or cloud-native controls |
| External dependencies | Make dependency ownership and failure behavior explicit |
Public exposure should require stricter review than internal-only services.
Example exposure policy:
publicExposurePolicy:
required:
- ownerTeam
- dataClassification
- tls
- authenticationModel
- rateLimitPlan
- abuseMonitoring
- accessLogRetention
- incidentRunbook
- rollbackPlan
17. Security Architecture
Security must span the full lifecycle.
Security controls by stage:
| Stage | Controls |
|---|---|
| Source | branch protection, dependency review, secret scanning |
| Build | isolated builder, reproducible pipeline, provenance, SBOM |
| Registry | private registry, digest references, retention, vulnerability scan |
| Admission | image signature verification, policy-as-code, Pod Security, quota |
| Runtime | non-root, seccomp, read-only root filesystem, NetworkPolicy, RBAC |
| Operations | audit logs, incident evidence, patching, key rotation, exception expiry |
A strong default-deny platform posture:
Deny privileged pods by default.
Deny hostPath by default.
Deny hostNetwork by default.
Deny containers without resource requests.
Deny mutable image tags in production.
Deny public ingress without TLS.
Deny missing owner labels.
Deny service account token auto-mount unless explicitly required.
Deny workloads in production without readiness semantics.
Exceptions must be explicit, reviewed, time-bounded, and observable.
18. Observability Architecture
Observability should be designed around questions.
Not around tools.
Core questions:
- Is the service available?
- Is latency within SLO?
- Is error rate increasing?
- Is saturation approaching a limit?
- Is a rollout causing regression?
- Which dependency is failing?
- Which version is affected?
- Which node, zone, cluster, tenant, or route is involved?
- What changed recently?
- What evidence is needed for incident review?
Signal model:
Minimum production telemetry contract:
| Signal | Required Fields |
|---|---|
| Metrics | service, namespace, cluster, version, status, route, dependency |
| Logs | timestamp, severity, trace ID, request ID, service, version |
| Traces | service graph, spans, dependency latency, errors |
| Events | object, reason, message, involved object, timestamp |
| Audit | user, verb, resource, namespace, decision, source IP |
| Deployment events | commit SHA, image digest, rollout ID, strategy, approver |
Alerting should be symptom-first.
Bad alert:
Pod restarted once.
Better alert:
payments-api availability burn rate is consuming the error budget too fast in prod-region-a.
Platform teams should still monitor platform internals, but product teams should be alerted primarily on user-impacting symptoms.
19. Reliability and Failure Model
Reliability is not achieved by adding replicas randomly.
It is achieved by understanding failure domains.
Map controls to failure domains.
| Failure Domain | Failure Example | Control |
|---|---|---|
| Pod | crash loop, OOM, bad config | probes, limits, rollout rollback |
| Node | disk pressure, kubelet issue | replicas across nodes, eviction handling |
| Zone | zone outage | topology spread, multi-zone node pools |
| Cluster | API/control plane or CNI failure | multi-cluster strategy, DR plan |
| Region | regional outage | active-active or active-passive design |
| Dependency | database outage | circuit breaker, fallback, queue buffering |
| Human change | bad manifest, wrong secret | GitOps review, policy, progressive delivery |
| Supply chain | malicious or vulnerable image | signing, scanning, admission verification |
Reliability checklist for a tier-1 HTTP service:
reliability:
replicas:
minimum: 3
spreadAcrossNodes: true
spreadAcrossZones: true
probes:
startup: required
readiness: required
liveness: conservative
rollout:
progressive: required
rollback: automatic-or-fast-manual
maxUnavailable: 0
disruption:
podDisruptionBudget: required
gracefulTermination: required
autoscaling:
hpa: required
metric: cpu-plus-traffic-or-custom
scaleDownStabilization: required
observability:
slo: required
burnRateAlerts: required
dashboard: required
runbook: required
dependencies:
timeout: required
retryBudget: required
fallbackOrDegradedMode: reviewed
The important principle:
A platform should not only deploy workloads. It should preserve service behavior under expected failure.
20. Capacity and Cost Model
Capacity management in Kubernetes has two layers.
- Workload resource declarations.
- Cluster/node capacity supply.
Cost control requires both technical and organizational mechanisms.
Technical controls:
- Resource requests required.
- Memory limits required for most workloads.
- CPU limits used carefully, not blindly.
- Namespace ResourceQuota.
- LimitRange defaults only when safe.
- Dedicated node pools for expensive workload classes.
- Autoscaling with scale-down stabilization.
- Idle workload detection.
- Object count quotas.
- Storage lifecycle policy.
Organizational controls:
- Required cost labels.
- Team-level reporting.
- Environment-level budgets.
- Expensive workload approval.
- Capacity review for tier-1 systems.
- Regular rightsizing review.
Example cost labels:
metadata:
labels:
cost.example.com/team: payments
cost.example.com/product: billing
cost.example.com/environment: prod
cost.example.com/criticality: tier-1
A platform that cannot attribute cost will eventually lose trust.
21. Stateful and Data-Aware Platform Design
Stateful workloads require stricter review.
A platform can support stateful systems, but should not pretend that StatefulSet alone solves data safety.
Stateful review questions:
- What is the consistency model?
- What is the quorum model?
- What happens during node drain?
- What happens during zone loss?
- What is the backup frequency?
- Has restore been tested?
- What is RPO?
- What is RTO?
- Can storage be expanded safely?
- Can the workload tolerate rescheduling?
- Does it require local SSD or topology-aware placement?
- Is the operator mature and understood?
- How are schema migrations performed?
- How is split brain prevented?
Data platform baseline:
statefulBaseline:
backup:
required: true
restoreTested: true
frequency: workload-specific
disruption:
pdb: required
nodeDrainRunbook: required
storage:
storageClass: approved-only
volumeExpansion: reviewed
reclaimPolicy: explicit
topology:
zoneSpread: reviewed
quorumAware: true
security:
encryptionAtRest: required
secretRotation: required
operations:
runbook: required
upgradePlan: required
For many enterprises, the best platform decision is not to run every database directly on Kubernetes.
Sometimes the right answer is managed database plus Kubernetes application layer.
Top engineers optimize for system reliability, not tool purity.
22. Upgrade and Lifecycle Management
A platform is a living system.
Cluster upgrades, API deprecations, CRD changes, CNI upgrades, CSI upgrades, policy engine upgrades, GitOps controller upgrades, and node OS patching all affect safety.
Upgrade operating model:
Required upgrade controls:
| Control | Purpose |
|---|---|
| API deprecation scan | Detect manifests using removed APIs |
| Add-on compatibility matrix | Prevent CNI/CSI/ingress/GitOps breakage |
| Webhook inventory | Avoid admission outage during upgrade |
| CRD conversion review | Protect custom resources |
| Node drain simulation | Validate workload disruption controls |
| Backup validation | Recover etcd/config/platform state |
| Rollback reality check | Know what can and cannot be rolled back |
| Communication plan | Prevent surprise operational impact |
Never treat upgrades as purely infrastructure maintenance.
They are product-impacting changes.
23. Incident Response Model
A production platform should make incident response faster.
Incident flow:
Incident evidence checklist:
What changed?
Which commit/image/manifest version?
Which cluster/namespace/service/version?
Which user-facing SLO degraded?
Which dependency showed errors?
Which Kubernetes events appeared?
Were Pods rescheduled, killed, throttled, or OOMKilled?
Was traffic shifted?
Did readiness remove bad Pods?
Was autoscaling involved?
Was NetworkPolicy, DNS, storage, or admission involved?
Which rollback path was used?
What prevention should become policy or automation?
Runbook template:
# Runbook: <service-name>
## Ownership
- Team:
- Slack/channel:
- Escalation:
## Service Summary
- Workload type:
- Criticality:
- Public/internal:
- Dependencies:
## SLO
- Availability:
- Latency:
- Error budget policy:
## Dashboards
- Service dashboard:
- Deployment dashboard:
- Dependency dashboard:
## Common Failure Modes
- Rollout regression:
- Dependency outage:
- Database saturation:
- Queue backlog:
- DNS/network issue:
## First 10 Minutes
1. Check SLO dashboard.
2. Check recent deployments.
3. Check error rate and latency by version.
4. Check Kubernetes events.
5. Check dependency health.
6. Decide rollback, traffic shift, scale, or dependency mitigation.
## Rollback
- Command/process:
- Expected duration:
- Risks:
## Post-Incident Evidence
- Logs:
- Metrics:
- Traces:
- Events:
- Git commits:
A platform should generate much of this automatically from service metadata.
24. Governance Without Becoming a Bottleneck
Governance should be encoded into the platform where possible.
Manual review should be reserved for high-risk decisions.
| Decision | Automate | Require Review |
|---|---|---|
| Deploy patch version to dev | Yes | No |
| Deploy low-risk internal service to staging | Yes | Maybe no |
| Expose public route | Partly | Yes |
| Add privileged container | No | Yes |
| Add hostPath mount | No | Yes |
| Increase production max replicas 10x | Partly | Yes |
| Create new stateful database | No | Yes |
| Rotate secret | Yes | No, if standard path |
| Rollback bad deployment | Yes | No, if approved strategy |
| Add new cluster-wide controller | No | Yes |
Governance anti-patterns:
Everything requires a ticket.
Every exception is permanent.
Policy is documented but not enforced.
Security approval happens after deployment.
Platform team manually edits YAML for teams.
Production access is broad because it is convenient.
Nobody owns cost after launch.
Runbooks are created once and never tested.
Better model:
Low-risk path is automated.
High-risk path is reviewable.
Unsafe path is blocked.
Exceptions expire.
Evidence is captured automatically.
25. Production Readiness Review
Use this checklist before approving a service for production.
25.1 Ownership
- Service has an owning team.
- Escalation channel exists.
- Service catalog entry exists.
- Criticality tier is declared.
- Runbook is linked.
25.2 Deployment
- Image uses digest, not mutable tag, for production.
- Deployment strategy is explicit.
- Rollback path is tested.
- Readiness semantics are correct.
- Startup behavior is understood.
- Graceful shutdown is implemented.
- Deployment metadata includes commit SHA and image digest.
25.3 Resource and Scaling
- CPU and memory requests are set.
- Memory limits are set.
- HPA/KEDA/VPA strategy is defined where needed.
- Scale-up and scale-down behavior are tested.
- Quota impact is understood.
- Cost labels are present.
25.4 Networking
- Service type is correct.
- Public exposure is reviewed.
- TLS is configured.
- NetworkPolicy baseline is applied.
- Egress dependencies are declared.
- DNS behavior is understood.
25.5 Security
- ServiceAccount is least privilege.
- Service account token is not auto-mounted unless needed.
- Pod runs as non-root.
- Privilege escalation is disabled.
- Capabilities are dropped.
- Root filesystem is read-only where possible.
- Secrets come from approved source.
- Image has SBOM/signature/provenance where required.
- Vulnerability threshold is accepted or exception is approved.
25.6 Observability
- Metrics are emitted.
- Logs are structured and correlated.
- Traces exist for tier-1 request paths.
- Kubernetes events are visible.
- Dashboard exists.
- Alerts map to user impact.
- Deployment regression can be detected by version.
25.7 Reliability
- SLO exists.
- Error budget policy exists for tier-1 service.
- PodDisruptionBudget is configured where needed.
- Topology spread is configured where needed.
- Dependency failure behavior is defined.
- Load test or capacity estimate exists.
- Backup/restore exists for stateful dependencies.
25.8 Operations
- On-call team understands runbook.
- Incident evidence can be collected.
- Rollback is practiced.
- Upgrade compatibility is understood.
- DR behavior is known for critical services.
Production readiness is not a one-time checklist.
It should become a recurring review cycle.
26. Platform Maturity Model
Use this model to assess the platform.
| Level | Description | Symptoms |
|---|---|---|
| 0 | Cluster-as-a-service | Teams get kubeconfig and write their own YAML |
| 1 | Standard manifests | Templates exist, but enforcement is weak |
| 2 | Guardrailed delivery | GitOps, RBAC, policy, observability baseline exist |
| 3 | Golden path platform | Self-service onboarding, service catalog, standard SLO/runbook patterns |
| 4 | Productized platform | Platform APIs, user research, measured adoption, exception governance |
| 5 | Adaptive platform | Automated remediation, SLO-aware rollout, predictive capacity, mature fleet governance |
Most organizations should not jump directly to level 5.
A practical improvement sequence:
1. Inventory what runs.
2. Standardize labels and ownership.
3. Require resources and security baseline.
4. Centralize GitOps delivery.
5. Add observability baseline.
6. Introduce production readiness review.
7. Add progressive delivery for critical services.
8. Build developer portal/golden path.
9. Introduce platform APIs for common workload classes.
10. Mature fleet, cost, and compliance automation.
Do not build advanced abstractions before you have operational visibility.
27. Architecture Decision Records
A production platform needs recorded decisions.
Example ADR topics:
ADR-001: Kubernetes environment and cluster topology
ADR-002: Namespace and tenant model
ADR-003: GitOps repository model
ADR-004: Standard label taxonomy
ADR-005: Default Pod security profile
ADR-006: Public traffic routing model
ADR-007: Service mesh adoption or non-adoption
ADR-008: Secret management architecture
ADR-009: Image signing and verification policy
ADR-010: Observability data retention
ADR-011: SLO and alerting standards
ADR-012: Stateful workload policy
ADR-013: Upgrade and version-skew process
ADR-014: Exception governance model
ADR-015: Platform API evolution strategy
ADR template:
# ADR: <title>
## Status
Proposed | Accepted | Deprecated | Superseded
## Context
What problem are we solving?
## Decision
What did we choose?
## Alternatives Considered
What else did we evaluate?
## Consequences
Positive and negative outcomes.
## Operational Impact
How does this affect deployment, security, reliability, cost, or incident response?
## Review Date
When should this decision be revisited?
Without ADRs, platform decisions become folklore.
Folklore does not scale.
28. Capstone Exercise
Design a production Kubernetes platform for this target state:
- 3 production clusters across 2 active regions and 1 DR region.
- 1 staging cluster.
- 2 non-prod shared clusters.
- 80 teams.
- 400 services.
- 30 tier-1 services.
- 60 public routes.
- 20 stateful workloads.
- Regulatory audit requirement.
Your deliverables:
28.1 Architecture Diagram
Create a Mermaid diagram showing:
- Developer portal.
- CI.
- Registry.
- SBOM/signing/provenance.
- GitOps desired-state repo.
- Policy engine.
- Clusters.
- Gateway layer.
- Observability stack.
- Incident response loop.
28.2 Workload Taxonomy
Classify at least five workload types:
- Public HTTP API.
- Internal API.
- Queue worker.
- CronJob.
- Stateful workload.
For each, define:
- Kubernetes primitives.
- Required policies.
- Observability contract.
- Rollback model.
- Failure modes.
28.3 GitOps Model
Define:
- Repository layout.
- Promotion process.
- Review rules.
- Drift handling.
- Secret handling.
- Rollback process.
28.4 Security Baseline
Define:
- RBAC model.
- Pod Security profile.
- NetworkPolicy baseline.
- Image policy.
- Secret management.
- Admission controls.
- Exception process.
28.5 Reliability Model
Define:
- SLO taxonomy.
- Alerting rules.
- PDB policy.
- Topology spread policy.
- Progressive delivery standard.
- Dependency failure behavior.
- Incident response process.
28.6 Platform Product Model
Define:
- Golden paths.
- Self-service flows.
- Platform API scope.
- Service catalog metadata.
- Developer documentation.
- Platform success metrics.
The exercise is complete when another senior engineer can review the design and understand:
What is safe by default?
What is self-service?
What requires review?
What is blocked?
What is observable?
What fails where?
Who owns each decision?
29. Final Integrated Review Template
Use this as a final platform review.
# Kubernetes Platform Review
## 1. Scope
- Clusters:
- Environments:
- Teams:
- Workload classes:
- Critical systems:
## 2. Architecture
- Cluster topology:
- Network topology:
- Traffic ingress:
- Delivery architecture:
- Observability architecture:
- Security architecture:
## 3. Invariants
- Ownership:
- Resource governance:
- Security baseline:
- Deployment traceability:
- SLO requirements:
- Secret handling:
- Exception expiry:
## 4. Workload Standards
- Stateless HTTP:
- Internal service:
- Worker:
- Batch:
- Stateful:
- Daemon:
## 5. Delivery
- CI:
- Registry:
- GitOps:
- Promotion:
- Rollback:
- Progressive delivery:
## 6. Security
- Identity:
- RBAC:
- Admission:
- Pod security:
- Network policy:
- Supply chain:
- Audit:
## 7. Reliability
- SLOs:
- Alerts:
- PDB:
- Topology:
- Capacity:
- DR:
- Incident response:
## 8. Operations
- Upgrade process:
- Add-on lifecycle:
- Backup and restore:
- Cost review:
- Runbooks:
- On-call:
## 9. Platform Product
- Golden paths:
- Developer portal:
- Service catalog:
- Feedback loops:
- Adoption metrics:
- Roadmap:
## 10. Risks and Next Actions
- Critical risks:
- Near-term improvements:
- Long-term improvements:
- Decision records required:
30. Common Production Anti-Patterns
30.1 YAML as Platform
Having many manifests is not the same as having a platform.
If every team copies YAML and edits it independently, you have distributed configuration drift.
30.2 Dashboards Without SLOs
Dashboards are useful for investigation.
They are not a reliability strategy.
Without SLOs and alert policy, dashboards become wall decoration.
30.3 Security by Documentation
A security standard that is only written in a wiki will eventually be bypassed.
Critical standards must become admission policy, CI checks, or runtime detection.
30.4 GitOps Without Ownership
GitOps can make bad desired state converge faster.
It does not replace ownership, review, testing, or policy.
30.5 Platform Team as YAML Helpdesk
If the platform team spends most of its time editing manifests for application teams, the platform is not self-service.
30.6 Service Mesh by Default
Service mesh can be powerful.
It can also add operational load, latency, debugging complexity, and upgrade risk.
Adopt it for clear requirements, not because it looks mature.
30.7 StatefulSet as Database Strategy
StatefulSet gives stable identity and storage association.
It does not automatically solve backup, quorum, restore, consistency, failover, or schema migration.
30.8 Autoscaling Without Backpressure
Autoscaling can help with demand changes.
It cannot fix unbounded queues, slow dependencies, bad retry storms, or inefficient code by itself.
30.9 Production and Staging Drift
If staging does not share production-like policy, routing, observability, and resource constraints, it cannot validate production behavior.
30.10 Permanent Exceptions
Exceptions that never expire become the real policy.
31. What a Top 1% Engineer Sees
A beginner sees Kubernetes as YAML.
An intermediate engineer sees Kubernetes as workloads and networking.
A strong senior engineer sees Kubernetes as reconciliation, scheduling, and runtime contracts.
A top engineer sees Kubernetes as an organizational operating system for software delivery.
They ask:
What is the desired state?
Who is allowed to change it?
What validates it?
What reconciles it?
What observes it?
What happens when it fails?
Who owns recovery?
How is the decision audited?
How does the platform evolve safely?
They do not over-index on tools.
They reason in invariants, failure domains, control loops, and human workflows.
32. Final Mental Model
The entire series can be compressed into this model:
Kubernetes is powerful because it separates intent from execution.
A platform is valuable because it makes that separation safe for humans, teams, and organizations.
33. Final Series Completion Note
This is the final part of the series:
learn-kubernetes-deployment-model-part-035-capstone-production-grade-platform.mdx
The series is now complete.
You should now have a full advanced map of Kubernetes and deployment model knowledge from first principles through production platform design.
The next step is not to read more Kubernetes concepts randomly.
The next step is to apply the model to a real platform design:
- Choose a representative service.
- Classify its workload type.
- Write its platform intent contract.
- Generate its Kubernetes desired state.
- Add policy, observability, SLO, rollback, and runbook.
- Simulate failure.
- Review what the platform made easy and what it failed to support.
That is where knowledge becomes engineering judgment.
34. References
- Kubernetes Documentation — Production Environment: https://kubernetes.io/docs/setup/production-environment/
- Kubernetes Documentation — Concepts Overview: https://kubernetes.io/docs/concepts/
- Kubernetes Documentation — Kubernetes Components: https://kubernetes.io/docs/concepts/overview/components/
- Kubernetes Documentation — Workloads: https://kubernetes.io/docs/concepts/workloads/
- Kubernetes Documentation — Services, Load Balancing, and Networking: https://kubernetes.io/docs/concepts/services-networking/
- Kubernetes Documentation — Storage: https://kubernetes.io/docs/concepts/storage/
- Kubernetes Documentation — Security: https://kubernetes.io/docs/concepts/security/
- Kubernetes Documentation — RBAC: https://kubernetes.io/docs/reference/access-authn-authz/rbac/
- Kubernetes Documentation — Admission Controllers: https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/
- Kubernetes Documentation — Observability: https://kubernetes.io/docs/concepts/cluster-administration/observability/
- Kubernetes Documentation — Version Skew Policy: https://kubernetes.io/releases/version-skew-policy/
- OpenGitOps Principles: https://opengitops.dev/
- CNCF TAG App Delivery — Platforms White Paper: https://tag-app-delivery.cncf.io/whitepapers/platforms/
- SLSA Framework: https://slsa.dev/
- Sigstore Cosign Documentation: https://docs.sigstore.dev/cosign/
- OpenTelemetry Documentation: https://opentelemetry.io/docs/
- Prometheus Documentation: https://prometheus.io/docs/
You just completed lesson 35 in final stretch. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.
Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.