Series MapLesson 11 / 35
Build CoreOrdered learning track

Learn Kubernetes Deployment Model Part 011 Progressive Delivery

20 min read3983 words
PrevNext
Lesson 1135 lesson track0719 Build Core

title: Learn Kubernetes, Deployment Model, and Cloud Native Platform Engineering - Part 011 description: Progressive delivery dan rollout safety di Kubernetes: canary automation, metric gates, traffic shifting, rollback policy, blast radius, dan failure modelling production. series: learn-kubernetes-deployment-model seriesTitle: Learn Kubernetes, Deployment Model, and Cloud Native Platform Engineering order: 11 partTitle: Progressive Delivery and Rollout Safety tags:

  • kubernetes
  • progressive-delivery
  • canary
  • rollout-safety
  • release-engineering
  • sre
  • platform-engineering
  • series date: 2026-07-01

Part 011 — Progressive Delivery and Rollout Safety

Tujuan part ini adalah membuat kita mampu merancang deployment yang bergerak bertahap berdasarkan bukti, bukan berdasarkan keberanian. Kubernetes memberi primitive rollout. Progressive delivery memberi sistem kontrol risiko di atas rollout.

Progressive delivery adalah praktik mengekspos perubahan ke production secara bertahap, mengukur sinyal kesehatan, lalu memutuskan apakah perubahan dipromosikan, dipause, atau dibatalkan. Di Kubernetes, progressive delivery biasanya melibatkan kombinasi beberapa lapisan:

  • workload controller: Deployment, ReplicaSet, atau CRD seperti Rollout;
  • traffic router: Service, Ingress, Gateway API, service mesh, atau load balancer;
  • metric source: Prometheus, Datadog, CloudWatch, New Relic, OpenTelemetry backend;
  • policy engine: analysis template, automated gate, manual approval, change window;
  • rollback mechanism: traffic rollback, object rollback, feature flag disable, atau compensating action.

Mental model sederhana:

Rolling update replaces pods gradually.
Progressive delivery exposes risk gradually.

Rolling update menjawab “berapa Pod lama diganti Pod baru”. Progressive delivery menjawab “siapa atau apa yang terkena perubahan, kapan, seberapa besar, berdasarkan bukti apa, dan bagaimana berhenti dengan aman”.


1. Kaufman Deconstruction: Skill yang Harus Dipraktikkan

Untuk menguasai progressive delivery, jangan mulai dari tool. Mulai dari sub-skill.

Sub-skillPertanyaan Operasional
Risk decompositionApa risiko perubahan ini: availability, correctness, latency, security, data, cost, compliance?
Exposure modellingUnit exposure apa yang aman: request, user, tenant, region, shard, feature, queue consumer?
Traffic controlLayer mana yang bisa membagi traffic secara presisi?
Metric designSinyal apa yang membuktikan versi baru sehat dalam window pendek?
Analysis gateApa threshold promosi, pause, atau abort?
Rollback semanticsApa yang dirollback: traffic, Pod, config, feature flag, migration, data?
Failure isolationBagaimana mencegah canary merusak shared state?
Automation boundaryMana yang otomatis, mana butuh human approval?
AuditabilityBukti apa yang tersimpan untuk post-incident dan compliance?

Kaufman-style target skill:

Dalam 20 jam pertama praktik serius, kita harus bisa:
1. memilih strategi progressive delivery yang cocok untuk workload tertentu;
2. menulis rollout plan dengan metric gate dan rollback rule;
3. membaca status rollout dan membedakan kegagalan aplikasi vs kegagalan routing;
4. mendesain blast radius yang eksplisit;
5. menjelaskan kenapa sebuah deployment aman atau tidak aman untuk auto-promotion.

2. Progressive Delivery Bukan Sekadar Canary

Canary adalah salah satu strategi. Progressive delivery adalah discipline yang lebih luas.

PolaIntiCocok UntukRisiko Utama
Rolling update with gatesReplace Pod bertahap, pause jika burukservice stateless low-risktraffic tidak benar-benar weighted
Canaryexpose sebagian traffic/user ke versi baruhigh-change API, UI, backend servicemetric noise, small sample bias
Blue-green with smoke gateswitch environment aktif setelah validasirelease besar, upgrade platformdouble capacity, state divergence
Shadow traffickirim copy request ke versi baru tanpa response userbehavior comparison, perf testside effect harus dinonaktifkan
A/Bexpose cohort untuk eksperimen produkproduct experimentbukan safety mechanism murni
Feature-flag rolloutexpose behavior secara gradual dalam appfitur business logicconfig drift, stale flags
Regional rolloutdeploy per region/clusterglobal systemsregional dependency mismatch
Tenant rolloutdeploy per tenant/groupSaaS enterpriseschema/data compatibility antar tenant

Rule praktis:

Canary is about safe exposure.
A/B is about product learning.
Shadow is about observation without serving.
Blue-green is about fast switch.
Rolling update is about replacement mechanics.

Jangan memakai A/B testing sebagai safety gate kecuali metric operasional tetap menjadi gate utama. Product conversion naik tidak berarti error handling, latency, data integrity, atau cost aman.


3. Native Kubernetes Boundary

Kubernetes Deployment native mendukung RollingUpdate, Recreate, rollout status, pause/resume, dan rollback revision. Tetapi Kubernetes native tidak secara otomatis menyediakan:

  • weighted HTTP traffic split;
  • per-user cohort routing;
  • automated metric analysis;
  • baseline-vs-canary comparison;
  • analysis template;
  • traffic mirroring;
  • request-level rollback;
  • business metric gates;
  • automatic promotion berdasarkan Prometheus query.

Karena itu progressive delivery biasanya butuh tool tambahan atau arsitektur routing tambahan.

Dalam platform engineering, boundary ini penting karena menentukan ownership:

LayerOwner BiasContoh Keputusan
Application teamrelease intent, metric semantics, business correctnesserror budget, validation query, feature flag
Platform teamrouter, controller, policy, observability substrateArgo Rollouts, Flagger, Gateway, Prometheus
SRESLO, incident rule, safe automationabort threshold, burn rate, alert coupling
Security/complianceapproval, audit, policy guardrailprod promotion policy, signed image gate

4. The Rollout Safety State Machine

Rollout safety harus dipikirkan sebagai state machine, bukan script linear.

Minimal state yang harus disimpan untuk audit:

  • artifact version: image digest, SBOM/provenance reference;
  • config version: ConfigMap/Secret revision or Git commit;
  • rollout strategy: canary/blue-green/etc;
  • exposure steps: 1%, 5%, 25%, 50%, 100% atau equivalent;
  • gate metrics and thresholds;
  • gate result per step;
  • manual override decision;
  • rollback reason jika terjadi abort;
  • incident/change record ID.

5. Blast Radius: Dimensi yang Harus Didesain

Banyak engineer mendesain canary hanya sebagai persen traffic. Itu terlalu sempit. Blast radius bisa dibatasi di banyak dimensi.

DimensiContohKapan Dipakai
Request percentage1% HTTP trafficservice stateless high-QPS
User cohortinternal users, beta usersUI/API behavior visible
Tenanttenant non-critical duluSaaS B2B
Regionap-southeast-1 dulumulti-region service
Clusterstaging-prod-edge cluster dulumulti-cluster fleet
AZ/node poolsubset node poolinfra/runtime upgrade
API route/v2/search sajaroute-specific logic
Message topicone topic/partition/shardevent-driven systems
Feature surfacesatu feature flagapplication-level rollout
Data shardshard 01 onlystateful/data-heavy systems

Prinsip:

Persentase traffic hanya aman jika traffic cukup homogen.
Jika risiko tersembunyi di tenant, data shape, region, route, atau dependency, maka canary by percentage dapat memberikan false confidence.

Contoh buruk:

1% traffic global terlihat sehat,
tetapi 1% itu hampir tidak pernah menyentuh tenant enterprise terbesar,
route paling berat,
atau data shape yang memicu bug.

Contoh lebih kuat:

Canary step 1: internal users only.
Canary step 2: low-risk tenants on non-critical routes.
Canary step 3: representative high-volume route with strict metric gates.
Canary step 4: one region.
Canary step 5: global promotion.

6. Traffic Shifting Models

Traffic shifting harus disesuaikan dengan layer yang punya informasi routing.

6.1 Kubernetes Service Selector

Native Service memilih Pod berdasarkan selector. Ini cocok untuk stable endpoint, tetapi tidak memberi weighted split yang presisi.

apiVersion: v1
kind: Service
metadata:
  name: payment-api
spec:
  selector:
    app.kubernetes.io/name: payment-api
  ports:
    - name: http
      port: 80
      targetPort: 8080

Jika selector match Pod v1 dan v2 sekaligus, traffic akan tersebar di endpoint yang tersedia, tetapi bukan canary policy yang kaya. Ia tidak tahu user cohort, route, header, SLO, atau weighted percentage berbasis policy.

6.2 Ingress / Gateway / Mesh Weighted Routing

Weighted routing biasanya terjadi di:

  • ingress controller;
  • Gateway API implementation;
  • service mesh seperti Istio/Linkerd/Kuma;
  • cloud load balancer;
  • API gateway;
  • progressive delivery controller yang mengubah route object.

Konsepnya:

Design invariant:

Traffic split harus berada di layer yang bisa mengamati dan mengontrol exposure unit yang kita butuhkan.

Jika butuh routing berdasarkan header, Service selector tidak cukup. Jika butuh mTLS policy, mesh mungkin relevan. Jika hanya butuh pod replacement rendah risiko, Deployment rolling update cukup.


7. Metric Gates: Apa yang Layak Jadi Bukti?

Metric gate adalah jantung progressive delivery. Tanpa gate, canary hanya “deploy pelan-pelan sambil berharap”.

7.1 Golden Signals

Minimum metric operasional:

SignalContoh Query/IndikatorCatatan
Availabilitysuccess rate, 5xx rate, non-2xx by routepisahkan client error vs server error
Latencyp50/p95/p99 histogramp99 sering lebih sensitif untuk regression
Trafficrequest rate, active connectionssample size harus cukup
SaturationCPU, memory, queue depth, thread pool, DB poolmasalah sering muncul sebagai saturation dulu

7.2 Correctness Signals

Metric availability tidak cukup. Banyak bug menghasilkan response 200 tetapi salah secara bisnis.

Tambahkan:

  • validation failure rate;
  • business rule rejection abnormal;
  • payment authorization mismatch;
  • order state transition invalid;
  • duplicate event emission;
  • reconciliation backlog;
  • data consistency violation;
  • consumer lag;
  • dead-letter queue growth;
  • cache miss explosion;
  • downstream contract error.

Untuk sistem regulatory/case management, correctness gate lebih penting daripada sekadar HTTP 5xx:

Apakah state transition tetap legal?
Apakah audit trail lengkap?
Apakah escalation SLA tidak rusak?
Apakah evidence lifecycle tidak kehilangan causal chain?
Apakah action idempotency tetap benar?

7.3 Short-Window vs Long-Window Metrics

Progressive delivery butuh metric yang bisa memberi sinyal dalam waktu pendek. Namun terlalu pendek membuat noise tinggi.

WindowKegunaanRisiko
1–2 menitfast abort untuk crash/error besarnoisy, sample kecil
5–15 menitcanary step umummasih bisa miss low-frequency bug
30–60 menithigher confidencerelease lambat
multi-daybehavior/product validationbukan cocok untuk automated rollout controller singkat

Rule praktis:

Gunakan short-window metrics untuk safety.
Gunakan long-window metrics untuk learning.
Jangan mencampur keduanya dalam satu gate otomatis tanpa desain jelas.

8. Canary Analysis Design

Canary analysis membandingkan sinyal dari versi baru terhadap threshold absolut atau baseline.

8.1 Absolute Threshold

Contoh:

abort if 5xx_rate > 1%
abort if p99_latency > 750ms
abort if cpu_throttling > 10%
abort if dlq_events > 0

Kelebihan:

  • mudah dipahami;
  • cocok untuk invariant keras;
  • baik untuk compliance dan safety.

Kelemahan:

  • tidak adaptif terhadap kondisi traffic normal yang memang buruk;
  • bisa gagal saat baseline juga sedang terdegradasi.

8.2 Baseline Comparison

Contoh:

abort if canary_5xx_rate > stable_5xx_rate + 0.5%
abort if canary_p99 > stable_p99 * 1.25

Kelebihan:

  • membandingkan versi dalam kondisi production yang sama;
  • lebih adil saat environment noisy.

Kelemahan:

  • butuh label metrics yang konsisten;
  • baseline juga bisa buruk;
  • sample size canary sering kecil.

8.3 Composite Gate

Di production mature, gate biasanya composite:

Promote only if:
- canary pods Ready and Available;
- request volume >= minimum sample;
- 5xx rate <= threshold;
- p95/p99 latency within budget;
- saturation not increasing abnormally;
- business correctness metric healthy;
- no critical logs/events matched;
- no active high-severity alert for dependency.

Composite gate harus hati-hati: semakin banyak metric, semakin besar kemungkinan false negative. Gunakan metric yang memang decision-relevant.


9. Sample Size Problem

Canary 1% dari traffic hanya berguna jika 1% itu cukup besar.

Contoh:

Traffic Service1% CanaryInterpretasi
100,000 req/min1,000 req/minsinyal cepat cukup kuat
1,000 req/min10 req/minbanyak bug bisa lolos
100 req/hour1 req/hourcanary percentage hampir tidak berarti

Untuk low-traffic service, gunakan strategi lain:

  • synthetic traffic;
  • internal cohort;
  • contract test against production-like data;
  • shadow traffic;
  • longer analysis window;
  • manual validation;
  • route-specific replay;
  • staged tenant rollout.

Invariant:

No sample, no signal.
No signal, no safe automation.

10. Rollback: Traffic Rollback vs Workload Rollback

Rollback bukan satu hal.

Rollback TypeApa yang DiubahKecepatanKapan Cocok
Traffic rollbackroute traffic kembali ke stablesangat cepatcanary/blue-green dengan router control
Workload rollbackDeployment/Rollout kembali ke revision lamasedangrolling update native
Feature rollbackflag disablesangat cepatbehavior behind flag
Config rollbackConfigMap/Secret revision revertsedangruntime config issue
Data rollbackrevert/repair datalambat/berisikomigration/data corruption
Compensating actionforward fixvariatifirreversible operation

Prinsip production:

Prefer rollback yang mengurangi exposure tanpa mengubah state lebih banyak.

Jika versi baru sudah menjalankan irreversible database migration atau menulis event incompatible, rollback Pod saja mungkin membuat sistem lebih rusak. Karena itu progressive delivery harus didahului compatibility design.


11. Data and Schema Compatibility Gate

Deployment baru jarang hanya mengganti compute. Ia sering menyentuh data.

Checklist sebelum canary:

AreaPertanyaan
Database schemaApakah v1 dan v2 bisa berjalan bersamaan?
MigrationApakah migration backward-compatible?
Event schemaApakah consumer lama bisa membaca event baru?
API contractApakah client lama masih valid?
CacheApakah key format berubah?
IdempotencyApakah retry dari dua versi tetap aman?
AuditApakah audit format tetap lengkap?
AuthorizationApakah permission model berubah?
State machineApakah state transition baru kompatibel dengan old worker?

Safe deployment pattern:

Expand -> Deploy -> Migrate/Backfill -> Switch -> Contract -> Cleanup

Contoh database:

  1. Tambahkan kolom nullable baru.
  2. Deploy kode yang bisa membaca old/new format.
  3. Backfill data.
  4. Aktifkan write path baru.
  5. Pastikan semua consumer kompatibel.
  6. Setelah aman, hapus field lama pada release terpisah.

Jangan campur perubahan destructive dengan canary compute dalam satu langkah.


12. Progressive Delivery with Argo Rollouts

Argo Rollouts adalah controller Kubernetes dan sekumpulan CRD yang menyediakan kemampuan deployment lanjutan seperti blue-green, canary, canary analysis, experiment, dan progressive delivery.

Model konseptual:

Resource inti:

ResourceFungsi
Rolloutpengganti/alternatif Deployment untuk strategy advanced
AnalysisTemplatetemplate metric gate
AnalysisRuneksekusi analysis aktual
Experimentmenjalankan beberapa ReplicaSet untuk eksperimen terkontrol
traffic routing integrationmengubah router/mesh/ingress untuk weight traffic

Contoh struktur canary konseptual:

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: payment-api
spec:
  replicas: 10
  selector:
    matchLabels:
      app.kubernetes.io/name: payment-api
  template:
    metadata:
      labels:
        app.kubernetes.io/name: payment-api
    spec:
      containers:
        - name: app
          image: registry.example.com/payment-api:2.4.0
          ports:
            - containerPort: 8080
  strategy:
    canary:
      steps:
        - setWeight: 5
        - pause:
            duration: 5m
        - analysis:
            templates:
              - templateName: payment-api-slo-check
        - setWeight: 25
        - pause:
            duration: 10m
        - setWeight: 50
        - pause:
            duration: 10m

Contoh AnalysisTemplate konseptual:

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: payment-api-slo-check
spec:
  metrics:
    - name: success-rate
      interval: 1m
      count: 5
      successCondition: result[0] >= 0.995
      failureLimit: 1
      provider:
        prometheus:
          address: http://prometheus.monitoring.svc.cluster.local:9090
          query: |
            sum(rate(http_requests_total{app="payment-api",status!~"5.."}[2m]))
            /
            sum(rate(http_requests_total{app="payment-api"}[2m]))

Important design note:

Argo Rollouts can automate promotion, but it cannot invent correct metrics.
The hard engineering work is metric semantics, compatibility, and blast-radius design.

13. Progressive Delivery with Flagger

Flagger adalah progressive delivery tool dalam ekosistem Flux yang dapat melakukan canary, A/B testing, blue-green, traffic mirroring, automated analysis, promotion, dan rollback menggunakan ingress controller atau service mesh serta metric backend.

Model konseptual:

Contoh konseptual:

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: payment-api
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: payment-api
  progressDeadlineSeconds: 600
  service:
    port: 80
    targetPort: 8080
  analysis:
    interval: 1m
    threshold: 5
    maxWeight: 50
    stepWeight: 10
    metrics:
      - name: request-success-rate
        thresholdRange:
          min: 99
        interval: 1m
      - name: request-duration
        thresholdRange:
          max: 500
        interval: 1m

Flagger-style thinking cocok jika organisasi sudah memakai GitOps dan ingin release automation yang mengikat workload, router, metrics, dan alerting.


14. Manual Gate vs Automated Gate

Tidak semua gate harus otomatis. Tidak semua approval harus manual.

Gate TypeCocok UntukAnti-pattern
Automated preflightlint, policy, image signature, unit/contract testhuman approval untuk hal deterministic
Automated canary metricsclear SLO, high traffic, stable metricmetric tidak reliable tapi tetap auto-promote
Manual approvalhigh-risk change, compliance, migrationapproval tanpa evidence
Time window gatemarket hours, regulatory freeze, low-support hoursfreeze permanen tanpa risk model
Incident-aware gateblock deploy saat SEV activeignore degraded dependency

Pattern yang baik:

Machine checks what machines can verify.
Humans decide trade-offs when evidence is incomplete or risk is socio-technical.

15. Readiness, Liveness, Startup Probe as Rollout Gates

Probe bukan progressive delivery, tetapi menjadi input penting.

ProbeFungsiRollout Implication
startupProbememberi waktu aplikasi start tanpa dianggap deadmencegah restart prematur saat cold start
readinessProbemenentukan apakah Pod menerima trafficgate minimum sebelum exposure
livenessProberestart container yang stuckbisa memperburuk incident jika terlalu agresif

Readiness harus merepresentasikan kemampuan melayani request, bukan sekadar process hidup.

Contoh readiness yang terlalu dangkal:

GET /healthz returns 200 if process alive.

Lebih baik:

GET /readyz returns 200 if:
- HTTP server ready;
- required config loaded;
- database pool initialized;
- critical dependency mode known;
- migration compatibility confirmed;
- app can accept traffic without corrupting state.

Namun jangan membuat readiness terlalu dependent pada semua downstream sehingga Pod sering keluar-masuk endpoint saat dependency minor flapping. Pisahkan:

  • readiness untuk menerima traffic;
  • health endpoint untuk diagnostics;
  • dependency metrics untuk alerting;
  • circuit breaker untuk degradation.

16. Rollout Safety and HPA Interaction

HPA dapat berinteraksi aneh dengan canary.

Problem umum:

ProblemPenyebabDampak
canary under-sampledHPA scale kecil, traffic kecilmetric tidak signifikan
canary overloadedweight naik lebih cepat daripada scalefalse negative latency/error
stable/canary imbalancetraffic split dan replica split tidak alignedunfair comparison
metric contaminationmetrics tidak label by versionanalysis salah
cold start penaltycanary baru belum warmlatency lebih buruk sementara

Mitigasi:

  • label metrics dengan version, pod-template-hash, atau release label yang stabil;
  • pastikan canary replica cukup untuk traffic step;
  • pakai warm-up pause sebelum analysis;
  • monitor CPU throttling dan memory pressure;
  • hindari auto-promotion saat HPA belum stabil;
  • desain minReplicas untuk canary high-QPS.

17. Progressive Delivery for Async/Event-Driven Workloads

HTTP canary lebih mudah karena traffic bisa diarahkan. Worker/consumer lebih sulit karena exposure terjadi melalui queue/topic/partition.

Exposure model untuk async:

UnitContoh
consumer groupv2 consumer group terpisah
partitionhanya partition subset dikonsumsi v2
topictopic canary/sandbox
message typeevent tertentu saja
tenant keytenant tertentu diarahkan ke worker v2
feature flaghandler baru aktif untuk subset

Failure mode async:

  • duplicate processing;
  • poison message;
  • reordering;
  • idempotency break;
  • offset commit terlalu cepat;
  • dead-letter meningkat;
  • consumer lag meningkat;
  • event schema incompatible;
  • rollback sulit karena message sudah diproses.

Canary gate untuk worker:

Promote worker v2 only if:
- consumer lag does not increase abnormally;
- processing error rate below threshold;
- DLQ count remains zero or within budget;
- duplicate detection remains normal;
- processing latency stays within SLA;
- downstream write failure rate normal;
- idempotency violation metric zero.

Pattern:


18. Shadow Traffic Safety

Shadow traffic mengirim copy request ke versi baru tanpa menggunakan response-nya untuk user.

Cocok untuk:

  • performance comparison;
  • parser/validator behavior comparison;
  • ML inference comparison;
  • dependency compatibility;
  • observability before release.

Bahaya utama: side effect.

Shadow target harus mencegah:

  • write ke production database;
  • external call yang mengubah state;
  • payment/auth/email/SMS real action;
  • event emission ke topic production;
  • audit log yang terlihat sebagai action nyata;
  • rate limit terhadap downstream;
  • duplicate transaction.

Safety pattern:

Shadow mode must be read-only or side-effect isolated.

Jika tidak bisa dijamin, shadow traffic berbahaya.


19. Blue-Green Safety

Blue-green sering dianggap lebih aman karena switch cepat. Tetapi blue-green punya risiko berbeda:

RisikoPenjelasan
double capacitygreen environment butuh resource ekstra
state divergenceblue dan green mungkin melihat state berbeda
DNS/LB cacheswitch tidak selalu instant di client
migration couplinggreen butuh schema baru, blue belum kompatibel
session stickinessuser session bisa lompat version
background jobsdua environment bisa menjalankan job ganda

Checklist sebelum switch:

  • green smoke test lulus;
  • green receives synthetic or shadow traffic;
  • DB schema backward compatible;
  • cron/job duplicate guard aktif;
  • queue consumer only one active side jika tidak idempotent;
  • rollback route jelas;
  • old environment dipertahankan sampai confidence cukup;
  • monitoring label membedakan blue dan green.

20. Deployment Guardrails for Platform Teams

Platform tidak boleh hanya menyediakan tool; platform harus menyediakan guardrail.

Contoh guardrail:

GuardrailTujuan
default rollout stepsstandardisasi exposure
required metric templatesmencegah rollout tanpa gate
max traffic jumpmencegah 5% langsung 100% untuk service kritikal
image digest requiredreproducibility
signed image policysupply chain control
PDB requirementprotect availability during disruption
readiness probe requiredprevent blind exposure
rollback windowdefine minimum observe time
owner label requiredincident routing
change record annotationauditability

Contoh annotation policy:

metadata:
  annotations:
    platform.example.com/change-id: CHG-2026-10422
    platform.example.com/risk-tier: high
    platform.example.com/rollback-plan: traffic-rollback
    platform.example.com/slo-template: payment-api-critical

21. Rollout Observability Dashboard

Dashboard progressive delivery harus menjawab:

  1. versi apa yang sedang dirilis;
  2. berapa exposure saat ini;
  3. berapa stable vs canary replica;
  4. berapa stable vs canary traffic;
  5. metric gate apa yang lulus/gagal;
  6. apakah traffic benar-benar sampai ke canary;
  7. apakah canary punya enough sample;
  8. apakah ada alert dependency;
  9. apakah rollback sudah terjadi;
  10. siapa yang approve/pause/promote.

Panel minimum:

PanelBreakdown
Request rateby version, route, status
Error rateby version, route, error class
Latencyp50/p95/p99 by version
SaturationCPU, memory, throttling, connection pool
Rollout statestep, weight, pause, analysis result
Kubernetes healthPod readiness, restarts, events
Business correctnessdomain-specific invariants
Dependency healthdownstream error/latency

22. Failure Modes and Debugging

22.1 Canary Healthy But Full Rollout Fails

Possible causes:

  • canary sample not representative;
  • scale-dependent bug;
  • cache behavior changes at higher load;
  • downstream rate limit only hit at 100%;
  • rare tenant/data shape not in canary;
  • canary ran too briefly;
  • metric lacked route/tenant breakdown.

Debug:

kubectl get rollout -n prod
kubectl describe rollout payment-api -n prod
kubectl get rs -n prod -l app.kubernetes.io/name=payment-api
kubectl get pods -n prod -l app.kubernetes.io/name=payment-api --show-labels
kubectl get events -n prod --sort-by=.lastTimestamp

Then correlate:

  • router config;
  • traffic weights;
  • version labels in metrics;
  • HPA scaling events;
  • downstream incidents;
  • release timeline.

22.2 Rollout Stuck at Analysis

Possible causes:

  • metric query returns empty result;
  • Prometheus label mismatch;
  • canary has no traffic;
  • metric provider unreachable;
  • success condition syntax wrong;
  • analysis window too short;
  • threshold too strict.

Troubleshooting questions:

Is canary receiving traffic?
Is the metric query returning data?
Does the metric distinguish canary from stable?
Is the condition evaluating the expected numeric type?
Is the analysis interval aligned with scrape interval?

22.3 Rollout Aborts Too Often

Possible causes:

  • threshold unrealistic;
  • metric noisy;
  • dependency baseline bad;
  • canary cold start;
  • insufficient replicas;
  • false correlation with unrelated incident.

Fix pattern:

  • add minimum sample requirement;
  • compare against baseline;
  • add warm-up pause;
  • separate critical abort metric from warning metric;
  • add incident-aware gate;
  • improve labels.

23. Progressive Delivery Readiness Rubric

LevelCapabilityDescription
0Manual deploykubectl apply, no consistent rollout safety
1Native rollingDeployment rolling update, probes, rollout status
2Manual canaryseparate canary/stable service, manual traffic switch
3Automated canarycontroller-managed steps, metric gates, rollback
4Risk-aware rolloutblast radius by tenant/region/route, domain metrics
5Platform-grade deliverypolicy, audit, SLO integration, GitOps, standard templates
6Adaptive deliverydynamic rollout based on risk, dependency status, error budget

Target top-tier engineer:

Not merely “can deploy with Argo Rollouts”.
Able to design when automation is safe, when manual judgment is required, and where rollback is impossible.

24. Hands-on Practice Plan

Exercise 1 — Native Deployment Safety

Create a Deployment with:

  • readiness probe;
  • liveness probe;
  • maxSurge and maxUnavailable;
  • progressDeadlineSeconds;
  • minReadySeconds;
  • rollout pause/resume.

Observe:

kubectl rollout status deployment/payment-api
kubectl rollout history deployment/payment-api
kubectl describe deployment payment-api
kubectl get events --sort-by=.lastTimestamp

Exercise 2 — Manual Canary

Run stable and canary Deployments:

payment-api-stable
payment-api-canary

Expose through separate Services:

payment-api-stable
payment-api-canary

Use ingress/router/gateway to direct small traffic to canary. Measure by version label.

Exercise 3 — Metric Gate Design

Write a gate spec:

rolloutGate:
  minSampleRequests: 1000
  window: 10m
  abortIf:
    http5xxRate: "> 0.5%"
    p99Latency: "> 750ms"
    dlqEvents: "> 0"
    stateTransitionInvalid: "> 0"
  promoteIf:
    successRate: ">= 99.5%"
    canaryTrafficReceived: true

Then ask:

  • can this be measured today?
  • are metrics labeled correctly?
  • what is the false positive risk?
  • what is the false negative risk?

Exercise 4 — Irreversible Change Review

Given this change:

v2 writes a new non-null column and emits a new required event field.

Design safe sequence:

  • expand schema;
  • deploy compatibility layer;
  • backfill;
  • enable write path gradually;
  • validate consumers;
  • contract cleanup later.

25. Anti-Patterns

Anti-patternWhy It Fails
Canary without trafficmetrics look healthy because nobody used it
Canary without version labelscannot compare stable vs canary
Error rate onlycorrectness bug returns 200
1% canary for low traffic serviceno meaningful sample
Auto-promote during dependency incidentfalse attribution
Rollback plan equals “kubectl rollout undo”data/config/traffic may not rollback
Shadow traffic with side effectsduplicates writes/actions
Blue-green with active jobs on both sidesduplicate processing
Changing schema destructively with rolloutold version breaks
Long-running canary as experimentoperational complexity grows
Manual approval without evidencebureaucracy, not safety
Tool-first rollout designignores domain-specific risk

26. Production Checklist

Before progressive rollout:

  • image digest pinned;
  • deployment manifest reviewed;
  • readiness probe meaningful;
  • rollback method defined;
  • traffic route controllable;
  • metrics labeled by version;
  • minimum sample size defined;
  • abort thresholds defined;
  • business correctness metric included;
  • data/schema compatibility reviewed;
  • downstream dependency risk reviewed;
  • alerts connected to rollout timeline;
  • owner and change ID annotated;
  • manual override path clear;
  • post-rollout cleanup defined.

27. Decision Framework

Gunakan decision tree ini:

Default recommendation:

Start with simple rolling update for low-risk services.
Add progressive delivery only when risk, traffic, or compliance justifies the extra control plane.

Complexity is not free. Argo Rollouts, Flagger, service mesh, and Gateway integrations add operational surface. Use them because you need controlled exposure, not because they look mature.


28. Summary

Progressive delivery is a risk control system. The core question is not “can we deploy automatically?” but:

Can we expose change gradually, observe meaningful evidence, and reverse exposure before damage exceeds the risk budget?

Key conclusions:

  • Kubernetes Deployment rolling update is necessary but not sufficient for advanced release safety.
  • Progressive delivery requires traffic control, metric gates, rollback semantics, and blast radius design.
  • Canary is only meaningful when traffic is representative and measurable.
  • Rollback must account for traffic, workload, config, data, and external side effects.
  • Business correctness metrics matter as much as HTTP metrics in enterprise systems.
  • Platform teams should provide guardrails, not just tools.

Top 1% Kubernetes engineers think in state, risk, evidence, and reversibility.


References

Lesson Recap

You just completed lesson 11 in build core. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.