Series MapLesson 21 / 35
Deepen PracticeOrdered learning track

Learn Kubernetes Deployment Model Part 021 Batch Event Workloads

17 min read3376 words
PrevNext
Lesson 2135 lesson track2029 Deepen Practice

title: Learn Kubernetes, Deployment Model, and Cloud Native Platform Engineering - Part 021 description: Deep dive into Kubernetes batch, scheduled, and event-driven workload design using Job, CronJob, queue workers, idempotency, retry policy, failure semantics, and operational governance. series: learn-kubernetes-deployment-model seriesTitle: Learn Kubernetes, Deployment Model, and Cloud Native Platform Engineering order: 21 partTitle: Batch, Event-Driven, and Scheduled Workloads tags:

  • kubernetes
  • jobs
  • cronjobs
  • batch-processing
  • event-driven
  • reliability
  • platform-engineering date: 2026-07-01

Part 021 — Batch, Event-Driven, and Scheduled Workloads

1. Tujuan Pembelajaran

Part sebelumnya membahas StatefulSet dan data-aware deployment. Sekarang kita masuk ke workload yang tidak selalu hidup selamanya: batch, scheduled, one-off, migration, reconciliation, maintenance, queue consumer, dan event-driven processing.

Target setelah part ini:

  1. Memahami perbedaan workload long-running, finite, scheduled, dan event-driven.
  2. Bisa memilih antara Job, CronJob, Deployment queue worker, workflow engine, atau controller/operator.
  3. Memahami semantics penting Job: completions, parallelism, completionMode, backoffLimit, activeDeadlineSeconds, ttlSecondsAfterFinished, podFailurePolicy, dan Indexed Job.
  4. Bisa mendesain batch job yang aman terhadap retry, duplicate execution, partial failure, dan concurrent execution.
  5. Bisa mendesain CronJob yang tidak rusak karena clock, missed schedule, overlap, timezone, dan backlog.
  6. Bisa membangun event-driven workload yang scalable tanpa menciptakan cascade failure.
  7. Bisa melakukan debugging batch workload dari object graph, status condition, event, log, dan external side effect.

Kaufman lens:

  • Deconstruct: finite work = trigger + unit-of-work + idempotency + retry + completion + cleanup.
  • Self-correct: baca Job/CronJob status, Pod failure, missed schedule, duplicate execution, dan external effect.
  • Remove barriers: gunakan decision tree agar tidak semua task dipaksa menjadi Deployment.
  • Practice subskills: run, retry, cancel, resume, inspect, cleanup, and protect side effects.

2. Mental Model: Workload Tidak Selalu Berarti Server

Banyak engineer mengasosiasikan Kubernetes dengan HTTP service. Itu salah satu use case, bukan keseluruhan model.

Kubernetes workload dapat berupa:

Workload TypeLifespanOwner ObjectPrimary Risk
HTTP servicelong-runningDeploymentrollout, traffic, latency
node-local agentlong-running per nodeDaemonSetnode coverage, privilege
stateful replicalong-running with identityStatefulSetdata correctness
one-off taskfiniteJobretry and duplicate side effect
scheduled taskrepeated finiteCronJoboverlap and missed schedule
queue consumerlong-running event processorDeployment + scalerbacklog and poison message
indexed batchfinite parallel shardsIndexed Jobpartial index failure
workflowmulti-step finite graphworkflow engine/controllerorchestration complexity

Top 1% mental model:

Batch workload correctness is not defined by “the Pod exited 0”. It is defined by whether the intended external side effect happened exactly as safely as the business requires.

For example, a billing reconciliation Job may exit successfully while processing only half the input because the code swallowed errors. Kubernetes can observe process lifecycle. It cannot infer domain correctness unless you expose it through status, metrics, logs, external ledger, or explicit completion markers.


3. Decision Tree: Which Workload Model Should Own This?

Practical rule:

  • Use Job when success/failure and completion matter.
  • Use CronJob when Kubernetes should create Jobs according to time.
  • Use Deployment for queue workers when work is continuous and scaling is backlog-driven.
  • Use Indexed Job when each parallel unit needs a stable index.
  • Use workflow engine when one task becomes a dependency graph.
  • Use controller/operator when the task is actually reconciliation, not batch execution.

Anti-pattern:

CronJob -> shell script -> kubectl apply -> random cleanup -> no status -> no idempotency -> no audit

That is not automation. It is an unaudited control plane mutation pipeline.


4. Job: Run-to-Completion Controller

A Kubernetes Job creates one or more Pods and tracks them until the specified completion condition is reached.

Minimal Job:

apiVersion: batch/v1
kind: Job
metadata:
  name: ledger-reconciliation
  namespace: finance
spec:
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: reconcile
          image: registry.example.com/finance/reconciler:2026.07.01
          args:
            - "--date=2026-07-01"

Important fields:

FieldMeaningDesign Question
parallelismhow many Pods may run concurrentlyHow much concurrent pressure can downstream tolerate?
completionshow many successful completions are neededHow many units must finish?
completionModeNonIndexed or IndexedDoes each unit need deterministic identity?
backoffLimitretry limit before Job failureIs failure transient or logical?
activeDeadlineSecondsmax total runtimeWhen should the work be considered stuck?
ttlSecondsAfterFinishedcleanup after finishHow long should objects remain inspectable?
podFailurePolicyclassify Pod failuresWhich errors should fail fast vs retry?

A Job is not just “a Pod with restart”. It is a controller-managed execution contract.


5. Restart Policy: Never vs OnFailure

Jobs support Pod restartPolicy values:

  • Never
  • OnFailure

For production debugging, Never is often clearer because each failed attempt becomes a failed Pod that can be inspected.

spec:
  template:
    spec:
      restartPolicy: Never

With OnFailure, the container may restart inside the same Pod. This can be cheaper but may hide attempt boundaries if logging and metrics are weak.

Decision matrix:

PolicyUse WhenAvoid When
Neveryou need clear attempt historyenormous number of short failed Pods would overload observability
OnFailureretry inside same Pod is acceptabledebugging attempt-level state matters

Top 1% rule:

Choose retry visibility intentionally. Hidden retries create hidden side effects.


6. Retry Semantics and Backoff

backoffLimit controls how many failures Kubernetes tolerates before marking the Job failed.

Example:

apiVersion: batch/v1
kind: Job
metadata:
  name: report-export
spec:
  backoffLimit: 3
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: export
          image: registry.example.com/reports/exporter:1.4.2

A failed retry may mean:

  1. A transient infrastructure issue.
  2. A temporary downstream issue.
  3. Bad input.
  4. Bad code.
  5. Permission denied.
  6. External side effect partially happened.

Kubernetes cannot know which one unless you model failure.

Failure classification:

FailureRetry?Reason
network timeoutmaybetransient
HTTP 429yes, with backoffdownstream throttling
invalid argumentnological/config error
schema mismatchnodeployment/data contract error
node evictionyesinfrastructure disruption
duplicate keydependsmaybe idempotency success
permission deniednoIAM/RBAC/config error

Use podFailurePolicy when exit codes or Pod conditions should affect retry behavior.

Example pattern:

apiVersion: batch/v1
kind: Job
metadata:
  name: import-customer-ledger
spec:
  backoffLimit: 5
  podFailurePolicy:
    rules:
      - action: FailJob
        onExitCodes:
          containerName: importer
          operator: In
          values: [64, 65, 66]
      - action: Ignore
        onPodConditions:
          - type: DisruptionTarget
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: importer
          image: registry.example.com/ledger/importer:2.8.0

Interpretation:

  • Exit code 64/65/66 means deterministic input/config error.
  • Disruption can be ignored so it does not consume logical retry budget.

This is where batch engineering becomes reliability engineering.


7. Idempotency: The Non-Negotiable Batch Invariant

Kubernetes documentation explicitly warns that a Job program may sometimes be started twice even with parallelism=1, completions=1, and restartPolicy=Never. Therefore, a Job must tolerate duplicate execution.

Idempotency means repeating the same operation does not corrupt the system.

Common strategies:

StrategyExample
natural keyinvoice_id unique constraint
idempotency keyjob_name + unit_id + attempt_id
external ledgerwrite STARTED, COMMITTED, FAILED states
compare-and-setupdate only if current state is expected
atomic renamewrite temp object then rename/promote
checkpointresume from last committed offset
lease/lockonly one worker owns shard for a time

Bad pattern:

for row in input:
  charge_customer(row.card, row.amount)

Better pattern:

for row in input:
  idempotencyKey = "billing-cycle-2026-07:" + row.invoiceId
  if ledger.alreadyCommitted(idempotencyKey):
      continue
  result = payment.charge(row.card, row.amount, idempotencyKey)
  ledger.commit(idempotencyKey, result)

Top 1% rule:

Job retry is a platform concern. Idempotency is an application/domain concern. You need both.


8. Parallel Jobs

A parallel Job runs multiple Pods concurrently.

apiVersion: batch/v1
kind: Job
metadata:
  name: image-thumbnail-backfill
spec:
  parallelism: 10
  completions: 100
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: worker
          image: registry.example.com/media/thumbnailer:3.2.1

Questions before enabling parallelism:

  1. Is the input partitioned safely?
  2. Can downstream systems handle concurrent load?
  3. Does each unit have a unique idempotency key?
  4. Is the job CPU-bound, IO-bound, or API-bound?
  5. What happens if 30% of units fail?
  6. Can we resume without reprocessing everything?
  7. What metric tells us progress?

Concurrency is not free. It shifts bottlenecks.


9. Indexed Jobs

Indexed Jobs give each completion a stable index exposed to the Pod. This is useful for deterministic partitioning.

Example use cases:

  • shard 0..999 of a backfill,
  • data partition per date range,
  • ML batch segment,
  • static file generation chunk,
  • test suite partition.

Example:

apiVersion: batch/v1
kind: Job
metadata:
  name: partitioned-ledger-check
spec:
  completions: 20
  parallelism: 5
  completionMode: Indexed
  backoffLimitPerIndex: 2
  maxFailedIndexes: 3
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: checker
          image: registry.example.com/finance/partition-checker:1.9.0
          env:
            - name: PARTITION_INDEX
              valueFrom:
                fieldRef:
                  fieldPath: metadata.annotations['batch.kubernetes.io/job-completion-index']

The application can use the index to determine its partition.

Pseudo-code:

index = env.JOB_COMPLETION_INDEX
range = partitionTable[index]
process(range)

Design invariant:

The index should map to stable work. Do not let each Pod randomly grab work if the reason you chose Indexed Job is deterministic partitioning.


10. Active Deadline and Stuck Work

A Job can fail because it exceeds activeDeadlineSeconds.

spec:
  activeDeadlineSeconds: 3600

Use it when a task has a maximum useful runtime.

Examples:

WorkloadDeadline Rationale
daily reconciliationmust finish before next business day window
report generationstale after reporting cutoff
data migrationshould not run indefinitely during release
external API synctoken/window may expire

Danger:

  • Too short: false failures.
  • Too long: stuck jobs waste capacity and hold locks.

Better pattern:

  1. Application emits heartbeat/progress.
  2. Job has deadline.
  3. External ledger records partial progress.
  4. Retry resumes from checkpoint.
  5. Alert fires before deadline, not only after failure.

11. TTL and Object Cleanup

Completed Jobs and Pods can accumulate quickly.

Use TTL controller:

spec:
  ttlSecondsAfterFinished: 86400

Retention strategy:

EnvironmentSuccessful Job TTLFailed Job TTL
dev1 hour1 day
staging1 day3 days
production1-7 days7-30 days or until archived

But do not use Kubernetes object retention as your audit system.

Production audit should live in:

  • application logs,
  • metrics,
  • object storage artifacts,
  • database execution ledger,
  • SIEM/audit stream,
  • workflow metadata store.

Kubernetes object TTL is cleanup, not compliance.


12. CronJob: Time-Based Job Factory

A CronJob creates Jobs according to a schedule.

apiVersion: batch/v1
kind: CronJob
metadata:
  name: daily-ledger-reconciliation
  namespace: finance
spec:
  schedule: "15 1 * * *"
  timeZone: "Etc/UTC"
  concurrencyPolicy: Forbid
  startingDeadlineSeconds: 900
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 5
  jobTemplate:
    spec:
      backoffLimit: 2
      ttlSecondsAfterFinished: 604800
      template:
        spec:
          restartPolicy: Never
          containers:
            - name: reconcile
              image: registry.example.com/finance/reconciler:2026.07.01

Important fields:

FieldMeaning
schedulecron expression
timeZonetimezone for schedule interpretation
concurrencyPolicyoverlap behavior
startingDeadlineSecondshow late a missed run can start
suspendstop future scheduling
successfulJobsHistoryLimitretained successful Jobs
failedJobsHistoryLimitretained failed Jobs
jobTemplateJob spec to create

13. CronJob Concurrency Policy

concurrencyPolicy determines what happens if the previous run is still active.

PolicyBehaviorUse CaseRisk
Allowoverlapping runs allowedindependent periodic tasksduplicate pressure/side effect
Forbidskip new run if previous activereconciliation, report, cleanupmissed windows
Replacereplace current run with new oneonly latest state matterskilling in-flight work

Example:

spec:
  concurrencyPolicy: Forbid

Use Forbid for most maintenance/reconciliation workloads unless overlap is explicitly safe.

But understand the trade-off: Forbid can skip scheduled runs. That is often better than duplicate financial or data mutations, but it must be monitored.


14. CronJob Timezone and Missed Schedule

Always specify timeZone explicitly.

spec:
  schedule: "0 2 * * *"
  timeZone: "Asia/Jakarta"

But for cross-region or enterprise systems, prefer UTC unless the business domain truly needs local time.

Problems caused by vague time:

  • cluster controller manager timezone differs from expectation,
  • daylight saving changes,
  • regional holiday/time window assumptions,
  • operator confusion during incidents.

Use absolute business language in metadata:

metadata:
  annotations:
    platform.example.com/business-window: "Daily settlement after Jakarta close, 02:00 Asia/Jakarta"
    platform.example.com/owner: "finance-platform"
    platform.example.com/runbook: "https://runbooks.example.com/finance/daily-settlement"

15. Event-Driven Workloads

Not all event-driven workloads should be CronJobs or Jobs.

Two common models:

  1. Continuous queue worker: a Deployment consumes messages forever.
  2. Event-created Job: each event creates a Job or scales a workload from zero.

Decision matrix:

SituationPrefer
high-throughput streamDeployment worker
low-frequency heavyweight taskJob per event
queue backlog drives scaleDeployment + KEDA/HPA
each event must be separately auditableJob or workflow
multi-step processworkflow engine
event changes Kubernetes desired statecontroller/operator

Typical queue worker Deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: payment-event-worker
spec:
  replicas: 3
  selector:
    matchLabels:
      app.kubernetes.io/name: payment-event-worker
  template:
    metadata:
      labels:
        app.kubernetes.io/name: payment-event-worker
    spec:
      containers:
        - name: worker
          image: registry.example.com/payments/event-worker:4.1.0
          env:
            - name: QUEUE_NAME
              value: payment-events
          resources:
            requests:
              cpu: "500m"
              memory: "512Mi"
            limits:
              memory: "1Gi"

This is not a Job because it does not finish. It is a service whose protocol is a queue.


16. Queue Processing Correctness

A queue worker must define message semantics.

ConceptQuestion
delivery modelat-most-once, at-least-once, exactly-once illusion?
ack timingbefore or after external side effect?
visibility timeoutcan work finish before message reappears?
poison messagehow many retries before dead-letter?
orderingper key, global, partitioned, or irrelevant?
idempotencywhat key prevents duplicate effects?
backpressurehow do we slow down safely?

Canonical processing loop:

while true:
  msg = queue.receive()
  key = msg.idempotencyKey

  if ledger.committed(key):
      queue.ack(msg)
      continue

  try:
      result = process(msg)
      ledger.commit(key, result)
      queue.ack(msg)
  except TransientError:
      queue.nackWithDelay(msg)
  except PermanentError:
      ledger.fail(key)
      queue.deadLetter(msg)

Never ack before the durable side effect unless message loss is acceptable.


17. Event-Driven Scaling

Autoscaling event-driven workloads usually depends on external metrics:

  • queue depth,
  • oldest message age,
  • Kafka consumer lag,
  • stream backlog,
  • pending task count,
  • custom business latency.

Scaling by CPU alone is often wrong for queue consumers. A worker may be blocked on IO while backlog grows.

Better scaling signal:

replicas_needed = ceil(backlog / target_messages_per_replica)

But production scaling must include constraints:

  • downstream rate limits,
  • database connection limits,
  • max cost budget,
  • cold start time,
  • retry storm prevention,
  • poison message isolation.

Top 1% rule:

Scaling consumers without downstream capacity modelling converts backlog into outage amplification.


18. Batch Workload Resource Design

Batch workloads often create resource spikes.

Sizing questions:

  1. What is the per-unit CPU/memory profile?
  2. Does memory grow with input size?
  3. Is the work parallelizable?
  4. Is there a safe maximum parallelism?
  5. Should batch run on separate node pool?
  6. Does it compete with latency-sensitive workloads?
  7. Can it be preempted?
  8. What is the business deadline?

Pattern: dedicated batch node pool.

spec:
  template:
    spec:
      nodeSelector:
        workload-tier: batch
      tolerations:
        - key: workload-tier
          operator: Equal
          value: batch
          effect: NoSchedule

This prevents heavy batch from starving customer-facing services.


19. Database Migration Jobs

Database migration is a special kind of Job and deserves stricter handling.

Bad pattern:

Application starts -> runs migrations -> multiple replicas race -> partial schema change -> outage

Safer approaches:

PatternUse When
pre-deploy migration Jobschema change must happen before app rollout
expand/contract migrationzero-downtime schema evolution
migration controllercomplex multi-step migration governance
manual approved workflowhigh-risk regulated change

Migration Job checklist:

  • single execution lock,
  • idempotent migration scripts,
  • versioned schema ledger,
  • backup/restore plan,
  • timeout and rollback/forward plan,
  • application compatibility matrix,
  • no destructive change in same deploy as code dependency,
  • clear owner and approval trail.

Example skeleton:

apiVersion: batch/v1
kind: Job
metadata:
  name: orders-schema-migration-20260701
  namespace: orders
  labels:
    app.kubernetes.io/name: orders
    platform.example.com/change-type: schema-migration
spec:
  backoffLimit: 0
  activeDeadlineSeconds: 900
  template:
    spec:
      restartPolicy: Never
      serviceAccountName: orders-migration
      containers:
        - name: migrate
          image: registry.example.com/orders/migrator:2026.07.01
          args:
            - "--target-version=2026_07_01_001"

For regulated systems, backoffLimit: 0 may be preferable: fail once, inspect, decide. Blind retries on DDL can be dangerous.


20. Maintenance and Cleanup Jobs

Cleanup Jobs are deceptively risky.

Examples:

  • delete expired sessions,
  • purge temporary files,
  • compact database records,
  • archive audit logs,
  • remove stale Kubernetes objects,
  • clean object storage prefixes.

Risk model:

RiskMitigation
accidental broad deletedry-run mode and scoped filters
race with active workloadleases and freshness checks
irreversible lossretention and backup window
API overloadrate limit and pagination
hidden partial cleanupprogress ledger
no auditstructured deletion log

Strong cleanup pattern:

1. Discover candidates.
2. Write candidate set to durable audit artifact.
3. Validate scope thresholds.
4. Delete in pages with rate limit.
5. Record each deletion.
6. Emit summary metric and artifact location.

Do not write cleanup jobs that silently delete unbounded resources.


21. Workflow Engines vs Native Jobs

Native Kubernetes Jobs are excellent for simple finite work. They become awkward when you need:

  • DAG dependencies,
  • artifact passing,
  • human approval,
  • retries per step,
  • compensation steps,
  • branch/merge logic,
  • long-running workflows,
  • visibility across many tasks,
  • domain-level status.

At that point, use a workflow system or build a controller.

Examples of workflow-style needs:

Do not encode complex workflow state in shell scripts and Kubernetes annotations unless you are intentionally building a workflow engine.


22. Observability for Batch and Event Workloads

Batch observability must answer:

  1. Did the work start?
  2. Which version/image ran?
  3. What input range was processed?
  4. How many units succeeded, failed, skipped, retried?
  5. What external side effects occurred?
  6. Was the result complete?
  7. Where is the artifact/report?
  8. Who owns the failure?

Minimum signals:

SignalExample
logsstructured per unit-of-work
metricsprocessed count, failed count, retry count, duration
tracesexternal API/database call path
eventsJob/CronJob lifecycle
statusJob condition, completed/failed indexes
artifactreconciliation report, output file
auditexecution ledger

Metric examples:

batch_job_duration_seconds{job="daily-ledger-reconciliation"}
batch_units_processed_total{job="daily-ledger-reconciliation",result="success"}
batch_units_failed_total{job="daily-ledger-reconciliation",reason="validation"}
batch_last_success_timestamp_seconds{job="daily-ledger-reconciliation"}
queue_oldest_message_age_seconds{queue="payment-events"}
queue_consumer_lag{consumer_group="payment-worker"}

Alert on business freshness, not only Pod failure.

Bad alert:

Job failed

Better alert:

Daily ledger reconciliation has not completed successfully by 03:00 UTC.

23. Debugging Job and CronJob Failures

Debugging sequence:

kubectl get cronjob -n finance
kubectl describe cronjob daily-ledger-reconciliation -n finance
kubectl get job -n finance --sort-by=.metadata.creationTimestamp
kubectl describe job daily-ledger-reconciliation-28766520 -n finance
kubectl get pods -n finance -l job-name=daily-ledger-reconciliation-28766520
kubectl logs -n finance job/daily-ledger-reconciliation-28766520
kubectl get events -n finance --sort-by=.lastTimestamp

Common symptoms:

SymptomLikely Cause
CronJob did not create Jobsuspended, missed deadline, controller issue, invalid schedule
Job active foreverstuck process, no deadline, blocked downstream
Job repeatedly failsbad input, missing secret, permission, bug
Many failed Podsbackoff/retry storm
Multiple Jobs overlapconcurrencyPolicy: Allow
Job succeeded but result missingapp bug, swallowed error, weak domain validation
CronJob suddenly creates many Jobsunsuspended with missed schedules and no starting deadline

Important distinction:

Kubernetes status tells you execution state. Domain ledger tells you business completion.

You need both.


24. Governance for Enterprise Batch Workloads

Batch workloads should be governed because they mutate data, consume capacity, and often run with elevated permissions.

Required metadata:

metadata:
  labels:
    app.kubernetes.io/name: daily-ledger-reconciliation
    app.kubernetes.io/part-of: finance-platform
    app.kubernetes.io/managed-by: gitops
    platform.example.com/workload-class: batch
    platform.example.com/data-classification: restricted
  annotations:
    platform.example.com/owner: finance-platform
    platform.example.com/runbook: https://runbooks.example.com/finance/ledger-reconciliation
    platform.example.com/slo: "complete by 03:00 UTC daily"
    platform.example.com/max-downstream-qps: "50"

Policy examples:

  • CronJobs must specify timeZone.
  • CronJobs must specify concurrencyPolicy.
  • Jobs must specify resource requests.
  • Production Jobs must specify owner/runbook annotations.
  • Migration Jobs must use dedicated ServiceAccount.
  • Cleanup Jobs must support dry-run or threshold guard.
  • Failed Jobs must be retained long enough for debugging.
  • Workloads with broad API access must run on trusted node pools.

25. Production Checklist

Before approving a Job/CronJob/event workload:

  • Workload type is correct: Job, CronJob, Deployment worker, workflow, or controller.
  • Idempotency key is defined.
  • Retry policy distinguishes transient vs permanent failure.
  • backoffLimit is intentional.
  • activeDeadlineSeconds is set for bounded work.
  • ttlSecondsAfterFinished or retention strategy exists.
  • Resource requests/limits are defined.
  • Downstream rate limits are understood.
  • CronJob has explicit timeZone.
  • CronJob has explicit concurrencyPolicy.
  • Missed schedule behavior is understood.
  • Logs are structured with unit-of-work identifiers.
  • Metrics expose duration, success, failure, retry, and freshness.
  • Failed execution has a runbook.
  • Sensitive workloads use least-privilege ServiceAccount.
  • Domain completion is recorded outside Kubernetes object status.

26. Latihan Praktis

Latihan 1 — Design Review

Ambil satu scheduled task di sistem nyata. Jawab:

  1. Apa trigger-nya?
  2. Apa unit-of-work-nya?
  3. Apa idempotency key-nya?
  4. Apa yang terjadi jika task dijalankan dua kali?
  5. Apa retry policy-nya?
  6. Apa completion signal-nya?
  7. Apa business freshness SLO-nya?

Latihan 2 — CronJob Hardening

Ubah CronJob yang hanya punya schedule menjadi production-ready dengan:

  • timeZone,
  • concurrencyPolicy,
  • startingDeadlineSeconds,
  • backoffLimit,
  • activeDeadlineSeconds,
  • history limits,
  • resource requests,
  • owner/runbook annotations.

Latihan 3 — Queue Worker Scaling

Untuk queue worker, tentukan:

  • queue depth target per replica,
  • maximum replicas,
  • downstream QPS limit,
  • poison message strategy,
  • oldest-message-age alert,
  • idempotency mechanism.

27. Ringkasan

Batch dan event-driven workload adalah area di mana Kubernetes menyediakan controller lifecycle, tetapi correctness tetap harus didesain di application/domain layer.

Key takeaways:

  1. Job adalah run-to-completion controller, bukan sekadar Pod sekali jalan.
  2. CronJob adalah Job factory berbasis waktu, bukan scheduler sempurna.
  3. Workload finite harus idempotent karena duplicate execution mungkin terjadi.
  4. Parallelism harus dimodelkan terhadap downstream capacity.
  5. Event-driven scaling harus memakai backlog/freshness signal, bukan CPU saja.
  6. Migration dan cleanup Jobs membutuhkan governance lebih ketat.
  7. Observability batch harus mengukur domain completion, bukan hanya process exit.

Top 1% Kubernetes engineer tidak bertanya “YAML Job-nya seperti apa?” terlebih dahulu. Mereka bertanya:

Apa unit-of-work, apa retry semantics, apa side effect, apa completion proof, dan apa failure boundary-nya?


28. Referensi

Lesson Recap

You just completed lesson 21 in deepen practice. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.