Final StretchOrdered learning track

Capstone: Production-Grade Container Platform from Laptop to Swarm

Learn Docker, Containerization, Docker Compose, Docker Swarm - Part 035

Capstone end-to-end untuk membangun production-grade container platform dari laptop ke Swarm: image, Compose dev/test, registry promotion, SBOM, secrets, stack deploy, rolling update, rollback, observability, incident drill, dan 20-hour deliberate practice checklist.

17 min read3321 words
Prev
Finish
Lesson 3535 lesson track3035 Final Stretch
#docker#containerization#docker-compose#docker-swarm+7 more

Part 035 — Capstone: Production-Grade Container Platform from Laptop to Swarm

Ini adalah part terakhir seri.

Kita tidak akan menambah konsep baru secara besar-besaran.

Kita akan menggabungkan seluruh seri menjadi satu sistem end-to-end:

  • aplikasi dibangun menjadi image yang reproducible;
  • developer menjalankan dependency lokal dengan Compose;
  • test integration berjalan pada ephemeral Compose stack;
  • image dipromosikan ke registry dengan tag dan digest yang benar;
  • SBOM/provenance/security scan menjadi evidence;
  • secrets/configs dipisahkan dari image;
  • stack dideploy ke Docker Swarm;
  • rolling update punya health gate dan rollback path;
  • observability cukup untuk debugging incident nyata;
  • failure drill dilakukan secara sengaja;
  • kemampuan diukur dengan rubric, bukan perasaan.

Capstone ini dibuat sebagai internal engineering handbook style: bukan tutorial mainan, melainkan blueprint yang bisa kamu adaptasi untuk platform kecil-menengah, internal tools, regulated system, service lab, atau migration runway sebelum Kubernetes.


1. Kaufman Lens: Dari Pengetahuan ke Keluwesan

Kaufman mengajarkan bahwa belajar cepat bukan berarti belajar dangkal.

Untuk skill kompleks, kita perlu:

  1. deconstruct skill menjadi subskill kecil;
  2. learn enough to self-correct;
  3. remove barriers to practice;
  4. practice deliberately pada skenario nyata;
  5. create feedback loops yang cepat.

Dalam konteks Docker, tujuan akhir bukan “hafal command”.

Tujuan akhir adalah bisa menjawab pertanyaan seperti:

  • kenapa container ini restart loop;
  • kenapa build lambat;
  • kenapa image membengkak;
  • kenapa service bisa connect di laptop tapi gagal di Swarm;
  • kenapa secret bisa bocor ke image layer;
  • kenapa rolling update menyebabkan outage;
  • kenapa stateful service tidak aman dipindahkan antar-node;
  • kapan Compose cukup;
  • kapan Swarm masuk akal;
  • kapan perlu orchestrator lain;
  • bukti apa yang dibutuhkan agar deployment defensible.

Capstone ini memaksa kita mengerjakan semua boundary itu dalam satu lifecycle.


2. Capstone System: Regulatory Case API Platform

Kita akan memakai contoh domain yang cukup realistis:

Regulatory Case API Platform — sistem backend untuk menerima laporan, membuat enforcement case, menyimpan state case, menerbitkan event audit, dan menyediakan endpoint health/metrics.

Arsitekturnya sengaja sederhana tapi cukup kaya:

KomponenFungsiRuntime
case-apiHTTP API untuk case lifecyclecustom application image
workerasynchronous worker untuk event processingsame image, different command
postgresdatabase transactionalofficial image, dev/test only atau managed/external di production
redisqueue/cache lightweightofficial image
reverse-proxyingress lokal/prod edge contohnginx/traefik/caddy style
prometheusmetrics scrape laboptional profile
grafanadashboard laboptional profile
migrateone-shot migration jobsame image, migration command
seedone-shot fixture jobsame image, seed command

Target sistem:

  • build image sekali;
  • jalankan di local Compose;
  • test di CI Compose;
  • push ke registry;
  • deploy ke Swarm sebagai stack;
  • update image by digest;
  • rollback jika health gagal;
  • audit artifact tersedia.

3. End-to-End Mental Model

Diagram berikut menggambarkan aliran dari developer laptop sampai Swarm.

Ada tiga state penting:

  1. source state — Git commit, Dockerfile, Compose file, stack file;
  2. artifact state — image digest, SBOM, provenance, scan result;
  3. runtime state — containers/tasks/services, logs, metrics, health, events.

Kesalahan engineer sering terjadi karena mencampur tiga state ini.

Contoh:

  • deploy berdasarkan tag mutable, bukan digest;
  • memperbaiki container live, bukan memperbaiki source/build;
  • menyimpan config rahasia di image;
  • menganggap Compose file dev sama dengan deployment spec prod;
  • menganggap healthcheck sama dengan readiness aplikasi;
  • menganggap rollback aman tanpa memikirkan migration/state.

4. Repository Layout

Layout yang rapi mengurangi friction latihan dan mengurangi ambiguitas operasional.

reg-case-platform/
├── app/
│   ├── src/
│   ├── package.json | pom.xml | build.gradle | go.mod
│   └── migrations/
├── docker/
│   ├── case-api.Dockerfile
│   ├── nginx.conf
│   └── entrypoint.sh
├── compose.yml
├── compose.dev.yml
├── compose.test.yml
├── compose.observability.yml
├── stack.yml
├── env/
│   ├── dev.env.example
│   ├── test.env.example
│   └── prod.env.example
├── secrets/
│   └── README.md
├── scripts/
│   ├── dev-up.sh
│   ├── dev-reset.sh
│   ├── test-integration.sh
│   ├── build-image.sh
│   ├── push-image.sh
│   ├── deploy-stack.sh
│   ├── rollback-stack.sh
│   └── collect-diagnostics.sh
├── ops/
│   ├── release-checklist.md
│   ├── incident-runbook.md
│   ├── capacity-worksheet.md
│   └── threat-model.md
└── README.md

Prinsip:

  • compose.yml berisi model dasar aplikasi;
  • compose.dev.yml hanya override untuk developer loop;
  • compose.test.yml hanya override untuk test isolation;
  • compose.observability.yml optional profile untuk metrics/log dashboard;
  • stack.yml adalah deployment contract untuk Swarm;
  • secrets tidak pernah di-commit;
  • script hanya membungkus command yang sudah jelas, bukan menyembunyikan logic.

5. Production-Grade Dockerfile

Contoh berikut generik untuk aplikasi compiled/runtime modern.

Jika aplikasimu Java, Node, Go, .NET, atau Python, detail builder/runtime akan berbeda, tetapi invariants-nya sama.

# syntax=docker/dockerfile:1.7

ARG RUNTIME_BASE=debian:bookworm-slim
ARG BUILD_BASE=debian:bookworm-slim

FROM ${BUILD_BASE} AS build
WORKDIR /src

# Install build dependencies in one layer.
# Replace this with Maven/Gradle/npm/go/dotnet specific dependency resolution.
RUN --mount=type=cache,target=/var/cache/apt \
    --mount=type=cache,target=/var/lib/apt \
    apt-get update && \
    apt-get install -y --no-install-recommends ca-certificates curl build-essential && \
    rm -rf /var/lib/apt/lists/*

COPY app/ ./app/

# Build artifact. Keep output deterministic.
RUN --mount=type=cache,target=/root/.cache \
    mkdir -p /out && \
    cp -r app /out/app

FROM ${RUNTIME_BASE} AS runtime

LABEL org.opencontainers.image.title="reg-case-platform" \
      org.opencontainers.image.description="Regulatory Case API Platform" \
      org.opencontainers.image.source="https://example.internal/reg-case-platform" \
      org.opencontainers.image.vendor="internal-platform"

RUN groupadd --system app && \
    useradd --system --gid app --home-dir /app --shell /usr/sbin/nologin app && \
    mkdir -p /app /var/log/app /tmp/app && \
    chown -R app:app /app /var/log/app /tmp/app

WORKDIR /app
COPY --from=build --chown=app:app /out/app/ ./
COPY --chown=app:app docker/entrypoint.sh /usr/local/bin/entrypoint.sh

RUN chmod 0555 /usr/local/bin/entrypoint.sh

USER app:app

ENV APP_PORT=8080 \
    APP_ENV=production \
    TMPDIR=/tmp/app

EXPOSE 8080

HEALTHCHECK --interval=30s --timeout=3s --start-period=20s --retries=3 \
  CMD curl -fsS http://127.0.0.1:${APP_PORT}/health || exit 1

ENTRYPOINT ["/usr/local/bin/entrypoint.sh"]
CMD ["api"]

Review invariants:

AreaExpected invariant
Base imageexplicit, reviewed, scanned
Build stagecontains build dependencies only
Runtime stagecontains runtime artifact only
Usernon-root by default
Secretsnot copied, not passed via ARG
Cachedependency/build cache isolated
MetadataOCI labels present
Entrypointsignal-safe, no shell trap if avoidable
Healthlocal, cheap, dependency-aware enough
Filesystemwritable paths explicit

Jangan jadikan Dockerfile sebagai tempat semua operational logic.

Dockerfile harus menjawab:

  • apa artifact runtime;
  • apa dependency minimum;
  • siapa user proses;
  • apa command default;
  • apa health signal minimum.

Bukan:

  • bagaimana deploy;
  • bagaimana rotate secret;
  • bagaimana menjalankan migration production tanpa kontrol;
  • bagaimana patch live container.

6. Entrypoint Contract

Entrypoint harus kecil dan defensible.

#!/usr/bin/env sh
set -eu

mode="${1:-api}"
shift || true

case "$mode" in
  api)
    exec /app/bin/case-api "$@"
    ;;
  worker)
    exec /app/bin/case-worker "$@"
    ;;
  migrate)
    exec /app/bin/case-migrate "$@"
    ;;
  seed)
    exec /app/bin/case-seed "$@"
    ;;
  *)
    echo "unknown mode: $mode" >&2
    exit 64
    ;;
esac

Prinsip:

  • gunakan exec agar proses aplikasi menjadi PID 1;
  • validasi mode secara eksplisit;
  • jangan menjalankan migration otomatis di semua startup API;
  • jangan menulis secret ke log;
  • jangan menelan signal;
  • jangan membuat infinite loop tersembunyi.

7. Compose Base Model

compose.yml adalah application graph.

name: reg-case

services:
  case-api:
    image: ${APP_IMAGE:-reg-case-platform:local}
    build:
      context: .
      dockerfile: docker/case-api.Dockerfile
    command: ["api"]
    environment:
      APP_ENV: ${APP_ENV:-development}
      APP_PORT: "8080"
      DATABASE_URL: postgres://case:${POSTGRES_PASSWORD:-case_dev}@postgres:5432/case
      REDIS_URL: redis://redis:6379/0
    ports:
      - "8080:8080"
    depends_on:
      postgres:
        condition: service_healthy
      redis:
        condition: service_started
      migrate:
        condition: service_completed_successfully
    healthcheck:
      test: ["CMD-SHELL", "curl -fsS http://127.0.0.1:8080/health || exit 1"]
      interval: 10s
      timeout: 3s
      retries: 5
      start_period: 20s
    networks:
      - app-net
    read_only: true
    tmpfs:
      - /tmp/app
    cap_drop:
      - ALL
    security_opt:
      - no-new-privileges:true

  worker:
    image: ${APP_IMAGE:-reg-case-platform:local}
    build:
      context: .
      dockerfile: docker/case-api.Dockerfile
    command: ["worker"]
    environment:
      APP_ENV: ${APP_ENV:-development}
      DATABASE_URL: postgres://case:${POSTGRES_PASSWORD:-case_dev}@postgres:5432/case
      REDIS_URL: redis://redis:6379/0
    depends_on:
      postgres:
        condition: service_healthy
      redis:
        condition: service_started
      migrate:
        condition: service_completed_successfully
    networks:
      - app-net
    read_only: true
    tmpfs:
      - /tmp/app
    cap_drop:
      - ALL
    security_opt:
      - no-new-privileges:true

  migrate:
    image: ${APP_IMAGE:-reg-case-platform:local}
    build:
      context: .
      dockerfile: docker/case-api.Dockerfile
    command: ["migrate"]
    environment:
      DATABASE_URL: postgres://case:${POSTGRES_PASSWORD:-case_dev}@postgres:5432/case
    depends_on:
      postgres:
        condition: service_healthy
    restart: "no"
    networks:
      - app-net

  postgres:
    image: postgres:16
    environment:
      POSTGRES_DB: case
      POSTGRES_USER: case
      POSTGRES_PASSWORD: ${POSTGRES_PASSWORD:-case_dev}
    volumes:
      - postgres-data:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U case -d case"]
      interval: 5s
      timeout: 3s
      retries: 10
    networks:
      - app-net

  redis:
    image: redis:7
    command: ["redis-server", "--appendonly", "yes"]
    volumes:
      - redis-data:/data
    networks:
      - app-net

networks:
  app-net:
    driver: bridge

volumes:
  postgres-data:
  redis-data:

Perhatikan beberapa hal:

  • case-api dan worker memakai image yang sama tetapi command berbeda;
  • migrate one-shot dan tidak restart;
  • database dev/test ada di Compose, bukan berarti database production harus ikut di Swarm;
  • depends_on dipakai untuk readiness minimum, bukan jaminan aplikasi bebas race;
  • read_only, tmpfs, cap_drop, no-new-privileges dimasukkan sejak dev agar hardening tidak jadi kejutan production;
  • port database tidak dipublish ke host secara default.

8. Compose Dev Override

compose.dev.yml mempercepat inner loop.

services:
  case-api:
    build:
      target: runtime
    environment:
      APP_ENV: development
      LOG_LEVEL: debug
    volumes:
      - ./app:/app:ro
    develop:
      watch:
        - action: sync+restart
          path: ./app
          target: /app
          ignore:
            - node_modules/
            - target/
            - build/

  worker:
    environment:
      APP_ENV: development
      LOG_LEVEL: debug

  seed:
    image: ${APP_IMAGE:-reg-case-platform:local}
    command: ["seed"]
    profiles: ["seed"]
    environment:
      DATABASE_URL: postgres://case:${POSTGRES_PASSWORD:-case_dev}@postgres:5432/case
    depends_on:
      postgres:
        condition: service_healthy
      migrate:
        condition: service_completed_successfully
    networks:
      - app-net

  mailhog:
    image: mailhog/mailhog
    profiles: ["debug"]
    ports:
      - "8025:8025"
    networks:
      - app-net

Cara menjalankan:

docker compose -f compose.yml -f compose.dev.yml up --build

Dengan debug tools:

docker compose -f compose.yml -f compose.dev.yml --profile debug up --build

Dengan seed fixture:

docker compose -f compose.yml -f compose.dev.yml --profile seed run --rm seed

Compose dev bukan production.

Tetapi Compose dev harus tetap mengajarkan boundary production:

  • service discovery via service name;
  • dependency readiness via healthcheck;
  • secrets/config dipisahkan;
  • writable path eksplisit;
  • non-root user tetap jalan;
  • logs ke stdout/stderr;
  • no hidden host dependency.

9. Compose Test Harness

compose.test.yml harus deterministic, disposable, dan mudah dibersihkan.

name: reg-case-test-${BUILD_ID:-local}

services:
  case-api:
    ports: []
    environment:
      APP_ENV: test
      LOG_LEVEL: info

  test-runner:
    image: ${APP_IMAGE:-reg-case-platform:local}
    command: ["test-integration"]
    environment:
      APP_ENV: test
      API_URL: http://case-api:8080
      DATABASE_URL: postgres://case:${POSTGRES_PASSWORD:-case_test}@postgres:5432/case
      REDIS_URL: redis://redis:6379/0
    depends_on:
      case-api:
        condition: service_healthy
    networks:
      - app-net

CI script:

#!/usr/bin/env bash
set -euo pipefail

export BUILD_ID="${BUILD_ID:-$(date +%s)}"
export APP_IMAGE="reg-case-platform:${GIT_SHA:-local}"

cleanup() {
  docker compose -f compose.yml -f compose.test.yml down -v --remove-orphans || true
}
trap cleanup EXIT

docker buildx build \
  --load \
  -t "$APP_IMAGE" \
  -f docker/case-api.Dockerfile \
  .

docker compose -f compose.yml -f compose.test.yml up \
  --abort-on-container-exit \
  --exit-code-from test-runner

Testing invariant:

InvariantWhy it matters
unique project nameparallel CI isolation
down -v cleanupno data contamination
health-gated test runneravoids startup race
no host portsavoids CI port collision
same image as deploy candidatetests artifact, not source checkout illusion
diagnostics on failuremakes flakiness actionable

Tambahkan diagnostics saat gagal:

docker compose -f compose.yml -f compose.test.yml ps
docker compose -f compose.yml -f compose.test.yml logs --no-color
docker compose -f compose.yml -f compose.test.yml events --json || true

10. Build and Release Pipeline

Pipeline ideal minimal:

Build command example:

#!/usr/bin/env bash
set -euo pipefail

IMAGE_REPO="registry.example.internal/platform/reg-case-platform"
GIT_SHA="${GIT_SHA:?GIT_SHA required}"
VERSION="${VERSION:-0.0.0-$GIT_SHA}"

IMAGE_SHA_TAG="$IMAGE_REPO:$GIT_SHA"
IMAGE_VERSION_TAG="$IMAGE_REPO:$VERSION"

# Use docker-container builder for multi-platform builds and attestations.
docker buildx create --name reg-case-builder --use --bootstrap || docker buildx use reg-case-builder

docker buildx build \
  --platform linux/amd64,linux/arm64 \
  --file docker/case-api.Dockerfile \
  --tag "$IMAGE_SHA_TAG" \
  --tag "$IMAGE_VERSION_TAG" \
  --sbom=true \
  --provenance=true \
  --push \
  .

Deployment should consume digest, not only tag.

Resolve digest:

docker buildx imagetools inspect "$IMAGE_SHA_TAG"

Or store digest from CI output.

Release evidence should include:

  • Git commit;
  • image tags;
  • image digest;
  • Dockerfile path;
  • build command/pipeline run;
  • SBOM reference;
  • provenance reference;
  • vulnerability scan result;
  • integration test result;
  • stack file version;
  • deploy timestamp;
  • smoke test result;
  • rollback plan.

11. Registry Promotion Model

Do not treat registry as a dumping ground.

Use promotion tiers:

TierTag exampleMeaning
CI candidate:sha-abc123built from commit
Release candidate:0.9.0-rc.1passed integration gate
Production release:1.4.2approved immutable release
Environment marker:prodoptional pointer, never sole source of truth

Hard rule:

Deployment records must include digest.

Tag is human-friendly.

Digest is evidence-friendly.

registry.example.internal/platform/reg-case-platform@sha256:...

Operational anti-pattern:

image: registry.example.internal/platform/reg-case-platform:latest

Better:

image: registry.example.internal/platform/reg-case-platform@sha256:012345...

Or if tooling forces tags, store resolved digest beside the deployment record.


12. Swarm Stack File

stack.yml is not just Compose copied to production.

It is deployment intent.

version: "3.9"

services:
  case-api:
    image: ${APP_IMAGE_DIGEST}
    command: ["api"]
    environment:
      APP_ENV: production
      APP_PORT: "8080"
      DATABASE_URL_FILE: /run/secrets/database_url
      REDIS_URL: redis://redis:6379/0
    secrets:
      - database_url
    configs:
      - source: case_api_config_v1
        target: /app/config/application.yml
    networks:
      - public-net
      - app-net
    ports:
      - target: 8080
        published: 8080
        protocol: tcp
        mode: ingress
    healthcheck:
      test: ["CMD-SHELL", "curl -fsS http://127.0.0.1:8080/health || exit 1"]
      interval: 10s
      timeout: 3s
      retries: 3
      start_period: 30s
    deploy:
      mode: replicated
      replicas: 4
      endpoint_mode: vip
      placement:
        constraints:
          - node.labels.workload == app
      resources:
        reservations:
          cpus: "0.25"
          memory: 256M
        limits:
          cpus: "1.00"
          memory: 768M
      restart_policy:
        condition: on-failure
        delay: 5s
        max_attempts: 3
        window: 120s
      update_config:
        parallelism: 1
        delay: 15s
        order: start-first
        failure_action: rollback
        monitor: 60s
        max_failure_ratio: 0
      rollback_config:
        parallelism: 1
        delay: 10s
        order: stop-first
        monitor: 60s
        failure_action: pause

  worker:
    image: ${APP_IMAGE_DIGEST}
    command: ["worker"]
    environment:
      APP_ENV: production
      DATABASE_URL_FILE: /run/secrets/database_url
      REDIS_URL: redis://redis:6379/0
    secrets:
      - database_url
    networks:
      - app-net
    deploy:
      mode: replicated
      replicas: 2
      placement:
        constraints:
          - node.labels.workload == app
      resources:
        reservations:
          cpus: "0.25"
          memory: 256M
        limits:
          cpus: "1.00"
          memory: 512M
      restart_policy:
        condition: on-failure
      update_config:
        parallelism: 1
        delay: 15s
        failure_action: rollback
        monitor: 60s

  redis:
    image: redis:7
    command: ["redis-server", "--appendonly", "yes"]
    volumes:
      - redis-data:/data
    networks:
      - app-net
    deploy:
      mode: replicated
      replicas: 1
      placement:
        constraints:
          - node.labels.stateful == true
      restart_policy:
        condition: any

networks:
  public-net:
    driver: overlay
  app-net:
    driver: overlay
    attachable: false

volumes:
  redis-data:

secrets:
  database_url:
    external: true

configs:
  case_api_config_v1:
    external: true

Production notes:

  • database is represented as external URL secret;
  • Redis is shown as stateful example, but production Redis may also be external/managed;
  • configs are versioned because Swarm configs are immutable in practice;
  • update_config.failure_action: rollback encodes release safety;
  • placement.constraints prevent accidental scheduling to wrong nodes;
  • resource reservations improve scheduling honesty;
  • limits create failure predictability;
  • start-first requires enough capacity to run old and new tasks during update.

13. Swarm Deployment Script

#!/usr/bin/env bash
set -euo pipefail

STACK_NAME="reg-case"
APP_IMAGE_DIGEST="${APP_IMAGE_DIGEST:?APP_IMAGE_DIGEST required}"

export APP_IMAGE_DIGEST

# Validate final config after interpolation.
docker stack config -c stack.yml > /tmp/${STACK_NAME}.resolved.yml

# Optional: show image digest and deploy summary.
grep "image:" /tmp/${STACK_NAME}.resolved.yml

# Deploy/update stack. Must run on manager node.
docker stack deploy \
  --with-registry-auth \
  -c /tmp/${STACK_NAME}.resolved.yml \
  "$STACK_NAME"

# Observe convergence.
docker stack services "$STACK_NAME"
docker stack ps "$STACK_NAME" --no-trunc

Post-deploy smoke:

curl -fsS http://platform.example.internal:8080/health
curl -fsS http://platform.example.internal:8080/ready

Convergence check:

docker service ls --filter label=com.docker.stack.namespace=reg-case
docker service ps reg-case_case-api --no-trunc
docker service inspect reg-case_case-api --pretty

Rollback:

docker service rollback reg-case_case-api
docker service rollback reg-case_worker

Remember: rollback is safe only if data/schema changes are compatible.

Application compatibility is part of release engineering, not Docker magic.


14. Migration Strategy

Migration is the most dangerous part of a “simple” container deployment.

Bad pattern:

Every API replica runs migration on startup.

This creates:

  • lock contention;
  • duplicate migration attempts;
  • partial rollout hazards;
  • rollback impossibility;
  • unclear ownership.

Better pattern:

Migration rules:

RuleReason
expand-contractold and new versions can coexist
no destructive pre-deploy changerollback remains possible
one migration owneravoids concurrency race
migration evidence recordedauditability
migration timeout boundedprevents stuck deploy
migration logs preservedincident diagnosis

For Swarm, migration can be done as:

  1. external CI job with network access to DB;
  2. temporary one-shot service;
  3. manually controlled maintenance runbook.

Do not let every app task mutate schema on boot.


15. Secrets and Configs Runbook

Create secret:

printf '%s' 'postgres://case:REDACTED@db.internal:5432/case' | \
  docker secret create database_url -

Create versioned config:

docker config create case_api_config_v1 ./config/application.prod.yml

Rotate secret:

printf '%s' 'postgres://case:NEW_REDACTED@db.internal:5432/case' | \
  docker secret create database_url_v2 -

Update service to use new secret:

docker service update \
  --secret-rm database_url \
  --secret-add source=database_url_v2,target=database_url \
  reg-case_case-api

Then remove old secret after all consumers are migrated:

docker secret rm database_url

Governance invariants:

  • secrets are not passed through ARG;
  • secrets are not stored in image;
  • secrets are not printed in logs;
  • secrets are injected at runtime;
  • config versions are named explicitly;
  • rotation has a rollback path;
  • secret consumers are known.

16. Observability Design

A production-grade container platform needs enough signal to answer:

  • what version is running;
  • where it is running;
  • why it restarted;
  • whether it is healthy;
  • whether it is overloaded;
  • whether logs are flowing;
  • whether deployment is converging;
  • whether rollback happened;
  • what changed before incident.

Minimum telemetry:

SignalSourceExample question
logsapp stdout/stderr, Docker logging driverwhat failed?
healthcontainer healthcheck, service psis task healthy?
eventsDocker eventsdid task die/restart/update?
metricsapp metrics, daemon metrics, host metricsis it saturated?
tracesapp instrumentationwhere latency occurs?
labelsimage/service labelswhat release is this?

Useful labels:

deploy:
  labels:
    com.example.service: case-api
    com.example.owner: platform-enforcement
    com.example.release: "1.4.2"
    com.example.git_sha: "abc123"
    com.example.runtime: swarm

Diagnostics command set:

docker service ps reg-case_case-api --no-trunc
docker service logs --since 30m reg-case_case-api
docker service inspect reg-case_case-api --pretty
docker events --since 30m --filter type=service --filter type=container
docker node ls
docker node ps <node-name> --no-trunc
docker stats --no-stream

Incident bundle:

mkdir -p incident-$(date +%Y%m%d-%H%M%S)
cd incident-*

docker node ls > nodes.txt
docker service ls > services.txt
docker stack ps reg-case --no-trunc > stack-ps.txt
docker service inspect reg-case_case-api > case-api.inspect.json
docker service logs --since 1h reg-case_case-api > case-api.logs.txt 2>&1
docker events --since 1h > docker-events.txt 2>&1 || true

This is not fancy.

It is enough to start debugging under pressure.


17. Failure Drills

A top engineer does not wait for production to teach failure.

Practice controlled failure.

Drill 1 — Container Crash

Inject app crash.

Expected:

  • service detects task failure;
  • restart policy applies;
  • logs show cause;
  • alert triggers if repeated;
  • deployment not silently marked healthy.

Commands:

docker service ps reg-case_case-api --no-trunc
docker service logs reg-case_case-api --since 10m

Drill 2 — Bad Image Rollout

Deploy image with failing healthcheck.

Expected:

  • rolling update starts;
  • monitor window detects failure;
  • update pauses or rolls back depending policy;
  • previous version remains serving;
  • evidence records failure.

Commands:

docker service update --image registry.example/internal/reg-case-platform:bad reg-case_case-api
docker service ps reg-case_case-api --no-trunc
docker service inspect reg-case_case-api --pretty

Drill 3 — Node Drain

Drain a worker node.

Expected:

  • node receives no new tasks;
  • existing tasks are rescheduled if replicated;
  • capacity remains sufficient;
  • stateful workloads are handled intentionally.
docker node update --availability drain worker-01
docker node ps worker-01 --no-trunc
docker service ps reg-case_case-api --no-trunc

Restore:

docker node update --availability active worker-01

Drill 4 — Secret Rotation

Rotate database URL secret to a test credential.

Expected:

  • only intended services updated;
  • old secret is not removed before successful rollout;
  • rollback can restore previous secret;
  • logs do not reveal secret.

Drill 5 — Resource Pressure

Lower memory limit temporarily.

Expected:

  • OOM/restart visible;
  • metrics show memory pressure;
  • capacity worksheet updated;
  • limit adjusted with evidence.

Drill 6 — Registry Unavailable

Simulate registry pull failure on a node.

Expected:

  • existing tasks keep running;
  • new task scheduling/pull fails visibly;
  • release process blocks until registry recovers;
  • runbook identifies image availability dependency.

18. Release Checklist

Use this before every production deployment.

Source

  • Git commit is known.
  • Dockerfile reviewed.
  • Compose/stack file reviewed.
  • Migration plan reviewed.
  • Rollback compatibility reviewed.
  • New secrets/configs identified.
  • Capacity impact estimated.

Build

  • Build uses BuildKit/buildx.
  • Image is tagged with commit SHA.
  • Image digest is recorded.
  • SBOM generated.
  • Provenance generated where available.
  • Vulnerability/policy scan reviewed.
  • Base image is approved.

Test

  • Unit tests pass.
  • Integration tests pass using image artifact.
  • Migration test pass.
  • Healthcheck tested.
  • Container starts as non-root.
  • Read-only filesystem compatibility tested.
  • Compose test stack cleaned up.

Deploy

  • Swarm manager quorum healthy.
  • Nodes have enough capacity for start-first update.
  • Stack config rendered and reviewed.
  • Secrets/configs exist.
  • External dependencies reachable.
  • docker stack deploy executed from manager.
  • Convergence monitored.

Verify

  • Service replicas desired = running.
  • Healthchecks passing.
  • Smoke test passing.
  • Logs free from immediate errors.
  • Metrics normal.
  • No unexpected restarts.
  • Release evidence archived.

Rollback

  • Previous image digest known.
  • docker service rollback path known.
  • Data/schema rollback risk understood.
  • Secret/config rollback path known.
  • Incident owner assigned.

19. Architecture Decision Record Template

For container platform decisions, use ADRs.

# ADR: Deploy Regulatory Case API on Docker Swarm

## Status
Accepted

## Context
We need a production runtime for a small internal case-management platform.
The team already operates Docker Engine and requires simple orchestration,
rolling updates, secrets, overlay networking, and service scheduling.

## Decision
Use Docker Swarm with stack deploy for the initial production runtime.
Use Compose for local development and integration testing.
Use immutable image digests for deployment.
Use external managed PostgreSQL.

## Consequences
Positive:
- lower operational complexity than Kubernetes for this scope;
- native Docker workflow from dev to deploy;
- built-in service orchestration and secrets;
- rolling update/rollback supported.

Negative:
- smaller ecosystem than Kubernetes;
- stateful workloads need careful placement/externalization;
- advanced traffic management requires extra components;
- team must maintain Swarm manager quorum and backup.

## Invariants
- no production deployment by mutable tag only;
- all app containers run non-root;
- no Docker socket mounted into app containers;
- secrets injected at runtime;
- healthcheck required for rolling updates;
- rollback compatibility reviewed before deployment.

ADR quality matters because container decisions often look obvious until failure.

A good ADR preserves reasoning.


20. Platform Maturity Rubric

Use this rubric to measure your Docker maturity.

LevelDescriptionObservable evidence
1 — Command userCan run containers manuallyad hoc docker run, few conventions
2 — Compose userCan define local multi-service appscompose.yml, volumes, networks, healthchecks
3 — Image engineerCan build reproducible imagesmulti-stage, cache, non-root, labels, small runtime
4 — Runtime engineerCan reason about lifecycle/resources/securitysignals, restart, cgroups, mounts, capabilities
5 — Platform engineerCan operate release lifecycleCI build, registry, scan, deploy, rollback, observability
6 — Systems engineerCan model failures and govern decisionsthreat model, ADRs, incident drills, capacity model

Top 1% practical target:

You can design, review, debug, and operate containerized systems under uncertainty, explain trade-offs clearly, and leave behind evidence that another engineer can trust.


21. Final 20-Hour Practice Plan

This is the Kaufman-style finishing plan.

Hour 1–2 — Rebuild the Mental Model

Deliverable:

  • one-page diagram of Docker Engine, image, container, network, volume, registry, Compose, Swarm;
  • explain image vs container vs service vs task without notes.

Practice:

docker image inspect <image>
docker container inspect <container>
docker network inspect <network>
docker volume inspect <volume>
docker service inspect <service>

Hour 3–4 — Dockerfile Mastery

Deliverable:

  • multi-stage Dockerfile;
  • build cache optimization;
  • non-root runtime;
  • read-only-compatible runtime.

Practice:

docker buildx build --progress=plain -t app:local .
docker history app:local
docker scout quickview app:local || true

Hour 5–6 — Runtime Lifecycle

Deliverable:

  • app handles SIGTERM;
  • healthcheck works;
  • restart behavior understood;
  • logs are clean.

Practice:

docker run --rm app:local
docker stop <container>
docker inspect <container>
docker events --since 10m

Hour 7–8 — Networking and Storage

Deliverable:

  • service-to-service DNS works;
  • no accidental host port exposure;
  • volumes and bind mounts understood;
  • backup/restore tested.

Practice:

docker network create lab-net
docker run --network lab-net --name a ...
docker run --network lab-net --name b ...
docker volume create lab-data

Hour 9–10 — Compose Dev Stack

Deliverable:

  • compose.yml base;
  • compose.dev.yml override;
  • health-gated startup;
  • profile for debug/seed tools.

Practice:

docker compose config
docker compose up --build
docker compose ps
docker compose logs -f

Hour 11–12 — Compose Test Harness

Deliverable:

  • ephemeral integration test stack;
  • unique project name;
  • cleanup trap;
  • diagnostics on failure.

Practice:

docker compose -f compose.yml -f compose.test.yml up \
  --abort-on-container-exit \
  --exit-code-from test-runner

Hour 13–14 — Supply Chain Evidence

Deliverable:

  • image pushed to registry;
  • digest captured;
  • SBOM/provenance generated;
  • vulnerability report reviewed.

Practice:

docker buildx build --sbom=true --provenance=true --push ...
docker buildx imagetools inspect <image>
docker scout sbom <image> || true

Hour 15–16 — Swarm Stack Deploy

Deliverable:

  • swarm initialized;
  • overlay network works;
  • stack deployed;
  • services converge;
  • placement constraints tested.

Practice:

docker swarm init
docker stack deploy -c stack.yml reg-case
docker stack services reg-case
docker stack ps reg-case --no-trunc

Hour 17 — Rolling Update and Rollback

Deliverable:

  • successful rolling update;
  • failed rollout triggers rollback/pause;
  • previous digest known.

Practice:

docker service update --image <new-image> reg-case_case-api
docker service ps reg-case_case-api --no-trunc
docker service rollback reg-case_case-api

Hour 18 — Security Hardening Review

Deliverable:

  • no root container unless justified;
  • no Docker socket mount;
  • caps dropped;
  • secrets externalized;
  • read-only filesystem tested.

Practice:

docker inspect <container> | jq '.[0].Config.User'
docker inspect <container> | jq '.[0].HostConfig.CapDrop'
docker inspect <container> | jq '.[0].HostConfig.ReadonlyRootfs'

Hour 19 — Observability and Incident Drill

Deliverable:

  • collect logs/events/inspect output;
  • identify restart reason;
  • correlate deployment to failure;
  • write incident note.

Practice:

docker service logs --since 30m reg-case_case-api
docker events --since 30m
docker stats --no-stream

Hour 20 — Final Review and Teachback

Deliverable:

  • explain platform lifecycle end-to-end;
  • produce ADR;
  • produce release checklist;
  • identify three remaining risks;
  • teach another engineer the design.

This final hour matters.

If you cannot teach it, you do not own it yet.


22. Common Capstone Failure Modes

FailureTypical causeBetter response
Works on laptop, fails in Swarmhidden host dependencyuse service DNS, overlay networks, external configs
Rolling update causes outageno health gate/capacityuse healthcheck, start-first, enough headroom
Rollback failsdestructive migrationexpand-contract migration
Secret appears in image historypassed via ARG/COPYuse BuildKit secrets/runtime secrets
Image pull fails on workerregistry/auth issueuse registry auth, preflight, digest evidence
App cannot write temp filesread-only FS without tmpfsdeclare writable path explicitly
CI tests flakystartup racehealth-gated dependency + diagnostics
Disk fulllogs/images/volumes unboundedlog rotation, prune policy, monitoring
Container runs as rootdefault base image behaviorexplicit USER, ownership, permission test
Stateful service moved incorrectlyvolume locality ignoredplacement constraints/external storage

23. What “Done” Means for This Series

You are done with this series when you can:

  1. build a secure, reproducible image;
  2. explain every Dockerfile instruction you used;
  3. run a realistic local Compose stack;
  4. debug DNS, port, volume, permission, health, and restart issues;
  5. design a test harness with ephemeral Compose;
  6. promote images through registry using tags and digests;
  7. produce SBOM/provenance/scan evidence;
  8. deploy a Swarm stack;
  9. operate rolling update and rollback;
  10. model stateful workload risk;
  11. rotate secrets/configs safely;
  12. collect incident diagnostics;
  13. write an ADR defending the architecture;
  14. teach the entire lifecycle clearly.

That is the difference between “I know Docker” and “I can operate containerized systems”.


24. Final Mental Model

Containerization is not packaging.

Containerization is boundary design.

Dockerfile defines artifact boundary.

Image defines distribution boundary.

Container defines process boundary.

Network defines communication boundary.

Volume defines state boundary.

Compose defines local application boundary.

Swarm defines cluster desired-state boundary.

Secrets/configs define trust boundary.

Observability defines feedback boundary.

Release evidence defines accountability boundary.

A top engineer sees those boundaries, designs them intentionally, and verifies them under failure.


25. References

Primary references for this capstone:

  • Docker Docs — Get Docker / Docker overview: https://docs.docker.com/get-started/get-docker/
  • Docker Docs — Compose profiles: https://docs.docker.com/compose/how-tos/profiles/
  • Docker Docs — Compose file reference: https://docs.docker.com/reference/compose-file/
  • Docker Docs — Compose Deploy Specification: https://docs.docker.com/reference/compose-file/deploy/
  • Docker Docs — Deploy a stack to a swarm: https://docs.docker.com/engine/swarm/stack-deploy/
  • Docker Docs — Swarm services: https://docs.docker.com/engine/swarm/services/
  • Docker Docs — Swarm networking: https://docs.docker.com/engine/swarm/networking/
  • Docker Docs — Rolling updates: https://docs.docker.com/engine/swarm/swarm-tutorial/rolling-update/
  • Docker CLI — docker stack deploy: https://docs.docker.com/reference/cli/docker/stack/deploy/
  • Docker CLI — docker service rollback: https://docs.docker.com/reference/cli/docker/service/rollback/
  • Docker Build — SBOM/provenance attestations: https://docs.docker.com/build/ci/github-actions/attestations/
  • Docker Scout — SBOM: https://docs.docker.com/scout/how-tos/view-create-sboms/

Seri Selesai

Ini adalah Part 035 dan merupakan bagian terakhir dari seri:

Learn Docker, Containerization, Docker Compose, Docker Swarm

Total selesai: 35 / 35 part.

Lesson Recap

You just completed lesson 35 in final stretch. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.