Deepen PracticeOrdered learning track

Swarm Secrets, Configs, Volumes, and Stateful Service Design

Learn Docker, Containerization, Docker Compose, Docker Swarm - Part 029

Deep dive Docker Swarm secrets, configs, volumes, stateful services, data placement, backup, restore, and operational governance.

19 min read3681 words
PrevNext
Lesson 2935 lesson track2029 Deepen Practice
#docker#containerization#docker-swarm#secrets+4 more

Part 029 — Swarm Secrets, Configs, Volumes, and Stateful Service Design

Target part ini: kita mampu mendesain service Swarm yang memisahkan artifact, configuration, secret, dan durable state secara benar. Kita tidak sekadar tahu syntax secrets: atau volumes:, tetapi memahami failure mode: secret leak, config drift, node-local volume, misplaced task, backup gagal, restore tidak teruji, dan database yang tampak berjalan tetapi tidak recoverable.

Di Part 028 kita membahas stack sebagai release unit. Sekarang kita membedah komponen paling sensitif dalam stack: secret, config, dan state.

Mental model utama:

Image adalah artifact. Config adalah behavior. Secret adalah trust boundary. Volume adalah durability boundary. Stateful service adalah kombinasi placement, persistence, recovery, and ownership.


1. Kaufman Skill Deconstruction

Untuk menguasai topik ini, pecah skill menjadi subskill berikut:

SubskillYang Harus DikuasaiBukti Penguasaan
Secret modelBisa menjelaskan secret lifecycle, mount path, permission, immutability, rotationTidak pernah memasukkan password/API key ke image, Git, atau environment sembarangan
Config modelBisa memisahkan config non-sensitive dari image dan secretBisa mengganti config tanpa rebuild image
Volume semanticsBisa membedakan container writable layer, named volume, bind mount, external volumeTidak kehilangan data saat task reschedule
Stateful schedulingBisa mendesain placement constraint, label node, drain behavior, and failure recoveryBisa menjelaskan di node mana data berada dan bagaimana recovery terjadi
Backup/restoreBisa membuat runbook backup, restore, integrity verification, RPO/RTORestore pernah diuji, bukan hanya backup dibuat
GovernanceBisa membuat naming, versioning, ownership, retention, and incident rulesTidak ada orphan secret/config/volume tanpa owner

Kaufman-style focus:

  1. Deconstruct: pisahkan image, config, secret, state, placement, backup.
  2. Learn enough to self-correct: pahami command inspect, service ps, mount path, and task reschedule behavior.
  3. Remove friction: buat template stack, secret rotation script, backup script, and failure lab.
  4. Practice deliberately: simulasi secret rotation, config rollout, node drain, volume restore, and failed database migration.

2. Four Runtime Data Categories

Containerized system sering kacau karena semua data diperlakukan sama. Dalam desain yang matang, kita bedakan empat kategori:

CategoryExampleShould Live In Image?Should Live In Secret?Should Live In Config?Should Live In Volume?
Application artifactJAR, binary, static assetYesNoNoNo
Non-sensitive configlog level, feature flag default, endpoint nameSometimes default onlyNoYesRarely
Sensitive materialDB password, private key, tokenNoYesNoNo, unless encrypted store with strict policy
Durable business statedatabase data, uploaded files, queue dataNoNoNoYes / external state service

Rule:

Jika data harus berubah antar-environment, jangan embed sebagai image constant kecuali itu default aman. Jika data membuktikan identitas atau akses, perlakukan sebagai secret. Jika data harus selamat dari task replacement, perlakukan sebagai state.


3. Swarm Object Model for Secrets, Configs, and Volumes

Swarm memiliki object model berbeda untuk setiap jenis data.

Key distinction:

ObjectStored In Swarm Control Plane?Mounted Into Container?Sensitive?Mutable?Data Follows Task?
SecretYesYes, as fileYesEffectively immutableYes, secret follows service task
ConfigYesYes, as fileNoEffectively immutableYes, config follows service task
VolumeNo, not data itselfYesDependsMutableOnly if storage backend supports it

Important:

Secret/config metadata follows the service. Volume data usually does not magically follow a task across nodes.

This is the most important stateful-service lesson in Swarm.


4. Swarm Secrets: Mental Model

Swarm secrets are designed for sensitive data needed by services at runtime.

Examples:

  • database password
  • TLS private key
  • API token
  • signing key
  • encryption key material
  • service credential

A secret should answer:

  1. Who owns it?
  2. Which services may read it?
  3. Where is it mounted?
  4. How is it rotated?
  5. How is access revoked?
  6. How do we know it is not leaked?

Docker's model:

Operational invariant:

A service only receives the secrets explicitly granted to that service.

Do not mount all secrets into all services. Secret access should be minimal and auditable.


5. Creating and Inspecting Secrets

Create secret from file:

printf '%s' 'super-secret-password' > db_password.txt

docker secret create orders_db_password db_password.txt

rm db_password.txt

Create secret from stdin:

printf '%s' 'super-secret-password' | docker secret create orders_db_password -

List:

docker secret ls

Inspect metadata:

docker secret inspect orders_db_password

The secret value is not shown by inspect. This is expected.


6. Mounting Secrets Into Services

CLI example:

docker service create \
  --name orders-api \
  --secret orders_db_password \
  registry.example.com/orders-api:2026.07.01

Inside the container:

cat /run/secrets/orders_db_password

Stack file example:

services:
  api:
    image: registry.example.com/orders-api:2026.07.01
    secrets:
      - orders_db_password

secrets:
  orders_db_password:
    external: true

Custom target:

services:
  api:
    image: registry.example.com/orders-api:2026.07.01
    secrets:
      - source: orders_db_password
        target: db_password
        uid: "10001"
        gid: "10001"
        mode: 0400

secrets:
  orders_db_password:
    external: true

The application should read:

/run/secrets/db_password

not:

DB_PASSWORD=plain-text-value

7. Why Environment Variables Are a Weak Secret Boundary

Environment variables are convenient but weak for secrets.

Potential leakage surfaces:

  • process inspection inside container
  • crash dumps
  • logs that print full environment
  • debug endpoints
  • support bundles
  • shell history
  • CI logs
  • Compose files committed to Git
  • application framework diagnostics

A pragmatic rule:

Use environment variables for non-sensitive runtime switches. Use secret files for sensitive material.

Example safe-ish env:

environment:
  LOG_LEVEL: "INFO"
  FEATURE_X_ENABLED: "false"

Example unsafe env:

environment:
  DB_PASSWORD: "prod-password"
  JWT_PRIVATE_KEY: "-----BEGIN PRIVATE KEY-----..."

8. Secret Immutability and Versioning

Secrets are effectively immutable. Treat them like versioned objects.

Bad naming:

db_password

Better naming:

orders_db_password_2026_07_v1
orders_db_password_2026_07_v2
orders_api_jwt_private_key_2026_q3

Why version names matter:

  • clear rotation history
  • rollback clarity
  • audit evidence
  • safe staged rollout
  • no ambiguity over which service uses which credential

Naming convention:

<system>_<purpose>_<scope>_<yyyy_mm>_v<n>

Examples:

orders_postgres_password_primary_2026_07_v1
orders_api_stripe_token_2026_07_v3
orders_tls_private_key_public_2026_q3_v1

9. Secret Rotation Pattern

Rotation is not “edit secret”. Rotation is controlled replacement.

9.1 Simple Password Rotation

Create new secret:

printf '%s' 'new-password' | docker secret create orders_db_password_2026_07_v2 -

Update service:

docker service update \
  --secret-add source=orders_db_password_2026_07_v2,target=db_password_new \
  orders_api

After database credential and application config are updated, remove old secret:

docker service update \
  --secret-rm orders_db_password_2026_07_v1 \
  orders_api

Remove unused secret:

docker secret rm orders_db_password_2026_07_v1

9.2 Rotation With Zero-Downtime Requirement

For zero-downtime rotation, the dependent system must support overlap:

  • DB accepts old and new credential temporarily, or
  • application supports dual credential lookup, or
  • token issuer supports multiple active keys, or
  • load balancer drains old tasks before revocation.

Without overlap, rotation becomes a downtime event.

9.3 Rotation Failure Mode

Common broken sequence:

  1. Update app to use new secret.
  2. Remove old secret.
  3. Forget to update database password.
  4. All new tasks fail healthcheck.
  5. Rollback cannot restore because old secret was removed.

Safer sequence:

  1. Create new credential in backend.
  2. Create new Swarm secret.
  3. Update service to use new secret.
  4. Observe health and auth success.
  5. Revoke old credential.
  6. Remove old secret only after rollback window closes.

10. Swarm Configs: Mental Model

Configs are for non-sensitive configuration data mounted into services.

Examples:

  • Nginx config
  • application YAML config without secret values
  • feature flag bootstrap file
  • logback/log4j config
  • Prometheus scrape config
  • static routing table
  • trusted public certificate bundle

Configs solve this problem:

We want generic immutable images, but runtime behavior differs by environment.

Docker config object keeps image generic without bind-mounting host files.


11. Creating and Mounting Configs

Create config:

docker config create orders_api_config_2026_07_v1 application-prod.yml

Inspect:

docker config inspect orders_api_config_2026_07_v1

Service usage:

services:
  api:
    image: registry.example.com/orders-api:2026.07.01
    configs:
      - source: orders_api_config_2026_07_v1
        target: /app/config/application.yml
        uid: "10001"
        gid: "10001"
        mode: 0444

configs:
  orders_api_config_2026_07_v1:
    external: true

Application launch:

services:
  api:
    command:
      - "java"
      - "-jar"
      - "/app/orders-api.jar"
      - "--spring.config.location=file:/app/config/application.yml"

12. Config Immutability and Rollout

Config should be versioned like code.

Bad:

application-prod.yml

Better:

orders_api_application_prod_2026_07_01_sha7f3a9c

When config changes:

  1. Create new config object.
  2. Update service to use new config.
  3. Allow rolling update.
  4. Observe health.
  5. Remove old config after rollback window.

Example:

docker config create orders_api_config_2026_07_v2 application-prod.yml

docker service update \
  --config-rm orders_api_config_2026_07_v1 \
  --config-add source=orders_api_config_2026_07_v2,target=/app/config/application.yml \
  orders_api

If the service has update policy, this triggers controlled replacement of tasks.


13. Config vs Secret Decision Matrix

DataConfigSecretReason
log levelYesNoNot sensitive
DB hostYesUsually noInfrastructure metadata
DB passwordNoYesCredential
public certificateYesNoPublic trust data
private keyNoYesIdentity material
OAuth client idUsually configDependsPublic identifier in many systems
OAuth client secretNoYesCredential
feature flag defaultYesNoRuntime behavior
license keyNoYesOften confidential/commercial secret

Rule:

If disclosure grants access, impersonation, privilege, or commercial loss, treat it as secret.


14. Volumes in Swarm: The Dangerous Mental Model

The dangerous assumption:

“A named volume in Swarm is cluster-wide.”

Usually false.

A named volume created with the default local driver is local to a node.

If a database task originally runs on Node A and writes to orders_pgdata, then later reschedules to Node B with the same volume name, it may get a different empty local volume.

This is the classic Swarm stateful-service trap.


15. Volume Types and Swarm Implications

Storage TypeScopeGood ForRisk
Container writable layertask/container-localephemeral fileslost on task replacement
Local named volumenode-localsingle-node durable statetask reschedule may lose access
Bind mountnode-local host pathexplicit host integrationbrittle path/permission/security
tmpfsmemory/ephemeralcache, sockets, temp secret-like filesdata lost on restart
External volume driverbackend-dependentdurable/movable statedriver complexity and split-brain risk
Managed external DB/storageservice-levelproduction durable stateoperational dependency outside Swarm

Top 1% engineer behavior:

Never say “we have a volume” as proof of durability. Ask: where is the data physically, how is it replicated, how is it backed up, and what happens during reschedule?


16. Stateful Service Categories

Not every stateful service has the same risk.

CategoryExampleSwarm FitNotes
StatelessAPI, worker, webExcellentscale horizontally
Soft statecache, local temp indexGoodrebuildable; no backup needed
Durable single-primaryPostgreSQL single nodeRisky but possiblerequires pinning, backup, restore runbook
Durable clusteredKafka, Elasticsearch, database clusterAdvanced / riskyrequires protocol-level clustering and storage design
External managed stateRDS, Cloud SQL, S3, managed queueOften bestSwarm runs compute, external service owns durability

Practical principle:

Use Swarm for compute orchestration. Use dedicated state systems for critical durable business data unless your team owns storage operations deeply.


17. Single-Primary Database on Swarm

Sometimes acceptable:

  • internal tool
  • low-to-medium criticality system
  • small deployment
  • strong backup discipline
  • one-node or pinned-node design
  • clear recovery expectation

Usually risky for:

  • high-volume transactional systems
  • strict RPO/RTO
  • multi-node HA requirement
  • regulated evidence systems without tested restore
  • workloads requiring automatic failover

A minimal pinned PostgreSQL example:

version: "3.9"

services:
  postgres:
    image: postgres:16.4
    environment:
      POSTGRES_DB: orders
      POSTGRES_USER: orders_app
      POSTGRES_PASSWORD_FILE: /run/secrets/orders_postgres_password_2026_07_v1
    secrets:
      - orders_postgres_password_2026_07_v1
    volumes:
      - orders_pgdata:/var/lib/postgresql/data
    networks:
      - app
    deploy:
      replicas: 1
      placement:
        constraints:
          - node.labels.orders.db == true
      restart_policy:
        condition: on-failure
      update_config:
        order: stop-first
        failure_action: rollback
      rollback_config:
        order: stop-first

secrets:
  orders_postgres_password_2026_07_v1:
    external: true

volumes:
  orders_pgdata:
    driver: local

networks:
  app:
    driver: overlay

Node label:

docker node update --label-add orders.db=true worker-1

Critical caveat:

This pins scheduling, not high availability. If worker-1 dies, the data is still on worker-1 unless storage/backend strategy says otherwise.


18. Why replicas: 3 Does Not Make a Database HA

Bad example:

services:
  postgres:
    image: postgres:16.4
    deploy:
      replicas: 3
    volumes:
      - pgdata:/var/lib/postgresql/data

This is wrong for a normal PostgreSQL image because three independent database processes do not automatically form a safe cluster.

Potential outcomes:

  • independent databases with divergent data
  • three nodes writing to separate local volumes
  • corrupted shared filesystem if using unsafe shared volume
  • client sees inconsistent behavior
  • backups are meaningless because there are multiple truths

Correct HA database requires a database-level clustering/replication protocol:

  • primary/replica replication
  • leader election
  • fencing
  • WAL/archive strategy
  • split-brain prevention
  • backup consistency
  • failover process

Swarm can schedule processes. It does not magically turn stateful software into a distributed database.


19. Placement Strategy for Stateful Services

Stateful service scheduling must be explicit.

19.1 Node Labels

docker node update --label-add storage=ssd worker-1
docker node update --label-add zone=az-a worker-1
docker node update --label-add orders.pgdata=true worker-1

Stack:

deploy:
  placement:
    constraints:
      - node.labels.orders.pgdata == true

19.2 Avoid Scheduling on Managers

deploy:
  placement:
    constraints:
      - node.role == worker

19.3 Resource Reservation

deploy:
  resources:
    reservations:
      cpus: "1.0"
      memory: 2G
    limits:
      cpus: "2.0"
      memory: 4G

For databases, memory limit needs careful tuning. If memory limit conflicts with DB buffer/cache expectations, performance and OOM behavior can become unstable.


20. Drain Behavior and Stateful Risk

docker node update --availability drain <node> tells Swarm to stop tasks on that node and reschedule them elsewhere.

For stateless service, this is normal.

For stateful local-volume service, this can be dangerous:

Stateful maintenance runbook must include:

  1. Identify local-volume services on node.
  2. Confirm backup freshness.
  3. Stop or migrate service intentionally.
  4. Avoid accidental reschedule to node without data.
  5. Verify data after restart.

Command to inspect services on node:

docker node ps worker-1

Command to inspect service tasks:

docker service ps orders_postgres --no-trunc

21. External Volume Drivers and Shared Storage

External volume drivers can make volumes available across nodes depending on backend.

But they introduce their own risks:

  • network storage latency
  • filesystem semantics mismatch
  • lock/split-brain behavior
  • backup consistency
  • driver availability
  • credential management
  • mount failure during task start
  • IO performance unpredictability

Decision question:

Does the storage backend provide the consistency, locking, durability, and performance semantics required by the application?

Do not assume every shared filesystem is safe for every database.


22. Pattern: Swarm Compute + External Managed State

For serious production systems, the clean pattern is often:

Benefits:

  • Swarm handles stateless compute.
  • Managed DB handles durability, replication, backup, failover.
  • Object storage handles file durability.
  • Queue service handles message durability.
  • Platform team avoids reinventing storage operations.

Trade-off:

  • external service dependency
  • network/security configuration
  • cloud/vendor cost
  • cross-environment parity concerns

23. Pattern: Local Single-Node Stateful Service With Strong Runbook

Acceptable for small systems if documented:

services:
  minio:
    image: minio/minio:RELEASE.2026-01-01T00-00-00Z
    command: server /data --console-address ":9001"
    environment:
      MINIO_ROOT_USER_FILE: /run/secrets/minio_root_user
      MINIO_ROOT_PASSWORD_FILE: /run/secrets/minio_root_password
    secrets:
      - minio_root_user
      - minio_root_password
    volumes:
      - minio_data:/data
    deploy:
      replicas: 1
      placement:
        constraints:
          - node.labels.minio.data == true

volumes:
  minio_data:
    driver: local

secrets:
  minio_root_user:
    external: true
  minio_root_password:
    external: true

Runbook must state:

  • node label owner
  • backup schedule
  • restore test cadence
  • acceptable downtime
  • how to replace node
  • how to migrate data
  • how to rotate secrets

24. Backup and Restore Mental Model

Backup is not the goal. Restore is the goal.

Questions:

  1. What is backed up?
  2. When is it backed up?
  3. Where is it stored?
  4. Is it encrypted?
  5. Who can restore it?
  6. How long does restore take?
  7. How much data can be lost?
  8. When was restore last tested?

25. PostgreSQL Backup Example

For Postgres in a container, do not copy raw data directory while database is running unless using a database-safe method.

Logical backup:

docker exec -t $(docker ps -q -f name=orders_postgres) \
  pg_dump -U orders_app -d orders \
  > orders_$(date +%Y%m%d_%H%M%S).sql

Compressed:

docker exec -t $(docker ps -q -f name=orders_postgres) \
  pg_dump -U orders_app -d orders | gzip \
  > orders_$(date +%Y%m%d_%H%M%S).sql.gz

Restore drill:

gunzip -c orders_20260701_010000.sql.gz | \
  docker exec -i $(docker ps -q -f name=orders_postgres_restore) \
  psql -U orders_app -d orders

For production-grade databases, prefer native backup strategy:

  • base backup
  • WAL archiving
  • point-in-time recovery
  • replica verification
  • backup catalog
  • retention policy

26. Volume Backup Example

For generic volume backup:

docker run --rm \
  -v orders_pgdata:/source:ro \
  -v "$PWD/backups:/backup" \
  busybox \
  tar czf /backup/orders_pgdata_$(date +%Y%m%d_%H%M%S).tar.gz -C /source .

Restore:

docker run --rm \
  -v orders_pgdata:/target \
  -v "$PWD/backups:/backup" \
  busybox \
  sh -c 'cd /target && tar xzf /backup/orders_pgdata_20260701_010000.tar.gz'

Warning:

For databases, filesystem-level backup must respect database consistency rules. Use database-aware backup unless you have quiesced the service or the database supports snapshot-safe backup procedure.


27. Secrets and Backups

Backups can leak secrets indirectly.

Possible leakage:

  • database contains third-party credentials
  • config file stored in volume contains passwords
  • application log contains secret by mistake
  • backup archive contains mounted secret copied accidentally
  • support dump includes /run/secrets

Backup policy must include:

  • encryption at rest
  • access control
  • retention
  • deletion
  • audit logging
  • restore access governance
  • redaction rules for support bundles

28. Stack Example: API + Worker + Postgres With Secrets and Configs

version: "3.9"

services:
  api:
    image: registry.example.com/orders-api:2026.07.01
    networks:
      - app
      - public
    ports:
      - target: 8080
        published: 8080
        protocol: tcp
        mode: ingress
    secrets:
      - source: orders_db_password_2026_07_v1
        target: db_password
        uid: "10001"
        gid: "10001"
        mode: 0400
    configs:
      - source: orders_api_config_2026_07_v1
        target: /app/config/application.yml
        uid: "10001"
        gid: "10001"
        mode: 0444
    environment:
      DB_PASSWORD_FILE: /run/secrets/db_password
      SPRING_CONFIG_ADDITIONAL_LOCATION: file:/app/config/application.yml
    deploy:
      replicas: 3
      placement:
        constraints:
          - node.role == worker
      update_config:
        parallelism: 1
        delay: 10s
        failure_action: rollback
        monitor: 30s
      rollback_config:
        parallelism: 1
        delay: 5s
        failure_action: pause

  worker:
    image: registry.example.com/orders-worker:2026.07.01
    networks:
      - app
    secrets:
      - source: orders_db_password_2026_07_v1
        target: db_password
    configs:
      - source: orders_worker_config_2026_07_v1
        target: /app/config/application.yml
    environment:
      DB_PASSWORD_FILE: /run/secrets/db_password
    deploy:
      replicas: 2
      placement:
        constraints:
          - node.role == worker

  postgres:
    image: postgres:16.4
    networks:
      - app
    environment:
      POSTGRES_DB: orders
      POSTGRES_USER: orders_app
      POSTGRES_PASSWORD_FILE: /run/secrets/postgres_password
    secrets:
      - source: orders_db_password_2026_07_v1
        target: postgres_password
        mode: 0400
    volumes:
      - orders_pgdata:/var/lib/postgresql/data
    deploy:
      replicas: 1
      placement:
        constraints:
          - node.labels.orders.postgres == true
      restart_policy:
        condition: on-failure

networks:
  public:
    driver: overlay
  app:
    driver: overlay
    internal: true

volumes:
  orders_pgdata:
    driver: local

secrets:
  orders_db_password_2026_07_v1:
    external: true

configs:
  orders_api_config_2026_07_v1:
    external: true
  orders_worker_config_2026_07_v1:
    external: true

Review points:

  • API and worker share DB secret explicitly.
  • Postgres is pinned to labeled node.
  • App config is separate from image.
  • Internal network isolates DB from public ingress.
  • API is scalable; Postgres is intentionally single replica.
  • Production risk remains: local volume is node-bound.

29. Application Pattern: Reading Secret Files

Application should support _FILE style environment variables.

Pseudo-code:

String readConfigValue(String key) throws IOException {
    String fileKey = key + "_FILE";
    String path = System.getenv(fileKey);
    if (path != null && !path.isBlank()) {
        return Files.readString(Path.of(path), StandardCharsets.UTF_8).trim();
    }

    String value = System.getenv(key);
    if (value != null) {
        return value;
    }

    throw new IllegalStateException("Missing required config: " + key + " or " + fileKey);
}

This makes the app work with:

DB_PASSWORD_FILE=/run/secrets/db_password

without forcing secret into environment value.


30. Governance: Ownership and Inventory

Every secret/config/volume should have metadata outside Docker.

Example inventory table:

ObjectOwnerEnvironmentServicesRotationBackupLast Review
orders_db_password_2026_07_v1Payments Platformprodapi, worker, postgres90 daysn/a2026-07-01
orders_api_config_2026_07_v1Orders Teamprodapiper releaseGit source2026-07-01
orders_pgdataDBA/Platformprodpostgresn/ahourly logical + daily full2026-07-01

Labels can help but are not a full governance system.

deploy:
  labels:
    com.example.owner: "orders-platform"
    com.example.data-classification: "confidential"
    com.example.runbook: "https://internal/runbooks/orders"

31. Failure Mode Catalog

31.1 Secret Missing

Symptom:

secret not found

Likely causes:

  • secret not created on target swarm
  • stack references wrong name
  • external secret omitted
  • deployed to wrong Docker context

Checks:

docker secret ls
docker stack config -c stack.yml
docker context show

31.2 Secret Permission Error

Symptom:

Permission denied: /run/secrets/db_password

Likely causes:

  • app runs as non-root UID
  • secret mode too restrictive
  • wrong uid/gid in stack file

Fix:

secrets:
  - source: orders_db_password
    target: db_password
    uid: "10001"
    gid: "10001"
    mode: 0400

31.3 Config Changed But App Not Updated

Likely causes:

  • config object immutable; service still mounts old config
  • app reads config once at startup
  • stack deploy did not replace task because reference unchanged

Fix:

  • create new config name
  • update service to new config
  • trigger rolling update

31.4 Stateful Task Rescheduled to Empty Volume

Symptom:

  • database starts but data missing
  • service appears healthy but records gone
  • volume exists but on wrong node

Checks:

docker service ps orders_postgres --no-trunc
docker node ps worker-1
docker volume ls
docker volume inspect orders_pgdata

Fix:

  • stop service
  • locate original data node
  • restore from backup or move data intentionally
  • apply node placement constraints

31.5 Backup Exists But Restore Fails

Likely causes:

  • backup was inconsistent
  • missing secrets/credentials
  • wrong database version
  • backup not encrypted/decrypted correctly
  • restore process never tested

Fix:

  • implement restore drill
  • pin backup tool version
  • store metadata with backup
  • verify checksum
  • document restore dependencies

32. Operational Runbook: Secret Rotation

Template:

# Runbook: Rotate <secret-name>

## Preconditions
- Current secret: <name-v1>
- New secret: <name-v2>
- Services affected: <list>
- Backend credential created: yes/no
- Rollback window: <duration>

## Steps
1. Create new backend credential.
2. Create new Docker secret.
3. Add new secret to service.
4. Update app config to read new target if required.
5. Observe service health.
6. Validate authentication metrics/logs.
7. Revoke old backend credential after rollback window.
8. Remove old secret from service.
9. Remove old Docker secret.

## Rollback
- Re-add old credential if still active.
- Roll back service spec.
- Restore previous config if needed.

## Evidence
- command output
- service ps output
- health metrics
- audit ticket

33. Operational Runbook: Stateful Node Maintenance

Template:

# Runbook: Maintain Node <node-name> With Stateful Services

## Pre-check
- List services on node.
- Identify stateful services.
- Confirm latest backup timestamp.
- Confirm restore test status.
- Confirm placement constraints.

## Safe Procedure
1. Announce maintenance window.
2. Stop or migrate stateful service intentionally.
3. Backup before maintenance.
4. Drain node only after stateful risk is handled.
5. Perform maintenance.
6. Reactivate node.
7. Start service on intended node.
8. Validate data and app behavior.

## Do Not
- Blindly drain a node with local-volume database.
- Assume named volume data follows task.
- Remove old backup before new restore test succeeds.

34. Security Review Checklist

For each service:

  • Does it mount only secrets it needs?
  • Are secrets mounted as files, not environment values?
  • Are secret target names stable and app-friendly?
  • Are secret file permissions compatible with non-root UID?
  • Are configs free of sensitive values?
  • Are config names versioned?
  • Are old secrets/configs removed after rollback window?
  • Are logs checked for accidental secret output?
  • Are support bundles redacted?
  • Is Docker socket not mounted into app containers?

35. Stateful Design Checklist

For each stateful service:

  • What is the state?
  • Where is it physically stored?
  • Is it node-local or cluster-backed?
  • What happens when task restarts on same node?
  • What happens when task reschedules to another node?
  • What happens during node drain?
  • What happens during node loss?
  • Is backup database-aware?
  • Is restore tested?
  • What are RPO and RTO?
  • Who owns recovery?
  • Is there a migration path to external managed state?

36. Practice Lab

Lab 1 — Secret Lifecycle

  1. Create a secret lab_db_password_v1.
  2. Mount it into a service.
  3. Read it from /run/secrets.
  4. Rotate to lab_db_password_v2.
  5. Remove v1 after service becomes healthy.

Expected learning:

  • secret immutability
  • service update behavior
  • secret permission
  • rollback window

Lab 2 — Config Rollout

  1. Create nginx_conf_v1.
  2. Deploy Nginx service with config.
  3. Create nginx_conf_v2.
  4. Update service.
  5. Validate response changed.

Expected learning:

  • config versioning
  • task replacement
  • service convergence

Lab 3 — Local Volume Reschedule Trap

  1. Deploy a single-replica service with named volume and node constraint.
  2. Write data.
  3. Remove constraint or drain node.
  4. Observe behavior when task lands elsewhere.
  5. Restore intended placement.

Expected learning:

  • node-local volume risk
  • service ps diagnosis
  • placement discipline

Lab 4 — Backup Restore Drill

  1. Populate database.
  2. Take logical backup.
  3. Remove database volume in lab.
  4. Restore into fresh volume.
  5. Validate row count and application behavior.

Expected learning:

  • backup is not restore
  • backup metadata matters
  • database-aware procedure matters

37. Common Anti-Patterns

Anti-PatternWhy It FailsBetter Approach
Secret in imageAnyone with image can extract itRuntime secret file
Secret in env file committed to GitPlaintext leakExternal secret management
Unversioned configRollback ambiguityImmutable versioned config names
replicas: 3 for non-clustered DBdata divergenceDB-specific replication or single-primary
Local volume without placementaccidental empty data on reschedulenode label + backup or external storage
Blind node drainstateful task moves unsafelystateful maintenance runbook
Backup without restore testfalse confidencescheduled restore drill
Shared volume for unsafe multi-writer appcorruption/split-brainapp-level clustering protocol
All services mount all secretsblast radius too largeleast-privilege secret mapping

38. Mental Model Summary

Final invariant:

Stateless services can be rescheduled freely. Stateful services can only be rescheduled safely when storage, placement, and recovery semantics are explicitly designed.


39. What Good Looks Like

A production-ready Swarm stateful design has:

  • immutable image artifact
  • versioned config object
  • least-privilege secret access
  • secret rotation path
  • explicit placement constraints for local state
  • external managed state for critical data where possible
  • database-aware backup
  • tested restore
  • node maintenance runbook
  • incident path for secret leak
  • evidence trail for changes

This is the difference between “it runs” and “it survives operational reality”.


40. References

  • Docker Docs — Manage sensitive data with Docker secrets: https://docs.docker.com/engine/swarm/secrets/
  • Docker Docs — Store configuration data using Docker Configs: https://docs.docker.com/engine/swarm/configs/
  • Docker Docs — Deploy services to a swarm: https://docs.docker.com/engine/swarm/services/
  • Docker Docs — Deploy a stack to a swarm: https://docs.docker.com/engine/swarm/stack-deploy/
  • Docker Docs — Drain a node on the swarm: https://docs.docker.com/engine/swarm/swarm-tutorial/drain-node/
  • Docker Docs — Administer and maintain a swarm of Docker Engines: https://docs.docker.com/engine/swarm/admin_guide/
Lesson Recap

You just completed lesson 29 in deepen practice. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.