Final StretchOrdered learning track

Production Case Study: Enterprise GitOps/IaC Platform

Learn State-of-the-Art GitOps/IaC Pipeline - Part 039

Production case study for building an enterprise GitOps/IaC platform from repository model to plan, policy, approval, apply, reconciliation, observability, audit evidence, and operating runbooks.

21 min read4045 words
PrevNext
Lesson 3940 lesson track34–40 Final Stretch
#gitops#iac#platform-engineering#terraform+6 more

Part 039 — Production Case Study: Enterprise GitOps/IaC Platform

This part turns the whole series into a concrete production-grade implementation.

We are not building a toy pipeline. We are designing an internal platform for a multi-team engineering organization where infrastructure changes must be reviewable, policy-checked, auditable, recoverable, and continuously reconciled.

The important question is not:

“Which CI YAML do we use?”

The important question is:

“Can every infrastructure and deployment state transition be explained, authorized, reproduced, observed, and recovered?”

That is the bar for a state-of-the-art GitOps/IaC platform.


1. Case Study Context

Assume an engineering organization with the following characteristics:

  • 80 application teams.
  • 10 platform teams.
  • 6 production Kubernetes clusters.
  • 18 non-production clusters.
  • 4 cloud accounts for shared platform services.
  • 30+ application cloud accounts.
  • Multiple regions.
  • Compliance-sensitive workloads.
  • Strict production approval requirements.
  • A mix of application delivery and cloud infrastructure provisioning.

The platform must support:

  • Application deployment through GitOps.
  • Cloud infrastructure through Terraform/OpenTofu.
  • Policy-as-code before merge, before apply, and at Kubernetes admission.
  • Short-lived cloud credentials through OIDC or workload identity.
  • Secrets through external secret managers, not plain Git.
  • Progressive delivery for production workloads.
  • Drift detection and reconciliation.
  • Evidence generation for audit and incident review.
  • Emergency break-glass without destroying traceability.

The platform must avoid:

  • Static admin cloud keys in CI.
  • Direct manual production mutations as the normal path.
  • Unreviewed production apply.
  • GitOps controller running with cluster-admin everywhere.
  • One giant Terraform state for the company.
  • One giant repository that hides ownership boundaries.
  • Silent drift suppression.
  • Rollback procedures that only work in happy-path demos.

2. North Star Architecture

At the highest level, the platform has two related but different loops.

The first loop is the IaC mutation loop. It changes cloud resources and platform infrastructure.

The second loop is the GitOps reconciliation loop. It continuously makes Kubernetes runtime state match Git.

The architecture is intentionally split.

Cloud resources are usually mutated by an IaC engine with explicit plan/apply semantics. Kubernetes application state is usually reconciled by a controller such as Argo CD or Flux. Trying to force both into the same execution model often creates unclear ownership.

The platform therefore defines the following invariant:

A change must have exactly one mutation authority for a given state boundary.

For example:

  • An AWS VPC is owned by an OpenTofu stack.
  • An EKS cluster add-on may be owned by either IaC or GitOps, but not both.
  • An application Deployment is owned by GitOps.
  • A Kubernetes Secret object may be owned by External Secrets Operator, while the secret value is owned by the external secret manager.
  • A database schema version may be owned by a migration runner, not by Argo CD blindly applying SQL from Git.

3. Repository Architecture

The platform uses multiple repositories because ownership boundaries matter more than folder elegance.

Recommended repository set:

RepositoryPrimary ownerPurpose
app-<name>Application teamSource code, tests, container build, app-level ownership metadata
app-<name>-deployApplication team + platform guardrailsHelm/Kustomize/CUE config, image digest updates, environment desired state
infra-modulesPlatform infra teamVersioned reusable infrastructure modules
infra-livePlatform infra team + service ownersEnvironment/account/region stack instantiation
cluster-addonsPlatform Kubernetes teamIngress, cert-manager, observability, policy engines, GitOps controllers, external-secrets
platform-policiesSecurity/platform governance teamOPA/Kyverno/Conftest/Checkov policies and tests
platform-bootstrapPlatform teamBootstrap sequence for accounts, clusters, GitOps controllers, initial trust anchors
service-catalogPlatform product teamOwnership, service metadata, environment mapping, dependency metadata
change-evidence or evidence object storeGovernance/platformImmutable change evidence, run summaries, approvals, plan artifacts, attestations

A common weak design is to put everything into one repository and call it GitOps. That often works for a small team, but it fails at scale because the repository stops expressing ownership.

The repository model must answer:

  • Who can propose a change?
  • Who must review it?
  • Which state backend is affected?
  • Which GitOps controller will reconcile it?
  • Which environment is targeted?
  • Which policy set applies?
  • Which evidence must be stored?
  • What is the blast radius if the merge is wrong?

4. Directory Layout for infra-live

A production-grade infra-live repository should make state boundaries visible.

Example:

infra-live/
  README.md
  catalog/
    accounts.yaml
    regions.yaml
    environments.yaml
  stacks/
    prod/
      aws/
        account-payments-prod/
          ap-southeast-1/
            network/
              terragrunt.hcl
              stack.yaml
              policy-context.yaml
            eks-cluster/
              terragrunt.hcl
              stack.yaml
              policy-context.yaml
            rds-orders/
              terragrunt.hcl
              stack.yaml
              policy-context.yaml
        account-shared-prod/
          ap-southeast-1/
            dns/
            observability/
    nonprod/
      aws/
        account-payments-dev/
          ap-southeast-1/
            network/
            eks-cluster/
            rds-orders/
  policies/
    bindings/
      prod.yaml
      nonprod.yaml
  .github/
    CODEOWNERS
    workflows/
      plan.yaml
      apply.yaml

Each stack directory is a separate unit of state. The path itself encodes environment, cloud, account, region, and capability.

A stack should be small enough that:

  • Its plan can be reviewed by a human.
  • Its blast radius is understandable.
  • Its lock contention is acceptable.
  • Its rollback/rollforward playbook is specific.
  • Its owner is clear.

A stack is too large when a reviewer cannot answer: “What will this change actually do?”


5. Directory Layout for Application GitOps

Example application deployment repository:

app-payments-deploy/
  README.md
  app.yaml
  base/
    deployment.yaml
    service.yaml
    serviceaccount.yaml
    hpa.yaml
    pdb.yaml
  overlays/
    dev/
      kustomization.yaml
      values.yaml
    stage/
      kustomization.yaml
      values.yaml
    prod-ap-southeast-1/
      kustomization.yaml
      values.yaml
      rollout.yaml
      external-secret.yaml
      policy-context.yaml
  release/
    promotion.yaml
    changelog.md
  evidence/
    evidence-contract.yaml

The repository should not hide production changes behind generic values files. A production overlay must make these visible:

  • image digest
  • replica and autoscaling bounds
  • resource requests/limits
  • rollout strategy
  • secret references
  • network exposure
  • service account permissions
  • dependency endpoints
  • policy exceptions
  • migration coupling

A dangerous application repository has this shape:

values.yaml
values-prod.yaml
values-real-prod-final.yaml
values-prod-temporary.yaml

That is not configuration management. That is archaeological evidence of missing boundaries.


6. Change Flow: From Pull Request to Production

The platform defines a standard state machine for every production change.

The important point: merge is not the only control. Merge is just one transition.

The platform also controls:

  • who can create a PR;
  • what validation must run;
  • what plan must be reviewed;
  • what policy decision must be accepted;
  • who may approve;
  • whether approval is still fresh;
  • which identity can apply;
  • whether the applied change matches the approved change;
  • whether post-change verification passed;
  • whether evidence exists.

7. Pull Request Contract

Every production-impacting PR must produce a machine-readable change contract.

Example:

change:
  id: CHG-2026-07-03-1421
  repository: infra-live
  pullRequest: 4812
  author: alice@example.com
  targetBranch: main
  environment: prod
  cloud: aws
  account: account-payments-prod
  region: ap-southeast-1
  stack: rds-orders
  mutationType: update
  riskClass: high
  requiresApprovalFrom:
    - platform-infra
    - database-owners
    - security
  evidenceRequired:
    - plan
    - policyDecision
    - approvalSnapshot
    - applyLog
    - postApplyVerification

This contract can be produced by CI from repository path, catalog metadata, CODEOWNERS, policy bindings, and PR metadata.

The contract prevents a common problem: humans infer too much from context.

A reviewer should not have to guess whether stacks/prod/aws/account-payments-prod/ap-southeast-1/rds-orders is production. The pipeline should know.


8. CI Validation Stage

The first stage rejects malformed or obviously unsafe changes before expensive planning.

Typical checks:

  • formatting;
  • schema validation;
  • module source allowlist;
  • provider version constraints;
  • lockfile validation;
  • no plaintext secret patterns;
  • no direct production endpoint in non-production config;
  • no forbidden provider alias;
  • no unpinned container image tag in production;
  • no wildcard IAM action without exception metadata;
  • no deleted critical policy file;
  • no bypass of generated metadata;
  • no direct mutation of evidence files.

Example CI stage:

This stage should be fast and deterministic.

The goal is not to prove the change is safe. The goal is to prove the change is well-formed enough to enter deeper evaluation.


9. IaC Plan Stage

For Terraform/OpenTofu stacks, the plan stage must be treated as a privileged read operation. It may need access to state, provider schemas, and cloud APIs.

The plan stage must produce:

  • human-readable plan summary;
  • full machine-readable plan artifact;
  • normalized risk summary;
  • changed resource list;
  • action counts;
  • policy input document;
  • evidence envelope.

Example risk summary:

planSummary:
  stack: prod/aws/account-payments-prod/ap-southeast-1/rds-orders
  actions:
    create: 0
    update: 2
    delete: 0
    replace: 1
  highRiskResources:
    - aws_db_instance.orders
  destructive: false
  replacement:
    - aws_db_parameter_group.orders
  sensitiveChanges:
    - kms_key_id
  requiresManualApproval: true

The plan stage should not expose secrets in PR comments. A clean PR comment contains a summary and a link to restricted evidence artifacts.

A bad PR comment dumps raw plan output with sensitive values into a public or broadly visible repository.


10. Policy Gate Stage

Policy evaluation should run against enriched context, not raw plan alone.

Raw plan answers:

  • Which resources change?
  • Which fields change?
  • Which actions occur?

Enriched context answers:

  • Which environment is targeted?
  • Which service owns this stack?
  • Is this a regulated workload?
  • Is this an emergency change?
  • Which exception is active?
  • Which approval group is required?
  • Which data classification applies?

Policy input example:

input:
  change:
    environment: prod
    account: account-payments-prod
    region: ap-southeast-1
    service: orders
    dataClass: restricted
    riskClass: high
  plan:
    actions:
      replace:
        - type: aws_db_parameter_group
          name: orders
      update:
        - type: aws_db_instance
          name: orders
  approvals:
    current: []
  exceptions:
    active: []

Policy decisions should be structured:

decision:
  result: deny
  severity: high
  ruleId: iac.prod.database.replacement.requires-db-owner
  message: Production database-affecting replacement requires database owner approval.
  requiredApprovers:
    - database-owners
  evidence:
    - plan.action.replace.aws_db_parameter_group.orders

Do not design policy gates as arbitrary text logs. They must produce decisions that machines and auditors can understand.


11. Approval Binding

A production apply must prove that approval was granted for the same change that is being applied.

Approval binding should include:

  • repository;
  • PR number;
  • commit SHA;
  • plan hash;
  • policy decision hash;
  • approver identity;
  • approver group membership snapshot;
  • approval timestamp;
  • approval freshness window;
  • target environment;
  • risk class.

Example:

approvalBinding:
  pr: 4812
  commitSha: 8fb2d7e
  planSha256: 9e4a...
  policyDecisionSha256: 05cf...
  approvedBy:
    - user: bob@example.com
      groups:
        - platform-infra
      timestamp: 2026-07-03T09:14:22Z
    - user: carol@example.com
      groups:
        - database-owners
      timestamp: 2026-07-03T09:19:40Z
  expiresAt: 2026-07-03T13:19:40Z

Without approval binding, a pipeline can accidentally approve one plan and apply another.

That is not a process problem. That is a system design flaw.


12. Apply Stage

The apply stage is the most dangerous part of the IaC mutation loop.

It must re-check:

  • branch protection state;
  • commit SHA;
  • plan freshness;
  • approval freshness;
  • policy decision freshness;
  • state lock availability;
  • target backend;
  • execution identity;
  • environment freeze status;
  • break-glass flags;
  • destructive-change requirements.

Apply state machine:

The apply runner should be isolated per trust boundary. A runner that can mutate production networking should not also execute arbitrary pull request code from forks.

Minimum runner controls:

  • ephemeral runtime where possible;
  • no long-lived cloud keys;
  • OIDC federated identity;
  • network egress restrictions;
  • dependency allowlist;
  • restricted state backend access;
  • no interactive shell by default;
  • artifact upload to immutable evidence store;
  • least-privilege cloud role per stack class;
  • explicit emergency mode.

13. GitOps Reconciliation Stage

For application and cluster manifests, merge to the desired-state branch triggers controller reconciliation.

Example Argo CD flow:

Example Flux flow:

The GitOps controller should not be treated as a magical deploy button. It is a distributed control loop with permissions.

For each GitOps-managed object, define:

  • source repository;
  • source path;
  • controller identity;
  • target namespace;
  • allowed resource kinds;
  • allowed destination cluster;
  • sync policy;
  • prune policy;
  • self-heal policy;
  • ignored differences;
  • health criteria;
  • alert rules.

14. Progressive Delivery Implementation

Production application rollout should not be a binary “new version everywhere” event.

Example canary contract:

rollout:
  strategy: canary
  artifact:
    image: registry.example.com/payments/orders@sha256:abc123
  steps:
    - setWeight: 5
    - pause: 5m
    - analysis:
        successRate: ">= 99.5"
        p95Latency: "<= 300ms"
        errorBudgetBurn: "<= 2x"
    - setWeight: 25
    - pause: 10m
    - analysis:
        successRate: ">= 99.7"
        p95Latency: "<= 280ms"
    - setWeight: 100
  abort:
    onNoData: true
    onMetricError: true

Progressive delivery must define no-data behavior. If metrics are missing, the safest default for production should usually be to stop or fail closed, not promote blindly.

Rollout evidence should include:

  • artifact digest;
  • rollout object revision;
  • analysis template version;
  • metric queries;
  • metric results;
  • promotion timestamps;
  • abort reason if failed;
  • final service version.

15. Secrets Flow

A production GitOps/IaC platform should avoid storing plaintext secrets in Git and avoid injecting long-lived secrets into general-purpose CI runners.

Reference pattern:

Git contains references, not secret values.

A safe ExternalSecret declaration describes:

  • which external secret to read;
  • which Kubernetes Secret to produce;
  • which namespace to place it in;
  • which refresh interval to use;
  • which service account/identity is allowed;
  • which ownership policy applies.

Secret management invariants:

  • CI should not print secrets.
  • Git should not contain plaintext secrets.
  • Terraform/OpenTofu state must be treated as sensitive if it contains secret-derived values.
  • Secret rotation must have a delivery and verification path.
  • Secret deletion must account for running workloads.
  • Break-glass secret access must produce audit evidence.

16. Supply Chain Gates

Before GitOps accepts a new workload version, the platform should verify artifact integrity.

Minimum production artifact contract:

artifact:
  image: registry.example.com/payments/orders@sha256:abc123
  sourceRepository: github.com/example/orders
  sourceCommit: 9d4c...
  buildRun: 712991
  sbom: registry.example.com/attestations/orders-sbom@sha256:def456
  provenance: registry.example.com/attestations/orders-provenance@sha256:789abc
  signed: true
  signerIdentity: repo:example/orders:ref:refs/heads/main

Policy gates should reject:

  • mutable tags in production;
  • unsigned images;
  • images signed by unexpected identity;
  • missing provenance;
  • stale vulnerability exception;
  • artifact built from unprotected branch;
  • artifact built by untrusted builder;
  • artifact whose source commit does not match the release metadata.

This prevents GitOps from becoming a high-speed delivery mechanism for untrusted artifacts.


17. Multi-Tenancy Model

The platform supports many teams, but not by giving every team cluster-admin access.

Tenancy boundaries:

BoundaryMechanism
Repository write accessCODEOWNERS, branch protection, required checks
IaC stack mutationstack-specific cloud role, state backend permission, apply gate
GitOps application scopeArgo CD Project or Flux namespace/RBAC boundary
Kubernetes runtimenamespace, service account, admission policy, network policy
Secretsexternal secret manager path/role, secret operator identity
Observabilitytenant-scoped dashboard and logs with platform-level aggregation
Evidenceimmutable central store with tenant-specific views

A tenant should be able to operate inside its assigned boundary without being able to escape it.

For example, an application team may change its deployment overlay but not create a cluster-wide privileged DaemonSet. That rule should be enforced by repository policy, GitOps project scope, Kubernetes RBAC, and admission policy.

Defense in depth matters because every single layer can be misconfigured.


18. Bootstrap Sequence

The hardest part of GitOps/IaC is often the beginning: who creates the thing that reconciles everything else?

A safe bootstrap sequence:

Bootstrap must be documented and evidence-producing because it has unusually high privilege.

Bootstrap artifacts should include:

  • bootstrap command version;
  • operator identity;
  • target account/cluster;
  • created roles;
  • state backend location;
  • GitOps controller version;
  • initial admin credentials handling;
  • policy baseline version;
  • break-glass procedure;
  • handoff record from bootstrap to GitOps.

After bootstrap, ongoing changes should flow through normal GitOps/IaC pipelines. Bootstrap should not remain the hidden alternate deployment channel.


19. Evidence Store Design

The evidence store is not a dumping ground for logs. It is a queryable record of state transitions.

Evidence object model:

evidence:
  id: ev-2026-07-03-4812
  changeId: CHG-2026-07-03-1421
  repository: infra-live
  pullRequest: 4812
  commitSha: 8fb2d7e
  actor: alice@example.com
  target:
    environment: prod
    account: account-payments-prod
    region: ap-southeast-1
    stack: rds-orders
  artifacts:
    plan: s3://evidence/prod/4812/plan.json
    policy: s3://evidence/prod/4812/policy.json
    approval: s3://evidence/prod/4812/approval.json
    applyLog: s3://evidence/prod/4812/apply.log
    postCheck: s3://evidence/prod/4812/post-check.json
  result:
    status: success
    startedAt: 2026-07-03T09:33:00Z
    completedAt: 2026-07-03T09:37:12Z

Audit queries should be easy:

  • “Who approved production database changes last month?”
  • “Which changes modified public ingress?”
  • “Which emergency changes bypassed normal review?”
  • “Which policies denied production changes?”
  • “Which deployments used artifacts from unprotected branches?”
  • “Which stacks had drift and how was it reconciled?”

If the evidence system cannot answer these questions, the platform has observability but not governance.


20. Observability Dashboard

A platform dashboard should show the health of the delivery control plane, not just application CPU.

Recommended panels:

IaC Pipeline

  • plan success rate;
  • apply success rate;
  • average plan duration;
  • average apply duration;
  • lock wait time;
  • failed policy gates;
  • stale approvals;
  • failed post-apply checks;
  • state backend errors;
  • drift count by environment;
  • emergency changes.

GitOps

  • out-of-sync applications;
  • reconciliation latency;
  • sync failure count;
  • health degraded applications;
  • ignored diff count;
  • prune events;
  • controller queue depth;
  • controller API errors;
  • suspended Flux resources;
  • Argo applications with auto-sync disabled.

Security and Policy

  • admission denies;
  • policy exceptions by age;
  • unsigned artifact denials;
  • stale SBOM/provenance failures;
  • secret sync failures;
  • break-glass access events.

Governance

  • evidence completeness rate;
  • production changes without required artifact;
  • approval SLA;
  • change failure rate;
  • rollback/rollforward count;
  • mean time to recovery for pipeline incidents.

21. Failure Scenario: Partial Apply on Production Database Stack

Scenario:

A PR updates an RDS parameter group and related monitoring alarms. The apply partially succeeds: alarms are updated, but the parameter group replacement fails due to provider/API constraint.

Bad response:

  • manually edit cloud console;
  • force unlock without investigation;
  • rerun apply repeatedly;
  • close incident without state reconciliation;
  • leave evidence incomplete.

Correct response:

Recovery PR content:

  • incident/change reference;
  • observed partial state;
  • chosen recovery direction;
  • updated desired state;
  • risk summary;
  • required approvals;
  • validation plan;
  • evidence links.

The key idea: recovery is also a state transition. It must not bypass all controls unless break-glass is truly required.


22. Failure Scenario: GitOps Controller Applies Bad Manifest

Scenario:

A production deployment manifest passes CI but fails at runtime because a new required secret key is missing.

Correct design:

  • admission policy checks secret reference shape;
  • pre-sync job verifies external secret availability;
  • rollout analysis catches startup failures;
  • GitOps marks app degraded;
  • alert fires;
  • production promotion halts;
  • rollback/rollforward PR is created;
  • evidence captures object revision and controller events.

Important distinction:

  • The Git commit is the desired state.
  • The cluster live state is the attempted actual state.
  • The missing secret means the desired state is incomplete.

Do not fix it by manually creating a Kubernetes Secret in production. That creates hidden drift and teaches the organization to bypass the control plane.


23. Policy Exception Workflow

Exceptions are inevitable. Unmanaged exceptions are how policy programs die.

A production exception must include:

exception:
  id: EXC-2026-071
  ruleId: iac.aws.s3.public-access-block.required
  scope:
    repository: infra-live
    stack: prod/aws/account-shared-prod/ap-southeast-1/public-artifacts
  justification: Public artifact bucket for signed release assets.
  compensatingControls:
    - CloudFront signed URLs
    - Object-level malware scan
    - Bucket write access restricted to release role
  approvedBy:
    - security@example.com
  expiresAt: 2026-10-01T00:00:00Z
  reviewCadence: monthly

Exception invariants:

  • every exception has an owner;
  • every exception expires;
  • every exception has scope;
  • every exception has justification;
  • every exception is visible in dashboards;
  • exception use appears in evidence;
  • expired exceptions fail closed.

An exception without expiry is not an exception. It is a new unmanaged standard.


24. Production Readiness Checklist

Before launching the platform to production teams, validate these controls.

Repository Controls

  • Protected branches enabled.
  • CODEOWNERS mapped to real ownership.
  • Required checks cannot be bypassed silently.
  • Merge queue or equivalent enabled for high-risk repos.
  • Production directories have stricter review rules.
  • No direct push to production branch.

IaC Controls

  • Remote state backend is encrypted and access-controlled.
  • State locking is enabled and monitored.
  • Stack boundaries are explicit.
  • Plan artifacts are stored securely.
  • Apply requires approval binding.
  • Destructive changes require special approval.
  • Force unlock requires incident/evidence record.

Identity Controls

  • CI uses OIDC or workload identity.
  • No long-lived cloud admin keys in CI.
  • Runner roles are scoped by environment and stack class.
  • Break-glass roles are separate and audited.
  • GitOps controller identities are restricted.

Policy Controls

  • Pre-merge policy runs.
  • Plan policy runs.
  • Admission policy runs.
  • Exceptions are scoped and expiring.
  • Policy decisions are stored as evidence.
  • Policy tests exist.

GitOps Controls

  • App/project boundaries configured.
  • Destination clusters/namespaces restricted.
  • Prune/self-heal policy intentional.
  • Diff ignores reviewed and documented.
  • Controller metrics and alerts enabled.
  • Controller upgrades tested.

Secrets Controls

  • Plaintext secret scanning enabled.
  • External secret manager integrated.
  • Secret sync failures alert.
  • Secret rotation has playbook.
  • IaC state sensitivity reviewed.

Evidence Controls

  • Every production change has evidence ID.
  • Plan, policy, approval, apply/sync, verification stored.
  • Evidence retention policy defined.
  • Evidence redaction policy defined.
  • Audit queries tested.

Recovery Controls

  • Stuck lock runbook tested.
  • Partial apply runbook tested.
  • Bad manifest rollback tested.
  • Secret failure runbook tested.
  • Controller outage runbook tested.
  • Emergency change path tested.

25. Reference Implementation Milestones

Do not attempt to implement everything at once.

Milestone 1 — Safe Repository and Plan Foundation

Deliver:

  • repo topology;
  • protected branches;
  • CODEOWNERS;
  • stack detection;
  • plan pipeline;
  • basic policy gate;
  • evidence envelope.

Exit criteria:

  • every infra PR shows affected stacks;
  • every stack can produce plan artifact;
  • every production plan has evidence ID;
  • direct production apply outside pipeline is documented as unsupported.

Milestone 2 — Controlled Apply

Deliver:

  • apply runner;
  • remote state backend;
  • locking visibility;
  • approval binding;
  • short-lived identity;
  • post-apply verification.

Exit criteria:

  • production apply cannot run without approved plan;
  • stale approval blocks apply;
  • apply logs and result are stored;
  • failed apply creates recovery workflow.

Milestone 3 — GitOps Runtime

Deliver:

  • Argo CD or Flux bootstrap;
  • app/project/tenant boundaries;
  • cluster-addons GitOps;
  • app deployment GitOps;
  • controller observability.

Exit criteria:

  • production workloads are reconciled from Git;
  • manual runtime drift is detected;
  • controller alerts are wired;
  • sync failure runbook exists.

Milestone 4 — Policy and Secrets Hardening

Deliver:

  • OPA/Kyverno/Conftest policies;
  • external secrets integration;
  • policy exception workflow;
  • admission enforcement;
  • secret rotation runbook.

Exit criteria:

  • unsafe resources denied pre-merge or admission;
  • plaintext secret PR fails;
  • expired exception fails;
  • secret sync failure alerts.

Milestone 5 — Progressive Delivery and Evidence Maturity

Deliver:

  • canary/blue-green rollout;
  • artifact signing/verification;
  • SBOM/provenance evidence;
  • compliance dashboard;
  • audit query pack.

Exit criteria:

  • production rollout can stop automatically;
  • unsigned artifact cannot deploy;
  • audit can answer who/what/when/why/how for changes;
  • recovery drills completed.

26. Common Anti-Patterns in the Case Study

Anti-Pattern: Git as Suggestion, Console as Reality

If engineers regularly fix production through console changes and later maybe update Git, then Git is not the source of truth. It is documentation.

Anti-Pattern: One CI Admin Role

A single CI role with broad admin access is operationally convenient and architecturally indefensible.

Anti-Pattern: Policy Without Context

A policy that cannot distinguish production database replacement from dev sandbox replacement will either block too much or allow too much.

Anti-Pattern: Approval Without Binding

Human approval that is not bound to a plan hash and commit SHA is easy to invalidate accidentally.

Anti-Pattern: Diff Ignore as Permanent Fix

Ignoring diffs may be necessary. But every ignored diff should have owner, justification, and review date.

Anti-Pattern: Rollback Theater

A rollback document that says “revert the commit” is not enough for databases, cloud resources, external state, or irreversible migrations.

Anti-Pattern: Self-Service as Template Dump

A template without lifecycle ownership, status, policy, and recovery is not a platform API. It is a copy-paste accelerator.


27. What Good Looks Like

A mature GitOps/IaC platform has these properties:

  • Engineers propose changes through normal development workflow.
  • The platform classifies risk automatically.
  • Plans are reviewable and policy-checked.
  • Production approvals are bound to exact artifacts.
  • Apply uses short-lived least-privilege identity.
  • Kubernetes desired state is continuously reconciled.
  • Secrets are referenced, not leaked.
  • Drift is visible and triaged.
  • Rollouts are progressive for risky workloads.
  • Evidence is generated without manual audit scramble.
  • Recovery is practiced and encoded.
  • Exceptions expire.
  • The platform can explain every production state transition.

This is the difference between “we use GitOps tools” and “we operate a GitOps/IaC control plane.”


28. Mental Model Summary

Think of the production platform as five connected ledgers.

Each ledger answers a different question:

LedgerQuestion
GitWhat do we want?
PlanWhat will change?
ApprovalWho allowed this exact change?
RuntimeWhat actually happened?
EvidenceCan we prove it later?

A weak platform has gaps between these ledgers.

A strong platform makes transitions explicit.


29. Practical Exercise

Design your own production GitOps/IaC case study using this template.

Step 1 — Choose a Realistic Domain

Pick one:

  • payments platform;
  • regulatory case management platform;
  • marketplace platform;
  • internal developer platform;
  • analytics platform;
  • banking integration platform.

Step 2 — Define State Boundaries

List at least 10 state boundaries:

stateBoundaries:
  - name: network-prod
    owner: platform-network
    mutationAuthority: opentofu
  - name: orders-api-prod
    owner: orders-team
    mutationAuthority: argocd
  - name: orders-db-schema
    owner: orders-team
    mutationAuthority: migration-runner

Step 3 — Define Change Flow

Draw the state machine from PR to production.

Step 4 — Define Evidence Contract

Specify artifacts required for:

  • non-production app deployment;
  • production app deployment;
  • non-production infra change;
  • production infra change;
  • database migration;
  • emergency change.

Step 5 — Define Failure Playbooks

At minimum:

  • failed plan;
  • stuck lock;
  • partial apply;
  • failed GitOps sync;
  • bad secret;
  • failed canary;
  • manual production drift.

The exercise is complete only when every failure has a recovery owner, decision path, and evidence record.


30. Closing

This part assembled the previous parts into a production operating model.

The main lesson is simple but strict:

A modern GitOps/IaC platform is not a pipeline that runs commands. It is a controlled state-transition system with evidence.

In the final part, we convert this into a concise operating handbook: maturity model, checklists, review questions, anti-pattern catalog, and mastery path.

Lesson Recap

You just completed lesson 39 in final stretch. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.