Series/Learn State-of-the-Art GitOps/IaC Pipeline

Deepen PracticeOrdered learning track

Drift Detection and Reconciliation

Learn State-of-the-Art GitOps/IaC Pipeline - Part 029

Drift detection and reconciliation for production-grade GitOps/IaC platforms, covering state planes, drift taxonomy, detection points, auto-heal strategy, drift budgets, exceptions, and recovery playbooks.

[2026-07-03]25 min read4846 words

In This Lesson

1. The Skill You Are Building 2. Drift Is a Three-Plane Problem 3. Drift Is Not Always Bad

PrevNext

Lesson 2940 lesson track23–33 Deepen Practice

#gitops#iac#terraform#opentofu+5 more

Part 029 — Drift Detection and Reconciliation

Drift is not merely "someone changed production manually".

That is the childish definition.

In a real GitOps/IaC platform, drift is any meaningful divergence between the system state you intend, the state your control plane remembers, and the state the runtime actually has.

The practical problem is not only detecting drift. The harder problem is deciding:

whether the drift is real;
whether it is harmful;
who owns it;
whether the system may self-heal it;
whether the drift should update Git, update state, trigger incident response, or be ignored;
how to produce evidence that the decision was correct.

A top-tier engineer does not treat drift as a binary alert. They treat drift as a state classification and reconciliation problem.

This part builds that model.

1. The Skill You Are Building

After this part, you should be able to design a drift system that can answer these questions without guesswork:

What is the source of truth for this resource?
Which controller owns this field?
Is this drift caused by manual mutation, provider read behavior, runtime controller mutation, emergency response, failed reconciliation, or stale desired state?
Is it safe to auto-reconcile?
Should reconciliation happen by reverting runtime, updating Git, importing state, refreshing state, or opening a PR?
Which team must approve the reconciliation?
Which evidence proves what happened?

That is the real capability.

Not "run terraform plan on a schedule".

2. Drift Is a Three-Plane Problem

Most engineers model drift as:

Git != Production

That is incomplete.

For IaC/GitOps, there are usually three important state planes:

Desired State     = what Git/config declares
Recorded State    = what the IaC engine/controller believes exists
Actual State      = what the external system currently contains

For Terraform/OpenTofu-style IaC:

desired state is configuration;
recorded state is the state file;
actual state is the cloud/provider resource inventory.

For Kubernetes GitOps:

desired state is rendered manifests from Git/Helm/Kustomize/etc.;
recorded state is controller cache/status/last applied metadata/managed fields;
actual state is live Kubernetes objects and their observed runtime status.

The moment you add these three planes, drift becomes much more precise.

A mature drift system asks which relationship is broken.

Broken relationship	Meaning	Example
Desired != Actual	Runtime does not match Git/config	Replica count manually changed from 3 to 5
Recorded != Actual	IaC state/controller cache is stale	Security group was changed outside OpenTofu
Desired != Recorded	Config changed but not applied, or state still tracks old model	Module refactor not migrated
Desired == Recorded but Actual differs	State is lying or refresh has not observed change	Provider state stale
Desired differs by generated output only	Render noise or nondeterminism	Helm template emits timestamp

This is why blind auto-heal is dangerous.

You cannot fix a three-plane problem with a two-plane mental model.

3. Drift Is Not Always Bad

Drift is often bad, but not always.

A healthy production platform must classify drift before reacting.

Drift class	Meaning	Default response
Unauthorized drift	Mutation outside approved flow	Alert, classify severity, reconcile or incident
Emergency drift	Manual change during incident	Preserve evidence, reconcile intentionally after incident
Controller-owned drift	Runtime controller mutates fields legitimately	Ignore or model ownership boundary
Provider-read drift	Provider reports equivalent data differently	Normalize, pin provider, suppress noisy fields carefully
Dependency drift	External dependency changed its observable state	Re-plan, validate compatibility, update state/config
Stale desired state	Git no longer represents reality because manual change became accepted	Open PR to Git, not blind revert
Failed reconciliation drift	Controller attempted but could not converge	Treat as control-loop failure
Policy drift	Runtime violates current policy, but was created before policy existed	Remediate through migration plan

The important distinction:

Drift is a signal. It is not automatically an incident.

A production drift program needs classification, not only detection.

4. The Reconciliation Contract

Before designing detection, define the contract.

A reconciliation system must specify:

Owner — which system owns the resource or field?
Source of truth — which artifact is authoritative?
Observation method — how is actual state observed?
Diff method — how is drift computed?
Decision policy — which drift can self-heal?
Action — revert runtime, update Git, refresh state, import, ignore, or escalate?
Evidence — what is recorded for audit?

Without this contract, drift detection becomes an alert factory.

A simple rule:

Never build drift detection without a reconciliation decision table.

Detection without response design creates fatigue.

5. Drift Taxonomy for GitOps/IaC

Use this taxonomy when designing your own platform.

5.1 Configuration Drift

Configuration drift happens when runtime configuration differs from the declared desired state.

Examples:

Kubernetes Deployment image changed manually;
replica count patched with kubectl scale;
cloud security group ingress rule added from console;
database parameter group changed manually;
CDN cache rule changed in UI.

This is the classic drift class.

The main question:

Should runtime be reverted to Git, or should Git be updated to represent the accepted runtime state?

You cannot answer that from the diff alone. You need context.

5.2 State Drift

State drift happens when IaC state no longer represents the external system correctly.

Examples:

resource deleted manually but still exists in state;
resource imported incorrectly;
state moved during refactor without moved block or state migration;
provider read behavior changed after provider upgrade;
state backend restored from old snapshot.

State drift is especially dangerous because the next plan may propose a misleading action.

A common failure pattern:

Manual cloud change -> state not refreshed -> stale plan -> approval based on wrong diff -> destructive apply

5.3 Runtime Controller Drift

In Kubernetes, many controllers mutate objects after they are applied.

Examples:

defaulting webhooks add fields;
service mesh injects sidecars;
HPA changes replica counts;
cert-manager writes certificate status;
Kubernetes controllers update status fields;
admission controllers mutate labels, annotations, tolerations, or security context.

This is not always a violation.

The solution is not "turn off drift detection".

The solution is field ownership.

Git owns desired spec fields.
Runtime controllers own status and selected generated fields.
Policy owns mandatory safety fields.
Humans own neither directly in production.

5.4 Dependency Drift

Dependency drift happens when an external resource changes in a way your configuration did not directly control.

Examples:

AMI image behind a data source changes;
cloud provider default TLS policy changes;
managed Kubernetes version auto-upgrades;
SaaS provider changes default behavior;
Helm chart dependency version resolves differently;
container tag moves because tag was mutable.

This is why production systems should prefer immutable references:

image digests instead of mutable tags;
pinned provider versions;
pinned module versions;
pinned chart versions;
explicit region/account/cluster targets.

5.5 Identity and Access Drift

Identity drift is one of the highest-risk drift classes.

Examples:

runner role gains extra permissions;
GitHub OIDC trust policy becomes too broad;
Kubernetes service account receives new cluster role binding;
cloud admin manually grants user access;
break-glass role is not revoked after incident;
external secret operator gains access to wider path.

Identity drift should rarely auto-heal silently. It should usually trigger high-severity review.

5.6 Policy Drift

Policy drift means runtime was once allowed but is no longer allowed by current policy.

Examples:

old resources lack mandatory tags;
old buckets are not encrypted with current standard;
old workloads run without current Pod Security settings;
old IAM policies violate new least-privilege model.

This is not the same as unauthorized drift.

Policy drift often requires migration, not immediate deletion.

5.7 Secret Drift

Secret drift happens when declared references, external secret values, Kubernetes Secret objects, and consuming workloads are inconsistent.

Examples:

secret rotated in Vault but Kubernetes Secret not refreshed;
SOPS-encrypted file changed but controller failed to decrypt;
External Secrets Operator lost permission;
workload still uses old mounted secret until restart;
secret key renamed in external backend.

Secret drift requires extra care because logging diffs may expose sensitive values.

5.8 Cost and Capacity Drift

Cost drift is runtime cost or capacity deviating from expected model.

Examples:

instance type manually upgraded;
autoscaler expands beyond budget;
logging retention changed;
expensive managed database feature enabled;
unused resources remain after failed destroy.

Cost drift is often detected outside GitOps tooling through billing, inventory, or cloud asset systems.

6. Drift Detection Points

Do not rely on one detection mechanism.

A serious platform has multiple observation points.

Detection point	Detects	Weakness
PR plan	Proposed desired-state changes	Does not detect runtime changes after merge
Scheduled IaC drift plan	Cloud/resource drift	Can be expensive and noisy
GitOps diff/sync status	Cluster desired-vs-live drift	Limited to resources managed by controller
Admission audit	Runtime mutation attempts	Does not detect old resources by itself
Cloud config/inventory scan	Broad cloud posture drift	May not know Git ownership
Kubernetes audit logs	Manual patch/apply events	Requires event correlation
Policy scanner	Compliance drift	May lack safe remediation path
Secret sync status	Secret delivery drift	Must avoid secret disclosure
Runtime synthetic probe	Behavioral drift	Does not map directly to config field

The best systems correlate them.

For example:

Argo app OutOfSync + Kubernetes audit patch by human + no matching PR = likely unauthorized runtime drift

Or:

OpenTofu refresh-only plan detects changed security group + cloud audit log actor is break-glass role + incident ticket exists = emergency drift

The detection is stronger when signals are joined.

7. IaC Drift Detection with Terraform/OpenTofu

Terraform/OpenTofu-style IaC is stateful. Drift detection must consider both config and state.

OpenTofu plan can be used to compare configuration with real infrastructure, and refresh-only mode is specifically used to update state and root outputs to match remote objects changed outside the normal workflow.

The simplified model:

7.1 Normal Plan vs Refresh-Only Plan

Use the distinction carefully.

Mode	Purpose	Typical use
Normal plan	Compare config + state + remote and propose changes to make remote match config	PR validation, pre-apply diff
Refresh-only plan	Update state to match remote changes without proposing config-driven infrastructure mutation	Reconciling known out-of-band changes into state

A dangerous mistake is using refresh-only as if it were remediation.

Refresh-only does not make remote match Git. It makes recorded state match remote.

That can be correct after an approved emergency change. It can be wrong if the runtime drift should be reverted.

Decision rule:

If runtime is wrong -> do not refresh state as acceptance.
If state is stale but runtime is accepted -> refresh/import/migrate state intentionally.

7.2 Scheduled Drift Plan Pattern

A production drift job should not be a blind cron that runs every stack with admin credentials.

It needs:

stack registry;
ownership metadata;
risk tier;
credential boundary;
lock discipline;
rate limit;
evidence store;
alert routing;
suppression and exception policy;
stale result detection.

Example conceptual workflow:

name: iac-drift-detection

on:
  schedule:
    - cron: "17 * * * *"
  workflow_dispatch: {}

jobs:
  discover-stacks:
    runs-on: platform-runner
    outputs:
      matrix: ${{ steps.registry.outputs.matrix }}
    steps:
      - uses: actions/checkout@v4
      - id: registry
        run: ./platform/scripts/list-drift-eligible-stacks.sh

  drift:
    needs: discover-stacks
    strategy:
      fail-fast: false
      max-parallel: 4
      matrix: ${{ fromJson(needs.discover-stacks.outputs.matrix) }}
    runs-on: iac-runner-${{ matrix.riskTier }}
    permissions:
      id-token: write
      contents: read
    steps:
      - uses: actions/checkout@v4
      - name: Assume stack identity
        run: ./platform/scripts/assume-stack-role.sh "${{ matrix.stack }}"
      - name: Run drift plan
        run: |
          tofu init -input=false
          tofu plan -input=false -detailed-exitcode -out=drift.tfplan
      - name: Export evidence
        if: always()
        run: |
          tofu show -json drift.tfplan > drift-plan.json || true
          ./platform/scripts/publish-drift-evidence.sh

This is not copy-paste production code. It is a shape.

In a real implementation, you need careful exit-code handling, secrets masking, backend locking, log scrubbing, and stack-specific credentials.

7.3 Drift Exit Codes

Terraform/OpenTofu-style tools commonly support detailed exit codes for plan-like operations:

no changes;
changes present;
error.

Your pipeline must not collapse these into pass/fail only.

A drift result should become a structured object:

{
  "stack": "prod/eu-west-1/payments/network",
  "result": "DRIFT_DETECTED",
  "severity": "HIGH",
  "owner": "platform-network",
  "resourceChanges": 3,
  "destructiveChanges": 0,
  "identityChanges": 1,
  "actorCorrelation": "cloudtrail:found-human-change",
  "recommendedAction": "OPEN_RECONCILIATION_PR",
  "evidenceUri": "s3://evidence/iac-drift/2026-07-03/..."
}

The exit code is not the product.

The drift decision is the product.

8. GitOps Drift Detection in Kubernetes

GitOps controllers continuously compare desired and actual state.

OpenGitOps describes GitOps-managed systems as declarative, versioned/immutable, automatically pulled, and continuously reconciled. In practice, that means a controller like Argo CD or Flux observes actual cluster state and tries to converge it to the desired state from source.

8.1 Argo CD Drift Model

Argo CD exposes two concepts that are often confused:

Concept	Meaning
Sync status	Whether live resources match desired manifests
Health status	Whether the application appears operational according to health checks

An application can be:

Synced but unhealthy;
OutOfSync but still serving traffic;
Synced and healthy;
OutOfSync and degraded.

Do not treat sync status as availability.

Auto-sync and self-heal should be enabled intentionally, not globally by ideology.

For low-risk stateless app resources, self-heal may be appropriate.

For CRDs, database operators, external secrets, admission controllers, or cluster-critical networking, self-heal may need staged rollout or manual gates.

8.2 Flux Drift Model

Flux uses composable controllers. A Kustomization or HelmRelease has status, conditions, reconciliation intervals, dependencies, and applied revision information.

This means drift detection may be spread across:

GitRepository/OCIRepository/HelmRepository source status;
Kustomization status;
HelmRelease status;
Kubernetes events;
notification-controller events;
controller metrics.

Flux encourages thinking in controller graph terms:

A Flux drift alert without source status is often incomplete.

You need to know whether the controller could fetch source, render manifests, decrypt secrets, apply objects, and observe readiness.

8.3 Desired-vs-Live Diff Is Not Enough

Kubernetes object diffs can be noisy.

Sources of noise include:

defaulted fields;
status fields;
managed fields;
admission mutation;
generated labels/annotations;
controller-injected sidecars;
unordered lists if tooling normalizes poorly;
timestamps and generated names;
server-side apply ownership differences.

A platform must define field ownership rules.

Example:

apiVersion: platform.example.com/v1
kind: DriftOwnershipRule
metadata:
  name: deployment-runtime-owned-fields
spec:
  resource:
    group: apps
    kind: Deployment
  gitOwned:
    - /spec/template/spec/containers
    - /spec/template/spec/securityContext
    - /spec/selector
  runtimeOwned:
    - /status
    - /metadata/managedFields
  controllerOwned:
    - path: /spec/replicas
      owner: horizontal-pod-autoscaler
      condition: hpa-enabled

This is a conceptual policy object. Your implementation may use Argo CD ignore differences, Kyverno policies, Gatekeeper constraints, server-side apply ownership, custom diff normalization, or controller-specific settings.

The key is explicitness.

9. Auto-Heal Is a Privilege

Auto-heal sounds attractive:

If runtime differs from Git, revert it automatically.

That is sometimes correct.

It is also sometimes how you turn a live incident into a bigger incident.

Use this rule:

Auto-heal is safe only when ownership is unambiguous, the desired state is fresh, the action is reversible, and the blast radius is bounded.

9.1 Auto-Heal Decision Matrix

Drift type	Auto-heal?	Reason
Manual label change on stateless app	Usually yes	Low blast radius, Git owns field
Manual image patch in prod	Usually yes, plus alert	High integrity concern
HPA replica change	Usually no	HPA may own replica count
Deleted ConfigMap used by app	Maybe	Depends on rollout impact
Security group port opened manually	Usually no silent auto-heal; alert first	Potential incident/evidence concern
IAM policy widened manually	No silent auto-heal	Security-critical; preserve forensic context
Emergency database parameter change	No	May be intentionally stabilizing production
Sidecar injected by mesh	No	Controller-owned mutation
Provider default read noise	No	Normalize/suppress, do not churn

Silent auto-heal should be rare for security-critical infra.

The more important the resource, the more you should prefer:

detect -> classify -> route -> approve -> reconcile -> verify

9.2 Bounded Auto-Heal

A mature platform often supports bounded auto-heal policies:

apiVersion: platform.example.com/v1
kind: AutoHealPolicy
metadata:
  name: stateless-app-safe-fields
spec:
  appliesTo:
    resourceKinds:
      - Deployment
      - Service
      - ConfigMap
    namespaces:
      matchLabels:
        platform.example.com/tier: standard
  allowedPaths:
    - /metadata/labels
    - /metadata/annotations
    - /spec/template/spec/containers/*/resources
  deniedPaths:
    - /spec/template/spec/containers/*/image
    - /spec/template/spec/securityContext
  maxObjectsPerReconcile: 5
  requireHealthyBeforeHeal: true
  evidenceRequired: true

Again, the object is conceptual. The principle is concrete.

Auto-heal must have scope.

10. Drift Budgets

A drift budget is the maximum tolerated amount or age of drift for a domain.

It works like an error budget, but for state consistency.

Examples:

Domain	Drift budget
Production IAM	0 unclassified high-risk drifts; detection under 15 minutes
Public network ingress	0 unauthorized drifts; immediate alert
Stateless app metadata	Auto-heal within 10 minutes
Non-prod cost resources	Detect daily, remediate weekly
Legacy policy compliance	30-day migration window
Kubernetes workload image digest	0 mutable tag drift in production
Secret sync freshness	External secret reflected within 5 minutes

Drift budgets force you to state what matters.

Without them, every drift alert appears equally urgent.

10.1 Drift SLO Examples

SLO: 99% of production GitOps applications reconcile desired state within 5 minutes of Git revision availability.

SLO: 100% of critical IAM drift findings are classified within 30 minutes.

SLO: 95% of non-critical IaC drift findings are either remediated or explicitly accepted within 7 days.

SLO: 0 production workloads run container images without digest pinning after promotion.

These SLOs are more useful than "we run drift detection hourly".

Frequency is an implementation detail.

Business risk is the target.

11. Drift Classification Engine

A practical platform should convert raw diffs into classified drift findings.

11.1 Required Context

Useful context includes:

resource owner;
environment;
data classification;
service criticality;
change freeze status;
matching PR/merge event;
matching incident ticket;
audit log actor;
controller ownership metadata;
policy version;
last successful reconciliation;
last successful apply;
last provider/controller upgrade;
known exception windows.

A raw diff without context cannot support good decisions.

11.2 Drift Finding Schema

A drift finding should be structured.

{
  "findingId": "drift-20260703-000184",
  "detectedAt": "2026-07-03T10:22:31Z",
  "domain": "cloud-network",
  "resourceRef": "aws_security_group.app_public",
  "environment": "prod",
  "owner": "platform-network",
  "sourceOfTruth": "git://org/infra-live/prod/eu-west-1/network",
  "stateBackend": "s3://tfstate-prod/network",
  "driftClass": "UNAUTHORIZED_RUNTIME_DRIFT",
  "risk": "CRITICAL",
  "changedFields": [
    "/ingress/0/cidr_blocks",
    "/ingress/0/from_port"
  ],
  "auditCorrelation": {
    "actor": "alice@example.com",
    "source": "cloudtrail",
    "eventTime": "2026-07-03T10:11:02Z"
  },
  "recommendedAction": "INCIDENT_AND_REVERT",
  "autoHealAllowed": false,
  "evidenceUri": "s3://platform-evidence/drift/2026/07/03/drift-20260703-000184"
}

This schema becomes the interface between drift detection, alerting, incident response, and audit.

12. Reconciliation Actions

There are several possible reconciliation actions. Choosing the wrong one can be worse than doing nothing.

Action	What it means	Use when
Runtime revert	Apply desired state to runtime	Runtime is wrong and Git is authoritative
Git update PR	Change desired state to match accepted reality	Runtime change was approved or should become standard
State refresh	Update recorded state to match accepted runtime	State is stale and runtime is accepted
Import	Bring existing resource under IaC control	Resource exists but not tracked
State move	Preserve identity across refactor	Module/resource address changed
Ignore/suppress	Declare field outside Git ownership	Runtime/controller legitimately owns field
Incident	Preserve forensic trail and coordinate response	Drift is security-critical or unknown
Destroy/recreate	Replace invalid resource	Safe only with strong blast-radius control

The key rule:

Reconciliation should restore the correct control relationship, not merely remove the alert.

13. Manual Hotfix Reconciliation

Manual hotfixes happen.

Pretending they never happen is immature.

The mature question is: how does the system return to a governed state?

13.1 Hotfix Lifecycle

13.2 Hotfix Rules

A production-ready rule set:

Hotfix must be tied to an incident or emergency change record.
Break-glass identity must be time-bound.
Drift detector must classify hotfix drift differently from unknown drift.
A reconciliation PR or state action must follow within a defined window.
Git must return to being the source of truth.
Evidence must preserve who changed what, when, and why.

Manual change is not the real problem.

Unreconciled manual change is the problem.

14. Ignore Rules Are Dangerous

Every GitOps/IaC platform eventually needs ignore rules.

Examples:

ignore Kubernetes status;
ignore HPA-controlled replica count;
ignore service mesh injected annotations;
ignore provider-computed values;
ignore cert-manager generated fields.

But ignore rules are a risk surface.

Bad ignore rule:

ignoreDifferences:
  - group: apps
    kind: Deployment
    jsonPointers:
      - /spec/template/spec/containers

This may hide image drift, resource drift, securityContext drift, and command drift.

Better rule:

ignoreDifferences:
  - group: apps
    kind: Deployment
    jsonPointers:
      - /status
      - /metadata/managedFields

And if HPA owns replicas:

ignoreDifferences:
  - group: apps
    kind: Deployment
    jsonPointers:
      - /spec/replicas

Even then, document why.

14.1 Ignore Rule Review Checklist

Before accepting an ignore rule, ask:

Which controller owns the ignored field?
Can an attacker abuse this ignored field?
Would the ignored field affect identity, image, network, privilege, persistence, or command execution?
Is the rule scoped by kind, name, namespace, or label?
Does the rule expire?
Is there another detector covering this field?
Is the exception visible in evidence?

Ignore rules must be governed like code.

15. Drift and Provider Upgrades

Provider upgrades are a major source of drift noise.

A provider may change:

default value interpretation;
computed field representation;
diff suppression logic;
resource schema;
read behavior;
replacement conditions;
import behavior.

A serious platform does not upgrade providers directly in production stacks.

Recommended flow:

Provider upgrade drift must be separated from unauthorized runtime drift.

If you do not separate them, engineers will learn to distrust drift alerts.

16. Drift and Immutable Artifacts

Mutable references cause dependency drift.

Bad:

image: ghcr.io/acme/payments:latest

Better:

image: ghcr.io/acme/payments@sha256:4f3c...

Bad:

module "vpc" {
  source = "git::https://github.com/acme/terraform-vpc.git"
}

Better:

module "vpc" {
  source = "git::https://github.com/acme/terraform-vpc.git?ref=v3.4.1"
}

Bad:

provider "aws" {
  region = var.region
}

Without version constraint, provider resolution becomes an implicit dependency.

Better:

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

Immutable references reduce drift surface.

They do not eliminate drift, but they make it explainable.

17. Drift Detection for Secrets

Secret drift is not like normal diff.

You usually cannot print the value.

You need metadata-based detection:

secret version;
checksum/hash, if safe and non-reversible;
last refresh time;
external secret backend version;
controller condition;
workload restart status;
consuming pod generation;
expiration timestamp;
certificate not-before/not-after;
rotation policy compliance.

Example finding:

{
  "resource": "ExternalSecret/prod/payments/db-credentials",
  "driftClass": "SECRET_DELIVERY_DRIFT",
  "externalVersion": "vault:v42",
  "kubernetesSecretVersion": "vault:v40",
  "lastRefreshAgeMinutes": 37,
  "consumerRolloutGeneration": "stale",
  "risk": "HIGH",
  "recommendedAction": "RESTART_CONSUMERS_AFTER_SECRET_SYNC"
}

Do not log secret values to prove drift.

Prove freshness and lineage instead.

18. Drift Detection for Databases

Database drift is hard because schema and data have different semantics.

Examples:

table created manually;
index added manually;
column type changed outside migration;
extension enabled manually;
sequence altered;
privileged user granted access;
data migration partially completed.

Unlike stateless config, blind rollback can destroy data.

Database drift response should be migration-driven.

Detect -> classify -> generate migration PR -> verify backward compatibility -> apply through database change pipeline

Do not let a generic GitOps controller auto-heal database schema drift unless you deeply understand the blast radius.

19. Drift Alert Routing

Alerts should go to owners, not to a generic platform channel forever.

Routing dimensions:

service owner;
platform domain owner;
environment;
severity;
data classification;
resource kind;
drift class;
affected customer tier;
compliance scope;
incident correlation.

Example routing policy:

routes:
  - match:
      driftClass: UNAUTHORIZED_RUNTIME_DRIFT
      domain: iam
      environment: prod
    severity: critical
    notify:
      - security-oncall
      - platform-identity-oncall
    createIncident: true

  - match:
      driftClass: CONTROLLER_OWNED_DRIFT
    severity: info
    notify: []
    createIncident: false

  - match:
      driftClass: PROVIDER_READ_NOISE
    severity: low
    notify:
      - iac-platform-team
    createIssue: true

Alert routing is part of the drift product.

20. Drift Evidence

For regulated or high-trust systems, drift handling must produce evidence.

Evidence should answer:

What resource drifted?
What was the desired state?
What was the observed actual state?
When was drift detected?
How long did it exist?
Who/what caused it, if known?
Which policy classified it?
Who approved remediation?
What action was taken?
What verified closure?

Evidence artifacts may include:

plan JSON;
normalized diff;
rendered manifests;
controller status;
cloud audit log event reference;
Kubernetes audit event reference;
PR link;
incident/change ticket;
approval record;
post-reconciliation verification result.

A good evidence record is machine-readable and human-auditable.

21. Drift Dashboard Design

A drift dashboard should not be a wall of red.

Useful dashboard sections:

21.1 Executive View

open critical drift count;
mean time to classify drift;
mean time to reconcile drift;
drift by environment;
drift by domain;
drift budget burn;
recurring drift sources.

21.2 Platform View

stacks with stale drift checks;
failed drift jobs;
noisy provider versions;
lock contention;
longest-running unresolved drift;
top ignored fields;
auto-heal success/failure.

21.3 Service Owner View

drift affecting my services;
required actions;
reconciliation PRs waiting for review;
exceptions expiring soon;
last successful reconciliation.

21.4 Security View

identity drift;
network exposure drift;
secret freshness drift;
unsigned/unverified artifact drift;
break-glass changes not reconciled.

A dashboard is useful only if it drives decisions.

22. Implementation Pattern: Drift Registry

Create a registry of drift-managed units.

apiVersion: platform.example.com/v1
kind: DriftManagedUnit
metadata:
  name: payments-prod-network
spec:
  type: opentofu-stack
  owner: platform-network
  environment: prod
  riskTier: critical
  source:
    repo: github.com/acme/infra-live
    path: prod/eu-west-1/payments/network
  state:
    backend: s3
    key: prod/eu-west-1/payments/network.tfstate
  credentials:
    oidcRole: arn:aws:iam::123456789012:role/iac-drift-payments-network
  schedule:
    frequency: hourly
  policies:
    autoHeal: false
    requireIncidentForManualChange: true
  evidence:
    retention: 7y

This registry lets you reason about drift as a platform capability.

It also avoids scanning everything with one overpowered role.

23. Implementation Pattern: Reconciliation PR

When runtime drift should become desired state, generate a PR.

PR content should include:

drift finding ID;
resource reference;
detected fields;
audit correlation;
reason for accepting runtime as desired;
generated config changes;
policy evaluation;
risk assessment;
rollback plan;
reviewers.

Example PR body:

## Drift Reconciliation

Finding: drift-20260703-000184
Environment: prod
Owner: platform-network
Resource: aws_security_group.app_public

### Classification
Accepted emergency drift from INC-9127.
Runtime change restored customer traffic during regional LB incident.

### Proposed Desired-State Update
This PR updates the declared ingress rule to match the approved temporary production state.

### Expiration
This rule expires on 2026-07-10 and must be removed after vendor remediation.

### Evidence
- Cloud audit event: cloudtrail://event/...
- Incident: INC-9127
- Drift plan: s3://platform-evidence/...

This is much better than telling engineers to "fix drift".

24. Failure Playbooks

24.1 Drift Detector Fails

Symptoms:

scheduled drift jobs fail;
no drift results for stack;
backend lock timeout;
provider authentication failure;
controller metrics missing.

Response:

Mark drift result as stale, not green.
Alert platform owner if staleness exceeds SLO.
Check runner identity and credential federation.
Check backend lock and state access.
Check provider API quota/throttling.
Re-run with bounded scope.
Store failure evidence.

No result is not no drift.

24.2 Drift Detected on Critical IAM

Response:

Preserve evidence before mutation.
Correlate with audit logs.
Identify actor and change path.
Check incident/change ticket.
Temporarily freeze related applies if necessary.
Decide revert vs accept vs incident.
Reconcile through approved path.
Verify role/policy effective permissions.
Review trust policy and runner permissions.

24.3 GitOps Auto-Heal Loop

Symptoms:

controller repeatedly applies;
runtime controller repeatedly mutates back;
app remains OutOfSync;
API server load increases;
alerts flap.

Response:

Identify field causing diff.
Determine field owner.
Pause/suspend auto-sync if blast radius exists.
Add scoped ignore/ownership rule if runtime-owned.
Move desired ownership to correct controller.
Remove broad ignores.
Re-enable reconciliation.

24.4 State Corruption Suspected

Response:

Stop applies for affected stack.
Snapshot current state backend.
Export provider inventory.
Compare desired, state, and actual resources.
Decide import/move/remove-state path.
Perform state surgery with peer review.
Run plan and policy checks.
Store evidence.
Re-enable apply.

State surgery is production surgery.

Treat it accordingly.

24.5 Provider Upgrade Causes Massive Drift

Response:

Stop production rollout of provider upgrade.
Classify diffs into real vs representation changes.
Pin previous provider version if needed.
Add migration notes.
Update normalization/suppression only when safe.
Re-test non-prod.
Communicate expected diff changes to reviewers.

25. Anti-Patterns

25.1 Auto-Heal Everything

This hides incidents and can revert emergency stabilization.

25.2 Ignore Everything Noisy

This eliminates the value of GitOps.

25.3 One Global Drift Role

This violates least privilege and creates a catastrophic credential.

25.4 No Drift Ownership

Findings without owners become platform-team garbage collection.

25.5 Treating Stale Detection as Healthy

If drift detection fails, the resource is unknown, not clean.

25.6 Refreshing State to Silence Drift

This accepts runtime as truth without review.

25.7 Detecting Drift Without Reconciliation Path

This creates alerts that no one can close correctly.

25.8 Broad GitOps Ignore Rules

This can hide security-critical mutations.

25.9 Mutable Artifact References

Mutable tags/modules/charts make drift hard to explain.

25.10 No Evidence Store

Without evidence, drift response becomes a story, not an audit trail.

26. Production Design Checklist

Use this checklist when reviewing a drift program.

State Model

Desired, recorded, and actual state are separately understood.
Each resource has an explicit source of truth.
State ownership is documented by domain and field where needed.
State backend access is least-privilege and auditable.

Detection

IaC drift checks run with stack-scoped credentials.
GitOps controller drift is monitored by sync and health signals.
Detection failures are treated as stale/unknown, not clean.
Drift checks produce machine-readable findings.
Detection frequency maps to risk tier.

Classification

Drift classes are defined.
Findings are enriched with owner, environment, audit, PR, and incident context.
Policy decides severity and action.
Known controller-owned fields are explicitly modeled.
Provider/read noise is tracked separately.

Reconciliation

Runtime revert, Git PR, state refresh, import, ignore, and incident paths are distinct.
Auto-heal is scoped and bounded.
Security-critical drift is not silently auto-healed.
Emergency drift has a reconciliation window.
State surgery requires peer review.

Evidence

Raw diff/plan is stored.
Normalized diff is stored.
Audit correlation is stored when available.
Approval and remediation action are recorded.
Closure verification is recorded.
Retention meets compliance needs.

Operations

Drift dashboards are owner-oriented.
Alerts route to accountable teams.
Drift budgets exist for critical domains.
Recurring drift sources are reviewed.
Ignore rules expire or are periodically reviewed.

27. Mental Model Summary

The mature model is this:

Drift detection is not a tool.
It is a control-loop assurance system.

You are not only asking:

Does production match Git?

You are asking:

Is the correct controller owning the correct state through the correct path with the correct evidence?

That question is what separates a toy GitOps installation from a production-grade GitOps/IaC platform.

28. Practical Exercise

Design a drift program for this scenario:

Company: multi-region SaaS
Cloud: AWS
Kubernetes: 12 clusters
IaC: OpenTofu + Terragrunt
GitOps: Argo CD
Secrets: External Secrets Operator + AWS Secrets Manager
Compliance: SOC2 + internal production change controls

Produce:

drift taxonomy;
drift-managed unit registry schema;
risk-tiered detection schedule;
auto-heal policy for Kubernetes apps;
no-auto-heal policy for IAM/network/database;
reconciliation PR template;
critical IAM drift runbook;
dashboard sections;
evidence schema;
drift SLOs.

Do not write tool-specific commands first.

Write the operating model first.

29. References

OpenGitOps Principles — declarative desired state, versioned and immutable desired state, automatic pull, continuous reconciliation.
OpenTofu CLI plan documentation — normal planning and refresh-only mode.
Argo CD documentation — sync status, automated sync, diffing, metrics, and application reconciliation.
Flux documentation — Kustomization reconciliation, controller events, notification-controller, and Prometheus metrics.
Kubernetes documentation — object status, admission, managed fields, and controller reconciliation model.
Terraform/OpenTofu state documentation — state as resource mapping and source of recorded infrastructure knowledge.

Lesson Recap

You just completed lesson 29 in deepen practice. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Back To Series Next Lesson

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.

Previous Lesson

Lesson 28

Promotion and Release Governance

Next Lesson

Lesson 30

Observability for GitOps/IaC Pipelines