Failure Modeling and Recovery Playbooks
Learn State-of-the-Art GitOps/IaC Pipeline - Part 031
Failure modeling and recovery playbooks for production-grade GitOps and IaC pipelines, covering failed plans, failed applies, stuck locks, broken controllers, bad secrets, policy incidents, drift, and emergency recovery.
Part 031 — Failure Modeling and Recovery Playbooks
Production-grade GitOps/IaC is not proven when everything works. It is proven when a change fails halfway, the state backend is locked, the GitOps controller refuses to sync, a cloud API throttles the runner, a policy rule blocks production, a secret is rotated incorrectly, or a human applies an emergency mutation outside Git — and the team can still recover without guessing.
This part is about that discipline.
A state-of-the-art GitOps/IaC pipeline must be designed as a recoverable state-transition system. Every change has a source, actor, authorization decision, plan, approval, mutation attempt, observed result, and evidence trail. Failure modeling is the practice of enumerating where that state transition can break and defining safe recovery before the failure happens.
The mindset shift is simple:
A mature platform does not merely automate success paths. It constrains and explains failure paths.
We are not going to list generic “CI failed, retry it” advice. We will model GitOps/IaC failures the way a senior platform engineer, SRE, or regulated-systems engineer should: boundary by boundary, state by state, invariant by invariant.
1. The Skill You Are Actually Building
The visible skill is “knowing how to fix pipeline failures.”
The real skill is deeper:
You are learning to preserve control-plane integrity while recovering from incomplete, ambiguous, or unsafe infrastructure transitions.
That means you must be able to answer:
- What state was intended?
- What state was approved?
- What state was actually attempted?
- What state exists now?
- Which actor or controller owns the next transition?
- Is it safer to retry, revert, roll forward, pause reconciliation, import state, repair config, unlock state, or escalate?
- What evidence proves the recovery was correct?
A junior engineer usually asks:
“How do I make the pipeline green?”
A top-tier engineer asks:
“Which invariant failed, and what is the smallest safe state transition that restores the system without hiding evidence?”
2. The Core Failure Model
Any GitOps/IaC pipeline can be modeled as a chain of state transitions.
Every failure is a transition to one of these states:
| Failure State | Meaning | Typical Example |
|---|---|---|
RejectedBeforePlan | Change cannot be converted into an execution plan | invalid syntax, missing module, provider init failure |
PlanGeneratedButUnsafe | Tool can produce a plan, but policy/risk rejects it | public bucket, excessive IAM, destructive replacement |
ApprovedButStale | Approval no longer matches the actual plan/input | base branch changed, state changed, artifact expired |
MutationNotStarted | Apply/sync never reached provider/controller mutation | runner failure, credential failure, state lock unavailable |
MutationPartiallySucceeded | Some resources changed before failure | cloud API timeout after creating resource |
RecordedStateDiverged | IaC state no longer correctly represents real resources | failed import, manual change, corrupted state |
LiveStateDiverged | Kubernetes/cloud live state differs from desired state | manual edit, controller mutation, drift |
ReconciliationBlocked | GitOps controller wants to act but cannot | admission denial, missing CRD, bad secret, RBAC |
ReconciliationUnsafe | Controller can act but should be paused | bad desired state, runaway prune, broken policy |
EvidenceIncomplete | The system changed but evidence is missing | manual hotfix, log retention gap, artifact deletion |
The recovery action depends on which state you are in. Treating all failures as “retry” is dangerous because retries can amplify partial failure.
3. Failure Domains in a GitOps/IaC Platform
A modern pipeline has many control surfaces. You need to know which domain failed before you choose a recovery.
The major failure domains are:
| Domain | What Can Fail | Primary Recovery Owner |
|---|---|---|
| Source control | branch protection, merge queue, CODEOWNERS, wrong commit | repo owner / platform |
| CI validation | lint/test/render/scanner failure | author / platform |
| Plan engine | init, provider, module, state read, graph construction | infra owner / platform |
| Policy engine | rule regression, false positive, unavailable policy service | platform security / policy owner |
| Approval system | stale approval, wrong approver, missing evidence | change manager / platform |
| Runner | capacity, network, credentials, filesystem, image | platform runtime owner |
| State backend | lock, corruption, versioning, access, consistency | infra platform owner |
| Cloud APIs | throttling, quota, eventual consistency, provider bug | cloud platform owner |
| GitOps controller | diff, sync, RBAC, health check, prune, CRD issue | cluster platform owner |
| Kubernetes API | admission, RBAC, CRD, webhook, API server availability | cluster platform owner |
| Secrets | decryption, sync, rotation, missing permissions | security/platform owner |
| Artifact registry | unavailable image/chart, tag drift, digest mismatch | delivery/platform owner |
| Observability | missing logs, missing metrics, broken evidence | platform/SRE |
| Human process | emergency bypass, wrong command, unapproved mutation | incident commander / governance |
Recovery starts by narrowing the domain.
A strong question is:
“Which control surface is currently authoritative for the next safe transition?”
If the failure is in the IaC runner, the next owner may be the pipeline. If the failure is in GitOps desired state, the next owner may be Git. If the failure is a manual emergency change, the next owner may be a reconciliation PR.
4. Recovery Doctrine: The Non-Negotiable Rules
Before playbooks, define doctrine. Doctrine prevents panic.
Rule 1 — Preserve Evidence Before Changing State
Before retrying, reverting, unlocking, importing, deleting, or manually fixing, capture:
- commit SHA
- PR number
- run ID
- plan artifact ID
- policy result
- approver identity
- runner identity
- state backend version
- lock ID
- impacted workspace/unit
- provider error
- controller event
- cluster namespace/resource
- current live state snapshot where relevant
If evidence disappears, incident analysis becomes storytelling.
Rule 2 — Classify the Failure Before Acting
Do not run apply again until you know whether the previous run mutated anything.
There are two very different situations:
| Situation | Safe First Question |
|---|---|
| Apply failed before mutation | Can we fix the precondition and retry the same approved change? |
| Apply failed after partial mutation | What changed, what did state record, and what is the smallest repair? |
Rule 3 — Do Not Hide Drift by Refreshing State Blindly
A refresh can update recorded state to match reality. That may be correct after a legitimate manual change, but it can also hide unauthorized drift.
Use refresh/import/state repair only after you answer:
- why does actual state differ?
- who authorized it?
- does Git need to change?
- does policy need to approve the new state?
- should this be recorded as exception, incident, or normal reconciliation?
Rule 4 — Prefer Declarative Recovery
A recovery should ideally be expressed as a new desired-state transition:
- Git revert
- forward fix commit
- configuration patch
- module version pin
- policy exception PR
- environment promotion rollback
- controlled state import PR
Manual console fixes are sometimes necessary, but they should be treated as emergency mutations that require reconciliation evidence afterward.
Rule 5 — Pause the Right Controller, Not Everything
When state is unsafe, pause the smallest responsible control loop:
| Problem | Pause Candidate |
|---|---|
| Bad Kubernetes desired state constantly re-applies | specific Argo CD Application / Flux Kustomization |
| Bad Helm release reconciliation | specific Flux HelmRelease or Argo Application |
| Bad IaC unit changing infra | specific stack/workspace/project apply |
| Secret sync producing bad secret | specific ExternalSecret / secret store binding |
| Policy controller blocking everything | specific policy/webhook only if break-glass process allows |
Do not disable an entire platform when one bounded unit is bad.
Rule 6 — A Green Pipeline Is Not Proof of Recovery
Recovery is proven by post-conditions:
- desired state matches approved Git revision
- actual infrastructure matches desired state
- recorded state matches actual state
- workloads are healthy
- policy violations are resolved or explicitly accepted
- drift budget is back within threshold
- audit/evidence trail is complete
5. Failure Severity Model
You need a common language for severity.
| Severity | Definition | Example | Default Action |
|---|---|---|---|
| SEV-4 | Localized non-prod failure, no user impact | dev plan fails | normal queue |
| SEV-3 | Production change blocked, no active impact | prod apply denied by policy | expedite owner review |
| SEV-2 | Production degraded or platform control loop impaired | Argo cannot sync critical app | incident process |
| SEV-1 | Broad outage, security exposure, destructive infra risk | network ACL locks prod, public data exposure | incident commander + break-glass |
Severity should be based on risk and impact, not emotional intensity.
A failed plan for a production IAM change may be SEV-3 even if nothing is down. A successful apply that silently opens public access may be SEV-1.
6. The Universal Triage Loop
Use this loop for every failure.
The key is the Did Mutation Start? branch. That one question prevents many dangerous retries.
7. Playbook 1 — PR Validation Fails
Symptoms
- linting fails
- format check fails
- static analysis fails
- config rendering fails
- schema validation fails
- plan job never starts
Likely Causes
- invalid HCL/YAML/JSON
- wrong module input
- invalid Helm values
- Kustomize overlay references missing file
- generated config not committed
- policy schema mismatch
- tool version mismatch
First Response
Do not bypass validation because “it is just a small fix.” Validation is a boundary that prevents unplannable desired state from entering the system.
Diagnosis Checklist
- Is the failure deterministic locally?
- Did a tool version change?
- Did the base branch change?
- Is the error from parsing, rendering, schema validation, or policy input construction?
- Is this a repository issue or platform runner issue?
Safe Recovery
- Fix the source-level issue.
- Re-run validation.
- Confirm generated/rendered artifacts are deterministic.
- If the failure is due to platform tooling, open a platform incident/change.
- Do not manually skip CI unless the exception is documented and approved.
Anti-Pattern
Adding ignore_errors, disabling lint, or weakening validation globally because one PR is blocked.
8. Playbook 2 — IaC Init Fails
Symptoms
- backend init fails
- provider plugin download fails
- module source cannot be resolved
- version constraint conflict
- registry timeout
- authentication failure to backend
Likely Causes
- backend credentials missing
- remote backend unavailable
- provider registry unavailable
- provider checksum mismatch
- module version removed
- network egress blocked
- runner image missing required tools
Key Question
Did the failure happen before any plan or mutation?
Usually yes. That means recovery is usually precondition repair, not rollback.
Safe Recovery
- Capture run logs and dependency versions.
- Confirm backend access from runner identity.
- Confirm provider/module source integrity.
- Check whether lock files changed.
- Re-run init only after dependency source is trusted.
- If provider/module supply chain is suspect, stop and escalate.
Design Improvement
For production platforms:
- pin provider versions
- commit lock files where appropriate
- mirror providers/modules internally for critical workloads
- record runner image digest
- keep dependency download logs as evidence
9. Playbook 3 — Plan Fails Due to State Lock
Symptoms
- plan cannot acquire lock
- apply cannot acquire lock
- lock holder looks stale
- remote backend reports concurrent operation
Mental Model
The lock protects the state database. A stuck lock is annoying. A corrupted state file is worse.
First Response
Do not force-unlock until you know whether another operation is still alive.
Diagnosis Checklist
- Which workspace/unit/state file is locked?
- Who acquired the lock?
- Which run ID owns it?
- Is that run still executing?
- Did the runner crash after acquiring it?
- Did a previous apply partially mutate resources?
- Does backend versioning show a recent write?
Safe Recovery
- Locate owning run.
- Stop or confirm completion of owning run.
- Capture lock metadata.
- Inspect state version timestamp.
- If no active operation exists, perform documented force unlock.
- Immediately run a refresh-only or plan diagnosis depending on platform policy.
- Record the unlock as evidence.
Anti-Pattern
Force-unlocking because a developer is impatient.
Control Improvement
Your platform should expose a “lock owner dashboard” showing:
- state unit
- lock ID
- owner run
- actor
- started time
- last heartbeat if available
- current pipeline status
10. Playbook 4 — Saved Plan Is Stale
Symptoms
- apply refuses saved plan
- plan artifact expired
- base branch changed after approval
- state changed since plan
- provider reads differ
- policy was evaluated against an old plan
Mental Model
A saved plan is not just a suggestion. It is a binding between:
- source revision
- variables
- provider versions
- state snapshot
- planned actions
- policy decision
- approval
If any of those change, the approval may no longer authorize the mutation.
Safe Recovery
- Mark previous plan as invalid.
- Generate a fresh plan from the current base and current state.
- Re-run policy.
- Require fresh approval if risk changed.
- Apply only the newly approved plan.
Anti-Pattern
Reusing approval from a stale plan because “the diff looks similar.”
11. Playbook 5 — Policy Gate Fails
Symptoms
- OPA/Conftest/Checkov/Sentinel/Kyverno validation fails
- plan contains denied resource
- exception required
- policy engine unavailable
- false positive suspected
Key Distinction
There are three different situations:
| Case | Meaning | Recovery |
|---|---|---|
| True violation | change is unsafe/non-compliant | fix design |
| Approved exception | violation is accepted with bounded risk | exception workflow |
| Policy defect | policy blocks valid change | policy fix with test |
Safe Recovery
- Capture policy input and decision output.
- Determine whether the rule is correct.
- If true violation, redesign the change.
- If exception, create an exception object with owner, reason, expiry, scope, and compensating controls.
- If policy defect, fix policy in policy repo and add regression test.
- Re-run policy before approval.
Anti-Pattern
Adding skip_check without expiry, owner, or evidence.
Production Principle
Policy exceptions should be data, not comments.
Bad:
# TODO: skip this for now
Better:
exception:
id: EXC-2026-0712
rule: iam.no-admin-wildcard
scope: account/prod/security-audit-role
owner: platform-security
expires: 2026-08-01
reason: temporary migration bridge
compensatingControls:
- cloudtrail-alert-admin-wildcard-use
- daily-access-review
12. Playbook 6 — Approval Binding Fails
Symptoms
- approval missing
- approver not authorized
- approval happened before final plan
- CODEOWNERS mismatch
- approval was dismissed after rebase
- production apply blocked by change policy
Mental Model
Approval should authorize a specific transition, not a vague intention.
A valid approval binds:
approver + role + PR + commit SHA + plan digest + policy result + target environment + time window
Safe Recovery
- Identify which binding field failed.
- Regenerate plan if commit/state changed.
- Request approval from the correct owner.
- Preserve the rejected approval evidence.
- Do not manually trigger apply as admin unless break-glass is active.
Anti-Pattern
“Approved in Slack” without linking to exact plan artifact and commit.
13. Playbook 7 — Apply Fails Before Mutation
Symptoms
- runner cannot start
- credential exchange fails
- backend lock unavailable
- provider auth fails before creating/updating resources
- saved plan cannot be read
- policy recheck fails before apply
Diagnosis
Confirm no provider mutation happened.
Look for:
- no provider operation logs
- no cloud audit events for the target role
- no state write after run start
- no resource events in cluster/cloud
Safe Recovery
- Fix the precondition.
- If source/plan/state unchanged, retry may be safe.
- If any input changed, regenerate plan and approval.
- Record the failed attempt.
Anti-Pattern
Treating every failed apply as harmless. Some tools fail after partial mutation; do not assume.
14. Playbook 8 — Apply Fails After Partial Mutation
This is one of the most important playbooks in the entire series.
Symptoms
- some resources created, others failed
- apply output says partial completion
- cloud API timeout after resource creation
- state file contains some updates but not all expected resources
- next plan proposes confusing replacements/imports
Mental Model
You now have three states:
Your job is to reconcile desired, recorded, and actual state without making the blast radius worse.
First Response
Stop automatic retries.
Diagnosis Checklist
- Which resources were successfully changed?
- Which resources failed?
- Did state record the successful changes?
- Did state record resources that do not exist?
- Do real resources exist that state does not know about?
- Did provider return an eventually consistent response?
- Is the failed resource safe to retry?
- Would retry cause replacement, duplicate creation, or deletion?
Safe Recovery Options
| Situation | Recovery |
|---|---|
| Actual and state agree for completed resources | fix failed precondition and re-apply |
| Actual exists but state does not know it | import or state repair after approval |
| State says resource exists but cloud does not | remove/repair state after evidence capture |
| Provider timeout but resource eventually appears | wait, refresh/diagnose, then plan |
| Failed replacement left old and new resources | choose target, update config/state carefully |
| Failure was caused by unsafe design | roll forward with corrected design |
Recovery Procedure
- Freeze affected unit.
- Capture state version before repair.
- Capture actual resource inventory.
- Compare desired vs recorded vs actual.
- Choose repair strategy.
- Run a diagnostic plan.
- Get approval for state repair if production.
- Repair/import/remove state as needed.
- Run plan again.
- Apply only after plan is understandable and policy-approved.
Anti-Pattern
Deleting cloud resources manually until the plan “looks clean.”
15. Playbook 9 — State Corruption or State Mismatch
Symptoms
- state JSON unreadable
- state backend version bad
- resource address mismatch
- provider cannot decode state after upgrade
- resource moved but
movedblock missing - state contains wrong resource mapping
- plan proposes mass recreation unexpectedly
Severity
Treat production state corruption as at least SEV-2. If destructive changes are possible, escalate to SEV-1.
Safe Recovery
- Disable applies for the affected state unit.
- Snapshot current state version.
- Retrieve previous backend versions if supported.
- Identify the last known good state.
- Compare with actual infrastructure inventory.
- Avoid applying until state is repaired.
- Use
movedblocks for refactors when possible. - Use import/remove state operations only under controlled procedure.
- Run plan and policy after repair.
Design Principle
State repair is a production change. It needs:
- owner
- review
- evidence
- rollback option where possible
- post-repair drift check
16. Playbook 10 — Provider/API Throttling or Eventual Consistency
Symptoms
- rate limit errors
- cloud API timeout
- resource not found immediately after creation
- dependency not ready
- repeated transient failure
- provider bug suspected
Mental Model
Cloud APIs are not always immediately consistent. IaC engines describe desired operations, but provider implementations must translate them through APIs with throttling, retries, and eventual consistency behavior.
Safe Recovery
- Determine whether mutation happened.
- Check cloud audit logs.
- Wait if provider/API documentation suggests eventual consistency.
- Re-plan after state refresh/diagnosis.
- Increase provider timeout/retry settings only if understood.
- Reduce parallelism for fragile services.
- Split overly large stacks if throttling is structural.
Anti-Pattern
Blindly rerunning high-parallelism applies against a throttled API.
Design Improvement
Use stack boundaries that reflect provider/API failure domains. For example, do not combine hundreds of unrelated IAM/network/storage mutations in one giant state unit.
17. Playbook 11 — Destructive Change Detected Late
Symptoms
- plan proposes destroy/recreate unexpectedly
- replacement appears after provider upgrade
- rename causes delete/create
- module refactor changes resource address
- production apply is about to delete critical resource
Safe Response
Stop the line.
Diagnosis Checklist
- Is destruction expected?
- Is it caused by resource address change?
- Is a
movedblock missing? - Did a force-new attribute change?
- Did provider behavior change?
- Is lifecycle protection configured?
- Is state pointing to the correct object?
- Would deletion violate data retention or uptime constraints?
Recovery
- Do not apply.
- Add or correct
movedblocks for refactor cases. - Adjust module design to avoid forced replacement where possible.
- Split migration into create-before-destroy phases.
- Add policy rule if this class of destruction should be blocked.
- Require explicit destructive approval if destruction is intentional.
Anti-Pattern
Approving destruction because the person reviewing “trusts Terraform.”
18. Playbook 12 — GitOps Application OutOfSync
Symptoms
- Argo CD Application
OutOfSync - Flux Kustomization reports changes not applied
- diff shows live state differs from Git
- auto-sync disabled or failed
- self-heal disabled
First Question
Is the diff expected, harmful, or noise?
Common Causes
- manual cluster mutation
- controller defaulting fields
- mutating webhook changes object
- generated fields not ignored
- Helm chart output changed
- image tag moved
- secret changed out-of-band
- CRD schema conversion
- Git revision not reachable
Safe Recovery
- Inspect diff.
- Classify drift: authorized, unauthorized, controller-generated, noise, or dangerous.
- If unauthorized, revert live state through GitOps sync or corrective PR.
- If authorized emergency change, capture incident evidence and reconcile Git.
- If diff noise, configure ignore rules narrowly.
- If desired state is wrong, fix Git before syncing.
Anti-Pattern
Adding broad ignore rules for fields you do not understand.
19. Playbook 13 — GitOps Sync Fails
Symptoms
- Argo CD sync operation fails
- Flux reconciliation fails
- resource apply error
- health check never passes
- sync wave stuck
- post-sync hook fails
- HelmRelease not ready
Failure Classes
| Class | Example | Recovery |
|---|---|---|
| Render failure | invalid Helm/Kustomize output | fix desired config |
| Apply failure | invalid Kubernetes object | fix schema/spec/RBAC |
| Admission failure | policy denies object | fix object or policy exception |
| Dependency failure | CRD not installed before CR | fix ordering/waves/dependencies |
| Health failure | object applied but not healthy | diagnose workload/controller |
| Prune failure | finalizer blocks delete | resolve finalizer/resource owner |
| Secret failure | missing/decryption/sync problem | repair secret path |
Safe Recovery
- Read controller events/logs.
- Identify phase: render, apply, health, prune, hook, dependency.
- Pause only the affected application/resource if repeated reconciliation is harmful.
- Fix desired state or dependency ordering.
- Resume reconciliation.
- Verify health and evidence.
Argo CD Specific Notes
Argo CD sync phases and waves allow resources to be applied in phases such as pre-sync, sync, and post-sync, and waves can order resources within a phase. Misusing hooks/waves can create stuck deployments if jobs are not idempotent or dependencies never become healthy.
Flux Specific Notes
Flux resources can be suspended to pause reconciliation and resumed after repair. Use suspension as a scalpel, not a blanket platform shutdown.
20. Playbook 14 — Admission Policy Blocks Sync
Symptoms
- Kubernetes API rejects object
- Kyverno/Gatekeeper/ValidatingAdmissionPolicy denial
- Argo/Flux stuck applying resource
- webhook unavailable causes fail-closed outage
Diagnosis Checklist
- Which policy denied the object?
- Is the policy correct?
- Is the object unsafe?
- Is the policy newly deployed?
- Did namespace labels change?
- Is this a fail-closed webhook availability issue?
- Are exceptions supported and scoped?
Safe Recovery
- Capture admission denial message.
- If object unsafe, fix desired state.
- If policy defective, fix policy and add regression test.
- If exception valid, create scoped exception with expiry.
- If webhook outage blocks critical operations, follow break-glass process.
- Reconcile after policy path is healthy.
Anti-Pattern
Disabling the entire admission controller for one workload without recording affected scope.
21. Playbook 15 — Bad Secret or Secret Sync Failure
Symptoms
- SOPS decryption fails
- ExternalSecret cannot sync
- Vault/cloud secret access denied
- workload enters crash loop after rotation
- Argo/Flux cannot render/apply secret-dependent config
- image pull secret invalid
Key Questions
- Did the secret fail to deliver?
- Did the wrong secret deliver successfully?
- Did rotation break compatibility?
- Is the secret value itself bad, or is the identity/RBAC path bad?
- Does rollback require the previous value, and is it still recoverable?
Safe Recovery
- Pause rollout if bad secret is causing cascading failure.
- Identify secret source of truth.
- Check identity permissions from secret operator/controller.
- Verify version/rotation timestamp.
- Restore previous secret version if safe and permitted.
- Reconcile Git references if path/key changed.
- Restart/reload workloads only according to application semantics.
- Add rotation test to prevent recurrence.
Design Improvement
A production secret rotation should have:
- dual-read compatibility when possible
- versioned secret history
- canary workload
- explicit rollback window
- telemetry for authentication failures
- post-rotation verification
22. Playbook 16 — Artifact Registry or Signature Failure
Symptoms
- image/chart not found
- tag points to unexpected digest
- Cosign verification fails
- SBOM/provenance missing
- registry unavailable
- admission policy rejects unsigned image
Recovery
- Confirm whether desired state references tag or digest.
- Resolve the expected digest from the promotion artifact.
- Verify signature/attestation identity.
- If artifact is missing, stop promotion and rebuild/re-promote from trusted source.
- If signature policy is wrong, fix policy with test.
- If registry outage, do not bypass verification unless emergency exception exists.
Anti-Pattern
Switching back to mutable tags during incident because digest references are “inconvenient.”
23. Playbook 17 — Broken GitOps Controller
Symptoms
- Argo CD application controller down
- Flux controller crash looping
- controller cannot access Git/registry/Kubernetes API
- reconciliation lag grows
- no events emitted
- controller upgrade breaks behavior
Severity
A broken controller may not immediately break applications, but it breaks your ability to change and heal them.
Safe Recovery
- Determine blast radius: one controller, namespace, cluster, or fleet.
- Check controller deployment health.
- Check credentials to Git/registry/API.
- Check recent controller upgrade/config change.
- Roll back controller version/config if needed.
- Avoid manual mass applies unless controller outage affects critical incident response.
- After recovery, check missed reconciliations and drift.
Design Improvement
Treat GitOps controller upgrades as platform changes with:
- canary cluster
- compatibility tests
- backup of controller config
- rollback plan
- metrics for reconciliation lag and error rate
24. Playbook 18 — Emergency Manual Mutation
Sometimes production must be fixed before the pipeline can safely execute. That may be acceptable. Pretending it did not happen is not.
Examples
- manually revoke public access
- manually scale workload during outage
- manually rotate compromised credential
- manually detach broken routing rule
- manually disable problematic admission rule
Emergency Rules
- Use named break-glass identity.
- Capture exact command/console action.
- Capture reason and incident ID.
- Limit scope and duration.
- Notify owner channel.
- Create reconciliation PR immediately after stabilization.
- Run drift detection.
- Close break-glass access.
Reconciliation After Manual Mutation
Anti-Pattern
Leaving Git wrong because “production is already fixed.”
That makes the next reconciliation dangerous.
25. Designing Recovery Artifacts
A recovery-oriented platform should produce durable artifacts.
| Artifact | Purpose |
|---|---|
| Plan artifact | proves intended mutation |
| Plan digest | binds approval to plan |
| Policy result | proves rules evaluated |
| Approval record | proves authorized actor accepted risk |
| Apply log | proves mutation attempt |
| Cloud audit event correlation | proves actual API calls |
| State version | proves recorded state before/after |
| GitOps sync event | proves controller action |
| Kubernetes event snapshot | proves cluster response |
| Exception object | proves accepted deviation |
| Incident link | connects emergency action to governance |
Evidence should be queryable by:
- service
- environment
- commit
- PR
- run ID
- actor
- resource address
- cloud account
- cluster
- policy rule
- exception ID
26. Recovery Command Center View
For serious platforms, build a view that answers:
- Which deployments are stuck?
- Which IaC runs are locked?
- Which applies partially failed?
- Which GitOps apps are degraded?
- Which policy rules are causing most denials?
- Which secrets failed rotation?
- Which drift findings are open?
- Which break-glass sessions are active?
- Which emergency manual changes are unreconciled?
- Which state files have recent repair operations?
This is not cosmetic. It changes incident response from archaeology to control.
27. Failure Injection Exercises
You do not know if your recovery works until you test it.
Run failure drills in non-production:
| Drill | Expected Learning |
|---|---|
| Kill runner during apply | detect partial mutation and lock behavior |
| Break state backend access | validate lock/error handling |
| Introduce policy false positive | test exception and policy rollback |
| Rotate secret to wrong value | test secret rollback and workload verification |
| Make Argo app unhealthy | test diff/sync/event diagnosis |
| Suspend Flux Kustomization | test detection of paused reconciliation |
| Add manual drift | test drift classification and reconciliation PR |
| Remove CRD before CR apply | test dependency ordering failure |
| Push unsigned image | test admission enforcement |
| Simulate registry outage | test artifact availability assumptions |
A mature platform runs these as game days.
28. The Recovery Decision Matrix
Use this as a default framework.
| Failure | Mutation Started? | State Consistent? | Preferred Action |
|---|---|---|---|
| syntax/render failure | no | yes | fix source |
| init/provider download failure | no | yes | fix dependency/platform |
| stale plan | no | yes | replan/reapprove |
| policy denial | no | yes | fix design or exception |
| credential failure before apply | no | yes | fix identity; retry if plan fresh |
| apply timeout | maybe | unknown | inspect audit/state/actual |
| partial create | yes | maybe | reconcile state/actual, then plan |
| stuck lock after crash | maybe | maybe | verify no active run, then unlock |
| state corruption | maybe | no | freeze, restore/repair state |
| Argo OutOfSync | controller attempted maybe | live drift | classify diff, sync or reconcile Git |
| admission denial | no object persisted usually | desired invalid | fix object/policy |
| bad secret delivered | yes | live bad | restore/roll forward secret + verify |
| manual hotfix | yes | Git drift | reconciliation PR |
29. Engineering Patterns That Reduce Recovery Pain
Pattern 1 — Small State Units
Large state units make partial failure harder to reason about. Split by lifecycle and blast radius.
Pattern 2 — Immutable Artifacts
Use image digests, chart versions, module versions, provider locks, and plan digests. Mutable references make recovery ambiguous.
Pattern 3 — Explicit Ownership
Every state unit, app, policy, secret, and exception needs an owner.
Pattern 4 — Narrow Auto-Heal
Auto-heal is good for known-safe drift. It is dangerous for bad desired state.
Pattern 5 — Reconciliation PRs
Every manual change should become a PR that either accepts, reverses, or replaces the manual state.
Pattern 6 — Policy Regression Tests
Every policy incident should produce a test case.
Pattern 7 — State Versioning
Use state backends with versioning and access logs for production.
Pattern 8 — Controller Canaries
Upgrade GitOps controllers and policy controllers through canary clusters.
Pattern 9 — Precomputed Runbooks
The middle of an incident is the worst time to invent a force-unlock procedure.
30. Failure Modeling Template
Use this template for each pipeline component.
## Component
### Responsibility
What state transition does this component own?
### Inputs
What does it trust?
### Outputs
What does it produce?
### State
What persistent state does it read/write?
### Failure Modes
How can it fail before mutation, during mutation, after mutation?
### Detection
Which logs, metrics, events, and artifacts reveal failure?
### Safe Recovery
What is the smallest safe transition to recover?
### Unsafe Recovery
Which actions must be avoided?
### Evidence
What must be captured?
### Preventive Controls
How do we reduce recurrence?
31. Example Failure Model: IaC Apply Runner
## Component
IaC apply runner
### Responsibility
Execute approved plan against cloud APIs and update state.
### Inputs
- saved plan artifact
- source commit
- variables
- short-lived credentials
- backend credentials
- provider/plugin versions
### Outputs
- apply log
- state write
- cloud resource mutations
- post-apply verification
### State
- remote backend state
- lock record
- plan artifact
### Failure Modes
- cannot acquire credentials
- cannot acquire lock
- provider auth failure
- partial resource creation
- API timeout after mutation
- state write failure
- runner crash
### Detection
- runner logs
- backend lock metadata
- cloud audit events
- state version history
- post-apply plan
### Safe Recovery
- determine whether mutation started
- compare desired/recorded/actual
- repair state if needed
- replan/reapprove before further apply
### Unsafe Recovery
- force unlock without checking active run
- rerun blindly after timeout
- manually delete created resources without state review
32. Example Failure Model: GitOps Controller
## Component
GitOps controller
### Responsibility
Reconcile Kubernetes live state to Git desired state.
### Inputs
- Git revision
- rendered manifests
- cluster credentials
- policy/admission behavior
- registry/artifact availability
### Outputs
- applied resources
- sync status
- health status
- Kubernetes events
### State
- controller cache
- application custom resources
- cluster live objects
### Failure Modes
- cannot fetch Git
- render failure
- admission denial
- missing CRD
- unhealthy workload
- prune blocked by finalizer
- controller crash
- RBAC denied
### Detection
- Application/Kustomization status
- controller logs
- Kubernetes events
- reconciliation lag metrics
### Safe Recovery
- classify render/apply/health/prune failure
- pause affected application if repeated sync is unsafe
- fix desired state or dependency
- resume and verify
### Unsafe Recovery
- manually kubectl apply unrelated rendered output
- deleting finalizers without owner review
- broad ignore-difference rules
33. Production Checklist
A GitOps/IaC platform is recovery-ready when:
- every state unit has an owner
- every apply has a saved evidence trail
- plan and approval are bound to commit and plan digest
- state backend supports versioning and locking
- force-unlock procedure is documented and audited
- partial apply recovery is practiced
- GitOps controller pause/resume is scoped and authorized
- admission policy exception workflow exists
- secret rollback process exists
- manual emergency mutation requires reconciliation PR
- drift findings are classified, not ignored
- controller upgrades have rollback plans
- policy changes have tests
- incident reviews produce platform improvements
34. Common Anti-Patterns
Anti-Pattern 1 — Retry as First Response
Retry is a recovery action only after classification.
Anti-Pattern 2 — Manual Console Fix Without Reconciliation
This creates future drift and destroys Git as source of truth.
Anti-Pattern 3 — Force Unlock as Routine Operation
A frequent stuck lock means runner lifecycle or state design is broken.
Anti-Pattern 4 — Ignoring Diff Noise Globally
Diff ignore rules should be narrow, explained, and tested.
Anti-Pattern 5 — Treating Policy as a Blocker, Not a Control
Policy failure is information. Bypassing it without evidence weakens the platform.
Anti-Pattern 6 — Rolling Back State Without Understanding Reality
State rollback can make recorded state lie about actual infrastructure.
Anti-Pattern 7 — Controller-Wide Pause for Local Problem
Pause the smallest control loop possible.
Anti-Pattern 8 — Losing Evidence During Emergency
Emergency does not remove the need for auditability. It increases it.
35. The Senior Engineer’s Mental Model
When a failure happens, think in this order:
- Boundary — which subsystem failed?
- Mutation — did anything actually change?
- State — do desired, recorded, and actual state agree?
- Authority — which actor/controller owns the next transition?
- Safety — can retry amplify damage?
- Evidence — can we prove what happened?
- Recovery — what is the smallest safe corrective transition?
- Learning — what platform guardrail prevents recurrence?
That is the difference between operating a toolchain and engineering a control plane.
36. References
- OpenTofu documentation —
plan, saved plan, and planning modes: https://opentofu.org/docs/cli/commands/plan/ - OpenTofu documentation —
applyexecution behavior: https://opentofu.org/docs/cli/commands/apply/ - OpenTofu documentation — state model: https://opentofu.org/docs/language/state/
- Argo CD documentation — sync phases and waves: https://argo-cd.readthedocs.io/en/stable/user-guide/sync-waves/
- Argo CD documentation — sync options: https://argo-cd.readthedocs.io/en/latest/user-guide/sync-options/
- Argo CD documentation — automated sync policy: https://argo-cd.readthedocs.io/en/latest/user-guide/auto_sync/
- Flux documentation — suspend command: https://fluxcd.io/flux/cmd/flux_suspend/
- Kubernetes documentation — declarative object management: https://kubernetes.io/docs/tasks/manage-kubernetes-objects/declarative-config/
You just completed lesson 31 in deepen practice. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.
Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.