AI CI/CD and Readiness Gates
Learn Python AI Application Engineer - Part 033
AI CI/CD and readiness gates for production AI systems: prompt/model/index/tool/workflow versioning, eval gates, security gates, cost gates, release trains, canary, shadow, rollback, and production readiness review.
Part 033 — AI CI/CD and Readiness Gates
1. Why This Part Matters
Classic CI/CD validates code.
AI CI/CD must validate code plus behavior.
A production AI system can change behavior when any of these changes:
- code;
- prompt;
- model route;
- model provider;
- output schema;
- retrieval index;
- embedding model;
- reranker;
- tool contract;
- tool description;
- agent workflow graph;
- memory policy;
- governance policy;
- eval dataset;
- safety threshold;
- cost/latency budget.
A code pipeline that only runs unit tests is not enough.
The central invariant:
AI release engineering must gate behavior, not only build artifacts.
This part turns the previous chapters into a practical release system.
2. Target Skill
After this part, you should be able to:
- design CI/CD pipelines for Python AI applications;
- version prompts, models, tools, indexes, workflows, and eval datasets;
- run deterministic tests and behavioral evals in separate tiers;
- define readiness gates for quality, safety, security, privacy, latency, and cost;
- use canary, shadow, blue-green, and progressive rollout for AI changes;
- roll back prompts, indexes, tools, and model routes independently;
- build release reports that senior engineers and risk owners can review;
- decide whether an AI feature is production-ready.
3. CI/CD for AI Is Multi-Artifact
A normal backend release might include:
source code -> build -> test -> deploy
An AI release includes more artifacts:
The release unit must describe all relevant artifacts.
Otherwise, you cannot reproduce or roll back behavior.
4. Kaufman Deconstruction
Break AI CI/CD into trainable subskills.
Deliberate practice:
- change one prompt;
- run evals;
- inspect diff;
- canary rollout;
- detect a regression;
- roll back prompt only;
- add regression example;
- repeat with index/model/tool change.
5. AI Release Manifest
Every AI release should have a manifest.
from typing import Literal
from pydantic import BaseModel
class AiReleaseManifest(BaseModel):
release_id: str
application_version: str
environment: Literal["dev", "staging", "production"]
code_commit: str
container_image: str
prompt_versions: dict[str, str]
model_routes: dict[str, str]
index_versions: dict[str, str]
tool_versions: dict[str, str]
workflow_versions: dict[str, str]
eval_dataset_versions: dict[str, str]
config_version: str
governance_policy_version: str
created_at: str
created_by: str
This manifest answers:
- what code ran?
- which prompt generated the answer?
- which model route was active?
- which index served retrieval?
- which tool contract was available?
- which workflow graph executed?
- which eval dataset approved the release?
No manifest, no reproducible release.
6. Version Everything That Changes Behavior
6.1 Code Version
Normal git commit/container image.
6.2 Prompt Version
Prompts are behavior.
Use IDs:
prompt.policy_answer.v7
prompt.case_recommendation.v3
prompt.grounded_judge.v5
6.3 Model Route Version
A model route maps tasks to models.
model_route.high_risk_case_review.v2
6.4 Index Version
Retrieval indexes must be versioned.
policy-index-2026-06-28-v3
6.5 Tool Contract Version
Tool input/output and risk metadata must be versioned.
tool.update_case_status.v2
6.6 Workflow Version
Agent workflows must be versioned.
workflow.case_review.v4
6.7 Eval Dataset Version
The gate itself must be versioned.
eval.case_review_golden.v9
7. Pipeline Tiers
Use multiple CI/CD tiers.
| Tier | Trigger | Purpose | Cost |
|---|---|---|---|
| PR fast | every PR | unit, lint, type, fake model tests | low |
| PR medium | important PRs | small integration and smoke eval | medium |
| main/nightly | merge/main schedule | golden eval, retrieval eval | medium/high |
| release candidate | pre-prod | full eval + security + performance | high |
| production canary | partial traffic | live quality/ops monitoring | controlled |
| post-release | after deploy | regression monitoring, feedback review | ongoing |
Do not put every expensive eval in every PR.
But do not ship high-risk behavior without release gates.
8. CI Stages for Python AI Apps
Recommended PR checks:
ruffor equivalent lint/format;mypyorpyrightwhere applicable;- unit tests;
- prompt rendering tests;
- schema tests;
- tool authorization tests;
- retrieval filter tests;
- fake model/agent workflow tests;
- dependency/security scan;
- small eval smoke test.
9. Readiness Gates
A readiness gate is a release condition.
Gate categories:
- code quality;
- deterministic tests;
- behavioral eval;
- RAG quality;
- agent trajectory;
- security;
- privacy/governance;
- performance;
- cost;
- observability;
- rollback readiness;
- human approval.
class ReadinessGate(BaseModel):
gate_id: str
name: str
category: str
severity: Literal["blocker", "warning", "informational"]
passed: bool
evidence_ref: str | None = None
failure_reason: str | None = None
Blocker gates should stop production rollout.
10. Quality Gates
Examples:
- critical eval pass rate >= 98%
- groundedness pass rate >= 95%
- citation support rate >= 98%
- unsupported critical claims == 0
- over-refusal rate within threshold
- under-refusal critical cases == 0
Quality gates must be sliced by risk.
A 95% overall pass rate is not acceptable if all failures are high-risk case recommendations.
11. RAG Readiness Gates
RAG Gates:
- unauthorized retrieval count == 0
- stale active-policy failure count == 0
- recall@10 on critical policy queries >= 0.98
- citation support rate >= 0.98
- exact identifier hit rate >= 0.99
- no-result rate regression <= threshold
- index metadata completeness == 100% for mandatory fields
- active index has rollback version
RAG gate inputs:
- retrieval eval results;
- index manifest;
- metadata validation report;
- ACL validation report;
- latency report;
- citation eval report.
12. Agent Readiness Gates
Agent Gates:
- approval bypass count == 0
- forbidden tool call count == 0
- unauthorized tool call count == 0
- max-step failure rate <= threshold
- required-node completion rate >= threshold
- idempotency violations == 0
- unsafe handoff count == 0
- long-running resume tests pass
Agent behavior must be evaluated by trajectory, not just final answer.
13. Tool Readiness Gates
For every tool change:
- schema compatibility verified;
- authorization tests pass;
- approval requirement configured;
- idempotency tested for writes;
- side-effect level reviewed;
- audit event emitted;
- output redaction tested;
- kill switch configured;
- tool description reviewed.
High-risk tool changes require explicit review.
14. Security Gates
Security gates:
- prompt injection eval critical failures == 0
- unauthorized retrieval == 0
- forbidden tool proposals blocked == 100%
- cross-tenant tests pass
- secrets scan clean
- dependency critical vulnerabilities == 0 or accepted exception
- trace redaction tests pass
- memory poisoning tests pass
Security gates should fail closed.
15. Privacy and Governance Gates
Governance gates:
- provider approved for data classification;
- prompt manifest approved;
- tool governance record exists;
- eval dataset classification reviewed;
- audit event completeness tested;
- retention policy configured;
- deletion propagation tested where applicable;
- human approval path tested for high-risk workflows;
- release owner and rollback owner named.
Governance is release engineering, not paperwork only.
16. Performance and Cost Gates
Examples:
- p95 RAG latency <= 6s
- p95 API latency <= budget
- average input tokens <= baseline + 10%
- average output tokens <= baseline + 10%
- cost per successful task <= budget
- model retry rate <= threshold
- agent average steps <= baseline + 1
- no unbounded context growth
Cost gates should be tied to successful tasks, not only raw spend.
A system that spends less because it fails more is not better.
17. Observability Gates
Before release:
- trace includes request ID;
- prompt version is traced;
- model version is traced;
- index version is traced;
- selected evidence IDs are traced;
- tool calls are traced;
- approval IDs are traced;
- cost/tokens are traced;
- redaction tests pass;
- dashboard updated;
- alerts configured;
- runbook exists.
If you cannot observe the release, you cannot operate it.
18. Example GitHub Actions Pipeline
name: ai-app-ci
on:
pull_request:
push:
branches: [main]
jobs:
fast-checks:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.12'
- run: pip install -r requirements-dev.txt
- run: ruff check .
- run: mypy src
- run: pytest tests/unit tests/contract -q
fake-model-integration:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.12'
- run: pip install -r requirements-dev.txt
- run: pytest tests/integration -q
eval-smoke:
runs-on: ubuntu-latest
if: github.event_name == 'pull_request'
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.12'
- run: pip install -r requirements-dev.txt
- run: python -m evals.run --suite smoke --output reports/eval-smoke.json
- run: python -m release_gates.check --report reports/eval-smoke.json --profile pr
The exact tooling can vary.
The principle is stable:
Fast deterministic gates in PR, deeper behavioral gates before release.
19. Eval Pipeline
Eval runs must store:
- manifest;
- dataset version;
- outputs;
- traces;
- metrics;
- failures;
- gate decisions.
20. Readiness Gate Engine
class GateDecision(BaseModel):
release_id: str
status: Literal["approved", "blocked", "approved_with_warnings"]
gates: list[ReadinessGate]
blockers: list[str]
warnings: list[str]
def decide_release(gates: list[ReadinessGate], release_id: str) -> GateDecision:
blockers = [g.name for g in gates if not g.passed and g.severity == "blocker"]
warnings = [g.name for g in gates if not g.passed and g.severity == "warning"]
if blockers:
status = "blocked"
elif warnings:
status = "approved_with_warnings"
else:
status = "approved"
return GateDecision(
release_id=release_id,
status=status,
gates=gates,
blockers=blockers,
warnings=warnings,
)
Make gate decisions deterministic and auditable.
21. Release Report
AI Release Report
Release:
- release_id:
- code_commit:
- container_image:
- prompt_versions:
- model_routes:
- index_versions:
- tool_versions:
- workflow_versions:
- eval_dataset_versions:
Quality:
- overall pass rate:
- critical pass rate:
- groundedness:
- citation support:
RAG:
- recall@10:
- MRR:
- unauthorized retrieval:
- stale source failures:
Agent:
- trajectory success:
- approval bypass:
- forbidden tool calls:
- max-step failures:
Security/Privacy:
- prompt injection pass:
- redaction tests:
- provider policy:
- audit completeness:
Performance/Cost:
- p95 latency:
- avg cost/task:
- token budget regression:
Decision:
- approved / blocked
- blockers:
- warnings:
- rollback plan:
This report should be understandable by engineering, product, security, and governance stakeholders.
22. Canary Rollout
Canary releases expose a small percentage of traffic to a new version.
AI canary signals:
- error rate;
- latency;
- cost;
- user feedback;
- citation failure;
- refusal rate;
- tool error rate;
- safety alerts;
- model output schema failures;
- approval bypass;
- fallback rate.
Canary should support automatic rollback for critical metrics.
23. Shadow Deployment
Shadow deployment runs a new version in parallel but does not show its output.
Useful for:
- new retrieval index;
- new model;
- new reranker;
- new prompt;
- new agent workflow;
- new tool selection policy.
Shadow comparison:
production answer vs shadow answer
production retrieval vs shadow retrieval
production cost vs shadow cost
production latency vs shadow latency
Do not use shadow output for side effects.
24. A/B Testing
A/B tests compare variants with real users.
Use carefully for AI.
Appropriate:
- answer style;
- UX presentation;
- low-risk summarization;
- retrieval ranking variants after offline safety passes.
Inappropriate without safeguards:
- high-risk policy interpretation;
- regulated decisions;
- tool side effects;
- approval workflows.
A/B testing is not a replacement for safety gates.
25. Rollback Strategy
Rollback targets:
| Artifact | Rollback Method |
|---|---|
| code | deploy previous image |
| prompt | switch prompt version |
| model route | switch route config |
| index | promote previous active index |
| tool | disable or revert tool version |
| workflow | route new runs to old version |
| eval dataset | not usually rollback; review gate change |
| config | revert config version |
Rollback should be tested.
If you cannot roll back a prompt quickly, prompt deployment is unsafe.
26. Feature Flags
Feature flags control exposure.
Use flags for:
- new prompt;
- new model route;
- new index;
- new agent workflow;
- high-risk tool;
- memory feature;
- judge/validator;
- streaming;
- canary rollout.
Flag dimensions:
- tenant;
- user;
- role;
- environment;
- risk level;
- percentage;
- feature;
- workflow.
Feature flags must be audited for high-risk capabilities.
27. Release Train for AI Systems
A mature AI release train may look like:
Daily:
- code PR checks
- small eval smoke
Nightly:
- golden RAG eval
- agent trajectory eval
- cost/latency eval
Weekly:
- full eval suite
- security eval
- human review sample
- index promotion review
Release:
- manifest
- readiness gates
- canary
- production monitoring
Cadence depends on risk and team maturity.
28. Handling Eval Dataset Changes
Eval dataset changes can alter pass rates.
Treat dataset changes like code:
- review examples;
- document reason;
- tag risk;
- avoid accidental removal of hard cases;
- keep history;
- compare old/new gate outcomes;
- require owner approval for critical examples.
A team can cheat by weakening evals.
Govern eval datasets.
29. Handling Model Upgrades
A model upgrade should trigger:
- compatibility tests;
- structured output tests;
- golden evals;
- safety evals;
- latency/cost comparison;
- judge calibration if judge model changes;
- canary or shadow rollout;
- rollback plan.
Do not assume a newer model is better for your specific workflow.
30. Handling Index Upgrades
Index upgrade gates:
- metadata completeness;
- ACL validation;
- retrieval eval;
- stale/superseded check;
- latency/cost check;
- shadow comparison;
- rollback index retained.
Index changes can silently change answers.
Treat index promotion like deployment.
31. Handling Tool Changes
Tool changes require:
- contract tests;
- authorization tests;
- idempotency tests;
- audit tests;
- description review;
- agent evals;
- security review for high-risk tools;
- feature flag rollout.
A tool description change can change model behavior.
Review descriptions like prompts.
32. Handling Workflow Changes
Agent workflow changes require:
- state migration review;
- versioned runs;
- trajectory evals;
- max-step checks;
- approval gate tests;
- resume tests;
- rollback plan;
- active run strategy.
Do not change workflow state schema casually while runs are active.
33. Production Readiness Review
Before launching a significant AI feature, hold a readiness review.
Checklist:
- product scope clear;
- risk level assigned;
- data classification complete;
- threat model complete;
- eval suite exists;
- tests pass;
- readiness gates pass;
- monitoring dashboards ready;
- alerts and runbooks ready;
- rollback plan tested;
- human review path ready;
- audit events verified;
- owner/on-call assigned;
- residual risks accepted.
Readiness review should produce a decision, not just discussion.
34. Case-Management Readiness Gates
For enterprise case-management AI:
Blocker Gates:
- unauthorized case retrieval == 0
- approval bypass == 0
- restricted trace redaction failures == 0
- active policy retrieval recall@10 >= 0.98
- unsupported high-risk recommendation == 0
- citation support for policy claims >= 0.99
- case write tools disabled unless approval exists
- audit event completeness == 100%
- rollback plan verified
Warnings:
- p95 latency slightly above target
- answer verbosity regression
- non-critical prior-decision retrieval miss
- human review queue backlog within tolerance
High-risk systems should block on safety and authorization, not on minor style issues.
35. Anti-Patterns
| Anti-Pattern | Why It Fails |
|---|---|
| CI only tests code | AI behavior changes untested |
| Prompt changes without eval | regressions ship |
| Model upgrade without shadow eval | behavior drift |
| Index overwrite in place | no rollback |
| Tool enabled globally | excessive blast radius |
| Eval dataset unreviewed | gate loses meaning |
| One overall score gate | critical failures hidden |
| No release manifest | cannot reproduce behavior |
| No canary/rollback | incidents last longer |
| Governance gate manual only | inconsistent enforcement |
| No cost gate | runaway spend |
| No trace gate | un-debuggable release |
36. Practice: Build an AI CI/CD Pipeline
For your practice RAG + agent app, create:
- release manifest;
- prompt manifest;
- index manifest;
- tool manifest;
- workflow manifest;
- PR pipeline;
- nightly eval pipeline;
- release readiness gates;
- canary rollout plan;
- rollback plan.
Test release scenarios:
- prompt improves style but breaks citation;
- new index improves recall but leaks stale policy;
- new model improves quality but increases cost;
- tool description change causes wrong tool choice;
- workflow change bypasses approval;
- eval dataset adds new critical case.
Deliverable:
AI CI/CD Review
1. Artifact inventory
2. Pipeline tiers
3. Readiness gates
4. Release manifest
5. Eval report template
6. Rollout strategy
7. Rollback strategy
8. Governance review
9. Case-management gates
10. Known gaps
37. Engineering Heuristics
- Gate behavior, not only code.
- Version prompts, models, indexes, tools, workflows, and eval datasets.
- Keep PR checks fast.
- Run deeper evals before release.
- Gate critical slices, not only aggregate score.
- Treat index promotion like deployment.
- Treat prompt changes like code changes.
- Treat tool descriptions as behavior-changing artifacts.
- Use canary and shadow for risky changes.
- Roll back artifacts independently.
- Store release manifests.
- Include cost, latency, security, and observability gates.
- Require human review for high-risk releases.
- Convert production incidents into evals.
- Make readiness review produce a decision.
38. Summary
AI CI/CD is release engineering for probabilistic systems.
The core invariant:
A release is ready only when its code, prompts, models, indexes, tools, workflows, policies, and eval results are known, versioned, tested, and rollback-capable.
This is how you move from experimentation to production discipline.
In the next part, we build the full Enterprise Case Management AI Capstone.
You just completed lesson 33 in final stretch. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.
Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.