Part 032 — Deployment Architecture and Runtime Operations

1. Why This Part Matters

An AI application is not production-ready because it runs locally.

Production requires:

API deployment;
model gateway;
retrieval service;
worker processes;
queues;
vector/search infrastructure;
databases;
object storage;
secret management;
observability;
model/provider routing;
rollout and rollback;
scaling;
incident response;
data governance;
cost control.

AI systems are operationally more complex than ordinary CRUD services because they depend on probabilistic and remote components.

A production deployment must handle:

model provider failures;
vector index rollout;
prompt version rollout;
tool version rollout;
agent workflow migration;
long-running task recovery;
eval gates;
high cost;
data privacy;
security boundaries;
human approvals.

The central invariant:

Deployment architecture must preserve the same safety, reliability, observability, and governance guarantees that the application design assumes.

2. Target Skill

After this part, you should be able to:

design service topology for Python AI applications;
separate API, model gateway, retrieval, ingestion, and worker responsibilities;
deploy RAG indexes safely with shadow/promote/rollback flows;
operate agent workers and long-running tasks;
use queues and checkpoints for durable execution;
configure health checks and readiness gates;
manage secrets and provider credentials;
design rollout/rollback for prompts, models, tools, and indexes;
scale API and worker components;
define SLOs, alerts, and runbooks;
operate AI systems under production constraints.

3. Reference Deployment Topology

This topology separates concerns.

Do not force every responsibility into one web process.

4. Service Responsibilities

Service	Responsibility
API service	user requests, auth context, response streaming
Model gateway	provider policy, routing, cost limits, tracing
Retrieval service	query planning, search, rerank, evidence package
Ingestion workers	parse, chunk, embed, index
Agent workers	long-running tasks, tools, approvals
Tool executor	authorization, side effects, audit
Checkpoint store	durable workflow state
Audit service/store	accountability records
Eval service/runner	offline/CI quality checks
Observability stack	traces, metrics, logs
Admin console	operations, review, incident tools

Small systems can combine some services.

But boundaries should remain clear in code.

5. Monolith vs Modular Services

5.1 Modular Monolith

Good starting point:

one deployable app
clear internal modules:
- api
- model_gateway
- retrieval
- ingestion
- tools
- workflows
- evals

Pros:

simpler operations;
fewer network calls;
easier development;
good for small teams.

Cons:

harder independent scaling;
larger blast radius;
background tasks need care.

5.2 Distributed Services

Use when:

ingestion load is heavy;
model gateway is shared;
retrieval needs independent scaling;
tools have strict security boundary;
agent workers need queues;
teams own different services.

Pros:

independent scaling;
clearer ownership;
stronger boundaries.

Cons:

distributed complexity;
more tracing needed;
more deployment coordination.

Start modular. Split when pressure appears.

6. Runtime Process Types

A production Python AI app often has multiple process types.

web:
  FastAPI / ASGI server

worker:
  queue consumer for long-running tasks

ingestion-worker:
  document parsing, chunking, embedding, indexing

scheduler:
  periodic jobs, re-index, cleanup

eval-runner:
  offline evals / CI evals

admin:
  management console or CLI

Each process type has different scaling and failure behavior.

7. API Runtime

API service should handle:

auth;
request validation;
lightweight orchestration;
streaming responses;
short RAG requests;
task creation for long-running work;
retrieving task status;
user feedback;
trace correlation.

Avoid putting heavy ingestion or long-running agent loops inside request handlers.

For long tasks:

POST /case-review -> returns task_id
GET /case-review/{task_id} -> status/result

This avoids HTTP timeout and improves reliability.

8. Worker Runtime

Workers handle:

long-running agent workflows;
retryable tool operations;
batch summarization;
human approval resumes;
scheduled tasks;
ingestion tasks;
eval tasks.

Worker requirements:

idempotent processing;
checkpointing;
graceful shutdown;
concurrency limits;
queue visibility timeout;
dead-letter handling;
trace propagation.

Worker should checkpoint after each important node.

9. Model Gateway

The model gateway centralizes model usage.

Responsibilities:

provider allowlist;
model allowlist;
data classification policy;
routing;
fallback;
timeout;
retry;
token/cost tracking;
prompt version tracing;
structured output enforcement;
redaction;
audit.

Do not scatter direct provider calls across code.

10. Retrieval Service

Retrieval service owns:

query normalization;
query planning;
security filters;
index selection;
lexical/vector/hybrid search;
reranking;
context candidate packaging;
retrieval trace.

It should not own final answer generation unless you intentionally combine.

Separating retrieval makes RAG failures easier to debug and evaluate.

11. Ingestion Deployment

Ingestion is operationally heavy.

Stages:

fetch source;
canonicalize;
parse;
quality check;
chunk;
embed;
write chunk store;
write vector/search index;
run eval;
promote index.

Ingestion workers should be separate from API workers.

Why?

parsing/OCR can be CPU-heavy;
embedding can be rate-limited;
indexing can be slow;
failures need quarantine;
reprocessing should not affect user-facing latency.

12. Index Deployment

Indexes should have lifecycle.

Deployment steps:

build shadow index;
validate metadata/ACL;
run retrieval evals;
compare against active index;
promote atomically;
monitor;
keep rollback index;
delete after retention.

Never overwrite active index blindly.

13. Prompt Deployment

Prompts are deployable artifacts.

Prompt rollout should include:

version ID;
changelog;
eval result;
approved models;
rollout percentage;
rollback version;
owner.

Prompt rollout strategies:

all-at-once for low-risk;
canary for high-risk;
tenant-specific rollout;
shadow evaluation;
A/B test where appropriate.

Prompt change can be as risky as code change.

14. Tool Deployment

Tool deployment needs compatibility.

Before enabling a tool version:

schema validated;
authorization tests pass;
approval policy configured;
audit event emitted;
idempotency supported;
rollback/disable path exists;
model-facing description reviewed;
eval tests updated.

For high-risk tools, use feature flags.

tool.update_case_status.v2.enabled = false

Enable only after approval.

15. Agent Workflow Deployment

Agent workflows are stateful.

Changing workflow while runs are active is tricky.

Strategies:

15.1 Versioned Workflow

Each run stores workflow version.

class WorkflowRun(BaseModel):
    run_id: str
    workflow_name: str
    workflow_version: str
    state: dict[str, object]

Existing runs continue with old version.

New runs use new version.

15.2 Migration

Migrate old state to new schema if needed.

Use only when necessary.

15.3 Drain

Stop accepting new runs for old version, let active runs finish.

For high-risk workflows, versioned runs are safest.

16. Configuration Management

Configuration includes:

model routing;
prompt versions;
index versions;
tool enablement;
feature flags;
timeouts;
retry policy;
token budgets;
cost budgets;
provider policy;
risk thresholds.

Configuration should be:

versioned;
environment-specific;
reviewed;
observable;
rollback-capable.

Avoid changing critical AI behavior through untracked environment variables.

17. Secret Management

Secrets include:

model provider API keys;
database credentials;
OAuth tokens;
tool credentials;
webhook secrets;
encryption keys;
signing keys.

Rules:

never put secrets in prompts;
never log secrets;
use secret manager;
rotate credentials;
use least privilege;
separate dev/staging/prod;
short-lived tokens where possible;
restrict worker access by need.

A model should never see API keys.

18. Environment Separation

Use separate environments:

local;
development;
staging;
production;
regulated/sensitive environment if needed.

Staging should have:

safe test data;
fake or approved model providers;
test indexes;
test tools;
sandbox side effects;
eval datasets.

Do not test destructive agent tools against production systems.

19. Health Checks

Use health checks carefully.

19.1 Liveness

Is process alive?

GET /health/live

Should not call external dependencies heavily.

19.2 Readiness

Can service handle traffic?

GET /health/ready

May check:

database connectivity;
required config loaded;
model gateway policy loaded;
index version available;
queue connection.

19.3 Dependency Health

For dashboards, check:

model provider status;
vector DB;
search backend;
queue;
object storage;
tool APIs.

Do not make liveness fail because model provider is temporarily down.

That can cause restart storms.

20. Graceful Shutdown

Workers need graceful shutdown.

On shutdown:

stop accepting new work;
finish current safe step or checkpoint;
release locks;
requeue unfinished work;
flush traces;
close clients.

API streaming endpoints should handle disconnects.

Long-running jobs should be resumable.

21. Scaling

Scale components differently.

Component	Scaling Driver
API	requests/sec, streaming connections
retrieval	query volume, index latency
model gateway	model call concurrency
workers	queue depth, task duration
ingestion	document volume
eval runners	release/nightly workload
vector DB	corpus size, QPS
cache	hit rate, memory

Do not scale everything together.

Agent workers may need strict concurrency limits to avoid cost spikes.

22. Autoscaling Signals

Useful signals:

CPU/memory for parsing workers;
request rate for API;
queue depth for workers;
oldest message age;
model concurrency saturation;
p95 latency;
vector DB latency;
cost rate;
provider rate-limit events.

For AI workloads, CPU alone is often insufficient.

Queue depth and provider limits matter.

23. Kubernetes Deployment Concepts

If using Kubernetes:

Deployment manages rollout of stateless API/worker pods;
rolling updates can update pods gradually;
rollout history enables rollback;
readiness probes prevent unready pods from receiving traffic;
ConfigMaps/Secrets hold configuration/secrets;
Horizontal Pod Autoscaler can scale pods based on metrics.

But Kubernetes does not solve AI correctness.

It only manages runtime infrastructure.

You still need:

eval gates;
prompt/index versioning;
model/provider policy;
trace/audit;
workflow checkpoints.

24. Rollout Strategies

24.1 Rolling Update

Gradually replace instances.

Good for low/medium-risk code changes.

24.2 Blue-Green

Run old and new environments side by side, switch traffic.

Good for major changes.

24.3 Canary

Send small percentage of traffic to new version.

Good for model/prompt changes.

24.4 Shadow

Run new system in parallel without user-visible output.

Good for retrieval/index/model evaluation.

AI-specific rollout should include quality metrics, not only error rate.

25. Rollback

Rollback units:

code version;
prompt version;
model route;
index version;
tool version;
workflow version;
configuration;
feature flag.

A good deployment can roll back each independently.

Example:

Problem:
New prompt causes citation failures.

Rollback:
prompt.policy_answer.v6 -> prompt.policy_answer.v5

No code rollback needed.

This is why versioning matters.

26. Database and Store Choices

Common stores:

Store	Purpose
Postgres	app state, workflows, metadata
Redis	cache, rate limits, ephemeral state
Object storage	source documents, artifacts
Vector DB/Search	retrieval indexes
Queue	async tasks
Audit store	immutable/auditable events
Time-series DB	metrics
Trace backend	observability

Choose based on access pattern and governance.

Do not store audit logs only in ephemeral logs.

Do not store source-of-truth documents only in vector DB.

27. Queue Operations

Queue operational requirements:

visibility timeout;
retry count;
dead-letter queue;
priority;
idempotency;
monitoring;
poison-message handling;
backpressure.

Metrics:

queue depth;
oldest message age;
processing rate;
failure rate;
DLQ count;
worker concurrency;
retry count.

Queue is not a substitute for task state.

State should live in durable store.

28. Runtime Security

Deployment security basics:

TLS;
authN/authZ;
network segmentation;
least-privilege service accounts;
secret management;
egress controls;
dependency scanning;
container image scanning;
runtime policy;
audit logs;
WAF/API gateway where appropriate;
tool network allowlist.

AI-specific additions:

model provider allowlist;
prompt injection monitoring;
tool kill switches;
vector index access controls;
trace redaction;
eval dataset access controls;
memory store policy.

29. Runtime Observability

Production dashboard should show:

API p95/p99 latency;
model latency/error/cost;
retrieval latency/no-result/stale;
tool success/failure;
worker queue depth;
agent max-step failures;
approval backlog;
eval gate status;
prompt/index/model versions;
cost by tenant/feature;
security alerts;
redaction failures.

Every dashboard should link to traces and runbooks.

30. SLOs

Define SLOs by feature.

Example RAG Q&A:

Availability: 99.5%
p95 latency: <= 6s
citation support failure: <= 2%
unauthorized retrieval: 0

Example case-review workflow:

Task creation p95: <= 500ms
Workflow completion within SLA: >= 98%
approval bypass: 0
duplicate side effects: 0
audit event completeness: 100%

SLOs should include quality/safety where relevant.

Pure uptime is not enough.

31. Incident Operations

Incident types:

provider outage;
vector DB outage;
index corruption;
prompt regression;
model behavior regression;
tool side-effect bug;
agent loop cost spike;
unauthorized retrieval;
stale policy answer;
queue backlog;
trace/audit outage.

Runbook should include:

owner;
detection signal;
immediate mitigation;
rollback option;
data/audit preservation;
communication path;
regression test requirement.

32. Operational Admin Console

A serious AI app may need admin tooling for:

active model routes;
prompt versions;
index versions;
tool enablement;
stuck workflows;
approval queue;
DLQ inspection;
trace lookup;
eval reports;
tenant budget;
memory records;
source ingestion status;
incident controls.

Admin actions must be audited.

33. Deployment Checklist

Before production:

API service deployed with health checks;
worker service deployed;
queue configured with DLQ;
checkpoint store configured;
model gateway policy configured;
provider credentials in secret manager;
retrieval index active and versioned;
prompt versions approved;
tool registry configured;
eval gates passed;
observability dashboards ready;
alerts configured;
rollback plan tested;
audit events verified;
redaction tested;
load test completed;
incident runbook written.

34. Case-Management Deployment Blueprint

For regulated case-management AI:

Rules:

read tools available to analyst;
write tools require workflow approval;
high-risk recommendations require supervisor approval;
policy index must be active/current;
audit events required for recommendations;
final case action is not model-autonomous.

35. Anti-Patterns

Anti-Pattern	Why It Fails
Everything in web request	timeouts and poor recovery
No model gateway	uncontrolled provider/model use
Direct tool calls from model output	unsafe side effects
Active index overwritten	no rollback
Prompt changes without version	no audit/rollback
Workers without checkpoints	lost long-running tasks
Queue without DLQ	invisible poison messages
Kubernetes liveness checks call model provider	restart storms
No feature flags	cannot disable bad tool/prompt
No staging eval	regressions reach production
Logs as audit	weak accountability
No admin tooling	operations become database surgery

36. Practice: Deployment Architecture Review

Design deployment for your practice RAG + agent system.

Include:

service topology;
process types;
model gateway;
retrieval service;
ingestion workers;
agent workers;
queue and checkpoint store;
vector/search index lifecycle;
prompt/tool/workflow versioning;
rollout/rollback plan;
health checks;
autoscaling signals;
secrets;
observability;
SLOs;
incident runbooks.

Deliverable:

Deployment Architecture Review

1. System topology
2. Service responsibilities
3. Runtime processes
4. Data stores
5. Queue design
6. Deployment strategy
7. Rollback strategy
8. Scaling plan
9. Security controls
10. SLOs and alerts
11. Operational runbooks

37. Engineering Heuristics

Separate API, workers, ingestion, retrieval, and model gateway boundaries.
Keep long-running work out of HTTP handlers.
Use queues and checkpoints for durable agent tasks.
Treat indexes, prompts, tools, and workflows as versioned deployable artifacts.
Roll out AI behavior changes with eval gates.
Keep rollback independent where possible.
Do not overwrite active indexes blindly.
Use model gateway for provider policy and cost control.
Use readiness checks carefully.
Do not make liveness depend on external AI providers.
Scale by workload type, not only CPU.
Trace every runtime boundary.
Keep secrets out of prompts/logs.
Build admin operations before production incidents.
Include quality and safety in SLOs.

38. Summary

Deployment architecture makes AI behavior operational.

The core invariant:

Runtime systems must enforce the same boundaries that the design, security model, and governance process require.

A production AI app needs:

service boundaries;
model gateway;
retrieval service;
workers;
queues;
checkpoints;
index lifecycle;
prompt/tool/workflow versioning;
secrets;
rollout/rollback;
scaling;
observability;
SLOs;
incident operations.

In the next part, we move to AI CI/CD and Readiness Gates.