Reliability Patterns for AI Systems
Learn Python AI Application Engineer - Part 028
Reliability patterns for AI systems: timeout budgets, retries with jitter, fallback, circuit breakers, bulkheads, rate limits, backpressure, idempotency, graceful degradation, chaos testing, and operational runbooks.
Part 028 — Reliability Patterns for AI Systems
1. Why This Part Matters
AI applications are distributed systems.
They depend on:
- model providers;
- embedding providers;
- vector databases;
- rerankers;
- tool APIs;
- workflow stores;
- queues;
- databases;
- object storage;
- identity systems;
- observability systems.
Every dependency can fail.
AI adds extra failure modes:
- invalid structured output;
- prompt injection;
- tool hallucination;
- long context latency;
- token budget overflow;
- model rate limits;
- model behavior drift;
- agent loops;
- retry storms;
- runaway cost;
- partial evidence;
- unsafe fallback.
The central invariant:
AI reliability is not achieved by hoping the model behaves. It is achieved by bounding failure, cost, latency, and authority.
This part gives you reliability patterns for production AI apps.
2. Target Skill
After this part, you should be able to:
- define timeout budgets across AI pipeline stages;
- implement retries with backoff, jitter, and retry budgets;
- use circuit breakers for failing model/tool dependencies;
- design fallback and graceful degradation;
- apply bulkheads to isolate tenants, tools, models, and queues;
- use rate limits and backpressure to prevent overload;
- make tool side effects idempotent;
- handle partial failures in RAG and agent workflows;
- build failure-mode runbooks;
- test reliability through chaos scenarios.
3. AI Reliability Architecture
Reliability is not one pattern.
It is a system of controls.
4. Kaufman Deconstruction
Break AI reliability into subskills.
Practice loop:
- create a normal pipeline;
- make model provider timeout;
- observe behavior;
- add timeout;
- add fallback;
- add retry budget;
- add circuit breaker;
- add metric;
- add regression test.
5. Reliability Goals
Before patterns, define goals.
Examples:
RAG answer endpoint:
- p95 latency <= 5s
- timeout <= 10s
- availability >= 99.5%
- no unauthorized retrieval
- no high-risk action without approval
- degraded answer allowed when reranker unavailable
- no answer when evidence insufficient
Case-review agent:
- task creation latency <= 500ms
- long-running completion within SLA
- resume after worker crash
- no duplicate side effects
- high-risk actions approval-gated
Reliability is contextual.
A chat response and a case-closing workflow need different patterns.
6. Timeout Budgets
Timeouts prevent resource exhaustion.
A request needs a total budget and stage budgets.
| Stage | Budget |
|---|---|
| API/auth | 200ms |
| query planning | 300ms |
| embedding | 500ms |
| retrieval | 700ms |
| reranking | 1,000ms |
| generation | 5,000ms |
| validation | 800ms |
| response formatting | 100ms |
| total | 8,600ms |
Do not let every dependency use its own large default timeout.
from pydantic import BaseModel
class TimeoutBudget(BaseModel):
total_ms: int
query_planning_ms: int
embedding_ms: int
retrieval_ms: int
rerank_ms: int
generation_ms: int
validation_ms: int
tool_ms: int
Timeout budget should be passed through the pipeline.
7. Deadline Propagation
A deadline is better than independent timeouts.
from time import monotonic
class Deadline:
def __init__(self, timeout_seconds: float) -> None:
self.expires_at = monotonic() + timeout_seconds
def remaining_seconds(self) -> float:
return max(0.0, self.expires_at - monotonic())
def expired(self) -> bool:
return self.remaining_seconds() <= 0
Usage:
async def call_model_with_deadline(model: "ModelClient", prompt: str, deadline: Deadline) -> object:
remaining = deadline.remaining_seconds()
if remaining <= 0:
raise TimeoutError("No time left for model call.")
return await model.generate(prompt, timeout_seconds=min(remaining, 5.0))
Deadline propagation prevents late stages from consuming time the request no longer has.
8. Retry Policy
Retries can help transient failures.
Retries can also amplify outages.
Use retries only when:
- failure is transient;
- operation is safe or idempotent;
- retry budget exists;
- backoff and jitter are applied;
- dependency can handle retry load.
Do not retry:
- authorization denied;
- validation error;
- insufficient evidence;
- high-risk side effect without idempotency;
- permanent failure;
- policy violation.
class RetryPolicy(BaseModel):
max_attempts: int
base_delay_ms: int
max_delay_ms: int
jitter: bool
retryable_errors: list[str]
9. Exponential Backoff With Jitter
import random
import asyncio
def compute_backoff_ms(attempt: int, base_ms: int, max_ms: int, jitter: bool = True) -> int:
delay = min(max_ms, base_ms * (2 ** (attempt - 1)))
if jitter:
return random.randint(0, delay)
return delay
async def retry_async(operation, policy: RetryPolicy):
last_error = None
for attempt in range(1, policy.max_attempts + 1):
try:
return await operation()
except Exception as exc:
last_error = exc
error_name = type(exc).__name__
if error_name not in policy.retryable_errors:
raise
if attempt == policy.max_attempts:
raise
delay_ms = compute_backoff_ms(
attempt=attempt,
base_ms=policy.base_delay_ms,
max_ms=policy.max_delay_ms,
jitter=policy.jitter,
)
await asyncio.sleep(delay_ms / 1000)
raise last_error
Jitter prevents synchronized retry waves.
10. Retry Budgets
A retry budget limits retry volume.
Example:
Retries may not exceed 10% of original request volume per minute.
Without retry budgets, failing dependencies can receive more traffic during failure.
AI systems are especially vulnerable because model calls can be expensive.
Track:
- retry count by dependency;
- retry success rate;
- retry cost;
- retry latency;
- retry amplification factor.
If retries rarely succeed, reduce or disable them.
11. Circuit Breaker
A circuit breaker stops calling a failing dependency temporarily.
States:
Simple implementation:
from enum import Enum
from time import monotonic
class CircuitState(str, Enum):
CLOSED = "closed"
OPEN = "open"
HALF_OPEN = "half_open"
class CircuitBreaker:
def __init__(self, *, failure_threshold: int, cooldown_seconds: float) -> None:
self.failure_threshold = failure_threshold
self.cooldown_seconds = cooldown_seconds
self.state = CircuitState.CLOSED
self.failures = 0
self.opened_at: float | None = None
def allow_request(self) -> bool:
if self.state == CircuitState.CLOSED:
return True
if self.state == CircuitState.OPEN:
if self.opened_at and monotonic() - self.opened_at >= self.cooldown_seconds:
self.state = CircuitState.HALF_OPEN
return True
return False
return True
def record_success(self) -> None:
self.failures = 0
self.state = CircuitState.CLOSED
self.opened_at = None
def record_failure(self) -> None:
self.failures += 1
if self.failures >= self.failure_threshold:
self.state = CircuitState.OPEN
self.opened_at = monotonic()
Use circuit breakers for:
- model provider;
- embedding provider;
- vector database;
- reranker;
- external tools;
- notification systems.
12. Fallback
Fallback provides alternative behavior when a component fails.
| Failure | Fallback |
|---|---|
| reranker down | use fused retrieval ranking |
| vector search down | lexical search only |
| embedding provider down | cached embedding or lexical fallback |
| primary model down | secondary model |
| large model timeout | smaller model with caveat |
| validation judge down | deterministic checks only + lower confidence |
| tool down | explain unavailable or queue task |
| agent worker down | resume later from checkpoint |
Fallback must be safe.
Do not fallback from "policy-grounded answer" to "model guesses from memory".
13. Safe Fallback Rules
A fallback is safe when it preserves core invariants.
For RAG:
- still uses authorized evidence;
- does not invent missing facts;
- lowers confidence if quality reduced;
- may refuse if evidence insufficient.
For tools:
- does not perform unapproved side effects;
- preserves idempotency;
- does not bypass authorization.
For models:
- supports required output schema;
- respects safety requirements;
- stays within data residency constraints.
Fallback policy:
class FallbackPolicy(BaseModel):
component: str
failure_type: str
fallback_component: str | None = None
degraded_status: str
user_message: str
allowed: bool
14. Graceful Degradation
Graceful degradation means reduced capability instead of total failure.
Examples:
- answer without reranker but with citations;
- provide summary but not final recommendation;
- create draft but do not send;
- queue long-running task instead of synchronous response;
- ask user to retry external lookup;
- return "insufficient evidence" instead of guessing;
- route high-risk case to human.
Good degraded response:
I can retrieve policy evidence, but the case record service is currently unavailable. I cannot determine whether this specific case should escalate until case facts are available.
Bad degraded response:
It probably does not require escalation.
15. Bulkheads
Bulkheads isolate failures.
Examples:
- separate queues per tenant;
- separate worker pools for high-risk workflows;
- separate concurrency limits per model provider;
- separate index resources for critical tenants;
- separate tool execution pools;
- separate budget pools for eval vs production;
- separate low-priority batch work from user-facing requests.
Without bulkheads, a batch eval run can starve production chat.
16. Rate Limiting and Backpressure
Rate limits protect:
- model provider quota;
- vector DB;
- tool APIs;
- tenant fairness;
- cost budget;
- downstream systems.
Backpressure tells upstream to slow down.
Signals:
- queue depth high;
- worker saturation;
- model provider rate limit;
- p95 latency rising;
- memory pressure;
- cost budget near limit;
- circuit breaker open.
Actions:
- reject low-priority requests;
- queue async tasks;
- reduce candidate_k;
- skip optional rerank;
- use smaller model;
- reduce max output tokens;
- shed load for non-critical features.
Backpressure prevents cascading failures.
17. Admission Control
Admission control decides whether to accept work.
class AdmissionDecision(BaseModel):
allowed: bool
reason: str
mode: str # normal, degraded, queued, rejected
Policy:
def admit_request(load: dict[str, float], risk_level: str) -> AdmissionDecision:
if load["queue_depth"] > 10_000 and risk_level == "low":
return AdmissionDecision(
allowed=False,
reason="System overloaded for low-priority requests.",
mode="rejected",
)
if load["reranker_latency_p95"] > 2_000:
return AdmissionDecision(
allowed=True,
reason="Reranker degraded.",
mode="degraded",
)
return AdmissionDecision(allowed=True, reason="ok", mode="normal")
Do not accept work you cannot complete responsibly.
18. Idempotency
Idempotency prevents duplicate side effects.
Required for:
- retrying tool calls;
- resuming workflows;
- handling client retries;
- queue redelivery;
- external API uncertainty.
Idempotency key:
def make_idempotency_key(
*,
request_id: str,
workflow_run_id: str,
node_name: str,
operation: str,
) -> str:
return f"{request_id}:{workflow_run_id}:{node_name}:{operation}"
For high-risk operations, idempotency is mandatory but not sufficient.
Approval and revalidation are still required.
19. Revalidation Before Side Effects
Long-running agents must revalidate before action.
Check:
- user still authorized;
- approval still valid;
- case status unchanged;
- source policy still active;
- idempotency key not already used for different payload;
- tool still allowed;
- risk classification unchanged.
Example:
async def preflight_case_update(state: "CaseWorkflowState") -> None:
await assert_user_authorized(state.user_id, state.case_id)
await assert_approval_valid(state.approval_id)
await assert_case_version_current(state.case_id, state.case_version)
await assert_policy_version_active(state.policy_version)
Do not act on stale long-running assumptions.
20. Model Routing and Fallback
Model routing improves reliability and cost.
Routes can depend on:
- risk level;
- task type;
- latency budget;
- model availability;
- cost budget;
- context length;
- structured output need;
- data policy.
Fallback model contract:
class ModelFallbackRule(BaseModel):
primary_model: str
fallback_model: str
allowed_features: list[str]
disallowed_risk_levels: list[str]
requires_additional_validation: bool
Do not fallback high-risk regulated answers to an unvalidated model.
21. Structured Output Repair
Invalid output is a reliability issue.
Pattern:
- call model with schema;
- validate output;
- if invalid, run one repair attempt;
- if still invalid, fail safely or fallback;
- record repair metric.
class RepairPolicy(BaseModel):
max_attempts: int = 1
fallback_on_failure: bool = False
Do not run infinite repair loops.
Track structured output failure rate.
22. RAG Partial Failure
RAG components can partially fail.
Examples:
- vector search works, lexical fails;
- lexical works, vector fails;
- reranker times out;
- one index unavailable;
- metadata filter service slow;
- evidence store missing some documents.
Policy should define:
- which sources are required;
- which are optional;
- whether answer can proceed;
- confidence/caveat;
- user message.
class RetrievalSourceRequirement(BaseModel):
source_name: str
required: bool
failure_behavior: Literal["fail", "degrade", "omit", "ask_human"]
A case-specific answer may require case facts and policy. Without either, it should not proceed.
23. Agent Partial Failure
Agent workflows can degrade.
Examples:
- prior-decision search fails but policy/case facts are available;
- evidence summarizer times out;
- approval service unavailable;
- drafting model fails;
- validation model unavailable.
Define per-node failure behavior:
class NodeFailurePolicy(BaseModel):
node_name: str
retry_policy: RetryPolicy
fallback_node: str | None = None
required_for_completion: bool
human_handoff_on_failure: bool
High-risk workflows should prefer human handoff over unsafe fallback.
24. Queue-Based Reliability
Use queues for long-running or expensive work.
Benefits:
- decouple API from processing;
- absorb bursts;
- retry safely;
- control concurrency;
- support priority;
- support dead-letter queues.
Queue rules:
- messages are idempotent;
- state is persisted outside queue;
- dead-letter failures are reviewed;
- priority prevents critical work starvation;
- retries are bounded.
25. Dead Letter Queue
A dead letter queue stores tasks that failed too many times.
class DeadLetterTask(BaseModel):
run_id: str
task_type: str
failed_node: str
error_type: str
error_message: str
retry_count: int
last_checkpoint_id: str
created_at: str
Operators should inspect DLQ and decide:
- retry after fix;
- cancel;
- manually complete;
- escalate incident;
- add regression test.
26. Load Shedding
Load shedding intentionally rejects or degrades requests.
Prioritize:
- safety-critical operations;
- contractual priority;
- interactive user requests;
- background jobs;
- eval/batch jobs;
- optional enrichment.
Load shedding is better than total collapse.
Return clear status:
The system is currently under heavy load. I can start this as a background task or you can retry later.
27. Caching and Staleness
Caching improves reliability and latency.
Cache:
- query embeddings;
- retrieval results for public/static data;
- prompt compilation;
- tool metadata;
- model responses for deterministic low-risk tasks;
- eval outputs;
- source metadata.
Be careful caching:
- user-specific answers;
- permission-sensitive retrieval;
- case-specific data;
- generated recommendations;
- tool outputs with sensitive data.
Cache key must include:
- tenant;
- security context hash;
- index version;
- prompt version;
- model version;
- query normalization version.
Caching and fallback introduce staleness.
Rules:
- show caveat when using stale cache;
- set TTL by data sensitivity;
- invalidate on source update;
- avoid stale cache for high-risk decisions;
- revalidate before side effects.
28. Chaos Testing
Chaos testing intentionally injects failures.
Scenarios:
- model timeout;
- embedding provider failure;
- vector DB unavailable;
- reranker slow;
- tool API returns 500;
- tool API times out after side effect;
- queue worker crash;
- checkpoint store unavailable;
- approval service unavailable;
- high latency spike;
- malformed model output;
- prompt injection in retrieved doc.
Expected:
- no unauthorized data leak;
- no duplicate side effects;
- bounded retries;
- graceful degradation;
- trace captures failure;
- alert fires where appropriate.
29. Reliability Test Matrix
| Failure | Expected Behavior |
|---|---|
| model timeout | fallback or fail safely |
| invalid model output | repair once, then fail |
| vector search down | lexical fallback if safe |
| reranker down | use fused ranking |
| tool rate limit | retry with backoff or queue |
| external write uncertainty | idempotency prevents duplicate |
| worker crash | resume from checkpoint |
| approval service down | pause, do not act |
| max steps exceeded | stop safely |
| cost budget exceeded | stop/degrade |
| cache stale | avoid high-risk action |
| provider outage | circuit breaker opens |
30. Observability for Reliability
Reliability patterns need metrics.
Track:
- timeout rate by dependency;
- retry attempts by dependency;
- retry success rate;
- circuit breaker state;
- fallback rate;
- degraded response rate;
- queue depth;
- DLQ count;
- idempotency conflict count;
- duplicate side effect count;
- max-step failures;
- cost budget stops;
- load shedding count.
Trace every fallback.
A silent fallback is a hidden quality change.
31. Reliability Runbook
AI Reliability Incident Runbook
1. Identify failing dependency.
2. Check circuit breaker state.
3. Check retry rate and amplification.
4. Check queue depth and worker health.
5. Check fallback/degraded response rate.
6. Check cost and token spikes.
7. Check safety metrics.
8. Disable unsafe feature path if needed.
9. Switch model/provider/index if approved.
10. Drain or pause queues if overloaded.
11. Review DLQ.
12. Add regression/chaos test.
Every reliability alert should link to a runbook.
32. Case-Management Reliability
For regulated case-management workflows:
Must Fail Closed
- authorization failure;
- approval service unavailable for high-risk action;
- policy source unavailable;
- case record unavailable;
- citation validation failure for final recommendation.
May Degrade
- prior decision search unavailable;
- optional explanation style judge unavailable;
- reranker unavailable if evidence is still sufficient;
- summary enhancement unavailable.
Must Not Do
- close case without approval;
- send external notice during degraded mode;
- use stale policy without labeling it;
- retry destructive action without idempotency;
- hide missing evidence.
Reliability is part of regulatory defensibility.
33. Anti-Patterns
| Anti-Pattern | Why It Fails |
|---|---|
| No timeout | resource exhaustion |
| Retry everything | retry storm |
| Retry non-idempotent writes | duplicate side effects |
| Fallback to hallucination | unsafe output |
| No circuit breaker | cascading failure |
| No bulkheads | one feature starves all |
| No queue limits | memory/resource collapse |
| Cache without auth key | data leak |
| No degraded status | users trust weaker answer |
| No failure metrics | invisible instability |
| No chaos tests | failure paths untested |
| No runbook | slow incident response |
34. Practice: Build Reliability Harness
Take your RAG + agent app and inject failures.
Implement:
- timeout budget;
- retry with jitter;
- circuit breaker around model provider;
- fallback from reranker to fused retrieval;
- idempotent tool write;
- queue for long-running agent task;
- DLQ for repeated failure;
- graceful degraded response;
- reliability metrics;
- chaos test scenarios.
Test cases:
- model timeout;
- reranker timeout;
- vector DB failure;
- invalid structured output;
- tool rate limit;
- tool side-effect uncertainty;
- worker crash;
- approval service down;
- max steps exceeded;
- cost budget exceeded.
Deliverable:
Reliability Report
1. Timeout budget
2. Retry policy
3. Circuit breaker policy
4. Fallback matrix
5. Bulkhead design
6. Queue/DLQ design
7. Idempotency strategy
8. Chaos test results
9. Metrics and alerts
10. Runbook
35. Engineering Heuristics
- Every remote call needs a timeout.
- Use deadlines, not isolated timeouts only.
- Retry only safe transient failures.
- Add jitter to retries.
- Use retry budgets.
- Use circuit breakers for failing dependencies.
- Make side effects idempotent.
- Do not fallback to unsupported answers.
- Treat degraded mode as a visible status.
- Use bulkheads to isolate tenants/features/tools.
- Apply backpressure before collapse.
- Queue long-running work.
- Use DLQs for repeated failures.
- Revalidate before high-risk side effects.
- Chaos-test failure paths.
36. Summary
AI systems fail like distributed systems, plus they fail like probabilistic reasoning systems.
The core invariant:
Failures must be bounded, visible, recoverable, and safe.
Reliability patterns give you that control:
- timeout budgets;
- retries with jitter;
- circuit breakers;
- fallback;
- bulkheads;
- rate limits;
- backpressure;
- idempotency;
- queues;
- graceful degradation;
- chaos testing.
In the next part, we focus on Latency, Cost, and Throughput Engineering.
You just completed lesson 28 in deepen practice. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.
Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.