Part 033 — Reliability and Failure Modeling

Reliability for agentic AI is not only uptime.

A system can be available, fast, and still unreliable if it makes unsupported decisions, loops forever, calls the wrong tool, ignores policy, loses state, or spends thousands of dollars on useless retries.

Traditional reliability asks:

Is the service up?
Is latency acceptable?
Are errors low?
Is saturation controlled?

Agentic reliability adds:

Did the agent follow the right trajectory?
Did it retrieve the right evidence?
Did it avoid forbidden tools?
Did it stop when uncertain?
Did it avoid duplicate side effects?
Did it preserve state across resume?
Did it remain within budget?
Did it escalate when required?
Did it avoid hallucinated authority?
Did it degrade safely under dependency failure?

This part builds a reliability and failure-modeling framework for enterprise-grade stateful multi-agent AI systems.

1. Kaufman Framing

Using Kaufman's method, reliability decomposes into:

identify expected behavior;
identify failure modes;
classify failure impact;
define detection signals;
define mitigation controls;
design fallback and degradation paths;
set budgets and SLOs;
test failure scenarios;
monitor production;
feed incidents back into evaluation.

Target Performance

By the end of this part, you should be able to:

define agentic reliability beyond uptime;
create failure mode catalogs;
model loops, deadlocks, hallucinations, drift, and cost explosions;
design retry, timeout, budget, and circuit-breaker controls;
distinguish transient, deterministic, semantic, and policy failures;
define SLOs for agentic workflows;
design graceful degradation;
test chaos/failure scenarios;
create incident-to-eval loops;
reason about reliability at component, workflow, and system levels.

2. Reliability Definition for Agentic Systems

A reliable agentic system:

completes intended tasks within acceptable time/cost;
uses allowed data and tools;
produces grounded outputs;
respects policy and authority boundaries;
preserves state across long-running execution;
handles dependency failures safely;
avoids duplicate side effects;
stops or escalates under uncertainty;
provides auditability;
degrades gracefully.

Reliability Is Multi-Dimensional

Dimension	Example
availability	runtime accepts tasks
latency	workflow finishes within target
correctness	output matches expected behavior
grounding	claims supported by evidence
safety	forbidden actions avoided
policy compliance	approval gates respected
state durability	resume works after crash
cost reliability	budget respected
security reliability	injection/data leak controls work
operational reliability	failures observable and recoverable

Do not optimize only one dimension.

3. Failure Taxonomy

Category	Example
input/context	missing evidence, poisoned context
reasoning/output	hallucinated claim, wrong risk
tool/integration	timeout, invalid arguments
state/workflow	lost checkpoint, invalid transition
policy/safety	approval bypass, forbidden action
operational/cost	infinite loop, budget explosion
human review	rubber-stamp, stale approval

4. Failure Mode Record

from enum import Enum
from pydantic import BaseModel, Field


class FailureSeverity(str, Enum):
    LOW = "low"
    MEDIUM = "medium"
    HIGH = "high"
    CRITICAL = "critical"


class FailureMode(BaseModel):
    failure_id: str
    name: str
    category: str
    description: str
    severity: FailureSeverity
    detection_signals: list[str]
    mitigations: list[str]
    fallback: str | None = None
    eval_cases: list[str] = Field(default_factory=list)

Failure modes should be tracked like risks and test cases.

5. Agent Loop Failures

Agent loops happen when the system keeps reasoning/calling tools without progress.

Examples:

planner keeps revising plan;
executor repeatedly repairs invalid output;
agent keeps searching evidence;
critic keeps asking for more work;
supervisor keeps delegating;
tool failure triggers full-agent rerun;
retrieval returns noisy results and agent never stops.

Loop Signals

Signal	Meaning
repeated same tool call	no progress
repeated same node	loop
repeated validation failure	output repair stuck
high token growth	context/loop issue
no new artifacts	no productive progress
replan count high	unstable plan
same error repeated	deterministic failure

Controls

max steps;
max replans;
max tool calls;
max repair attempts;
progress detector;
loop detection by state hash;
budget limits;
human escalation;
safe stop.

6. Progress Detection

A workflow should know whether it is making progress.

class ProgressSnapshot(BaseModel):
    run_id: str
    step_count: int
    artifacts_created: int
    new_evidence_refs: int
    completed_objectives: list[str]
    open_blockers: list[str]
    state_hash: str

Simple loop detector:

def repeated_state_hashes(history: list[ProgressSnapshot], window: int = 3) -> bool:
    if len(history) < window:
        return False

    recent = history[-window:]
    return len({snapshot.state_hash for snapshot in recent}) == 1

If state does not change after several steps, stop or escalate.

7. Deadlocks

Deadlock occurs when workflow cannot continue because components wait on each other.

Examples:

supervisor waits for worker that waits for supervisor clarification;
tool waits for approval that was never created;
human review waits for decision package missing required fields;
policy engine requires context that context builder cannot obtain;
two agents wait for each other's output.

Deadlock Detection

pending state exceeds SLA;
no runnable nodes;
unresolved dependency cycle;
review task missing assignee;
required artifact absent;
waiting reason unchanged.

Controls

explicit dependency graph;
timeout for pending states;
dead-letter queue;
escalation policy;
workflow invariant checks;
state machine validation.

8. Hallucination Failures

Hallucination means output contains unsupported or false information.

Agentic hallucination is dangerous when it influences tools, state, or humans.

Examples:

cites nonexistent evidence;
invents policy rule;
says notice was sent when only drafted;
claims human approval exists;
fabricates confidence;
invents tool result;
summarizes unverified memory as fact.

Controls

grounded output schema;
citation verifier;
source refs required;
model output is proposal, not fact;
policy/domain state source of truth;
verifier agent/service;
human review for high-impact claims;
output guardrails.

9. Drift

Drift means behavior changes over time.

Sources:

model version changes;
prompt changes;
tool changes;
RAG index changes;
memory accumulation;
policy changes;
data distribution shifts;
user behavior changes;
dependency changes.

Drift Signals

Signal	Meaning
eval regression	offline behavior changed
approval/rejection rate shift	quality/policy drift
citation failure increase	RAG drift
tool denial increase	tool-use drift
memory conflict increase	memory drift
latency/cost shift	runtime drift
human override increase	recommendation drift

Controls

version all components;
run manifests;
regression evals;
canary/shadow;
monitoring;
drift alerts;
rollback.

10. Cost Explosion

Agentic systems can spend too much through:

loops;
excessive tool calls;
large context;
repeated retrieval;
model retries;
multi-agent fan-out;
unnecessary critics/judges;
reprocessing same artifacts;
no budget enforcement.

Cost Controls

class CostBudget(BaseModel):
    max_model_calls: int
    max_tool_calls: int
    max_tokens: int
    max_usd: float
    max_wall_clock_seconds: int

Budget enforcement:

def check_budget(used: dict, budget: CostBudget) -> bool:
    return (
        used["model_calls"] <= budget.max_model_calls
        and used["tool_calls"] <= budget.max_tool_calls
        and used["tokens"] <= budget.max_tokens
        and used["usd"] <= budget.max_usd
        and used["seconds"] <= budget.max_wall_clock_seconds
    )

Budgets should be checked during execution, not only at the end.

11. Retry Storms

Retries can amplify outages.

Example:

Controls

retry at semantic layer;
exponential backoff;
jitter;
retry budget;
circuit breaker;
idempotency;
timeout classification;
avoid retrying deterministic validation errors;
avoid rerunning entire agent for one tool failure.

12. Circuit Breakers

A circuit breaker stops calls to unhealthy dependency.

class CircuitState(str, Enum):
    CLOSED = "closed"
    OPEN = "open"
    HALF_OPEN = "half_open"


class CircuitBreakerStatus(BaseModel):
    dependency_name: str
    state: CircuitState
    failure_count: int
    opened_at: str | None = None

When open:

fail fast;
use cached/degraded response;
route to fallback;
escalate;
stop high-risk workflow.

Circuit breakers prevent agent systems from hammering failing dependencies.

13. Timeout Hierarchy

Timeouts should compose.

run deadline
  workflow node deadline
    model call timeout
    tool call timeout
      HTTP/database timeout

Rules:

child timeout < parent deadline;
timeout creates typed error;
side-effect timeout may require reconciliation;
expired run should stop gracefully;
cancellation should propagate.

14. Graceful Degradation

Graceful degradation means providing a safe reduced capability.

Examples:

Failure	Degradation
RAG unavailable	ask for retry or use cached approved summary
policy service unavailable	fail closed for side effects
memory unavailable	continue without personalization
verifier unavailable	require human review
external notification unavailable	keep draft and retry later
model provider degraded	route to fallback model if eval-approved
evidence search partial	disclose missing evidence

Do not degrade from safe to unsafe.

15. Fallback Design

Fallbacks must be explicit.

class FallbackPolicy(BaseModel):
    failure_type: str
    fallback_action: str
    allowed_risk_tiers: list[str]
    requires_human_review: bool

Example:

If citation verifier fails:
- low-risk: ask model to include evidence refs and mark unverified
- high-risk: block decision package and require human review

Fallback is policy, not improvisation.

16. Model Provider Failure

Model failures:

timeout;
rate limit;
degraded quality;
empty response;
malformed structured output;
provider outage;
model behavior drift;
cost spike.

Controls:

model gateway;
provider routing;
fallback models;
schema validation;
retry policy;
eval-approved model routes;
cost budget;
observability by model version.

Never switch to a fallback model for high-risk workflows without evaluation approval.

17. RAG Failure

RAG failures:

retrieval miss;
stale document;
unauthorized chunk;
irrelevant top-k;
index outage;
embedding drift;
poisoned corpus;
citation mismatch.

Controls:

hybrid retrieval;
metadata filters;
freshness checks;
retriever eval;
citation verifier;
index versioning;
fallback to human evidence search;
do not hallucinate when evidence missing.

18. Tool Failure

Tool failures:

validation error;
authorization denied;
timeout;
dependency failure;
side-effect ambiguous;
output schema invalid;
external API changed;
duplicate request;
rate limit.

Controls:

typed error taxonomy;
retry classification;
idempotency;
reconciliation;
circuit breakers;
output validation;
fallback;
escalation.

19. Memory Failure

Memory failures:

stale memory retrieved;
sensitive memory leaked;
poisoned memory accepted;
deleted memory still indexed;
wrong scope memory used;
memory conflicts with domain state;
memory store unavailable.

Controls:

memory governance;
expiry;
read/write policy;
conflict checks;
tombstones;
index deletion propagation;
memory-off degradation.

If memory fails, agent should usually continue without memory rather than inventing it.

20. Human Review Failure

Human review failures:

task not assigned;
timeout ignored;
unauthorized reviewer;
stale package approved;
reviewer rubber-stamps;
reviewer lacks evidence;
duplicate approval;
approval not bound to action.

Controls:

review queue monitoring;
reviewer authorization;
package versioning;
approval expiry;
idempotency;
review metrics;
escalation.

21. State Durability Failure

State failures:

checkpoint not saved;
checkpoint corrupt;
state schema migration fails;
resume repeats side effect;
state/data mismatch;
lost human decision;
wrong thread_id;
cross-tenant state leak.

Controls:

durable checkpointer;
schema versioning;
state integrity checks;
idempotency records;
domain event reconciliation;
migration tests;
tenant partitioning;
resume audit event.

22. Reliability Patterns

Pattern	Use
timeout	prevent hanging
retry with backoff	transient failures
idempotency	safe retry
circuit breaker	failing dependency
bulkhead	isolate failure
outbox/inbox	reliable integration
checkpointing	resume state
saga	long-running process
compensation	mitigate side effect
graceful degradation	safe partial capability
human escalation	uncertainty/high risk
dead-letter queue	unprocessable work

Agentic reliability is a combination of distributed systems patterns and AI-specific controls.

23. Bulkheads

Bulkheads isolate failure.

Examples:

separate worker pools by tenant;
separate high-risk workflows;
separate tool executors for external side effects;
rate limits per agent/tool/tenant;
isolate experimental agents;
separate memory writes from core workflow;
separate model provider quotas.

Without bulkheads, one runaway agent can affect all users.

24. SLOs for Agentic Systems

Service Level Objectives should include more than latency.

Examples:

SLO	Example
availability	99.9% task intake availability
latency	95% low-risk tasks complete < 30s
workflow completion	98% valid tasks reach terminal state
policy compliance	0 critical false allows
grounding	95% citations verified
side-effect safety	0 duplicate external notices
cost	99% runs under budget
human review SLA	95% high-risk reviews assigned < 10m
eval regression	release blocked if critical failures > 0

Error Budget

For critical safety failures, error budget may be zero.

25. Reliability Metrics

Map classic golden signals to agentic signals.

Classic Signal	Agentic Extension
latency	run duration, node duration, approval latency
traffic	runs, tool calls, retrievals, approvals
errors	model/tool/policy/guardrail failures
saturation	worker queues, model rate limits, token budget, review backlog

Additional signals:

loop detections;
cost per run;
citation failures;
policy denials;
retries;
ambiguous side effects;
stuck workflows;
memory conflicts;
escalation rate.

26. Error Taxonomy

class AgentErrorType(str, Enum):
    VALIDATION = "validation"
    POLICY_DENIED = "policy_denied"
    AUTHORIZATION = "authorization"
    MODEL_TIMEOUT = "model_timeout"
    MODEL_MALFORMED_OUTPUT = "model_malformed_output"
    TOOL_TIMEOUT = "tool_timeout"
    TOOL_AMBIGUOUS_SIDE_EFFECT = "tool_ambiguous_side_effect"
    RETRIEVAL_MISS = "retrieval_miss"
    CITATION_UNSUPPORTED = "citation_unsupported"
    STATE_CONFLICT = "state_conflict"
    BUDGET_EXCEEDED = "budget_exceeded"
    HUMAN_REVIEW_TIMEOUT = "human_review_timeout"
    UNKNOWN = "unknown"

Typed errors enable better retry/fallback.

27. Reliability Event

class ReliabilityEvent(BaseModel):
    event_id: str
    run_id: str
    tenant_id: str
    event_type: str
    severity: FailureSeverity
    component: str
    error_type: AgentErrorType | None = None
    message: str
    retryable: bool
    occurred_at: str

Events create a structured reliability trail.

28. Chaos Testing for Agents

Chaos testing intentionally injects failure.

Scenarios:

model timeout;
malformed model output;
retriever returns stale policy;
tool API times out after commit;
worker crashes after checkpoint;
human approval delayed;
policy engine unavailable;
memory store returns conflicting memory;
MCP server unavailable;
vector index returns malicious chunk.

Expected behavior should be safe, not necessarily successful.

29. Failure Injection Harness

class FailureInjectionRule(BaseModel):
    component: str
    failure_type: str
    probability: float = Field(ge=0.0, le=1.0)
    applies_to_tags: list[str] = Field(default_factory=list)


class FailureInjectionConfig(BaseModel):
    rules: list[FailureInjectionRule]

Use in simulation environments, not casually in production.

30. Incident-to-Eval Loop

Every incident should produce:

root cause analysis;
failed control identification;
new eval case;
new guardrail/policy/test if needed;
monitoring update;
runbook update;
residual risk review.

This is how reliability improves.

31. Reliability Anti-Patterns

Anti-Pattern 1 — Retry Everything

Retries without idempotency and classification create duplicate side effects and storms.

Anti-Pattern 2 — No Stop Conditions

Agent loops until cost/deadline explodes.

Anti-Pattern 3 — Uptime-Only SLO

System is up but producing unsafe outputs.

Anti-Pattern 4 — Fallback to Unsafe Model

High-risk workflow switches to unvalidated model.

Anti-Pattern 5 — Degrade by Ignoring Policy

Dependency failure makes system permissive.

Anti-Pattern 6 — No Typed Errors

Everything is generic failure.

Anti-Pattern 7 — No Incident Regression

Same bug comes back.

Anti-Pattern 8 — Human Escalation as Black Hole

Escalated tasks never complete or timeout.

32. Production Checklist

Before claiming reliability:

33. Practice Drill

Design a reliability plan for a multi-agent case assistant.

Capabilities:

RAG evidence retrieval;
policy mapping;
risk assessment;
drafting;
human approval;
external notice sending;
memory;
MCP tools.

Deliverables:

reliability definition;
top 20 failure modes;
severity classification;
detection signals;
mitigation controls;
fallback policy;
retry/circuit-breaker policy;
SLOs;
chaos scenarios;
incident-to-eval workflow;
dashboard metrics;
production checklist.

34. What Top 1% Engineers Pay Attention To

Top engineers ask:

What does reliability mean for this agent?
What can fail besides infrastructure?
What happens if the model loops?
What happens if the retriever misses evidence?
What happens if policy engine is unavailable?
What happens if tool times out after success?
What happens if memory is stale?
What happens if human review never happens?
What happens if fallback model behaves differently?
What happens if cost spikes?
Are failures typed?
Are retries safe?
Are SLOs measuring correctness and safety?
Did incidents become evals?

They design agents to fail safely, visibly, and recoverably.

35. Summary

In this part, we covered:

reliability definition for agentic systems;
failure taxonomy;
failure mode records;
loop failures;
progress detection;
deadlocks;
hallucination;
drift;
cost explosion;
retry storms;
circuit breakers;
timeout hierarchy;
graceful degradation;
fallback design;
model/RAG/tool/memory/human/state failures;
reliability patterns;
bulkheads;
SLOs;
metrics;
error taxonomy;
reliability events;
chaos testing;
failure injection;
incident-to-eval loop;
anti-patterns;
production checklist.

The key principle:

Reliable agentic systems do not merely keep running. They keep making controlled, grounded, policy-compliant progress—or stop safely.

The next part focuses on Observability and Runtime Forensics.

References

Google SRE Book: monitoring distributed systems and the four golden signals of latency, traffic, errors, and saturation.
OpenTelemetry documentation: traces, metrics, logs, context propagation, and observability concepts.
OpenAI Agents SDK tracing documentation: tracing LLM generations, tool calls, handoffs, guardrails, and custom events.
LangGraph documentation: persistence, checkpoints, durable state, and long-running stateful workflows.

Reliability and Failure Modeling

Part 033 — Reliability and Failure Modeling

1. Kaufman Framing

Target Performance

2. Reliability Definition for Agentic Systems

Reliability Is Multi-Dimensional

3. Failure Taxonomy

Categories

4. Failure Mode Record

5. Agent Loop Failures

Loop Signals

Controls

6. Progress Detection

7. Deadlocks

Deadlock Detection

Controls

8. Hallucination Failures

Controls

9. Drift

Drift Signals

Controls

10. Cost Explosion

Cost Controls

11. Retry Storms

Controls

12. Circuit Breakers

13. Timeout Hierarchy

14. Graceful Degradation

15. Fallback Design

16. Model Provider Failure

17. RAG Failure

18. Tool Failure

19. Memory Failure

20. Human Review Failure

21. State Durability Failure

22. Reliability Patterns

23. Bulkheads

24. SLOs for Agentic Systems

Error Budget

25. Reliability Metrics

26. Error Taxonomy

27. Reliability Event

28. Chaos Testing for Agents

29. Failure Injection Harness

30. Incident-to-Eval Loop

31. Reliability Anti-Patterns

Anti-Pattern 1 — Retry Everything

Anti-Pattern 2 — No Stop Conditions

Anti-Pattern 3 — Uptime-Only SLO

Anti-Pattern 4 — Fallback to Unsafe Model

Anti-Pattern 5 — Degrade by Ignoring Policy

Anti-Pattern 6 — No Typed Errors

Anti-Pattern 7 — No Incident Regression

Anti-Pattern 8 — Human Escalation as Black Hole

32. Production Checklist

33. Practice Drill

34. What Top 1% Engineers Pay Attention To

35. Summary

References