Series/Learn Python Enterprise-Grade Stateful Multi-Agent AI Systems

Final StretchOrdered learning track

Reference Architecture and Capstone

Learn Python Enterprise-Grade Stateful Multi-Agent AI Systems - Part 035

Reference architecture and capstone for an enterprise-grade stateful multi-agent AI case management system: runtime, state, orchestration, tools, memory, RAG, policy, human review, evaluation, reliability, observability, deployment, and operating model.

[2026-06-29]21 min read4064 words

In This Lesson

1. Kaufman Framing Users External Systems

Finish

Lesson 3535 lesson track30–35 Final Stretch

#python#ai#multi-agent#reference-architecture+4 more

Part 035 — Reference Architecture and Capstone

This capstone is where the entire series becomes one system.

Not a chatbot.

Not a demo agent.

A stateful, governed, observable, evaluated, policy-controlled, enterprise-grade multi-agent AI system.

The reference system:

Enterprise Case Management Multi-Agent System

The system assists analysts with regulated case review:

intake a case;
summarize evidence;
detect missing evidence;
map facts to policy;
assess risk;
draft a decision package;
verify citations;
escalate disagreements;
request human approval;
send an approved external notice;
update domain state through command handlers;
preserve full audit and forensic trace;
evaluate and monitor behavior continuously.

This capstone is not a toy. It is a blueprint you can adapt for real enterprise systems.

1. Kaufman Framing

Using Josh Kaufman's skill acquisition framework, the capstone consolidates all sub-skills:

define target performance;
deconstruct the system into capabilities;
remove unsafe/unclear practices;
build feedback loops;
practice through realistic exercises.

Target Performance

By the end of this capstone, you should be able to design and implement a reference architecture that includes:

stateful agent runtime;
graph/workflow orchestration;
supervisor-worker multi-agent topology;
typed state and contracts;
RAG and knowledge graph context;
governed memory;
MCP/tool integration;
policy enforcement;
human-in-the-loop control points;
side-effect transaction boundaries;
threat model;
guardrail runtime;
evaluation suite;
reliability patterns;
observability and runtime forensics;
deployment and operating model.

The goal is not to memorize a framework. The goal is to develop engineering judgment.

2. System Context

The business capability:

Assist enterprise analysts in reviewing regulated cases safely and efficiently.

The system must help, but it must not silently seize authority.

Users

analyst;
senior reviewer;
compliance manager;
system administrator;
auditor;
operations engineer.

External Systems

case management system;
evidence/document store;
policy repository;
notification service;
approval/review queue;
identity provider;
audit/event log;
observability backend;
memory store;
RAG index;
knowledge graph;
MCP servers.

High-Level Context

3. Non-Negotiable Invariants

These invariants define enterprise-grade behavior.

Invariant 1 — Agents Propose, Authoritative Services Commit

Agents may propose:

risk assessment;
policy mapping;
notice draft;
command proposal.

But domain services commit:

case status update;
notice sent;
approval recorded;
memory accepted;
policy decision.

Invariant 2 — Domain State Beats Memory and Chat

If memory or transcript conflicts with domain state, domain state wins.

Invariant 3 — Tools Are Governed Capabilities

No raw shell, raw SQL, unrestricted HTTP, or broad administrative tools.

Invariant 4 — High-Impact Side Effects Require Approval

External notification, irreversible mutation, and official domain transition require policy/human gates.

Invariant 5 — Every Important Decision Has Evidence

Risk, policy, and external communication must cite sources or declare missing evidence.

Invariant 6 — Every Run Is Forensically Reconstructable

The system must preserve run manifest, trace, context refs, tool calls, policy decisions, approvals, evidence refs, artifacts, and side-effect records.

Invariant 7 — Evaluation Blocks Unsafe Releases

Critical eval failures block deployment.

Invariant 8 — Kill Switches Exist

Unsafe agents, tools, MCP servers, prompts, model routes, indexes, or memory writes can be disabled quickly.

4. Capability Map

Each capability is independently testable.

5. Reference Architecture

6. Runtime Flow

The typical case review flow:

Every step is typed, logged, and checkpointed.

7. Agent Topology

Use a supervisor-worker topology with bounded specialists.

Role Inventory

Agent	Responsibility	Authority
supervisor	orchestrate, aggregate, escalate	recommend/prepare
intake	normalize case input	analyze
evidence	retrieve and summarize evidence	analyze
risk	recommend risk level	recommend
policy	map facts to policy	recommend
missing-evidence	identify gaps	analyze
drafting	create decision package/draft notice	prepare
verifier	check citations and contract compliance	analyze
adjudicator	resolve conflicts or escalate	recommend
human reviewer	approve/reject high-impact actions	approve

No worker owns final domain mutation.

8. State Model

Separate state types.

Domain State

Authoritative business state, owned by domain services.

from enum import Enum
from pydantic import BaseModel, Field


class CaseStatus(str, Enum):
    OPEN = "open"
    UNDER_REVIEW = "under_review"
    PENDING_APPROVAL = "pending_approval"
    NOTIFIED = "notified"
    CLOSED = "closed"


class CaseDomainSnapshot(BaseModel):
    tenant_id: str
    case_id: str
    status: CaseStatus
    case_version: int
    risk_level: str | None = None
    assigned_reviewer: str | None = None

Execution State

Owned by the runtime/checkpointer.

class ExecutionState(BaseModel):
    run_id: str
    thread_id: str
    current_node: str
    completed_nodes: list[str] = Field(default_factory=list)
    pending_interrupt_id: str | None = None
    tool_calls_used: int = 0
    model_calls_used: int = 0
    budget_remaining_usd: float
    artifact_refs: list[str] = Field(default_factory=list)

Conversation State

Owned by the interaction layer.

class ConversationState(BaseModel):
    thread_id: str
    user_id: str
    messages: list[dict]
    latest_user_intent: str | None = None

9. Artifact Model

Artifacts are durable work products.

class ArtifactType(str, Enum):
    EVIDENCE_SUMMARY = "evidence_summary"
    RISK_ASSESSMENT = "risk_assessment"
    POLICY_MAPPING = "policy_mapping"
    MISSING_EVIDENCE_REPORT = "missing_evidence_report"
    DECISION_PACKAGE = "decision_package"
    NOTICE_DRAFT = "notice_draft"
    VERIFICATION_REPORT = "verification_report"


class ArtifactRecord(BaseModel):
    artifact_id: str
    artifact_type: ArtifactType
    tenant_id: str
    case_id: str
    run_id: str
    created_by: str
    content: dict
    source_refs: list[str]
    version: int
    created_at: str

Artifacts should be immutable or versioned.

10. Context Architecture

Context is assembled per role and step.

Context Package

class CapstoneContextPackage(BaseModel):
    context_id: str
    run_id: str
    agent_name: str
    step_name: str
    blocks: list[dict]
    source_refs: list[str]
    token_estimate: int
    sufficiency_passed: bool
    warnings: list[str] = Field(default_factory=list)

Context Rules

include only role-relevant sources;
label untrusted retrieved content;
include policy version;
include output schema;
include evidence refs;
exclude expired/disputed memory;
prefer domain state over memory;
record omitted critical sources.

11. RAG Architecture

RAG supports evidence and policy retrieval.

Retrieval Profiles

Agent	Retrieval Profile
evidence	case-scoped evidence
policy	effective-date policy/guidance
risk	evidence summaries + risk rubric
drafting	approved facts + templates
verifier	source docs for citations
supervisor	worker artifacts and conflicts

RAG Rules

authorization before retrieval;
preserve document/chunk IDs;
enforce effective date;
classify source authority;
isolate untrusted content;
verify citations;
record index version.

12. Knowledge Graph Architecture

Knowledge graph supports relationship reasoning.

Graph Uses

entity relationship traversal;
policy applicability;
evidence support/contradiction;
artifact lineage;
agent audit graph;
human approval lineage;
impact analysis.

Graph Rules

agents propose facts;
graph service commits;
every edge has provenance;
temporal validity matters;
inferred facts are labeled;
traversal is permissioned and bounded.

13. Memory Architecture

Memory supports future usefulness, not authoritative truth.

Memory Types

user preference;
team checklist;
episodic lesson;
semantic fact reference;
safety warning;
procedural hint.

Memory Rules

source refs required;
broad-scope memory requires approval;
restricted data rejected;
expiry supported;
supersession supported;
forgetting supported;
memory usage logged.

14. Tool and MCP Architecture

Tools are governed capabilities.

Tool Inventory

Tool	Effect	Approval
`get_case_summary`	read	no
`search_case_evidence`	retrieve	no
`fetch_policy_excerpt`	retrieve	no
`create_notice_draft`	draft	no/depends
`request_human_approval`	workflow	no
`send_approved_notice`	external notification	yes
`update_case_status`	internal mutation	yes/command policy
`propose_memory_write`	memory mutation proposal	policy-dependent
`traverse_knowledge_graph`	read/relationship	no/depends sensitivity

MCP Rules

approved MCP server registry;
resources/prompts/tools separated;
discovery filtered by policy;
local servers sandboxed;
version pinned;
calls traced;
capabilities can be killed.

15. Policy Enforcement Architecture

Policy inputs:

user identity;
agent role;
tenant;
resource;
action;
risk level;
tool effect;
workflow state;
approval state;
case version;
policy version.

Policy Decision

class CapstonePolicyDecision(BaseModel):
    decision: str  # allow, deny, require_approval
    reason: str
    policy_id: str
    policy_version: str
    obligations: list[str] = Field(default_factory=list)

Policy Rules

deny by default;
read authorization before retrieval;
no side-effect tool without approval;
no memory write without policy;
no official state update from worker;
no cross-tenant access;
no stale approval;
no critical auto-decision.

16. Human-in-the-Loop Architecture

Decision Package

class CapstoneDecisionPackage(BaseModel):
    decision_package_id: str
    tenant_id: str
    case_id: str
    run_id: str
    proposed_action: str
    rationale: str
    evidence_refs: list[str]
    policy_basis: list[str]
    risk_level: str
    known_uncertainties: list[str]
    alternatives: list[str]
    side_effect_preview: dict
    version: int

Human Review Rules

reviewer authorization required;
separation of duties;
version check;
approval expiry;
decision event immutable;
approval separate from execution;
human sees evidence and uncertainty;
high-risk overrides require reason.

17. Side-Effect Boundary

External notice sending flow:

Command

class SendApprovedNoticeCommand(BaseModel):
    command_id: str
    tenant_id: str
    case_id: str
    notice_draft_id: str
    approval_id: str
    recipient_id: str
    expected_case_version: int
    idempotency_key: str

Rules

command handler owns commit;
approval binds to draft version;
idempotency key stable;
external reference recorded;
ambiguous timeout triggers reconciliation;
outbox/inbox for integration events.

18. Guardrail Runtime

Guardrails at boundaries:

Boundary	Guardrails
input	prompt injection, intent/risk
context	source authority, sensitivity, sufficiency
RAG	ACL, freshness, untrusted content
output	schema, citations, sensitive data
tool	schema, grants, effect policy
memory	source/scope/sensitivity
workflow	loop/deadlock/budget
state	transition/version/checkpoint
human review	package version/authorization
MCP	server/capability allowlist

Guardrails return typed decisions:

class CapstoneGuardrailResult(BaseModel):
    guardrail_id: str
    boundary: str
    decision: str
    reason: str
    version: str

19. Evaluation Architecture

Eval Suites

Suite	Coverage
fast PR eval	schema/tool/policy basics
RAG eval	retrieval and grounding
tool eval	selection/arguments/forbidden tools
trajectory eval	multi-step workflow path
safety eval	injection/exfiltration/tool abuse
reliability eval	failure injection
human review eval	decision package quality
full release eval	all high-risk scenarios

Critical Failure Examples

notice sent without approval;
cross-tenant retrieval;
unsupported high-risk claim;
forbidden tool call;
memory poisoning accepted;
duplicate side effect;
policy false allow.

20. Reliability Architecture

Reliability controls:

checkpoint every durable step;
timeout hierarchy;
retry policy by error type;
idempotency for side effects;
circuit breakers for dependencies;
budgets for model/tool/token/cost;
loop detection;
deadlock detection;
fallback policy;
graceful degradation;
escalation path;
chaos/failure injection.

Reliability Flow

21. Observability Architecture

Forensic Questions

The system must answer:

what did the agent see?
what model/prompt/tool versions were active?
what evidence supported claim?
what tool did the agent propose?
what did policy decide?
who approved?
what command committed?
what side effect occurred?
what changed after release?

Required Events

run started/completed;
context assembled;
model call completed;
tool requested/executed/denied;
retrieval performed;
memory retrieved/written/rejected;
policy decision;
guardrail decision;
checkpoint saved/resumed;
human decision recorded;
command committed;
side effect reconciled;
eval case failed.

22. Deployment Architecture

Deployable Units

agent runtime service;
worker service/pool;
tool executor;
MCP adapter;
RAG retrieval service;
memory service;
policy engine;
evaluation service;
observability pipeline;
UI/review queue.

Use separate deployability where risk and scaling differ.

23. Environment Strategy

Environment	Purpose
local	unit/contract tests
dev	integration with mocks
staging	full simulation with safe data
pre-prod	production-like load/evals
production canary	limited real traffic
production	governed release

High-impact tools should have sandbox/stub variants in non-prod.

24. Data and Tenant Isolation

Enterprise systems require tenant isolation.

Controls:

tenant ID on every state/artifact/tool request;
policy checks include tenant;
RAG indexes tenant-scoped or ACL-filtered;
memory tenant-scoped;
graph traversal tenant-scoped;
traces tagged by tenant but access-controlled;
eval data de-identified;
cross-tenant tests in CI.

Tenant Isolation Test

def test_cross_tenant_evidence_denied():
    request = RetrievalRequest(
        request_id="req_1",
        tenant_id="tenant_a",
        requester_id="user_a",
        run_id="run_1",
        query="case evidence",
        metadata_filters={"case_id": "tenant_b_case"},
        max_results=10,
    )

    result = retrieval_service.search(request)

    assert result.chunks == []

25. Security Architecture

Security controls:

identity provider integration;
service-to-service auth;
tenant isolation;
least-privilege tools;
prompt injection defense;
memory/RAG poisoning controls;
MCP server registry;
secret isolation;
egress control;
policy enforcement;
audit logs;
kill switches;
incident response.

Threat Control Map

Threat	Control
prompt injection	context isolation + tool policy
data exfiltration	auth before retrieval + redaction
excessive agency	tool grants + policy + approval
memory poisoning	memory write policy
RAG poisoning	ingestion validation + citation verification
MCP compromise	registry + sandbox + kill switch
duplicate side effect	idempotency + reconciliation
trace leakage	redaction + access controls

26. Governance Operating Model

Governance Artifacts

AI system inventory;
intended/prohibited use;
risk tier assessment;
risk register;
control catalog;
role/tool/prompt registry snapshots;
RAG index report;
memory policy report;
eval report;
threat model;
incident runbook;
evidence pack.

27. Implementation Skeleton

A simplified Python package layout:

case_ai_platform/
  app/
    api/
    config/
  runtime/
    orchestrator.py
    graph.py
    checkpoints.py
    state.py
  agents/
    supervisor.py
    evidence.py
    risk.py
    policy.py
    drafting.py
    verifier.py
    adjudicator.py
  context/
    builder.py
    blocks.py
    compression.py
    sufficiency.py
  tools/
    registry.py
    executor.py
    contracts.py
    mcp_adapter.py
  policy/
    engine.py
    requests.py
    decisions.py
  memory/
    service.py
    governance.py
  rag/
    retrieval.py
    ingestion.py
    citation_verifier.py
  graph/
    service.py
    schema.py
  human/
    review_queue.py
    approval.py
  evals/
    datasets.py
    graders.py
    scenarios.py
    gates.py
  observability/
    tracing.py
    events.py
    manifests.py
  security/
    threat_tests.py
    guardrails.py
  domain/
    commands.py
    events.py
    services.py

This layout is not mandatory, but it shows separation of concerns.

28. Orchestrator Sketch

class CaseReviewOrchestrator:
    def __init__(
        self,
        context_builder,
        supervisor,
        policy_engine,
        guardrails,
        checkpoint_store,
        artifact_store,
        tool_executor,
        tracer,
    ) -> None:
        self.context_builder = context_builder
        self.supervisor = supervisor
        self.policy_engine = policy_engine
        self.guardrails = guardrails
        self.checkpoint_store = checkpoint_store
        self.artifact_store = artifact_store
        self.tool_executor = tool_executor
        self.tracer = tracer

    async def run_case_review(self, *, tenant_id: str, user_id: str, case_id: str) -> str:
        run_id = new_id("run")

        async with self.tracer.span("case_review.run", {"run_id": run_id, "case_id": case_id}):
            state = ExecutionState(
                run_id=run_id,
                thread_id=new_id("thread"),
                current_node="supervisor",
                budget_remaining_usd=5.00,
            )

            await self.checkpoint_store.save(state)

            context = await self.context_builder.build_for_supervisor(
                tenant_id=tenant_id,
                user_id=user_id,
                case_id=case_id,
                run_id=run_id,
            )

            await self.guardrails.evaluate_context(context)

            supervisor_result = await self.supervisor.run(context)

            artifact_id = await self.artifact_store.save(supervisor_result.decision_package)

            policy_decision = await self.policy_engine.evaluate_action(
                tenant_id=tenant_id,
                user_id=user_id,
                run_id=run_id,
                action="case.notice.propose_send",
                resource_id=artifact_id,
            )

            if policy_decision.decision == "require_approval":
                interrupt_id = await create_human_review_interrupt(
                    run_id=run_id,
                    artifact_id=artifact_id,
                    required_role="senior_reviewer",
                )
                state.pending_interrupt_id = interrupt_id
                state.current_node = "awaiting_human_review"
                await self.checkpoint_store.save(state)
                return run_id

            if policy_decision.decision == "deny":
                await record_denial(run_id, policy_decision)
                return run_id

            await self._execute_allowed_action(supervisor_result)

            return run_id

This is intentionally incomplete but shows architecture boundaries.

29. Supervisor Sketch

class SupervisorAgent:
    def __init__(self, workers, adjudicator, model_client, artifact_store):
        self.workers = workers
        self.adjudicator = adjudicator
        self.model_client = model_client
        self.artifact_store = artifact_store

    async def run(self, context: CapstoneContextPackage):
        evidence = await self.workers.evidence.run(context)
        risk = await self.workers.risk.run(context)
        policy = await self.workers.policy.run(context)

        conflicts = detect_conflicts(evidence, risk, policy)

        if conflicts:
            adjudication = await self.adjudicator.run(conflicts)
            if adjudication.requires_human_review:
                return build_human_review_package(evidence, risk, policy, adjudication)

        draft = await self.workers.drafting.run(evidence, risk, policy)
        verification = await self.workers.verifier.run(draft)

        if not verification.passed:
            return build_revision_or_review_package(draft, verification)

        return build_decision_package(evidence, risk, policy, draft, verification)

Supervisor aggregates typed artifacts, not raw chatter.

30. Capstone End-to-End Scenario

Scenario:

Case C-123 contains evidence that Entity XYZ may have repeatedly submitted late filings.
Analyst asks the system to review the case and prepare an escalation package.

Expected Behavior

load case snapshot;
retrieve authorized evidence;
retrieve applicable policy by effective date;
build evidence summary;
assess risk;
identify missing evidence;
map policy;
draft decision package;
verify citations;
detect high-risk external notice requirement;
create human review interrupt;
wait for authorized reviewer;
if approved, execute command handler;
record notice sent event;
update case status;
preserve audit trail.

Forbidden Behavior

send notice without approval;
retrieve unrelated tenant evidence;
cite nonexistent documents;
store unverified memory as tenant policy;
update case status from worker agent;
ignore missing evidence;
hide policy uncertainty;
continue loop after budget exhausted.

31. Capstone Evaluation Cases

Create at least these eval categories:

Category	Example
happy path	evidence supports high-risk escalation
missing evidence	agent must request more evidence
conflicting evidence	adjudication/human review
stale policy	effective-date policy needed
prompt injection doc	ignore malicious retrieved instruction
cross-tenant access	retrieval denied
low-risk case	no unnecessary human review
high-risk notice	approval required
verifier failure	draft blocked/revised
tool timeout	safe retry/fallback
ambiguous send	reconciliation
memory poisoning	memory write rejected
duplicate approval	idempotent decision
cost budget	stop/escalate
model malformed output	repair bounded

32. Capstone Release Gate

Release gate:

Must pass:
- 0 critical policy false allows
- 0 external side effects without approval
- 0 cross-tenant retrievals
- citation accuracy >= 95%
- retrieval recall@10 >= 90%
- high-risk scenario pass rate >= 98%
- memory poisoning rejection >= 99%
- prompt injection side-effect prevention = 100%
- p95 latency within target for low-risk tasks
- cost p95 within budget

Critical failures block release.

33. Capstone Observability Dashboard

Dashboard sections:

Runtime

runs by status;
latency p50/p95;
workflow terminal state;
loop/deadlock detections;
model/tool error rate.

Quality

citation verification failure;
unsupported claim rate;
eval regression score;
human rejection/override rate.

Safety

policy denials;
guardrail tripwires;
unauthorized tool attempts;
memory rejection;
prompt injection detections.

Operations

cost per run;
token usage;
RAG latency;
tool latency;
review backlog;
side-effect ambiguity.

Governance

release versions;
active kill switches;
risk register open items;
eval suite status.

34. Capstone Incident Example

Incident:

A decision package cited doc_77, but doc_77 did not support the claim.

Forensic process:

identify run ID;
load run manifest;
inspect RAG retrieval event;
inspect selected chunk IDs;
inspect context package;
inspect model output;
inspect citation verifier result;
inspect why verifier passed/failed;
identify whether issue was retrieval, generation, or verifier;
create regression eval case;
update verifier/retrieval eval;
release fix through gate.

This is mature operational AI.

35. Top 1% Architecture Review Questions

Use these before shipping.

System

Is this a workflow, agent, or multi-agent system?
Why are multiple agents needed?
What is the autonomy budget?
What is the risk tier?

State

What is domain state?
What is execution state?
What is memory?
What is artifact state?
What is source of truth?

Tools

Which tools have side effects?
Which tools require approval?
Are tool calls idempotent?
Can agents call raw infrastructure tools?

Context

What context does each agent see?
Are untrusted sources isolated?
Are source refs preserved?
What happens if context is insufficient?

Memory

What is stored?
Who can read/write?
How is it forgotten?
Can memory affect decisions?

RAG/Graph

What corpus is authoritative?
Is retrieval authorized?
Are citations verified?
Does graph have provenance?

Policy

Where are PEPs?
What is deny-by-default?
What policy version applied?
Is approval enforced by code?

Human Review

What exactly is approved?
Is package version checked?
Is reviewer authorized?
What happens on timeout?

Reliability

What are top failure modes?
What retries are safe?
What is fallback?
What is the cost budget?

Observability

Can the run be reconstructed?
What did model see?
What tools executed?
Which evidence supported claims?

Evaluation

What golden set exists?
What critical failures block release?
Are incidents turned into evals?

36. Implementation Roadmap

Phase 1 — Minimal Safe Skeleton

Build:

typed state;
runtime/checkpoints;
supervisor only;
one read-only tool;
policy engine stub;
trace/run manifest;
simple eval harness.

Avoid side effects.

Phase 2 — Evidence and Policy

Add:

RAG retrieval;
policy retrieval;
evidence summaries;
citation verifier;
context builder;
RAG evals.

Phase 3 — Multi-Agent Specialists

Add:

evidence agent;
risk agent;
policy agent;
drafting agent;
verifier;
supervisor aggregation.

Phase 4 — Human Review and Side Effects

Add:

decision package;
human interrupt;
approval service;
command handler;
idempotency;
outbox/inbox;
notification sandbox.

Phase 5 — Governance and Production Controls

Add:

tool registry;
memory governance;
guardrail runtime;
threat model;
release gates;
dashboards;
kill switches;
incident runbook.

Phase 6 — Production Hardening

Add:

load/cost controls;
circuit breakers;
chaos/failure injection;
canary/shadow eval;
full audit/evidence pack;
operational runbooks.

37. Capstone Practice Project

Build a simplified but production-shaped prototype.

Required Components

FastAPI or similar API layer.
Pydantic contracts for state/artifacts/tools.
Stateful graph/orchestrator.
Supervisor agent.
Two specialist agents: evidence and risk.
Mock RAG corpus.
Citation verifier.
Policy engine.
Human approval interrupt.
Tool executor with one side-effecting sandbox tool.
Checkpoint store.
Trace/event store.
Evaluation harness with at least 20 scenarios.

Required Safety Tests

prompt injection does not call send tool;
high-risk notice requires approval;
cross-tenant retrieval denied;
missing evidence produces escalation;
duplicate side effect prevented;
stale approval rejected;
memory poisoning rejected;
verifier blocks unsupported citation.

Required Observability

run manifest;
context event;
model call event;
tool call event;
policy decision;
guardrail decision;
checkpoint event;
approval event;
side-effect event;
artifact lineage.

38. Maturity Model

Level 0 — Demo

prompt + tool calls;
no durable state;
no policy;
no eval;
no audit.

Level 1 — Prototype

typed outputs;
simple tools;
basic traces;
manual testing.

Level 2 — Controlled Pilot

stateful runtime;
read-only tools;
RAG with citations;
basic evals;
human review.

Level 3 — Production-Ready

policy engine;
tool registry;
memory governance;
guardrails;
side-effect boundaries;
regression gates;
observability;
incident response.

Level 4 — Enterprise Platform

multi-tenant governance;
MCP registry;
model/tool/prompt registries;
advanced evals;
runtime forensics;
risk management;
self-service capability onboarding;
mature SRE/AI governance.

Level 5 — Strategic Capability

continuous evaluation;
adaptive but governed memory;
mature graph/RAG evidence infrastructure;
organization-wide AI control plane;
measurable business and risk outcomes.

39. Final Production Checklist

Before enterprise launch:

40. What Top 1% Engineers Internalize

Top engineers do not think:

How do I make the agent smarter?

They think:

How do I make the system correct, bounded, observable, governable, recoverable, and useful even when the model is uncertain or wrong?

They know:

state is architecture;
context is control;
tools are authority;
memory is liability unless governed;
RAG is evidence infrastructure;
policy must be enforced;
humans need decision packages;
side effects need transaction boundaries;
evals are release gates;
observability is forensics;
governance is runtime architecture;
reliability means safe progress or safe stop.

This is the mindset shift from AI demo builder to enterprise AI systems engineer.

41. Series Closure

Across the full series, we covered:

Kaufman skill map;
target performance and decomposition;
enterprise AI mental model;
agentic taxonomy;
state machines;
control plane vs data plane;
orchestration topologies;
determinism vs autonomy;
stateful runtime design;
Python runtime architecture;
domain/conversation/execution state;
typed agent contracts;
command/query/event model;
idempotency and retry;
agent roles;
planner-executor-critic;
supervisor-worker routing;
consensus/adjudication;
human-in-the-loop;
memory architecture;
context engineering;
RAG as system component;
knowledge graphs;
memory governance;
tool contracts;
MCP tooling;
permissioning/policy;
side effects/transactions;
threat modeling;
guardrails;
AI governance;
evaluation;
reliability;
observability;
capstone reference architecture.

The connective tissue:

Enterprise-grade stateful multi-agent AI is not a model problem. It is a distributed systems, product, security, governance, data, evaluation, and operations problem where the model is only one component.

42. Final Practice: Build the Capstone

Your final exercise:

Build a working prototype of the case-management reference architecture.

Minimum version:

one supervisor;
two workers;
one RAG store;
one policy engine;
one human review interrupt;
one side-effecting sandbox tool;
checkpointing;
eval harness;
trace events.

Then iterate:

add verifier;
add memory governance;
add graph traversal;
add guardrails;
add MCP adapter;
add failure injection;
add release gates;
add incident replay.

The purpose is not to build a perfect platform immediately.

The purpose is to practice the architecture until the design instincts become automatic.

43. Final Summary

This capstone presented a complete reference architecture for an enterprise-grade stateful multi-agent AI case management system.

We covered:

system context;
invariants;
capability map;
reference architecture;
runtime flow;
agent topology;
state model;
artifacts;
context architecture;
RAG;
knowledge graph;
memory;
tools and MCP;
policy enforcement;
human review;
side effects;
guardrails;
evaluation;
reliability;
observability;
deployment;
tenant isolation;
security;
governance;
implementation skeleton;
orchestrator and supervisor sketches;
end-to-end scenario;
eval cases;
release gate;
dashboard;
incident example;
review questions;
implementation roadmap;
practice project;
maturity model;
production checklist.

The final principle:

The top 1% engineer does not merely make agents act. They make agentic systems accountable.

References

LangGraph documentation: durable execution, persistence/checkpoints, interrupts, and long-running stateful workflows.
OpenAI Agents SDK documentation: agents, tools, guardrails, sessions, handoffs, and tracing.
Model Context Protocol specification: tools, resources, prompts, clients, servers, and authorization boundaries.
NIST AI Risk Management Framework: Govern, Map, Measure, and Manage functions for AI risk.
OWASP Top 10 for LLM Applications: prompt injection, excessive agency, sensitive information disclosure, insecure output handling, and supply-chain risks.
OpenTelemetry documentation: traces, spans, metrics, logs, context propagation, and observability signals.

Lesson Recap

You just completed lesson 35 in final stretch. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Back To Series

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.

Previous Lesson

Lesson 34

Observability and Runtime Forensics

END_OF_SERIES