Orchestration Topologies
Learn Python Enterprise-Grade Stateful Multi-Agent AI Systems - Part 007
Orchestration topologies for enterprise-grade stateful multi-agent AI systems: router, supervisor, swarm, graph, pipeline, blackboard, handoff, and hierarchical control.
Part 007 — Orchestration Topologies
Top 1% engineers do not start a multi-agent system by asking: “Which framework should I use?”
They start by asking: what coordination topology makes the system correct, explainable, recoverable, and governable?
This part focuses on orchestration topology: the structural pattern used to coordinate multiple agents, tools, workflows, state transitions, human approvals, and side effects.
A topology is not a diagram decoration. It determines:
- who owns the decision;
- where state lives;
- where retries happen;
- where human approval is inserted;
- how failures are isolated;
- how audit trails are reconstructed;
- how expensive loops are prevented;
- how agent behavior is tested.
Frameworks such as LangGraph, OpenAI Agents SDK, Microsoft Agent Framework, CrewAI, Autogen-style systems, or custom Python runtimes can express these patterns differently. The important skill is not memorizing framework APIs. The important skill is knowing which topology fits the risk, state, latency, audit, and autonomy requirements.
1. Kaufman Framing
Josh Kaufman’s method emphasizes:
- Deconstruct the skill
- Learn enough to self-correct
- Remove barriers to practice
- Practice deliberately for at least 20 hours
For orchestration topology, the skill is not “knowing all patterns.” The skill is:
Given a business process, risk profile, and operational constraint, choose and implement the simplest topology that preserves correctness, visibility, and control.
Target Performance
By the end of this part, you should be able to:
- classify an AI workflow into router, pipeline, supervisor, graph, swarm, blackboard, handoff, or hierarchical topology;
- explain the state ownership model of each topology;
- identify where retries, compensation, approvals, and audit logging belong;
- recognize when a multi-agent design is over-engineered;
- design a topology for a case-management or enforcement workflow;
- sketch a Python runtime interface independent of any specific framework.
2. The Core Problem: Coordination Under Uncertainty
A traditional distributed system coordinates deterministic services.
An agentic system coordinates semi-deterministic actors:
- LLM calls may vary;
- tool selection may be probabilistic;
- context may be incomplete;
- memory may be stale;
- agents may disagree;
- output may require validation;
- side effects may be irreversible.
So the orchestration layer must answer:
In enterprise systems, the orchestration layer is usually more important than the model call itself.
3. Workflow vs Agent vs Multi-Agent
Before selecting a topology, separate these three ideas.
Workflow
A workflow has a mostly predetermined path.
Example:
The system knows the order. The model may help inside steps, but the path is controlled.
Agent
An agent chooses some of its own path.
The runtime gives the agent capabilities, and the agent decides when to use them.
Multi-Agent System
A multi-agent system has multiple specialized actors.
The hard part is not creating agents. The hard part is defining coordination semantics.
4. Orchestration Topology Overview
| Topology | Best For | State Ownership | Autonomy | Risk | Typical Failure |
|---|---|---|---|---|---|
| Router | Fast dispatch to known specialist | Central request state | Low | Low-Medium | Misclassification |
| Pipeline | Predictable step-by-step process | Stage-owned or shared state | Low | Low-Medium | Brittle sequence |
| Supervisor | Delegation with central control | Supervisor-owned state | Medium | Medium | Supervisor bottleneck |
| Handoff | Specialist ownership transfer | Current agent/session state | Medium | Medium | Lost context |
| Graph | Explicit stateful transitions | Shared typed graph state | Medium | Medium-High | State explosion |
| Blackboard | Shared evidence workspace | Shared artifact store | Medium-High | High | Conflicting writes |
| Swarm / Peer Network | Emergent collaboration | Distributed / local state | High | High | Unbounded loops |
| Hierarchical | Large enterprise decomposition | Layered ownership | Medium-High | High | Slow escalation |
| Hybrid | Real enterprise systems | Mixed | Depends | Depends | Inconsistent boundaries |
A serious enterprise implementation often combines several topologies. But it should combine them intentionally, not accidentally.
5. Router Topology
A router decides which specialist should handle a task.
When to Use
Use a router when:
- categories are known;
- routing decision is cheap;
- specialist behavior is isolated;
- state does not require complex cross-agent negotiation;
- wrong routing is recoverable.
Examples:
- classify support ticket into billing/refund/technical;
- route compliance case to domain-specific workflow;
- choose extraction prompt by document type;
- choose model tier by risk/cost.
Router Invariants
A router topology needs these invariants:
- Every input has a route or fallback
- Routing decision is recorded
- Routing confidence is explicit
- Low-confidence routes go to fallback
- Specialists do not silently reroute without emitting a handoff event
Python Sketch
from enum import Enum
from pydantic import BaseModel, Field
class Route(str, Enum):
BILLING = "billing"
COMPLIANCE = "compliance"
TECHNICAL = "technical"
FALLBACK = "fallback"
class RoutingDecision(BaseModel):
route: Route
confidence: float = Field(ge=0.0, le=1.0)
rationale: str
required_evidence: list[str] = []
class RoutedRequest(BaseModel):
request_id: str
tenant_id: str
text: str
route: Route | None = None
routing_decision: RoutingDecision | None = None
def enforce_routing_policy(decision: RoutingDecision, threshold: float = 0.75) -> Route:
if decision.confidence < threshold:
return Route.FALLBACK
return decision.route
The important detail is not the code. The important detail is that the router output is a typed decision artifact, not an unstructured string.
Router Failure Modes
| Failure | Cause | Mitigation |
|---|---|---|
| Misroute | Ambiguous input | Confidence threshold + fallback |
| Specialist mismatch | Taxonomy drift | Route taxonomy versioning |
| Hidden reroute | Specialist calls another path silently | Handoff event required |
| Over-routing | Too many categories | Collapse categories by operational owner |
| Prompt-only routing | No testable contract | Typed routing output |
Router Practice Drill
Take 30 real support/case-management examples. Define:
- route taxonomy;
- confidence threshold;
- fallback criteria;
- evaluation set;
- expected route;
- explanation quality rubric.
The goal is not 100% automation. The goal is knowing when not to automate.
6. Pipeline Topology
A pipeline executes steps in a known order.
When to Use
Use a pipeline when:
- the process is stable;
- steps are known;
- output of each step feeds the next;
- auditability matters;
- parallelism is not the main concern;
- deterministic control is more valuable than agent autonomy.
Examples:
- document ingestion;
- regulatory case intake;
- KYC evidence extraction;
- incident triage summary;
- report generation with human review.
Pipeline State Model
A pipeline can use stage-specific state:
class IntakeState(BaseModel):
raw_input: str
normalized_text: str | None = None
extracted_entities: dict[str, str] = {}
validation_errors: list[str] = []
risk_score: float | None = None
draft_summary: str | None = None
approved: bool = False
Each stage should be a pure-ish transformation where possible:
def extract_entities(state: IntakeState) -> IntakeState:
# call LLM or deterministic parser
# validate output
# return new state snapshot
return state
Pipeline Invariants
- Each stage has typed input/output
- Each stage is idempotent or protected by idempotency key
- Each stage emits an event
- Failure at one stage does not corrupt previous state
- Human review points are explicit
- Retries are local unless state is invalidated
Pipeline Anti-Pattern
A common anti-pattern is hiding a fully autonomous agent inside a pipeline step:
This defeats the purpose of the pipeline. If a step has broad autonomy, make that autonomy visible in the topology.
Pipeline Practice Drill
Model a case intake workflow:
- receive complaint;
- normalize complaint;
- extract entities;
- map allegations to regulatory categories;
- score severity;
- identify missing evidence;
- draft analyst brief;
- wait for human approval.
For each step, define:
- input schema;
- output schema;
- validation rule;
- retry policy;
- escalation condition.
7. Supervisor Topology
A supervisor coordinates specialists while retaining central control.
When to Use
Use a supervisor when:
- multiple specialist agents are useful;
- central coordination is required;
- cross-agent consistency matters;
- state must remain coherent;
- the system needs explainable delegation;
- agent-to-agent chaos is unacceptable.
Examples:
- enforcement case analysis;
- incident investigation;
- complex claim review;
- enterprise support resolution;
- software engineering assistant with planner/reviewer/tester roles.
Supervisor State
The supervisor should own the canonical run state.
class AgentFinding(BaseModel):
agent_name: str
finding_type: str
summary: str
evidence_refs: list[str]
confidence: float
blockers: list[str] = []
class SupervisorState(BaseModel):
run_id: str
objective: str
findings: list[AgentFinding] = []
open_questions: list[str] = []
decisions: list[str] = []
escalation_required: bool = False
Specialists produce findings. The supervisor decides how to integrate them.
Delegation Contract
A specialist should not be called with vague instructions such as:
“Analyze this case.”
A better delegation contract:
class DelegatedTask(BaseModel):
task_id: str
agent_name: str
objective: str
allowed_tools: list[str]
input_refs: list[str]
output_schema: str
deadline_ms: int
stop_conditions: list[str]
Supervisor Invariants
- The supervisor owns the final decision
- Specialists own findings, not final authority
- Every delegation is recorded
- Every specialist output is validated
- Disagreements become explicit state
- The supervisor has a stop policy
Supervisor Failure Modes
| Failure | Description | Mitigation |
|---|---|---|
| Bottleneck | Supervisor does too much | Parallel specialist calls + summarization |
| Rubber stamp | Supervisor accepts all outputs | Independent validation |
| Delegation loop | Supervisor keeps asking agents | Max turns + stop conditions |
| Context overload | Supervisor state grows too large | Evidence refs + summaries + memory policy |
| Authority confusion | Specialist makes final decision | Explicit decision rights |
Practical Rule
If the system has regulatory, financial, legal, or irreversible consequences, prefer supervisor with deterministic gates over free-form peer collaboration.
8. Handoff Topology
A handoff transfers control from one agent to another.
When to Use
Use handoff when:
- agents represent distinct operational domains;
- each domain can own a segment of the interaction;
- the user experience should remain conversational;
- control naturally transfers between specialists.
Examples:
- customer support agents;
- internal helpdesk;
- healthcare intake to specialist workflow;
- legal intake to document review;
- software assistant from triage to code-generation agent.
Handoff Payload
A handoff must carry structured context.
class HandoffPayload(BaseModel):
from_agent: str
to_agent: str
reason: str
user_intent: str
relevant_history_refs: list[str]
current_state_summary: str
unresolved_questions: list[str]
allowed_next_actions: list[str]
Without this, the receiving agent reconstructs context from raw chat history, which is unreliable.
Handoff Invariants
- Handoff reason is explicit
- Receiving agent knows current objective
- Relevant context is summarized and referenced
- Authority transfers are clear
- Handoff is logged
- There is a fallback if receiving agent refuses or cannot proceed
Handoff Anti-Pattern
Do not use handoff to hide routing uncertainty.
Bad:
I do not know who owns this, so I will transfer it randomly.
Better:
Routing confidence is low. Ask for clarification or escalate to human triage.
9. Graph Topology
A graph topology represents execution as nodes and transitions over shared state.
When to Use
Use a graph when:
- execution paths branch based on state;
- loops are possible but must be controlled;
- human-in-the-loop points are needed;
- state must be persisted and resumed;
- nodes can be tested independently;
- different teams own different subgraphs.
Graph State
A graph topology should use a shared, typed state object.
class CaseGraphState(BaseModel):
case_id: str
phase: str
risk_level: str | None = None
evidence_refs: list[str] = []
draft_decision: str | None = None
validation_errors: list[str] = []
human_approval_required: bool = False
completed_nodes: list[str] = []
Graph Invariants
- Every node declares input assumptions
- Every node declares state mutations
- Every edge has a transition condition
- Cycles have termination criteria
- State snapshots are persisted
- Resume behavior is deterministic
- Human interrupts are represented as state, not exceptions
Graph vs Pipeline
A pipeline is a graph with mostly linear edges.
A graph is better when the process contains:
- branches;
- loops;
- conditional escalation;
- repair paths;
- dynamic specialist invocation;
- checkpoint/resume requirements.
Graph Failure Modes
| Failure | Cause | Mitigation |
|---|---|---|
| State explosion | Too many mutable fields | Typed substate + ownership |
| Edge ambiguity | Multiple valid transitions | Priority rules |
| Infinite loop | Repair path lacks stop policy | Max iteration + loop reason |
| Hidden mutation | Node mutates unrelated state | Mutation contract |
| Resume bug | State cannot be hydrated cleanly | Snapshot schema versioning |
10. Blackboard Topology
In a blackboard topology, agents collaborate through a shared workspace.
Each agent contributes evidence, hypotheses, findings, and objections.
When to Use
Use blackboard topology when:
- multiple specialists need to contribute asynchronously;
- no single agent has full context;
- evidence evolves over time;
- findings may conflict;
- the final decision requires adjudication.
Examples:
- fraud investigation;
- regulatory enforcement lifecycle;
- security incident response;
- complex litigation support;
- large-scale research synthesis.
Blackboard Data Model
The blackboard is not a random shared dictionary. It needs typed artifacts.
class BlackboardArtifact(BaseModel):
artifact_id: str
artifact_type: str
produced_by: str
summary: str
evidence_refs: list[str]
confidence: float
created_at: str
supersedes: list[str] = []
disputed_by: list[str] = []
class BlackboardState(BaseModel):
case_id: str
artifacts: list[BlackboardArtifact] = []
open_hypotheses: list[str] = []
disputes: list[str] = []
Blackboard Invariants
- Agents append artifacts; they do not overwrite silently
- Contradictions are first-class artifacts
- Every artifact has provenance
- Evidence references are mandatory
- Adjudication is separate from contribution
- Obsolete artifacts are superseded, not deleted
Blackboard Failure Modes
| Failure | Description | Mitigation |
|---|---|---|
| Conflicting writes | Agents overwrite shared state | Append-only artifact log |
| Evidence pollution | Low-quality artifacts accumulate | Evidence quality scoring |
| No convergence | Agents keep adding findings | Adjudicator + stop criteria |
| Unclear authority | Contributors decide final outcome | Separate adjudication stage |
| Context bloat | Blackboard grows without bound | Summaries + archival policy |
11. Swarm / Peer Network Topology
A swarm topology allows agents to coordinate more freely.
This can be powerful but dangerous.
When to Use
Use swarm-like patterns only when:
- exploration matters more than determinism;
- cost is bounded;
- side effects are prohibited;
- agents operate in a sandbox;
- final output is reviewed by a deterministic or human gate.
Examples:
- brainstorming;
- research exploration;
- design alternatives;
- test-case generation;
- red-team simulation.
When Not to Use
Avoid swarm topology for:
- payment execution;
- compliance decisions;
- identity verification;
- regulatory notices;
- production data mutation;
- customer-facing irreversible actions.
Swarm Invariants
- No irreversible side effects
- Strict budget limit
- Max turn limit
- Final adjudication outside the swarm
- All messages logged
- No secret sharing between agents unless permitted
Swarm Failure Modes
| Failure | Description | Mitigation |
|---|---|---|
| Infinite debate | Agents keep responding | Turn budget |
| Consensus illusion | Agents agree without evidence | Independent evidence requirement |
| Cost explosion | Peer loops multiply calls | Token/tool budgets |
| Authority collapse | No one owns decision | External adjudicator |
| Prompt contamination | Agents amplify bad context | Context filters |
The enterprise rule is simple:
Swarms are good for exploration, not authority.
12. Hierarchical Topology
A hierarchy decomposes work across layers.
When to Use
Use hierarchy when:
- the domain is large;
- teams own different subsystems;
- the workflow spans multiple bounded contexts;
- permissions differ by layer;
- audit and escalation are mandatory.
Examples:
- enterprise case management;
- banking operations;
- telecom BSS/OSS workflows;
- public sector regulatory enforcement;
- cross-domain compliance review.
Hierarchical Invariants
- Each layer has explicit authority
- Lower layers cannot bypass upper policy
- Escalation path is deterministic
- State ownership is partitioned
- Summaries move upward; detailed evidence remains referenced
- Cross-layer communication is typed
Hierarchy Failure Modes
| Failure | Description | Mitigation |
|---|---|---|
| Slow escalation | Too many layers | Risk-based fast path |
| Distorted summaries | Information lost upward | Evidence refs + audit log |
| Over-centralization | Top agent bottlenecks | Delegated authority matrix |
| Policy bypass | Lower agent takes action | Policy guard at tool boundary |
| Inconsistent state | Layers hold different truths | Canonical state store |
13. Hybrid Topology
Most real enterprise systems are hybrid.
Example: enforcement case management.
This hybrid combines:
- router for intake;
- pipeline for low-risk cases;
- supervisor for complex cases;
- blackboard for shared evidence;
- adjudicator for decision;
- human review for high-risk actions.
The skill is not avoiding hybrid systems. The skill is making hybrid systems explicit.
14. Decision Matrix
Use this matrix when selecting topology.
| Requirement | Prefer |
|---|---|
| Fast classification | Router |
| Stable business process | Pipeline |
| Branching stateful workflow | Graph |
| Specialist delegation with central authority | Supervisor |
| Conversational domain transfer | Handoff |
| Shared evolving evidence | Blackboard |
| Exploration / ideation | Swarm |
| Large regulated organization | Hierarchical |
| Mixed risk levels | Hybrid |
Risk-Based Shortcut
| Risk Level | Suggested Topology |
|---|---|
| Low risk, reversible | Router or pipeline |
| Medium risk, reviewable | Graph or supervisor |
| High risk, irreversible | Supervisor + deterministic gates + human review |
| Exploratory only | Swarm with no side effects |
| Regulated decision | Graph + blackboard + adjudicator + audit |
15. State Ownership by Topology
State ownership determines correctness.
| Topology | Canonical State Owner |
|---|---|
| Router | Request/session service |
| Pipeline | Workflow engine |
| Supervisor | Supervisor state |
| Handoff | Current owning agent/session |
| Graph | Graph runtime/checkpointer |
| Blackboard | Shared artifact store |
| Swarm | External coordinator |
| Hierarchical | Layered state stores |
| Hybrid | Explicit per-subsystem ownership |
A dangerous system is one where every agent believes it owns the truth.
16. Tool Ownership by Topology
Tool ownership should follow authority.
| Topology | Tool Rule |
|---|---|
| Router | Router usually has no side-effect tools |
| Pipeline | Tools are bound to stages |
| Supervisor | Supervisor grants tools to specialists |
| Handoff | Receiving agent gets domain tools |
| Graph | Node declares tools |
| Blackboard | Contributors may read/write artifacts only |
| Swarm | Read-only or sandbox tools |
| Hierarchical | Tools scoped by authority layer |
A top-level design principle:
Tools should be granted to execution units, not personalities.
Do not say: “The policy agent is trusted.”
Say: “The policy-review node can read policy documents, propose interpretations, and cannot mutate case state.”
17. Retry and Compensation by Topology
| Topology | Retry Strategy |
|---|---|
| Router | Retry classification only if input/context changes |
| Pipeline | Retry failed stage idempotently |
| Supervisor | Retry specialist task with same task ID |
| Handoff | Retry handoff only if receiving agent did not accept |
| Graph | Retry node with checkpointed state |
| Blackboard | Append correction artifact |
| Swarm | Usually do not retry; rerun exploration |
| Hierarchical | Retry at lowest safe layer |
Compensation is required when a step has side effects.
Example:
Agentic systems must not pretend all operations are pure.
18. Observability by Topology
Every topology needs runtime forensics.
| Topology | Must Observe |
|---|---|
| Router | route, confidence, rationale, fallback rate |
| Pipeline | stage latency, validation failure, retry count |
| Supervisor | delegation graph, specialist outputs, disagreements |
| Handoff | from/to agent, handoff reason, context payload |
| Graph | node transitions, state snapshots, loop count |
| Blackboard | artifact provenance, disputes, supersession |
| Swarm | turn count, cost, convergence, final adjudication |
| Hierarchical | escalation path, layer decision, authority boundary |
Do not log only prompts and responses. Log decisions.
19. Enterprise Design Heuristics
Heuristic 1 — Prefer the Least Autonomous Topology That Works
If a pipeline solves the problem, do not use a swarm.
If a router solves the problem, do not use a supervisor.
If a deterministic function solves the problem, do not call an LLM.
Heuristic 2 — Separate Exploration from Authority
Use autonomous agents to explore.
Use deterministic gates, typed validators, policy engines, and humans to authorize.
Heuristic 3 — Make Every Boundary Typed
Agent boundaries are system boundaries.
Each boundary should have:
- input schema;
- output schema;
- tool permission;
- timeout;
- retry policy;
- validation rule;
- audit event.
Heuristic 4 — Design for Replay
A production incident will eventually require answering:
- What did the agent know?
- What did it do?
- Which tools did it call?
- Which policy version applied?
- Why did it choose that path?
- Who approved the action?
- Can we reproduce or replay the decision?
If your topology cannot answer these, it is not enterprise-grade.
20. Reference Python Interfaces
Below is a minimal topology-agnostic orchestration interface.
from abc import ABC, abstractmethod
from typing import Any
from pydantic import BaseModel
class RunContext(BaseModel):
run_id: str
tenant_id: str
user_id: str | None = None
policy_version: str
correlation_id: str
class StepResult(BaseModel):
status: str
output: dict[str, Any] = {}
events: list[dict[str, Any]] = []
next_steps: list[str] = []
class OrchestrationNode(ABC):
name: str
@abstractmethod
async def run(self, context: RunContext, state: BaseModel) -> StepResult:
pass
class Transition(BaseModel):
from_node: str
to_node: str
condition: str
class TopologySpec(BaseModel):
name: str
nodes: list[str]
transitions: list[Transition]
state_schema: str
This abstraction is intentionally simple. You can map it to LangGraph nodes, a workflow engine, custom async tasks, Temporal-like orchestration, or another runtime.
The point is to keep architecture concepts separate from vendor APIs.
21. Practice: Design a Regulatory Case Topology
Design an AI-assisted regulatory case workflow.
Requirements:
- cases arrive from multiple channels;
- low-risk cases can be summarized automatically;
- high-risk cases require human approval;
- evidence can be incomplete;
- multiple specialist agents may disagree;
- every decision must be auditable;
- side effects include notifying regulated entities.
Recommended topology:
Now define:
- canonical state owner;
- tool permissions by node;
- retry strategy;
- stop conditions;
- audit events;
- human approval gates;
- failure modes;
- evaluation scenarios.
22. What Top 1% Engineers Pay Attention To
Top engineers are not impressed by diagrams with many agents.
They ask:
- Is the topology simpler than the problem?
- Is authority explicit?
- Is state ownership explicit?
- Are side effects isolated?
- Are failures recoverable?
- Are decisions replayable?
- Are human gates placed before irreversible action?
- Is autonomy bounded by risk?
- Can each agent be tested independently?
- Can the system be operated by people who did not build it?
The best topology is rarely the most “agentic.” It is the one that gives the business enough intelligence without losing control.
23. Summary
In this part, we covered:
- router topology;
- pipeline topology;
- supervisor topology;
- handoff topology;
- graph topology;
- blackboard topology;
- swarm topology;
- hierarchical topology;
- hybrid enterprise topology;
- state ownership;
- tool ownership;
- retries and compensation;
- observability;
- enterprise heuristics.
The next part focuses on a deeper architectural tension:
How deterministic should an enterprise agentic system be, and where should autonomy be allowed?
That is the heart of production AI system design.
References
- LangGraph documentation: workflows, agents, durable execution, graph state, subgraphs, interrupts.
- OpenAI Agents SDK documentation: handoffs, tools, guardrails, tracing.
- Microsoft Agent Framework documentation: workflows, multi-agent orchestration, state, checkpointing, telemetry.
- Model Context Protocol specification: tools, resources, prompts, authorization.
- OpenTelemetry documentation: traces, metrics, logs.
You just completed lesson 07 in build core. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.
Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.