Conversation State and Context Management
Learn Python AI Application Engineer - Part 009
Conversation state, context management, memory boundaries, summarization, context packing, and auditability for production-grade Python AI applications.
Part 009 — Conversation State and Context Management
A production AI application rarely fails because the model cannot write fluent text.
It fails because the application gave the model the wrong context, too much context, stale context, unauthorized context, or no clear state boundary.
Conversation state is not "a list of chat messages". That is only one possible source of state.
In a serious AI application, state includes:
- current user intent,
- active task,
- conversation history,
- extracted durable facts,
- temporary working memory,
- retrieval results,
- tool outputs,
- policy decisions,
- authorization scope,
- output contract,
- unresolved decisions,
- human approvals,
- and audit metadata.
This part teaches how to design conversation state as an explicit engineering boundary, not as accidental prompt accumulation.
1. Kaufman Framing
The target skill:
Given a multi-turn AI application, preserve enough state for the model to behave correctly while preventing context bloat, privacy leakage, stale assumptions, and hidden non-determinism.
Decompose it into subskills.
| Subskill | Meaning | Failure If Ignored |
|---|---|---|
| State taxonomy | Classify what kind of state exists | Everything becomes "chat history" |
| Session modeling | Represent conversation turns and active task | Model forgets what workflow it is in |
| Context assembly | Decide what to include in the model call | Wrong or excessive context enters prompt |
| Token budgeting | Pack limited context window intentionally | Important facts get truncated |
| Summarization | Compress history without losing invariants | Summary lies or drops decisions |
| Memory boundary | Separate durable memory from temporary state | Private/stale facts leak across sessions |
| Auditability | Record what context was shown to the model | Failures cannot be reproduced |
| Privacy control | Enforce tenant/user/data boundaries | Unauthorized context exposure |
The first practice goal is simple:
Build a context builder that accepts structured state and returns a deterministic, budgeted, auditable context package.
2. Core Mental Model
Treat the model call as a pure-ish function over an explicit context package.
The important rule:
The model should never receive "whatever happened before". It should receive a deliberate context package built from typed state sources.
A good context package answers these questions:
- Who is the user and what authority do they have?
- What is the current task?
- What has already been decided?
- What facts are relevant now?
- What facts are unsafe, stale, or out of scope?
- What output contract must be followed?
- What should the model not assume?
3. State Is Not Context
State is the source of truth stored by your system.
Context is the selected, transformed, ordered subset shown to the model.
Never make the model context your only state store.
If you do, these problems appear quickly:
- You cannot reproduce failures.
- You cannot inspect what was persisted versus inferred.
- You cannot expire or redact sensitive data reliably.
- You cannot distinguish model-generated summary from user-confirmed fact.
- You cannot recover long-running tasks after process restart.
- You cannot enforce tenant isolation confidently.
4. State Taxonomy
A useful state taxonomy for AI apps:
| State Type | Lifetime | Source | Example | Should Enter Prompt? |
|---|---|---|---|---|
| Request state | One call | API request | current message, locale, request id | Yes |
| Session state | One conversation/thread | event log + snapshot | active workflow, recent decisions | Usually |
| Task state | Until task completion | workflow engine | investigation case id, current step | Yes |
| Tool state | One tool execution | tool executor | API response, status, error | Sometimes |
| Retrieval context | One model call | retriever | top chunks, citations | Yes, budgeted |
| Durable memory | Across sessions | user/system memory store | user preference, known project | Only if relevant and allowed |
| Policy state | Varies | policy engine | role permissions, data classification | Yes, often compact |
| Audit state | Permanent/retained | trace/event store | prompt version, model id, context ids | No, but logged |
A strong engineering habit:
Every state field must have an owner, lifetime, source, trust level, and deletion policy.
5. Conversation Event Log
The event log is the canonical history of what happened.
It should be append-only where possible.
Example events:
user_message_receivedassistant_response_generatedtool_call_requestedtool_call_authorizedtool_call_rejectedtool_call_completedcontext_package_builtsummary_generatedhuman_approval_requestedhuman_approval_grantedtask_state_changed
The event log is not always what you send to the model. It is what you use to reconstruct what happened.
6. Session Snapshot
A snapshot is a compact materialized view derived from the event log.
It exists so your application does not need to replay the full history for every request.
A session snapshot may contain:
- active task name,
- current workflow step,
- unresolved user asks,
- confirmed facts,
- rejected assumptions,
- latest summary,
- relevant entity ids,
- active constraints,
- last model response id,
- last tool results,
- pending approval ids.
Important distinction:
| Event Log | Snapshot |
|---|---|
| Canonical history | Current derived state |
| Append-only | Mutable/materialized |
| Useful for audit/replay | Useful for runtime speed |
| Verbose | Compact |
| Harder to query quickly | Easy to load per request |
Do not put everything into the snapshot. Put only what the runtime needs frequently.
7. Minimal Python State Model
Use explicit types.
from __future__ import annotations
from datetime import datetime, timezone
from enum import Enum
from typing import Any, Literal
from uuid import UUID, uuid4
from pydantic import BaseModel, Field
class Actor(str, Enum):
USER = "user"
ASSISTANT = "assistant"
TOOL = "tool"
SYSTEM = "system"
class TrustLevel(str, Enum):
USER_CLAIMED = "user_claimed"
SYSTEM_VERIFIED = "system_verified"
MODEL_INFERRED = "model_inferred"
TOOL_OBSERVED = "tool_observed"
class ConversationTurn(BaseModel):
id: UUID = Field(default_factory=uuid4)
session_id: UUID
actor: Actor
content: str
created_at: datetime = Field(default_factory=lambda: datetime.now(timezone.utc))
metadata: dict[str, Any] = Field(default_factory=dict)
class ConfirmedFact(BaseModel):
key: str
value: str
source_turn_id: UUID | None = None
trust_level: TrustLevel
created_at: datetime = Field(default_factory=lambda: datetime.now(timezone.utc))
expires_at: datetime | None = None
class TaskState(BaseModel):
task_name: str
current_step: str
entity_ids: dict[str, str] = Field(default_factory=dict)
pending_questions: list[str] = Field(default_factory=list)
pending_approvals: list[str] = Field(default_factory=list)
class SessionSnapshot(BaseModel):
session_id: UUID
user_id: str
tenant_id: str
active_task: TaskState | None = None
confirmed_facts: list[ConfirmedFact] = Field(default_factory=list)
rejected_assumptions: list[str] = Field(default_factory=list)
running_summary: str | None = None
updated_at: datetime = Field(default_factory=lambda: datetime.now(timezone.utc))
Notice the explicit trust_level.
A fact extracted by a model is not the same as a fact verified by a backend system.
8. Context Items
Do not concatenate strings too early.
Represent context items as structured objects until the last possible moment.
from enum import IntEnum
class ContextPriority(IntEnum):
CRITICAL = 100
HIGH = 80
MEDIUM = 50
LOW = 20
class ContextItem(BaseModel):
id: str
kind: Literal[
"system_instruction",
"task_state",
"recent_turn",
"summary",
"confirmed_fact",
"retrieved_document",
"tool_result",
"policy",
"output_contract",
]
content: str
priority: ContextPriority
estimated_tokens: int
source: str
trust_level: TrustLevel | None = None
metadata: dict[str, Any] = Field(default_factory=dict)
This lets you:
- sort by priority,
- filter by authorization,
- log source ids,
- cap token usage,
- inspect what was included,
- and write deterministic tests.
9. Context Package
A context package is the final model input envelope.
class ContextPackage(BaseModel):
session_id: UUID
user_id: str
tenant_id: str
model_profile: str
prompt_version: str
items: list[ContextItem]
total_estimated_tokens: int
excluded_items: list[ContextItem] = Field(default_factory=list)
def render_messages(self) -> list[dict[str, str]]:
system_parts: list[str] = []
user_parts: list[str] = []
for item in self.items:
block = f"[{item.kind}:{item.id}]\n{item.content}"
if item.kind in {"system_instruction", "policy", "output_contract"}:
system_parts.append(block)
else:
user_parts.append(block)
return [
{"role": "system", "content": "\n\n".join(system_parts)},
{"role": "user", "content": "\n\n".join(user_parts)},
]
In real implementations, rendering may target different provider message formats. The principle remains: context is built first, rendered second.
10. Context Budgeting
A model has a finite context window. Even when the window is large, attention, latency, and cost remain finite.
Context budgeting is not just about avoiding token overflow. It is about preserving decision quality.
A practical budget split:
| Context Segment | Example Budget | Why |
|---|---|---|
| System/developer instructions | 10-15% | Stable behavior |
| Current user request | 5-10% | Immediate intent |
| Task state | 10-20% | Workflow continuity |
| Relevant recent turns | 10-20% | Conversational continuity |
| Retrieved evidence | 30-50% | Grounding |
| Tool results | 5-15% | External facts |
| Output contract | 5-10% | Machine-readable result |
This is not a universal formula. It is a starting point.
For RAG-heavy tasks, retrieved evidence may dominate. For workflow tasks, task state and tool results may dominate. For customer support chat, recent turns may matter more.
11. Deterministic Context Packing
A simple context packer:
class ContextBudgetExceeded(Exception):
pass
class ContextPacker:
def __init__(self, max_tokens: int):
self.max_tokens = max_tokens
def pack(self, items: list[ContextItem]) -> tuple[list[ContextItem], list[ContextItem]]:
ordered = sorted(
items,
key=lambda item: (-int(item.priority), item.kind, item.id),
)
included: list[ContextItem] = []
excluded: list[ContextItem] = []
used = 0
for item in ordered:
if used + item.estimated_tokens <= self.max_tokens:
included.append(item)
used += item.estimated_tokens
else:
excluded.append(item)
critical_excluded = [
item for item in excluded
if item.priority == ContextPriority.CRITICAL
]
if critical_excluded:
raise ContextBudgetExceeded(
f"Critical context did not fit: {[item.id for item in critical_excluded]}"
)
return included, excluded
This is intentionally boring. Boring is good for infrastructure.
Improve it later with:
- reserved budgets per category,
- recency scoring,
- retrieval relevance score,
- semantic compression,
- citation preservation,
- and adaptive packing by task type.
12. Context Builder
A context builder should be explicit about inputs.
class BuildContextCommand(BaseModel):
session_id: UUID
user_id: str
tenant_id: str
current_user_message: str
model_profile: str
token_budget: int
class ContextBuilder:
def __init__(
self,
session_store: "SessionStore",
memory_store: "MemoryStore",
retrieval_service: "RetrievalService",
policy_service: "PolicyService",
token_estimator: "TokenEstimator",
) -> None:
self.session_store = session_store
self.memory_store = memory_store
self.retrieval_service = retrieval_service
self.policy_service = policy_service
self.token_estimator = token_estimator
async def build(self, command: BuildContextCommand) -> ContextPackage:
snapshot = await self.session_store.load_snapshot(command.session_id)
authorization = await self.policy_service.scope_for_user(
user_id=command.user_id,
tenant_id=command.tenant_id,
)
recent_turns = await self.session_store.load_recent_turns(
session_id=command.session_id,
limit=12,
)
memories = await self.memory_store.search_relevant_memories(
user_id=command.user_id,
query=command.current_user_message,
authorization=authorization,
limit=5,
)
retrieved_docs = await self.retrieval_service.retrieve(
tenant_id=command.tenant_id,
query=command.current_user_message,
authorization=authorization,
limit=8,
)
candidate_items = self._to_context_items(
snapshot=snapshot,
recent_turns=recent_turns,
memories=memories,
retrieved_docs=retrieved_docs,
current_user_message=command.current_user_message,
authorization=authorization,
)
packer = ContextPacker(max_tokens=command.token_budget)
included, excluded = packer.pack(candidate_items)
return ContextPackage(
session_id=command.session_id,
user_id=command.user_id,
tenant_id=command.tenant_id,
model_profile=command.model_profile,
prompt_version="conversation-v3",
items=included,
excluded_items=excluded,
total_estimated_tokens=sum(item.estimated_tokens for item in included),
)
The builder is the policy enforcement point for context.
It should not be bypassed by ad hoc prompt construction in feature code.
13. Recent Turns Strategy
Naive strategy:
Always include the last N messages.
Better strategy:
- Always include the current user message.
- Include recent turns only up to a budget.
- Include turns related to active unresolved questions.
- Include decisions and corrections even if older.
- Exclude chit-chat if irrelevant.
- Exclude tool trace noise unless needed.
- Include human approvals and denials.
A recent turn is more valuable when it contains:
- a constraint,
- a correction,
- a decision,
- an entity reference,
- a refusal boundary,
- an unresolved task,
- or a preference relevant to the current request.
14. Summarization Strategy
Summarization is dangerous when treated as compression only.
A summary becomes part of state. If it is wrong, the model may repeatedly act on wrong state.
Use structured summaries.
class SessionSummary(BaseModel):
user_goal: str | None = None
active_task: str | None = None
confirmed_decisions: list[str] = Field(default_factory=list)
open_questions: list[str] = Field(default_factory=list)
rejected_assumptions: list[str] = Field(default_factory=list)
important_entities: dict[str, str] = Field(default_factory=dict)
risk_notes: list[str] = Field(default_factory=list)
last_updated_turn_id: UUID
Bad summary:
The user wants help with a case.
Good summary:
Active task: draft enforcement escalation analysis for Case C-1042. Confirmed: user wants regulatory defensibility and audit trail. Rejected assumption: do not assume violation intent. Open question: whether inspection evidence has been verified. Important entity: licensee ACME-77.
A good summary preserves state invariants, not prose style.
15. Summary Refresh Policy
Do not summarize every turn blindly.
Trigger summary refresh when:
- conversation exceeds a turn threshold,
- token budget pressure occurs,
- active task changes,
- a major decision is made,
- human approval/rejection occurs,
- model detects unresolved references,
- or the user corrects prior state.
The summary generator itself should have an output schema and eval tests.
16. Memory Boundary
Memory is not the same as history.
History answers:
What happened in this conversation?
Memory answers:
What durable fact should be available in future conversations?
Most facts should not become durable memory.
A durable memory candidate must pass checks:
| Check | Question |
|---|---|
| Relevance | Will this be useful later? |
| Stability | Is it likely to remain true? |
| Consent | Is it appropriate to store? |
| Sensitivity | Does it contain private or regulated data? |
| Source | Was it user-confirmed or model-inferred? |
| Expiry | Should it expire? |
| Scope | User-level, tenant-level, team-level, or case-level? |
Never store model speculation as durable memory without clear labeling and review.
17. Working Memory vs Long-Term Memory
| Type | Lifetime | Example | Storage |
|---|---|---|---|
| Working memory | Current call/workflow | intermediate plan, candidate tool route | in-process/task state |
| Session memory | Current thread | summary, active task, unresolved questions | session store |
| Long-term memory | Across threads | stable user preference, project context | memory store |
| Domain memory | System knowledge | policy documents, cases, procedures | knowledge base/RAG |
Common mistake:
Saving domain knowledge into user memory.
If a rule belongs to the organization, put it in the knowledge base. If a fact belongs to a case, put it in the case store. If a preference belongs to the user, put it in user memory.
18. Context Freshness
State has age.
A model should be told when information may be stale.
Examples:
- "The last retrieved policy version is from 2026-04-15."
- "The case status in session summary may be stale; verify before action."
- "This tool result was observed 2 minutes ago."
- "This user preference was stored 9 months ago."
A practical pattern:
class Freshness(BaseModel):
observed_at: datetime
expires_at: datetime | None = None
stale_after_seconds: int | None = None
def is_stale(self, now: datetime) -> bool:
if self.expires_at and now >= self.expires_at:
return True
if self.stale_after_seconds is None:
return False
age = (now - self.observed_at).total_seconds()
return age > self.stale_after_seconds
For high-stakes workflows, stale context should trigger verification before action.
19. Cross-Turn Reference Resolution
Users say:
- "do that again"
- "use the second one"
- "same as before"
- "continue"
- "remove it"
These are not directly executable commands. They require reference resolution.
A reference resolver should use:
- recent turns,
- active task state,
- visible UI selection,
- entity ids,
- tool results,
- and disambiguation policy.
Do not let the model guess silently when multiple candidates exist.
class ResolvedReference(BaseModel):
phrase: str
entity_type: str
entity_id: str | None
confidence: float
requires_clarification: bool
reason: str
For regulated systems, ambiguous references should often become clarification questions rather than actions.
20. Context Ordering
Ordering affects model behavior.
A stable ordering pattern:
- Role and non-negotiable policies.
- User authority and data scope.
- Current task state.
- Current user request.
- Relevant confirmed facts.
- Retrieved evidence.
- Tool results.
- Recent conversation turns.
- Output contract.
Why current request before old turns?
Because old turns can create anchoring bias. The current user request should dominate unless a prior decision is explicitly still active.
Why output contract last?
Because the final visible instruction often strongly influences format. In some provider formats, structured output schema is passed separately; when rendered into prompt text, putting format constraints near the end can help reduce drift.
21. State Update After Model Output
A model output should not mutate durable state directly.
Use a state update pipeline.
Examples of safe automatic updates:
- update session summary,
- record assistant response,
- store tool result in trace,
- update active step after successful deterministic tool execution.
Examples requiring caution:
- storing new durable memory,
- changing case status,
- assigning violation category,
- sending notification,
- deleting records,
- escalating enforcement stage.
22. State Invariants
Define invariants early.
Examples:
- A session belongs to exactly one tenant.
- A context package may only include documents authorized for the current user.
- Durable memory must not be created from model-inferred facts without explicit policy allowance.
- A tool result may be included only if its source tool execution succeeded and matches the current session/task.
- A stale case status cannot be used for irreversible actions.
- A summary must reference the last turn id it covers.
- A rejected assumption must not reappear as a confirmed fact without new evidence.
- A context package must be logged before model invocation.
These invariants are testable.
23. Testing Context Builders
A context builder is infrastructure. Test it like infrastructure.
Test cases:
| Test | Expected Result |
|---|---|
| Critical policy always included | Context contains policy item |
| Unauthorized document filtered | Document absent from package |
| Token pressure excludes low priority items | Low priority items excluded |
| Critical item too large | Builder fails closed |
| Stale memory excluded | Memory absent or labeled stale |
| Summary references last covered turn | Summary metadata valid |
| Cross-tenant memory not included | No leakage |
| Same input creates same package | Deterministic output |
Example unit test:
def test_context_packer_excludes_low_priority_items() -> None:
items = [
ContextItem(
id="policy",
kind="policy",
content="Never expose unauthorized records.",
priority=ContextPriority.CRITICAL,
estimated_tokens=20,
source="policy:v1",
),
ContextItem(
id="smalltalk",
kind="recent_turn",
content="The user said hello.",
priority=ContextPriority.LOW,
estimated_tokens=100,
source="turn:1",
),
]
packer = ContextPacker(max_tokens=40)
included, excluded = packer.pack(items)
assert [item.id for item in included] == ["policy"]
assert [item.id for item in excluded] == ["smalltalk"]
24. Eval for Conversation State
State bugs often appear only after multiple turns.
Create multi-turn eval scenarios.
Example scenario:
name: user_corrects_prior_assumption
turns:
- user: "Case C-1042 is not urgent. Just summarize the evidence."
- assistant_should: "acknowledge non-urgent summary task"
- user: "Actually, new evidence makes it urgent. Escalate the risk analysis only, don't change the case status."
- assistant_should:
- "treat urgency as updated"
- "not change official case status"
- "ask for approval before any irreversible workflow mutation"
Evaluate:
- Was the old state superseded?
- Did the new constraint persist?
- Did the assistant avoid unauthorized action?
- Did the context include the correction?
- Did the summary update correctly?
25. Conversation State Failure Modes
| Failure Mode | Symptom | Root Cause | Mitigation |
|---|---|---|---|
| Context bloat | High latency, worse answer | Too many raw turns | Budgeting + summaries |
| Lost decision | Assistant repeats old question | Decision not captured in state | Decision extraction |
| Stale memory | Assistant uses old preference | No freshness policy | Expiry + relevance check |
| Cross-tenant leakage | Wrong customer data appears | Authorization skipped | Context builder as policy gate |
| Summary drift | Summary invents/changes facts | Unvalidated model summary | Structured summary + eval |
| Ambiguous reference | Wrong item modified | Resolver guessed silently | clarification threshold |
| Hidden prompt mutation | Behavior changes unexpectedly | Unversioned context/prompt | prompt/context versioning |
| Irreproducible bug | Cannot replay issue | Context not logged | context package audit log |
26. Observability
Log enough to debug without leaking sensitive content unnecessarily.
Recommended trace fields:
session_idtenant_idrequest_idcontext_package_idprompt_versionmodel_profileincluded_context_item_idsexcluded_context_item_idstoken_budgetestimated_tokensretrieval_query_idsummary_versionstate_snapshot_versionpolicy_scope_hash
For sensitive systems, store redacted content in general traces and full content only in restricted audit storage.
27. Privacy and Retention
Conversation state can contain sensitive data.
Design for deletion and minimization.
Questions to answer before production:
- What state is retained?
- For how long?
- Who can access it?
- Is it encrypted?
- Can a user request deletion?
- Are model prompts stored?
- Are retrieved documents copied into traces?
- Are tool results retained?
- Is PII redacted before analytics?
- Are summaries considered derived personal data?
A summary may still contain sensitive information. Do not treat it as safe just because it is shorter.
28. Regulated Case-Management Example
Imagine an AI assistant for enforcement case triage.
Bad context:
Here is the whole conversation and all case notes. Help the officer decide.
Good context:
Task: Draft a non-binding preliminary risk assessment.
Authority: User may view case C-1042 but may not change status.
Case status: Open, verified by case-service at 2026-06-28T04:17Z.
Relevant evidence: inspection report IR-77, complaint CP-12.
Rejected assumption: Do not infer intentional misconduct.
Required output: risk categories, evidence references, missing information, no final enforcement recommendation.
Policy: Escalation requires human approval.
The second context is not just shorter. It is safer, more auditable, and more actionable.
29. Implementation Checklist
Before shipping conversation state:
- There is a typed session model.
- Event log and snapshot are separate.
- Context builder is the only path to model context.
- Context package is logged or reproducible.
- Authorization filtering happens before context packing.
- Token budget is explicit.
- Critical context fails closed if it cannot fit.
- Summary has structured schema.
- Durable memory has consent/scope/freshness policy.
- Cross-turn references have resolver logic.
- State updates after model output are validated.
- Multi-turn eval scenarios exist.
- Sensitive state has retention/redaction rules.
30. Practice Exercise
Build a small context management module.
Requirements:
- Define
ConversationTurn,SessionSnapshot,ContextItem, andContextPackage. - Implement
ContextPackerwith deterministic priority order. - Implement a fake
ContextBuilderthat combines:- current user message,
- session summary,
- recent turns,
- confirmed facts,
- fake retrieved documents,
- output contract.
- Add tests for:
- unauthorized item exclusion,
- token pressure,
- deterministic ordering,
- critical item fail-closed,
- stale memory exclusion.
- Log
included_context_item_idsandexcluded_context_item_ids.
Stretch goal:
Create a multi-turn eval where the user corrects a prior assumption. Verify that the final context includes the correction and excludes the superseded assumption.
31. Key Takeaways
- Conversation state is an application-level responsibility, not a model feature.
- State and context are different.
- The context package should be deliberate, typed, budgeted, authorized, and auditable.
- Summaries are state transformations and must be validated.
- Memory needs scope, lifetime, trust level, and deletion policy.
- Multi-turn AI quality depends less on "chat history" and more on state invariants.
32. References
- OpenAI API docs: Conversation state — https://developers.openai.com/api/docs/guides/conversation-state
- OpenAI API docs: Prompting — https://developers.openai.com/api/docs/guides/prompting
- LangChain docs: Memory overview — https://docs.langchain.com/oss/python/concepts/memory
- LangGraph docs: Persistence — https://docs.langchain.com/oss/python/langgraph/persistence
- LangGraph docs: Interrupts — https://docs.langchain.com/oss/python/langgraph/interrupts
Next Part
Part 010 moves from state to runtime behavior: async execution, streaming, cancellation, timeout, queues, and backpressure.
You just completed lesson 09 in build core. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.
Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.