Observability, Tracing, and Debugging
Learn Python AI Application Engineer - Part 027
Observability, tracing, and debugging for production AI applications: GenAI telemetry, prompt/model traces, retrieval traces, tool traces, agent traces, metrics, logs, replay, privacy, and incident diagnosis.
Part 027 — Observability, Tracing, and Debugging
1. Why This Part Matters
An AI application can fail in ways ordinary logs do not explain.
A user says:
The assistant gave the wrong answer.
That sentence is not enough.
You need to know:
- what the user asked;
- how the query was normalized;
- which prompt version was used;
- which model was called;
- what parameters were used;
- what the model returned;
- which documents were retrieved;
- which chunks entered context;
- which tool calls were proposed;
- which tool calls were executed;
- which guardrails fired;
- which agent node ran;
- why a workflow stopped;
- how many tokens were used;
- how much it cost;
- what latency each stage added;
- whether sensitive data was redacted;
- whether the output was validated;
- whether citations supported claims.
Without observability, AI debugging becomes guesswork.
The central invariant:
Every AI output should be explainable through a trace of model, retrieval, tool, validation, and workflow events.
2. Target Skill
After this part, you should be able to:
- design trace schemas for LLM calls, RAG, tools, and agents;
- instrument Python AI apps with structured telemetry;
- separate traces, logs, metrics, audit events, and eval artifacts;
- debug bad answers from trace data;
- build replayable AI runs;
- track latency, cost, token usage, and failure modes;
- protect sensitive data in observability pipelines;
- map observability data into incident workflows;
- use trace data to improve eval datasets and reliability patterns.
3. Observability Vocabulary
| Term | Meaning | AI Example |
|---|---|---|
| Trace | A full request/run path across components | user query -> retrieval -> model -> tool -> answer |
| Span | Timed operation inside a trace | model call, vector search, rerank |
| Event | Point-in-time detail inside a span | tool_call_started, validation_failed |
| Metric | Aggregated numerical measurement | p95 model latency, token cost |
| Log | Structured record for debugging | schema validation error |
| Audit event | Accountability record | user accessed source S1 |
| Eval artifact | Offline quality record | groundedness score |
| Replay record | Data needed to reproduce run | prompt, evidence refs, model config |
Do not mix these without intent.
A log is not an audit trail.
A trace is not an eval.
A metric is not enough for debugging.
4. AI Observability Architecture
A single user request may produce:
- request trace;
- model spans;
- retrieval spans;
- tool spans;
- validation events;
- metrics;
- audit events;
- eval sampling events.
Each has different retention and access requirements.
5. Kaufman Deconstruction
Break AI observability into subskills.
Deliberate practice:
- run a RAG query;
- inspect trace;
- inject retrieval failure;
- inspect trace again;
- inject model invalid output;
- inspect trace again;
- add a metric;
- add a replay record;
- convert the failure into an eval.
6. What to Trace
Trace every important AI boundary.
6.1 Request Boundary
- request ID;
- user/tenant context;
- channel;
- feature;
- risk level;
- sanitized query hash;
- raw query if permitted;
- auth decision reference.
6.2 Prompt Boundary
- prompt template ID;
- prompt version;
- rendered prompt hash;
- input variables;
- system/developer/user/evidence sections;
- context token count.
6.3 Model Boundary
- provider;
- model name/version;
- parameters;
- input tokens;
- output tokens;
- latency;
- status;
- error type;
- finish reason;
- structured output validity.
6.4 Retrieval Boundary
- index version;
- embedding model;
- query text/hash;
- filters;
- candidate IDs;
- selected context IDs;
- scores/ranks;
- reranker model;
- latency.
6.5 Tool Boundary
- tool name/version;
- arguments hash/summary;
- trusted context;
- authorization decision;
- idempotency key;
- status;
- latency;
- output summary;
- side effect refs.
6.6 Agent Boundary
- run ID;
- node sequence;
- decisions;
- transitions;
- approvals;
- interrupts;
- retries;
- stop reason;
- max step/budget status.
6.7 Validation Boundary
- schema validation;
- citation validation;
- grounding validation;
- safety checks;
- unsupported claims;
- repair attempts.
7. Trace Schema
from typing import Literal
from pydantic import BaseModel, Field
class AiTrace(BaseModel):
trace_id: str
request_id: str
tenant_id: str
user_id: str | None = None
feature: str
channel: str
status: Literal["success", "failed", "partial", "refused", "escalated"]
started_at: str
ended_at: str | None = None
spans: list["AiSpan"] = []
total_tokens: int = 0
total_cost_estimate: float | None = None
error_summary: str | None = None
Span:
class AiSpan(BaseModel):
span_id: str
parent_span_id: str | None = None
trace_id: str
name: str
span_type: Literal[
"api",
"prompt",
"model",
"retrieval",
"rerank",
"tool",
"agent_node",
"validation",
"storage",
]
started_at: str
ended_at: str | None = None
status: Literal["ok", "error", "timeout", "cancelled"]
attributes: dict[str, str | int | float | bool | None] = {}
events: list["AiSpanEvent"] = []
Event:
class AiSpanEvent(BaseModel):
name: str
timestamp: str
attributes: dict[str, str | int | float | bool | None] = {}
Keep trace data structured.
Plain-text trace logs are hard to query.
8. OpenTelemetry Mental Model
OpenTelemetry provides a vendor-neutral way to generate, collect, and export traces, metrics, and logs.
For AI apps, you can model:
- each request as a trace;
- each model call as a span;
- each retrieval call as a span;
- each tool call as a span;
- each agent node as a span;
- validation failures as span events;
- token/cost/latency as attributes or metrics.
Do not make observability vendor-specific inside business logic.
Use adapters.
9. Minimal Python Tracing Wrapper
This example uses a simple internal tracer interface.
from contextlib import asynccontextmanager
from time import perf_counter
from typing import AsyncIterator
class TraceSink:
async def start_span(self, name: str, span_type: str, attributes: dict[str, object]) -> str:
...
async def end_span(self, span_id: str, status: str, attributes: dict[str, object]) -> None:
...
async def add_event(self, span_id: str, name: str, attributes: dict[str, object]) -> None:
...
@asynccontextmanager
async def traced_span(
*,
sink: TraceSink,
name: str,
span_type: str,
attributes: dict[str, object],
) -> AsyncIterator[str]:
span_id = await sink.start_span(name, span_type, attributes)
start = perf_counter()
try:
yield span_id
await sink.end_span(
span_id,
"ok",
{"duration_ms": round((perf_counter() - start) * 1000, 2)},
)
except TimeoutError:
await sink.end_span(
span_id,
"timeout",
{"duration_ms": round((perf_counter() - start) * 1000, 2)},
)
raise
except Exception as exc:
await sink.end_span(
span_id,
"error",
{
"duration_ms": round((perf_counter() - start) * 1000, 2),
"error_type": type(exc).__name__,
},
)
raise
Use this pattern around model, retrieval, and tool calls.
10. Model Call Trace
class ModelCallTrace(BaseModel):
provider: str
model: str
model_version: str | None = None
prompt_template_id: str
prompt_version: str
rendered_prompt_hash: str
temperature: float | None = None
max_output_tokens: int | None = None
input_tokens: int | None = None
output_tokens: int | None = None
total_tokens: int | None = None
latency_ms: float
finish_reason: str | None = None
output_schema_id: str | None = None
output_valid: bool | None = None
repair_attempts: int = 0
error_type: str | None = None
Do not store raw prompts everywhere by default.
Store hashes and references.
For debugging environments, raw prompt capture may be allowed with strict access and retention.
11. Prompt Trace
Prompt trace should answer:
- which template rendered this prompt?
- which variables were used?
- which evidence entered?
- how many tokens?
- which instructions were active?
- what changed between versions?
class PromptTrace(BaseModel):
prompt_template_id: str
prompt_version: str
variable_names: list[str]
rendered_hash: str
sections: list[str]
evidence_ids: list[str] = []
system_instruction_hash: str | None = None
developer_instruction_hash: str | None = None
estimated_tokens: int
Prompt bugs are easier to debug when the prompt is versioned and traceable.
12. Retrieval Trace
class RetrievalTrace(BaseModel):
query: str
normalized_query: str
query_type: str
tenant_id: str
filters: dict[str, object]
retrieval_mode: str
index_versions: list[str]
embedding_model: str | None = None
reranker_model: str | None = None
lexical_candidate_ids: list[str] = []
vector_candidate_ids: list[str] = []
fused_candidate_ids: list[str] = []
reranked_candidate_ids: list[str] = []
selected_context_ids: list[str] = []
latency_ms: dict[str, float] = {}
This trace supports RAG debugging.
If the correct chunk was not in candidates, debug retrieval.
If it was in candidates but not selected, debug reranking/context.
If it was in context but answer was wrong, debug generation/validation.
13. Tool Trace
class ToolCallTrace(BaseModel):
tool_call_id: str
tool_name: str
tool_version: str
side_effect_level: str
risk_level: str
input_hash: str
input_summary: dict[str, object]
trusted_context_hash: str
authorization_status: Literal["allowed", "denied", "approval_required"]
idempotency_key: str | None = None
approval_id: str | None = None
status: Literal["success", "failed", "timeout", "denied"]
error_type: str | None = None
output_hash: str | None = None
output_summary: dict[str, object] | None = None
latency_ms: float
Do not log full tool input/output if it contains sensitive data.
Use summaries and hashes.
14. Agent Trace
class AgentNodeTrace(BaseModel):
run_id: str
node_name: str
step_number: int
decision_type: str | None = None
selected_action: str | None = None
rationale_summary: str | None = None
state_version_before: int | None = None
state_version_after: int | None = None
tool_call_ids: list[str] = []
approval_id: str | None = None
interrupt_id: str | None = None
status: Literal["ok", "waiting", "failed", "completed"]
stop_reason: str | None = None
Agent trace should show path and state transitions.
A final answer is not enough.
15. Metrics
Metrics are aggregated signals.
15.1 Quality Metrics
- answer pass rate;
- unsupported claim rate;
- citation failure rate;
- insufficient evidence rate;
- refusal rate;
- over-refusal rate;
- user negative feedback rate;
- human escalation rate.
15.2 Retrieval Metrics
- recall@k in eval;
- no-result rate in production;
- stale source rate;
- unauthorized retrieval count;
- duplicate top-k rate;
- reranker timeout rate.
15.3 Model Metrics
- model latency p50/p95/p99;
- error rate;
- timeout rate;
- token usage;
- cost per request;
- structured output failure rate;
- repair attempt rate.
15.4 Agent Metrics
- average steps per run;
- max-step failures;
- tool call count;
- tool error rate;
- approval rate;
- approval rejection rate;
- stuck task count;
- resume success rate.
15.5 System Metrics
- API latency;
- queue depth;
- worker utilization;
- rate-limit events;
- cache hit rate;
- database latency;
- vector index latency.
Metrics should be sliced by feature, tenant, model, prompt version, index version, and query type.
16. Logs
Logs should be structured.
Bad:
Model failed lol
Better:
{
"event": "model_call_failed",
"trace_id": "tr-123",
"request_id": "req-123",
"provider": "provider-x",
"model": "model-y",
"error_type": "timeout",
"retryable": true
}
Log rules:
- include trace/request/run IDs;
- avoid raw sensitive text;
- include error type;
- include retryability;
- include component;
- include version metadata.
Logs support local debugging.
Traces show request flow.
Metrics show system health.
17. Audit vs Observability
Audit events are for accountability.
Observability is for debugging and operations.
Do not assume debug logs satisfy audit requirements.
Example audit event:
class AiAuditEvent(BaseModel):
audit_event_id: str
timestamp: str
tenant_id: str
user_id: str
request_id: str
trace_id: str
action: str
selected_source_ids: list[str]
cited_source_ids: list[str]
model_versions: list[str]
index_versions: list[str]
authorization_decision_id: str | None = None
answer_status: str
Audit events often need stricter retention and access control.
18. Privacy and Redaction
AI observability can easily leak sensitive data.
Sensitive fields:
- PII;
- case details;
- legal privileged content;
- credentials;
- tool outputs;
- raw prompts;
- retrieved passages;
- user messages;
- model outputs;
- hidden instructions.
Redaction strategy:
- classify fields;
- hash where possible;
- store references instead of raw text;
- store summaries where safe;
- restrict raw capture to debug mode;
- expire raw captures quickly;
- audit access to traces;
- redact before export.
SENSITIVE_KEYS = {
"password",
"token",
"access_token",
"secret",
"private_key",
"ssn",
"national_id",
}
def redact_dict(data: dict[str, object]) -> dict[str, object]:
result = {}
for key, value in data.items():
if key.lower() in SENSITIVE_KEYS:
result[key] = "[REDACTED]"
elif isinstance(value, dict):
result[key] = redact_dict(value)
else:
result[key] = value
return result
Do not send sensitive traces to unapproved vendors.
19. Replayability
A replay record helps reproduce behavior.
For AI, exact replay may be impossible if model provider behavior changes.
But you can still capture enough to diagnose.
Replay record:
class ReplayRecord(BaseModel):
trace_id: str
request_id: str
feature: str
user_query: str | None = None
user_query_hash: str
prompt_template_id: str
prompt_version: str
prompt_inputs_ref: str
model_provider: str
model_name: str
model_parameters: dict[str, object]
retrieval_trace_ref: str | None = None
selected_evidence_refs: list[str] = []
tool_trace_refs: list[str] = []
output_ref: str | None = None
created_at: str
For deterministic replay, use:
- fake model with recorded output;
- recorded tool outputs;
- recorded retrieval results;
- same prompt version;
- same context package.
This lets you reproduce pipeline behavior even if model output changes later.
20. Debugging Bad Answers
Use this method.
Checklist:
- Was query normalized correctly?
- Was the correct index used?
- Were filters correct?
- Was correct evidence in candidates?
- Was it selected into context?
- Was context rendered correctly?
- Did the model output unsupported claims?
- Did validator catch them?
- Were citations valid?
- Did UI display caveats?
Do not start by changing the prompt.
21. Debugging Slow Answers
Latency trace should show stage timings.
api: 120ms
query_plan: 80ms
embedding: 280ms
vector_search: 150ms
lexical_search: 90ms
rerank: 1,300ms
context_build: 40ms
model_generation: 4,800ms
validation: 600ms
Common fixes:
- parallelize retrieval;
- reduce candidate count;
- skip rerank for exact lookup;
- use smaller model for validation;
- reduce context tokens;
- cache query embeddings;
- stream generation after evidence is ready;
- add timeout and fallback.
Do not optimize without timing spans.
22. Debugging Cost Spikes
Cost trace should include:
- input tokens;
- output tokens;
- model choice;
- number of model calls;
- reranker calls;
- eval/judge calls;
- retry count;
- agent step count;
- context size;
- tool calls with cost.
Common causes:
- agent loops;
- excessive context;
- too many rerank candidates;
- judge called on every request;
- retry storm;
- large prompt templates;
- unnecessary multi-agent handoffs;
- no caching.
Cost needs budget metrics and per-feature breakdown.
23. Debugging Tool Failures
Tool trace should answer:
- did model select correct tool?
- were arguments valid?
- was authorization allowed?
- did approval exist?
- did tool timeout?
- was retry attempted?
- was idempotency key used?
- did output validate?
- did state update happen?
Tool failure categories:
- wrong tool;
- bad arguments;
- authorization denied;
- approval missing;
- timeout;
- rate limit;
- external dependency failure;
- output schema mismatch;
- side effect conflict.
Each category has a different fix.
24. Debugging Agent Loops
Loop symptoms:
- repeated same node;
- repeated same tool call;
- increasing step count;
- no state progress;
- same observation repeated;
- max steps exceeded.
Trace checks:
- completed nodes;
- state diffs;
- tool outputs;
- planner decisions;
- stop conditions;
- duplicate handoffs;
- repeated errors.
Fixes:
- max calls per tool;
- state progress guard;
- duplicate action detector;
- better observation summary;
- deterministic router;
- explicit stop condition;
- human interrupt.
25. Dashboards
A useful AI app dashboard includes:
25.1 Product Quality
- answer pass rate;
- feedback rate;
- insufficient evidence rate;
- escalation rate;
- top failure types.
25.2 RAG
- no-result rate;
- selected context tokens;
- stale source rate;
- citation failure rate;
- retrieval latency.
25.3 Models
- latency by model;
- error rate;
- token usage;
- cost by feature;
- structured output failure rate.
25.4 Tools and Agents
- tool call rate;
- tool failure rate;
- approval rate;
- max-step failures;
- stuck tasks.
25.5 Safety
- unauthorized retrieval count;
- forbidden tool proposal count;
- prompt injection detections;
- redaction failures.
Dashboards should support drill-down from metric to trace.
26. Alerting
Alert on symptoms that require action.
Examples:
| Alert | Severity |
|---|---|
| unauthorized retrieval > 0 | critical |
| approval bypass > 0 | critical |
| redaction failure > 0 | critical |
| model timeout rate spike | high |
| cost per request spike | high |
| citation failure spike | high |
| retrieval no-result spike | medium |
| p95 latency exceeds SLO | medium/high |
| stuck tasks exceed threshold | medium |
| eval critical failure in release candidate | blocker |
Avoid noisy alerts.
Every alert should have an owner and runbook.
27. Incident Debugging Runbook
AI Incident Runbook
1. Identify affected feature and time window.
2. Pull traces for affected requests.
3. Identify model/prompt/index/tool versions.
4. Classify failure type.
5. Check whether unauthorized data was exposed.
6. Check whether external side effects occurred.
7. Determine rollback option.
8. Add temporary mitigation.
9. Create regression eval.
10. Patch responsible component.
11. Run eval and release gate.
12. Document incident.
For regulated systems, preserve audit artifacts before modifying data.
28. Observability Testing
Test instrumentation.
def test_model_span_has_required_attributes() -> None:
span = make_test_model_span()
assert span.attributes["model.name"]
assert span.attributes["prompt.version"]
assert "input_tokens" in span.attributes
assert "output_tokens" in span.attributes
Test redaction.
def test_trace_redacts_token() -> None:
raw = {"access_token": "abc", "query": "hello"}
redacted = redact_dict(raw)
assert redacted["access_token"] == "[REDACTED]"
assert redacted["query"] == "hello"
Observability can regress.
Treat trace schema as a contract.
29. Case-Management Observability
For case-management AI, trace must support defensibility.
Trace should capture:
- user identity and role;
- case ID;
- authorization decision;
- policy sources retrieved;
- case facts retrieved;
- evidence references;
- recommendation draft;
- citations;
- risk classification;
- approval requirement;
- approval decision;
- final output;
- workflow action status.
Do not rely on conversational transcript alone.
A regulator or auditor may ask:
Why did the system recommend escalation?
You need a traceable answer.
30. Anti-Patterns
| Anti-Pattern | Why It Fails |
|---|---|
| Only log final answer | Cannot debug evidence path |
| No prompt version | Prompt regressions invisible |
| No model version | Model changes untraceable |
| No retrieval candidate IDs | RAG failures undiagnosable |
| Raw sensitive traces everywhere | privacy/security risk |
| Metrics without trace drill-down | cannot find cause |
| Traces without eval link | cannot improve quality |
| Audit and debug logs mixed | wrong retention/access |
| No cost metrics | runaway spend |
| No redaction tests | accidental leaks |
| No runbook | incidents become improvisation |
31. Practice: Instrument a RAG + Agent App
Add observability to your practice app.
Required traces:
- request trace;
- prompt trace;
- model span;
- retrieval span;
- context assembly span;
- validation span;
- tool span;
- agent node span.
Required metrics:
- p95 latency;
- token usage;
- cost estimate;
- retrieval no-result rate;
- citation failure rate;
- tool failure rate;
- max-step failure rate.
Required debugging scenarios:
- wrong answer due to retrieval miss;
- wrong answer due to hallucination;
- slow answer due to reranker;
- tool failure;
- agent loop;
- unauthorized retrieval attempt.
Deliverable:
Observability Report
1. Trace schema
2. Span list
3. Metrics list
4. Redaction policy
5. Dashboard plan
6. Alert plan
7. Debugging runbooks
8. Replay strategy
32. Engineering Heuristics
- Trace model, retrieval, tool, validation, and agent boundaries.
- Store versions for prompts, models, indexes, tools, and agents.
- Keep raw sensitive text out of default telemetry.
- Use hashes and references for replay.
- Make retrieval traces inspectable.
- Make context assembly inspectable.
- Track token and cost by feature.
- Trace approval and high-risk actions.
- Connect production failures to eval examples.
- Monitor both quality and operations.
- Alert on security and approval failures immediately.
- Test trace schema and redaction.
- Keep audit logs separate from debug logs.
- Build dashboards with trace drill-down.
- Debug from trace before changing prompts.
33. Summary
Observability makes AI systems operable.
The core invariant:
If the system generated an output, you should be able to reconstruct the path that produced it.
For AI applications, this path includes:
- prompt;
- model;
- retrieval;
- tools;
- agent state;
- validation;
- citations;
- approvals;
- metrics;
- audit.
Without that path, you cannot reliably debug, evaluate, improve, or defend the system.
In the next part, we move into Reliability Patterns for AI Systems.
You just completed lesson 27 in deepen practice. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.
Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.