Evaluation Foundations
Learn Python AI Application Engineer - Part 023
Evaluation foundations for production AI applications: eval-first mindset, datasets, rubrics, metrics, regression gates, scenario design, calibration, and release readiness.
Part 023 — Evaluation Foundations
1. Why This Part Matters
Traditional software engineering has tests.
AI application engineering needs tests plus evaluations.
A unit test usually answers:
Did this deterministic function return the expected value?
An AI eval answers:
Did this probabilistic system behave acceptably for this scenario, under this context, with this evidence, using this model/prompt/tool chain?
AI systems can fail even when the code is correct:
- the model chooses the wrong tool;
- a prompt change changes answer style;
- retrieval returns plausible but wrong evidence;
- an answer is fluent but unsupported;
- an agent loops or stops too early;
- a model upgrade improves average quality but breaks critical cases;
- a safety rule is followed in one scenario and ignored in another.
Without evaluation, you are shipping by vibes.
The central invariant:
An AI application is not production-ready until important behavior is represented as repeatable evaluation scenarios.
2. Target Skill
After this part, you should be able to:
- distinguish tests, evals, monitoring, and human review;
- design eval datasets from real product behavior;
- create rubrics for non-deterministic outputs;
- define acceptance criteria for AI features;
- separate component evals from end-to-end evals;
- build regression gates for prompts, models, retrieval, and agents;
- calibrate LLM-as-judge and human review;
- reason about precision, recall, faithfulness, groundedness, usefulness, and safety;
- turn production failures into eval examples;
- avoid misleading aggregate metrics.
3. The Evaluation Mindset
In normal software, we often write code and then tests.
For AI applications, reverse the default:
Define the behavior before optimizing the prompt, model, retrieval, or agent.
A practical sequence:
This is the Kaufman loop applied to AI engineering:
- deconstruct the skill;
- practice on small scenarios;
- get fast feedback;
- self-correct;
- increase complexity deliberately.
4. Test vs Eval vs Monitor
| Mechanism | Primary Purpose | Example |
|---|---|---|
| Unit test | deterministic correctness | schema parser rejects invalid JSON |
| Contract test | interface stability | model provider returns expected envelope |
| Integration test | components work together | retrieval service can query active index |
| Eval | behavior quality | answer is grounded and useful |
| Red-team eval | adversarial behavior | prompt injection does not cause tool misuse |
| Human review | expert judgment | legal reviewer checks regulatory answer |
| Monitoring | production health | hallucination complaints spike |
| Audit | accountability | answer trace cites source and user permission |
Do not use one mechanism for everything.
A retrieval recall eval is not a security audit.
A latency monitor is not a groundedness eval.
A human reviewer is not a scalable regression gate unless sampled and structured.
5. Component Evals vs End-to-End Evals
An AI application has multiple layers.
You need evals at different levels.
| Level | Question |
|---|---|
| Planner eval | Did the system choose the right plan/query type? |
| Retrieval eval | Did it retrieve the right evidence? |
| Context eval | Did it package sufficient useful evidence? |
| Generation eval | Did it answer faithfully from evidence? |
| Citation eval | Did citations support claims? |
| Tool eval | Did it choose/call the right tool correctly? |
| Agent trajectory eval | Did the multi-step path make sense? |
| Safety eval | Did it avoid prohibited behavior? |
| E2E eval | Did the user-visible result satisfy the scenario? |
End-to-end evals are important, but they are not enough.
When E2E fails, component evals help locate the fault.
6. Eval Dataset Anatomy
A good eval example has more than input and expected output.
from typing import Literal
from pydantic import BaseModel, Field
class EvalExample(BaseModel):
example_id: str
feature: str
scenario: str
user_input: str
user_context: dict[str, object] = {}
expected_behavior: str
expected_status: Literal[
"answer",
"clarify",
"refuse",
"escalate",
"insufficient_evidence",
]
expected_sources: list[str] = []
forbidden_sources: list[str] = []
rubric: dict[str, str] = {}
risk_level: Literal["low", "medium", "high", "critical"] = "medium"
tags: list[str] = []
notes: str | None = None
For RAG and agents, include:
- tenant;
- role;
- source permissions;
- expected evidence;
- forbidden evidence;
- tool expectations;
- approval expectations;
- trace expectations.
The richer the example, the more diagnosable the failure.
7. Scenario Design
Avoid only testing happy paths.
A production eval set should include:
- normal success;
- ambiguous input;
- missing information;
- conflicting evidence;
- stale source;
- unauthorized source;
- prompt injection;
- exact identifier lookup;
- semantic paraphrase;
- table lookup;
- long context;
- multilingual or mixed-language input where relevant;
- tool timeout;
- invalid tool input;
- high-risk action requiring approval;
- case requiring human handoff.
Scenario categories should reflect real product risk, not just model capability.
8. Golden Dataset
A golden dataset is a curated set of examples used to protect important behavior.
It should be:
- small enough to inspect;
- broad enough to represent risk;
- versioned;
- owned;
- reviewed;
- updated from production incidents;
- stable across releases unless intentionally changed.
Do not start with thousands of examples.
Start with 30-100 high-value cases.
Then grow by failure-driven additions.
8.1 Example Golden Dataset Structure
evals/
golden/
rag_policy_qa.yaml
case_review_agent.yaml
tool_calling.yaml
prompt_injection.yaml
rubrics/
grounded_answer_rubric.md
agent_trajectory_rubric.md
runners/
run_rag_eval.py
run_agent_eval.py
reports/
2026-06-28-baseline.json
Version the dataset with the application.
9. Expected Behavior Is Not Always Expected Text
For LLM apps, exact string match is usually too brittle.
Instead of:
Expected output exactly equals:
"Escalation is required."
Use behavioral expectations:
The answer must:
- state that escalation is required;
- mention repeat non-compliance within 90 days;
- cite the active escalation policy;
- cite the case event showing the second breach;
- avoid claiming final case closure authority;
- recommend supervisor review.
This maps better to real quality.
10. Rubrics
A rubric converts subjective quality into structured judgment.
Example grounded answer rubric:
| Criterion | Pass Condition |
|---|---|
| Relevance | Directly addresses the user question |
| Groundedness | Material claims are supported by evidence |
| Completeness | Covers required conditions/exceptions |
| Citation correctness | Citations support the claims |
| Uncertainty | States missing info or limitations |
| Safety | Avoids prohibited or unauthorized content |
| Actionability | Provides useful next step where appropriate |
| Tone | Clear, professional, not overconfident |
Rubric score:
class RubricScore(BaseModel):
relevance: int = Field(ge=1, le=5)
groundedness: int = Field(ge=1, le=5)
completeness: int = Field(ge=1, le=5)
citation_correctness: int = Field(ge=1, le=5)
safety: int = Field(ge=1, le=5)
actionability: int = Field(ge=1, le=5)
overall: int = Field(ge=1, le=5)
failure_reasons: list[str] = []
Rubrics should include examples of pass/fail.
11. Binary, Scalar, and Categorical Evals
Different questions need different outputs.
11.1 Binary
Did the answer cite an authorized source? yes/no
Good for gates.
11.2 Scalar
Groundedness score 1-5
Good for trend analysis.
11.3 Categorical
Failure type: retrieval_miss | hallucination | citation_error | unsafe_tool
Good for diagnosis.
11.4 Structured
class EvalJudgment(BaseModel):
passed: bool
score: float
failure_type: str | None
rationale: str
Structured eval results are easier to aggregate.
12. Deterministic Evaluators
Use deterministic evaluators whenever possible.
Examples:
- JSON schema validity;
- required field exists;
- status is one of allowed enum values;
- citation IDs exist in evidence package;
- forbidden source not retrieved;
- tool name is allowed;
- approval required for high-risk tool;
- latency under threshold;
- cost under threshold.
def eval_forbidden_sources(
retrieved_source_ids: list[str],
forbidden_source_ids: list[str],
) -> bool:
return not set(retrieved_source_ids).intersection(forbidden_source_ids)
Deterministic checks are cheap, stable, and explainable.
Use model judges for what deterministic code cannot judge well.
13. LLM-as-Judge
LLM-as-judge can evaluate:
- answer relevance;
- faithfulness;
- completeness;
- helpfulness;
- rubric compliance;
- whether an answer follows evidence.
But it has risks:
- judge bias;
- inconsistency;
- prompt sensitivity;
- overconfidence;
- poor calibration;
- vulnerability to persuasive answers;
- hidden reasoning errors.
Use LLM judges as one tool, not as unquestioned truth.
13.1 Judge Prompt Principles
A judge prompt should include:
- task description;
- rubric;
- input;
- evidence;
- candidate answer;
- scoring schema;
- failure categories;
- examples if needed.
Require structured output.
class JudgeResult(BaseModel):
passed: bool
score: int = Field(ge=1, le=5)
failure_types: list[str]
rationale: str
14. Human Review
Human review is needed for:
- high-risk legal/regulatory content;
- new rubric calibration;
- disputed judge results;
- production incidents;
- dataset curation;
- critical release gates;
- policy-sensitive behavior.
Human review should also be structured.
class HumanReviewResult(BaseModel):
reviewer_id: str
example_id: str
passed: bool
score: int
failure_types: list[str]
comments: str
reviewed_at: str
Free-form review comments are useful, but structured labels make improvement measurable.
15. Calibration
Calibration aligns evaluators with desired judgment.
Calibrate:
- human reviewers with examples;
- LLM judges against human labels;
- thresholds against risk;
- component metrics against product outcomes.
Questions:
- Do reviewers agree?
- Does the judge match expert labels?
- Are false passes more dangerous than false fails?
- Which examples are borderline?
- Which rubric terms are ambiguous?
For high-risk systems, prefer failing uncertain cases into human review.
16. Eval Metrics
Common metrics:
| Metric | Useful For |
|---|---|
| pass rate | release gate |
| average score | trend |
| failure type distribution | diagnosis |
| recall@k | retrieval |
| MRR/nDCG | retrieval ranking |
| citation support rate | grounded answers |
| unsupported claim rate | generation |
| tool success rate | agents/tools |
| trajectory success rate | agents |
| approval compliance rate | risk workflows |
| latency p95 | operations |
| cost per task | economics |
Metrics must be sliced.
Aggregate pass rate can hide critical failures.
17. Slicing
Slice eval results by:
- query type;
- user role;
- tenant;
- source type;
- document status;
- risk level;
- model version;
- prompt version;
- index version;
- retrieval mode;
- tool;
- agent workflow;
- language;
- jurisdiction;
- case type.
Example:
class EvalSlice(BaseModel):
name: str
filters: dict[str, str]
pass_rate: float
total_examples: int
critical_failures: int
A release can pass overall and still fail a critical slice.
18. Regression Gates
A regression gate blocks release if quality drops.
Examples:
- no unauthorized source retrieved;
- no high-risk approval bypass;
- groundedness pass rate >= 0.95 for critical cases;
- retrieval recall@10 >= baseline - 1%;
- citation support rate >= 0.98;
- p95 latency <= budget;
- cost per task <= budget;
- no new critical failure.
Gate policy:
class RegressionGate(BaseModel):
metric_name: str
slice_name: str | None = None
comparator: Literal[">=", "<=", "=="]
threshold: float
severity: Literal["blocker", "warning"]
Use stricter gates for safety/security than for style.
19. Baselines
Always compare against a baseline.
Baselines can be:
- previous production version;
- simpler model;
- vector-only retrieval;
- no-rerank version;
- single-agent workflow;
- deterministic workflow;
- human-only reference.
A new architecture should justify itself.
If a multi-agent system costs 3x more and improves pass rate by 0.5% while adding latency, it may not be worth it.
20. Eval Report
A useful eval report includes:
Release Candidate: rag-agent-v0.12
Dataset Version: golden-v8
Prompt Version: policy-answer-v5
Model Version: model-x
Index Version: policy-index-2026-06-28
Overall:
- pass rate:
- critical failure count:
- p95 latency:
- cost per task:
By slice:
- exact lookup:
- policy interpretation:
- case decision support:
- prompt injection:
- unauthorized access:
Top failure types:
1. retrieval_miss
2. citation_mismatch
3. insufficient_evidence_false_pass
Recommendation:
- ship / do not ship
- required fixes
A release decision should be traceable to eval evidence.
21. Eval Data Lifecycle
Eval datasets evolve.
Lifecycle:
Each example should have:
- owner;
- reason for inclusion;
- source;
- review status;
- last updated;
- tags;
- risk level.
Do not allow random unreviewed examples to silently redefine release gates.
22. Failure Taxonomy
Use consistent failure labels.
Example:
| Failure Type | Meaning |
|---|---|
retrieval_miss | correct evidence not retrieved |
rerank_error | correct evidence retrieved but not selected |
context_error | selected evidence poorly packaged |
hallucination | unsupported claim generated |
citation_mismatch | citation does not support claim |
over_refusal | refused despite sufficient evidence |
under_refusal | answered despite insufficient evidence |
tool_misuse | wrong or unsafe tool selected |
approval_bypass | risky action without approval |
security_leak | unauthorized data exposed |
stale_source | superseded/outdated source used |
format_error | output schema/format invalid |
latency_budget | too slow |
cost_budget | too expensive |
Failure labels guide fixes.
23. From Incident to Eval
Every serious production issue should become an eval.
Process:
Example:
Incident:
Agent recommended closing case without escalation.
Root cause:
Escalation policy was retrieved but context selector omitted exception clause.
New eval:
Query: "Can this repeat breach case be closed without escalation?"
Expected: Must mention repeat breach within 90 days requires escalation.
Gate: Critical case decision support must pass groundedness and completeness.
This is how production systems improve.
24. Eval Runner Architecture
The runner should store:
- input;
- system output;
- trace;
- judge outputs;
- deterministic metrics;
- timing;
- cost;
- versions.
Without version metadata, eval results are not reproducible.
25. Version Metadata
Every eval run should record:
class EvalRunMetadata(BaseModel):
eval_run_id: str
dataset_version: str
application_version: str
model_provider: str
model_name: str
model_version: str | None = None
prompt_versions: dict[str, str]
index_versions: dict[str, str]
tool_versions: dict[str, str]
agent_versions: dict[str, str]
started_at: str
completed_at: str | None = None
AI systems change behavior when any of these changes.
Track them.
26. Minimal Python Eval Runner
class EvalRunner:
def __init__(
self,
*,
system_under_test: "AiApplication",
evaluators: list["Evaluator"],
) -> None:
self.system_under_test = system_under_test
self.evaluators = evaluators
async def run(self, examples: list[EvalExample]) -> list["EvalResult"]:
results: list[EvalResult] = []
for example in examples:
output = await self.system_under_test.run(example)
judgments = []
for evaluator in self.evaluators:
judgment = await evaluator.evaluate(example, output)
judgments.append(judgment)
results.append(
EvalResult(
example_id=example.example_id,
output=output,
judgments=judgments,
)
)
return results
Evaluator protocol:
class Evaluator(Protocol):
async def evaluate(self, example: EvalExample, output: object) -> "EvalJudgment":
...
Keep the runner generic.
Feature-specific evaluators can plug in.
27. Evaluation Anti-Patterns
| Anti-Pattern | Why It Fails |
|---|---|
| Only manual vibe checks | Not repeatable |
| Only exact string match | Too brittle |
| Only aggregate score | Hides critical failures |
| Eval set not versioned | Results not comparable |
| No failure labels | Hard to improve |
| Judge without calibration | False confidence |
| No traces captured | Failures not diagnosable |
| No negative cases | Unsafe behavior missed |
| Eval after release only | Regression discovered too late |
| No baseline | Cannot justify complexity |
| Prompt tuned to eval only | Overfitting |
| Ignoring latency/cost | Unusable system |
28. Practice: Build Evaluation Harness
Create eval files for a RAG/case assistant.
Datasets:
policy_lookup.yamlcase_decision_support.yamlprompt_injection.yamltool_calling.yamlagent_trajectory.yaml
Build evaluators:
- schema validity;
- forbidden source check;
- required citation check;
- groundedness judge;
- refusal correctness;
- tool authorization check;
- approval gate check;
- latency/cost check.
Create a report with:
- overall pass rate;
- pass rate by scenario;
- blocker failures;
- top failure types;
- examples to inspect;
- release recommendation.
29. Engineering Heuristics
- Define evals before tuning prompts.
- Use component and E2E evals.
- Start with a small high-quality golden dataset.
- Include negative and adversarial scenarios.
- Prefer deterministic checks where possible.
- Use LLM judges with calibration.
- Store traces for every eval example.
- Slice metrics by risk and query type.
- Gate on critical failures, not just averages.
- Compare against baselines.
- Turn incidents into evals.
- Version datasets, prompts, models, tools, and indexes.
- Review eval datasets like production assets.
- Avoid optimizing only for the eval set.
- Treat eval as an engineering system, not a one-off script.
30. Summary
Evaluation is the quality system for AI applications.
The core invariant:
Important behavior must be represented as repeatable, versioned, diagnosable scenarios.
Tests verify deterministic code.
Evals verify probabilistic behavior.
Monitoring detects production drift.
Human review calibrates judgment.
Together, they make AI applications shippable.
In the next part, we apply these foundations specifically to RAG and Agent Evaluation.
You just completed lesson 23 in deepen practice. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.
Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.