Part 029 — Latency, Cost, and Throughput Engineering

1. Why This Part Matters

AI applications are expensive distributed systems with user-facing latency.

A traditional API might spend most time in:

database query;
cache lookup;
business logic;
network calls.

An AI application may spend most time and cost in:

prompt construction;
embedding call;
vector search;
reranking;
model prefill;
model generation;
tool calls;
judge calls;
retries;
agent loops;
validation/repair.

A small design mistake can multiply cost.

Examples:

adding 20 retrieved chunks to every prompt;
using a frontier model for low-risk classification;
reranking 200 candidates for every query;
allowing an agent to call tools without step budget;
running judge on every request synchronously;
not caching stable prompt prefixes;
streaming late because retrieval is slow;
retrying failed model calls without budget.

The central invariant:

Every AI feature should have an explicit latency, cost, and quality budget.

Without budgets, you cannot make engineering trade-offs.

2. Target Skill

After this part, you should be able to:

decompose AI latency into stage-level timings;
reason about time-to-first-token, time-to-last-token, and perceived latency;
estimate token cost per task;
control prompt/context growth;
choose models by task, risk, and budget;
design caching strategies safely;
use streaming without hiding backend inefficiency;
manage concurrency and throughput;
apply batching where appropriate;
use queues for long-running tasks;
build capacity plans and cost dashboards;
design release gates for latency and cost.

3. Latency Anatomy of an AI Request

A RAG request may look like this:

Latency components:

Component	Typical Cause
API/auth	middleware, identity lookup
query planning	model/rule planner
embedding	remote embedding provider
search	vector DB + filters
reranking	cross-encoder/model judge
context assembly	token counting, formatting
prefill	model processes input tokens
generation	output token generation
validation	schema/grounding/citation checks
tool calls	external systems
retries	transient failure handling

If you only measure total latency, you cannot optimize intelligently.

4. Latency Metrics

Track several latency metrics.

Metric	Meaning
TTFT	time to first token
TTLT	time to last token
p50 latency	typical experience
p95 latency	tail user experience
p99 latency	worst normal operating experience
queue wait time	time before work starts
model latency	model call duration
retrieval latency	search + rerank duration
validation latency	post-generation checks
tool latency	external tool calls

For streaming apps, TTFT strongly affects perceived responsiveness.

For transactional workflows, TTLT and correctness matter more.

5. Cost Anatomy

AI cost is often driven by tokens and calls.

Cost dimensions:

input tokens;
output tokens;
cached input tokens;
embedding tokens;
reranker calls;
judge calls;
tool/API costs;
vector DB cost;
queue/worker cost;
storage cost;
observability cost;
human review cost.

Cost per task:

total_task_cost =
  model_generation_cost
+ embedding_cost
+ reranking_cost
+ judge_cost
+ tool_cost
+ infra_cost
+ human_review_cost

Do not optimize only model generation if reranking or judge calls dominate.

6. Token Economics

Input tokens come from:

system instructions;
developer instructions;
user message;
conversation history;
retrieved evidence;
tool definitions;
tool results;
output schema;
memory;
hidden context wrappers.

Output tokens come from:

answer;
reasoning-like intermediate output if exposed;
structured JSON;
citations;
verbose explanations;
repair attempts.

Large prompts increase:

cost;
prefill latency;
memory pressure;
failure surface;
chance of irrelevant context;
prompt injection exposure.

A top-tier engineer treats token budget like heap memory or database query budget.

7. Token Budget Design

Create budgets by feature.

Example:

Feature	Input Budget	Output Budget
classification	1k	100
short answer	4k	500
RAG policy answer	12k	1k
case review draft	24k	2k
long document summary	64k	4k
agent step decision	8k	300

Budget should be enforced.

from pydantic import BaseModel


class TokenBudget(BaseModel):
    max_input_tokens: int
    max_output_tokens: int
    max_context_tokens: int
    max_tool_result_tokens: int
    max_conversation_tokens: int

A context builder should refuse or degrade when budget is exceeded.

8. Context Compression

Techniques:

8.1 Select Less

Best compression is not adding irrelevant content.

reduce top-k;
use reranker;
filter by metadata;
prefer answer-bearing chunks;
avoid duplicate chunks.

8.2 Summarize

Summarize previous turns or tool outputs.

Risk:

summaries are lossy;
source references must remain;
critical facts need validation.

8.3 Structure

Structured evidence is more efficient than raw dumps.

Evidence E1:
- Source: Active Escalation Policy
- Rule: repeat breach within 90 days requires escalation
- Citation: policy-enf-4.2

8.4 Use Parent-Child Carefully

Retrieve child chunks, include parent only when needed.

8.5 Cache Stable Prefixes

Stable system instructions and tool schemas can benefit from prompt caching where provider supports it.

9. Prompt Caching

Prompt caching reduces cost/latency when repeated prompt prefixes are stable.

Good candidates:

system instructions;
developer instructions;
tool schemas;
stable rubric;
long static policy context;
repeated workflow instructions.

Poor candidates:

dynamic user data;
rapidly changing tool results;
per-user sensitive context;
random ordering of tools/evidence;
volatile timestamps at the top of prompt.

Design rule:

Put stable reusable content early and dynamic content later.

This improves cache hit likelihood for prefix-based caching systems.

But cache safety matters:

include tenant/security in cache keys when application-level caching;
do not share sensitive context unsafely;
understand provider caching policy;
trace cached token usage;
do not rely on caching for correctness.

10. Model Routing for Cost and Latency

Not every task needs the strongest model.

Route by:

task complexity;
risk level;
required schema reliability;
latency budget;
context length;
cost budget;
user tier;
data policy.

Example routing:

Task	Model Strategy
intent classification	small/fast model or deterministic rules
query rewrite	small model
low-risk summary	economical model
high-risk case recommendation	stronger model + validation
judge critical answer	calibrated judge model
fallback during outage	approved secondary model
extraction with strict schema	model with strong structured output reliability

Routing logic should be explicit and traced.

class ModelRoute(BaseModel):
    task_type: str
    risk_level: str
    model_name: str
    max_input_tokens: int
    max_output_tokens: int
    requires_validation: bool

11. Cost Guardrails

Use cost guardrails at several levels.

Level	Guardrail
request	max tokens, max model calls
user	daily/monthly quota
tenant	budget limit
feature	cost per task budget
agent run	max steps, max tool calls
eval	separate budget
provider	rate/cost alert

Example:

class CostBudget(BaseModel):
    max_cost_per_request_usd: float
    max_cost_per_run_usd: float
    max_model_calls: int
    max_judge_calls: int

If budget is exceeded:

stop;
ask user to narrow scope;
queue for batch;
degrade to cheaper path;
require approval for costly run.

12. Streaming

Streaming improves perceived latency.

It does not reduce total compute cost.

Streaming helps when:

generation is long;
retrieval is already done;
user benefits from incremental output;
answer can be safely streamed before final validation.

Streaming is risky when:

output requires full validation before display;
citations may be generated late;
safety checks happen after generation;
tool results may change answer;
answer status may become refusal after partial text.

For high-risk answers, consider:

stream progress events, not final answer text;
stream only after validation;
stream draft with clear status;
buffer until citation/grounding validation passes.

13. Streaming Event Design

Instead of streaming raw tokens only, stream structured events.

class StreamEvent(BaseModel):
    event_type: str
    payload: dict[str, object]

Examples:

retrieval_started
retrieval_completed
generation_started
token_delta
citation_ready
validation_started
final_answer
error

This improves UX and debuggability.

For case-management workflows, progress events can be more appropriate than unvalidated answer tokens.

14. Time-to-First-Token vs Time-to-First-Useful-Token

TTFT can be misleading.

If the first token arrives quickly but the answer is wrong or later corrected, UX suffers.

Define:

TTFT = first generated token
TTFUT = first token that belongs to the final valid answer

For high-risk AI:

retrieval and validation may need to happen before answer streaming;
a progress stream may be safer;
final answer may appear later but be trustworthy.

Optimize perceived latency without compromising correctness.

15. Retrieval Performance

Retrieval latency contributors:

embedding call;
lexical search;
vector search;
metadata filters;
reranking;
network latency;
index size;
candidate count;
tenant partitioning;
cold caches.

Optimizations:

parallel lexical/vector search;
cache query embeddings;
reduce candidate_k;
skip reranking for exact lookup;
pre-filter by tenant/source;
partition by tenant or domain;
optimize metadata indexes;
use approximate nearest neighbor tuning;
avoid huge top-k;
use source diversity before rerank.

Trace before tuning.

16. Reranking Cost

Reranking is powerful but expensive.

Control it:

rerank only top N;
skip rerank for exact identifier queries;
use cheaper reranker for low-risk tasks;
use metadata boosts before rerank;
cache rerank for stable public queries;
route high-risk queries to stronger rerank.

Example adaptive policy:

def choose_rerank_k(query_type: str, risk_level: str) -> int:
    if query_type == "exact_lookup":
        return 10

    if risk_level in {"high", "critical"}:
        return 80

    return 40

17. Agent Cost Control

Agents can explode cost because they call models repeatedly.

Controls:

max steps;
max model calls;
max tool calls;
max cost per run;
planning budget;
no repeated same tool with same args;
stop when evidence sufficient;
use deterministic router where possible;
summarize state carefully;
do not use multi-agent without baseline.

Agent trace should show:

step_count
model_calls
tool_calls
tokens_by_step
cost_by_step

If you cannot attribute agent cost by step, you cannot optimize it.

18. Batching

Batching can improve throughput.

Good for:

offline embedding jobs;
eval runs;
document ingestion;
batch classification;
human review pre-processing;
low-priority summarization.

Risky for:

interactive chat;
personalized permission-sensitive tasks;
high-priority workflows;
external side-effect tools.

Batching trade-off:

higher throughput;
lower per-item overhead;
increased waiting latency;
larger failure blast radius.

Use separate queues for batch workloads.

19. Concurrency

Concurrency must be bounded.

Use semaphores for model calls, embedding calls, reranker calls, and tools.

import asyncio


class BoundedModelClient:
    def __init__(self, client: "ModelClient", max_concurrency: int) -> None:
        self.client = client
        self.sem = asyncio.Semaphore(max_concurrency)

    async def generate(self, *args, **kwargs):
        async with self.sem:
            return await self.client.generate(*args, **kwargs)

Concurrency limits should be per:

provider;
model;
tenant;
feature;
tool;
worker pool.

Unbounded concurrency creates rate-limit storms and cascading failures.

20. Throughput and Capacity Planning

Capacity planning asks:

How many requests/tasks can the system handle while meeting latency and cost goals?

Inputs:

requests per second;
average model calls per request;
p95 model latency;
token usage;
provider rate limits;
vector DB QPS;
tool QPS;
worker concurrency;
queue depth;
average cost per request;
peak multiplier.

Example:

100 user requests/min
average 1.5 model calls/request
average 8k input tokens + 800 output tokens
rerank on 60% requests
p95 model latency 4s

From this, plan:

model concurrency;
retrieval capacity;
worker count;
budget;
fallback thresholds.

21. Queueing

Use queues when work is:

long-running;
expensive;
batchable;
non-interactive;
retryable;
approval-dependent;
tool-heavy.

Queue metrics:

enqueue rate;
dequeue rate;
queue depth;
oldest message age;
worker utilization;
retry count;
DLQ count.

Queue design:

Interactive requests should not wait behind batch jobs.

Use priority queues or separate queues.

22. Caching Layers

Potential caches:

Cache	Key Must Include
prompt prefix cache	provider-dependent
query embedding cache	normalized query + embedding model
retrieval result cache	query + index + security context
source metadata cache	source ID + version
tool result cache	tool args + tenant + permission
generated answer cache	dangerous; include many versions/security fields
eval result cache	example + app/model/prompt/index versions

Generated answer caching is risky for permission-sensitive systems.

Prefer caching intermediate safe artifacts.

23. Cache Invalidation

Cache invalidation triggers:

source document updated;
index version changed;
ACL changed;
user role changed;
prompt version changed;
model version changed;
tool output version changed;
case status changed;
policy superseded.

For regulated systems, stale cache can cause wrong decisions.

Use conservative TTLs for high-risk data.

24. Latency-Cost-Quality Trade-Off

Most AI optimization is a triangle.

Examples:

stronger model improves quality but costs more;
larger context improves recall but increases latency/cost;
reranker improves precision but adds latency;
judge improves safety but adds cost;
caching lowers latency/cost but creates staleness/privacy risks;
multi-agent improves decomposition but increases coordination cost.

Engineering judgment means choosing the correct trade-off for the use case.

25. Performance Budgets by Risk

Low-risk:

cheap model;
small context;
no judge;
fast fallback;
lower citation strictness if not factual.

Medium-risk:

RAG with citations;
moderate model;
validation;
rerank if needed.

High-risk:

stronger model;
strict evidence;
claim validation;
human approval;
no unsafe fallback;
higher latency acceptable.

Risk should influence performance budget.

Do not optimize high-risk workflows into unsafe shortcuts.

26. Load Testing AI Apps

Load tests should simulate:

model latency;
provider rate limits;
retrieval latency;
tool latency;
streaming clients;
queue workloads;
failures and retries;
burst traffic;
tenant distribution.

Use fake/stub providers for high-volume load tests.

Measure:

p50/p95/p99 latency;
error rate;
timeout rate;
queue depth;
cost estimate;
token usage;
fallback rate;
circuit breaker events.

Do not accidentally run huge live-model load tests without budget controls.

27. Performance Regression Gates

Example gates:

Performance Gates

- p95 RAG latency <= 5s
- p99 API latency <= 12s
- average input tokens <= baseline + 10%
- average output tokens <= baseline + 10%
- cost per successful answer <= budget
- reranker timeout rate <= 1%
- model retry rate <= 2%
- agent average steps <= baseline + 1
- no unbounded context growth

Performance regressions are product regressions.

28. Cost Dashboard

A cost dashboard should show:

cost by feature;
cost by tenant;
cost by model;
cost by prompt version;
cost by index version;
cost by agent workflow;
tokens by stage;
judge cost;
retry cost;
wasted cost from failed requests;
cost per successful task;
cost per human-approved recommendation.

Cost per successful task is more useful than raw spend.

If many requests fail after expensive model calls, quality problems become cost problems.

29. Optimization Method

Do not optimize randomly.

Use the sequence:

measure by stage;
identify dominant cost/latency;
classify whether it is necessary;
remove waste first;
optimize architecture second;
optimize model choice third;
optimize provider/backend last;
add regression gate.

Example:

Problem:
p95 latency 12s.

Trace:
generation 7s, rerank 3s, retrieval 1s, validation 1s.

Fix:
reduce output length and skip rerank for exact lookup.

Not first fix:
rewrite whole vector database layer.

30. Case-Management Performance Policy

For regulatory case-management AI:

Interactive Q&A

target p95: 5-8s;
citations required;
RAG required;
no final decision side effects.

Case Review Draft

can be async;
target completion: minutes not seconds;
stronger validation;
human approval;
full audit.

External Notice Draft

async acceptable;
strict policy and legal review;
no direct send without approval.

Workflow Action

deterministic preflight;
idempotency;
approval;
audit;
latency less important than correctness.

Not all AI work should be synchronous chat.

31. Anti-Patterns

Anti-Pattern	Why It Fails
No cost budget	surprise spend
No token tracking	context bloat invisible
Frontier model for every task	unnecessary cost
Judge every request synchronously	latency/cost explosion
Rerank huge candidate sets always	slow and expensive
Unbounded agent loops	runaway cost
Streaming unsafe drafts	bad partial answers
Caching without security key	data leakage
Batch jobs share interactive pool	user latency spikes
No per-stage timings	blind optimization
Only optimize average latency	tail users suffer
No performance gates	regressions ship

32. Practice: Performance and Cost Lab

Take your RAG + agent practice app.

Add:

per-stage latency trace;
token counters;
cost estimator;
model routing policy;
context token budget;
adaptive rerank policy;
query embedding cache;
bounded concurrency;
queue for long-running case review;
performance gates.

Run scenarios:

exact lookup;
semantic RAG;
long case review;
high-risk recommendation;
missing evidence;
prompt injection case;
multi-agent case review.

Report:

Performance Report

1. Latency by stage
2. Token usage by stage
3. Cost by feature
4. p95/p99 latency
5. Cost per successful task
6. Top waste sources
7. Optimization applied
8. Regression gates

33. Engineering Heuristics

Measure latency by stage.
Track TTFT and TTLT separately.
Track cost per successful task.
Enforce token budgets.
Use cheaper models for simple tasks.
Route high-risk tasks to stronger models and validation.
Use streaming carefully for high-risk answers.
Cache stable safe content.
Include security context in cache keys.
Bound concurrency per dependency.
Separate interactive and batch queues.
Avoid unbounded agent loops.
Skip expensive stages when query type does not need them.
Add performance regression gates.
Optimize after tracing, not before.

34. Summary

Latency, cost, and throughput are architecture concerns.

The core invariant:

AI quality must be delivered within explicit time, token, cost, and capacity budgets.

A top-tier AI application engineer can explain:

why a request is slow;
why a feature is expensive;
which stage dominates cost;
which optimization preserves correctness;
which fallback is safe;
which workload should be async;
which model should handle which task.

In the next part, we move into Security Threat Modeling for LLM Apps.