Final StretchOrdered learning track

Learn Build From Scratch Recommendations System Part 073 Cost Capacity And Performance Engineering

[]10 min read1957 words

In This Lesson

1. Mental Model: Optimize Cost per Useful Decision 2. Cost Drivers 3. Capacity Formula: Candidate Scores per Second

Lesson 7380 lesson track67–80 Final Stretch

title: Build From Scratch Recommendations System - Part 073 description: Mendesain cost, capacity, dan performance engineering untuk recommendation system production-grade: QPS, candidate scoring volume, feature store load, vector search capacity, model inference cost, batch pipeline cost, cache economics, autoscaling, load shedding, and optimization strategy. series: learn-build-from-scratch-recommendations-system seriesTitle: Build From Scratch: Enterprise Recommendations System order: 73 partTitle: Cost, Capacity, and Performance Engineering tags:

recommendation-system
recsys
cost-engineering
capacity-planning
performance
scalability
series date: 2026-07-02

Part 073 — Cost, Capacity, and Performance Engineering

Recommendation system skala besar bukan hanya masalah model quality.

Ia juga masalah ekonomi.

Satu request rekomendasi bisa memicu:

candidate generation
vector search
feature lookup
profile lookup
eligibility batch check
model inference untuk ratusan/ribuan candidates
reranking
decision logging
event ingestion
observability

Jika QPS tinggi, biaya bisa meledak.

Contoh sederhana:

2.000 QPS * 800 candidates/request = 1.600.000 candidate scores/detik

Jika setiap candidate membutuhkan puluhan feature lookup dan model inference mahal, sistem bisa menjadi sangat mahal bahkan sebelum traffic global.

Part ini membahas cost, capacity, dan performance engineering untuk recommendation system production-grade: QPS, candidate scoring volume, vector search, feature store, model inference, cache economics, batch scoring, autoscaling, load shedding, cost attribution, and optimization strategy.

1. Mental Model: Optimize Cost per Useful Decision

RecSys cost harus dilihat sebagai:

cost per request
cost per candidate scored
cost per useful impression
cost per conversion/outcome
cost per tenant/surface
cost per model/source/policy

Bukan hanya total infrastructure bill.

Goal:

maximize product value per unit cost

High-quality recommendation yang terlalu mahal untuk disajikan di production bukan solusi.

2. Cost Drivers

Major cost drivers:

QPS
candidate count per request
feature count per candidate
feature store latency/load
vector search topK/overfetch
model inference complexity
deep model/GPU usage
cache hit rate
batch scoring volume
embedding/index build frequency
event logging volume
debug trace sampling
LLM calls
multi-tenant isolation overhead

Know your cost drivers before optimizing.

3. Capacity Formula: Candidate Scores per Second

Most important formula:

candidate_scores_per_second = request_qps * candidates_ranked_per_request

Example:

QPS = 5.000
candidates/request = 1.000

candidate_scores/sec = 5.000.000

Ranking capacity should be planned in candidate scores/sec, not only requests/sec.

4. Feature Lookup Volume

Feature volume:

feature_values_per_second =
  QPS * candidates_per_request * item_features_per_candidate
  + QPS * user_features_per_request

Example:

QPS = 2.000
candidates = 800
item_features = 50

= 80.000.000 item feature values/sec

Even if values are batched, payload and compute matter.

5. Vector Search Capacity

Vector search capacity depends on:

QPS
topK
overfetch factor
dimension
index algorithm
filtering
partition count
latency target
replication

Example:

2 vector sources/request
2.000 QPS
topK 1.000

This is 4.000 ANN queries/sec.

If each query searches large HNSW index with filters, capacity planning matters.

6. End-to-End Cost Model

Create cost model:

surface: home_feed
qps_peak: 2000
candidate_sources:
  two_tower:
    calls_per_request: 1
    cost_per_1k_calls: 0.02
  trending:
    calls_per_request: 1
    cost_per_1k_calls: 0.001
ranking:
  candidates_ranked: 800
  cost_per_million_scores: 0.50
feature_store:
  batch_calls_per_request: 3
  avg_payload_kb: 120
logging:
  decision_log_kb: 8
  impression_events_per_request: 20

Approximation is better than surprise.

7. Cost by Stage

Track cost by stage:

Stage	Cost Driver
Candidate generation	source calls, vector queries
Feature serving	lookups, payload, cache miss
Ranking	candidate scores, model complexity
Reranking	slate optimization complexity
Logging	event volume, storage retention
Offline training	dataset size, compute
Embeddings/index	item count, dimension, rebuild cadence
Batch scoring	subjects × candidates
LLM	tokens, model, retries

Without attribution, cost optimization becomes guesswork.

8. Surface-Level Cost

Different surfaces have different economics.

home feed: high QPS, low latency
PDP recommendations: moderate QPS, seed-based
email: batch-heavy, no online latency
push: low volume, high trust
enterprise actions: lower QPS, high correctness
search suggestions: strict latency

Optimize per surface.

Do not use same expensive pipeline everywhere.

9. Tenant-Level Cost

For enterprise:

cost per tenant
cost per active user
cost per case
cost per recommended action
cost per LLM explanation
cost per batch scoring run

Tenant cost helps:

pricing,
quota,
capacity,
abuse detection,
SLA planning.

Large tenants can dominate platform cost.

10. Candidate Count Optimization

Candidate count affects:

ranking cost,
feature cost,
latency,
memory,
payload size.

Tune:

candidate source quota
dedup before ranking
eligibility before feature fetch
pre-rank/filter before expensive scoring
rank top M then rerank top N

Do not rank 10.000 candidates if 800 gives same quality.

11. Candidate Funnel Metrics

Track:

generated candidates
after dedup
after eligibility
after prefilter
ranked
reranked
final slate
clicked/converted

If source generates many candidates that never survive, reduce quota or improve source.

Cost follows candidate funnel.

12. Two-Stage Ranking

Pattern:

candidate pool 5000
cheap pre-ranker selects 800
expensive ranker scores 800
reranker final 20

This reduces cost.

Pre-ranker can use:

source scores,
simple GBDT,
cached features,
heuristic quality filters.

Deep ranker should not score everything.

13. Feature Cost Optimization

Strategies:

reduce feature count,
batch lookup,
group features,
cache static item features,
compute cross features only for top candidates,
use missing/default policy,
remove unused features,
monitor feature importance,
avoid huge categorical payloads,
use compact encoding.

Feature cost is often hidden.

14. Feature Pruning

Remove features that:

low importance
high serving cost
high missing rate
high privacy risk
unstable
duplicative

Feature value should justify:

latency + compute + storage + governance cost

Do not keep feature because “maybe useful”.

15. Model Complexity Trade-Off

Model choices:

Model	Serving Cost	Notes
heuristic	very low	fallback/baseline
GBDT small	low	strong tabular baseline
GBDT large	medium	latency grows
deep ranker	high	powerful but expensive
cross-encoder/LLM reranker	very high	small K only

Choose model by marginal lift vs cost.

16. Model Distillation

Distillation:

train smaller model to mimic larger model

Use cases:

deep model offline teacher,
smaller online student,
LLM/expert reranker teacher,
expensive feature teacher.

Goal:

capture most quality at lower serving cost

Measure quality/cost frontier.

17. Batch vs Online Scoring

If online scoring expensive and context stable:

batch score candidates
store precomputed list
online final-check/rerank lightly

Good for:

email,
digest,
fallback,
expensive deep model,
low-latency home.

But batch scoring can also be expensive at huge scale.

18. Batch Scoring Cost

Formula:

batch_scores = subjects * candidates_per_subject

Example:

20M users * 3000 candidates = 60B scores

Optimization:

score active users only,
incremental refresh,
smaller candidate pool,
segment-level lists,
batch ranker simpler,
store top more but not too much,
schedule off-peak.

19. Embedding and Index Cost

Cost drivers:

entity count
dimension
embedding model cost
recompute cadence
index algorithm overhead
replication
shadow/canary index
delta index
memory footprint

Index rollout may require old and new index loaded simultaneously.

Capacity plan for 2x memory during rollout.

20. Vector Dimension Trade-Off

Higher dimension can improve quality but increases:

storage,
memory,
ANN latency,
network payload,
build time,
cache size.

Evaluate dimension:

64 vs 128 vs 256

using recall/latency/cost.

Do not choose dimension arbitrarily.

21. Overfetch Cost

Overfetch helps filtering but costs latency.

topK = desired_valid_candidates * overfetch_factor

If filter rate high, fix partition/filter strategy.

Example:

desired 500
overfetch 5000

Maybe index partition by region/tenant instead.

22. Cache Economics

Cache is worth it if:

cache_hit_savings > cache_cost + staleness_risk

Measure:

hit rate,
miss cost,
stale rejection,
memory cost,
cache dependency cost.

High hit rate on cheap data may not matter. Low hit rate on expensive vector result may matter.

23. Local Cache Economics

Local cache is useful for:

config,
model route,
rule bundle,
static metadata,
small fallback lists.

Cost:

memory,
stale risk,
per-instance warmup,
invalidation complexity.

Use size bounds.

24. Distributed Cache Economics

Distributed cache useful for:

item features,
precomputed lists,
popular/trending lists,
profile snapshots.

Cost:

network latency,
cache cluster cost,
hot keys,
operational dependency.

If cache outage overloads source, it is dangerous.

25. Payload Optimization

Payload size affects:

network latency,
serialization CPU,
GC,
cache memory,
logging cost.

Reduce:

unnecessary fields,
verbose JSON in hot path,
huge debug payloads,
raw feature maps,
long candidate provenance for every request if not needed.

Use compact binary/internal DTO where appropriate.

26. Serialization and Java Performance

Java hot path considerations:

avoid excessive object allocation,
avoid reflection-heavy serialization in tight loops,
batch DTO conversion,
reuse immutable configs,
precompile model runtime structures,
keep feature matrix compact,
avoid boxed primitives for large candidate arrays,
watch GC.

A ranking request with 1000 candidates × 200 features can create many objects if naive.

27. GC and Memory

Symptoms:

p99 latency spikes
GC pause
high allocation rate
large payloads
candidate arrays retained
debug traces too large

Optimization:

primitive arrays,
compact feature vectors,
bounded caches,
streaming logs,
avoid retaining full candidate objects after response,
sample debug traces.

Java performance is engineering, not magic.

28. Model Inference Batching

Batch candidates in one request.

Example:

score 800 candidates in one model call

Avoid per-candidate inference.

For remote model service, batch size affects:

latency,
throughput,
memory,
CPU/GPU utilization.

Tune batch size per model.

29. Concurrency Control

Limit:

max candidate source calls
max ranker concurrent requests
max feature store calls
max vector search requests
max batch jobs

Without limits, overload cascades.

Use bulkheads and backpressure.

30. Autoscaling

Autoscale by relevant metrics.

Not only CPU.

Possible scaling signals:

QPS
in-flight requests
candidate_scores/sec
model inference queue depth
feature store p95 latency
vector search CPU/memory
cache hit/miss load
batch job backlog
stream lag

Choose signal per service.

31. Peak Traffic Planning

Plan for:

daily peaks,
campaign spikes,
holiday events,
email/push bursts,
tenant onboarding,
model rollout side-by-side,
cache cold start,
failover.

Capacity should include headroom.

normal peak + failure mode headroom

32. Multi-Region Capacity

If serving multiple regions:

data residency,
latency to feature store,
model/index replication,
regional cache,
failover,
regional traffic spikes.

Vector indexes and feature stores may need regional replicas.

Cross-region calls can kill latency.

33. Load Shedding

When overloaded, degrade intentionally:

disable optional sources
reduce candidate count
skip expensive feature groups
use fallback ranker
serve precomputed list
disable LLM explanation
turn off shadow traffic
limit debug
drop low-priority batch

Define degradation order.

34. Quality-Cost Modes

Create modes:

mode: full
  candidates: 1000
  model: deep_ranker
mode: normal
  candidates: 800
  model: gbdt_ranker
mode: degraded
  candidates: 300
  model: fallback_ranker
mode: safe_fallback
  source: trending_editorial

Switch based on load/SLO.

35. Shadow Traffic Cost

Shadow models/indexes cost money.

Track:

shadow scoring QPS
shadow candidate scoring volume
shadow index queries
shadow feature fetches

Limit shadow percentage.

Turn off shadow during incidents.

36. Experiment Cost

Experiments can increase cost:

treatment uses bigger model,
new source adds vector query,
more candidates,
LLM explanation,
extra logging.

Experiment spec should include cost estimate.

Guardrail:

cost per request not increase > X%

37. LLM Cost Control

LLM cost drivers:

tokens
model size
calls per request
retries
candidate count in context
conversation turns
prompt logging
validation calls

Controls:

use offline enrichment,
cache intent/explanation where safe,
smaller model for simple tasks,
template fallback,
cap candidates in prompt,
strict use-case gating,
tenant quotas.

LLM should not sit in high-QPS hot path unless justified.

38. Event Logging Cost

High-cardinality logs are expensive.

Manage:

log final slate always,
sample full candidate traces,
compress payloads,
retention policy,
separate debug logs,
avoid raw feature dump for every request,
aggregate metrics.

Logging is essential, but unbounded logs are expensive.

39. Observability Cost

Dashboards/metrics can be costly.

Avoid:

user_id/item_id as metric labels,
excessive high-cardinality tags,
per-feature metric for thousands features without sampling/aggregation,
full trace every request.

Use logs/traces for high-cardinality, metrics for bounded dimensions.

40. Training Cost

Training cost drivers:

dataset size
feature count
negative sampling ratio
model complexity
hyperparameter trials
retraining cadence
GPU/CPU requirements
data scan volume
backfill

Optimization:

incremental datasets,
feature pruning,
sample wisely,
early stopping,
reuse embeddings/features,
limit hyperparameter search,
train only when needed.

41. Index Build Cost

Index build costs:

embedding generation
index construction
validation
replication
shadow/canary
storage

Optimization:

delta index,
partition rebuild,
reuse unchanged embeddings,
build off-peak,
optimize dimension,
archive old versions.

Do not rebuild full index every hour if delta works.

42. Cost Attribution

Tag resources by:

service
surface
tenant
model version
candidate source
environment
batch job
experiment

Use cost allocation.

Without attribution, teams cannot optimize.

Cost should be visible to owners.

43. Unit Economics Dashboard

Dashboard:

cost/request
cost/1000 recommendations
cost/candidate score
cost/vector query
cost/model inference
cost/tenant
cost/surface
cost/conversion
cache savings
fallback cost
LLM cost

This connects engineering to business.

44. Performance Testing

Test:

normal load
peak load
cold cache
cache outage
feature store slow
candidate source timeout
large candidate pool
shadow model enabled
tenant burst
batch job overlap

Measure p95/p99, saturation, fallback, cost.

45. Profiling

Profile hot path:

CPU,
allocation,
serialization,
feature assembly,
model inference,
network waits,
cache misses,
GC.

Use flame graphs/profilers.

Performance optimization should be evidence-driven.

46. Common Failure Modes

46.1 Planning by QPS Only

Candidate scoring volume ignored.

46.2 Per-Candidate Remote Calls

Latency/cost explosion.

46.3 Huge Feature Payload

Network/GC bottleneck.

46.4 Deep Ranker Scores Too Many Candidates

Cost spike.

46.5 Cache Without Correctness Metrics

Stale bad recommendations.

46.6 Shadow Traffic Too Expensive

Unexpected bill.

46.7 Batch Scoring All Users Daily Unnecessarily

Waste.

46.8 No Tenant Cost Attribution

Enterprise margin unknown.

46.9 Logging Everything Forever

Storage explosion.

46.10 No Degradation Mode

Overload becomes outage.

47. Implementation Sketch: Capacity Estimate

public record ServingCapacityEstimate(
    long requestQps,
    int candidatesPerRequest,
    int featureValuesPerCandidate,
    int vectorQueriesPerRequest
) {
    public long candidateScoresPerSecond() {
        return requestQps * candidatesPerRequest;
    }

    public long featureValuesPerSecond() {
        return requestQps * candidatesPerRequest * featureValuesPerCandidate;
    }

    public long vectorQueriesPerSecond() {
        return requestQps * vectorQueriesPerRequest;
    }
}

Use simple models early.

48. Implementation Sketch: Cost Attribution Tag

public record CostAttribution(
    String surface,
    String tenantId,
    String modelVersion,
    String candidatePolicyVersion,
    String experimentId,
    String serviceName
) {}

Attach to logs/metrics where feasible.

49. Minimal Production Cost/Capacity Plan

Start with:

capacity:
  request_qps_by_surface: true
  candidate_scores_per_sec: true
  vector_queries_per_sec: true
  feature_values_per_sec: true
performance:
  stage_latency_p95_p99: true
  candidate_count_caps: true
  batch_feature_fetch: true
  batch_model_inference: true
cost:
  cost_by_service: true
  cost_by_surface: true
  tenant_cost_attribution: true
optimization:
  cache_hit_rate: true
  final_filter_rejection_rate: true
  feature_pruning: quarterly
  candidate_funnel_metrics: true
resilience:
  degradation_modes: true
  load_shedding: true

50. Checklist Cost, Capacity, and Performance Readiness

[ ] QPS is tracked by surface/tenant.
[ ] Candidate scores/sec is tracked.
[ ] Feature values/sec and payload size are estimated.
[ ] Vector search QPS/topK/overfetch are tracked.
[ ] Candidate count caps exist.
[ ] Feature fetch and model inference are batched.
[ ] Cost is attributed by service/surface/tenant.
[ ] Cache hit/miss/stale metrics exist.
[ ] Model inference latency/cost is tracked by version.
[ ] Batch scoring volume and cost are tracked.
[ ] Embedding/index build cost is tracked.
[ ] Shadow/experiment cost is tracked.
[ ] LLM cost quotas exist if LLM is used.
[ ] Load tests include cold cache and dependency failures.
[ ] Degradation/load-shedding modes exist.
[ ] Performance profiling is evidence-driven.
[ ] Observability/logging cost is controlled.
[ ] Capacity plan includes rollout/failover headroom.

51. Kesimpulan

Cost, capacity, dan performance engineering memastikan recommendation platform bisa melayani skala besar tanpa latency buruk atau biaya tidak terkendali.

Prinsip utama:

Optimize cost per useful decision, not just total bill.
Candidate scores/sec is more important than request QPS alone.
Feature lookup volume can dominate cost.
Vector search topK/overfetch/dimension directly affect latency and memory.
Two-stage ranking controls expensive model cost.
Caching must be evaluated by savings and correctness risk.
Batch scoring trades online latency for offline compute cost.
Shadow/experiments/LLM/debug logging can silently increase cost.
Cost attribution by surface/tenant/model/source enables ownership.
Degradation modes are part of performance engineering.

Di Part 074, kita akan membahas Operating Model and Team Topology: bagaimana membentuk tim, ownership, review process, on-call, governance, roadmap, platform boundaries, and collaboration model untuk menjalankan RecSys enterprise-grade.

Lesson Recap

You just completed lesson 73 in final stretch. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Back To Series Next Lesson

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.

Previous Lesson

Lesson 72

Learn Build From Scratch Recommendations System Part 072 Multi Tenant And Enterprise Configuration

Next Lesson

Lesson 74

Learn Build From Scratch Recommendations System Part 074 Operating Model And Team Topology