Learn Build From Scratch Recommendations System Part 073 Cost Capacity And Performance Engineering
title: Build From Scratch Recommendations System - Part 073 description: Mendesain cost, capacity, dan performance engineering untuk recommendation system production-grade: QPS, candidate scoring volume, feature store load, vector search capacity, model inference cost, batch pipeline cost, cache economics, autoscaling, load shedding, and optimization strategy. series: learn-build-from-scratch-recommendations-system seriesTitle: Build From Scratch: Enterprise Recommendations System order: 73 partTitle: Cost, Capacity, and Performance Engineering tags:
- recommendation-system
- recsys
- cost-engineering
- capacity-planning
- performance
- scalability
- series date: 2026-07-02
Part 073 — Cost, Capacity, and Performance Engineering
Recommendation system skala besar bukan hanya masalah model quality.
Ia juga masalah ekonomi.
Satu request rekomendasi bisa memicu:
candidate generation
vector search
feature lookup
profile lookup
eligibility batch check
model inference untuk ratusan/ribuan candidates
reranking
decision logging
event ingestion
observability
Jika QPS tinggi, biaya bisa meledak.
Contoh sederhana:
2.000 QPS * 800 candidates/request = 1.600.000 candidate scores/detik
Jika setiap candidate membutuhkan puluhan feature lookup dan model inference mahal, sistem bisa menjadi sangat mahal bahkan sebelum traffic global.
Part ini membahas cost, capacity, dan performance engineering untuk recommendation system production-grade: QPS, candidate scoring volume, vector search, feature store, model inference, cache economics, batch scoring, autoscaling, load shedding, cost attribution, and optimization strategy.
1. Mental Model: Optimize Cost per Useful Decision
RecSys cost harus dilihat sebagai:
cost per request
cost per candidate scored
cost per useful impression
cost per conversion/outcome
cost per tenant/surface
cost per model/source/policy
Bukan hanya total infrastructure bill.
Goal:
maximize product value per unit cost
High-quality recommendation yang terlalu mahal untuk disajikan di production bukan solusi.
2. Cost Drivers
Major cost drivers:
QPS
candidate count per request
feature count per candidate
feature store latency/load
vector search topK/overfetch
model inference complexity
deep model/GPU usage
cache hit rate
batch scoring volume
embedding/index build frequency
event logging volume
debug trace sampling
LLM calls
multi-tenant isolation overhead
Know your cost drivers before optimizing.
3. Capacity Formula: Candidate Scores per Second
Most important formula:
candidate_scores_per_second = request_qps * candidates_ranked_per_request
Example:
QPS = 5.000
candidates/request = 1.000
candidate_scores/sec = 5.000.000
Ranking capacity should be planned in candidate scores/sec, not only requests/sec.
4. Feature Lookup Volume
Feature volume:
feature_values_per_second =
QPS * candidates_per_request * item_features_per_candidate
+ QPS * user_features_per_request
Example:
QPS = 2.000
candidates = 800
item_features = 50
= 80.000.000 item feature values/sec
Even if values are batched, payload and compute matter.
5. Vector Search Capacity
Vector search capacity depends on:
QPS
topK
overfetch factor
dimension
index algorithm
filtering
partition count
latency target
replication
Example:
2 vector sources/request
2.000 QPS
topK 1.000
This is 4.000 ANN queries/sec.
If each query searches large HNSW index with filters, capacity planning matters.
6. End-to-End Cost Model
Create cost model:
surface: home_feed
qps_peak: 2000
candidate_sources:
two_tower:
calls_per_request: 1
cost_per_1k_calls: 0.02
trending:
calls_per_request: 1
cost_per_1k_calls: 0.001
ranking:
candidates_ranked: 800
cost_per_million_scores: 0.50
feature_store:
batch_calls_per_request: 3
avg_payload_kb: 120
logging:
decision_log_kb: 8
impression_events_per_request: 20
Approximation is better than surprise.
7. Cost by Stage
Track cost by stage:
| Stage | Cost Driver |
|---|---|
| Candidate generation | source calls, vector queries |
| Feature serving | lookups, payload, cache miss |
| Ranking | candidate scores, model complexity |
| Reranking | slate optimization complexity |
| Logging | event volume, storage retention |
| Offline training | dataset size, compute |
| Embeddings/index | item count, dimension, rebuild cadence |
| Batch scoring | subjects × candidates |
| LLM | tokens, model, retries |
Without attribution, cost optimization becomes guesswork.
8. Surface-Level Cost
Different surfaces have different economics.
home feed: high QPS, low latency
PDP recommendations: moderate QPS, seed-based
email: batch-heavy, no online latency
push: low volume, high trust
enterprise actions: lower QPS, high correctness
search suggestions: strict latency
Optimize per surface.
Do not use same expensive pipeline everywhere.
9. Tenant-Level Cost
For enterprise:
cost per tenant
cost per active user
cost per case
cost per recommended action
cost per LLM explanation
cost per batch scoring run
Tenant cost helps:
- pricing,
- quota,
- capacity,
- abuse detection,
- SLA planning.
Large tenants can dominate platform cost.
10. Candidate Count Optimization
Candidate count affects:
- ranking cost,
- feature cost,
- latency,
- memory,
- payload size.
Tune:
candidate source quota
dedup before ranking
eligibility before feature fetch
pre-rank/filter before expensive scoring
rank top M then rerank top N
Do not rank 10.000 candidates if 800 gives same quality.
11. Candidate Funnel Metrics
Track:
generated candidates
after dedup
after eligibility
after prefilter
ranked
reranked
final slate
clicked/converted
If source generates many candidates that never survive, reduce quota or improve source.
Cost follows candidate funnel.
12. Two-Stage Ranking
Pattern:
candidate pool 5000
cheap pre-ranker selects 800
expensive ranker scores 800
reranker final 20
This reduces cost.
Pre-ranker can use:
- source scores,
- simple GBDT,
- cached features,
- heuristic quality filters.
Deep ranker should not score everything.
13. Feature Cost Optimization
Strategies:
- reduce feature count,
- batch lookup,
- group features,
- cache static item features,
- compute cross features only for top candidates,
- use missing/default policy,
- remove unused features,
- monitor feature importance,
- avoid huge categorical payloads,
- use compact encoding.
Feature cost is often hidden.
14. Feature Pruning
Remove features that:
low importance
high serving cost
high missing rate
high privacy risk
unstable
duplicative
Feature value should justify:
latency + compute + storage + governance cost
Do not keep feature because “maybe useful”.
15. Model Complexity Trade-Off
Model choices:
| Model | Serving Cost | Notes |
|---|---|---|
| heuristic | very low | fallback/baseline |
| GBDT small | low | strong tabular baseline |
| GBDT large | medium | latency grows |
| deep ranker | high | powerful but expensive |
| cross-encoder/LLM reranker | very high | small K only |
Choose model by marginal lift vs cost.
16. Model Distillation
Distillation:
train smaller model to mimic larger model
Use cases:
- deep model offline teacher,
- smaller online student,
- LLM/expert reranker teacher,
- expensive feature teacher.
Goal:
capture most quality at lower serving cost
Measure quality/cost frontier.
17. Batch vs Online Scoring
If online scoring expensive and context stable:
batch score candidates
store precomputed list
online final-check/rerank lightly
Good for:
- email,
- digest,
- fallback,
- expensive deep model,
- low-latency home.
But batch scoring can also be expensive at huge scale.
18. Batch Scoring Cost
Formula:
batch_scores = subjects * candidates_per_subject
Example:
20M users * 3000 candidates = 60B scores
Optimization:
- score active users only,
- incremental refresh,
- smaller candidate pool,
- segment-level lists,
- batch ranker simpler,
- store top more but not too much,
- schedule off-peak.
19. Embedding and Index Cost
Cost drivers:
entity count
dimension
embedding model cost
recompute cadence
index algorithm overhead
replication
shadow/canary index
delta index
memory footprint
Index rollout may require old and new index loaded simultaneously.
Capacity plan for 2x memory during rollout.
20. Vector Dimension Trade-Off
Higher dimension can improve quality but increases:
- storage,
- memory,
- ANN latency,
- network payload,
- build time,
- cache size.
Evaluate dimension:
64 vs 128 vs 256
using recall/latency/cost.
Do not choose dimension arbitrarily.
21. Overfetch Cost
Overfetch helps filtering but costs latency.
topK = desired_valid_candidates * overfetch_factor
If filter rate high, fix partition/filter strategy.
Example:
desired 500
overfetch 5000
Maybe index partition by region/tenant instead.
22. Cache Economics
Cache is worth it if:
cache_hit_savings > cache_cost + staleness_risk
Measure:
- hit rate,
- miss cost,
- stale rejection,
- memory cost,
- cache dependency cost.
High hit rate on cheap data may not matter. Low hit rate on expensive vector result may matter.
23. Local Cache Economics
Local cache is useful for:
- config,
- model route,
- rule bundle,
- static metadata,
- small fallback lists.
Cost:
- memory,
- stale risk,
- per-instance warmup,
- invalidation complexity.
Use size bounds.
24. Distributed Cache Economics
Distributed cache useful for:
- item features,
- precomputed lists,
- popular/trending lists,
- profile snapshots.
Cost:
- network latency,
- cache cluster cost,
- hot keys,
- operational dependency.
If cache outage overloads source, it is dangerous.
25. Payload Optimization
Payload size affects:
- network latency,
- serialization CPU,
- GC,
- cache memory,
- logging cost.
Reduce:
- unnecessary fields,
- verbose JSON in hot path,
- huge debug payloads,
- raw feature maps,
- long candidate provenance for every request if not needed.
Use compact binary/internal DTO where appropriate.
26. Serialization and Java Performance
Java hot path considerations:
- avoid excessive object allocation,
- avoid reflection-heavy serialization in tight loops,
- batch DTO conversion,
- reuse immutable configs,
- precompile model runtime structures,
- keep feature matrix compact,
- avoid boxed primitives for large candidate arrays,
- watch GC.
A ranking request with 1000 candidates × 200 features can create many objects if naive.
27. GC and Memory
Symptoms:
p99 latency spikes
GC pause
high allocation rate
large payloads
candidate arrays retained
debug traces too large
Optimization:
- primitive arrays,
- compact feature vectors,
- bounded caches,
- streaming logs,
- avoid retaining full candidate objects after response,
- sample debug traces.
Java performance is engineering, not magic.
28. Model Inference Batching
Batch candidates in one request.
Example:
score 800 candidates in one model call
Avoid per-candidate inference.
For remote model service, batch size affects:
- latency,
- throughput,
- memory,
- CPU/GPU utilization.
Tune batch size per model.
29. Concurrency Control
Limit:
max candidate source calls
max ranker concurrent requests
max feature store calls
max vector search requests
max batch jobs
Without limits, overload cascades.
Use bulkheads and backpressure.
30. Autoscaling
Autoscale by relevant metrics.
Not only CPU.
Possible scaling signals:
QPS
in-flight requests
candidate_scores/sec
model inference queue depth
feature store p95 latency
vector search CPU/memory
cache hit/miss load
batch job backlog
stream lag
Choose signal per service.
31. Peak Traffic Planning
Plan for:
- daily peaks,
- campaign spikes,
- holiday events,
- email/push bursts,
- tenant onboarding,
- model rollout side-by-side,
- cache cold start,
- failover.
Capacity should include headroom.
normal peak + failure mode headroom
32. Multi-Region Capacity
If serving multiple regions:
- data residency,
- latency to feature store,
- model/index replication,
- regional cache,
- failover,
- regional traffic spikes.
Vector indexes and feature stores may need regional replicas.
Cross-region calls can kill latency.
33. Load Shedding
When overloaded, degrade intentionally:
disable optional sources
reduce candidate count
skip expensive feature groups
use fallback ranker
serve precomputed list
disable LLM explanation
turn off shadow traffic
limit debug
drop low-priority batch
Define degradation order.
34. Quality-Cost Modes
Create modes:
mode: full
candidates: 1000
model: deep_ranker
mode: normal
candidates: 800
model: gbdt_ranker
mode: degraded
candidates: 300
model: fallback_ranker
mode: safe_fallback
source: trending_editorial
Switch based on load/SLO.
35. Shadow Traffic Cost
Shadow models/indexes cost money.
Track:
shadow scoring QPS
shadow candidate scoring volume
shadow index queries
shadow feature fetches
Limit shadow percentage.
Turn off shadow during incidents.
36. Experiment Cost
Experiments can increase cost:
- treatment uses bigger model,
- new source adds vector query,
- more candidates,
- LLM explanation,
- extra logging.
Experiment spec should include cost estimate.
Guardrail:
cost per request not increase > X%
37. LLM Cost Control
LLM cost drivers:
tokens
model size
calls per request
retries
candidate count in context
conversation turns
prompt logging
validation calls
Controls:
- use offline enrichment,
- cache intent/explanation where safe,
- smaller model for simple tasks,
- template fallback,
- cap candidates in prompt,
- strict use-case gating,
- tenant quotas.
LLM should not sit in high-QPS hot path unless justified.
38. Event Logging Cost
High-cardinality logs are expensive.
Manage:
- log final slate always,
- sample full candidate traces,
- compress payloads,
- retention policy,
- separate debug logs,
- avoid raw feature dump for every request,
- aggregate metrics.
Logging is essential, but unbounded logs are expensive.
39. Observability Cost
Dashboards/metrics can be costly.
Avoid:
- user_id/item_id as metric labels,
- excessive high-cardinality tags,
- per-feature metric for thousands features without sampling/aggregation,
- full trace every request.
Use logs/traces for high-cardinality, metrics for bounded dimensions.
40. Training Cost
Training cost drivers:
dataset size
feature count
negative sampling ratio
model complexity
hyperparameter trials
retraining cadence
GPU/CPU requirements
data scan volume
backfill
Optimization:
- incremental datasets,
- feature pruning,
- sample wisely,
- early stopping,
- reuse embeddings/features,
- limit hyperparameter search,
- train only when needed.
41. Index Build Cost
Index build costs:
embedding generation
index construction
validation
replication
shadow/canary
storage
Optimization:
- delta index,
- partition rebuild,
- reuse unchanged embeddings,
- build off-peak,
- optimize dimension,
- archive old versions.
Do not rebuild full index every hour if delta works.
42. Cost Attribution
Tag resources by:
service
surface
tenant
model version
candidate source
environment
batch job
experiment
Use cost allocation.
Without attribution, teams cannot optimize.
Cost should be visible to owners.
43. Unit Economics Dashboard
Dashboard:
cost/request
cost/1000 recommendations
cost/candidate score
cost/vector query
cost/model inference
cost/tenant
cost/surface
cost/conversion
cache savings
fallback cost
LLM cost
This connects engineering to business.
44. Performance Testing
Test:
normal load
peak load
cold cache
cache outage
feature store slow
candidate source timeout
large candidate pool
shadow model enabled
tenant burst
batch job overlap
Measure p95/p99, saturation, fallback, cost.
45. Profiling
Profile hot path:
- CPU,
- allocation,
- serialization,
- feature assembly,
- model inference,
- network waits,
- cache misses,
- GC.
Use flame graphs/profilers.
Performance optimization should be evidence-driven.
46. Common Failure Modes
46.1 Planning by QPS Only
Candidate scoring volume ignored.
46.2 Per-Candidate Remote Calls
Latency/cost explosion.
46.3 Huge Feature Payload
Network/GC bottleneck.
46.4 Deep Ranker Scores Too Many Candidates
Cost spike.
46.5 Cache Without Correctness Metrics
Stale bad recommendations.
46.6 Shadow Traffic Too Expensive
Unexpected bill.
46.7 Batch Scoring All Users Daily Unnecessarily
Waste.
46.8 No Tenant Cost Attribution
Enterprise margin unknown.
46.9 Logging Everything Forever
Storage explosion.
46.10 No Degradation Mode
Overload becomes outage.
47. Implementation Sketch: Capacity Estimate
public record ServingCapacityEstimate(
long requestQps,
int candidatesPerRequest,
int featureValuesPerCandidate,
int vectorQueriesPerRequest
) {
public long candidateScoresPerSecond() {
return requestQps * candidatesPerRequest;
}
public long featureValuesPerSecond() {
return requestQps * candidatesPerRequest * featureValuesPerCandidate;
}
public long vectorQueriesPerSecond() {
return requestQps * vectorQueriesPerRequest;
}
}
Use simple models early.
48. Implementation Sketch: Cost Attribution Tag
public record CostAttribution(
String surface,
String tenantId,
String modelVersion,
String candidatePolicyVersion,
String experimentId,
String serviceName
) {}
Attach to logs/metrics where feasible.
49. Minimal Production Cost/Capacity Plan
Start with:
capacity:
request_qps_by_surface: true
candidate_scores_per_sec: true
vector_queries_per_sec: true
feature_values_per_sec: true
performance:
stage_latency_p95_p99: true
candidate_count_caps: true
batch_feature_fetch: true
batch_model_inference: true
cost:
cost_by_service: true
cost_by_surface: true
tenant_cost_attribution: true
optimization:
cache_hit_rate: true
final_filter_rejection_rate: true
feature_pruning: quarterly
candidate_funnel_metrics: true
resilience:
degradation_modes: true
load_shedding: true
50. Checklist Cost, Capacity, and Performance Readiness
[ ] QPS is tracked by surface/tenant.
[ ] Candidate scores/sec is tracked.
[ ] Feature values/sec and payload size are estimated.
[ ] Vector search QPS/topK/overfetch are tracked.
[ ] Candidate count caps exist.
[ ] Feature fetch and model inference are batched.
[ ] Cost is attributed by service/surface/tenant.
[ ] Cache hit/miss/stale metrics exist.
[ ] Model inference latency/cost is tracked by version.
[ ] Batch scoring volume and cost are tracked.
[ ] Embedding/index build cost is tracked.
[ ] Shadow/experiment cost is tracked.
[ ] LLM cost quotas exist if LLM is used.
[ ] Load tests include cold cache and dependency failures.
[ ] Degradation/load-shedding modes exist.
[ ] Performance profiling is evidence-driven.
[ ] Observability/logging cost is controlled.
[ ] Capacity plan includes rollout/failover headroom.
51. Kesimpulan
Cost, capacity, dan performance engineering memastikan recommendation platform bisa melayani skala besar tanpa latency buruk atau biaya tidak terkendali.
Prinsip utama:
- Optimize cost per useful decision, not just total bill.
- Candidate scores/sec is more important than request QPS alone.
- Feature lookup volume can dominate cost.
- Vector search topK/overfetch/dimension directly affect latency and memory.
- Two-stage ranking controls expensive model cost.
- Caching must be evaluated by savings and correctness risk.
- Batch scoring trades online latency for offline compute cost.
- Shadow/experiments/LLM/debug logging can silently increase cost.
- Cost attribution by surface/tenant/model/source enables ownership.
- Degradation modes are part of performance engineering.
Di Part 074, kita akan membahas Operating Model and Team Topology: bagaimana membentuk tim, ownership, review process, on-call, governance, roadmap, platform boundaries, and collaboration model untuk menjalankan RecSys enterprise-grade.
You just completed lesson 73 in final stretch. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.
Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.