Deepen PracticeOrdered learning track

Learn Build From Scratch Recommendations System Part 063 Offline Evaluation Metrics

[]10 min read1962 words

In This Lesson

1. Mental Model: Offline Evaluation Is a Safety Gate, Not Final Truth 2. Evaluation Levels 3. Evaluation Dataset Design

Lesson 6380 lesson track45–66 Deepen Practice

title: Build From Scratch Recommendations System - Part 063 description: Mendesain offline evaluation metrics untuk recommendation system production-grade: retrieval recall, ranking metrics, NDCG/MAP/MRR, calibration, diversity, novelty, coverage, counterfactual limitations, segment analysis, leakage checks, and metric governance. series: learn-build-from-scratch-recommendations-system seriesTitle: Build From Scratch: Enterprise Recommendations System order: 63 partTitle: Offline Evaluation Metrics tags:

recommendation-system
recsys
offline-evaluation
metrics
ranking
experimentation
series date: 2026-07-02

Part 063 — Offline Evaluation Metrics

Mulai Part 063, kita masuk Module 8: Evaluation, Experimentation, dan Observability.

Recommendation system tidak boleh hanya “kelihatannya bekerja”.

Kita harus mampu menjawab:

Apakah retrieval source punya recall cukup?
Apakah ranker mengurutkan item yang benar?
Apakah model calibrated?
Apakah slate terlalu repetitif?
Apakah cold-start membaik?
Apakah segment tertentu memburuk?
Apakah offline metric benar-benar berkorelasi dengan online outcome?
Apakah data evaluation bebas leakage?

Offline evaluation adalah filter awal sebelum model/policy diuji online.

Tetapi offline evaluation juga penuh jebakan:

historical bias,
missing counterfactual,
exposure bias,
random split leakage,
proxy metric,
candidate set mismatch,
label maturity,
metric overfitting,
segment harm hidden by global average.

Part ini membahas offline evaluation metrics production-grade untuk recommendation system: retrieval, ranking, calibration, slate-level quality, diversity/novelty, coverage, segment analysis, evaluation dataset, leakage control, and metric governance.

1. Mental Model: Offline Evaluation Is a Safety Gate, Not Final Truth

Offline evaluation menjawab:

Based on historical logged data, is this model/policy promising and safe enough to test online?

Ia tidak menjawab sepenuhnya:

Will this improve product metrics in production?

Kenapa?

Karena historical data berasal dari policy lama.

Jika item tidak pernah ditampilkan, kita tidak tahu user akan bereaksi bagaimana.

Offline metric adalah necessary gate, not sufficient proof.

2. Evaluation Levels

Recommendation evaluation punya beberapa level:

candidate generation
ranking
reranking/slate
calibration
business utility
fairness/exposure
latency/cost
segment robustness

Each level needs different metric.

Do not use one metric for all layers.

3. Evaluation Dataset Design

Evaluation dataset defines truth boundary.

It should specify:

prediction time
candidate universe
labels
label windows
exposure policy
negative examples
temporal split
segment definitions
eligibility snapshot

Bad evaluation data creates misleading metrics.

Example dataset spec:

evaluation_dataset: home_ranker_eval_20260702
base: logged_impressions
surface: home_feed
prediction_time: impression_time
labels:
  click_30m: v3
  purchase_7d: v2
  hide_7d: v1
split:
  type: temporal
  test_window: 2026-06-29..2026-07-01
candidate_set:
  source: logged_candidate_pool_sample

4. Temporal Evaluation

Use temporal splits.

train: older data
validation: newer data
test: newest data

Why?

Production predicts future behavior.

Random split leaks:

user behavior,
item popularity,
session duplicates,
future catalog state,
future labels.

Offline metric from random split often too optimistic.

5. Candidate Set for Offline Ranking Evaluation

Ranker evaluation needs candidate set.

Options:

Logged Impressions Only

Evaluate among items actually shown.

Pros:

labels available.

Cons:

limited to old policy.

Logged Candidate Pool

Evaluate candidates generated during serving, shown or not.

Pros:

closer to ranker decision.

Cons:

labels missing for unshown candidates.

Reconstructed Candidate Pool

Run candidate generation offline.

Pros:

evaluate new candidate sources.

Cons:

labels/counterfactual missing.

Be explicit which one is used.

6. Retrieval Evaluation

Candidate generation goal:

include relevant items in candidate pool

Metric:

recall@K

Example:

Did candidate source retrieve item user clicked/purchased?

If positive item is not in candidate pool, ranker cannot rank it.

Retrieval recall is upper bound for ranking.

7. Retrieval Recall@K

For each query/user/context with known positive item:

recall@K = 1 if positive item in top K candidates else 0

Aggregate average.

Example:

Recall@100 = 0.72
Recall@500 = 0.88
Recall@1000 = 0.93

Choose K based on downstream ranker capacity.

8. Retrieval Recall by Source

Evaluate source contribution.

Metrics:

recall@K by source
marginal recall by source
source overlap
unique positives found
source latency/cost

Example:

two_tower recall@500: 0.62
item_cf recall@500: 0.31
content recall@500: 0.22
combined recall@500: 0.81

Combined recall matters, but marginal recall tells which source adds unique value.

9. Candidate Pool Diagnostics

Measure:

candidate_count
unique_candidate_count
duplicate_rate
eligible_after_filter_rate
source_distribution
category_distribution
new_item_share
long_tail_share
empty_pool_rate

Candidate source may have good recall but bad diversity or high invalid rate.

10. Ranking Metrics Overview

Common ranking metrics:

NDCG@K
MAP@K
MRR@K
Recall@K
Precision@K
HitRate@K
AUC
Logloss
Calibration metrics

Choose by task.

NDCG for graded relevance and top-heavy ranking.
MRR for first relevant item.
MAP for multiple relevant items.
AUC/logloss for pointwise binary prediction.
Calibration for probability quality.

11. Precision@K

Precision@K = relevant items in top K / K

Good when:

multiple relevant items,
fixed slate size,
relevance labels available.

Weakness:

ignores order within top K,
binary relevance,
sensitive to incomplete labels.

For RecSys, labels are often incomplete because unshown items unknown.

12. Recall@K for Ranking

Recall@K = relevant items in top K / all relevant items

Useful when relevant set known.

Examples:

next-item prediction,
held-out purchases,
known consumed items.

But “all relevant items” often unknown.

Use carefully.

13. HitRate@K

HitRate@K = 1 if any relevant item in top K

Common for next-item evaluation.

Example:

Did held-out next item appear in top 20?

Simple but loses nuance.

If user has multiple valid outcomes, HitRate can underrepresent quality.

14. MRR@K

Mean Reciprocal Rank:

MRR = 1 / rank_of_first_relevant_item

Good for tasks where first relevant result matters:

search,
support article lookup,
enterprise case recommendation,
“next best action” if one correct action.

Less useful when slate has many relevant items.

15. MAP@K

Mean Average Precision:

average precision over relevant items in top K

Good when multiple relevant items and order matters.

Needs reliable relevant set.

In recommendation, labels incomplete; MAP can be biased.

16. DCG and NDCG

DCG:

DCG@K = sum((2^rel_i - 1) / log2(i + 1))

NDCG normalizes by ideal ranking.

NDCG@K = DCG@K / IDCG@K

NDCG handles graded relevance and top-heavy value.

Very common for learning-to-rank.

17. Graded Relevance

Instead of binary label:

hide/report: negative
impression no action: 0
click: 1
save/cart: 2
purchase/complete: 3

Example:

relevance_grade:
  report: -3
  hide: -1
  click: 1
  add_to_cart: 2
  purchase: 3

NDCG can use graded relevance.

Be careful mixing actions with different business meaning.

18. NDCG Caveats

NDCG can mislead if:

labels are biased by old positions,
unshown candidates treated as irrelevant,
relevance grades arbitrary,
evaluation set lacks hard negatives,
popularity dominates,
segment metrics ignored.

NDCG is useful but not truth.

19. AUC

AUC measures pairwise ordering of positives above negatives.

Useful for binary classifiers.

But:

not top-K specific,
can improve while top positions don't,
insensitive to calibration,
large easy negatives can inflate AUC.

For recommender ranking, AUC alone is insufficient.

20. Logloss

Logloss evaluates probability prediction.

Good for:

click probability,
conversion probability,
hide/report risk.

If model outputs probabilities, logloss matters.

But lower logloss does not always mean better ranking/slate.

Use with ranking metrics.

21. Calibration Metrics

Calibration asks:

When model predicts 10% click probability, does click happen around 10%?

Metrics:

calibration curve,
expected calibration error,
Brier score,
reliability diagram.

Calibration is important for:

utility composition,
multi-objective ranking,
thresholds,
business trade-offs.

22. Brier Score

For probability prediction:

Brier = mean((p - y)^2)

Lower is better.

It combines calibration and discrimination.

Useful for probability outputs but less top-K focused.

23. Multi-Task Evaluation

If model predicts multiple tasks:

p_click
p_purchase
p_hide
p_report
p_satisfaction

Evaluate each:

AUC/logloss/calibration per task,
segment metrics,
utility simulation.

Do not only evaluate final composed score.

A model with good click and terrible hide risk can be dangerous.

24. Utility Simulation

If ranking uses utility:

utility = 1*click + 5*purchase - 3*hide - 50*report

Offline simulate expected utility.

But utility weights are product policy.

Need:

sensitivity analysis,
guardrails,
segment checks.

Utility simulation is only as good as prediction/label quality.

25. Slate-Level Offline Metrics

Reranking/slate evaluation includes:

slate predicted utility
category diversity
creator diversity
intra-list similarity
novelty
coverage
source mix
frequency cap violations
policy violations
sponsored count
repetition

User sees slate, not item scores.

Evaluate final slate after reranker.

26. Diversity Metrics

Metrics:

category_entropy
distinct_categories@K
distinct_creators@K
intra_list_similarity
max_same_creator_count
max_same_category_count

Evaluate by surface.

Search may need lower diversity than home feed.

Diversity without relevance is not success.

27. Novelty Metrics

Metrics:

new_to_user_rate
not_seen_30d_rate
long_tail_share
new_item_share
topic_distance_from_profile
creator_not_seen_30d

Pair with relevance/engagement guardrails.

Novelty can degrade quality if unbounded.

28. Coverage Metrics

Coverage:

catalog_coverage
creator_coverage
seller_coverage
category_coverage
new_item_coverage
tenant_document_coverage

Use exposure-weighted coverage for offline slate simulation.

Coverage helps monitor ecosystem, but not optimize blindly.

29. Constraint Metrics

For reranking/policy:

hard_constraint_violation_count
soft_constraint_satisfaction_rate
dedup_violation_rate
max_sponsored_violation
frequency_cap_violation
policy_required_inclusion_rate
empty_slate_rate

Hard violation should be zero.

Constraint metrics are release gates.

30. Offline Latency/Cost Metrics

Evaluation should include serving feasibility.

Metrics:

candidate_count
feature_count
model_size
inference_latency_estimate
memory footprint
ANN query latency benchmark
cache payload size

Offline quality improvement that cannot serve within budget is not production-ready.

31. Segment Evaluation

Always slice metrics.

Segments:

new users
anonymous users
heavy users
new items
long-tail items
regions
languages
device types
surfaces
candidate sources
tenants
categories
privacy modes

Global NDCG can improve while important segment degrades.

Segment gates catch this.

32. Cold-Start Evaluation

For new items/users:

new_user_recall@K
new_user_ndcg@K
new_item_exposure_simulation
new_item_retrieval_recall
content-only performance
anonymous performance

Cold-start often hidden by warm-user metrics.

Evaluate separately.

33. Negative Feedback Metrics

Track:

hide@K
report@K
predicted_negative_utility
negative_rate_by_segment
negative in top positions

A model that improves CTR by showing controversial content may be bad.

Negative feedback should be first-class.

34. Offline Evaluation for Candidate Sources

For each source:

recall,
marginal recall,
latency,
cost,
freshness,
coverage,
invalid rate,
diversity,
cold-start contribution.

Source evaluation table:

Source	Recall@500	Marginal Recall	Invalid Rate	p95 Latency
two_tower	0.62	0.39	3%	30ms
item_cf	0.31	0.11	5%	25ms
trending	0.18	0.07	1%	5ms

35. Offline Evaluation for Ranking Models

Compare model versions:

champion vs challenger

Metrics:

NDCG@K,
AUC/logloss per task,
calibration,
utility simulation,
segment metrics,
feature missing sensitivity,
latency/model size.

Use same evaluation dataset.

36. Offline Evaluation for Reranking Policies

Compare:

topK baseline
diversity policy A
diversity policy B
frequency policy C
exploration policy D

Metrics:

predicted utility loss/gain,
diversity gain,
novelty gain,
constraint violations,
source/category distribution,
empty slate rate.

Reranking changes often trade relevance for diversity/fairness. Make trade-off explicit.

37. Offline Evaluation for Business Rules

Simulate rule impact:

candidate rejection rate
slate change rate
affected users
affected categories
empty slate rate
primary metric proxy impact
hard violation prevention

For new rule, shadow evaluation before production.

Rule that rejects 40% candidates may need fallback/source changes.

38. Offline Evaluation for LLM-Augmented Components

Metrics by use case:

Intent Parser

schema validity
constraint extraction accuracy
clarification precision

Metadata Enrichment

taxonomy precision/recall
human agreement
invalid output rate

Explanation

faithfulness
unsupported claim rate
policy violation rate

LLM components need their own evaluation, not only RecSys metric.

39. Counterfactual Limitation

Offline data is biased because old policy chose what users saw.

If model A would show items never shown before, offline labels cannot evaluate them.

This affects:

long-tail,
new source,
exploration,
diversity,
serendipity.

Use offline evaluation to screen, then online experiment.

40. Position Bias

Items in higher position get more clicks.

Offline labels reflect position.

Mitigations:

position features,
debiasing,
propensity weighting,
randomized exploration data,
evaluate within comparable positions,
click models.

Do not treat click as pure relevance.

41. Selection Bias

Logged examples are selected by old candidate generation/ranking.

Unshown items lack labels.

If new ranker changes candidate distribution, offline metric may not predict online effect.

Record candidate generation policy and candidate sets.

Use exploration to improve support.

42. Label Incompleteness

No click does not mean irrelevant.

User may not see item, be busy, already know it, or convert later.

Labels are weak signals.

Use:

viewability,
dwell,
delayed windows,
negative feedback,
multiple labels,
uncertainty/weights.

43. Metric Correlation with Online

Track whether offline metric changes predict online metrics.

Example:

+1% NDCG@20 historically correlates with +0.2% CTR

If offline metric stops predicting online, revisit metric/dataset.

Metric governance includes metric validity.

44. Offline Metric Overfitting

Teams can overfit to offline benchmark.

Symptoms:

offline improves, online flat/worse,
model exploits label artifact,
segment harm,
metric gaming.

Mitigate:

holdout test,
rolling windows,
online validation,
multiple metrics,
causal experiments,
periodic metric review.

45. Metric Governance

For each metric define:

name
formula
owner
dataset
label version
window
segments
known limitations
use in gates
last reviewed

Metric definitions should be versioned.

Do not let every team compute “NDCG” differently.

46. Evaluation Report Template

Candidate model/policy report:

summary
candidate/champion versions
evaluation dataset version
global metrics
segment metrics
calibration
retrieval/ranking/slate metrics
guardrails
latency/cost
known risks
recommendation
next online test plan

Attach to model/policy registry.

47. Common Failure Modes

47.1 Random Split Leakage

Offline inflated.

47.2 Treating Unshown Items as Negative

Bias.

47.3 Only Global Metric

Segment regressions hidden.

47.4 NDCG Without Label Quality

Metric meaningless.

47.5 Candidate Set Mismatch

Offline ranker evaluated on unrealistic pool.

47.6 No Calibration Evaluation

Utility composition unreliable.

47.7 Diversity Metric Optimized Alone

Randomness increases.

47.8 Offline Metric Treated as Final Proof

Online fails.

47.9 Metric Definition Drift

Comparisons invalid.

47.10 No Negative Feedback Metric

Trust degradation missed.

48. Implementation Sketch: Metric Interface

public interface OfflineMetric {
    String name();

    MetricResult compute(EvaluationDataset dataset, RankingOutput output);
}

public record MetricResult(
    String metricName,
    double value,
    Map<String, Double> segmentValues,
    Map<String, Object> diagnostics
) {}

Metric computation should be versioned and tested.

49. Implementation Sketch: NDCG@K

public final class NdcgAtK implements OfflineMetric {
    private final int k;

    @Override
    public String name() {
        return "ndcg_at_" + k;
    }

    @Override
    public MetricResult compute(EvaluationDataset dataset, RankingOutput output) {
        double sum = 0.0;
        int groups = 0;

        for (EvaluationGroup group : dataset.groups()) {
            List<ScoredItem> ranked = output.rankedItems(group.groupId());
            double dcg = dcg(ranked.subList(0, Math.min(k, ranked.size())), group.relevance());
            double idcg = idealDcg(group.relevance(), k);

            if (idcg > 0.0) {
                sum += dcg / idcg;
                groups++;
            }
        }

        return new MetricResult(name(), sum / Math.max(groups, 1), Map.of(), Map.of("groups", groups));
    }

    private double dcg(List<ScoredItem> ranked, Map<String, Integer> rel) {
        double value = 0.0;
        for (int i = 0; i < ranked.size(); i++) {
            int grade = rel.getOrDefault(ranked.get(i).itemId(), 0);
            value += (Math.pow(2, grade) - 1.0) / (Math.log(i + 2) / Math.log(2));
        }
        return value;
    }
}

Production implementation must handle ties, negatives, missing labels, and segments.

50. Minimal Production Offline Evaluation Plan

Start with:

evaluation_dataset:
  temporal_split: true
  label_versions: explicit
  candidate_set: logged_candidate_pool
retrieval_metrics:
  - recall_at_100
  - recall_at_500
  - marginal_recall_by_source
ranking_metrics:
  - ndcg_at_20
  - mrr_at_20
  - auc_click
  - logloss_click
  - calibration_ece
slate_metrics:
  - category_entropy
  - distinct_creators
  - duplicate_rate
  - constraint_violation_rate
guardrails:
  - hide_rate_proxy
  - report_rate_proxy
  - cold_start_segment
  - latency_estimate
segments:
  - new_user
  - new_item
  - region
  - category
  - privacy_mode
governance:
  metric_definitions_versioned: true
  evaluation_report_required: true

51. Checklist Offline Evaluation Readiness

[ ] Evaluation dataset is versioned.
[ ] Temporal split is used.
[ ] Label definitions and maturity windows are explicit.
[ ] Candidate set definition is explicit.
[ ] Retrieval recall is measured.
[ ] Ranking metrics include NDCG/MRR/AUC/logloss as appropriate.
[ ] Calibration is evaluated for probability outputs.
[ ] Slate metrics include diversity/novelty/constraint metrics.
[ ] Negative feedback metrics are included.
[ ] Segment metrics are mandatory.
[ ] Cold-start metrics are separate.
[ ] Latency/cost feasibility is measured.
[ ] Counterfactual limitations are documented.
[ ] Metric definitions are versioned.
[ ] Evaluation reports compare champion vs challenger.
[ ] Offline metrics are validated against online outcomes over time.

52. Kesimpulan

Offline evaluation adalah gate penting untuk mencegah model/policy buruk masuk online experiment.

Prinsip utama:

Offline evaluation is a safety gate, not final truth.
Candidate generation, ranking, reranking, calibration, and slate need different metrics.
Temporal split and point-in-time safety are mandatory.
Retrieval recall limits downstream ranking quality.
NDCG/MAP/MRR are useful only with clear labels and candidate sets.
Calibration matters for utility composition.
Diversity/novelty/coverage metrics must be paired with relevance/guardrails.
Segment metrics are mandatory.
Historical logs are biased by old policy.
Online experiments are still required for causal product validation.

Di Part 064, kita akan membahas Online Experimentation and A/B Testing: bagaimana menjalankan eksperimen recommendation system secara benar, aman, statistically valid, dan operable di production.

Lesson Recap

You just completed lesson 63 in deepen practice. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Back To Series Next Lesson

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.

Previous Lesson

Lesson 62

Learn Build From Scratch Recommendations System Part 062 Fault Tolerance And Graceful Degradation

Next Lesson

Lesson 64

Learn Build From Scratch Recommendations System Part 064 Online Experimentation And Ab Testing