Build CoreOrdered learning track

Learn Build From Scratch Recommendations System Part 041 Score Calibration And Score Composition

[]11 min read2134 words

In This Lesson

1. Mental Model: Score Has Semantics 2. Ranking Score Types 3. Calibration Definition

Lesson 4180 lesson track16–44 Build Core

title: Build From Scratch Recommendations System - Part 041 description: Mendesain score calibration dan score composition production-grade: probability calibration, source score normalization, utility composition, calibration by segment, drift, guardrails, score debugging, dan governance. series: learn-build-from-scratch-recommendations-system seriesTitle: Build From Scratch: Enterprise Recommendations System order: 41 partTitle: Score Calibration and Score Composition tags:

recommendation-system
recsys
ranking
calibration
score-composition
mlops
series date: 2026-07-02

Part 041 — Score Calibration and Score Composition

Production recommendation system menggabungkan banyak score:

retrieval score,
two-tower dot product,
content similarity,
item-to-item similarity,
graph PPR score,
popularity score,
GBDT score,
deep ranker predictions,
p_click,
p_purchase,
p_hide,
business value,
freshness boost,
quality penalty,
exploration propensity.

Masalahnya: tidak semua score punya makna yang sama.

0.8 dari cosine similarity tidak sama dengan 0.8 dari calibrated click probability.
8.5 dari dot product tidak sama dengan 8.5 dari GBDT raw margin.
Rank 1 dari trending source tidak sama dengan rank 1 dari personalized retrieval.
p_purchase=0.01 mungkin tinggi untuk kategori tertentu, tetapi rendah untuk kategori lain.

Jika score digabung sembarangan, ranking menjadi tidak stabil, tidak bisa diaudit, dan sering menghasilkan trade-off salah.

Part ini membahas score calibration dan score composition production-grade: kapan perlu kalibrasi, bagaimana mengkalibrasi, cara menormalisasi source score, komposisi multi-task utility, segment calibration, drift, debugging, dan governance.

1. Mental Model: Score Has Semantics

Setiap score harus menjawab:

Apa makna angka ini?

Contoh:

p_click = estimated probability user clicks candidate within 30m
p_purchase = estimated probability purchase within 7d
two_tower_score = inner product in retrieval embedding space
content_similarity = cosine similarity between query/session and item content
popularity_score = smoothed engagement rate in segment
utility_score = composed expected utility from predictions and policy weights

Jika semantics tidak jelas, score tidak boleh digabung.

Score contract:

score_name: p_purchase_7d
meaning: probability of purchase within 7 days after impression
range: 0..1
calibrated: true
calibration_version: purchase-calibration-v3
segment_scope:
  - surface
  - category
  - region

2. Ranking Score Types

Common score types:

Probability Score

P(click)
P(purchase)
P(hide)

Range 0..1. Ideally calibrated.

Similarity Score

cosine
dot product
BM25
graph proximity

Not probability.

Rank-Based Score

source rank
inverse rank
percentile rank

Relative within source/request.

Raw Model Margin

GBDT raw margin
neural logit

Not probability until transformed/calibrated.

Utility Score

weighted composition of predicted outcomes

May be any real value.

Constraint/Penalty Score

freshness penalty
repetition penalty
quality risk

Usually policy-defined.

3. Calibration Definition

A probability prediction is calibrated if:

Among candidates predicted 0.10 probability, about 10% actually happen.

Example:

predicted p_click = 0.05
actual click rate in that bucket ≈ 5%

Calibration is different from ranking quality.

A model can rank well but be poorly calibrated.

A model can be calibrated but not rank well.

For utility composition, calibration matters.

4. Why Calibration Matters

Calibration matters when score is used as expected value.

Example:

expected_margin = p_purchase * margin

If p_purchase is 2x overestimated for expensive items, ranking over-promotes expensive items.

Example negative objective:

expected_risk = p_report * report_cost

If p_report is under-calibrated for new items, unsafe/low-quality items may get too much exposure.

Multi-objective ranking depends on calibrated task predictions.

5. Calibration Is Segment-Specific

Global calibration can look good while segment calibration is bad.

Segments:

surface
category
region
locale
item age
user tenure
candidate source
device
price bucket
tenant
role/workflow

Example:

p_purchase calibrated overall
but overestimates new items
and underestimates repeat-purchase consumables

Monitor calibration by important segments.

6. Calibration Curves

Calibration curve:

Bucket predictions.
For each bucket, compare average predicted probability vs actual rate.

Example:

Prediction Bucket	Avg Predicted	Actual Rate
0.00-0.01	0.006	0.005
0.01-0.03	0.020	0.017
0.03-0.05	0.041	0.052
0.05-0.10	0.071	0.090

If actual > predicted, model underestimates.
If actual < predicted, model overestimates.

Metrics:

Expected Calibration Error
Brier Score
log loss
calibration slope/intercept

7. Calibration Methods

Common methods:

Platt Scaling

Fit logistic regression on raw score/logit.

calibrated = sigmoid(a * raw_score + b)

Simple and stable.

Isotonic Regression

Non-parametric monotonic mapping.

More flexible, can overfit if data small.

Temperature Scaling

Mostly for neural logits.

calibrated = sigmoid(logit / T)

Segment Calibration

Separate calibration per segment or with segment features.

Beta Calibration / Other Methods

More advanced, use when needed.

Start simple: Platt or isotonic with validation data.

8. Calibration Data

Calibration must use held-out data not used for model training.

Use:

train model on train period
fit calibration on validation period
evaluate on later test period

Do not calibrate on future/test data.

Calibration should match serving distribution:

same candidate sources,
same eligibility filters,
same surface,
same label maturity,
same feature definitions.

If candidate distribution changes, calibration can drift.

9. Calibration by Source

Predictions may be differently calibrated by candidate source.

Example:

p_click for two_tower candidates
p_click for editorial candidates
p_click for exploration candidates

Same model score can mean different actual rate because candidate source changes prior distribution.

Features can include source flags, but calibration monitoring by source remains important.

If needed:

calibration model includes source

or:

source-specific calibration

10. Calibration by Item Age

New items often poorly calibrated.

Warm items have behavior features. New items rely on priors/content.

Monitor:

item_age < 1d
item_age 1-7d
item_age 7-30d
item_age > 30d

If new-item p_click/p_purchase is overestimated, exploration may harm users.
If underestimated, new items starve.

Cold-start calibration is important.

11. Calibration by Position

Training labels are affected by position.

Serving model predicts candidate probability before final position, but actual click depends on position.

Possible definitions:

Position-Independent Relevance Probability

Hard to observe.

Position-Conditional Click Probability

P(click | candidate, context, position)

If model includes position, it predicts UI outcome, not pure relevance.

For ranking before position assignment, avoid using final position as feature.

But for final slate simulation/calibration, position effects may be modeled separately.

12. Ranking Score vs Display Probability

If candidate at position 1 gets more clicks than at position 10, model score alone is not enough to estimate final slate clicks.

Need distinction:

relevance_score
examination_probability(position)
display_click_probability = relevance * examination

For many systems, ranker learns from historical position-biased data. Be aware.

Advanced systems use propensity/examination models. For now, log position and evaluate carefully.

13. Source Score Normalization

Candidate sources output incompatible scores.

Examples:

two_tower dot product: -5..15
content cosine: -1..1
BM25: 0..100
PPR: 0..0.02
popularity CTR: 0..0.5
editorial priority: 1..10

For ranker features, provide raw score plus normalized features.

Options:

source_rank
source_rank_inverse
source_score_percentile
source_score_zscore
source_score_minmax
source_score_bucket

Raw score can be useful, but model needs context.

14. Rank-Based Normalization

Rank-based normalization is robust.

rank_inverse = 1 / log2(rank + 1)

Example:

rank 1 -> 1.0
rank 2 -> 0.63
rank 10 -> 0.29
rank 100 -> 0.15

Rank has stable meaning within source: earlier candidate is better according to source.

Use along with source ID.

15. Percentile Normalization

Within a source result set:

percentile = rank percentile or score percentile

Example:

candidate in top 5% of content source

This is more comparable across requests/sources than raw score.

But percentile loses absolute strength.

If source returns only weak candidates, top percentile may still be bad.

Include both raw and percentile.

16. Score Z-Score

Within source/request:

z = (score - mean_score) / std_score

Useful when score distribution varies.

Problems:

unstable if few candidates,
outlier sensitive,
assumes distribution meaningful.

Use cautiously.

17. Min-Max Normalization

norm = (score - min) / (max - min)

Simple but sensitive to outliers.

If all scores close, small differences get amplified.

Usually less robust than rank/percentile.

18. Raw Model Logits

Neural/GBDT may output raw logits/margins.

Transform:

prob = sigmoid(logit)

But sigmoid output may still not be calibrated.

Calibration layer:

prob_calibrated = calibration(sigmoid_or_logit)

Store:

raw_score
calibrated_score
calibration_version

For debugging, raw and calibrated both useful.

19. Utility Composition

Utility composition combines calibrated predictions.

Example:

utility =
  0.5 * p_click
  + 20.0 * p_purchase
  - 5.0 * p_hide
  - 100.0 * p_report

This is expected value thinking.

But utility score itself is not probability.

It is ranking decision score under policy version.

Score contract:

score_name: home_utility_score
inputs:
  - p_click_calibrated
  - p_purchase_calibrated
  - p_hide_calibrated
  - p_report_calibrated
policy_version: home-utility-v8
range: unbounded real
higher_is_better: true

20. Score Composition Layers

Recommended layers:

model prediction
-> calibration
-> utility composition
-> policy adjustments
-> reranking constraints

Diagram:

Keep layers separate for debugging.

21. Business Value Composition

For e-commerce:

expected_profit =
  p_purchase * margin
  - p_return * return_cost
  - p_support_contact * support_cost

But user value matters.

utility =
  expected_profit
  + user_satisfaction_value * p_satisfaction
  - trust_cost * p_hide

If margin dominates, recommendations degrade long-term trust.

Use guardrails and relevance floors.

22. Risk-Aware Composition

For high-risk negative outcomes:

risk_cost = p_report * report_cost

If report/severe harm is rare but costly, even small probability matters.

However for true policy/safety violations, do not rely on probability penalty. Hard filter.

Use risk-aware composition for soft risk, not illegal/forbidden candidates.

23. Quality Adjustments

Quality can enter as:

Feature in model

Model learns effect.

Utility multiplier

utility *= quality_multiplier

Penalty

utility -= low_quality_penalty

Hard gate

if quality < min: filter

Which one depends on severity.

For catalog quality threshold, hard gate may be better.
For minor quality differences, rank feature/penalty.

24. Freshness Composition

Freshness can be:

positive for news/new arrivals,
negative for stale content,
neutral for evergreen items.

Freshness boost:

freshness_boost = alpha * exp(-item_age / tau)

But blind freshness boost can overexpose low-quality new items.

Use:

freshness * quality * eligibility * exploration policy

Freshness policy should be surface/category-specific.

25. Exploration and Propensity

Exploration may override pure utility slightly.

Example:

final_score = utility + exploration_bonus

But exploration should log propensity.

Candidate:

{
  "source": "new_item_exploration",
  "propensity": 0.02,
  "exploration_policy": "cold-item-v3"
}

Exploration score adjustment must be controlled, capped, and monitored.

Do not hide exploration as organic score.

26. Sponsored Score Composition

Sponsored candidate may have bid and predicted relevance.

Example:

ad_score = bid * p_click * quality

But organic recommendation and sponsored ranking need disclosure and constraints.

Do not mix sponsored and organic silently.

If system includes sponsored candidates, score contract must include:

campaign_id
bid
disclosure_required
relevance_floor
frequency_cap

27. Score Clipping

Extreme scores can dominate.

Use clipping:

p in [epsilon, 1-epsilon]
utility contribution capped
boost capped
penalty capped

Example:

freshness_boost <= 0.2
business_boost <= 0.1 * base_score_range

Unbounded boosts are dangerous.

28. Score Monotonicity

Some signals should behave monotonically.

Examples:

higher p_report should not increase utility
higher item_quality should not decrease utility all else equal
higher seen_count should not increase repetition unless domain allows

GBDT/deep models may violate intuitive monotonicity.

Options:

monotonic constraints if supported,
explicit composition layer,
post-score adjustments,
tests.

For policy-critical scoring, explicit composition is safer.

29. Relevance Floor

Business/freshness/exploration boosts should not promote irrelevant items.

Use relevance floor:

if p_relevance < min:
    candidate cannot be boosted into final slate

Or:

exploration candidates must pass quality and relevance threshold

This protects user experience.

30. Score Debugging

For a candidate, debug should show:

{
  "item_id": "item_123",
  "raw_predictions": {
    "click_logit": 1.2,
    "purchase_logit": -4.1
  },
  "calibrated_predictions": {
    "p_click": 0.071,
    "p_purchase": 0.004,
    "p_hide": 0.012
  },
  "utility_components": {
    "click": 0.028,
    "purchase": 0.080,
    "hide": -0.060
  },
  "policy_adjustments": {
    "freshness_boost": 0.020,
    "repetition_penalty": -0.030
  },
  "final_score": 0.038
}

Without component debugging, score composition becomes opaque.

31. Score Logging

Log:

raw model scores
calibrated scores
calibration version
utility policy version
utility components
policy boosts/penalties
final score
rank before/after rerank

For sampled candidate pool and final slate.

This enables:

calibration analysis,
utility debugging,
offline simulation,
policy audit,
incident review.

32. Calibration Drift

Calibration can drift due to:

UI change,
candidate source change,
seasonality,
catalog shift,
user behavior shift,
model retrain,
feature pipeline change,
policy change.

Monitor calibration over time.

If drift detected:

recalibrate,
retrain,
segment recalibration,
rollback model/source change.

Calibration layer can be updated more frequently than model if stable.

33. Score Distribution Monitoring

Monitor:

raw score distribution
calibrated prediction distribution
utility distribution
component contribution distribution
top score concentration
score by source
score by segment

Sudden shift indicates:

model bug,
feature drift,
calibration mismatch,
source distribution change.

Example alert:

p_click p95 doubles after deploy

Investigate before ramping.

34. Calibration After Candidate Source Change

If new source added, calibration can change.

Why?

Candidate pool distribution changes.

Example:

old candidates mostly popularity/source A,
new source brings semantic long-tail,
model predictions for long-tail are under-calibrated.

Need:

shadow log new source,
calibrate with new distribution,
retrain ranker if needed,
monitor by source.

35. Calibration for Rare Events

Rare tasks:

report
return
unsubscribe
policy issue

Hard to calibrate due to low counts.

Strategies:

aggregate over longer window,
segment less granularly,
use Bayesian smoothing,
use risk tiers,
combine with rule/safety models,
treat as guardrail/hard filter when severe.

Do not overtrust tiny rare-event probabilities.

36. Calibration for Enterprise

Enterprise outcomes often sparse and delayed.

Examples:

case resolution success
rework
policy violation
SLA improvement

Calibration needs domain review.

For high-stakes action ranking, predicted probability should not be sole decision. Use:

hard rules,
expert validation,
calibrated risk tiers,
confidence intervals,
audit logs.

Score composition should be explainable.

37. Confidence and Uncertainty

Predicted score should ideally include uncertainty.

Useful for:

cold-start,
rare categories,
low-support items,
exploration,
risk management.

Approximate uncertainty features:

training support count
item impression count
feature missing count
model ensemble variance
prediction interval
calibration bucket support

Use uncertainty to:

avoid overconfident cold items,
guide exploration,
require fallback/human review.

38. Score Composition Testing

Test utility composer:

higher p_purchase increases score
higher p_hide decreases score
higher p_report decreases score strongly
freshness boost capped
business boost cannot override relevance floor
NaN predictions rejected
missing calibration version fails

Unit tests for score composition are essential.

Score bugs can alter product behavior drastically.

39. Policy Versioning

Utility policy must be versioned.

utility_policy: home-feed-utility-v9
weights:
  p_click: 0.4
  p_purchase: 20.0
  p_hide: -5.0
  p_report: -100.0
boosts:
  freshness_max: 0.2
  exploration_max: 0.1
constraints:
  relevance_floor: 0.01
owner: ranking-platform
approved_at: 2026-07-02

Log policy version with every response.

40. Offline Simulation of Score Policies

Before changing weights:

Replay logged candidate sets.
Apply new score composition.
Compare final order.
Measure proxy metrics.
Inspect source/category/new-item distribution.
Check guardrail predictions.
Run shadow/canary.

Offline simulation is not proof, but catches obvious issues.

41. A/B Testing Score Composition

Score composition change can be tested without retraining model.

Experiment:

same model predictions
different utility weights

This isolates policy effect.

Metrics:

primary objective,
negative feedback,
conversion,
retention,
diversity,
source mix,
latency unchanged.

Useful for product tuning.

42. Common Anti-Patterns

42.1 Add Raw Scores from Different Sources

Cosine + dot product + CTR without normalization.

42.2 Treat Raw Logit as Probability

Wrong expected utility.

42.3 No Segment Calibration

Global calibrated, segment broken.

42.4 Hard Safety as Soft Penalty

Unsafe candidates can still win.

42.5 Hidden Business Weights

No governance/audit.

42.6 Unbounded Boosts

Freshness/business/exploration dominates relevance.

42.7 Calibration on Test/Future Data

Offline results inflated.

42.8 No Component Logging

Cannot debug score behavior.

42.9 Score Policy Changed Without Experiment

Product behavior shifts silently.

42.10 Rare Risk Prediction Overtrusted

High-stakes bad decisions.

43. Implementation Sketch: Calibration Layer

public interface Calibrator {
    double calibrate(String taskName, double rawScore, CalibrationContext context);
}

public record CalibrationContext(
    String surface,
    String category,
    String source,
    String segment,
    String calibrationVersion
) {}

Example Platt calibrator:

public final class PlattCalibrator implements Calibrator {
    private final Map<String, Parameters> paramsBySegment;

    public double calibrate(String taskName, double rawScore, CalibrationContext context) {
        Parameters p = paramsBySegment.getOrDefault(context.segment(), paramsBySegment.get("global"));
        return sigmoid(p.a() * rawScore + p.b());
    }

    private double sigmoid(double x) {
        return 1.0 / (1.0 + Math.exp(-x));
    }
}

Production version needs task-specific params, versioning, and fallback.

44. Implementation Sketch: Utility Composer

public final class UtilityComposer {
    private final UtilityPolicy policy;

    public UtilityResult compose(CalibratedPredictions p, CandidateContext c) {
        Map<String, Double> components = new LinkedHashMap<>();

        components.put("click", policy.weight("p_click") * p.pClick());
        components.put("purchase", policy.weight("p_purchase") * p.pPurchase());
        components.put("hide", -policy.weight("p_hide") * p.pHide());
        components.put("report", -policy.weight("p_report") * p.pReport());

        double base = components.values().stream().mapToDouble(Double::doubleValue).sum();

        double freshness = cappedFreshnessBoost(c);
        double repetition = repetitionPenalty(c);

        double finalScore = base + freshness - repetition;

        return new UtilityResult(finalScore, components, Map.of(
            "freshness_boost", freshness,
            "repetition_penalty", repetition
        ), policy.version());
    }
}

Use explicit signs carefully. Many bugs come from negative weight mistakes.

45. Minimal Production Calibration Plan

Start with:

tasks:
  - click_30m
  - purchase_7d
  - hide_7d
calibration:
  method: platt_or_isotonic
  data: validation_period_after_training
  segments:
    - surface
    - major_category
    - candidate_source
monitoring:
  - calibration_curve
  - expected_calibration_error
  - score_distribution
utility:
  policy_versioned: true
  components_logged: true
guardrails:
  - report_rate
  - hide_rate
  - return_rate
  - latency

For rare report/safety, use guardrails and policy filters, not only calibrated model.

46. Checklist Score Calibration and Composition Readiness

[ ] Every score has documented semantics.
[ ] Raw model scores are distinguished from probabilities.
[ ] Task predictions are calibrated if used as probabilities.
[ ] Calibration data is held-out and temporal.
[ ] Calibration is monitored globally and by segment.
[ ] Candidate source scores are normalized before broad use.
[ ] Utility composition policy is versioned.
[ ] Loss weights are not confused with utility weights.
[ ] Hard constraints remain filters.
[ ] Boosts and penalties are capped.
[ ] Relevance floors exist where needed.
[ ] Score components are logged.
[ ] Calibration version and utility policy version are logged.
[ ] Offline score-policy simulation exists.
[ ] A/B testing is used for major score policy changes.
[ ] Rare/high-risk tasks are handled conservatively.

47. Kesimpulan

Score calibration dan composition adalah lapisan yang membuat ranking multi-signal bisa dipercaya.

Prinsip utama:

Every score needs semantics.
Similarity, rank, logit, probability, and utility are different.
Calibration means predicted probability matches observed frequency.
Calibration matters for expected value and multi-objective ranking.
Calibration must be monitored by segment.
Source scores should not be mixed raw.
Utility composition should use calibrated predictions and versioned weights.
Hard safety/access/policy constraints must remain filters.
Boosts/penalties need caps and relevance floors.
Score components must be logged for debugging and governance.

Di Part 042, kita akan membahas Ranking Service Design: bagaimana membangun service ranking production-grade yang menerima candidate pool, fetch feature, score batch, compose utility, expose debug traces, memenuhi latency SLO, dan aman dioperasikan.

Lesson Recap

You just completed lesson 41 in build core. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Back To Series Next Lesson

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.

Previous Lesson

Lesson 40

Learn Build From Scratch Recommendations System Part 040 Multi Task And Multi Objective Ranking

Next Lesson

Lesson 42

Learn Build From Scratch Recommendations System Part 042 Ranking Service Design