Learn Build From Scratch Recommendations System Part 033 Ranking Problem Formulation
title: Build From Scratch Recommendations System - Part 033 description: Memformulasikan ranking problem production-grade: candidate pool, objective, label, utility, constraints, ranking context, pointwise vs listwise thinking, position bias, calibration, multi-objective trade-off, dan offline-online alignment. series: learn-build-from-scratch-recommendations-system seriesTitle: Build From Scratch: Enterprise Recommendations System order: 33 partTitle: Ranking Problem Formulation tags:
- recommendation-system
- recsys
- ranking
- learning-to-rank
- machine-learning
- evaluation
- series date: 2026-07-02
Part 033 — Ranking Problem Formulation
Candidate generation menjawab:
Item apa saja yang mungkin relevan?
Ranking menjawab:
Dari kandidat valid ini, mana yang paling berguna untuk ditampilkan sekarang, dalam urutan apa?
Ranking layer adalah tempat banyak sinyal bertemu:
- user preference,
- item quality,
- context,
- source score,
- business objective,
- policy,
- diversity,
- novelty,
- freshness,
- long-term value,
- marketplace health,
- safety guardrails.
Ranking bukan sekadar model yang memprediksi click probability.
Ranking adalah formulasi keputusan.
Jika problem formulation salah, model bisa sangat akurat untuk metric yang salah dan buruk untuk produk.
Part ini membahas bagaimana memformulasikan ranking problem production-grade: unit ranking, objective, labels, utility, candidate context, slate context, bias, calibration, multi-objective, constraints, offline-online alignment, dan failure modes.
1. Mental Model: Ranking Is Utility Ordering Under Constraints
Ranking system menerima candidate pool:
C = {candidate_1, candidate_2, ..., candidate_n}
Untuk request context x, ranking menghasilkan urutan:
ordered candidates = sort_by_utility(C, x)
Namun utility bukan hanya:
P(click)
Utility bisa berupa:
expected user value
+ expected business value
+ expected long-term value
- expected harm/risk
- expected fatigue
subject to constraints
Ranking production-grade:
maximize expected utility of slate
while satisfying safety, eligibility, policy, latency, and product constraints
Diagram:
2. Ranking Is Not Retrieval
Retrieval/candidate generation optimizes recall.
Ranking optimizes precision and utility among candidates.
Retrieval may use approximate similarity:
user embedding dot item embedding
Ranking can use richer features:
user-item cross features
context
item quality
source provenance
business value
freshness
negative feedback
availability
Retrieval can be wrong but useful if it includes candidates. Ranking refines.
Do not overload retrieval model to solve everything. It cannot see all cross features or slate constraints cheaply.
3. Ranking Unit
Define what is being ranked.
Examples:
Item Ranking
rank products/videos/articles
Offer Ranking
rank seller offers for same product
Action Ranking
rank next best actions for case
Document Ranking
rank knowledge articles
Creator Ranking
rank creators/channels
Bundle Ranking
rank groups of items
Slate Ranking
rank an ordered list, not individual items independently
The unit affects labels, features, constraints, and evaluation.
4. Candidate Pool Is Part of Ranking Problem
Ranker only sees candidates from retrieval.
If candidate distribution changes, ranking changes.
Training data must match production candidate pool.
Problems:
- ranker trained on old retrieval candidates,
- new two-tower source adds different items,
- exploration source adds cold items,
- ranker suppresses unknown source,
- ranker overfits source rank.
Ranking formulation must include candidate source provenance.
Candidate features:
source flags
source scores
source ranks
source count
retrieval model version
candidate generation policy
Ranker learns within a candidate ecosystem.
5. Ranking Context
Ranking depends on request context.
Context includes:
user identity/state
session intent
surface
device
time
region
locale
query
seed item
cart
candidate source policy
experiment variant
privacy mode
enterprise case state
actor role
Same candidate can have different utility in different context.
Example:
camera battery
High utility on cart with camera, low utility on homepage.
Ranking model must receive enough context to distinguish.
6. Surface-Specific Ranking
Different surfaces have different intent and metrics.
Home Feed
Discovery, engagement, retention, diversity.
Product Detail
Alternatives, complements, comparison.
Cart/Checkout
Attach, conversion, low regret.
Search
Explicit query relevance, conversion.
Email/Push
Open, long-term trust, frequency fatigue.
Enterprise Case
Task success, correctness, compliance, auditability.
One universal ranker may underperform if surface semantics differ.
Options:
- separate ranker per surface,
- shared model with surface features,
- shared backbone with surface-specific heads,
- surface-specific calibration.
Start with surface-specific formulation if behavior is very different.
7. Label Choice
Ranking model usually predicts labels from user feedback.
Labels:
click
long dwell
add_to_cart
purchase
watch completion
save/like
hide/not interested
report
return/refund
case action accepted
case action success
article useful
Label choice defines what model optimizes.
If you train on click, model learns clickability.
If you train on purchase, model learns conversion but with sparse delayed labels.
If you train on watch completion, model learns deeper engagement.
If you train on accepted action, model may learn user convenience, not actual case outcome.
Choose label aligned with product objective.
8. Click Is Not Enough
Click is common because it is frequent and fast.
But click can be misleading:
- clickbait,
- accidental clicks,
- curiosity without satisfaction,
- UI bias,
- title/image attractiveness,
- short-term engagement over long-term trust.
Click can be useful as one task, but not the only objective.
Production ranking often uses multiple labels:
click
conversion
satisfaction
negative feedback
long-term retention
9. Conversion Labels
Conversion depends on domain.
E-commerce:
purchase
add_to_cart
checkout
repeat purchase
low return
Content:
watch complete
read complete
follow creator
return next day
Enterprise:
action executed
case resolved
SLA met
supervisor approved
no rework
article marked useful
Conversion labels are often delayed and sparse.
Need label windows and maturity.
10. Negative Labels
Negative feedback is valuable.
Examples:
hide
not interested
dislike
report
refund
return
complaint
dismiss action as irrelevant
supervisor rejects action
But negative labels have different severity.
hide may indicate preference.
report may indicate safety/policy issue.
return may indicate post-purchase dissatisfaction, item quality, logistics, or expectation mismatch.
Do not collapse all negatives into one binary label blindly.
11. Non-Action Is Ambiguous
Candidate shown but not clicked is often used as negative.
But no-click could mean:
- not seen,
- user busy,
- bad position,
- item irrelevant,
- user already satisfied,
- UI issue,
- candidate below fold,
- user not ready.
No-click can be weak negative for CTR, but not strong dislike.
Ranking formulation must consider exposure, visibility, and position.
12. Ranking Example Row
A pointwise ranking example:
{
"request_id": "req_001",
"candidate_id": "item_123",
"prediction_time": "2026-07-02T10:00:00Z",
"surface": "home_feed",
"position_logged": 3,
"features": {
"user_category_affinity": 0.71,
"item_quality": 0.84,
"source_two_tower_score": 8.2,
"source_trending_rank": 12,
"user_has_seen_item_7d": false
},
"labels": {
"clicked_30m": 1,
"purchased_7d": 0,
"hide_7d": 0
},
"example_weight": 1.0
}
For production, include:
- source provenance,
- experiment variant,
- feature version,
- label version,
- visibility/impression quality,
- position/layout.
13. Objective vs Label vs Metric
These are different.
Objective
What the system should optimize.
increase satisfied purchases without increasing returns
Label
Observed signal used for training.
purchase within 7d
return within 30d
Metric
How we evaluate.
online conversion, return rate, revenue, NDCG, retention
Objective may require multiple labels and guardrails.
Do not confuse label with objective.
14. Utility Function
Ranking can be framed as expected utility.
Example e-commerce:
utility =
P(purchase) * expected_margin
- P(return) * return_cost
- P(report) * safety_cost
+ P(long_term_satisfaction) * retention_value
Content:
utility =
P(watch_complete) * user_value
- P(hide) * trust_cost
- repetition_penalty
+ novelty_bonus
Enterprise:
utility =
P(action_success) * case_value
- P(rework) * operational_cost
- policy_risk
The model may predict components separately. Ranking policy composes them.
15. Single-Objective Ranking
Simple formulation:
score = P(click)
Good for:
- early baseline,
- high-volume engagement surfaces,
- simple objective.
Bad if:
- clickbait,
- conversion matters,
- negative feedback matters,
- long-term value matters,
- business constraints matter.
Single-objective is starting point, not final destination.
16. Multi-Objective Ranking
Predict multiple outcomes:
p_click
p_purchase
p_hide
p_report
p_return
p_satisfaction
Then compose:
score =
w1 * p_click
+ w2 * p_purchase
- w3 * p_hide
- w4 * p_report
- w5 * p_return
Weights are product decisions.
Important:
- predictions should be calibrated enough,
- weights should be reviewed,
- guardrails should exist,
- online experiments required.
Multi-objective ranking is common in production.
17. Guardrail Metrics
Some metrics should not be optimized directly but constrained.
Examples:
report rate must not increase
return rate must not increase
latency must stay below SLO
exposure fairness must not degrade
policy violations must be zero
creator concentration must be bounded
Guardrail:
improve primary metric while maintaining guardrails
Ranking formulation should know which metrics are optimize vs guard.
18. Position Bias
Historical labels depend on position.
Items at top get more attention.
If model trains on clicks without accounting for position, it may learn historical ranker bias.
Example:
item A clicked because position 1
item B not clicked because position 20
Mitigations:
- visibility logging,
- position features for debiasing/evaluation,
- propensity weighting,
- randomized exploration,
- train on candidate sets with position-aware labels,
- avoid using final position as serving feature for pre-position ranker.
Position is logging context, not always serving feature.
19. Presentation Bias
Labels depend on UI.
Examples:
- image size,
- badge,
- sponsored label,
- price display,
- layout,
- autoplay,
- title truncation,
- notification copy.
Ranking data should log presentation context.
If UI changes, label distribution changes.
Ranker trained on old UI may not transfer.
20. Selection Bias
Training data only includes shown candidates.
Unshown candidates have unknown outcome.
Ranking model learns under historical candidate generation and ranking policy.
This means:
observed feedback is conditional on logging policy
Mitigations:
- exploration,
- propensity logging,
- counterfactual evaluation,
- source diversity,
- retraining after source changes,
- careful offline validation.
21. Candidate Set Context
Ranker often scores candidates independently, but user sees slate.
Candidate utility can depend on other candidates:
- duplicates,
- diversity,
- category balance,
- creator repetition,
- price range,
- sequence,
- complementarity,
- fairness exposure.
A pointwise ranker gives item score. Reranker/slate optimizer handles list context.
But ranking formulation should include:
ranking is eventually slate-level
22. Pointwise Ranking Formulation
Pointwise:
predict label for each candidate independently
Example:
P(click | user, item, context)
Pros:
- simple,
- scalable,
- works with standard classifiers/regressors,
- easy to calibrate,
- easy to debug.
Cons:
- does not directly optimize ordering,
- ignores slate context,
- no pair/list objective,
- position bias issues.
Pointwise is common starting point.
23. Pairwise Ranking Formulation
Pairwise:
for same request/user, positive item should score higher than negative item
Example:
clicked item > skipped item
purchased item > viewed-only item
Pros:
- closer to ranking,
- handles relative preference,
- useful for learning-to-rank.
Cons:
- pair generation cost,
- negative sampling complexity,
- can lose calibration,
- pair labels can be noisy.
24. Listwise Ranking Formulation
Listwise:
optimize ordering of entire list/slate
Pros:
- closest to ranking metrics,
- can optimize NDCG-like objectives,
- understands group context better.
Cons:
- more complex,
- needs query/request groups,
- harder training/infrastructure,
- slate feedback sparse/noisy.
Part 034 will go deeper into pointwise/pairwise/listwise.
25. Ranking Group
For learning-to-rank, examples are grouped.
Group can be:
request_id
search query
session request
recommendation response
candidate pool snapshot
case context
Group contains candidates competing for same slots.
If group lost, pairwise/listwise training breaks.
Training dataset should keep:
group_id
candidate_id
features
labels
position
26. Ranking and Calibration
If score is used as probability or combined with business value, calibration matters.
Example:
score = P(purchase) * margin
If P(purchase) is uncalibrated, utility wrong.
Calibration needed for:
- multi-objective score composition,
- thresholding,
- bid/sponsored systems,
- risk-sensitive enterprise decisions,
- expected value ranking.
Less needed if score only orders within same model/source.
Monitor calibration by segment.
27. Score Composition
Ranking score can be:
model_score
or composed:
final_score =
model_score
+ business_boost
+ freshness_boost
- repetition_penalty
Prefer explicit composition layer instead of hiding everything inside labels.
Example:
ranker predicts p_purchase
policy layer applies margin, availability, frequency, diversity
This is more debuggable.
28. Ranking Features
Feature groups:
User
affinity, history, segment, consent, lifecycle
Item
category, quality, price, freshness, popularity
Context
surface, device, time, region, query, cart, case state
Cross
user-item affinity, embedding similarity, seen count, price fit
Source
retrieval source flags, scores, ranks, source count
System
experiment, model version, layout
Feature design will be Part 035.
29. Ranking Label Windows
Each label needs window.
click within 30m
purchase within 7d
return within 30d after purchase
watch completion within session
article useful within case lifecycle
Longer windows:
- more complete,
- slower training,
- more delayed.
Shorter windows:
- faster,
- noisier/incomplete.
Use label maturity.
30. Delayed Outcomes
Some outcomes appear long after impression.
Examples:
- purchase after days,
- return after weeks,
- retention after months,
- case resolution after days/weeks.
Ranking model may need:
- fast proxy labels,
- delayed correction labels,
- multi-stage models,
- periodic retraining,
- long-term value model.
Do not label incomplete delayed outcomes as negative.
31. Ranking for Enterprise Actions
Enterprise ranking often has stricter formulation.
Candidate universe:
valid actions/documents only
Labels:
action accepted
action completed
case progressed
SLA improved
supervisor approved
no rework
article marked useful
Constraints:
permission
case state
jurisdiction
policy
audit
explainability
Utility:
task success > engagement
A high click/accept rate may be bad if action causes rework.
Ranking must optimize outcome, not just user convenience.
32. Ranking vs Reranking
Ranking:
assign score to candidates
Reranking:
construct final slate with constraints
Reranking handles:
- diversity,
- novelty,
- frequency cap,
- fairness,
- source mix,
- business constraints,
- slate-level rules.
Do not force ranker to solve all slate constraints alone.
But ranker should output enough calibrated/meaningful scores for reranker.
33. Offline Ranking Evaluation
Metrics:
- AUC,
- log loss,
- precision@K,
- recall@K,
- NDCG@K,
- MAP,
- calibration error,
- coverage,
- diversity,
- guardrail metrics.
Metric depends on formulation.
For ranking, request-grouped metrics like NDCG@K often more relevant than global AUC.
But offline metrics are proxies. Online A/B remains necessary.
34. Offline-Online Gap
Reasons offline ranking improvement may not convert online:
- leakage,
- biased labels,
- wrong candidate distribution,
- metric not aligned,
- position bias,
- ranker distribution shift,
- calibration issues,
- reranker changes,
- UI effects,
- delayed outcomes not captured,
- guardrail degradation.
Ranking formulation should include expected online experiment plan.
35. Ranking Baselines
Before complex model, compare against:
- source score ranking,
- popularity ranking,
- content similarity ranking,
- item quality ranking,
- simple logistic regression/GBDT,
- previous production ranker.
A complex ranker that cannot beat strong baseline is not ready.
36. Logging for Ranking
At serving, log:
request context
candidate pool
candidate sources
features or feature snapshot refs
model version
scores
rank before rerank
final position
labels/outcomes later
experiment variant
filter decisions
Without ranking logs, training and debugging fail.
Full candidate logging may be expensive. At least log final slate and sampled non-final candidates.
37. Ranking Problem Spec
Define ranking problem as spec.
ranking_problem: home_feed_ranker_v1
surface: home_feed
unit: item_candidate
group_id: request_id
candidate_sources:
- two_tower
- item_cf
- content_based
- trending
features: home_ranker_features_v12
labels:
primary:
clicked_30m:
weight: 1.0
secondary:
purchase_7d:
weight: 3.0
hide_7d:
weight: -2.0
objective:
type: multi_task_pointwise
evaluation:
primary_offline: ndcg_at_20_click
guardrails:
- hide_rate
- report_rate
- latency
split: temporal
Spec creates clarity and reviewability.
38. Ranking Failure Modes
38.1 Optimizes Clickbait
Click-only objective.
38.2 Suppresses New Items
Training dominated by warm items.
38.3 Overweights Source Rank
Ranker learns old retrieval/ranker bias.
38.4 Ignores Negative Feedback
User fatigue and trust degrade.
38.5 Uses Position as Serving Feature
Leakage/invalid feature.
38.6 Candidate Distribution Shift
New source not handled.
38.7 Poor Calibration
Multi-objective score wrong.
38.8 No Slate Awareness
Duplicates/repetition.
38.9 Enterprise Invalid Actions
Eligibility failure mixed with ranking.
38.10 Offline Metric Overfit
A/B test fails.
39. Implementation Sketch: Ranking Request
public record RankingRequest(
String requestId,
String surface,
Subject subject,
RequestContext context,
List<AggregatedCandidate> candidates,
RankingPolicy policy,
boolean debug
) {}
Candidate includes source provenance and eligibility-passed item.
public record RankableCandidate(
String itemId,
String itemType,
List<CandidateSourceEvidence> sources,
Map<String, Object> candidateMetadata
) {}
Ranking response:
public record RankingResult(
List<ScoredCandidate> scoredCandidates,
String modelVersion,
String featureSetVersion,
RankingDiagnostics diagnostics
) {}
40. Implementation Sketch: Utility Composition
public final class UtilityComposer {
public double compose(Predictions p, BusinessContext context) {
return context.clickWeight() * p.pClick()
+ context.purchaseWeight() * p.pPurchase()
- context.hideWeight() * p.pHide()
- context.reportWeight() * p.pReport()
+ context.qualityWeight() * p.expectedSatisfaction();
}
}
In production, weights should be versioned config, not random constants hidden in code.
41. Minimal Production Ranking Formulation
Start with:
unit: item candidate
group: request_id
model: pointwise GBDT or neural ranker
primary_label: clicked_within_30m
secondary_labels:
- purchase_7d
- hide_7d
features:
- user
- item
- context
- user_item_cross
- source_features
evaluation:
- NDCG@K
- Precision@K
- calibration
- hide/report guardrails
candidate_distribution:
- production candidate logs
split:
- temporal
Do not start with listwise deep slate optimizer before pointwise/pairwise formulation and logging are solid.
42. Checklist Ranking Problem Formulation
[ ] Ranking unit is explicit.
[ ] Surface-specific objective is defined.
[ ] Candidate source distribution is known.
[ ] Group ID is preserved.
[ ] Primary label is defined.
[ ] Secondary/negative labels are defined.
[ ] Label windows and maturity are defined.
[ ] Utility function or score composition is explicit.
[ ] Guardrail metrics are defined.
[ ] Position/presentation bias is considered.
[ ] Candidate source features are included.
[ ] Eligibility is separated from ranking.
[ ] Reranking/slate constraints are acknowledged.
[ ] Offline metrics align with objective.
[ ] Calibration need is identified.
[ ] Training logs include feature/model/source versions.
[ ] Enterprise constraints are handled as hard filters if applicable.
[ ] Ranking problem spec is versioned.
43. Kesimpulan
Ranking problem formulation menentukan apa yang sebenarnya dipelajari model.
Prinsip utama:
- Ranking is utility ordering under constraints.
- Retrieval optimizes recall; ranking optimizes utility among valid candidates.
- Label choice shapes model behavior.
- Click is useful but insufficient as sole objective.
- Multi-objective ranking is often necessary.
- Position, presentation, and selection bias affect labels.
- Ranking examples need group/request context.
- Candidate source provenance is part of ranking.
- Eligibility is a hard boundary, not ranker preference.
- Ranking problem should be specified and versioned before model training.
Di Part 034, kita akan membahas Learning to Rank: Pointwise, Pairwise, Listwise — tiga cara utama melatih ranking model dan kapan menggunakannya di production.
You just completed lesson 33 in build core. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.
Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.