Learn Build From Scratch Recommendations System Part 034 Learning To Rank Pointwise Pairwise Listwise
title: Build From Scratch Recommendations System - Part 034 description: Membahas learning-to-rank production-grade: pointwise, pairwise, listwise objectives, dataset grouping, pair construction, losses, metrics, calibration, bias, trade-offs, dan penerapan untuk recommendation ranking. series: learn-build-from-scratch-recommendations-system seriesTitle: Build From Scratch: Enterprise Recommendations System order: 34 partTitle: Learning to Rank: Pointwise, Pairwise, Listwise tags:
- recommendation-system
- recsys
- learning-to-rank
- ranking
- pointwise
- pairwise
- listwise
- series date: 2026-07-02
Part 034 — Learning to Rank: Pointwise, Pairwise, Listwise
Setelah ranking problem diformulasikan, pertanyaan berikutnya:
Bagaimana model dilatih agar menghasilkan urutan kandidat yang baik?
Di learning-to-rank, ada tiga keluarga utama:
-
Pointwise
Setiap kandidat diprediksi secara independen. -
Pairwise
Model belajar bahwa kandidat yang lebih baik harus mendapat skor lebih tinggi dari kandidat yang lebih buruk. -
Listwise
Model belajar mengoptimalkan urutan list secara keseluruhan.
Ketiganya punya tempat di production.
Pointwise sederhana dan mudah dioperasikan.
Pairwise lebih dekat ke ranking.
Listwise lebih dekat ke metric seperti NDCG, tetapi lebih kompleks.
Part ini membahas ketiganya secara praktis: objective, dataset, loss, trade-off, bias, calibration, evaluation, dan penerapan di recommendation system production-grade.
1. Mental Model
Ranking model menghasilkan score:
score = f(user, item, context, source_features)
Kemudian kandidat diurutkan berdasarkan score.
Learning-to-rank menentukan bagaimana f dilatih.
All approaches need good data. Loss function cannot fix bad labels, leakage, or invalid candidates.
2. Training Data Shape
Learning-to-rank dataset should preserve groups.
Example:
group_id = request_id
candidates = [item_A, item_B, item_C, item_D]
labels = [0, 1, 0, 0]
Table:
| group_id | item_id | features | label | logged_position |
|---|---|---|---|---|
| req_1 | A | ... | 0 | 1 |
| req_1 | B | ... | 1 | 2 |
| req_1 | C | ... | 0 | 3 |
| req_1 | D | ... | 0 | 4 |
For pointwise, group less critical but still useful for metrics.
For pairwise/listwise, group is essential.
3. Pointwise Learning-to-Rank
Pointwise treats each candidate independently.
Example objective:
predict P(click | user, item, context)
Model sees rows:
(candidate features) -> label
Common losses:
- binary cross entropy,
- logistic loss,
- squared error,
- Poisson/count loss,
- regression loss for rating/utility.
Prediction:
score_i = f(x_i)
sort candidates by score_i
This is the most common starting point.
4. Pointwise Example
Training examples:
item A shown, not clicked -> 0
item B shown, clicked -> 1
item C shown, not clicked -> 0
Model learns:
features of clicked items -> higher probability
features of non-clicked items -> lower probability
At serving:
score each candidate independently
sort descending
Simple and scalable.
5. Pointwise Pros
Pointwise is popular because:
- easy dataset construction,
- works with many models,
- supports calibrated probabilities,
- easy multi-task setup,
- easy debugging,
- online-serving simple,
- can use standard classification/regression tooling,
- handles large data.
Models:
- logistic regression,
- gradient boosted trees,
- random forests,
- neural nets,
- wide & deep,
- deep rankers.
For first production ranker, pointwise is often right.
6. Pointwise Cons
Limitations:
- does not directly optimize rank order,
- ignores candidate competition,
- treats examples across different requests similarly,
- class imbalance can dominate,
- no-click ambiguity,
- position bias,
- slate effects not modeled,
- predicted probability may not translate to best ordering if utility is complex.
Example:
A p_click=0.2 on low-intent request
B p_click=0.1 on high-intent request
Pointwise model sees both globally, but ranking only matters within each request.
Use group metrics for evaluation.
7. Pointwise with Utility Labels
Instead of predicting click only, target can be utility.
Example:
label_utility =
1.0 * click
+ 5.0 * purchase
- 2.0 * hide
- 10.0 * report
Then train regression.
Risk:
- arbitrary weights,
- scale instability,
- hard calibration,
- rare outcomes dominate,
- negative labels semantically mixed.
Often better:
multi-task predictions + explicit utility composition
rather than one handcrafted utility label.
8. Multi-Task Pointwise
Model predicts multiple labels:
p_click
p_purchase
p_hide
p_report
p_return
p_satisfaction
Loss:
L =
w_click * BCE(click)
+ w_purchase * BCE(purchase)
+ w_hide * BCE(hide)
+ ...
Serving utility:
score =
a * p_click
+ b * p_purchase
- c * p_hide
- d * p_report
Pros:
- each label retains semantics,
- flexible score composition,
- guardrails easier,
- calibration per task possible.
Cons:
- task imbalance,
- delayed labels,
- missing labels,
- task conflict.
9. Pointwise Class Imbalance
Clicks/purchases are sparse.
Example:
CTR = 3%
purchase rate = 0.5%
If model predicts all zero, accuracy high but useless.
Use:
- class weights,
- downsample negatives,
- proper metrics,
- calibration correction,
- group-wise ranking metrics,
- hard negative sampling.
Do not optimize accuracy.
Use log loss/AUC/NDCG/PR-AUC depending objective.
10. Pairwise Learning-to-Rank
Pairwise trains on item pairs within same group.
If item A is better than item B:
score(A) > score(B)
Training pair:
(x_A, x_B, label = A preferred)
Loss penalizes wrong ordering.
Example:
clicked item > non-clicked item
purchased item > clicked-only item
not-hidden item > hidden item
Pairwise is closer to ranking than pointwise.
11. Pair Construction
Given group:
A label=0
B label=1
C label=0
D label=0
Pairs:
B > A
B > C
B > D
If graded labels:
purchase=3
click=1
no-click=0
hide=-2
Pairs:
purchase > click
purchase > no-click
click > no-click
no-click > hide
Pair construction can explode.
If group has m candidates, pairs can be O(m²).
Use sampling.
12. Pairwise Loss
Common form:
loss = log(1 + exp(-(score_pos - score_neg)))
If positive score much higher than negative, loss small.
If negative score higher, loss large.
Hinge version:
loss = max(0, margin - (score_pos - score_neg))
Pairwise loss focuses on relative ordering.
13. Pairwise Pros
- optimizes ordering more directly,
- robust to global calibration issues,
- good for relative preference,
- can handle graded labels,
- focuses on hard comparisons,
- aligns with ranking within request.
Pairwise is useful when:
- ordering matters more than calibrated probability,
- labels are relative,
- candidate groups are available,
- negatives are meaningful within same request.
14. Pairwise Cons
- pair explosion,
- pair sampling complexity,
- less calibrated scores,
- noisy pairs from ambiguous negatives,
- harder online interpretation,
- position bias still matters,
- training slower,
- group construction required.
If no-click negatives are weak, many pairs may be noisy.
Example:
clicked item at position 1 > no-click item at position 20
Maybe position, not relevance.
15. Pair Sampling
Strategies:
All Positive-Negative Pairs
High cost.
Sample Negatives Per Positive
for each positive, sample K negatives
Hard Negative Pairs
Use negatives with high model/source score.
Same-Position/Similar Exposure Pairs
Reduce position bias.
Label-Difference Weighted Pairs
Bigger label difference gets higher weight.
Example:
purchase > report has higher weight than click > no-click
Pair sampling policy should be versioned.
16. Pairwise with Multiple Positive Levels
Graded labels:
purchase = 4
add_to_cart = 3
click = 1
no_click = 0
hide = -2
report = -5
Pairs only if label difference meaningful.
if label_i - label_j >= threshold:
create pair i > j
This avoids noisy tiny differences.
Weight:
pair_weight = abs(label_i - label_j)
But graded labels must be carefully designed.
17. Pairwise and Position Bias
Pairwise can still learn bias.
Historical clicked item often appeared higher.
Mitigation:
- compare items with similar exposure probability,
- use propensity weights,
- use randomized/interleaved data,
- include examination model,
- exclude low-visibility negatives,
- pair within same slate with visibility known.
Do not assume pairwise solves bias automatically.
18. Listwise Learning-to-Rank
Listwise trains on entire candidate list/group.
Goal:
produce ordering that maximizes list metric
Metrics:
- NDCG,
- MAP,
- MRR,
- top-K utility.
Listwise objectives approximate ranking metrics directly.
Examples:
- LambdaRank/LambdaMART-style gradient weighting,
- softmax/list probability losses,
- differentiable NDCG approximations.
Listwise is more complex but powerful.
19. NDCG Intuition
NDCG rewards putting high-relevance items near top.
DCG = sum(gain_i / log2(position_i + 1))
High label at top gives more gain.
NDCG normalizes by ideal DCG.
Example:
purchase item at position 1 is much better than position 10
Listwise methods often weight errors by NDCG impact.
20. LambdaRank / LambdaMART Intuition
Lambda methods train by pairwise swaps weighted by metric impact.
If swapping item A and B would greatly improve NDCG, gradient is larger.
This combines pairwise comparison with listwise metric awareness.
LambdaMART with gradient boosted trees has been a strong LTR approach in many search/recommendation systems.
Practical idea:
not all wrong pairs matter equally
wrong order at top matters more
21. Listwise Softmax Loss
For a group, model produces scores for candidates.
Softmax converts scores to distribution:
P(item_i) = exp(score_i) / sum(exp(score_j))
Target distribution comes from labels.
Loss compares predicted distribution vs target distribution.
Useful when group labels are meaningful.
Challenges:
- group size variance,
- large groups expensive,
- label noise,
- missing negatives.
22. Listwise Pros
- closest to ranking metrics,
- considers group/list context,
- emphasizes top positions,
- can optimize graded relevance,
- often strong for search/ranking tasks.
Useful when:
- request groups are well-formed,
- labels are graded,
- top-K ordering matters,
- infra supports group training.
23. Listwise Cons
- complex data pipeline,
- needs group completeness,
- expensive for large candidate pools,
- harder debugging,
- harder calibration,
- label noise can hurt,
- slate feedback still incomplete,
- online serving still scores individual candidates unless using slate model.
Listwise is not automatically better. It needs mature data.
24. Pointwise vs Pairwise vs Listwise Summary
| Approach | Trains On | Pros | Cons |
|---|---|---|---|
| Pointwise | individual candidate labels | simple, scalable, calibrated | indirect ranking |
| Pairwise | item comparisons | better ordering | pair sampling/noise |
| Listwise | full group/list | metric-aligned | complex, needs groups |
Recommended progression:
pointwise baseline -> pairwise/listwise if ranking metric/scale justifies
Do not skip pointwise unless you have mature LTR infra.
25. Choosing Approach by Situation
Early Production Ranker
Use pointwise.
Search Ranking with Query Groups
Pairwise/listwise can be strong.
E-commerce Product Ranking
Pointwise multi-task + reranking often works well.
Pairwise/listwise for top-K improvements.
Feed Ranking
Pointwise/multi-task plus slate reranking is common; listwise more complex due to feedback loops.
Enterprise Action Ranking
Pointwise/multi-task with strong constraints and explainability first.
Pairwise possible with expert-labeled preferences.
26. Calibration Considerations
Pointwise models can be calibrated to probabilities.
Pairwise/listwise scores are often less calibrated.
If utility composition needs probabilities:
P(purchase) * margin - P(return) * cost
pointwise/multi-task is useful.
If final rank only needs order, pairwise/listwise may be sufficient.
Hybrid approach:
- pointwise model predicts calibrated outcomes,
- pairwise/listwise model or reranker improves ordering,
- calibration layer applied separately.
27. Hybrid Objectives
Model can combine losses:
loss =
pointwise_click_loss
+ pairwise_order_loss
+ listwise_ndcg_loss
Or train separate models.
Be careful:
- loss weights need tuning,
- tasks can conflict,
- harder debugging.
Start simple. Add complexity when clear bottleneck exists.
28. Training Dataset Group Completeness
For pairwise/listwise, group should represent candidate competition.
If training group contains only final displayed items, it misses candidates ranker considered but did not show.
Better:
candidate pool before ranking
+ final positions
+ labels
But logging full candidate pool is expensive.
Options:
- log top N pre-rank candidates,
- sample non-shown candidates,
- use candidate source replay,
- train on displayed slate first but recognize bias.
29. Displayed Slate vs Candidate Pool
Displayed slate:
- has labels,
- position known,
- smaller.
Candidate pool:
- closer to ranker decision set,
- many unshown items have unknown labels,
- needs counterfactual handling.
Pointwise CTR usually uses displayed impressions.
Pairwise/listwise can use displayed slate labels, but it optimizes within what was shown, not all candidates.
Exploration data helps.
30. Negative Sampling for Ranking
Ranking negatives can be:
- shown but not clicked,
- visible no action,
- same request candidates not selected,
- candidate pool non-shown,
- hard negatives from high source score,
- explicit negative feedback.
Confidence differs.
For pointwise CTR:
valid visible no-click = weak negative
For pairwise:
clicked > visible no-click
For listwise:
labels reflect graded relevance
Do not treat unshown candidates as strong negative.
31. Weighting Examples
Pointwise weights:
click positive weight = 1
no-click visible top position = 0.2
no-click low visibility = 0.05
purchase positive = 5
hide negative = 3
Pairwise weights:
purchase > no-click: high
click > no-click: medium
no-click > hide: high
click > click: no pair
Listwise gains:
purchase = 7
add_to_cart = 4
click = 1
no-click = 0
hide = -3 or separate penalty
Weights are product/modeling decisions.
32. Handling Negative Feedback in LTR
Negative labels can be handled as:
Pointwise
Predict p_hide, p_report, p_return separately.
Pairwise
Ensure negative-feedback items rank below neutral/positive items.
Listwise
Use negative gain or exclusion.
Be careful with reports:
- may indicate safety/policy issue,
- should trigger filter/policy systems,
- not just rank lower.
Explicit user block/hide often becomes suppression.
33. Query/Request Grouping for Recommendation
Group by:
recommendation request
response slate
candidate generation run
search query
case context
session decision point
Do not group candidates from unrelated requests. Pair/list comparisons across different contexts are meaningless.
Bad:
clicked product from request A > no-click article from request B
unless using pointwise.
34. Feature Consistency Across Candidates
Within group, candidates should share request context.
Efficient representation:
group features: user/context
candidate features: item/source/cross
Serving can compute group features once.
Training should avoid duplicating huge context fields unnecessarily, but logical model sees them.
35. Models for LTR
Logistic Regression
Good baseline, interpretable.
Gradient Boosted Decision Trees
Strong for tabular ranking features.
Common for pointwise and LambdaMART.
Neural Rankers
Good for embeddings, sequences, high-dimensional features.
Factorization Machines / DeepFM
Good for sparse feature interactions.
Hybrid
GBDT + neural embeddings/features.
Model choice depends on feature type, scale, latency, and ops maturity.
36. Gradient Boosted Trees for Ranking
GBDTs are strong when features are structured:
- counts,
- affinities,
- source scores,
- item quality,
- cross features.
Pros:
- strong tabular performance,
- handles non-linearities,
- less feature scaling,
- interpretable-ish,
- fast inference if controlled.
Cons:
- large models can be heavy,
- harder with raw high-dimensional embeddings,
- incremental updates not natural.
LambdaMART is a tree-based LTR method.
37. Neural Rankers for LTR
Neural rankers handle:
- embeddings,
- sequences,
- text representations,
- multimodal features,
- complex interactions.
Pros:
- expressive,
- can use deep representations,
- supports multi-task.
Cons:
- more data needed,
- serving latency,
- calibration,
- harder debugging,
- training-serving skew.
Start neural when feature/model maturity supports it.
38. Evaluation Metrics by Approach
Pointwise:
- log loss,
- AUC,
- PR-AUC,
- calibration error,
- grouped NDCG.
Pairwise:
- pairwise accuracy,
- NDCG,
- MAP,
- top-K metrics.
Listwise:
- NDCG@K,
- MAP,
- MRR,
- top-K utility.
Always include production guardrails:
- hide/report,
- latency,
- coverage,
- diversity,
- business constraints.
39. Grouped Metrics Matter
Even pointwise models should be evaluated by grouped ranking metrics.
For each request group:
- score candidates,
- sort,
- compute NDCG@K/Precision@K/Recall@K.
Global AUC can improve while top-K ranking worsens.
Ranking is about order within request.
40. Offline Evaluation Caveats
LTR offline evaluation can be misleading due to:
- position bias,
- selection bias,
- candidate generation mismatch,
- label noise,
- incomplete labels,
- temporal leakage,
- source changes,
- UI changes.
Use temporal splits and online A/B tests.
Offline LTR metric is screening, not final proof.
41. Online Testing
A/B test rankers with:
- primary product metric,
- guardrails,
- source contribution,
- latency,
- segment analysis,
- cold-start analysis,
- long-term metrics if possible.
If pairwise/listwise improves NDCG but increases hide/report, not acceptable.
42. Training-Serving Alignment
Ensure:
same features
same transforms
same candidate source distribution
same eligibility filters
same model version
same score composition
For listwise/pairwise, serving still usually scores candidates individually then sorts. Ensure training objective matches serving use.
43. Debugging LTR Models
Questions:
Did candidate appear in training distribution?
Are features missing/stale?
Did model overuse source rank?
Is label noisy?
Is item cold-start?
Is score calibrated?
Did reranker override?
Is metric segment-specific?
Debug views should show:
- feature values,
- model score,
- task predictions,
- source evidence,
- rank before/after rerank,
- reason for final order.
44. Production Progression
Recommended maturity path:
Stage 1
Pointwise CTR/CVR model with strong features.
Stage 2
Multi-task pointwise with negative/longer-term labels.
Stage 3
Pairwise loss or LambdaMART for top-K ordering.
Stage 4
Listwise/slate-aware objectives for mature surfaces.
Stage 5
Contextual bandits/long-term optimization.
Do not jump stages without logging/evaluation foundation.
45. Enterprise LTR
Enterprise ranking should start conservative.
Recommended:
- pointwise/multi-task model,
- valid candidates only,
- expert/rule features,
- success outcome labels,
- strong explanation,
- audit logs,
- human review for high-risk actions.
Pairwise can use expert preference pairs:
in this case state, action A is preferred over B
Listwise can optimize ordered checklist, but only after enough reliable feedback.
46. Implementation Sketch: Pointwise Dataset
public record PointwiseRankingExample(
String groupId,
String candidateId,
FeatureVector features,
Map<String, Double> labels,
double exampleWeight
) {}
Training:
for each example:
predict p_click, p_purchase, p_hide
compute weighted BCE losses
Serving:
for each candidate:
predictions = model.predict(features)
utility = composer.compose(predictions)
sort by utility
47. Implementation Sketch: Pairwise Dataset
public record PairwiseRankingExample(
String groupId,
String preferredCandidateId,
String lessPreferredCandidateId,
FeatureVector preferredFeatures,
FeatureVector lessPreferredFeatures,
double pairWeight
) {}
Pair generation:
for (CandidateExample a : group.candidates()) {
for (CandidateExample b : group.candidates()) {
if (a.label() > b.label() + threshold) {
pairs.add(new PairwiseRankingExample(group.id(), a.id(), b.id(), a.features(), b.features(), weight(a, b)));
}
}
}
Sample pairs if too many.
48. Implementation Sketch: Pairwise Loss
Conceptual:
double scorePreferred = model.score(preferredFeatures);
double scoreLess = model.score(lessPreferredFeatures);
double margin = scorePreferred - scoreLess;
double loss = Math.log1p(Math.exp(-margin)) * pairWeight;
Training updates model to increase margin.
49. Implementation Sketch: Listwise Group
public record RankingGroup(
String groupId,
List<CandidateExample> candidates
) {}
public record CandidateExample(
String candidateId,
FeatureVector features,
double relevanceLabel,
int loggedPosition
) {}
Listwise training needs complete groups and relevance labels.
50. Minimal Production LTR Plan
Start:
approach: multi_task_pointwise
model: gradient_boosted_tree_or_neural
group_id: request_id
labels:
click_30m: primary
purchase_7d: secondary
hide_7d: negative
features:
- user
- item
- context
- cross
- source
loss:
weighted_bce_per_task
evaluation:
- grouped_ndcg_at_10
- precision_at_10
- calibration
- hide_report_guardrails
next:
- pairwise hard negative training
- lambda/listwise for mature surface
This is practical and production-friendly.
51. Checklist Learning-to-Rank Readiness
[ ] Group ID is available.
[ ] Candidate features are point-in-time safe.
[ ] Labels and label windows are defined.
[ ] Position/visibility logging exists.
[ ] Pointwise baseline is built.
[ ] Grouped ranking metrics are computed.
[ ] Negative/no-click confidence is handled.
[ ] Pair construction policy is versioned if pairwise.
[ ] Pair sampling controls pair explosion.
[ ] Listwise groups are complete enough if listwise.
[ ] Calibration needs are identified.
[ ] Candidate source distribution matches production.
[ ] Source features are included.
[ ] Bias/propensity considerations are documented.
[ ] Guardrail metrics are evaluated.
[ ] Online A/B test plan exists.
[ ] Debug view shows feature/score/source/rank.
52. Kesimpulan
Learning-to-rank menyediakan beberapa cara melatih ranker, dan semuanya punya trade-off.
Prinsip utama:
- Pointwise predicts candidate outcomes independently.
- Pairwise learns relative ordering within group.
- Listwise optimizes list-level ranking more directly.
- Pointwise is the best production starting point for most teams.
- Pairwise/listwise need good group data and careful sampling.
- Calibration is easier with pointwise.
- Pairwise/listwise scores are often less probabilistic.
- Bias, position, and selection effects still matter.
- Offline ranking metrics must be grouped by request.
- Choose LTR approach based on data maturity, latency, debugging, and product objective.
Di Part 035, kita akan membahas Feature Engineering for Ranking: bagaimana mendesain user, item, context, cross, source, sequence, graph, and embedding features untuk ranker yang kuat dan production-safe.
You just completed lesson 34 in build core. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.
Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.