Learn Build From Scratch Recommendations System Part 065 Counterfactual And Off Policy Evaluation
title: Build From Scratch Recommendations System - Part 065 description: Mendesain counterfactual dan off-policy evaluation untuk recommendation system production-grade: logged policy, propensity, support/overlap, IPS, SNIPS, doubly robust intuition, slate OPE, variance, bias, diagnostics, and practical rollout. series: learn-build-from-scratch-recommendations-system seriesTitle: Build From Scratch: Enterprise Recommendations System order: 65 partTitle: Counterfactual and Off-Policy Evaluation tags:
- recommendation-system
- recsys
- counterfactual-evaluation
- off-policy-evaluation
- causal-inference
- experimentation
- series date: 2026-07-02
Part 065 — Counterfactual and Off-Policy Evaluation
Online A/B testing adalah gold standard untuk membuktikan efek policy baru.
Tetapi A/B test mahal, lambat, dan berisiko.
Sebelum mengirim traffic production ke policy baru, kita ingin bertanya:
Berdasarkan logged data dari policy lama, kira-kira bagaimana performa policy baru?
Ini disebut off-policy evaluation atau counterfactual evaluation.
Masalahnya: logged data hanya berisi outcome untuk action/item/slate yang benar-benar ditampilkan oleh policy lama.
Kita tidak tahu outcome untuk action alternatif yang tidak ditampilkan.
Counterfactual evaluation mencoba memperkirakan hasil policy baru menggunakan data lama dengan bantuan:
- logged policy,
- propensity,
- randomization/exploration,
- support/overlap,
- importance weighting,
- reward models,
- diagnostics,
- variance control.
Part ini membahas counterfactual dan off-policy evaluation production-grade untuk recommendation system: konsep, IPS/SNIPS, doubly robust intuition, support problem, slate complexity, diagnostics, logging requirements, limitations, and practical rollout.
1. Mental Model: OPE Estimates “What If We Had Used Another Policy?”
Historical data:
context x
old policy chose action a
reward r observed
Question:
What would reward have been if new policy chose action a'?
If a' != a, reward is unobserved.
Off-policy evaluation uses logged events where new policy would have taken the same/similar action as old policy, weighted by selection probabilities.
OPE is not magic. It requires good logging and overlap.
2. Key Terms
Context
Request/user/session/surface state.
x
Action
Item, candidate, source, module, slot decision, or slate.
a
Reward
Observed outcome.
click, purchase, hide, report, utility
Logging Policy
Old policy that generated data.
pi_0(a | x)
Target Policy
New policy we want to evaluate.
pi_1(a | x)
Propensity
Probability logging policy chose action.
p = pi_0(a | x)
3. Why Propensity Is Required
If old policy chose item A with probability 0.9, reward from A is common.
If old policy chose item B with probability 0.01, reward from B is rare and more informative about exploration.
Importance weighting corrects for unequal sampling.
Without propensity, we cannot know whether action was shown because:
- model was confident,
- random exploration,
- business rule,
- position bias,
- deterministic default.
OPE without propensity is often unreliable.
4. Logging Requirements
For OPE, log:
context features used by policy
candidate/action set
chosen action/slate
reward/outcome
logging policy version
propensity of chosen action
randomization seed
position
experiment/exploration policy
model/policy versions
eligibility constraints
For slate recommendations, also log:
per-slot propensities if available
slate probability or approximation
candidate set order
If logging does not include candidate set and propensity, OPE will be weak.
5. Support / Overlap
OPE needs overlap.
If target policy chooses action never chosen by logging policy:
pi_0(a | x) = 0
then no logged reward exists.
Estimator cannot know reward.
This is called support problem.
Practical implication:
exploration creates support
Without exploration, OPE cannot evaluate radically different policies.
6. Deterministic Logging Policy Problem
If old policy always shows top item deterministically:
pi_0(top_item | x) = 1
pi_0(other_items | x) = 0
Then OPE can only evaluate policies choosing same top item.
This is why production systems add controlled randomization/exploration.
Even small exploration budget improves future evaluation.
7. Policy Value
Policy value:
V(pi) = E[reward when actions are chosen by pi]
In recommendation:
expected click
expected utility
expected purchase
expected satisfaction
expected case success
OPE estimates V(pi_1) using data generated by pi_0.
8. Direct Method
Direct Method trains reward model:
r_hat(x, a) = predicted reward
Then estimates:
V(pi_1) = average r_hat(x, pi_1(x))
Pros:
- low variance,
- can score unchosen actions if features available.
Cons:
- biased if reward model wrong,
- inherits logged data bias,
- cannot fully solve no-support problem.
This is similar to offline model evaluation.
9. Inverse Propensity Scoring
IPS estimator:
IPS = mean( reward_i * pi_1(a_i | x_i) / pi_0(a_i | x_i) )
If target policy would likely choose logged action, weight high.
If target policy would not choose logged action, weight low/zero.
IPS can be unbiased under assumptions, but high variance.
10. IPS Intuition
Example:
old policy showed item A with p=0.1
new policy would show item A with p=0.5
reward=1
weight=0.5/0.1=5
This logged reward counts more because target policy would choose it more often than old policy.
If p is tiny, weight becomes huge.
That creates variance.
11. Propensity Clipping
To reduce variance, clip weights.
weight = min(pi_1 / pi_0, max_weight)
Example:
max_weight = 10
Pros:
- lower variance.
Cons:
- introduces bias.
Use diagnostics to show clipped fraction.
Do not hide clipping.
12. Self-Normalized IPS
SNIPS:
SNIPS = sum(w_i * r_i) / sum(w_i)
where:
w_i = pi_1(a_i|x_i) / pi_0(a_i|x_i)
Pros:
- lower variance,
- stable scale.
Cons:
- biased but often practical.
Common in production OPE.
13. Doubly Robust Intuition
Doubly Robust combines:
- reward model,
- propensity weighting.
Idea:
estimate reward using model
then correct using logged residual weighted by propensity
If either reward model or propensity model is correct, estimator can perform better under assumptions.
Practical value:
- lower variance than IPS,
- less biased than direct method if propensities good.
But implementation complexity and assumptions matter.
14. Reward Model for DR
Reward model predicts:
r_hat(x, a)
Need train/evaluate reward model carefully.
If reward model is trained on same biased data, it may be wrong for underexplored actions.
Use exploration data and calibration.
DR does not magically fix missing support.
15. Slate OPE Complexity
Recommendation often chooses slate:
[a1, a2, a3, ..., ak]
Reward depends on:
- positions,
- item interactions,
- diversity,
- user attention,
- substitution effects.
Slate probability can be tiny:
pi(slate | context)
Exact slate IPS has huge variance.
Practical systems often evaluate simpler action units:
- slot-level,
- top item,
- module selection,
- candidate source selection,
- limited slate policy variants.
16. Position Propensity
For item at position:
pi(item, position | context)
Click reward is position-biased.
If old policy randomized item positions, position propensity helps debias.
Without position randomization, click-based OPE remains biased.
Log position and examination-related signals.
17. Slot-Level OPE
Evaluate each slot separately.
For slot j:
action = item chosen for slot j
reward = click/purchase attributable to slot j
propensity = probability item chosen at slot j
This simplifies but ignores slate interactions.
Useful for early analysis.
18. Module-Level OPE
If page has modules:
Trending
Because you viewed
New arrivals
Recommended for your role
Evaluate module policy.
Action:
which module shown in slot
Reward:
module click/conversion
Propensity easier than item-level slate.
Module-level experiments/OPE are practical.
19. Candidate Source OPE
Evaluate source allocation.
Action:
candidate source quota/allocation
Reward:
downstream engagement from source candidates
Useful when testing source mix.
Need source provenance and exposure logging.
20. Counterfactual Replay
Counterfactual replay runs new policy on historical contexts.
Procedure:
- Take logged request contexts.
- Reconstruct candidate set/features as-of time.
- Run new policy.
- Compare chosen items with logged shown items.
- Use observed rewards only where overlap exists.
- Compute OPE/diagnostics.
Replay also reveals support gap.
21. Support Diagnostics
Measure:
target_action_covered_rate
average_logging_propensity
min_propensity
weight_distribution
effective_sample_size
fraction_weight_clipped
candidate_overlap
segment_overlap
If support poor, OPE estimate unreliable.
Report support diagnostics with metric.
22. Effective Sample Size
Importance weights reduce effective sample size.
A rough ESS:
ESS = (sum w)^2 / sum(w^2)
If logged data has 1M rows but ESS is 2K, estimate is noisy.
ESS should be reported.
23. Weight Distribution
Inspect:
p50 weight
p95 weight
p99 weight
max weight
Huge weights mean high variance.
If a few examples dominate, OPE unreliable.
Use clipping, better exploration, or do online test.
24. Reward Maturity
OPE reward must be mature.
If evaluating purchase_7d, wait 7 days.
If using return_30d, wait 30 days.
Immature labels bias reward down.
Same issue as offline evaluation.
25. Multiple Rewards
Estimate separate rewards:
click
purchase
hide
report
retention
Do not only estimate primary reward.
Target policy might improve click and worsen report.
Use OPE for guardrails too, if support/reward quality allows.
26. OPE for Exploration Policies
For contextual bandit policies, OPE is central.
If logging policy randomized with known propensity, you can evaluate alternative policies offline before online ramp.
Bandit logs should include:
- action set,
- chosen action,
- propensity,
- reward,
- context.
Without this, bandit learning/evaluation is compromised.
27. OPE for Ranking Model Changes
If new ranker mostly reorders same displayed candidate pool, OPE may help.
If new ranker chooses very different candidates, support weak.
For ranker changes:
- compare overlap with logged slates,
- use logged candidate pool if available,
- evaluate top-K overlap and supported reward,
- then A/B test.
OPE screens, not replaces online test.
28. OPE for Business Rule Changes
New rule may reject candidates.
OPE can simulate:
how often rule would remove shown item
reward of removed items
replacement availability
segment impact
But if replacement items were not shown, their reward unknown.
Use OPE to estimate risk and candidate loss, not full outcome.
29. OPE for Reranking/Diversity
Diversity policy may choose items historically not shown.
Support often weak.
Offline replay can measure:
- predicted utility,
- overlap with logged items,
- diversity metrics,
- supported reward.
But online A/B needed.
30. Bias Sources
OPE assumptions can fail due to:
- wrong propensity,
- unlogged policy decision,
- hidden eligibility filters,
- position bias,
- unobserved confounders,
- policy changes during logging,
- nonstationarity,
- interference,
- reward attribution errors.
OPE result should include limitations.
31. Propensity Accuracy
If logged propensity wrong, IPS/SNIPS wrong.
Common errors:
- does not account for filtering,
- ignores slot constraints,
- ignores fallback,
- ignores deterministic business rule,
- wrong candidate set size,
- reused cached response,
- treatment not applied but logged as applied.
Validate propensity logging.
32. Logging Policy Versioning
Policy version must be logged.
If data spans multiple policy versions:
pi_0 is not one policy
Estimator must use correct propensity per event.
Group/stratify by logging policy version.
33. Nonstationarity
User/item behavior changes over time.
Logged data from last month may not represent today.
Use recent windows and compare across windows.
OPE is more reliable when environment stable.
34. Interference
If target policy changes exposure ecosystem, logged individual rewards may not estimate new equilibrium.
Examples:
- marketplace seller exposure,
- creator supply,
- social feed interactions.
OPE usually assumes no interference.
For marketplace changes, use experiments.
35. OPE Output Report
Report should include:
target policy version
logging policy versions
data window
reward definition
estimator type
estimated value
confidence interval/uncertainty
support diagnostics
weight diagnostics
clipping settings
segment results
limitations
recommendation
Never present OPE metric without diagnostics.
36. Confidence Intervals
Use:
- bootstrap,
- asymptotic variance,
- user-level clustering if unit is user.
Importance weighted estimators can have heavy tails.
Confidence intervals may be wide.
If CI wide, OPE not decisive.
37. User-Level Aggregation
If randomization/logging unit is user, aggregate by user to avoid overweighting heavy users.
For OPE, clustering by user may be needed.
High-activity users can dominate event-level estimates.
38. Segment OPE
Estimate by segment:
- new users,
- regions,
- categories,
- item age,
- source,
- tenant,
- privacy mode.
Support may differ by segment.
A policy can be evaluable for warm users but not cold-start users.
39. Practical OPE Maturity Levels
Level 0
No propensity. Only offline replay/heuristics.
Level 1
Logged propensities for exploration slots.
Level 2
OPE for candidate/source/module policies.
Level 3
OPE integrated with bandit learning and experiment planning.
Level 4
Robust OPE with DR, support diagnostics, governance, and online correlation tracking.
Most teams start at Level 0/1.
40. When Not to Trust OPE
Do not trust OPE when:
propensity missing/wrong
support very low
ESS tiny
weights huge
target policy radically different
reward immature
strong interference
logging policy unknown
candidate set missing
segment support poor
Use OPE as warning signal, not launch approval.
41. Relationship with A/B Testing
OPE helps decide:
is policy safe/promising enough to A/B?
which variants to test?
what risks/segments to monitor?
A/B validates:
actual causal effect in production
OPE reduces risk and cost, but does not replace online experiments.
42. Minimal Logging for Future OPE
Add now:
request_id
context
candidate set sample/full
shown items
position
logging policy id/version
propensity if randomized
experiment assignment
source provenance
reward linkage
Even if you don't implement OPE now, logging enables future evaluation.
43. Practical Example: Exploration Slot OPE
Logged policy:
slot 5 explores one item from exploration pool with epsilon=0.05
chosen uniformly from pool of size n
Propensity:
epsilon * 1/n
Target policy:
choose item with highest UCB score
OPE uses logged events where target policy would choose logged item, weighted by probability ratio.
If target policy chooses unseen items mostly, support low.
44. Practical Example: Source Allocation
Logged source mix:
two_tower 60%
content 20%
trending 20%
Target:
two_tower 50%
content 30%
trending 20%
If source allocation randomized and propensity logged, OPE can estimate effect.
If not randomized, only replay source contribution.
45. Common Failure Modes
45.1 No Propensity
IPS impossible.
45.2 Wrong Propensity
False confidence.
45.3 No Support
Target policy unevaluable.
45.4 Huge Weights
High variance.
45.5 No Reward Maturity
Delayed reward underestimated.
45.6 Slate Probability Ignored
Item-level estimate overconfident.
45.7 Fallback Not Logged
Policy mismatch.
45.8 Cached Response Breaks Propensity
Logged probability wrong.
45.9 Segment Support Ignored
Launch hurts under-supported segment.
45.10 OPE Treated as Final Proof
Online experiment fails.
46. Implementation Sketch: Logged Bandit Event
public record LoggedBanditEvent(
String requestId,
String userId,
String surface,
String policyId,
String policyVersion,
String actionId,
int position,
double loggingPropensity,
List<String> eligibleActionIds,
Instant decisionTime,
Reward reward
) {}
public record Reward(
double value,
String rewardType,
Instant maturedAt
) {}
Store action set if feasible; otherwise store enough to reconstruct.
47. Implementation Sketch: IPS Estimator
public final class IpsEstimator {
public double estimate(List<LoggedBanditEvent> logs, TargetPolicy targetPolicy) {
double sum = 0.0;
for (LoggedBanditEvent e : logs) {
double targetProb = targetPolicy.probability(e.actionId(), e);
if (e.loggingPropensity() <= 0.0) {
continue;
}
double weight = targetProb / e.loggingPropensity();
sum += weight * e.reward().value();
}
return sum / Math.max(logs.size(), 1);
}
}
Production implementation needs clipping, diagnostics, confidence intervals, and segment analysis.
48. Implementation Sketch: SNIPS Estimator
public final class SnipsEstimator {
public double estimate(List<LoggedBanditEvent> logs, TargetPolicy targetPolicy) {
double weightedReward = 0.0;
double weightSum = 0.0;
for (LoggedBanditEvent e : logs) {
if (e.loggingPropensity() <= 0.0) {
continue;
}
double targetProb = targetPolicy.probability(e.actionId(), e);
double weight = targetProb / e.loggingPropensity();
weightedReward += weight * e.reward().value();
weightSum += weight;
}
return weightedReward / Math.max(weightSum, 1e-9);
}
}
SNIPS is often more stable than raw IPS.
49. Minimal Production OPE Plan
Start with:
logging:
policy_version: required
candidate_set_sample: required
position: required
source_provenance: required
reward_linkage: required
propensity_for_exploration: required
ope_scope:
- exploration_slots
- source_allocation
- module_selection
estimators:
- direct_method_baseline
- snips_for_randomized_logs
diagnostics:
- support_rate
- effective_sample_size
- weight_distribution
- clipping_fraction
- segment_support
reporting:
- limitations_required
- online_ab_still_required
Do not attempt full slate OPE before logging maturity.
50. Checklist Counterfactual and Off-Policy Evaluation Readiness
[ ] Logging policy version is logged.
[ ] Candidate/action set is logged or reconstructable.
[ ] Chosen action/slate is logged.
[ ] Position is logged.
[ ] Reward/outcome is linked and mature.
[ ] Propensity is logged for randomized decisions.
[ ] Fallback/treatment-applied status is logged.
[ ] Support/overlap diagnostics exist.
[ ] Weight distribution and ESS are reported.
[ ] Clipping settings are explicit.
[ ] Segment OPE is reported.
[ ] Estimator assumptions are documented.
[ ] OPE reports include limitations.
[ ] OPE is used to screen, not replace, A/B tests.
51. Kesimpulan
Counterfactual dan off-policy evaluation membantu mengevaluasi policy baru dari logged data, tetapi hanya jika logging dan randomization memadai.
Prinsip utama:
- OPE asks what would happen under a new policy using old logged data.
- Propensity is essential.
- Support/overlap determines whether policy is evaluable.
- IPS can be unbiased but high variance.
- SNIPS reduces variance but adds bias.
- Doubly robust combines reward model and propensity correction.
- Slate OPE is much harder than single-action OPE.
- Wrong propensities produce wrong estimates.
- OPE result must include support, weight, ESS, and limitation diagnostics.
- OPE screens candidates for A/B testing; it does not replace online experimentation.
Di Part 066, kita akan membahas Recommendation Observability: bagaimana membangun observability end-to-end untuk request path, data, features, models, candidates, ranking, slate, feedback, business metrics, and debugging.
You just completed lesson 65 in deepen practice. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.
Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.