Learn Build From Scratch Recommendations System Part 049 Causal Thinking And Long Term Value
title: Build From Scratch Recommendations System - Part 049 description: Mendesain causal thinking dan long-term value dalam recommendation system production-grade: proxy trap, treatment effect, counterfactuals, long-term metrics, delayed outcomes, feedback loops, exploration, retention, trust, dan decision governance. series: learn-build-from-scratch-recommendations-system seriesTitle: Build From Scratch: Enterprise Recommendations System order: 49 partTitle: Causal Thinking and Long-Term Value tags:
- recommendation-system
- recsys
- causal-inference
- long-term-value
- experimentation
- decision-policy
- series date: 2026-07-02
Part 049 — Causal Thinking and Long-Term Value
Recommendation system mudah terlihat bagus secara metric jangka pendek tetapi buruk untuk bisnis dan user jangka panjang.
Model bisa meningkatkan click, tetapi menurunkan trust.
Model bisa meningkatkan watch time, tetapi menaikkan hide/report.
Model bisa meningkatkan purchase, tetapi menaikkan return/refund.
Model bisa meningkatkan action acceptance, tetapi meningkatkan rework.
Model bisa membuat user aktif hari ini, tetapi membuat user churn minggu depan.
Model bisa memperkuat popularitas masa lalu dan mematikan discovery.
Masalah utamanya: banyak metric yang kita amati adalah proxy, bukan outcome final.
Causal thinking membantu kita bertanya:
Apakah rekomendasi ini menyebabkan outcome yang lebih baik, atau hanya berkorelasi dengan outcome?
Part ini membahas causal thinking dan long-term value untuk recommendation system production-grade: proxy trap, counterfactual, treatment effect, delayed outcome, retention, trust, feedback loop, experimental design, long-term reward, dan governance.
1. Mental Model: Recommendation Is an Intervention
Setiap rekomendasi adalah treatment.
user/context x
candidate item/action a
system chooses to show a
user/system outcome happens
Pertanyaan causal:
What would have happened if we showed a different item?
Observed data hanya menunjukkan outcome untuk item yang ditampilkan.
Kita tidak melihat outcome alternatif.
shown item A -> click
not shown item B -> unknown
Ranking dataset penuh counterfactual yang tidak terobservasi.
2. Correlation vs Causation
Model melihat:
popular item has high purchase
Tetapi apakah item populer menyebabkan purchase?
Mungkin:
- item memang bagus,
- item sering ditampilkan di posisi atas,
- item dipromosikan campaign,
- user yang melihat item sudah high intent,
- item muncul setelah search intent kuat.
Correlation:
shown item -> purchase
Causal effect:
showing item caused incremental purchase vs alternative
Ranking yang hanya belajar correlation dapat mengoptimalkan historical bias.
3. Proxy Trap
Proxy trap terjadi ketika model mengoptimalkan signal yang mudah diukur tetapi tidak sepenuhnya mewakili objective.
Examples:
Click
Proxy for interest, but can become clickbait.
Watch Time
Proxy for engagement, but can reward addictive/low-quality content.
Purchase
Proxy for value, but can ignore return/refund/satisfaction.
Action Acceptance
Proxy for usefulness, but can reward easy actions that do not improve case outcome.
Revenue
Proxy for business value, but can harm retention/trust.
Proxy diperlukan, tetapi harus dikontrol oleh guardrails dan long-term metrics.
4. Long-Term Value
Long-term value includes:
retention
repeat purchase
trust
satisfaction
low regret
low return/refund
creator/seller ecosystem health
case resolution quality
SLA improvement
low rework
brand safety
user agency
Many long-term outcomes are delayed and hard to attribute.
But ignoring them causes system myopia.
5. Short-Term vs Long-Term Conflict
Example e-commerce:
high-pressure recommendation increases purchase now
but increases returns and lowers repeat purchase
Example content:
sensational content increases clicks
but increases hide/report and churn
Example enterprise:
quick action accepted by analyst
but creates downstream rework
Model needs:
- negative delayed labels,
- long-term guardrails,
- experiment measurement,
- objective governance.
6. Counterfactual Question
For each decision:
Observed: we showed item A, user clicked.
Counterfactual: if we showed item B instead, would user still click/purchase/stay?
This is impossible to know directly for one instance.
But we can estimate using:
- randomized experiments,
- exploration,
- propensity logging,
- causal modeling,
- off-policy evaluation,
- natural experiments,
- longitudinal metrics.
Causal thinking starts by acknowledging missing counterfactuals.
7. Treatment Effect
Treatment effect:
effect = outcome_if_shown_item - outcome_if_not_shown_item
For recommendation:
incremental effect of showing item/action/document
Example:
user would buy item anyway after searching
Then recommendation gets credit, but incremental effect may be low.
Attribution != causation.
8. Incrementality
Incrementality asks:
How much additional outcome did recommendation create?
Examples:
- incremental purchase vs purchase user would have made anyway,
- incremental watch satisfaction vs user would have watched similar video,
- incremental case resolution improvement vs normal workflow,
- incremental retention from diversity/novelty.
Recommendation systems often over-credit themselves if they only count last-click conversion.
9. Attribution Pitfall
Attribution rule:
if user clicked recommended item and purchased within 7d, recommendation caused purchase
This is often false.
Could be:
- user searched product already,
- item was in cart,
- recommendation appeared after intent formed,
- campaign drove purchase,
- price drop drove purchase.
Attribution is measurement convention, not causal proof.
Use incrementality experiments when possible.
10. Randomized Experiments
A/B tests estimate causal effect of policy change.
Example:
Control: old ranker
Treatment: new ranker
Random assignment creates comparable groups.
Measure:
- primary metric,
- guardrails,
- long-term metrics,
- segments.
A/B is gold standard for online policy evaluation, but expensive/slow and not always enough for per-item causal effects.
11. Experiment Unit
Randomization unit matters.
Options:
user-level
session-level
request-level
tenant-level
item-level
creator-level
case-level
User-level keeps experience consistent.
Request-level has more power but can contaminate experience.
Tenant-level may be needed in enterprise.
Choose unit based on interference and product.
12. Interference
Interference means one user's treatment affects another user's outcome.
Marketplace example:
treatment gives more exposure to seller A
seller B loses exposure
Content ecosystem:
creator exposure distribution changes supply incentives
Interference violates simple A/B assumptions.
Marketplace experiments may need cluster or switchback designs.
13. Switchback Experiments
Switchback alternates treatment over time.
Useful when:
- marketplace supply shared,
- delivery/logistics constraints,
- global ranking policy affects everyone.
Example:
hour 1 control
hour 2 treatment
hour 3 control
Need handle time confounding and seasonality.
14. Exploration Data for Causal Learning
Exploration creates randomized variation.
If policy sometimes shows uncertain candidates with known probability, we can estimate effects better.
Requirements:
propensity logged
candidate set logged
reward logged
position logged
policy version logged
Exploration is not only for cold-start; it is causal data infrastructure.
15. Propensity and Selection Bias
Historical logging policy selects which items are shown.
Observed data is biased toward items old policy liked.
Propensity:
probability old policy showed candidate
With propensity, we can correct some bias.
Without propensity, counterfactual evaluation is much weaker.
16. Off-Policy Evaluation
Off-policy evaluation estimates new policy using old logged data.
Basic idea:
weight observed rewards by probability new policy would choose same action
divided by probability old policy chose it
Challenges:
- high variance,
- support mismatch,
- slate actions,
- delayed rewards,
- incorrect propensity,
- unobserved confounding.
Use OPE to screen, not to replace A/B for major launches.
17. Support Problem
If old policy never showed certain item/source, no data tells us how it would perform.
No overlap:
logging_policy_prob(a|x) = 0
Then off-policy cannot evaluate new policy choosing a.
This is why exploration matters.
18. Delayed Outcomes
Long-term outcomes arrive late.
Examples:
return/refund after 30d
retention after 14d
case resolution after 7d
creator churn after months
Label maturity needed.
Training and evaluation should not treat immature outcomes as negative.
Long-term models may train less frequently.
19. Outcome Windows
Define windows.
click:
window: 30m
purchase:
window: 7d
return:
window: 30d_after_purchase
retention:
window: 14d_after_exposure
case_success:
window: case_lifecycle_or_14d
Short windows are fast but incomplete.
Long windows are better but slow.
Use both proxy and delayed outcomes.
20. Long-Term Reward Modeling
Long-term reward can be:
short-term reward + delayed correction
Example:
reward =
click * 1
+ purchase * 5
- return * 8
- hide * 3
+ retention_signal * 10
But scalar reward is simplification.
Multi-task prediction + utility composition is usually clearer.
21. User Trust as Metric
Trust is hard to measure but important.
Signals:
hide/not interested
report
unsubscribe
session abandonment
reduced return frequency
complaints
low-quality engagement
survey feedback
reset recommendations
disable personalization
Trust degradation may appear slowly.
Recommendation systems should monitor trust proxies.
22. Satisfaction vs Engagement
Engagement is not always satisfaction.
Examples:
- doomscrolling,
- hate-clicking,
- clickbait,
- comparison shopping frustration,
- repeated invalid enterprise suggestions.
Satisfaction signals:
completion with positive follow-up
save/like/share
low negative feedback
return next day
survey helpful
low return/refund
case resolved
Use satisfaction-oriented labels when possible.
23. Causal Feature Leakage
Causal analysis can be corrupted by post-treatment features.
Post-treatment feature:
feature affected by recommendation itself
Example:
- current session click after recommendation,
- item popularity after exposure,
- user profile updated from treatment,
- rank position outcome,
- campaign response.
For causal evaluation, distinguish pre-treatment covariates from post-treatment outcomes.
24. Feedback Loops
Recommendation affects future training data.
Loop:
recommend -> exposure -> feedback -> model training -> recommend
If system overexposes one group, training data says that group is better because it has more feedback.
Mitigations:
- exploration,
- exposure-aware sampling,
- counterfactual evaluation,
- fairness/exposure monitoring,
- debiasing,
- candidate diversity.
25. Popularity Bias as Causal Problem
Popularity features may be both signal and bias.
High popularity may indicate quality.
But popularity is also caused by previous exposure.
Use:
- smoothed popularity,
- segment popularity,
- exposure-normalized rates,
- exploration data,
- decay,
- cap popularity contribution,
- long-tail guardrails.
Do not blindly use raw clicks as quality.
26. Long-Term User State
Recommendation can change user state.
Examples:
- broaden interests,
- narrow interests,
- increase trust,
- cause fatigue,
- teach preferences,
- influence future demand.
This is beyond one-step bandit.
Full RL framing is tempting but complex. Start with long-term metrics and guardrails before deep RL.
27. When Reinforcement Learning Is Too Much
Full RL requires:
- state transitions,
- delayed rewards,
- sequential policy,
- exploration safety,
- simulation/offline evaluation,
- stable environment.
Most production teams should first build:
- strong supervised ranker,
- exploration/propensity logging,
- long-term metrics,
- reranking policies,
- A/B experimentation.
Do not jump to RL buzzwords without infrastructure.
28. Causal Metrics by Domain
E-commerce
incremental purchase
repeat purchase
return/refund
margin after returns
customer lifetime value
Content
retention
satisfaction
hide/report
diverse consumption
creator follow/unfollow
Enterprise
case resolution
SLA
rework
audit outcome
action correctness
operator productivity
Choose metrics that reflect real outcome.
29. Enterprise Causal Thinking
Enterprise example:
recommend action A
analyst accepts
case later reopens due to missing step
Acceptance was positive proxy, but long-term outcome negative.
Need labels:
- action accepted,
- action completed,
- supervisor approved,
- no rework,
- case progressed,
- SLA improved.
Causal question:
Did recommending action A improve case outcome vs standard workflow?
Use careful experiments/human review.
30. Long-Term Experiment Design
Long-term metrics need longer tests.
Challenges:
- delayed readout,
- seasonality,
- novelty effects,
- user learning,
- supply changes,
- interference.
Use:
- pre-defined readout windows,
- guardrail early stopping,
- cohort analysis,
- sequential monitoring carefully,
- post-experiment follow-up.
Do not declare success from day-1 click lift alone.
31. Novelty Effect
New model/policy can temporarily change behavior.
Users may click because slate looks different.
Short-term lift may fade.
Track:
day 1
day 7
day 14
cohort retention
repeat engagement
negative feedback trend
Causal long-term evaluation requires patience.
32. Heterogeneous Treatment Effects
Policy may help one segment and harm another.
Examples:
new users benefit from diversity
power users prefer focused recommendations
new items benefit from exploration
high-intent search users dislike broad exploration
Analyze treatment effects by segment.
Use personalization of policy only after enough evidence.
33. Causal Guardrails
Guardrails:
report rate
hide rate
return rate
refund rate
unsubscribe
complaint
latency
creator/seller health
tenant error rate
policy violations
Guardrails prevent proxy optimization from harming important outcomes.
For critical metrics, fail experiment if guardrail worsens.
34. Incrementality Tests
For recommendation modules/campaigns, incrementality test:
holdout group does not receive module/campaign
treatment group receives it
compare outcome
This estimates added value.
Useful for:
- email recommendations,
- push recommendations,
- campaign modules,
- cross-sell,
- enterprise assistant panel.
Sometimes recommendation steals credit from organic behavior. Holdout reveals that.
35. Ghost Ads / Ghost Recommendations
A technique: randomly decide whether candidate would be shown, but sometimes withhold it to measure counterfactual.
Use carefully because withholding useful recommendation hurts users.
Useful for estimating incremental effect of exposure.
Needs ethical/product review.
36. Causal Decision Logging
Log:
request context
candidate set
policy scores
chosen items
randomization/exploration
propensity
position
model/policy version
outcomes
Without decision logging, causal analysis is weak.
Decision log is production data asset.
37. Causal DAG Thinking
Draw causal graph.
Example:
User intent -> Recommendation shown -> Click -> Purchase
User intent -> Purchase
Position -> Click
Campaign -> Recommendation shown
Item quality -> Click/Purchase
DAG helps identify confounders.
If user intent causes both recommendation and purchase, naive attribution overestimates recommendation effect.
38. Confounders
Common confounders:
user intent
item popularity
campaign exposure
position
price discount
inventory
seasonality
region
device
UI layout
seller quality
case severity
Causal analysis must account for them or use randomization.
39. Long-Term Value Model
A long-term value model predicts:
expected future value after showing candidate
Potential labels:
retention_7d
repeat_purchase_30d
satisfaction_survey
low_return
case_success
Use as auxiliary task or reranking signal.
Be cautious: long-term labels are sparse and confounded.
40. Reward Hacking
If system optimizes metric, it may exploit metric weakness.
Examples:
- clickbait for click,
- long videos for watch time,
- discount-heavy items for conversion,
- easy actions for acceptance,
- repetitive safe items for low report.
Mitigation:
- multi-objective metrics,
- causal experiments,
- guardrails,
- human review,
- metric audits,
- user feedback.
41. Metric Review
Periodically ask:
Is this metric still aligned with product value?
Has model found loopholes?
What negative behavior is increasing?
Are user complaints changing?
Are long-term metrics stable?
Does metric harm a segment?
Metrics are part of system design.
42. Decision Governance
Long-term value trade-offs require governance.
Questions:
Who decides click vs retention trade-off?
Who approves report/hide guardrail thresholds?
Who owns marketplace health?
Who reviews enterprise action outcome?
When do we stop experiment?
ML team should not make all value trade-offs alone.
43. Failure Modes
43.1 Click Proxy Trap
Click increases, satisfaction drops.
43.2 Attribution Mistaken for Causation
Recommendation gets credit for inevitable purchase.
43.3 Delayed Negative Ignored
Returns/rework appear later.
43.4 No Propensity Logging
Counterfactual evaluation impossible.
43.5 Feedback Loop Amplifies Bias
Popular gets more popular.
43.6 Segment Harm Hidden
Global metric improves.
43.7 Experiment Too Short
Novelty effect mistaken for durable lift.
43.8 Post-Treatment Feature Leakage
Causal model biased.
43.9 Long-Term Metric Not Owned
No one acts on degradation.
43.10 Full RL Attempt Without Infrastructure
Complexity without safety.
44. Implementation Sketch: Decision Log
public record DecisionLog(
String requestId,
String slateId,
String userId,
String surface,
Instant decisionTime,
String modelVersion,
String rankingPolicyVersion,
String explorationPolicyVersion,
List<LoggedCandidateDecision> candidates
) {}
public record LoggedCandidateDecision(
String itemId,
boolean shown,
int position,
double score,
double propensity,
List<String> sources,
Map<String, Double> predictions
) {}
This log supports experimentation, OPE, and debugging.
45. Implementation Sketch: Long-Term Label Spec
label: repeat_purchase_30d
entity: impression
positive:
event: purchase
condition:
same_user: true
within: 30d
negative:
no_purchase_after_window: true
maturity:
wait: 30d
exclusions:
- user_deleted
- item_unavailable_entire_window
- fraud_flag
Every delayed label needs maturity and exclusions.
46. Minimal Production Causal Plan
Start with:
decision_logging:
candidate_set: sampled
final_slate: all
scores: true
model_policy_versions: true
propensity_for_exploration: true
experimentation:
user_level_ab_tests: true
long_term_readout: 7d_14d_30d
guardrails:
- hide
- report
- return
- retention
- latency
long_term_labels:
- return_refund_30d
- repeat_purchase_30d
- retention_14d
- case_success_14d
analysis:
segment_treatment_effects: true
exposure_bias_monitoring: true
offline_ope_screening: planned
Do this before attempting advanced RL.
47. Checklist Causal Thinking and Long-Term Value Readiness
[ ] Primary objective and proxy labels are documented.
[ ] Long-term outcomes are defined.
[ ] Label maturity windows exist.
[ ] Decision logs include candidate set, scores, versions.
[ ] Exploration propensity is logged.
[ ] A/B tests use appropriate randomization unit.
[ ] Guardrails include negative and delayed outcomes.
[ ] Segment-level treatment effects are analyzed.
[ ] Exposure feedback loops are monitored.
[ ] Incrementality tests exist for major modules/campaigns.
[ ] Post-treatment features are avoided in causal analysis.
[ ] Long-term metric owner is defined.
[ ] Full RL is not attempted before logging/experimentation maturity.
48. Kesimpulan
Causal thinking membantu recommendation system keluar dari jebakan proxy metric dan historical bias.
Prinsip utama:
- Recommendation is an intervention.
- Observed feedback lacks counterfactuals.
- Attribution is not causation.
- Click/purchase/watch/action acceptance are proxies, not complete value.
- Long-term outcomes and negative delayed outcomes matter.
- Exploration and propensity logging create better causal data.
- A/B tests estimate policy-level causal effects.
- Segment analysis is mandatory.
- Feedback loops can amplify popularity and exposure bias.
- Do not jump to full RL before strong logging, exploration, and experimentation infrastructure.
Di Part 050, kita akan membahas LLM-Augmented Recommendation Systems: bagaimana LLM dapat membantu query understanding, semantic enrichment, explanation, metadata generation, agentic workflows, dan enterprise recommendation — tanpa menggantikan retrieval/ranking/foundation yang benar.
You just completed lesson 49 in deepen practice. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.
Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.