Deepen PracticeOrdered learning track

Learn Build From Scratch Recommendations System Part 049 Causal Thinking And Long Term Value

[]11 min read2062 words

In This Lesson

1. Mental Model: Recommendation Is an Intervention 2. Correlation vs Causation 3. Proxy Trap

Lesson 4980 lesson track45–66 Deepen Practice

title: Build From Scratch Recommendations System - Part 049 description: Mendesain causal thinking dan long-term value dalam recommendation system production-grade: proxy trap, treatment effect, counterfactuals, long-term metrics, delayed outcomes, feedback loops, exploration, retention, trust, dan decision governance. series: learn-build-from-scratch-recommendations-system seriesTitle: Build From Scratch: Enterprise Recommendations System order: 49 partTitle: Causal Thinking and Long-Term Value tags:

recommendation-system
recsys
causal-inference
long-term-value
experimentation
decision-policy
series date: 2026-07-02

Part 049 — Causal Thinking and Long-Term Value

Recommendation system mudah terlihat bagus secara metric jangka pendek tetapi buruk untuk bisnis dan user jangka panjang.

Model bisa meningkatkan click, tetapi menurunkan trust.
Model bisa meningkatkan watch time, tetapi menaikkan hide/report.
Model bisa meningkatkan purchase, tetapi menaikkan return/refund.
Model bisa meningkatkan action acceptance, tetapi meningkatkan rework.
Model bisa membuat user aktif hari ini, tetapi membuat user churn minggu depan.
Model bisa memperkuat popularitas masa lalu dan mematikan discovery.

Masalah utamanya: banyak metric yang kita amati adalah proxy, bukan outcome final.

Causal thinking membantu kita bertanya:

Apakah rekomendasi ini menyebabkan outcome yang lebih baik, atau hanya berkorelasi dengan outcome?

Part ini membahas causal thinking dan long-term value untuk recommendation system production-grade: proxy trap, counterfactual, treatment effect, delayed outcome, retention, trust, feedback loop, experimental design, long-term reward, dan governance.

1. Mental Model: Recommendation Is an Intervention

Setiap rekomendasi adalah treatment.

user/context x
candidate item/action a
system chooses to show a
user/system outcome happens

Pertanyaan causal:

What would have happened if we showed a different item?

Observed data hanya menunjukkan outcome untuk item yang ditampilkan.

Kita tidak melihat outcome alternatif.

shown item A -> click
not shown item B -> unknown

Ranking dataset penuh counterfactual yang tidak terobservasi.

2. Correlation vs Causation

Model melihat:

popular item has high purchase

Tetapi apakah item populer menyebabkan purchase?

Mungkin:

item memang bagus,
item sering ditampilkan di posisi atas,
item dipromosikan campaign,
user yang melihat item sudah high intent,
item muncul setelah search intent kuat.

Correlation:

shown item -> purchase

Causal effect:

showing item caused incremental purchase vs alternative

Ranking yang hanya belajar correlation dapat mengoptimalkan historical bias.

3. Proxy Trap

Proxy trap terjadi ketika model mengoptimalkan signal yang mudah diukur tetapi tidak sepenuhnya mewakili objective.

Examples:

Click

Proxy for interest, but can become clickbait.

Watch Time

Proxy for engagement, but can reward addictive/low-quality content.

Purchase

Proxy for value, but can ignore return/refund/satisfaction.

Action Acceptance

Proxy for usefulness, but can reward easy actions that do not improve case outcome.

Revenue

Proxy for business value, but can harm retention/trust.

Proxy diperlukan, tetapi harus dikontrol oleh guardrails dan long-term metrics.

4. Long-Term Value

Long-term value includes:

retention
repeat purchase
trust
satisfaction
low regret
low return/refund
creator/seller ecosystem health
case resolution quality
SLA improvement
low rework
brand safety
user agency

Many long-term outcomes are delayed and hard to attribute.

But ignoring them causes system myopia.

5. Short-Term vs Long-Term Conflict

Example e-commerce:

high-pressure recommendation increases purchase now
but increases returns and lowers repeat purchase

Example content:

sensational content increases clicks
but increases hide/report and churn

Example enterprise:

quick action accepted by analyst
but creates downstream rework

Model needs:

negative delayed labels,
long-term guardrails,
experiment measurement,
objective governance.

6. Counterfactual Question

For each decision:

Observed: we showed item A, user clicked.
Counterfactual: if we showed item B instead, would user still click/purchase/stay?

This is impossible to know directly for one instance.

But we can estimate using:

randomized experiments,
exploration,
propensity logging,
causal modeling,
off-policy evaluation,
natural experiments,
longitudinal metrics.

Causal thinking starts by acknowledging missing counterfactuals.

7. Treatment Effect

Treatment effect:

effect = outcome_if_shown_item - outcome_if_not_shown_item

For recommendation:

incremental effect of showing item/action/document

Example:

user would buy item anyway after searching

Then recommendation gets credit, but incremental effect may be low.

Attribution != causation.

8. Incrementality

Incrementality asks:

How much additional outcome did recommendation create?

Examples:

incremental purchase vs purchase user would have made anyway,
incremental watch satisfaction vs user would have watched similar video,
incremental case resolution improvement vs normal workflow,
incremental retention from diversity/novelty.

Recommendation systems often over-credit themselves if they only count last-click conversion.

9. Attribution Pitfall

Attribution rule:

if user clicked recommended item and purchased within 7d, recommendation caused purchase

This is often false.

Could be:

user searched product already,
item was in cart,
recommendation appeared after intent formed,
campaign drove purchase,
price drop drove purchase.

Attribution is measurement convention, not causal proof.

Use incrementality experiments when possible.

10. Randomized Experiments

A/B tests estimate causal effect of policy change.

Example:

Control: old ranker
Treatment: new ranker

Random assignment creates comparable groups.

Measure:

primary metric,
guardrails,
long-term metrics,
segments.

A/B is gold standard for online policy evaluation, but expensive/slow and not always enough for per-item causal effects.

11. Experiment Unit

Randomization unit matters.

Options:

user-level
session-level
request-level
tenant-level
item-level
creator-level
case-level

User-level keeps experience consistent.
Request-level has more power but can contaminate experience.
Tenant-level may be needed in enterprise.

Choose unit based on interference and product.

12. Interference

Interference means one user's treatment affects another user's outcome.

Marketplace example:

treatment gives more exposure to seller A
seller B loses exposure

Content ecosystem:

creator exposure distribution changes supply incentives

Interference violates simple A/B assumptions.

Marketplace experiments may need cluster or switchback designs.

13. Switchback Experiments

Switchback alternates treatment over time.

Useful when:

marketplace supply shared,
delivery/logistics constraints,
global ranking policy affects everyone.

Example:

hour 1 control
hour 2 treatment
hour 3 control

Need handle time confounding and seasonality.

14. Exploration Data for Causal Learning

Exploration creates randomized variation.

If policy sometimes shows uncertain candidates with known probability, we can estimate effects better.

Requirements:

propensity logged
candidate set logged
reward logged
position logged
policy version logged

Exploration is not only for cold-start; it is causal data infrastructure.

15. Propensity and Selection Bias

Historical logging policy selects which items are shown.

Observed data is biased toward items old policy liked.

Propensity:

probability old policy showed candidate

With propensity, we can correct some bias.

Without propensity, counterfactual evaluation is much weaker.

16. Off-Policy Evaluation

Off-policy evaluation estimates new policy using old logged data.

Basic idea:

weight observed rewards by probability new policy would choose same action
divided by probability old policy chose it

Challenges:

high variance,
support mismatch,
slate actions,
delayed rewards,
incorrect propensity,
unobserved confounding.

Use OPE to screen, not to replace A/B for major launches.

17. Support Problem

If old policy never showed certain item/source, no data tells us how it would perform.

No overlap:

logging_policy_prob(a|x) = 0

Then off-policy cannot evaluate new policy choosing a.

This is why exploration matters.

18. Delayed Outcomes

Long-term outcomes arrive late.

Examples:

return/refund after 30d
retention after 14d
case resolution after 7d
creator churn after months

Label maturity needed.

Training and evaluation should not treat immature outcomes as negative.

Long-term models may train less frequently.

19. Outcome Windows

Define windows.

click:
  window: 30m
purchase:
  window: 7d
return:
  window: 30d_after_purchase
retention:
  window: 14d_after_exposure
case_success:
  window: case_lifecycle_or_14d

Short windows are fast but incomplete.
Long windows are better but slow.

Use both proxy and delayed outcomes.

20. Long-Term Reward Modeling

Long-term reward can be:

short-term reward + delayed correction

Example:

reward =
  click * 1
  + purchase * 5
  - return * 8
  - hide * 3
  + retention_signal * 10

But scalar reward is simplification.

Multi-task prediction + utility composition is usually clearer.

21. User Trust as Metric

Trust is hard to measure but important.

Signals:

hide/not interested
report
unsubscribe
session abandonment
reduced return frequency
complaints
low-quality engagement
survey feedback
reset recommendations
disable personalization

Trust degradation may appear slowly.

Recommendation systems should monitor trust proxies.

22. Satisfaction vs Engagement

Engagement is not always satisfaction.

Examples:

doomscrolling,
hate-clicking,
clickbait,
comparison shopping frustration,
repeated invalid enterprise suggestions.

Satisfaction signals:

completion with positive follow-up
save/like/share
low negative feedback
return next day
survey helpful
low return/refund
case resolved

Use satisfaction-oriented labels when possible.

23. Causal Feature Leakage

Causal analysis can be corrupted by post-treatment features.

Post-treatment feature:

feature affected by recommendation itself

Example:

current session click after recommendation,
item popularity after exposure,
user profile updated from treatment,
rank position outcome,
campaign response.

For causal evaluation, distinguish pre-treatment covariates from post-treatment outcomes.

24. Feedback Loops

Recommendation affects future training data.

Loop:

recommend -> exposure -> feedback -> model training -> recommend

If system overexposes one group, training data says that group is better because it has more feedback.

Mitigations:

exploration,
exposure-aware sampling,
counterfactual evaluation,
fairness/exposure monitoring,
debiasing,
candidate diversity.

25. Popularity Bias as Causal Problem

Popularity features may be both signal and bias.

High popularity may indicate quality.
But popularity is also caused by previous exposure.

Use:

smoothed popularity,
segment popularity,
exposure-normalized rates,
exploration data,
decay,
cap popularity contribution,
long-tail guardrails.

Do not blindly use raw clicks as quality.

26. Long-Term User State

Recommendation can change user state.

Examples:

broaden interests,
narrow interests,
increase trust,
cause fatigue,
teach preferences,
influence future demand.

This is beyond one-step bandit.

Full RL framing is tempting but complex. Start with long-term metrics and guardrails before deep RL.

27. When Reinforcement Learning Is Too Much

Full RL requires:

state transitions,
delayed rewards,
sequential policy,
exploration safety,
simulation/offline evaluation,
stable environment.

Most production teams should first build:

strong supervised ranker,
exploration/propensity logging,
long-term metrics,
reranking policies,
A/B experimentation.

Do not jump to RL buzzwords without infrastructure.

28. Causal Metrics by Domain

E-commerce

incremental purchase
repeat purchase
return/refund
margin after returns
customer lifetime value

Content

retention
satisfaction
hide/report
diverse consumption
creator follow/unfollow

Enterprise

case resolution
SLA
rework
audit outcome
action correctness
operator productivity

Choose metrics that reflect real outcome.

29. Enterprise Causal Thinking

Enterprise example:

recommend action A
analyst accepts
case later reopens due to missing step

Acceptance was positive proxy, but long-term outcome negative.

Need labels:

action accepted,
action completed,
supervisor approved,
no rework,
case progressed,
SLA improved.

Causal question:

Did recommending action A improve case outcome vs standard workflow?

Use careful experiments/human review.

30. Long-Term Experiment Design

Long-term metrics need longer tests.

Challenges:

delayed readout,
seasonality,
novelty effects,
user learning,
supply changes,
interference.

Use:

pre-defined readout windows,
guardrail early stopping,
cohort analysis,
sequential monitoring carefully,
post-experiment follow-up.

Do not declare success from day-1 click lift alone.

31. Novelty Effect

New model/policy can temporarily change behavior.

Users may click because slate looks different.

Short-term lift may fade.

Track:

day 1
day 7
day 14
cohort retention
repeat engagement
negative feedback trend

Causal long-term evaluation requires patience.

32. Heterogeneous Treatment Effects

Policy may help one segment and harm another.

Examples:

new users benefit from diversity
power users prefer focused recommendations
new items benefit from exploration
high-intent search users dislike broad exploration

Analyze treatment effects by segment.

Use personalization of policy only after enough evidence.

33. Causal Guardrails

Guardrails:

report rate
hide rate
return rate
refund rate
unsubscribe
complaint
latency
creator/seller health
tenant error rate
policy violations

Guardrails prevent proxy optimization from harming important outcomes.

For critical metrics, fail experiment if guardrail worsens.

34. Incrementality Tests

For recommendation modules/campaigns, incrementality test:

holdout group does not receive module/campaign
treatment group receives it
compare outcome

This estimates added value.

Useful for:

email recommendations,
push recommendations,
campaign modules,
cross-sell,
enterprise assistant panel.

Sometimes recommendation steals credit from organic behavior. Holdout reveals that.

35. Ghost Ads / Ghost Recommendations

A technique: randomly decide whether candidate would be shown, but sometimes withhold it to measure counterfactual.

Use carefully because withholding useful recommendation hurts users.

Useful for estimating incremental effect of exposure.

Needs ethical/product review.

36. Causal Decision Logging

Log:

request context
candidate set
policy scores
chosen items
randomization/exploration
propensity
position
model/policy version
outcomes

Without decision logging, causal analysis is weak.

Decision log is production data asset.

37. Causal DAG Thinking

Draw causal graph.

Example:

User intent -> Recommendation shown -> Click -> Purchase
User intent -> Purchase
Position -> Click
Campaign -> Recommendation shown
Item quality -> Click/Purchase

DAG helps identify confounders.

If user intent causes both recommendation and purchase, naive attribution overestimates recommendation effect.

38. Confounders

Common confounders:

user intent
item popularity
campaign exposure
position
price discount
inventory
seasonality
region
device
UI layout
seller quality
case severity

Causal analysis must account for them or use randomization.

39. Long-Term Value Model

A long-term value model predicts:

expected future value after showing candidate

Potential labels:

retention_7d
repeat_purchase_30d
satisfaction_survey
low_return
case_success

Use as auxiliary task or reranking signal.

Be cautious: long-term labels are sparse and confounded.

40. Reward Hacking

If system optimizes metric, it may exploit metric weakness.

Examples:

clickbait for click,
long videos for watch time,
discount-heavy items for conversion,
easy actions for acceptance,
repetitive safe items for low report.

Mitigation:

multi-objective metrics,
causal experiments,
guardrails,
human review,
metric audits,
user feedback.

41. Metric Review

Periodically ask:

Is this metric still aligned with product value?
Has model found loopholes?
What negative behavior is increasing?
Are user complaints changing?
Are long-term metrics stable?
Does metric harm a segment?

Metrics are part of system design.

42. Decision Governance

Long-term value trade-offs require governance.

Questions:

Who decides click vs retention trade-off?
Who approves report/hide guardrail thresholds?
Who owns marketplace health?
Who reviews enterprise action outcome?
When do we stop experiment?

ML team should not make all value trade-offs alone.

43. Failure Modes

43.1 Click Proxy Trap

Click increases, satisfaction drops.

43.2 Attribution Mistaken for Causation

Recommendation gets credit for inevitable purchase.

43.3 Delayed Negative Ignored

Returns/rework appear later.

43.4 No Propensity Logging

Counterfactual evaluation impossible.

43.5 Feedback Loop Amplifies Bias

Popular gets more popular.

43.6 Segment Harm Hidden

Global metric improves.

43.7 Experiment Too Short

Novelty effect mistaken for durable lift.

43.8 Post-Treatment Feature Leakage

Causal model biased.

43.9 Long-Term Metric Not Owned

No one acts on degradation.

43.10 Full RL Attempt Without Infrastructure

Complexity without safety.

44. Implementation Sketch: Decision Log

public record DecisionLog(
    String requestId,
    String slateId,
    String userId,
    String surface,
    Instant decisionTime,
    String modelVersion,
    String rankingPolicyVersion,
    String explorationPolicyVersion,
    List<LoggedCandidateDecision> candidates
) {}

public record LoggedCandidateDecision(
    String itemId,
    boolean shown,
    int position,
    double score,
    double propensity,
    List<String> sources,
    Map<String, Double> predictions
) {}

This log supports experimentation, OPE, and debugging.

45. Implementation Sketch: Long-Term Label Spec

label: repeat_purchase_30d
entity: impression
positive:
  event: purchase
  condition:
    same_user: true
    within: 30d
negative:
  no_purchase_after_window: true
maturity:
  wait: 30d
exclusions:
  - user_deleted
  - item_unavailable_entire_window
  - fraud_flag

Every delayed label needs maturity and exclusions.

46. Minimal Production Causal Plan

Start with:

decision_logging:
  candidate_set: sampled
  final_slate: all
  scores: true
  model_policy_versions: true
  propensity_for_exploration: true

experimentation:
  user_level_ab_tests: true
  long_term_readout: 7d_14d_30d
  guardrails:
    - hide
    - report
    - return
    - retention
    - latency

long_term_labels:
  - return_refund_30d
  - repeat_purchase_30d
  - retention_14d
  - case_success_14d

analysis:
  segment_treatment_effects: true
  exposure_bias_monitoring: true
  offline_ope_screening: planned

Do this before attempting advanced RL.

47. Checklist Causal Thinking and Long-Term Value Readiness

[ ] Primary objective and proxy labels are documented.
[ ] Long-term outcomes are defined.
[ ] Label maturity windows exist.
[ ] Decision logs include candidate set, scores, versions.
[ ] Exploration propensity is logged.
[ ] A/B tests use appropriate randomization unit.
[ ] Guardrails include negative and delayed outcomes.
[ ] Segment-level treatment effects are analyzed.
[ ] Exposure feedback loops are monitored.
[ ] Incrementality tests exist for major modules/campaigns.
[ ] Post-treatment features are avoided in causal analysis.
[ ] Long-term metric owner is defined.
[ ] Full RL is not attempted before logging/experimentation maturity.

48. Kesimpulan

Causal thinking membantu recommendation system keluar dari jebakan proxy metric dan historical bias.

Prinsip utama:

Recommendation is an intervention.
Observed feedback lacks counterfactuals.
Attribution is not causation.
Click/purchase/watch/action acceptance are proxies, not complete value.
Long-term outcomes and negative delayed outcomes matter.
Exploration and propensity logging create better causal data.
A/B tests estimate policy-level causal effects.
Segment analysis is mandatory.
Feedback loops can amplify popularity and exposure bias.
Do not jump to full RL before strong logging, exploration, and experimentation infrastructure.

Di Part 050, kita akan membahas LLM-Augmented Recommendation Systems: bagaimana LLM dapat membantu query understanding, semantic enrichment, explanation, metadata generation, agentic workflows, dan enterprise recommendation — tanpa menggantikan retrieval/ranking/foundation yang benar.

Lesson Recap

You just completed lesson 49 in deepen practice. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Back To Series Next Lesson

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.

Previous Lesson

Lesson 48

Learn Build From Scratch Recommendations System Part 048 Contextual Bandits And Exploration

Next Lesson

Lesson 50

Learn Build From Scratch Recommendations System Part 050 Llm Augmented Recommendation Systems