Deepen PracticeOrdered learning track

Learn Build From Scratch Recommendations System Part 048 Contextual Bandits And Exploration

[]11 min read2090 words

In This Lesson

1. Mental Model: Explore vs Exploit 2. Bandit Setup 3. Why Not Pure A/B Test?

Lesson 4880 lesson track45–66 Deepen Practice

title: Build From Scratch Recommendations System - Part 048 description: Mendesain contextual bandits dan exploration production-grade: explore-exploit tradeoff, epsilon-greedy, Thompson sampling, UCB, contextual policies, propensity logging, off-policy evaluation, guardrails, cold-start, safety, dan rollout. series: learn-build-from-scratch-recommendations-system seriesTitle: Build From Scratch: Enterprise Recommendations System order: 48 partTitle: Contextual Bandits and Exploration tags:

recommendation-system
recsys
contextual-bandits
exploration
reinforcement-learning
experimentation
series date: 2026-07-02

Part 048 — Contextual Bandits and Exploration

Recommendation system tidak bisa hanya mengeksploitasi hal yang sudah diketahui.

Jika sistem selalu menampilkan kandidat dengan predicted score tertinggi:

item baru tidak pernah mendapat data,
creator/seller baru tidak punya peluang,
model tidak belajar preferensi baru,
user terjebak di filter bubble,
sistem tidak tahu apakah kandidat alternatif lebih baik,
popularity bias semakin kuat,
offline evaluation makin biased.

Sistem perlu exploration: mencoba kandidat yang uncertain tetapi aman untuk belajar.

Namun exploration yang buruk terasa random, merusak trust, dan bisa melanggar safety.

Contextual bandits adalah framework untuk mengelola explore-exploit trade-off dalam konteks recommendation: memilih action/item berdasarkan context, menerima reward, dan memperbarui policy.

Part ini membahas contextual bandits dan exploration production-grade: epsilon-greedy, UCB, Thompson sampling, contextual policies, propensity logging, off-policy evaluation, guardrails, cold-start, safety, dan rollout.

1. Mental Model: Explore vs Exploit

Exploit:

show what current model thinks is best

Explore:

show safe uncertain alternatives to learn

Trade-off:

exploit improves short-term metric,
explore gathers information for future improvement.

If no exploration, system learns only from past policy.

If too much exploration, user experience suffers.

Goal:

controlled exploration with measurable learning and bounded risk

2. Bandit Setup

Contextual bandit has:

context x
actions/items a
policy pi(a | x)
reward r

In recommendation:

context: user/session/surface/request,
action: candidate item/action/slate slot,
reward: click/purchase/watch/hide/satisfaction,
policy: ranking/reranking decision.

Unlike full reinforcement learning, bandit usually treats each decision independently, focusing on immediate reward.

This is simpler and often practical.

3. Why Not Pure A/B Test?

A/B test compares fixed policies.

Bandit adapts allocation based on observed reward.

A/B:

50% policy A, 50% policy B

Bandit:

allocate more traffic to better action while still exploring uncertainty

Use bandits when:

many alternatives,
cold-start candidates,
exploration budget,
continuous learning,
opportunity cost matters.

Use A/B for major product/model changes with guardrails.

4. Exploration Use Cases

Recommendation exploration helps:

new item cold-start
new creator/seller
long-tail discovery
candidate source evaluation
new ranking policy
content category expansion
personal novelty tuning
enterprise action usefulness
campaign creative testing

Exploration can happen at:

candidate generation,
ranking score adjustment,
reranking slots,
module selection,
email/push timing.

5. Exploration Is Not Random Garbage

Bad exploration:

show random item from catalog

Good exploration:

show candidate that is eligible, safe, quality-approved, contextually plausible, and uncertain

Exploration candidate must pass:

hard eligibility,
policy/safety,
relevance floor,
quality floor,
frequency cap,
user suppression,
tenant/permission constraints.

Exploration happens inside safe candidate universe.

6. Exploration Budget

Define budget.

Examples:

max 5% traffic to exploration
max 1 exploration slot per slate
max 1000 impressions/day for new item
max 2 uncertain actions per tenant/day

Budget controls risk.

Budget can vary by surface:

home feed: more exploration,
checkout: less,
enterprise high-risk action: very little or human-reviewed,
email/push: conservative.

7. Epsilon-Greedy

Simple policy:

with probability epsilon:
    choose exploration candidate
else:
    choose best known candidate

Example:

epsilon = 0.05

Pros:

simple,
easy to implement,
useful baseline.

Cons:

explores uniformly,
inefficient,
can waste exposure on weak candidates,
not context-aware unless exploration pool is curated.

Use as starting point with guardrails.

8. Epsilon-Greedy in Reranking

Slate policy:

position 5 is exploration-eligible
with probability epsilon, choose candidate from exploration pool
otherwise use normal reranker

Need log:

epsilon
eligible exploration pool
chosen candidate
propensity
random seed

Propensity for chosen exploration action:

pi(a | x)

This is critical for off-policy evaluation.

9. Decaying Epsilon

Epsilon can decay as evidence grows.

Example:

new item:
  impressions < 100: epsilon high within cap
  impressions 100-1000: epsilon lower
  enough evidence: normal ranking

This avoids over-exploring mature candidates.

Also useful per item/creator/source.

10. Upper Confidence Bound

UCB chooses actions based on:

estimated_reward + uncertainty_bonus

If action has high uncertainty, it gets exploration boost.

Concept:

ucb_score = mean_reward + alpha * uncertainty

Pros:

explores uncertain promising actions,
less random than epsilon.

Cons:

needs uncertainty estimate,
can be sensitive to reward noise,
delayed rewards complicate.

11. UCB Example

For new item:

estimated_ctr = smoothed_ctr
uncertainty = sqrt(log(total_impressions) / item_impressions)

Score:

ucb = estimated_ctr + alpha * uncertainty

New item with few impressions gets bonus.

As impressions increase, uncertainty shrinks.

Use quality/relevance floors and caps.

12. Thompson Sampling

Thompson sampling samples from posterior distribution of reward.

For binary reward:

CTR ~ Beta(clicks + alpha, non_clicks + beta)
sample ctr
rank by sampled ctr

Pros:

elegant exploration,
naturally balances uncertainty,
works well for simple reward.

Cons:

modeling assumptions,
context handling more complex,
delayed/multi-objective reward harder.

Good for campaign/creative/new item exploration with simple rewards.

13. Contextual Bandits

Non-contextual bandit ignores user/request context.

Contextual bandit learns:

which action works for which context

Example:

new camera accessory works for users currently viewing cameras, not everyone

Policy uses features:

user/session/context/item features

Common approaches:

linear contextual bandit,
neural contextual bandit,
bandit on top of ranker scores,
contextual exploration bonus.

Production often starts with contextual heuristic + propensity logging.

14. Reward Definition

Reward must match objective.

Examples:

click = 1
purchase = 5
hide = -3
report = -100
watch completion = 2
case action success = 10
rework = -5

But delayed/multi-objective rewards are hard.

Start with simple reward and guardrails.

Do not optimize click-only exploration if clickbait risk high.

15. Delayed Rewards

Rewards can arrive later:

purchase after days
return after weeks
case resolution after days

Bandit updates may use:

fast proxy rewards,
delayed correction,
reward maturity windows,
separate short-term and long-term metrics.

Avoid treating missing delayed reward as negative before maturity.

16. Negative Rewards

Exploration must consider negative feedback.

Reward example:

click: +1
purchase: +5
hide: -3
report: -50

If item gets high clicks but high reports, exploration should stop.

Guardrails can override reward.

For severe safety, use hard policy systems.

17. Propensity Logging

Propensity is probability policy chose displayed action.

Log:

pi(a | x)

For exploration decision.

Example event:

{
  "request_id": "req_001",
  "slot": 5,
  "policy_id": "epsilon-new-item-v2",
  "candidate_id": "item_123",
  "propensity": 0.0125,
  "eligible_actions_count": 80,
  "random_seed": "abc",
  "exploration": true
}

Without propensity logging, off-policy evaluation becomes unreliable.

18. Why Propensity Matters

Historical data is biased by policy.

If item was shown because policy had high probability, observed reward must be weighted differently than rare exploration.

Inverse propensity weighting:

reward / propensity

lets you estimate alternative policy performance.

If propensity missing or wrong, estimates are biased.

19. Off-Policy Evaluation

Off-policy evaluation estimates new policy using logged data from old policy.

Basic IPS:

IPS = mean(reward_i * new_policy_prob_i / logging_policy_prob_i)

Challenges:

high variance when propensity small,
support mismatch,
delayed rewards,
slate interactions,
hidden confounders.

Use OPE carefully.

Part 065 will go deeper into counterfactual/off-policy evaluation.

20. Support / Overlap

New policy can only be evaluated on actions old policy sometimes chose.

If old policy never showed tail items, logged data cannot evaluate tail-heavy policy well.

Exploration ensures support.

This is why randomization/propensity logging matters.

21. Exploration Slots

Implement exploration in specific slots.

Example:

exploration:
  eligible_positions: [5, 10]
  max_slots_per_slate: 1
  relevance_floor: 0.01
  quality_floor: 0.8

Top slots remain high confidence.

Lower/mid slots can explore with less risk.

22. Exploration Pool

Candidate pool for exploration should be curated.

Sources:

new_item_candidates
long_tail_quality_candidates
adjacent_topic_candidates
underexposed_creator_candidates
uncertain_high_potential_candidates
new_action_candidates

Filter:

eligible,
safe,
quality,
relevance,
not suppressed,
within exposure cap.

Do not explore from whole catalog blindly.

23. Exploration Priority

Exploration priority can use uncertainty and value.

exploration_priority =
  uncertainty
  * relevance_estimate
  * quality
  * learning_value

Learning value examples:

new item needs impressions,
candidate source needs evaluation,
uncertain segment,
underexposed but high quality.

Rank exploration candidates by priority before sampling.

24. Exposure Caps for Exploration

Caps:

max exploration impressions per item/day
max exploration impressions per user/item
max exploration slots per user/session
max exploration per creator/day

Exploration should not spam.

If candidate fails guardrails, stop.

25. Safety Guardrails

Exploration guardrails:

report rate
hide rate
return rate
complaint rate
policy violation
low quality threshold
creator trust
seller fraud
enterprise rework rate

If guardrail triggers:

disable exploration candidate/source

Exploration is controlled risk, not unrestricted testing.

26. Bandits and Ranking

Bandit can be applied in several ways.

Candidate-Level

Choose which item gets exploration slot.

Source-Level

Choose which candidate source gets quota.

Module-Level

Choose which module to show.

Policy-Level

Choose reranking policy variant.

Creative-Level

Choose title/thumbnail/copy.

Start with candidate/source/module exploration before full slate bandit.

27. Source-Level Bandit

Candidate sources compete:

two_tower
content_based
trending
new_item_exploration
graph
editorial

Bandit allocates source quotas based on reward by context.

Example:

for new users, trending performs well
for active users, session source performs well

Contextual source bandit can improve candidate mix.

28. Module-Level Bandit

Home page may choose modules:

Because you viewed X
Trending
New arrivals
Continue watching
Recommended for your role

Bandit selects module order/presence.

Reward:

module click,
downstream conversion,
session engagement,
negative feedback.

Need slot/module propensity logging.

29. Slate-Level Bandits

Slate-level bandit treats whole slate as action.

Hard because action space enormous.

Approaches:

choose among limited slate policies,
reranker parameter bandit,
module composition bandit,
constrained slate sampling.

Full slate exploration is advanced.

Start with controlled slot/source/module exploration.

30. Exploration and Ranking Model Training

Exploration creates better training data.

Log:

exploration flag
policy id
propensity
candidate source
position
context
reward

Training can use:

exploration data to reduce bias,
propensity weighting,
hard negatives,
cold-start labels.

But exploration distribution may differ from normal serving. Include policy features.

31. Exploration and User Trust

Exploration can hurt trust if irrelevant.

Controls:

low number of slots,
not top position initially,
relevance floor,
quality floor,
user negative feedback stop,
personalized exploration affinity,
surface-specific budget.

Users should not feel system is random.

32. Personalizing Exploration

Some users like discovery; some dislike it.

Features:

novel_item_click_rate
long_tail_engagement
category_breadth
hide_rate_on_novel_items
exploration_affinity

Policy:

if user exploration affinity high:
    allow more novelty
else:
    keep exploration conservative

Be careful not to deny discovery forever to narrow-profile users.

33. Cold-Start Item Bandit

For new item:

Assign prior based on category/creator/metadata.
Allocate small exploration budget.
Observe reward/negative feedback.
Update posterior/score.
Increase exposure if good.
Stop if bad.

Metrics:

time_to_first_impression
time_to_confidence
exploration_success_rate
negative_rate

This solves item cold-start better than pure popularity.

34. Creator/Seller Bandit

For new creator/seller:

quality gate,
trust check,
category relevance,
small exploration,
monitor report/return/complaint,
ramp if good.

Protect against abuse:

bot filtering,
metadata spam detection,
exposure caps,
trust tiers.

35. Enterprise Exploration

Enterprise exploration must be conservative.

Use cases:

recommend potentially useful article,
suggest optional action,
test new workflow hint.

Constraints:

only valid actions,
low-risk context,
human can ignore,
audit,
no policy violation,
no high-stakes automation without approval.

Reward:

article useful
action completed
case progressed
no rework

Exploration in enterprise is often “assistive suggestion,” not autonomous decision.

36. Exploration Policy Spec

Example:

exploration_policy: new-item-epsilon-v3
scope:
  surface: home_feed
  region: ID
budget:
  max_slots_per_slate: 1
  epsilon: 0.05
eligible_positions:
  - 5
  - 10
candidate_pool:
  source: new_item_exploration
  requirements:
    min_quality: 0.8
    min_relevance: 0.01
    policy_state: approved
caps:
  max_item_exploration_impressions_day: 1000
guardrails:
  max_hide_rate: 0.10
  max_report_rate: 0.005
logging:
  propensity_required: true

Policy is versioned and auditable.

37. Exploration Diagnostics

Log per request:

exploration_policy_id
eligible_pool_size
exploration_decision
chosen_candidate
propensity
random_seed
relevance_floor_pass
quality_floor_pass
cap_state

Metrics:

exploration_slot_fill_rate
exploration_reward
exploration_negative_rate
guardrail_stop_count
propensity_missing_count

38. Random Seed and Reproducibility

Use deterministic seed.

seed = hash(request_id, policy_id, slot)

This helps replay.

If randomness cannot be reproduced, debugging and OPE suffer.

39. Exploration Stop Conditions

Stop candidate/source if:

report rate > threshold
hide rate > threshold
conversion below floor after enough exposure
quality incident
policy state changes
exposure cap reached
confidence enough and candidate moves to normal serving

Exploration should have exit criteria.

40. Learning Updates

Bandit update cadence:

online immediate,
nearline every few minutes,
batch daily.

Online updates are powerful but risk reacting to noise.

Start with nearline/batch updates plus guardrails.

For high-risk systems, human-reviewed updates may be needed.

41. Delayed and Non-Stationary Rewards

User behavior changes.

Item quality changes.

Seasonality matters.

Bandit should handle non-stationarity:

decay old observations,
use sliding windows,
reset on major item change,
segment by context,
monitor drift.

If item thumbnail/title changes, old reward may not apply.

42. Exploration with Multiple Objectives

Reward can combine:

click
purchase
hide
report
satisfaction

But bandit algorithms often assume scalar reward.

Use scalar reward carefully:

reward = click + 5*purchase - 3*hide - 50*report

or optimize primary with guardrails.

For safety/negative feedback, guardrails are safer than scalar reward alone.

43. Explore-Exploit in Reranking

Reranker can choose:

best utility candidate
vs
uncertain candidate with exploration bonus

Adjusted score:

bandit_score =
  expected_reward
  + exploration_bonus

This is UCB-style.

Then slate constraints apply.

44. Contextual Exploration Bonus

Example:

bonus =
  alpha * uncertainty(candidate, context)
  * relevance_gate
  * quality_gate

Candidate-specific uncertainty:

low impressions
high model disagreement
high prediction variance
new source
new segment

Context-specific:

new user segment
new region
new tenant

45. Model Uncertainty Approximations

Practical uncertainty signals:

item_impression_count
creator_impression_count
calibration_bucket_support
ensemble_prediction_variance
dropout variance
distance from training distribution
feature_missing_count
new_item_flag
source_recently_launched

You don't need perfect Bayesian uncertainty to start.

46. Exploration and Causal Data

Exploration data enables:

unbiased-ish evaluation,
better negative sampling,
counterfactual learning,
causal effect estimation,
improved cold-start.

But only if:

randomization is logged,
propensity is known,
candidate set is known,
reward is logged,
policy version is logged.

This is long-term infrastructure value.

47. Common Failure Modes

47.1 No Exploration

Cold-start and popularity bias persist.

47.2 Random Exploration

User sees irrelevant items.

47.3 No Propensity Logging

Cannot evaluate or learn correctly.

47.4 Exploration Too Aggressive

Trust/metrics drop.

47.5 Unsafe Candidate Explored

Policy incident.

47.6 Propensity Wrong

Off-policy estimates invalid.

47.7 Delayed Reward Mishandled

Good candidates killed too early.

47.8 Negative Feedback Ignored

Exploration harms users.

47.9 Exploration Budget Not Capped

One item/source overexposed.

47.10 No Stop Conditions

Bad exploration continues.

48. Implementation Sketch: Exploration Decision

public record ExplorationDecision(
    boolean explore,
    String policyId,
    String candidateId,
    double propensity,
    long randomSeed,
    String reason
) {}

Epsilon policy:

public final class EpsilonExplorationPolicy {
    private final double epsilon;

    public ExplorationDecision decide(
        RequestContext context,
        List<Candidate> explorationPool,
        long seed
    ) {
        if (explorationPool.isEmpty()) {
            return new ExplorationDecision(false, "epsilon-v1", null, 1.0, seed, "empty_pool");
        }

        Random random = new Random(seed);
        boolean explore = random.nextDouble() < epsilon;

        if (!explore) {
            return new ExplorationDecision(false, "epsilon-v1", null, 1.0 - epsilon, seed, "exploit");
        }

        int index = random.nextInt(explorationPool.size());
        Candidate chosen = explorationPool.get(index);

        double propensity = epsilon * (1.0 / explorationPool.size());

        return new ExplorationDecision(true, "epsilon-v1", chosen.itemId(), propensity, seed, "epsilon_sample");
    }
}

Production should account for slot probability, eligibility, and candidate pool construction.

49. Implementation Sketch: UCB Score

public final class UcbScorer {
    private final double alpha;

    public double score(BanditStats stats, double relevanceEstimate, double quality) {
        double mean = stats.smoothedRewardMean();
        double uncertainty = Math.sqrt(
            Math.log(Math.max(stats.totalExposure(), 2.0))
            / Math.max(stats.itemExposure(), 1.0)
        );

        return relevanceEstimate * quality * (mean + alpha * uncertainty);
    }
}

Use caps and guardrails.

50. Minimal Production Exploration Plan

Start with:

exploration:
  type: epsilon_greedy_slot
  surfaces:
    - home_feed
  eligible_positions:
    - 5
  epsilon: 0.03
candidate_pool:
  - new_item_quality_candidates
  - long_tail_quality_candidates
requirements:
  policy_approved: true
  min_quality: 0.8
  min_relevance: 0.01
caps:
  max_exploration_slots_per_slate: 1
  max_item_exploration_impressions_day: 1000
logging:
  propensity: required
  policy_id: required
  random_seed: required
guardrails:
  hide_rate_max: 0.10
  report_rate_max: 0.005
evaluation:
  offline_ope: planned
  ab_test: required

Then evolve to UCB/Thompson/contextual policies.

51. Checklist Contextual Bandits and Exploration Readiness

[ ] Exploration objective is defined.
[ ] Exploration surface/slot scope is defined.
[ ] Exploration budget is capped.
[ ] Exploration candidate pool is curated.
[ ] Eligibility/safety/quality/relevance floors exist.
[ ] User suppression/frequency caps apply to exploration.
[ ] Policy ID is logged.
[ ] Propensity is logged.
[ ] Random seed is logged.
[ ] Reward definition is explicit.
[ ] Delayed reward maturity is handled.
[ ] Negative feedback guardrails exist.
[ ] Stop conditions exist.
[ ] Offline simulation/OPE plan exists.
[ ] A/B testing plan exists.
[ ] Exploration metrics are monitored.
[ ] Enterprise/high-risk exploration has additional controls.

52. Kesimpulan

Contextual bandits dan exploration membantu recommendation system keluar dari bias masa lalu dan belajar tentang kandidat yang belum pasti.

Prinsip utama:

Exploitation uses known best; exploration learns uncertain alternatives.
Exploration must be controlled, not random garbage.
Candidate universe must remain safe and eligible.
Exploration budget and slots should be explicit.
Epsilon-greedy is a good starting baseline.
UCB and Thompson sampling explore based on uncertainty.
Contextual bandits adapt exploration to user/request context.
Propensity logging is mandatory for off-policy learning.
Guardrails and stop conditions protect user trust.
Exploration infrastructure is foundational for cold-start, fairness, and causal evaluation.

Di Part 049, kita akan membahas Causal Thinking and Long-Term Value: bagaimana berpikir tentang efek rekomendasi di luar immediate click, menghindari proxy trap, dan mendesain sistem yang mengoptimalkan outcome jangka panjang.

Lesson Recap

You just completed lesson 48 in deepen practice. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Back To Series Next Lesson

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.

Previous Lesson

Lesson 47

Learn Build From Scratch Recommendations System Part 047 Business Rules And Policy Constraints

Next Lesson

Lesson 49

Learn Build From Scratch Recommendations System Part 049 Causal Thinking And Long Term Value