Build CoreOrdered learning track

Learn Build From Scratch Recommendations System Part 020 Item To Item Cooccurrence Recommendation

[]12 min read2371 words

In This Lesson

1. Mental Model: Relationship from Shared Behavior 2. Co-occurrence Units 3. Relationship Semantics

Lesson 2080 lesson track16–44 Build Core

title: Build From Scratch Recommendations System - Part 020 description: Membangun item-to-item dan co-occurrence recommendation production-grade: co-view, co-buy, session co-occurrence, lift, confidence, PMI, association scoring, dedup, complement vs substitute, batch pipeline, dan serving. series: learn-build-from-scratch-recommendations-system seriesTitle: Build From Scratch: Enterprise Recommendations System order: 20 partTitle: Item-to-Item & Co-occurrence Recommendation tags:

recommendation-system
recsys
item-to-item
co-occurrence
association-rules
collaborative-filtering
series date: 2026-07-02

Part 020 — Item-to-Item & Co-occurrence Recommendation

Salah satu recommendation pattern paling berguna di production adalah item-to-item:

People also viewed
People also bought
Frequently bought together
Users who watched this also watched
Related articles
Similar cases
Knowledge articles used together
Actions often taken after this action

Berbeda dari content-based recommendation yang melihat isi item, item-to-item melihat pola interaksi.

Jika banyak user melihat A lalu B, membeli A dan B, menonton A lalu B, atau memakai knowledge article A dan B pada case yang mirip, maka A dan B punya hubungan behavioral.

Item-to-item recommendation sangat production-friendly karena:

bisa diprecompute,
serving cepat,
explainable,
bagus untuk product detail page,
bagus untuk cross-sell,
tidak perlu full user profile,
bisa menjadi candidate source untuk ranker,
cocok untuk collaborative signal awal sebelum model kompleks.

Namun co-occurrence yang naif juga mudah salah:

popular item muncul di semua list,
substitute dan complement tercampur,
bot/internal traffic mencemari,
stock/policy tidak diperiksa,
rare item pair overpromoted,
temporal/session leakage,
duplicate item family mendominasi,
“also viewed” dipakai untuk “also bought” padahal maknanya berbeda.

Part ini membangun item-to-item dan co-occurrence recommender dari nol.

1. Mental Model: Relationship from Shared Behavior

Item-to-item bertanya:

“Item apa yang punya hubungan kuat dengan item ini berdasarkan perilaku kolektif?”

Input:

user/session/order/watch/case interaction histories

Output:

for each seed item, list related items with relationship score

Diagram:

The core idea:

If items appear together more often than expected, they are related.

2. Co-occurrence Units

Sebelum menghitung pair, tentukan “together” artinya apa.

Common units:

2.1 Session

Items interacted in same session.

view A, view B, click C

Good for:

browsing similarity,
next item,
content feed,
product discovery.

2.2 Order / Cart

Items bought together.

order contains A and B

Good for:

frequently bought together,
bundles,
accessories,
cross-sell.

2.3 User History Window

Items interacted by same user within 7/30/90 days.

Good for:

broad taste relation,
collaborative similarity.

Risk:

too broad,
user interests mixed.

2.4 Sequence Transition

Item B follows item A.

A -> B

Good for:

next video,
next article,
workflow next action,
session prediction.

2.5 Case / Task Context

Enterprise:

knowledge articles A and B used in same case type
actions A then B in similar cases

Good for:

next best action,
related knowledge,
case support.

Different unit produces different relationship.

3. Relationship Semantics

Not all co-occurrence means same thing.

Also Viewed

view A and view B in same session

Often means substitutes, alternatives, or comparison.

Also Bought

buy A and buy B in same order/window

Often means complements or bundle relation.

Bought After Viewing

view A then buy B

Can mean A led to B, or A was rejected in favor of B.

Watched Next

watch A then watch B

Can mean sequence continuation.

Used Together

Enterprise/document:

article A and B used in same case

Can mean procedural relationship.

Do not mix all co-occurrence types into one generic score without source label.

4. Basket Construction

A basket is a set or ordered list of items considered together.

Example session basket:

{
  "basket_id": "sess_123",
  "basket_type": "session_views",
  "user_id": "u123",
  "event_time_start": "2026-07-02T10:00:00Z",
  "items": [
    {"item_id": "A", "event_type": "view", "time": "10:00"},
    {"item_id": "B", "event_type": "view", "time": "10:03"},
    {"item_id": "C", "event_type": "click", "time": "10:04"}
  ]
}

Order basket:

{
  "basket_id": "order_456",
  "basket_type": "purchase_order",
  "items": [
    {"item_id": "camera", "quantity": 1},
    {"item_id": "memory_card", "quantity": 1},
    {"item_id": "camera_bag", "quantity": 1}
  ]
}

Basket quality matters. Bad baskets produce bad pair counts.

5. Basket Cleaning

Before generating pairs:

remove duplicate item within basket if needed,
map SKU to product family if appropriate,
remove invalid/bot/internal traffic,
remove items not actually visible/valid,
remove policy-blocked items,
cap huge baskets,
handle repeated events,
exclude test orders,
filter refunds/fraud if needed.

Huge baskets are dangerous.

Example: enterprise procurement order with 500 items creates 124,750 pairs. It can dominate counts but may not reflect recommendation relation.

Cap or downweight large baskets.

basket_weight = 1 / log(2 + basket_size)

or ignore baskets above threshold.

6. Pair Generation

For unordered basket:

items = [A, B, C]
pairs = (A,B), (A,C), (B,C)

For recommendation from seed item, store directional pairs:

A -> B
B -> A
A -> C
C -> A

For sequence:

A -> B
B -> C

Or with window:

A -> B within next 3 interactions
A -> C within next 3 interactions

Directional relationships matter.

Example:

phone -> phone_case

is useful.

phone_case -> phone

may also be useful but has different semantics.

7. Basic Co-occurrence Count

Simplest score:

co_count(A,B) = number of baskets containing both A and B

This is easy but biased toward popular items.

8. Support, Confidence, Lift

Association rule metrics.

Support

How often A and B occur together.

support(A,B) = count(A,B) / total_baskets

Confidence

Probability of B given A.

confidence(A -> B) = count(A,B) / count(A)

Useful for directional recommendation.

Problem: favors popular B.

Lift

How much more likely B is with A than generally.

lift(A -> B) =
  confidence(A -> B) / P(B)

Or:

lift(A,B) =
  P(A,B) / (P(A) * P(B))

Lift reduces popularity bias.

But lift can overpromote rare pairs with tiny counts. Use minimum support/smoothing.

9. PMI and PPMI

Pointwise Mutual Information:

PMI(A,B) = log( P(A,B) / (P(A) * P(B)) )

Positive PMI:

PPMI = max(PMI, 0)

PMI captures association beyond chance.

Problem:

unstable for rare pairs,
needs smoothing/min count.

Use:

if count(A,B) >= min_pair_count:
    score = PMI
else:
    ignore

Or combine with support/confidence.

10. Scoring Formula

Practical item-to-item score often blends metrics.

Example:

score(A -> B) =
  w_conf * smoothed_confidence(A -> B)
  + w_lift * log_lift(A,B)
  + w_quality * quality(B)
  + w_freshness * freshness(B)
  - w_pop_penalty * popularity_penalty(B)

Simpler:

score =
  smoothed_confidence(A -> B)
  * log(1 + lift(A,B))
  * quality_score(B)

With minimum thresholds:

count(A,B) >= 5
count(A) >= 20
item B eligible

Do not chase mathematical purity first. Use stable, explainable scoring.

11. Smoothing

Confidence:

conf(A -> B) = count(A,B) / count(A)

Smoothed:

smoothed_conf =
  (count(A,B) + prior_prob_B * prior_weight)
  / (count(A) + prior_weight)

This prevents tiny items from extreme scores.

Alternatively:

score = count(A,B) / (count(A)^alpha * count(B)^beta)

With alpha/beta controlling popularity normalization.

Example:

score = count(A,B) / sqrt(count(A) * count(B))

This is cosine similarity for binary co-occurrence vectors.

12. Cosine Similarity from Co-occurrence

Represent each item as vector of users/sessions/baskets.

item A vector: baskets where A appears
item B vector: baskets where B appears

Cosine:

cosine(A,B) = count(A,B) / sqrt(count(A) * count(B))

This is a strong simple baseline.

It normalizes popularity better than raw count.

But it treats all baskets equally and ignores direction.

13. Jaccard Similarity

jaccard(A,B) = count(A,B) / count(A or B)

Good for set overlap.

Can be too harsh for popular items.

Useful for:

document co-usage,
small domains,
entity overlap.

14. Directional vs Symmetric Relations

Symmetric:

similar_to(A,B)

Directional:

A recommends B

Co-buy in order is often symmetric, but serving may need direction.

Example:

camera -> memory card
memory card -> camera

Both may be valid, but ranking differs.

Sequence is directional:

episode 1 -> episode 2

not:

episode 2 -> episode 1

Workflow actions are directional:

collect evidence -> escalate case

not always reverse.

Store relation type and direction.

15. Time-Aware Co-occurrence

Relationships change.

Use windows:

co_view_1d
co_view_7d
co_view_30d
co_buy_90d

Or decay:

pair_weight = exp(-lambda * age)

For stable domains, longer windows help.

For trends, shorter windows matter.

Hybrid:

score =
  0.5 * score_7d
  + 0.3 * score_30d
  + 0.2 * score_180d

If item pair was common last year but irrelevant now, recency decay helps.

16. Session-Aware Co-occurrence

For session co-view, not all pairs equally strong.

If A and B occur close together, relation stronger.

weight(A,B) = exp(-lambda * interaction_distance)

Example:

A then B immediately: high
A and B 20 interactions apart: low

Also use time gap:

weight = exp(-lambda * minutes_between_events)

This helps distinguish session intent from broad user taste.

17. Event Weighting

Different event types carry different strength.

Example:

view-view pair: 1
click-click pair: 2
add_to_cart-add_to_cart pair: 4
purchase-purchase pair: 6
watch_complete-watch_complete pair: 5

Pair weight:

pair_weight = sqrt(weight(event_i) * weight(event_j))

But keep relation type separate.

Better:

build co-view model,
build co-buy model,
build co-watch-complete model,
blend at serving depending surface.

18. Co-view vs Co-buy

Co-view often finds alternatives/substitutes.

Example:

User views multiple cameras before choosing one.

Co-buy often finds complements.

Example:

Camera + memory card + camera bag.

For PDP:

similar alternatives -> co_view/content_similarity
frequently bought together -> co_buy/order
accessories -> co_buy + compatibility

Do not use co-view for “frequently bought together” unless validated.

19. Substitute vs Complement

This is critical.

Substitute:

items satisfy similar need; user chooses one

Complement:

items are useful together

Signals:

Signal	Likely relation
same session views, one purchased	substitute/alternative
same order purchase	complement
same cart	complement or comparison
same category + co-view	substitute
different category + co-buy	complement
compatibility graph	complement
same creator/topic + watch sequence	related/continuation

Classification:

if same category and co_view high: alternative
if different complementary category and co_buy high: complement
if sequence transition high: next

Relation type should be stored.

20. Item Granularity

Use correct item level:

product-level,
SKU-level,
variant-level,
offer-level,
article-level,
canonical document-level,
case template-level,
action-level.

Example e-commerce:

Order contains SKU red size 42. For recommendation, use product family for co-buy to avoid duplicate variants.

But for compatibility, SKU/variant might matter.

Rule:

choose granularity based on serving surface

Store mapping:

sku -> variant -> product -> product_family

21. Dedup and Canonicalization

Before pair generation:

canonicalize items,
map duplicates,
collapse variants if needed,
remove same dedup group pairs,
handle item merges/deletions.

Example:

A_variant1 and A_variant2 in same order

Do not produce pair:

A -> A_variant2

unless variant recommendation is intended.

Dedup group is mandatory.

22. Eligibility and Policy

Pair store may contain old items now invalid.

Serving must filter:

active,
available,
policy approved,
visible to actor,
region,
age,
tenant,
surface allowed,
not suppressed.

Batch generation can prefilter, but serving must final-check because state changes.

Pattern:

precompute relation candidates
final online eligibility check

23. Data Pipeline

Batch pipeline:

Inputs:

clean events,
catalog snapshots,
dedup mapping,
traffic filters,
item quality,
policy state.

Outputs:

seed item,
candidate item,
relation type,
score,
evidence counts,
version.

24. Pair Explosion

If basket size is n, unordered pairs:

n * (n - 1) / 2

Large baskets explode.

Example:

n = 1000 -> 499,500 pairs

Mitigation:

cap basket size,
sample pairs,
downweight large baskets,
ignore low-signal events,
split basket by category/time,
sequence window instead of full pair,
use frequent item mining thresholds.

For sessions, only pair nearby events.

window_size = 5 next items

25. Minimum Support

Rare pairs can have high lift by accident.

Use thresholds:

count(A) >= min_seed_count
count(B) >= min_candidate_count
count(A,B) >= min_pair_count
unique_users(A,B) >= min_unique_users

Example:

min_pair_count: 5
min_unique_users: 3
min_seed_count: 20

Use higher thresholds for high-traffic surfaces.

For enterprise low-data domains, thresholds may be lower but require expert/rule backing.

26. Unique User vs Event Count

If one user repeatedly views A and B, raw pair count high.

Use unique users/sessions/orders.

Metrics:

pair_event_count
pair_session_count
pair_user_count
pair_order_count

For robustness:

score uses unique_user_count or capped per-user contribution

Cap contribution:

max contribution per user per pair per day = 1

Prevents power users/bots from dominating.

27. Bot and Fraud Protection

Co-occurrence is easy to manipulate.

Filters:

exclude bot/internal/test,
cap per-user contribution,
require unique users,
detect abnormal pair velocity,
seller/creator self-interaction filters,
suspicious account clusters,
low-quality traffic source filtering.

Trending co-occurrence needs even stronger protection.

28. Pair Store Schema

Example:

{
  "seed_item_id": "item_A",
  "candidate_item_id": "item_B",
  "relation_type": "co_buy",
  "score": 0.842,
  "rank": 1,
  "evidence": {
    "pair_count": 120,
    "seed_count": 1500,
    "candidate_count": 800,
    "unique_user_count": 110,
    "confidence": 0.08,
    "lift": 3.2,
    "pmi": 1.16
  },
  "metadata": {
    "window": "90d",
    "generated_at": "2026-07-02T02:00:00Z",
    "model_version": "i2i-cobuy-v4",
    "item_granularity": "product_family"
  }
}

Store evidence for debugging.

29. Serving Flow

Item-to-item should be low-latency because top lists are precomputed.

30. Fallbacks

If seed item has no co-occurrence list:

Fallback hierarchy:

same category popularity
content-based similar items
same creator/brand popular
global surface baseline
editorial list

If co-buy empty but co-view exists, use co-view only if surface allows alternatives.

Fallback reason must be logged.

31. Blending Multiple I2I Sources

For PDP:

sources:
  content_similar:
    weight: 0.3
  co_view_alternatives:
    weight: 0.3
  co_buy_complements:
    weight: 0.3
  editorial_accessories:
    weight: 0.1

But keep relation slots separate if UX needs clarity:

Similar items
Frequently bought together
Accessories
Customers also viewed

Mixing all into one list can confuse users.

32. Multi-Seed Recommendation

For cart:

Seed is multiple items.

Options:

32.1 Union

Fetch related for each cart item, merge scores.

score(candidate) = sum(score(seed_i -> candidate))

32.2 Max

score = max_i score(seed_i -> candidate)

32.3 Weighted by Cart Importance

score = sum(cart_item_weight_i * score(seed_i -> candidate))

Filter:

items already in cart,
incompatible items,
duplicates,
too expensive if not intended.

For enterprise case, seeds can be current case attributes, used articles, prior actions.

33. Sequence Transitions

For next item/action:

Count transitions:

A -> B

within event sequence.

Metrics:

transition_count(A,B)
transition_confidence(A->B) = count(A->B) / count(A)

Use time gap and position.

For video:

watch_complete A then watch_start B within 10m

For workflow:

action A then action B in same case within valid state transition

Sequence relation is directional and context-dependent.

34. Contextual Co-occurrence

Same item pair may mean different things by context.

Examples:

region,
category,
surface,
user segment,
tenant,
role,
case state.

Build segmented I2I if data sufficient:

co_buy_by_region
co_view_by_category
next_action_by_case_state_role
related_article_by_jurisdiction

Use fallback hierarchy if sparse.

Segmented co-occurrence can dramatically improve relevance.

35. Enterprise Co-occurrence

Examples:

Similar Cases

Pairs of cases sharing entities, risk indicators, actions, outcomes.

Knowledge Articles Used Together

If article A and B often used in same case type, recommend B after A.

Next Actions

If action B often follows action A for case state S, recommend B.

Evidence Checklist

Evidence item B often required when evidence A is present.

Constraints:

permission,
jurisdiction,
policy version,
case state validity,
auditability,
expert review for high-risk actions.

Co-occurrence is evidence, not authority. For high-stakes workflows, combine with rules/state machine.

36. Explanation

Item-to-item explanations:

Frequently bought together
Customers also viewed
Often watched next
Used in similar cases
Commonly referenced with this article
Usually follows this action in cases like this

Only use explanation matching relation type.

Do not label co-view as “bought together”.

Expose confidence carefully. User-facing explanation should be simple; internal debug can include counts/lift.

37. Evaluation

Offline:

Recall@K for next item/purchase,
HitRate@K,
NDCG@K,
coverage,
pair precision via human judgment,
complement/substitute correctness,
diversity,
cold item coverage,
long-tail exposure.

Online:

CTR,
add-to-cart,
attach rate,
bundle conversion,
watch next completion,
case action acceptance,
hide/report,
revenue/margin,
return/refund,
user satisfaction.

For co-buy, attach rate is more relevant than CTR alone.

38. Attach Rate

For cross-sell:

attach_rate = orders_with_seed_and_candidate / orders_with_seed

This is directional confidence for purchase.

But adjust for general popularity:

lift = attach_rate / overall_purchase_rate(candidate)

Use both.

High attach rate but low lift may just mean candidate is popular.

High lift but tiny count may be noise.

39. Guardrails

Monitor:

out_of_stock_filter_rate
policy_filter_rate
duplicate_filter_rate
same_seed_candidate_rate
cooccurrence_list_empty_rate
pair_count_distribution
top_candidate_popularity_concentration
hide/report rate
return/refund rate for co-buy
bot contribution

If one item appears in too many lists, check popularity bias.

Metric:

candidate_appears_in_seed_lists_count

Cap over-dominant candidates if needed.

40. Versioning

Version:

event source,
basket definition,
window,
item granularity,
pair generation logic,
scoring formula,
thresholds,
filters,
output list.

Example:

i2i-cobuy-productfamily-90d-v4

Serving logs:

{
  "source": "item_to_item",
  "relation_type": "co_buy",
  "source_version": "i2i-cobuy-productfamily-90d-v4"
}

Without versioning, changed pair logic can’t be evaluated cleanly.

41. Implementation Sketch

Core model:

public record ItemPair(
    String seedItemId,
    String candidateItemId,
    String relationType,
    double score,
    long pairCount,
    long seedCount,
    long candidateCount,
    double confidence,
    double lift
) {}

Scoring:

public final class CoOccurrenceScorer {
    public double score(long pairCount, long seedCount, long candidateCount, long totalBaskets,
                        double candidateQuality) {
        double priorWeight = 50.0;
        double priorProb = (double) candidateCount / totalBaskets;

        double smoothedConfidence =
            (pairCount + priorProb * priorWeight) / (seedCount + priorWeight);

        double candidateProb = (double) candidateCount / totalBaskets;
        double lift = smoothedConfidence / Math.max(candidateProb, 1e-9);

        return smoothedConfidence * Math.log1p(lift) * candidateQuality;
    }
}

This is simple, stable, and interpretable.

42. Batch Job Pseudocode

List<Basket> baskets = basketReader.read(spec.timeWindow());

Stream<ItemPairEvent> pairEvents = baskets.stream()
    .filter(basketQuality::isValid)
    .flatMap(pairGenerator::generatePairs);

Map<PairKey, PairStats> pairStats = aggregatePairs(pairEvents);
Map<String, Long> itemCounts = aggregateItemCounts(baskets);

List<ItemPair> scoredPairs = pairStats.entrySet().stream()
    .filter(e -> e.getValue().pairCount() >= spec.minPairCount())
    .map(e -> scorer.score(e, itemCounts, totalBaskets))
    .filter(pair -> eligibilityPreFilter.isCandidateAllowed(pair.candidateItemId()))
    .collect(toTopNPerSeed(spec.topN()));

pairStore.write(scoredPairs, spec.version());

Production implementation may use Spark/Flink/Beam/SQL. The logical pipeline remains.

43. Anti-Patterns

43.1 Raw Co-count Ranking

43.2 No Minimum Support

Rare accidental pairs overpromoted.

43.3 Mix Co-view and Co-buy Blindly

Alternatives and complements confused.

43.4 No Basket Cleaning

Bots, huge orders, duplicates contaminate pairs.

43.5 No Item Granularity Decision

SKU variants recommend each other uselessly.

43.6 No Online Eligibility Filter

Deleted/out-of-stock/banned items show up.

43.7 No Direction

Sequence and complement relationships get reversed incorrectly.

43.8 No Dedup

Same product family fills list.

43.9 No Evidence Stored

Pair cannot be debugged.

43.10 No Versioning

Score changes cannot be attributed.

44. Minimal Production I2I Plan

Build four relation stores:

44.1 Co-view Alternatives

basket: session views/clicks
window: 30d
granularity: product/article/item canonical
score: cosine or confidence*lift
use_case: similar/also viewed

44.2 Co-buy Complements

basket: order/cart purchases
window: 90d
granularity: product_family
score: smoothed attach rate * lift * quality
use_case: frequently bought together

44.3 Sequential Next

basket: ordered session events
window: 30d
pairing: next 1-3 events
score: transition confidence with smoothing
use_case: next video/article/action

44.4 Enterprise Co-usage

basket: case/task context
window: 180d
constraints: tenant/role/jurisdiction/policy
score: success-weighted co-usage
use_case: related knowledge/next action

Each relation store gets its own version, thresholds, and evaluation.

45. Checklist Item-to-Item Readiness

[ ] Co-occurrence unit is explicit.
[ ] Relation type is explicit.
[ ] Basket cleaning rules exist.
[ ] Bot/internal/test traffic excluded.
[ ] Huge baskets capped or downweighted.
[ ] Item granularity is chosen intentionally.
[ ] Dedup/canonicalization happens before pair generation.
[ ] Pair generation direction is correct.
[ ] Minimum support thresholds exist.
[ ] Score normalizes popularity.
[ ] Smoothing is applied.
[ ] Evidence counts are stored.
[ ] Relation store is versioned.
[ ] Online eligibility/policy/availability filters run.
[ ] User suppression runs.
[ ] Dedup/diversity constraints exist.
[ ] Fallback exists for sparse seeds.
[ ] Co-view and co-buy are not mixed blindly.
[ ] Enterprise access/state/jurisdiction constraints enforced if applicable.
[ ] Metrics include coverage, attach rate, hide/report, and filter rates.

46. Kesimpulan

Item-to-item dan co-occurrence recommendation adalah salah satu sistem paling berguna, murah, dan kuat untuk production.

Ia memberi:

“also viewed”,
“frequently bought together”,
“watched next”,
“related knowledge”,
“next action”,
candidate source untuk ranker,
fallback dan explainability.

Prinsip utama:

Define what “together” means.
Separate co-view, co-buy, sequence, and co-usage.
Raw co-count is not enough; normalize popularity.
Use support, confidence, lift, PMI/cosine, and smoothing.
Decide item granularity carefully.
Clean baskets before generating pairs.
Store evidence and version everything.
Apply online eligibility and suppression.
Distinguish substitute vs complement.
For enterprise, co-occurrence is evidence, not authority; combine with rules and permissions.

Di Part 021, kita akan membahas User-Item Collaborative Filtering: bagaimana membangun rekomendasi dari matrix interaksi user-item, neighborhood methods, similarity metrics, sparse data, cold-start limitation, dan production trade-offs.

Lesson Recap

You just completed lesson 20 in build core. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Back To Series Next Lesson

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.

Previous Lesson

Lesson 19

Learn Build From Scratch Recommendations System Part 019 Content Based Recommendation

Next Lesson

Lesson 21

Learn Build From Scratch Recommendations System Part 021 User Item Collaborative Filtering