Learn Build From Scratch Recommendations System Part 068 Model Quality Monitoring And Drift
title: Build From Scratch Recommendations System - Part 068 description: Mendesain model quality monitoring dan drift detection untuk recommendation system production-grade: data drift, feature drift, prediction drift, calibration drift, label drift, candidate drift, segment drift, alerting, retraining triggers, rollback, and governance. series: learn-build-from-scratch-recommendations-system seriesTitle: Build From Scratch: Enterprise Recommendations System order: 68 partTitle: Model Quality Monitoring and Drift tags:
- recommendation-system
- recsys
- model-monitoring
- drift
- mlops
- observability
- series date: 2026-07-02
Part 068 — Model Quality Monitoring and Drift
Model recommendation tidak gagal hanya saat inference error.
Model bisa tetap sehat secara teknis tetapi kualitasnya turun karena:
- user behavior berubah,
- catalog berubah,
- seasonality,
- feature pipeline drift,
- candidate source berubah,
- new item distribution berubah,
- policy/rule berubah,
- label distribution berubah,
- embedding/index version berubah,
- experiment treatment mengubah feedback,
- model calibration memburuk,
- training data tidak lagi representatif.
Ini disebut drift.
Model quality monitoring bertugas mendeteksi penurunan kualitas, menjelaskan sumbernya, dan memicu tindakan:
recalibrate
retrain
rollback
disable source
fix feature
adjust policy
run experiment
investigate data incident
Part ini membahas model quality monitoring dan drift detection untuk recommendation system production-grade: data drift, feature drift, prediction drift, calibration drift, label drift, candidate drift, segment drift, alerting, retraining triggers, rollback, and governance.
1. Mental Model: Model Quality Is a Moving Target
Model dilatih pada data masa lalu.
Production environment berubah.
training distribution != current serving distribution
Recommendation system sangat rentan karena ia memengaruhi data yang ia pelajari:
model recommends -> user reacts -> data collected -> future model learns
Monitoring harus melihat:
- input drift,
- output drift,
- outcome drift,
- decision drift,
- segment drift,
- feedback loop drift.
2. Drift Types
Each type needs different metric and response.
3. Data Drift
Data drift means input data distribution changes.
Examples:
new region traffic increases
mobile traffic doubles
new category launches
holiday season changes demand
anonymous user share rises
tenant mix changes
Data drift does not always mean model bad, but it means training distribution may be less representative.
Monitor request/context distribution.
4. Feature Drift
Feature drift means feature value distribution changes.
Examples:
item_ctr_7d all drops
user_profile_affinity all zeros
category unknown rate spikes
embedding norm shifts
stock feature missing increases
session_depth distribution changes
Feature drift can be caused by real world or pipeline bug.
Distinguish natural drift vs data incident.
5. Prediction Drift
Prediction drift means model outputs distribution changes.
Examples:
p_click mean drops
p_hide p95 increases
rank_score distribution compresses
top scores become too high
utility score dominated by business boost
Prediction drift can result from feature/candidate changes.
Track by model version and segment.
6. Label Drift
Label drift means outcome distribution changes.
Examples:
CTR baseline changes
purchase conversion shifts
hide/report increases
return rate changes
case action completion rate drops
Causes:
- user behavior,
- seasonality,
- UI change,
- tracking bug,
- model behavior,
- external factors.
Label drift affects training and evaluation.
7. Calibration Drift
Calibration drift:
predicted probabilities no longer match observed probabilities
Example:
model predicts 10% click, actual click now 6%
Causes:
- user behavior shift,
- candidate distribution shift,
- feature drift,
- UI change,
- policy change.
Calibration drift is critical for utility composition.
8. Candidate Drift
Candidate pool distribution changes.
Examples:
two_tower source returns more new items
trending source dominates
content source candidate count drops
invalid candidate rate rises
candidate categories shift
Ranker trained on old candidate distribution may perform poorly.
Monitor candidate source and pool distribution.
9. Embedding/Index Drift
Embedding/index changes alter retrieval.
Monitor:
embedding norm
coverage
nearest neighbor distribution
index recall benchmark
category distribution from ANN
empty vector result rate
index filter rate
new item time to index
A new index can change candidate universe even if ranker unchanged.
10. Segment Drift
Global metrics may be stable while segment degrades.
Segments:
new users
anonymous users
cold-start items
regions
languages
devices
tenants
categories
candidate sources
privacy modes
Monitor drift by segment.
11. Feedback Loop Drift
Recommender affects future data.
Symptoms:
top items get more exposure
long-tail exposure shrinks
user interests narrow
creator concentration increases
exploration support decreases
negative feedback increases for repeated topics
This is not ordinary data drift. It is system-induced drift.
Need exposure/fairness monitoring.
12. Monitoring Windows
Use multiple windows:
real-time: minutes
nearline: hours
daily: 1d
weekly: 7d
long-term: 30d+
Different signals mature at different speeds.
Example:
- feature missing spike: minutes.
- click drift: hours.
- purchase drift: days.
- return drift: weeks.
- retention drift: weeks.
13. Baseline Selection
Compare current metrics to baseline.
Options:
previous hour/day/week
same day last week
training distribution
champion model baseline
control variant
seasonality-adjusted expected range
Use context.
Comparing holiday traffic to normal weekday may produce false alerts.
14. Feature Drift Metrics
Metrics:
mean/std change
quantile shift
null/missing rate
unknown enum rate
population stability index
KL divergence
JS divergence
Wasserstein distance
embedding norm drift
categorical top-K distribution shift
Start with simple distribution checks and missing/stale rates.
15. Population Stability Index
PSI compares distribution bins.
High-level:
PSI = sum((actual% - expected%) * ln(actual% / expected%))
Useful for feature drift dashboards.
But thresholds are heuristic.
Do not blindly alert on PSI without context.
16. Embedding Drift Metrics
For embeddings:
norm distribution
dimension mean/std
zero vector rate
nearest neighbor category mix
cluster occupancy
average pairwise similarity sample
coverage
Embedding drift often appears as retrieval quality drift.
17. Prediction Drift Metrics
Track:
score mean/p50/p95/p99
entropy of ranking scores
top-K score gap
prediction bucket distribution
task prediction distribution
score component contribution
If score distribution compresses, ranker may lose discrimination.
If top score spikes, feature bug or calibration issue possible.
18. Calibration Monitoring
Process:
- Group predictions into buckets.
- Wait reward maturity.
- Compare predicted vs observed rate.
- Compute ECE/Brier/logloss by window.
- Slice by segment/model version.
Metrics:
ECE
Brier
logloss
bucket observed/predicted gap
Calibration alerts need delayed labels.
19. Outcome Quality Monitoring
Track mature outcomes:
CTR
CVR
purchase
hide/report
retention
return/refund
case success
rework
SLA
By model version and segment.
Use control group/holdout when available.
Raw metric changes may be confounded by traffic mix.
20. Proxy vs Long-Term Monitoring
Monitor both:
Fast Proxy
click
dwell
session continuation
hide/report
Long-Term
retention
repeat purchase
return/refund
case resolution
trust signals
Fast proxy alerts detect urgent issues.
Long-term metrics validate true value.
21. Candidate Quality Monitoring
Metrics:
candidate recall proxy
candidate count
empty candidate pool
source contribution
eligible rate
invalid rate
dedup rate
source overlap
new item share
long-tail share
If candidate count drops, model quality drops regardless of ranker.
22. Feature Importance-Aware Monitoring
Features with high model importance need tighter monitoring.
Example:
item_ctr_7d
user_category_affinity
source_two_tower_score
item_quality_score
seen_count
If critical feature missing, alert higher severity.
Model registry can provide feature importance.
23. Model Version Monitoring
Always slice by model version.
Metrics:
traffic share
latency
error rate
score distribution
prediction distribution
feature missing
online outcomes
calibration
fallback rate
If multiple models active, global metrics hide issue.
24. Drift vs Incident
Drift can be natural gradual shift.
Incident is sudden unexpected change.
Examples:
Drift
holiday season increases gift category interest
Incident
category feature becomes null after deploy
Response differs.
Monitoring should help distinguish sudden vs gradual.
25. Drift Response Playbook
When drift detected:
- Identify feature/model/source/segment.
- Check recent changes.
- Validate data pipeline.
- Compare control/previous version.
- Check outcome metrics.
- Decide: ignore, monitor, recalibrate, retrain, rollback, fix pipeline, adjust policy.
- Document.
Not every drift requires retraining.
26. Retraining Triggers
Trigger retrain when:
feature distribution shifted materially
calibration degraded
online metrics degrade
new category/region launched
catalog distribution changed
candidate source changed
label distribution changed
seasonality
model age exceeds threshold
Retraining should still pass validation gates.
27. Recalibration Triggers
Recalibrate when:
ranking order still okay but probabilities off
utility composition unstable
calibration ECE worsens
business thresholds misfire
Recalibration is cheaper than full retrain but only fixes probability mapping, not representation.
28. Rollback Triggers
Rollback when:
model deploy causes score distribution anomaly
guardrail breach
latency spike
policy violation
severe segment regression
feature incompatibility
calibration severely broken
Rollback should be fast via model route.
If root cause is feature pipeline, rollback may not help.
29. Feature Pipeline Fix vs Model Fix
If feature pipeline bug:
fix feature pipeline
backfill if needed
recompute affected features
possibly retrain
If model overfits:
retrain with better data/objective
If candidate source changed:
adjust candidate policy/ranker training distribution
Choose fix based on root cause.
30. Drift Caused by Policy Changes
Business rules and reranking policies can change distribution.
Example:
new diversity policy increases long-tail exposure
Feature/outcome drift may be expected.
Deployment notes should annotate expected drift.
Monitoring should compare against experiment/control.
31. Drift Caused by Experiment
Treatment changes feedback distribution.
If training pipeline uses experiment data, record variant.
Questions:
Should treatment data be included?
Should model train with experiment feature?
Is treatment causing distribution shift?
Experiment-aware training avoids contamination.
32. Data Poisoning and Abuse
Bad actors can manipulate feedback.
Examples:
bot clicks
fake reviews
creator spamming metadata
coordinated engagement
seller gaming recommendation
Monitoring:
abnormal engagement bursts
creator/item anomaly
bot traffic
review/fraud signals
source-specific spike
Model quality monitoring should include abuse signals.
33. Monitoring New Item Quality
Metrics:
time_to_first_impression
new_item_embedding_coverage
new_item_exploration_reward
new_item_negative_rate
new_item_conversion
new_item_dropoff
If new items never get exposure, system becomes stale.
If new items get too much exposure, quality may drop.
34. Monitoring User-Level Experience
Aggregate metrics can hide individual fatigue.
Track:
repeat rate per user
category concentration per user
hide/report per user
session abandonment
recommendation reset
do_not_personalize
Use privacy-aware aggregation.
35. Monitoring Marketplace Health
Metrics:
creator exposure concentration
seller revenue concentration
qualified exposure share
new creator time to first exposure
long-tail conversion
supply churn
Model quality includes ecosystem health for marketplaces.
36. Monitoring Enterprise Quality
Enterprise metrics:
recommended action acceptance
action completion
case resolution
SLA improvement
rework
supervisor override
audit issue
document helpfulness
tenant-specific quality
Long outcome windows matter.
Monitor by tenant and workflow type.
37. Alert Design
Alert should include:
what changed
where
since when
affected segment
current value
baseline
likely owner
dashboard link
runbook
Bad alert:
model drift high
Good alert:
home_feed home_ranker_v13 p_click p95 shifted +45% in ID mobile since 09:00 after feature_set v19 deploy
38. Alert Thresholds
Threshold options:
- static thresholds,
- relative change,
- anomaly detection,
- control chart,
- seasonality-aware threshold.
Start simple:
critical feature missing > 5%
fallback rate > 10%
empty slate > 1%
ECE > threshold after maturity
Then mature.
39. False Positive Management
Too many false alerts cause alert fatigue.
Reduce noise by:
- segment priority,
- duration windows,
- combining signals,
- severity levels,
- ownership,
- suppress expected deployment windows,
- annotate experiments.
But do not suppress safety alerts.
40. Drift Dashboard
Dashboard sections:
active model versions
traffic share
feature drift
prediction drift
calibration
outcome metrics
candidate source distribution
fallback/empty rates
segment health
recent changes
alerts/incidents
The dashboard should support diagnosis, not just charts.
41. Model Quality Report
Periodic report:
model version
age
training data window
online performance
calibration
drift summary
segment regressions
feature health
candidate source health
recommendation: keep/retrain/recalibrate/investigate
Run daily/weekly depending system.
42. Drift and Governance
Governance decisions:
Who decides retrain?
Who approves recalibration?
Who owns feature drift?
When is rollback mandatory?
What guardrails are non-negotiable?
Model quality monitoring must connect to owners and process.
43. Common Failure Modes
43.1 Only Monitor CTR
Feature/model degradation missed.
43.2 No Segment Drift
Minority segment broken.
43.3 No Feature Drift
Pipeline bug becomes model issue.
43.4 No Candidate Drift
Retrieval change blamed on ranker.
43.5 Calibration Drift Ignored
Utility composition wrong.
43.6 Alerts Without Owners
No action.
43.7 Retrain on Bad Data
Drift caused by event bug.
43.8 Rollback Wrong Layer
Model rollback does not fix feature bug.
43.9 No Experiment Awareness
Treatment data contaminates baseline.
43.10 Long-Term Metrics Ignored
Short-term lift hides trust loss.
44. Implementation Sketch: Drift Metric
public interface DriftMetric {
String name();
DriftResult compute(Distribution baseline, Distribution current);
}
public record DriftResult(
String metricName,
double value,
DriftSeverity severity,
Map<String, Object> diagnostics
) {}
public enum DriftSeverity {
OK,
WARN,
CRITICAL
}
Feature/model monitoring can plug in multiple drift metrics.
45. Implementation Sketch: Feature Drift Monitor
public final class FeatureDriftMonitor {
private final List<DriftMetric> metrics;
public List<DriftResult> evaluate(
String featureName,
Distribution trainingDistribution,
Distribution servingDistribution
) {
return metrics.stream()
.map(metric -> metric.compute(trainingDistribution, servingDistribution))
.toList();
}
}
Store results by feature/model/segment/window.
46. Implementation Sketch: Calibration Bucket
public record CalibrationBucket(
double lowerBound,
double upperBound,
long predictionCount,
double averagePrediction,
double observedRate
) {
public double gap() {
return observedRate - averagePrediction;
}
}
Calibration monitoring needs mature labels.
47. Implementation Sketch: Drift Alert
public record ModelQualityAlert(
String alertId,
String modelVersion,
String surface,
String segment,
String signal,
double currentValue,
double baselineValue,
DriftSeverity severity,
Instant detectedAt,
String owner,
String runbookUrl
) {}
Alerts should be actionable.
48. Minimal Production Model Quality Monitoring Plan
Start with:
model_quality:
slice_by:
- model_version
- surface
- region
- user_segment
feature_monitoring:
- missing_rate
- stale_rate
- distribution_shift_top_features
prediction_monitoring:
- score_distribution
- task_prediction_distribution
outcome_monitoring:
- ctr
- hide_rate
- report_rate
- conversion_if_available
calibration:
delayed_bucket_report: true
candidate_monitoring:
- candidate_count
- source_distribution
- empty_pool_rate
triggers:
- feature_bug_investigation
- recalibration
- retraining
- rollback
Add long-term metrics and sophisticated drift methods after basics are reliable.
49. Checklist Model Quality Monitoring and Drift Readiness
[ ] Metrics are sliceable by model version.
[ ] Feature missing/stale/distribution drift is monitored.
[ ] Top model features have tighter monitoring.
[ ] Prediction score distributions are monitored.
[ ] Calibration is monitored with mature labels.
[ ] Outcome metrics are monitored by segment.
[ ] Candidate source/pool drift is monitored.
[ ] Embedding/index drift is monitored.
[ ] Segment drift is mandatory.
[ ] Long-term metrics are included where available.
[ ] Alerts have owners and runbooks.
[ ] Drift response playbook exists.
[ ] Retraining triggers are defined.
[ ] Recalibration triggers are defined.
[ ] Rollback triggers are defined.
[ ] Experiment/treatment data is accounted for.
[ ] Data quality incidents block retraining.
[ ] Model quality reports are generated periodically.
50. Kesimpulan
Model quality monitoring dan drift detection menjaga recommendation system tetap sehat setelah deployment.
Prinsip utama:
- Model quality is a moving target.
- Monitor input, output, outcome, decision, and feedback loop drift.
- Feature drift often explains model quality drift.
- Candidate drift can break ranking without changing ranker.
- Calibration drift is critical for utility composition.
- Segment monitoring prevents global averages from hiding harm.
- Drift can be natural, incident-driven, or system-induced.
- Not every drift requires retraining; choose response based on root cause.
- Alerts must be actionable with owners/runbooks.
- Long-term metrics are needed to catch proxy optimization damage.
Part ini menutup Module 8: Evaluation, Experimentation, dan Observability.
Di Part 069, kita akan masuk Module 9: Governance, Safety, Security, dan Enterprise Constraints, dimulai dari Privacy, Consent, and Data Minimization.
You just completed lesson 68 in final stretch. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.
Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.