Learn Build From Scratch Recommendations System Part 062 Fault Tolerance And Graceful Degradation
title: Build From Scratch Recommendations System - Part 062 description: Mendesain fault tolerance dan graceful degradation untuk recommendation platform production-grade: failure taxonomy, fail-open vs fail-closed, fallback hierarchy, circuit breaker, bulkhead, timeout, safe defaults, model fallback, policy failure, event logging degradation, incident response, and resilience testing. series: learn-build-from-scratch-recommendations-system seriesTitle: Build From Scratch: Enterprise Recommendations System order: 62 partTitle: Fault Tolerance and Graceful Degradation tags:
- recommendation-system
- recsys
- fault-tolerance
- graceful-degradation
- resilience
- system-design
- series date: 2026-07-02
Part 062 — Fault Tolerance and Graceful Degradation
Production recommendation platform akan mengalami failure.
Candidate source timeout.
Feature store stale.
Vector index unavailable.
Ranking model gagal load.
Policy service lambat.
Event logging backlog.
Experiment config salah.
Cache cold.
User profile missing.
Embedding pipeline terlambat.
ANN index berisi item invalid.
Model baru menghasilkan score aneh.
Tenant policy berubah saat request berjalan.
Sistem production-grade tidak boleh hanya “error”.
Ia harus tahu:
failure mana yang boleh degrade?
failure mana yang harus fail closed?
fallback apa yang aman?
apa yang harus dilog?
kapan alert?
bagaimana rollback?
apa dampak user?
Part ini membahas fault tolerance dan graceful degradation untuk recommendation system: failure taxonomy, fail-open vs fail-closed, fallback hierarchy, circuit breakers, bulkheads, safe defaults, model fallback, policy failure, event logging degradation, incident response, resilience testing, and operational readiness.
1. Mental Model: Recommendation Must Fail Safely
Goal bukan “tidak pernah gagal”.
Goal:
when something fails, system remains safe, observable, and as useful as possible
Failure response should be intentional.
Examples:
- optional candidate source fails -> continue with other sources,
- ranker timeout -> fallback ranker,
- user profile missing -> contextual recommendations,
- policy service unavailable -> fail closed for critical checks,
- event logging degraded -> buffer and alert,
- model artifact invalid -> keep previous model.
Safety first, usefulness second.
2. Failure Taxonomy
Common failures:
dependency timeout
dependency error
dependency stale data
partial response
empty response
schema mismatch
model load failure
model inference failure
feature missing
cache outage
data freshness violation
config error
policy conflict
index unavailable
event logging failure
latency overload
security/tenant failure
Classify by:
- severity,
- user impact,
- safety risk,
- fallback availability,
- recoverability.
3. Fail-Open vs Fail-Closed
Fail-Open
Continue despite failure.
Use when failure is low-risk.
Example:
optional editorial source unavailable -> skip it
minor freshness boost missing -> ignore boost
Fail-Closed
Do not proceed / reject.
Use when failure could violate safety, privacy, access, law, tenant boundary.
Example:
permission check unavailable -> do not recommend restricted document
consent unknown -> do not use personalized profile
policy banned state unknown for high-risk item -> reject or safe fallback
Choose per dependency/rule.
4. Failure Policy Matrix
Example:
| Component | Failure | Behavior |
|---|---|---|
| Two-tower source | timeout | skip source |
| Trending source | timeout | use cached trending |
| Feature store | partial missing | defaults + indicators |
| Ranker | timeout | fallback ranker |
| Policy access | unavailable | fail closed |
| Consent service | unavailable | non-personalized fallback |
| Event logger | unavailable | buffer + alert |
| Model registry | unavailable | use last loaded model |
| Index service | unavailable | use other sources/fallback |
| Suppression store | unavailable | conservative fallback or fail safe |
Document this matrix.
5. Fallback Hierarchy
Fallback should be layered.
Example personalized home:
full personalized pipeline
-> personalized candidates + fallback ranker
-> precomputed personalized list + final check
-> non-personalized regional trending + final check
-> editorial safe fallback
-> empty safe response
Each fallback must still respect critical policy.
6. Candidate Source Degradation
Candidate sources should be optional/preferred/required.
Example:
candidate_sources:
two_tower:
criticality: preferred
timeout_ms: 50
fallback: none
trending:
criticality: fallback
timeout_ms: 15
editorial:
criticality: fallback
timeout_ms: 10
If personalized source fails, use non-personalized sources.
Candidate diversity may drop, but response continues.
7. Ranking Degradation
Ranking failure options:
fallback model
source score ranker
precomputed order
popularity order
editorial order
safe empty
Fallback ranker should be simple and local if possible.
Example fallback score:
score =
source_priority
+ normalized_source_rank
+ item_quality
- seen_penalty
Still apply eligibility and final validation.
8. Feature Store Degradation
Feature failures:
- missing feature,
- stale feature,
- feature store timeout,
- partial batch failure.
Policy per feature group:
user_long_term_profile:
failure: use_default_contextual
item_quality:
failure: fallback_ranker_or_reject_low_confidence
policy_state:
failure: fail_closed
session_embedding:
failure: use_long_term_profile
Feature criticality matters.
9. Safe Defaults
Defaults should be safe.
Examples:
unknown p_report risk -> conservative
unknown personalization consent -> non-personalized
unknown item quality -> do not boost
unknown availability for checkout -> reject
unknown user affinity -> zero with missing indicator
Unsafe default:
unknown policy state -> allow
For critical fields, unknown should not mean allowed.
10. Policy Service Failure
Policy/access/consent failures are high risk.
If policy state unavailable:
- for low-risk public content, maybe use cached last known policy with short TTL,
- for restricted/enterprise data, fail closed,
- for sponsored/legal, fail closed or safe fallback.
Final response should never bypass critical policy because of availability pressure.
11. Consent Failure
If consent service unavailable:
do not use personalized behavioral features
Fallback:
contextual/non-personalized recommendations
Do not assume consent allowed.
Privacy failure should be conservative.
12. Suppression Store Failure
Suppression includes user blocks/hides.
If suppression unavailable:
- for ordinary home feed, maybe use cached suppression if fresh,
- if no suppression available, conservative fallback may be needed,
- for legally required opt-out/block, fail closed or non-personalized safe fallback.
User explicit controls are trust-critical.
13. Inventory/Availability Failure
For e-commerce checkout/PDP:
unknown stock -> reject or final availability check
For discovery home feed:
unknown stock -> maybe allow if final PDP handles, but better downrank/filter
Surface matters.
The closer to purchase, the stricter availability must be.
14. Vector Index Failure
If ANN index unavailable:
- use other candidate sources,
- use cached similar items,
- use batch/precomputed candidates,
- use trending/editorial fallback,
- reduce personalization.
Index failure should not crash entire recsys if alternative sources exist.
But retrieval quality may drop.
15. Embedding Missing
If query/user embedding missing:
- use session/user profile alternatives,
- use non-vector sources,
- use content/popularity,
- fallback to contextual.
If item embedding missing:
- content/metadata candidate source,
- cold-start source,
- no vector retrieval for that item until delta pipeline catches up.
Missing embedding should be monitored.
16. Model Artifact Failure
Failures:
- model registry unavailable,
- artifact download fails,
- checksum mismatch,
- model load exception,
- warmup fails,
- runtime incompatible.
Behavior:
- keep last known good model,
- do not switch route,
- alert,
- block promotion,
- rollback if already active.
Serving should not load unvalidated model.
17. Model Inference Failure
If inference fails at request time:
- retry only if safe within deadline,
- fallback model,
- source score ranker,
- precomputed order,
- safe fallback list.
Log model version and error.
If failure rate spikes, circuit-break model and use fallback.
18. Calibration/Utility Policy Missing
If calibration artifact missing:
- do not compose expected utility using raw scores,
- fallback to previous compatible bundle,
- or use ranking raw order if explicitly validated,
- alert.
If utility policy missing:
- use last known good policy,
- fallback route,
- do not invent weights.
Score semantics matter.
19. Event Logging Failure
Event logging failures do not usually block response, but they are serious.
If decision/impression logging fails:
- buffer locally if possible,
- retry,
- send to dead-letter queue,
- alert,
- mark data quality incident if prolonged.
If events lost, training/evaluation degraded.
For high-compliance enterprise decisions, logging failure may need stricter behavior.
20. Experiment Service Failure
If experiment assignment unavailable:
- use cached assignment/config if safe,
- default to control,
- log assignment failure,
- avoid mixing variants unpredictably.
Experiment consistency matters.
Do not randomly assign without persistent/deterministic logic.
21. Config Service Failure
Config failures:
- cannot load surface config,
- invalid config,
- stale config,
- partial config.
Behavior:
- use last known good config,
- fail if no valid config,
- alert,
- reject invalid config before activation.
Config should be validated before production.
22. Cache Failure
If cache unavailable:
- fallback to source if capacity allows,
- use local stale cache,
- reduce load,
- use fallback list,
- open circuit to cache if slow.
Cache outage can overload backing stores.
Use rate limits and bulkheads.
23. Circuit Breaker
Circuit breaker prevents repeated calls to unhealthy dependency.
States:
closed: normal
open: skip calls, use fallback
half-open: test recovery
Use for:
- candidate source,
- feature store,
- vector search,
- model service,
- cache.
Critical policy dependency may open into fail-closed mode.
24. Bulkhead
Bulkhead isolates resources.
Examples:
separate thread pools per dependency
separate pool for optional sources
separate traffic lane for debug
separate executor for logging
separate model runtime resources
Without bulkhead, one slow source exhausts all request threads.
25. Timeout
Every downstream call must have timeout.
Timeout should respect remaining deadline.
Example:
candidate source timeout 50ms
feature store timeout 25ms
ranker timeout 40ms
policy timeout 20ms
No infinite waits.
Timeouts are part of API contract.
26. Retry
Retries can help transient failures.
But retries increase latency/load.
Rules:
- only retry idempotent operations,
- retry within deadline,
- use jitter/backoff,
- avoid retry storms,
- do not retry high-QPS hot path too aggressively.
For online recsys, fallback often better than retry.
27. Hedging
Hedging sends duplicate request to another replica after delay.
Useful for tail latency but increases load.
Use carefully for:
- read-only idempotent calls,
- critical low-latency dependency.
Do not hedge everything.
28. Load Shedding
When overloaded:
- skip optional candidate sources,
- reduce candidate count,
- disable shadow models,
- use fallback ranker,
- reduce feature groups,
- serve precomputed lists,
- reject low-priority debug requests.
Load shedding should preserve high-priority safe responses.
29. Graceful Quality Degradation
Quality tiers:
Tier 0: full personalized + fresh + deep ranker
Tier 1: personalized candidates + simpler ranker
Tier 2: precomputed personalized + final check
Tier 3: contextual trending/editorial
Tier 4: safe empty/generic response
Track which tier served each request.
Fallback tier is key observability dimension.
30. Degradation by Surface
Different surfaces tolerate different degradation.
Home Feed
Can use trending fallback.
Checkout
Must preserve availability/compatibility.
Search
Must respect query; fallback to lexical search.
Email/Push
Better skip send than send bad/stale content.
Enterprise Actions
Better safe empty than invalid action.
Surface-specific fallback policy.
31. Safe Empty Response
Sometimes empty is correct.
Examples:
- no eligible actions,
- policy service failed closed,
- tenant permission uncertain,
- all candidates invalid,
- user opted out.
Empty response should include internal reason and user-safe handling.
Do not fill with unsafe content just to avoid empty.
32. Partial Response
If fewer than requested items available:
- return smaller list if UI supports,
- fill with safe fallback,
- hide module,
- show generic content.
Define product behavior.
Recommendation API should not assume exact N always possible.
33. Stale Data Policy
Stale data can be acceptable or not.
Examples:
item category stale 1h: okay
item stock stale 1h at checkout: not okay
policy state stale 1h: risky
user profile stale 1h: okay
suppression stale 1h: bad
Each data source needs stale policy.
34. Freshness Degradation
If fresh feature unavailable:
use stale feature if within hard TTL
otherwise default/fallback
Feature response should include staleness.
Ranking service can choose fallback if many critical features stale.
35. Fallback Lists Safety
Fallback lists must be safe:
- generated from approved items,
- versioned,
- TTL,
- final eligibility check,
- tombstone filter,
- no personalized leakage,
- region/tenant scoped.
Fallback list is not random popular items from last month.
36. Observability for Degradation
Log and metric:
fallback_used
fallback_tier
fallback_reason
dependency_timeout
candidate_source_failures
feature_missing_rate
policy_fail_closed_count
empty_slate_reason
model_fallback_count
cache_stale_served
event_log_buffered
By surface/model/tenant/region.
If fallback rate rises, quality likely drops.
37. Alerts
Alert on:
fallback rate spike
empty slate rate spike
policy fail-closed spike
ranker timeout spike
feature missing critical spike
candidate count below threshold
event logging backlog
model inference error
index empty result rate
cache error rate
decision log loss
Use severity based on user/safety impact.
38. Runbooks
For each major failure:
symptom
dashboard
likely causes
immediate mitigation
rollback command
owner
communication
postmortem checklist
Examples:
- ranker outage,
- feature store stale,
- index failure,
- policy config bug,
- event logging loss,
- model bad deploy.
Runbooks make incidents faster.
39. Kill Switches
Need fast disable switches:
disable candidate source
disable model route
disable exploration policy
disable campaign
disable LLM explanation
disable sponsored source
force non-personalized fallback
block item/category/tenant
Kill switches must be audited and scoped.
Emergency changes should not require code deploy.
40. Resilience Testing
Test failures intentionally.
Scenarios:
- candidate source timeout,
- feature store unavailable,
- ranker throws exception,
- model artifact invalid,
- policy service slow,
- cache outage,
- index returns empty,
- event logger down,
- stale config,
- suppression store unavailable,
- tenant permission failure.
Automated chaos/resilience tests catch missing fallback.
41. GameDay
Run game days:
simulate ranker outage
simulate bad model deploy
simulate feature null spike
simulate policy service outage
simulate event logging lag
Measure:
- alert fired?
- fallback worked?
- user impact bounded?
- rollback fast?
- logs sufficient?
Practice before real incident.
42. Data Quality Fault Tolerance
If upstream event quality bad:
- stop training,
- mark data window invalid,
- prevent model promotion,
- fallback to previous model,
- annotate metrics,
- backfill after fix.
Online may still run, but learning pipeline should not ingest poison unnoticed.
43. Model Quality Fault Tolerance
If online metrics degrade:
- rollback model,
- reduce traffic,
- disable treatment,
- switch utility policy,
- disable source,
- use previous calibration.
This requires model route and policy versioning.
44. Enterprise High-Stakes Degradation
For enterprise/compliance:
- invalid action worse than no action,
- unauthorized document worse than empty recommendation,
- audit log loss may be severe,
- policy version mismatch unacceptable.
Fallback should be conservative.
Human workflow may take over.
45. LLM Component Degradation
If LLM explanation/conversation fails:
- return deterministic recommendations,
- use template explanation,
- skip explanation,
- ask simple clarification,
- do not block core recsys unless LLM is core to user flow.
If LLM output validation fails, fallback to grounded template.
Never serve unvalidated hallucinated output.
46. Fault Containment
Contain blast radius:
- canary deployments,
- per-surface routing,
- per-tenant routing,
- feature flags,
- source-specific circuit breakers,
- model route rollback,
- policy bundle rollback.
Avoid global changes without staged rollout.
47. Postmortem Data
Incident analysis needs:
- request traces,
- dependency metrics,
- model/feature/policy versions,
- fallback tiers,
- decision logs,
- config changes,
- deploy history,
- data quality status.
Store enough information before incident.
48. Common Failure Modes
48.1 No Fallback
Single dependency kills recommendations.
48.2 Unsafe Fallback
Banned/out-of-stock items served.
48.3 Fail-Open on Critical Policy
Security/compliance incident.
48.4 Retry Storm
Outage amplified.
48.5 No Bulkheads
One dependency exhausts thread pool.
48.6 Silent Degradation
Fallback rate high but no alert.
48.7 Empty Slate Not Handled
UI breaks.
48.8 Stale Suppression
User sees blocked item.
48.9 Event Loss Ignored
Future training corrupted.
48.10 Rollback Incomplete
Model changed back but feature/calibration mismatch remains.
49. Implementation Sketch: Fallback Policy
public record FallbackPolicy(
String policyName,
String policyVersion,
List<FallbackTier> tiers,
Map<String, FailureMode> criticalFailureModes
) {}
public record FallbackTier(
int priority,
String name,
FallbackAction action,
boolean finalEligibilityRequired
) {}
public enum FallbackAction {
USE_FULL_PIPELINE,
USE_FALLBACK_RANKER,
USE_PRECOMPUTED_LIST,
USE_TRENDING_LIST,
USE_EDITORIAL_SAFE_LIST,
RETURN_SAFE_EMPTY
}
Fallback policy should be surface-specific.
50. Implementation Sketch: Stage Failure Handling
public final class RecommendationFailureHandler {
public StageResult<CandidatePool> handleCandidateFailure(
Throwable error,
RequestContext context,
FallbackPolicy policy
) {
if (error instanceof RequiredPolicyException) {
return StageResult.failure("critical_policy_failure");
}
CandidatePool fallback = fallbackCandidateProvider.get(context);
return StageResult.fallback(
fallback,
"candidate_generation_failed",
error.getClass().getSimpleName()
);
}
}
Critical failures must be distinguished from optional failures.
51. Implementation Sketch: Circuit Breaker Decision
public final class DependencyGuard {
private final CircuitBreaker circuitBreaker;
private final DependencyFallback fallback;
public <T> T callOrFallback(Supplier<T> call, Supplier<T> fallbackCall) {
if (!circuitBreaker.allowRequest()) {
return fallbackCall.get();
}
try {
T result = call.get();
circuitBreaker.recordSuccess();
return result;
} catch (Exception ex) {
circuitBreaker.recordFailure(ex);
return fallbackCall.get();
}
}
}
In real system, fallback may fail closed depending dependency.
52. Minimal Production Fault Tolerance Plan
Start with:
failure_policy:
matrix_documented: true
fail_closed:
- consent_unknown
- permission_unknown
- policy_critical_unknown
fail_open_or_degrade:
- optional_candidate_source
- noncritical_feature_missing
- llm_explanation_failure
fallbacks:
ranking_fallback: source_score_ranker
candidate_fallback: trending_editorial
final_fallback: safe_empty
resilience:
timeouts: all_dependencies
circuit_breakers: critical_dependencies
bulkheads: candidate_sources_and_model
load_shedding: optional_sources_shadow_debug
observability:
fallback_tier_metric: true
fallback_reason_logs: true
empty_slate_reason: true
event_logging_backlog_alert: true
operations:
kill_switches: true
rollback_routes: true
runbooks: true
resilience_tests: true
53. Checklist Fault Tolerance and Graceful Degradation Readiness
[ ] Failure taxonomy is documented.
[ ] Fail-open vs fail-closed policy is defined per dependency.
[ ] Fallback hierarchy exists per surface.
[ ] Candidate source failures degrade gracefully.
[ ] Ranking/model fallback exists.
[ ] Feature missing/stale policy exists.
[ ] Critical policy/permission/consent failures fail safe.
[ ] Event logging degradation is buffered/alerted.
[ ] Circuit breakers exist for unstable dependencies.
[ ] Bulkheads isolate dependency resource pools.
[ ] Timeouts exist for every downstream call.
[ ] Retry policy avoids retry storms.
[ ] Load shedding strategy exists.
[ ] Fallback lists are safe, versioned, and final-checked.
[ ] Empty/partial slate behavior is defined.
[ ] Fallback tier and reason are logged.
[ ] Alerts cover fallback spikes and critical failures.
[ ] Kill switches exist for models/sources/rules/exploration.
[ ] Resilience tests/game days are run.
[ ] Runbooks exist for major failure modes.
54. Kesimpulan
Fault tolerance dan graceful degradation menentukan apakah recommendation platform tetap aman saat dunia production tidak ideal.
Prinsip utama:
- Recommendation must fail safely.
- Not all failures are equal.
- Fail-open only for low-risk optional components.
- Fail-closed for permission, consent, tenant, legal, safety, and critical policy.
- Fallback hierarchy should be surface-specific.
- Fallbacks must still pass final eligibility/safety checks.
- Timeouts, circuit breakers, bulkheads, and load shedding are mandatory.
- Degradation must be observable through fallback tier/reason metrics.
- Kill switches, rollback, and runbooks reduce incident duration.
- Resilience testing is the only way to know fallback works.
Part ini menutup Module 7: Production Platform Architecture.
Di Part 063, kita masuk Module 8: Evaluation, Experimentation, dan Observability, dimulai dari Offline Evaluation Metrics — bagaimana mengevaluasi retrieval/ranking/slate secara offline dengan benar sebelum online experiments.
You just completed lesson 62 in deepen practice. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.
Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.