Series MapLesson 62 / 80
Deepen PracticeOrdered learning track

Learn Build From Scratch Recommendations System Part 062 Fault Tolerance And Graceful Degradation

12 min read2341 words
PrevNext
Lesson 6280 lesson track4566 Deepen Practice

title: Build From Scratch Recommendations System - Part 062 description: Mendesain fault tolerance dan graceful degradation untuk recommendation platform production-grade: failure taxonomy, fail-open vs fail-closed, fallback hierarchy, circuit breaker, bulkhead, timeout, safe defaults, model fallback, policy failure, event logging degradation, incident response, and resilience testing. series: learn-build-from-scratch-recommendations-system seriesTitle: Build From Scratch: Enterprise Recommendations System order: 62 partTitle: Fault Tolerance and Graceful Degradation tags:

  • recommendation-system
  • recsys
  • fault-tolerance
  • graceful-degradation
  • resilience
  • system-design
  • series date: 2026-07-02

Part 062 — Fault Tolerance and Graceful Degradation

Production recommendation platform akan mengalami failure.

Candidate source timeout.
Feature store stale.
Vector index unavailable.
Ranking model gagal load.
Policy service lambat.
Event logging backlog.
Experiment config salah.
Cache cold.
User profile missing.
Embedding pipeline terlambat.
ANN index berisi item invalid.
Model baru menghasilkan score aneh.
Tenant policy berubah saat request berjalan.

Sistem production-grade tidak boleh hanya “error”.

Ia harus tahu:

failure mana yang boleh degrade?
failure mana yang harus fail closed?
fallback apa yang aman?
apa yang harus dilog?
kapan alert?
bagaimana rollback?
apa dampak user?

Part ini membahas fault tolerance dan graceful degradation untuk recommendation system: failure taxonomy, fail-open vs fail-closed, fallback hierarchy, circuit breakers, bulkheads, safe defaults, model fallback, policy failure, event logging degradation, incident response, resilience testing, and operational readiness.


1. Mental Model: Recommendation Must Fail Safely

Goal bukan “tidak pernah gagal”.

Goal:

when something fails, system remains safe, observable, and as useful as possible

Failure response should be intentional.

Examples:

  • optional candidate source fails -> continue with other sources,
  • ranker timeout -> fallback ranker,
  • user profile missing -> contextual recommendations,
  • policy service unavailable -> fail closed for critical checks,
  • event logging degraded -> buffer and alert,
  • model artifact invalid -> keep previous model.

Safety first, usefulness second.


2. Failure Taxonomy

Common failures:

dependency timeout
dependency error
dependency stale data
partial response
empty response
schema mismatch
model load failure
model inference failure
feature missing
cache outage
data freshness violation
config error
policy conflict
index unavailable
event logging failure
latency overload
security/tenant failure

Classify by:

  • severity,
  • user impact,
  • safety risk,
  • fallback availability,
  • recoverability.

3. Fail-Open vs Fail-Closed

Fail-Open

Continue despite failure.

Use when failure is low-risk.

Example:

optional editorial source unavailable -> skip it
minor freshness boost missing -> ignore boost

Fail-Closed

Do not proceed / reject.

Use when failure could violate safety, privacy, access, law, tenant boundary.

Example:

permission check unavailable -> do not recommend restricted document
consent unknown -> do not use personalized profile
policy banned state unknown for high-risk item -> reject or safe fallback

Choose per dependency/rule.


4. Failure Policy Matrix

Example:

ComponentFailureBehavior
Two-tower sourcetimeoutskip source
Trending sourcetimeoutuse cached trending
Feature storepartial missingdefaults + indicators
Rankertimeoutfallback ranker
Policy accessunavailablefail closed
Consent serviceunavailablenon-personalized fallback
Event loggerunavailablebuffer + alert
Model registryunavailableuse last loaded model
Index serviceunavailableuse other sources/fallback
Suppression storeunavailableconservative fallback or fail safe

Document this matrix.


5. Fallback Hierarchy

Fallback should be layered.

Example personalized home:

full personalized pipeline
-> personalized candidates + fallback ranker
-> precomputed personalized list + final check
-> non-personalized regional trending + final check
-> editorial safe fallback
-> empty safe response

Each fallback must still respect critical policy.


6. Candidate Source Degradation

Candidate sources should be optional/preferred/required.

Example:

candidate_sources:
  two_tower:
    criticality: preferred
    timeout_ms: 50
    fallback: none
  trending:
    criticality: fallback
    timeout_ms: 15
  editorial:
    criticality: fallback
    timeout_ms: 10

If personalized source fails, use non-personalized sources.

Candidate diversity may drop, but response continues.


7. Ranking Degradation

Ranking failure options:

fallback model
source score ranker
precomputed order
popularity order
editorial order
safe empty

Fallback ranker should be simple and local if possible.

Example fallback score:

score =
  source_priority
  + normalized_source_rank
  + item_quality
  - seen_penalty

Still apply eligibility and final validation.


8. Feature Store Degradation

Feature failures:

  • missing feature,
  • stale feature,
  • feature store timeout,
  • partial batch failure.

Policy per feature group:

user_long_term_profile:
  failure: use_default_contextual
item_quality:
  failure: fallback_ranker_or_reject_low_confidence
policy_state:
  failure: fail_closed
session_embedding:
  failure: use_long_term_profile

Feature criticality matters.


9. Safe Defaults

Defaults should be safe.

Examples:

unknown p_report risk -> conservative
unknown personalization consent -> non-personalized
unknown item quality -> do not boost
unknown availability for checkout -> reject
unknown user affinity -> zero with missing indicator

Unsafe default:

unknown policy state -> allow

For critical fields, unknown should not mean allowed.


10. Policy Service Failure

Policy/access/consent failures are high risk.

If policy state unavailable:

  • for low-risk public content, maybe use cached last known policy with short TTL,
  • for restricted/enterprise data, fail closed,
  • for sponsored/legal, fail closed or safe fallback.

Final response should never bypass critical policy because of availability pressure.


If consent service unavailable:

do not use personalized behavioral features

Fallback:

contextual/non-personalized recommendations

Do not assume consent allowed.

Privacy failure should be conservative.


12. Suppression Store Failure

Suppression includes user blocks/hides.

If suppression unavailable:

  • for ordinary home feed, maybe use cached suppression if fresh,
  • if no suppression available, conservative fallback may be needed,
  • for legally required opt-out/block, fail closed or non-personalized safe fallback.

User explicit controls are trust-critical.


13. Inventory/Availability Failure

For e-commerce checkout/PDP:

unknown stock -> reject or final availability check

For discovery home feed:

unknown stock -> maybe allow if final PDP handles, but better downrank/filter

Surface matters.

The closer to purchase, the stricter availability must be.


14. Vector Index Failure

If ANN index unavailable:

  • use other candidate sources,
  • use cached similar items,
  • use batch/precomputed candidates,
  • use trending/editorial fallback,
  • reduce personalization.

Index failure should not crash entire recsys if alternative sources exist.

But retrieval quality may drop.


15. Embedding Missing

If query/user embedding missing:

  • use session/user profile alternatives,
  • use non-vector sources,
  • use content/popularity,
  • fallback to contextual.

If item embedding missing:

  • content/metadata candidate source,
  • cold-start source,
  • no vector retrieval for that item until delta pipeline catches up.

Missing embedding should be monitored.


16. Model Artifact Failure

Failures:

  • model registry unavailable,
  • artifact download fails,
  • checksum mismatch,
  • model load exception,
  • warmup fails,
  • runtime incompatible.

Behavior:

  • keep last known good model,
  • do not switch route,
  • alert,
  • block promotion,
  • rollback if already active.

Serving should not load unvalidated model.


17. Model Inference Failure

If inference fails at request time:

  • retry only if safe within deadline,
  • fallback model,
  • source score ranker,
  • precomputed order,
  • safe fallback list.

Log model version and error.

If failure rate spikes, circuit-break model and use fallback.


18. Calibration/Utility Policy Missing

If calibration artifact missing:

  • do not compose expected utility using raw scores,
  • fallback to previous compatible bundle,
  • or use ranking raw order if explicitly validated,
  • alert.

If utility policy missing:

  • use last known good policy,
  • fallback route,
  • do not invent weights.

Score semantics matter.


19. Event Logging Failure

Event logging failures do not usually block response, but they are serious.

If decision/impression logging fails:

  • buffer locally if possible,
  • retry,
  • send to dead-letter queue,
  • alert,
  • mark data quality incident if prolonged.

If events lost, training/evaluation degraded.

For high-compliance enterprise decisions, logging failure may need stricter behavior.


20. Experiment Service Failure

If experiment assignment unavailable:

  • use cached assignment/config if safe,
  • default to control,
  • log assignment failure,
  • avoid mixing variants unpredictably.

Experiment consistency matters.

Do not randomly assign without persistent/deterministic logic.


21. Config Service Failure

Config failures:

  • cannot load surface config,
  • invalid config,
  • stale config,
  • partial config.

Behavior:

  • use last known good config,
  • fail if no valid config,
  • alert,
  • reject invalid config before activation.

Config should be validated before production.


22. Cache Failure

If cache unavailable:

  • fallback to source if capacity allows,
  • use local stale cache,
  • reduce load,
  • use fallback list,
  • open circuit to cache if slow.

Cache outage can overload backing stores.

Use rate limits and bulkheads.


23. Circuit Breaker

Circuit breaker prevents repeated calls to unhealthy dependency.

States:

closed: normal
open: skip calls, use fallback
half-open: test recovery

Use for:

  • candidate source,
  • feature store,
  • vector search,
  • model service,
  • cache.

Critical policy dependency may open into fail-closed mode.


24. Bulkhead

Bulkhead isolates resources.

Examples:

separate thread pools per dependency
separate pool for optional sources
separate traffic lane for debug
separate executor for logging
separate model runtime resources

Without bulkhead, one slow source exhausts all request threads.


25. Timeout

Every downstream call must have timeout.

Timeout should respect remaining deadline.

Example:

candidate source timeout 50ms
feature store timeout 25ms
ranker timeout 40ms
policy timeout 20ms

No infinite waits.

Timeouts are part of API contract.


26. Retry

Retries can help transient failures.

But retries increase latency/load.

Rules:

  • only retry idempotent operations,
  • retry within deadline,
  • use jitter/backoff,
  • avoid retry storms,
  • do not retry high-QPS hot path too aggressively.

For online recsys, fallback often better than retry.


27. Hedging

Hedging sends duplicate request to another replica after delay.

Useful for tail latency but increases load.

Use carefully for:

  • read-only idempotent calls,
  • critical low-latency dependency.

Do not hedge everything.


28. Load Shedding

When overloaded:

  • skip optional candidate sources,
  • reduce candidate count,
  • disable shadow models,
  • use fallback ranker,
  • reduce feature groups,
  • serve precomputed lists,
  • reject low-priority debug requests.

Load shedding should preserve high-priority safe responses.


29. Graceful Quality Degradation

Quality tiers:

Tier 0: full personalized + fresh + deep ranker
Tier 1: personalized candidates + simpler ranker
Tier 2: precomputed personalized + final check
Tier 3: contextual trending/editorial
Tier 4: safe empty/generic response

Track which tier served each request.

Fallback tier is key observability dimension.


30. Degradation by Surface

Different surfaces tolerate different degradation.

Home Feed

Can use trending fallback.

Checkout

Must preserve availability/compatibility.

Must respect query; fallback to lexical search.

Email/Push

Better skip send than send bad/stale content.

Enterprise Actions

Better safe empty than invalid action.

Surface-specific fallback policy.


31. Safe Empty Response

Sometimes empty is correct.

Examples:

  • no eligible actions,
  • policy service failed closed,
  • tenant permission uncertain,
  • all candidates invalid,
  • user opted out.

Empty response should include internal reason and user-safe handling.

Do not fill with unsafe content just to avoid empty.


32. Partial Response

If fewer than requested items available:

  • return smaller list if UI supports,
  • fill with safe fallback,
  • hide module,
  • show generic content.

Define product behavior.

Recommendation API should not assume exact N always possible.


33. Stale Data Policy

Stale data can be acceptable or not.

Examples:

item category stale 1h: okay
item stock stale 1h at checkout: not okay
policy state stale 1h: risky
user profile stale 1h: okay
suppression stale 1h: bad

Each data source needs stale policy.


34. Freshness Degradation

If fresh feature unavailable:

use stale feature if within hard TTL
otherwise default/fallback

Feature response should include staleness.

Ranking service can choose fallback if many critical features stale.


35. Fallback Lists Safety

Fallback lists must be safe:

  • generated from approved items,
  • versioned,
  • TTL,
  • final eligibility check,
  • tombstone filter,
  • no personalized leakage,
  • region/tenant scoped.

Fallback list is not random popular items from last month.


36. Observability for Degradation

Log and metric:

fallback_used
fallback_tier
fallback_reason
dependency_timeout
candidate_source_failures
feature_missing_rate
policy_fail_closed_count
empty_slate_reason
model_fallback_count
cache_stale_served
event_log_buffered

By surface/model/tenant/region.

If fallback rate rises, quality likely drops.


37. Alerts

Alert on:

fallback rate spike
empty slate rate spike
policy fail-closed spike
ranker timeout spike
feature missing critical spike
candidate count below threshold
event logging backlog
model inference error
index empty result rate
cache error rate
decision log loss

Use severity based on user/safety impact.


38. Runbooks

For each major failure:

symptom
dashboard
likely causes
immediate mitigation
rollback command
owner
communication
postmortem checklist

Examples:

  • ranker outage,
  • feature store stale,
  • index failure,
  • policy config bug,
  • event logging loss,
  • model bad deploy.

Runbooks make incidents faster.


39. Kill Switches

Need fast disable switches:

disable candidate source
disable model route
disable exploration policy
disable campaign
disable LLM explanation
disable sponsored source
force non-personalized fallback
block item/category/tenant

Kill switches must be audited and scoped.

Emergency changes should not require code deploy.


40. Resilience Testing

Test failures intentionally.

Scenarios:

  • candidate source timeout,
  • feature store unavailable,
  • ranker throws exception,
  • model artifact invalid,
  • policy service slow,
  • cache outage,
  • index returns empty,
  • event logger down,
  • stale config,
  • suppression store unavailable,
  • tenant permission failure.

Automated chaos/resilience tests catch missing fallback.


41. GameDay

Run game days:

simulate ranker outage
simulate bad model deploy
simulate feature null spike
simulate policy service outage
simulate event logging lag

Measure:

  • alert fired?
  • fallback worked?
  • user impact bounded?
  • rollback fast?
  • logs sufficient?

Practice before real incident.


42. Data Quality Fault Tolerance

If upstream event quality bad:

  • stop training,
  • mark data window invalid,
  • prevent model promotion,
  • fallback to previous model,
  • annotate metrics,
  • backfill after fix.

Online may still run, but learning pipeline should not ingest poison unnoticed.


43. Model Quality Fault Tolerance

If online metrics degrade:

  • rollback model,
  • reduce traffic,
  • disable treatment,
  • switch utility policy,
  • disable source,
  • use previous calibration.

This requires model route and policy versioning.


44. Enterprise High-Stakes Degradation

For enterprise/compliance:

  • invalid action worse than no action,
  • unauthorized document worse than empty recommendation,
  • audit log loss may be severe,
  • policy version mismatch unacceptable.

Fallback should be conservative.

Human workflow may take over.


45. LLM Component Degradation

If LLM explanation/conversation fails:

  • return deterministic recommendations,
  • use template explanation,
  • skip explanation,
  • ask simple clarification,
  • do not block core recsys unless LLM is core to user flow.

If LLM output validation fails, fallback to grounded template.

Never serve unvalidated hallucinated output.


46. Fault Containment

Contain blast radius:

  • canary deployments,
  • per-surface routing,
  • per-tenant routing,
  • feature flags,
  • source-specific circuit breakers,
  • model route rollback,
  • policy bundle rollback.

Avoid global changes without staged rollout.


47. Postmortem Data

Incident analysis needs:

  • request traces,
  • dependency metrics,
  • model/feature/policy versions,
  • fallback tiers,
  • decision logs,
  • config changes,
  • deploy history,
  • data quality status.

Store enough information before incident.


48. Common Failure Modes

48.1 No Fallback

Single dependency kills recommendations.

48.2 Unsafe Fallback

Banned/out-of-stock items served.

48.3 Fail-Open on Critical Policy

Security/compliance incident.

48.4 Retry Storm

Outage amplified.

48.5 No Bulkheads

One dependency exhausts thread pool.

48.6 Silent Degradation

Fallback rate high but no alert.

48.7 Empty Slate Not Handled

UI breaks.

48.8 Stale Suppression

User sees blocked item.

48.9 Event Loss Ignored

Future training corrupted.

48.10 Rollback Incomplete

Model changed back but feature/calibration mismatch remains.


49. Implementation Sketch: Fallback Policy

public record FallbackPolicy(
    String policyName,
    String policyVersion,
    List<FallbackTier> tiers,
    Map<String, FailureMode> criticalFailureModes
) {}

public record FallbackTier(
    int priority,
    String name,
    FallbackAction action,
    boolean finalEligibilityRequired
) {}

public enum FallbackAction {
    USE_FULL_PIPELINE,
    USE_FALLBACK_RANKER,
    USE_PRECOMPUTED_LIST,
    USE_TRENDING_LIST,
    USE_EDITORIAL_SAFE_LIST,
    RETURN_SAFE_EMPTY
}

Fallback policy should be surface-specific.


50. Implementation Sketch: Stage Failure Handling

public final class RecommendationFailureHandler {
    public StageResult<CandidatePool> handleCandidateFailure(
        Throwable error,
        RequestContext context,
        FallbackPolicy policy
    ) {
        if (error instanceof RequiredPolicyException) {
            return StageResult.failure("critical_policy_failure");
        }

        CandidatePool fallback = fallbackCandidateProvider.get(context);
        return StageResult.fallback(
            fallback,
            "candidate_generation_failed",
            error.getClass().getSimpleName()
        );
    }
}

Critical failures must be distinguished from optional failures.


51. Implementation Sketch: Circuit Breaker Decision

public final class DependencyGuard {
    private final CircuitBreaker circuitBreaker;
    private final DependencyFallback fallback;

    public <T> T callOrFallback(Supplier<T> call, Supplier<T> fallbackCall) {
        if (!circuitBreaker.allowRequest()) {
            return fallbackCall.get();
        }

        try {
            T result = call.get();
            circuitBreaker.recordSuccess();
            return result;
        } catch (Exception ex) {
            circuitBreaker.recordFailure(ex);
            return fallbackCall.get();
        }
    }
}

In real system, fallback may fail closed depending dependency.


52. Minimal Production Fault Tolerance Plan

Start with:

failure_policy:
  matrix_documented: true
  fail_closed:
    - consent_unknown
    - permission_unknown
    - policy_critical_unknown
  fail_open_or_degrade:
    - optional_candidate_source
    - noncritical_feature_missing
    - llm_explanation_failure
fallbacks:
  ranking_fallback: source_score_ranker
  candidate_fallback: trending_editorial
  final_fallback: safe_empty
resilience:
  timeouts: all_dependencies
  circuit_breakers: critical_dependencies
  bulkheads: candidate_sources_and_model
  load_shedding: optional_sources_shadow_debug
observability:
  fallback_tier_metric: true
  fallback_reason_logs: true
  empty_slate_reason: true
  event_logging_backlog_alert: true
operations:
  kill_switches: true
  rollback_routes: true
  runbooks: true
  resilience_tests: true

53. Checklist Fault Tolerance and Graceful Degradation Readiness

[ ] Failure taxonomy is documented.
[ ] Fail-open vs fail-closed policy is defined per dependency.
[ ] Fallback hierarchy exists per surface.
[ ] Candidate source failures degrade gracefully.
[ ] Ranking/model fallback exists.
[ ] Feature missing/stale policy exists.
[ ] Critical policy/permission/consent failures fail safe.
[ ] Event logging degradation is buffered/alerted.
[ ] Circuit breakers exist for unstable dependencies.
[ ] Bulkheads isolate dependency resource pools.
[ ] Timeouts exist for every downstream call.
[ ] Retry policy avoids retry storms.
[ ] Load shedding strategy exists.
[ ] Fallback lists are safe, versioned, and final-checked.
[ ] Empty/partial slate behavior is defined.
[ ] Fallback tier and reason are logged.
[ ] Alerts cover fallback spikes and critical failures.
[ ] Kill switches exist for models/sources/rules/exploration.
[ ] Resilience tests/game days are run.
[ ] Runbooks exist for major failure modes.

54. Kesimpulan

Fault tolerance dan graceful degradation menentukan apakah recommendation platform tetap aman saat dunia production tidak ideal.

Prinsip utama:

  1. Recommendation must fail safely.
  2. Not all failures are equal.
  3. Fail-open only for low-risk optional components.
  4. Fail-closed for permission, consent, tenant, legal, safety, and critical policy.
  5. Fallback hierarchy should be surface-specific.
  6. Fallbacks must still pass final eligibility/safety checks.
  7. Timeouts, circuit breakers, bulkheads, and load shedding are mandatory.
  8. Degradation must be observable through fallback tier/reason metrics.
  9. Kill switches, rollback, and runbooks reduce incident duration.
  10. Resilience testing is the only way to know fallback works.

Part ini menutup Module 7: Production Platform Architecture.

Di Part 063, kita masuk Module 8: Evaluation, Experimentation, dan Observability, dimulai dari Offline Evaluation Metrics — bagaimana mengevaluasi retrieval/ranking/slate secara offline dengan benar sebelum online experiments.

Lesson Recap

You just completed lesson 62 in deepen practice. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.