Final StretchOrdered learning track

Learn Build From Scratch Recommendations System Part 074 Operating Model And Team Topology

[]12 min read2293 words

In This Lesson

1. Mental Model: RecSys Is a Socio-Technical System 2. Team Topology Goals 3. Core Team Types

Lesson 7480 lesson track67–80 Final Stretch

title: Build From Scratch Recommendations System - Part 074 description: Mendesain operating model dan team topology untuk recommendation platform production-grade: ownership, team boundaries, platform vs product teams, ML/data/platform collaboration, governance, on-call, incident process, model review, experimentation review, roadmap, and organizational scaling. series: learn-build-from-scratch-recommendations-system seriesTitle: Build From Scratch: Enterprise Recommendations System order: 74 partTitle: Operating Model and Team Topology tags:

recommendation-system
recsys
operating-model
team-topology
mlops
governance
series date: 2026-07-02

Part 074 — Operating Model and Team Topology

Recommendation system production-grade tidak bisa dijalankan hanya oleh satu “ML engineer hero”.

Ia membutuhkan kolaborasi:

product,
ML/research,
backend/platform,
data engineering,
MLOps,
infra/SRE,
experimentation,
trust & safety,
privacy/security,
domain experts,
support,
enterprise/customer success,
analytics,
governance.

Tanpa operating model yang jelas, sistem akan gagal secara organisasi:

tidak jelas siapa owner feature,
model deploy tanpa review,
experiment berjalan tanpa guardrail,
data pipeline rusak tanpa on-call,
bad recommendation tidak punya root cause owner,
business rule berubah tanpa audit,
cost meledak tanpa accountability,
tenant config drift,
safety incident lambat ditangani.

Part ini membahas operating model dan team topology untuk recommendation platform enterprise-grade: ownership, team boundaries, platform vs product responsibilities, model governance, experimentation process, on-call, incident response, roadmap, review rituals, and organizational scaling.

1. Mental Model: RecSys Is a Socio-Technical System

Recommendation platform terdiri dari:

software
data
models
policies
experiments
metrics
people
processes
ownership
governance

Teknologi tanpa operating model akan menjadi rapuh.

Pertanyaan operating model:

Who owns candidate source?
Who owns ranking model?
Who owns feature pipeline?
Who approves policy changes?
Who responds to incidents?
Who manages tenant configs?
Who pays for cost?
Who decides objective trade-offs?

Jika tidak jelas, production quality turun.

2. Team Topology Goals

Good topology should provide:

clear ownership,
fast iteration,
safe deployment,
reusable platform,
product alignment,
operational excellence,
governance,
scalable collaboration,
reduced cognitive load.

Avoid:

every product team rebuilding RecSys,
centralized bottleneck for every small change,
no owner for data/feature quality,
ML team shipping without platform support,
platform team owning product metrics alone.

3. Core Team Types

Typical teams:

Recommendation Product Team
Ranking / ML Team
Candidate Retrieval Team
RecSys Platform Team
Feature/Data Platform Team
Experimentation Platform Team
Trust & Safety / Policy Team
Privacy/Security Team
SRE / Infrastructure Team
Analytics / Measurement Team
Enterprise Configuration / Solutions Team

In small org, one team may cover multiple roles. Responsibilities still need clarity.

4. Platform vs Product Responsibilities

Product/SDS Team

Owns:

user experience,
surface objective,
product metrics,
launch decisions,
business trade-offs,
UI integration.

RecSys Platform

Owns:

serving infrastructure,
candidate/ranking framework,
model lifecycle,
feature integration,
observability,
reliability,
reusable components.

Both must collaborate.

5. ML/Ranking Team Responsibilities

Owns:

ranking objectives,
model training,
feature selection,
offline evaluation,
calibration,
model registry metadata,
model monitoring,
retraining strategy,
model quality incidents.

Should not alone own:

privacy policy,
business trade-off final decisions,
frontend experience,
all infrastructure.

6. Candidate Retrieval Team Responsibilities

Owns:

candidate source design,
vector retrieval,
item-to-item/graph/content sources,
ANN indexes,
source recall/latency/cost,
source contribution metrics,
source fallback.

Candidate source quality is not ranking team’s problem alone.

7. Feature/Data Team Responsibilities

Owns:

event tracking contracts,
clean/curated datasets,
feature pipelines,
feature registry,
feature quality/freshness,
dataset builder,
lineage/backfill,
data incident response.

Feature quality is production quality.

8. MLOps/Platform Responsibilities

Owns:

training orchestration,
model registry,
deployment pipeline,
shadow/canary,
rollback,
artifact storage,
monitoring frameworks,
serving runtime.

MLOps makes ML deployable.

9. Experimentation/Analytics Responsibilities

Owns:

experiment platform,
metric definitions,
power/sample guidance,
analysis pipeline,
guardrail dashboards,
experiment registry,
result review.

Recommendation experiments are complex; measurement needs ownership.

10. Trust & Safety / Policy Responsibilities

Owns:

policy taxonomy,
recommendability rules,
safety classifiers thresholds,
moderation workflow,
appeals,
safety guardrails,
safety incident response,
policy rule approval.

ML/engineering implements, but policy owners define policy.

11. Privacy/Security Responsibilities

Owns:

privacy modes,
consent requirements,
data access policy,
tenant isolation review,
security review,
audit requirements,
deletion/retention governance,
sensitive feature review.

Engineering builds enforcement.

12. SRE/Infrastructure Responsibilities

Owns:

service reliability,
capacity planning,
autoscaling,
on-call practices,
incident management,
infrastructure cost visibility,
deployment reliability,
observability platform.

RecSys-specific SLOs need collaboration with RecSys teams.

13. Enterprise Solutions Team

For enterprise RecSys:

Owns:

tenant onboarding,
tenant configuration,
customer-specific constraints,
tenant rollout coordination,
support escalation,
feedback collection,
tenant SLA management.

Should not fork core platform per tenant.

14. RACI Matrix

Define RACI:

Responsible
Accountable
Consulted
Informed

Example:

Decision	Responsible	Accountable	Consulted
New ranker model	ML Team	RecSys Lead	Product, Experimentation, SRE
New safety rule	T&S	Policy Lead	ML, Platform, Legal
Feature pipeline change	Data Team	Data Lead	ML, Platform
Tenant config change	Enterprise Config Team	Tenant Owner	Security, Product
Experiment launch	Product/ML	Product Owner	Analytics, T&S
Model rollback	On-call RecSys	RecSys Owner	Product, SRE

Explicit RACI prevents confusion.

15. Ownership by Artifact

Every artifact should have owner.

Artifacts:

event schema
feature
feature set
candidate source
ranker model
embedding/index
rule bundle
slate policy
utility policy
experiment
tenant config
fallback list
dashboard
alert
runbook

Owner is responsible for quality, monitoring, and lifecycle.

16. Model Review Process

Before production model:

Review:

objective,
training dataset,
feature set,
offline metrics,
segment metrics,
calibration,
guardrails,
latency/cost,
privacy/safety implications,
deployment plan,
rollback plan.

Review output recorded in model registry.

No review, no production.

17. Experiment Review Process

Before experiment:

Review:

hypothesis,
treatment details,
primary metric,
guardrails,
randomization unit,
sample size/duration,
exposure logging,
safety/privacy impact,
cost impact,
rollback plan.

High-risk experiments require stronger approval.

18. Feature Review Process

Before new feature:

Review:

definition,
owner,
source data,
point-in-time safety,
freshness SLA,
default/missing policy,
privacy class,
online availability,
monitoring,
cost,
downstream models.

Feature review prevents feature debt.

19. Rule/Policy Review Process

Before rule change:

Review:

policy intent,
scope,
severity,
expected impact,
conflict with existing rules,
expiry,
test cases,
rollout plan,
rollback,
metrics.

Rules can be as impactful as model deploys.

20. Tenant Config Review Process

Before tenant config activation:

Review:

tenant scope,
data policy,
model route,
rule bundle,
feature availability,
fallback,
quotas,
LLM settings,
data residency,
approval.

Tenant config should be treated as deployment.

21. On-Call Model

RecSys needs on-call for:

online serving outage
ranking/model failure
feature/data pipeline failure
event logging incident
policy/safety incident
experiment issue
tenant config incident

On-call may be layered:

platform/SRE primary for infra,
RecSys engineer for decision quality,
data engineer for pipeline,
ML owner for model,
T&S/security escalation for critical policy.

22. Incident Severity

Severity categories:

SEV0/SEV1

cross-tenant leak
policy/safety violation at scale
major surface outage
privacy incident

SEV2

large quality regression
fallback spike
major segment broken
model deploy bad

SEV3

limited source failure
small tenant issue
batch delay

Severity drives response time and stakeholders.

23. Incident Runbooks

Runbooks:

ranker outage
feature store stale
candidate source failure
bad model deploy
policy violation
event logging loss
tenant config issue
index build/publish failure
LLM hallucination incident
cost spike

Runbook includes:

detection,
dashboards,
immediate mitigations,
rollback/kill switches,
owners,
communication,
postmortem checklist.

24. Postmortem Culture

Postmortem should identify:

technical root cause
process gap
observability gap
test gap
ownership gap
preventive actions

Blameless does not mean accountability-free.

Actions should be tracked.

25. Release Management

Release types:

code deploy
model deploy
index publish
feature pipeline change
rule bundle change
tenant config change
experiment ramp
LLM prompt change
fallback list update

All are production changes.

Use release calendar/timeline.

26. Change Freeze / High-Risk Windows

During high-traffic periods:

freeze risky model/rule/config changes,
allow emergency safety fixes,
increase monitoring,
ensure on-call coverage.

Examples:

major sales event,
enterprise go-live,
campaign launch,
holiday peak.

27. Governance Board

For mature org, create RecSys governance forum.

Participants:

product,
ML,
platform,
data,
experimentation,
trust/safety,
privacy/security,
SRE,
enterprise.

Reviews:

major model launches,
objective changes,
safety trade-offs,
privacy-sensitive features,
tenant customizations,
long-term metrics.

Keep lightweight but real.

28. Objective Governance

Who decides trade-off?

Examples:

CTR vs hide rate
purchase vs return
sponsored revenue vs trust
personalization vs privacy
exploration vs user satisfaction
creator exposure vs user relevance

These are business/product governance questions.

ML team provides evidence; product/governance decides.

29. Roadmap Structure

Roadmap categories:

quality improvements
platform reliability
data quality
evaluation/experimentation
safety/privacy
cost/performance
enterprise/tenant features
developer productivity
observability/debugging

Avoid roadmap only being “new models”.

Infrastructure investments are quality investments.

30. Prioritization Framework

Score initiatives by:

user/business impact
risk reduction
platform leverage
cost reduction
operational pain
regulatory/security need
customer commitment
engineering complexity

Example:

feature freshness monitoring may beat new neural model

Top engineers prioritize bottlenecks, not shiny models.

31. Platform as Product

RecSys platform has internal customers:

product teams,
ML teams,
enterprise teams,
trust/safety,
analytics.

Treat platform with:

APIs,
docs,
onboarding,
SLAs,
support channel,
roadmap,
versioning,
deprecation policy.

Internal platform quality affects speed.

32. Golden Paths

Provide golden paths:

add a candidate source
add a feature
train a ranker
launch an experiment
deploy a rule bundle
onboard a tenant
debug bad recommendation
rollback model

Golden paths reduce cognitive load and unsafe custom work.

33. Documentation System

Docs should include:

architecture,
service ownership,
data contracts,
feature registry,
model registry,
experiment registry,
runbooks,
onboarding guide,
privacy/security rules,
tenant config guide,
debugging playbook.

Docs must be maintained as artifacts change.

RecSys is interdisciplinary.

Run:

architecture reviews,
model review sessions,
incident reviews,
metric literacy sessions,
feature design reviews,
safety/privacy training,
on-call shadowing,
internal workshops.

This builds shared mental model.

35. Product-ML Collaboration

Product should understand:

offline vs online metrics,
proxy traps,
experiment design,
guardrails,
latency/cost trade-offs,
segment analysis.

ML should understand:

user journey,
business objective,
surface constraints,
trust/safety,
product strategy.

Weak product-ML collaboration leads to wrong objective.

36. Data Contract Ownership

Event schemas need owners.

If product changes UI but not event contract, training breaks.

Process:

event schema review,
instrumentation validation,
canary client events,
event quality dashboards,
data contract tests.

Product and data teams jointly own feedback loop quality.

37. Support Escalation

Support needs path for:

bad recommendation report
privacy concern
tenant-specific issue
unsafe recommendation
wrong explanation
enterprise permission issue

Support tool should collect request/slate IDs.

Escalation should route by severity/root cause area.

38. Customer/Tenant Feedback Loop

Enterprise tenants provide domain feedback.

Process:

structured feedback,
label collection,
expert review,
tenant metrics,
roadmap input,
config changes,
model improvement.

Do not rely only on implicit feedback.

39. Human-in-the-Loop Operations

For some domains:

editorial curation,
safety review,
policy approval,
enterprise expert validation,
appeal review.

Recommendation platform should integrate human decisions as data/artifacts, not side spreadsheets.

40. Ownership of Fallbacks

Fallbacks need owners.

Fallback list/policy should have:

owner,
generation pipeline,
validation,
TTL,
monitoring,
emergency update process.

Fallbacks often become visible during incidents; they must be maintained.

41. Cost Ownership

Cost should have owner.

By:

service,
surface,
tenant,
model,
pipeline.

Cost review cadence:

monthly platform cost review
per-experiment cost estimate
tenant profitability review
batch job optimization review

Cost is a product constraint.

42. Quality Review Cadence

Regular reviews:

Daily/Weekly

online health,
experiment status,
incidents,
model quality drift.

Monthly

objective metrics,
cost,
feature/model debt,
safety/privacy review,
tenant health.

Quarterly

architecture roadmap,
platform reliability,
team ownership,
governance updates.

43. Technical Debt Management

RecSys debt examples:

unused features
stale candidate sources
old model routes
unowned dashboards
tenant overrides
manual backfills
duplicated filters
missing tests
unversioned configs

Debt should be tracked and retired.

Otherwise platform slows down.

44. Deprecation Process

Deprecate:

old feature,
old model,
old candidate source,
old config version,
old event schema,
old debug view.

Process:

identify consumers,
announce,
provide migration path,
monitor usage,
remove after window,
archive metadata.

Deprecation is governance.

45. Scaling Organization

Early stage:

one full-stack RecSys team

Growth:

separate platform + ML + data responsibilities

Mature:

platform teams enable multiple product/tenant teams

Avoid splitting too early, but do not keep one team as bottleneck forever.

46. Team Cognitive Load

RecSys spans many domains.

Reduce cognitive load with:

clear service boundaries,
golden paths,
self-service tools,
strong contracts,
docs,
platform abstractions,
templates,
automated validation.

If every change requires understanding the whole stack, team cannot scale.

47. Team Topologies Lens

Useful team modes:

Stream-Aligned Teams

Own product surface/user outcome.

Platform Teams

Provide RecSys capabilities.

Complicated Subsystem Teams

Own ranking/retrieval/ML complexity.

Enabling Teams

Help teams adopt experimentation, observability, privacy, etc.

Use consciously.

48. Common Operating Failure Modes

48.1 No Clear Owner

Problems linger.

48.2 ML Throws Model Over Wall

Serving/ops breaks.

48.3 Platform Bottleneck

Every change waits.

48.4 Product Optimizes Proxy Alone

Trust/long-term harm.

48.5 No Data Owner

Feature/event quality degrades.

48.6 Experiments Without Guardrails

Bad launches.

48.7 Tenant Config Sprawl

Unmaintainable enterprise.

48.8 No On-Call for Quality

Bad recommendations persist.

48.9 Governance Too Heavy

Innovation stops.

48.10 Governance Too Light

Incidents increase.

49. Implementation Sketch: Artifact Ownership Registry

public record ArtifactOwnership(
    String artifactType,
    String artifactName,
    String version,
    String owningTeam,
    String primaryOwner,
    String escalationChannel,
    String runbookUrl,
    Instant lastReviewedAt
) {}

Track ownership for models/features/sources/configs.

50. Implementation Sketch: Launch Review Checklist

launch_review:
  artifact: home_ranker_v14
  owners:
    ml: recsys-ranking
    platform: recsys-serving
    product: home-feed
  required_checks:
    - offline_metrics_passed
    - segment_metrics_passed
    - feature_compatibility_passed
    - latency_budget_passed
    - privacy_review_passed
    - safety_guardrails_defined
    - experiment_plan_approved
    - rollback_plan_ready
    - dashboards_ready
    - oncall_notified

Make launch readiness explicit.

51. Minimal Production Operating Model

Start with:

ownership:
  artifact_owner_required: true
  service_owner_required: true
reviews:
  model_launch_review: true
  experiment_review: true
  feature_review: true
  rule_config_review: true
operations:
  recsys_oncall: true
  incident_runbooks: true
  postmortem_process: true
governance:
  objective_review: monthly
  privacy_safety_review: required_for_sensitive_changes
platform:
  golden_paths:
    - add_feature
    - train_model
    - launch_experiment
    - debug_bad_recommendation
observability:
  dashboard_owners: true
  alert_owners: true
cost:
  surface_tenant_cost_review: monthly

This creates sustainable production discipline.

52. Checklist Operating Model and Team Topology Readiness

[ ] Service ownership is documented.
[ ] Artifact ownership is documented.
[ ] Candidate/ranking/feature/policy/experiment owners are clear.
[ ] RACI exists for key decisions.
[ ] Model review process exists.
[ ] Experiment review process exists.
[ ] Feature review process exists.
[ ] Rule/config review process exists.
[ ] Tenant config review process exists.
[ ] On-call covers online/data/model/policy incidents.
[ ] Runbooks exist for major failure modes.
[ ] Postmortem process exists.
[ ] Objective trade-off governance exists.
[ ] Privacy/safety/security review is integrated.
[ ] Platform golden paths exist.
[ ] Support escalation path exists.
[ ] Cost ownership and review exist.
[ ] Deprecation process exists.
[ ] Documentation is maintained.
[ ] Team topology reduces cognitive load.

53. Kesimpulan

Operating model dan team topology menentukan apakah recommendation platform bisa tumbuh secara sustainable.

Prinsip utama:

RecSys is a socio-technical system.
Every service/artifact/metric/config needs owner.
Platform and product responsibilities must be explicit.
Model, feature, rule, experiment, and tenant config changes need review.
On-call must cover decision quality, not only uptime.
Governance should manage objective, privacy, safety, and cost trade-offs.
Golden paths reduce cognitive load and unsafe customization.
Enterprise operations need tenant onboarding/config/support process.
Cost and technical debt need ownership.
Organization design is part of system design.

Part ini menutup Module 9: Governance, Safety, Security, dan Enterprise Constraints.

Di Part 075, kita masuk Module 10: Build From Scratch Implementation Tracks, dimulai dari Minimum Production Skeleton — blueprint project/service/pipeline minimal yang bisa menjadi fondasi implementasi nyata.

Lesson Recap

You just completed lesson 74 in final stretch. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Back To Series Next Lesson

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.

Previous Lesson

Lesson 73

Learn Build From Scratch Recommendations System Part 073 Cost Capacity And Performance Engineering

Next Lesson

Lesson 75

Learn Build From Scratch Recommendations System Part 075 Minimum Production Skeleton

Learn Build From Scratch Recommendations System Part 074 Operating Model And Team Topology

Part 074 — Operating Model and Team Topology

1. Mental Model: RecSys Is a Socio-Technical System

2. Team Topology Goals

3. Core Team Types

4. Platform vs Product Responsibilities

Product/SDS Team

RecSys Platform

5. ML/Ranking Team Responsibilities

6. Candidate Retrieval Team Responsibilities

7. Feature/Data Team Responsibilities

8. MLOps/Platform Responsibilities

9. Experimentation/Analytics Responsibilities

10. Trust & Safety / Policy Responsibilities

11. Privacy/Security Responsibilities

12. SRE/Infrastructure Responsibilities

13. Enterprise Solutions Team

14. RACI Matrix

15. Ownership by Artifact

16. Model Review Process

17. Experiment Review Process

18. Feature Review Process

19. Rule/Policy Review Process

20. Tenant Config Review Process

21. On-Call Model

22. Incident Severity

SEV0/SEV1

SEV2

SEV3

23. Incident Runbooks

24. Postmortem Culture

25. Release Management

26. Change Freeze / High-Risk Windows

27. Governance Board

28. Objective Governance

29. Roadmap Structure

30. Prioritization Framework

31. Platform as Product

32. Golden Paths

33. Documentation System

34. Training and Knowledge Sharing

35. Product-ML Collaboration

36. Data Contract Ownership

37. Support Escalation

38. Customer/Tenant Feedback Loop

39. Human-in-the-Loop Operations

40. Ownership of Fallbacks

41. Cost Ownership

42. Quality Review Cadence

Daily/Weekly

Monthly

Quarterly

43. Technical Debt Management

44. Deprecation Process

45. Scaling Organization

46. Team Cognitive Load

47. Team Topologies Lens

Stream-Aligned Teams

Platform Teams

Complicated Subsystem Teams

Enabling Teams

48. Common Operating Failure Modes

48.1 No Clear Owner

48.2 ML Throws Model Over Wall

48.3 Platform Bottleneck

48.4 Product Optimizes Proxy Alone

48.5 No Data Owner

48.6 Experiments Without Guardrails

48.7 Tenant Config Sprawl

48.8 No On-Call for Quality

48.9 Governance Too Heavy

48.10 Governance Too Light

49. Implementation Sketch: Artifact Ownership Registry

50. Implementation Sketch: Launch Review Checklist

51. Minimal Production Operating Model

52. Checklist Operating Model and Team Topology Readiness

53. Kesimpulan