Learn Build From Scratch Recommendations System Part 074 Operating Model And Team Topology
title: Build From Scratch Recommendations System - Part 074 description: Mendesain operating model dan team topology untuk recommendation platform production-grade: ownership, team boundaries, platform vs product teams, ML/data/platform collaboration, governance, on-call, incident process, model review, experimentation review, roadmap, and organizational scaling. series: learn-build-from-scratch-recommendations-system seriesTitle: Build From Scratch: Enterprise Recommendations System order: 74 partTitle: Operating Model and Team Topology tags:
- recommendation-system
- recsys
- operating-model
- team-topology
- mlops
- governance
- series date: 2026-07-02
Part 074 — Operating Model and Team Topology
Recommendation system production-grade tidak bisa dijalankan hanya oleh satu “ML engineer hero”.
Ia membutuhkan kolaborasi:
- product,
- ML/research,
- backend/platform,
- data engineering,
- MLOps,
- infra/SRE,
- experimentation,
- trust & safety,
- privacy/security,
- domain experts,
- support,
- enterprise/customer success,
- analytics,
- governance.
Tanpa operating model yang jelas, sistem akan gagal secara organisasi:
- tidak jelas siapa owner feature,
- model deploy tanpa review,
- experiment berjalan tanpa guardrail,
- data pipeline rusak tanpa on-call,
- bad recommendation tidak punya root cause owner,
- business rule berubah tanpa audit,
- cost meledak tanpa accountability,
- tenant config drift,
- safety incident lambat ditangani.
Part ini membahas operating model dan team topology untuk recommendation platform enterprise-grade: ownership, team boundaries, platform vs product responsibilities, model governance, experimentation process, on-call, incident response, roadmap, review rituals, and organizational scaling.
1. Mental Model: RecSys Is a Socio-Technical System
Recommendation platform terdiri dari:
software
data
models
policies
experiments
metrics
people
processes
ownership
governance
Teknologi tanpa operating model akan menjadi rapuh.
Pertanyaan operating model:
Who owns candidate source?
Who owns ranking model?
Who owns feature pipeline?
Who approves policy changes?
Who responds to incidents?
Who manages tenant configs?
Who pays for cost?
Who decides objective trade-offs?
Jika tidak jelas, production quality turun.
2. Team Topology Goals
Good topology should provide:
- clear ownership,
- fast iteration,
- safe deployment,
- reusable platform,
- product alignment,
- operational excellence,
- governance,
- scalable collaboration,
- reduced cognitive load.
Avoid:
- every product team rebuilding RecSys,
- centralized bottleneck for every small change,
- no owner for data/feature quality,
- ML team shipping without platform support,
- platform team owning product metrics alone.
3. Core Team Types
Typical teams:
Recommendation Product Team
Ranking / ML Team
Candidate Retrieval Team
RecSys Platform Team
Feature/Data Platform Team
Experimentation Platform Team
Trust & Safety / Policy Team
Privacy/Security Team
SRE / Infrastructure Team
Analytics / Measurement Team
Enterprise Configuration / Solutions Team
In small org, one team may cover multiple roles. Responsibilities still need clarity.
4. Platform vs Product Responsibilities
Product/SDS Team
Owns:
- user experience,
- surface objective,
- product metrics,
- launch decisions,
- business trade-offs,
- UI integration.
RecSys Platform
Owns:
- serving infrastructure,
- candidate/ranking framework,
- model lifecycle,
- feature integration,
- observability,
- reliability,
- reusable components.
Both must collaborate.
5. ML/Ranking Team Responsibilities
Owns:
- ranking objectives,
- model training,
- feature selection,
- offline evaluation,
- calibration,
- model registry metadata,
- model monitoring,
- retraining strategy,
- model quality incidents.
Should not alone own:
- privacy policy,
- business trade-off final decisions,
- frontend experience,
- all infrastructure.
6. Candidate Retrieval Team Responsibilities
Owns:
- candidate source design,
- vector retrieval,
- item-to-item/graph/content sources,
- ANN indexes,
- source recall/latency/cost,
- source contribution metrics,
- source fallback.
Candidate source quality is not ranking team’s problem alone.
7. Feature/Data Team Responsibilities
Owns:
- event tracking contracts,
- clean/curated datasets,
- feature pipelines,
- feature registry,
- feature quality/freshness,
- dataset builder,
- lineage/backfill,
- data incident response.
Feature quality is production quality.
8. MLOps/Platform Responsibilities
Owns:
- training orchestration,
- model registry,
- deployment pipeline,
- shadow/canary,
- rollback,
- artifact storage,
- monitoring frameworks,
- serving runtime.
MLOps makes ML deployable.
9. Experimentation/Analytics Responsibilities
Owns:
- experiment platform,
- metric definitions,
- power/sample guidance,
- analysis pipeline,
- guardrail dashboards,
- experiment registry,
- result review.
Recommendation experiments are complex; measurement needs ownership.
10. Trust & Safety / Policy Responsibilities
Owns:
- policy taxonomy,
- recommendability rules,
- safety classifiers thresholds,
- moderation workflow,
- appeals,
- safety guardrails,
- safety incident response,
- policy rule approval.
ML/engineering implements, but policy owners define policy.
11. Privacy/Security Responsibilities
Owns:
- privacy modes,
- consent requirements,
- data access policy,
- tenant isolation review,
- security review,
- audit requirements,
- deletion/retention governance,
- sensitive feature review.
Engineering builds enforcement.
12. SRE/Infrastructure Responsibilities
Owns:
- service reliability,
- capacity planning,
- autoscaling,
- on-call practices,
- incident management,
- infrastructure cost visibility,
- deployment reliability,
- observability platform.
RecSys-specific SLOs need collaboration with RecSys teams.
13. Enterprise Solutions Team
For enterprise RecSys:
Owns:
- tenant onboarding,
- tenant configuration,
- customer-specific constraints,
- tenant rollout coordination,
- support escalation,
- feedback collection,
- tenant SLA management.
Should not fork core platform per tenant.
14. RACI Matrix
Define RACI:
Responsible
Accountable
Consulted
Informed
Example:
| Decision | Responsible | Accountable | Consulted |
|---|---|---|---|
| New ranker model | ML Team | RecSys Lead | Product, Experimentation, SRE |
| New safety rule | T&S | Policy Lead | ML, Platform, Legal |
| Feature pipeline change | Data Team | Data Lead | ML, Platform |
| Tenant config change | Enterprise Config Team | Tenant Owner | Security, Product |
| Experiment launch | Product/ML | Product Owner | Analytics, T&S |
| Model rollback | On-call RecSys | RecSys Owner | Product, SRE |
Explicit RACI prevents confusion.
15. Ownership by Artifact
Every artifact should have owner.
Artifacts:
event schema
feature
feature set
candidate source
ranker model
embedding/index
rule bundle
slate policy
utility policy
experiment
tenant config
fallback list
dashboard
alert
runbook
Owner is responsible for quality, monitoring, and lifecycle.
16. Model Review Process
Before production model:
Review:
- objective,
- training dataset,
- feature set,
- offline metrics,
- segment metrics,
- calibration,
- guardrails,
- latency/cost,
- privacy/safety implications,
- deployment plan,
- rollback plan.
Review output recorded in model registry.
No review, no production.
17. Experiment Review Process
Before experiment:
Review:
- hypothesis,
- treatment details,
- primary metric,
- guardrails,
- randomization unit,
- sample size/duration,
- exposure logging,
- safety/privacy impact,
- cost impact,
- rollback plan.
High-risk experiments require stronger approval.
18. Feature Review Process
Before new feature:
Review:
- definition,
- owner,
- source data,
- point-in-time safety,
- freshness SLA,
- default/missing policy,
- privacy class,
- online availability,
- monitoring,
- cost,
- downstream models.
Feature review prevents feature debt.
19. Rule/Policy Review Process
Before rule change:
Review:
- policy intent,
- scope,
- severity,
- expected impact,
- conflict with existing rules,
- expiry,
- test cases,
- rollout plan,
- rollback,
- metrics.
Rules can be as impactful as model deploys.
20. Tenant Config Review Process
Before tenant config activation:
Review:
- tenant scope,
- data policy,
- model route,
- rule bundle,
- feature availability,
- fallback,
- quotas,
- LLM settings,
- data residency,
- approval.
Tenant config should be treated as deployment.
21. On-Call Model
RecSys needs on-call for:
online serving outage
ranking/model failure
feature/data pipeline failure
event logging incident
policy/safety incident
experiment issue
tenant config incident
On-call may be layered:
- platform/SRE primary for infra,
- RecSys engineer for decision quality,
- data engineer for pipeline,
- ML owner for model,
- T&S/security escalation for critical policy.
22. Incident Severity
Severity categories:
SEV0/SEV1
cross-tenant leak
policy/safety violation at scale
major surface outage
privacy incident
SEV2
large quality regression
fallback spike
major segment broken
model deploy bad
SEV3
limited source failure
small tenant issue
batch delay
Severity drives response time and stakeholders.
23. Incident Runbooks
Runbooks:
ranker outage
feature store stale
candidate source failure
bad model deploy
policy violation
event logging loss
tenant config issue
index build/publish failure
LLM hallucination incident
cost spike
Runbook includes:
- detection,
- dashboards,
- immediate mitigations,
- rollback/kill switches,
- owners,
- communication,
- postmortem checklist.
24. Postmortem Culture
Postmortem should identify:
technical root cause
process gap
observability gap
test gap
ownership gap
preventive actions
Blameless does not mean accountability-free.
Actions should be tracked.
25. Release Management
Release types:
code deploy
model deploy
index publish
feature pipeline change
rule bundle change
tenant config change
experiment ramp
LLM prompt change
fallback list update
All are production changes.
Use release calendar/timeline.
26. Change Freeze / High-Risk Windows
During high-traffic periods:
- freeze risky model/rule/config changes,
- allow emergency safety fixes,
- increase monitoring,
- ensure on-call coverage.
Examples:
- major sales event,
- enterprise go-live,
- campaign launch,
- holiday peak.
27. Governance Board
For mature org, create RecSys governance forum.
Participants:
- product,
- ML,
- platform,
- data,
- experimentation,
- trust/safety,
- privacy/security,
- SRE,
- enterprise.
Reviews:
- major model launches,
- objective changes,
- safety trade-offs,
- privacy-sensitive features,
- tenant customizations,
- long-term metrics.
Keep lightweight but real.
28. Objective Governance
Who decides trade-off?
Examples:
CTR vs hide rate
purchase vs return
sponsored revenue vs trust
personalization vs privacy
exploration vs user satisfaction
creator exposure vs user relevance
These are business/product governance questions.
ML team provides evidence; product/governance decides.
29. Roadmap Structure
Roadmap categories:
quality improvements
platform reliability
data quality
evaluation/experimentation
safety/privacy
cost/performance
enterprise/tenant features
developer productivity
observability/debugging
Avoid roadmap only being “new models”.
Infrastructure investments are quality investments.
30. Prioritization Framework
Score initiatives by:
user/business impact
risk reduction
platform leverage
cost reduction
operational pain
regulatory/security need
customer commitment
engineering complexity
Example:
feature freshness monitoring may beat new neural model
Top engineers prioritize bottlenecks, not shiny models.
31. Platform as Product
RecSys platform has internal customers:
- product teams,
- ML teams,
- enterprise teams,
- trust/safety,
- analytics.
Treat platform with:
- APIs,
- docs,
- onboarding,
- SLAs,
- support channel,
- roadmap,
- versioning,
- deprecation policy.
Internal platform quality affects speed.
32. Golden Paths
Provide golden paths:
add a candidate source
add a feature
train a ranker
launch an experiment
deploy a rule bundle
onboard a tenant
debug bad recommendation
rollback model
Golden paths reduce cognitive load and unsafe custom work.
33. Documentation System
Docs should include:
- architecture,
- service ownership,
- data contracts,
- feature registry,
- model registry,
- experiment registry,
- runbooks,
- onboarding guide,
- privacy/security rules,
- tenant config guide,
- debugging playbook.
Docs must be maintained as artifacts change.
34. Training and Knowledge Sharing
RecSys is interdisciplinary.
Run:
- architecture reviews,
- model review sessions,
- incident reviews,
- metric literacy sessions,
- feature design reviews,
- safety/privacy training,
- on-call shadowing,
- internal workshops.
This builds shared mental model.
35. Product-ML Collaboration
Product should understand:
- offline vs online metrics,
- proxy traps,
- experiment design,
- guardrails,
- latency/cost trade-offs,
- segment analysis.
ML should understand:
- user journey,
- business objective,
- surface constraints,
- trust/safety,
- product strategy.
Weak product-ML collaboration leads to wrong objective.
36. Data Contract Ownership
Event schemas need owners.
If product changes UI but not event contract, training breaks.
Process:
- event schema review,
- instrumentation validation,
- canary client events,
- event quality dashboards,
- data contract tests.
Product and data teams jointly own feedback loop quality.
37. Support Escalation
Support needs path for:
bad recommendation report
privacy concern
tenant-specific issue
unsafe recommendation
wrong explanation
enterprise permission issue
Support tool should collect request/slate IDs.
Escalation should route by severity/root cause area.
38. Customer/Tenant Feedback Loop
Enterprise tenants provide domain feedback.
Process:
- structured feedback,
- label collection,
- expert review,
- tenant metrics,
- roadmap input,
- config changes,
- model improvement.
Do not rely only on implicit feedback.
39. Human-in-the-Loop Operations
For some domains:
- editorial curation,
- safety review,
- policy approval,
- enterprise expert validation,
- appeal review.
Recommendation platform should integrate human decisions as data/artifacts, not side spreadsheets.
40. Ownership of Fallbacks
Fallbacks need owners.
Fallback list/policy should have:
- owner,
- generation pipeline,
- validation,
- TTL,
- monitoring,
- emergency update process.
Fallbacks often become visible during incidents; they must be maintained.
41. Cost Ownership
Cost should have owner.
By:
- service,
- surface,
- tenant,
- model,
- pipeline.
Cost review cadence:
monthly platform cost review
per-experiment cost estimate
tenant profitability review
batch job optimization review
Cost is a product constraint.
42. Quality Review Cadence
Regular reviews:
Daily/Weekly
- online health,
- experiment status,
- incidents,
- model quality drift.
Monthly
- objective metrics,
- cost,
- feature/model debt,
- safety/privacy review,
- tenant health.
Quarterly
- architecture roadmap,
- platform reliability,
- team ownership,
- governance updates.
43. Technical Debt Management
RecSys debt examples:
unused features
stale candidate sources
old model routes
unowned dashboards
tenant overrides
manual backfills
duplicated filters
missing tests
unversioned configs
Debt should be tracked and retired.
Otherwise platform slows down.
44. Deprecation Process
Deprecate:
- old feature,
- old model,
- old candidate source,
- old config version,
- old event schema,
- old debug view.
Process:
- identify consumers,
- announce,
- provide migration path,
- monitor usage,
- remove after window,
- archive metadata.
Deprecation is governance.
45. Scaling Organization
Early stage:
one full-stack RecSys team
Growth:
separate platform + ML + data responsibilities
Mature:
platform teams enable multiple product/tenant teams
Avoid splitting too early, but do not keep one team as bottleneck forever.
46. Team Cognitive Load
RecSys spans many domains.
Reduce cognitive load with:
- clear service boundaries,
- golden paths,
- self-service tools,
- strong contracts,
- docs,
- platform abstractions,
- templates,
- automated validation.
If every change requires understanding the whole stack, team cannot scale.
47. Team Topologies Lens
Useful team modes:
Stream-Aligned Teams
Own product surface/user outcome.
Platform Teams
Provide RecSys capabilities.
Complicated Subsystem Teams
Own ranking/retrieval/ML complexity.
Enabling Teams
Help teams adopt experimentation, observability, privacy, etc.
Use consciously.
48. Common Operating Failure Modes
48.1 No Clear Owner
Problems linger.
48.2 ML Throws Model Over Wall
Serving/ops breaks.
48.3 Platform Bottleneck
Every change waits.
48.4 Product Optimizes Proxy Alone
Trust/long-term harm.
48.5 No Data Owner
Feature/event quality degrades.
48.6 Experiments Without Guardrails
Bad launches.
48.7 Tenant Config Sprawl
Unmaintainable enterprise.
48.8 No On-Call for Quality
Bad recommendations persist.
48.9 Governance Too Heavy
Innovation stops.
48.10 Governance Too Light
Incidents increase.
49. Implementation Sketch: Artifact Ownership Registry
public record ArtifactOwnership(
String artifactType,
String artifactName,
String version,
String owningTeam,
String primaryOwner,
String escalationChannel,
String runbookUrl,
Instant lastReviewedAt
) {}
Track ownership for models/features/sources/configs.
50. Implementation Sketch: Launch Review Checklist
launch_review:
artifact: home_ranker_v14
owners:
ml: recsys-ranking
platform: recsys-serving
product: home-feed
required_checks:
- offline_metrics_passed
- segment_metrics_passed
- feature_compatibility_passed
- latency_budget_passed
- privacy_review_passed
- safety_guardrails_defined
- experiment_plan_approved
- rollback_plan_ready
- dashboards_ready
- oncall_notified
Make launch readiness explicit.
51. Minimal Production Operating Model
Start with:
ownership:
artifact_owner_required: true
service_owner_required: true
reviews:
model_launch_review: true
experiment_review: true
feature_review: true
rule_config_review: true
operations:
recsys_oncall: true
incident_runbooks: true
postmortem_process: true
governance:
objective_review: monthly
privacy_safety_review: required_for_sensitive_changes
platform:
golden_paths:
- add_feature
- train_model
- launch_experiment
- debug_bad_recommendation
observability:
dashboard_owners: true
alert_owners: true
cost:
surface_tenant_cost_review: monthly
This creates sustainable production discipline.
52. Checklist Operating Model and Team Topology Readiness
[ ] Service ownership is documented.
[ ] Artifact ownership is documented.
[ ] Candidate/ranking/feature/policy/experiment owners are clear.
[ ] RACI exists for key decisions.
[ ] Model review process exists.
[ ] Experiment review process exists.
[ ] Feature review process exists.
[ ] Rule/config review process exists.
[ ] Tenant config review process exists.
[ ] On-call covers online/data/model/policy incidents.
[ ] Runbooks exist for major failure modes.
[ ] Postmortem process exists.
[ ] Objective trade-off governance exists.
[ ] Privacy/safety/security review is integrated.
[ ] Platform golden paths exist.
[ ] Support escalation path exists.
[ ] Cost ownership and review exist.
[ ] Deprecation process exists.
[ ] Documentation is maintained.
[ ] Team topology reduces cognitive load.
53. Kesimpulan
Operating model dan team topology menentukan apakah recommendation platform bisa tumbuh secara sustainable.
Prinsip utama:
- RecSys is a socio-technical system.
- Every service/artifact/metric/config needs owner.
- Platform and product responsibilities must be explicit.
- Model, feature, rule, experiment, and tenant config changes need review.
- On-call must cover decision quality, not only uptime.
- Governance should manage objective, privacy, safety, and cost trade-offs.
- Golden paths reduce cognitive load and unsafe customization.
- Enterprise operations need tenant onboarding/config/support process.
- Cost and technical debt need ownership.
- Organization design is part of system design.
Part ini menutup Module 9: Governance, Safety, Security, dan Enterprise Constraints.
Di Part 075, kita masuk Module 10: Build From Scratch Implementation Tracks, dimulai dari Minimum Production Skeleton — blueprint project/service/pipeline minimal yang bisa menjadi fondasi implementasi nyata.
You just completed lesson 74 in final stretch. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.
Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.