Series MapLesson 32 / 35
Final StretchOrdered learning track

Learn Aws Part 032 Enterprise Saas Multitenancy And Cell Based Architecture

27 min read5266 words
PrevNext
Lesson 3235 lesson track3035 Final Stretch

title: Learn AWS Engineering Mastery - Part 032 description: Enterprise SaaS and multi-tenant architecture on AWS through tenant isolation, pooled/siloed/bridge models, tenant lifecycle, entitlement, metering, noisy-neighbor control, and cell-based architecture. series: learn-aws seriesTitle: Learn AWS Engineering Mastery order: 32 partTitle: Enterprise SaaS, Multitenancy, and Cell-Based Architecture tags:

  • aws
  • cloud
  • architecture
  • saas
  • multitenancy
  • tenant-isolation
  • cell-based-architecture
  • platform-engineering date: 2026-07-01

Learn AWS Engineering Mastery - Part 032

Enterprise SaaS, Multitenancy, and Cell-Based Architecture

SaaS architecture is not just “one application used by many customers.”

A serious SaaS platform must solve tenant identity, tenant isolation, provisioning, entitlement, metering, billing, observability, deployment, noisy-neighbor control, compliance evidence, lifecycle management, and failure containment.

A senior AWS engineer does not begin with:

Should we put all tenants in one database or separate databases?

The better question is:

What tenant boundary must be enforced for security, operations, performance, cost, compliance, and product experience?

This part teaches how to design SaaS and multi-tenant systems on AWS using tenant isolation, pooled/siloed/bridge deployment models, and cell-based architecture.


1. Target Skill

After this part, you should be able to:

  • define tenant, tenant context, tenant isolation, tenant partitioning, tenant lifecycle, and tenant control plane;
  • distinguish silo, pool, and bridge SaaS deployment models;
  • design tenant isolation at identity, API, compute, data, network, observability, and operations layers;
  • choose when to use shared resources, dedicated resources, account-per-tenant, or cell-based segmentation;
  • prevent cross-tenant data leakage and noisy-neighbor effects;
  • design tenant onboarding, offboarding, suspension, entitlement, metering, and audit workflows;
  • reason about multi-tenant cost allocation and unit economics;
  • design cell-based architecture to limit blast radius;
  • explain trade-offs for regulated enterprise SaaS on AWS.

2. Kaufman Skill Decomposition

SaaS architecture decomposes into the following sub-skills:

First 20 Hours Focus

TimeboxFocusPractice Output
2hSaaS vocabularyDefine tenant, isolation, partitioning, tier, cell
3hIsolation modelDraw isolation layers for one SaaS workload
3hSilo/pool/bridge trade-offSelect model for 5 tenant types
3hTenant lifecycleDesign onboarding/offboarding workflows
3hNoisy-neighbor controlDefine quotas, throttles, and metrics
3hCell-based architectureDesign cell routing and blast-radius boundary
3hRegulated SaaS reviewCreate evidence and audit checklist

3. Core Mental Model

SaaS is a control-plane problem and a data-plane problem.

The SaaS control plane manages tenants.
The SaaS data plane serves tenant workload traffic.

Control Plane Responsibilities

  • tenant registration;
  • tenant identity linkage;
  • provisioning;
  • plan/tier assignment;
  • entitlement configuration;
  • billing and metering;
  • tenant suspension/reactivation;
  • tenant placement into cells/environments;
  • tenant configuration;
  • audit and lifecycle evidence;
  • admin operations;
  • tenant-aware support tooling.

Data Plane Responsibilities

  • serve user/API traffic;
  • enforce tenant context;
  • isolate tenant data;
  • apply entitlement decisions;
  • emit tenant-tagged telemetry;
  • enforce quotas/throttles;
  • protect against noisy neighbors;
  • preserve tenant-level audit events.

Diagram

Key Rule

A multi-tenant application without a tenant control plane eventually becomes operationally unmanageable.

4. Tenant Vocabulary

TermMeaning
TenantA customer, organization, business unit, or logical consumer of the SaaS system
Tenant userA human or machine principal associated with a tenant
Tenant contextThe tenant identity attached to a request, event, job, or data object
Tenant isolationControls that prevent one tenant from accessing or impacting another tenant improperly
Tenant partitioningThe storage or placement model used to organize tenant data/resources
Tenant tierCommercial/operational class such as Free, Standard, Enterprise, Regulated
Tenant lifecycleOnboarding, activation, configuration, suspension, deletion/offboarding
Tenant entitlementWhat a tenant is allowed to use
Tenant meteringMeasurement of tenant usage for cost, billing, quota, or abuse detection
Tenant cellIsolated workload slice serving a subset of tenants

Tenant Context Invariant

Every request, event, job, log, metric, trace, audit record, object, row, and support action must answer:

Which tenant is this for?

If tenant context is missing, the system cannot reliably enforce isolation, troubleshoot tenant impact, allocate cost, or prove audit boundaries.


5. Silo, Pool, and Bridge Models

AWS SaaS Lens describes common SaaS architectural models as silo, pool, and bridge.

5.1 Silo Model

A tenant gets dedicated resources.

Strengths

  • strong isolation;
  • easier tenant-specific compliance;
  • simpler noisy-neighbor containment;
  • tenant-specific maintenance windows possible;
  • simpler per-tenant restore in many designs;
  • clearer cost allocation.

Weaknesses

  • higher cost;
  • more operational overhead;
  • harder fleet management;
  • slower tenant onboarding unless automated;
  • version drift risk;
  • inefficient utilization.

Use When

  • enterprise tenant requires dedicated deployment;
  • compliance requires stronger physical/logical separation;
  • tenant workload is large enough to justify dedicated capacity;
  • custom maintenance or data residency is required;
  • tenant contract demands strong isolation.

5.2 Pool Model

Multiple tenants share resources.

Strengths

  • cost efficiency;
  • high utilization;
  • simpler global deployment;
  • faster onboarding;
  • easier feature rollout;
  • less infrastructure sprawl.

Weaknesses

  • isolation must be enforced in software/policy;
  • noisy-neighbor risk;
  • more complex tenant-aware observability;
  • per-tenant restore/export/deletion is harder;
  • compliance evidence requires stronger controls;
  • blast radius can be larger.

Use When

  • tenants are small/medium;
  • cost efficiency matters;
  • product experience is standardized;
  • tenant-specific customization is limited;
  • compliance allows logical isolation.

5.3 Bridge Model

Some layers are pooled and others are siloed.

Strengths

  • balances cost and isolation;
  • supports tier-based offerings;
  • allows dedicated data with shared edge/control plane;
  • practical for regulated enterprise SaaS.

Weaknesses

  • operational complexity;
  • multiple deployment patterns;
  • entitlement and routing complexity;
  • requires strong automation.

Use When

  • some tenants require dedicated resources;
  • most tenants can share pooled infrastructure;
  • certain data classes need stronger isolation;
  • enterprise tiers differ meaningfully.

6. Tenant Isolation vs Data Partitioning

Tenant isolation and data partitioning are related but not the same.

Partitioning decides where tenant data is stored.
Isolation ensures one tenant cannot access another tenant's data or resources.

A shared table with tenant_id is partitioning. It is not sufficient isolation unless every access path enforces tenant constraints.

Isolation Layers

LayerIsolation Mechanism
IdentityTenant claims in token, tenant-bound roles, federation mapping
APIAuthorizer validates tenant context and entitlement
ApplicationTenant-aware service logic and invariant checks
DataRow/item/object partitioning, separate schema, separate DB, separate account
ComputeShared worker pool, per-tenant workers, per-cell services
NetworkSecurity groups, VPC segmentation, PrivateLink, account boundary
OperationsTenant-scoped admin tools and support authorization
ObservabilityTenant-tagged logs/metrics/traces without data leakage
CostTenant allocation tags, metering, usage reports
ComplianceAudit evidence, retention, access review

Isolation Failure Example

API validates tenant_id on normal user requests.
Background job queries all rows without tenant filter.
Support tool allows operator to search by case number globally.
Export job writes multiple tenants into same file.
Logs contain cross-tenant payload data.

Isolation must be system-wide, not endpoint-specific.


7. Tenant Identity and Context Propagation

Tenant context usually starts at identity.

Request Path

Tenant Claims

A token may include:

{
  "sub": "user-123",
  "tenant_id": "tenant-abc",
  "tenant_tier": "enterprise",
  "roles": ["case_manager", "reviewer"],
  "entitlements": ["case.write", "evidence.read"]
}

Rules

  • Do not trust tenant IDs supplied only in request body.
  • Bind tenant context to authenticated identity or machine identity.
  • Validate that requested tenant equals authorized tenant.
  • For cross-tenant admin, require explicit elevated workflow and audit.
  • Propagate tenant context to async events and background jobs.
  • Reject work items that lack tenant context unless explicitly system-scoped.

8. Tenant-Aware Data Models

8.1 Shared Table

CREATE TABLE cases (
  tenant_id varchar(64) NOT NULL,
  case_id varchar(64) NOT NULL,
  status varchar(32) NOT NULL,
  created_at timestamp NOT NULL,
  PRIMARY KEY (tenant_id, case_id)
);

Strengths

  • simple pooled model;
  • efficient for many small tenants;
  • easier global schema rollout;
  • good utilization.

Risks

  • every query must include tenant predicate;
  • accidental cross-tenant scans;
  • hard per-tenant restore;
  • noisy tenant can affect shared DB;
  • data deletion/export requires discipline.

8.2 Schema-per-Tenant

DB cluster
├── tenant_a_schema
├── tenant_b_schema
└── tenant_c_schema

Strengths

  • stronger logical separation;
  • easier per-tenant export in some cases;
  • familiar relational model.

Risks

  • schema migration fanout;
  • connection pool complexity;
  • many schema operational overhead;
  • risk of version drift.

8.3 Database-per-Tenant

Tenant A -> DB A
Tenant B -> DB B
Tenant C -> DB C

Strengths

  • strong isolation;
  • easier per-tenant backup/restore;
  • clearer noisy-neighbor boundary;
  • clearer cost allocation.

Risks

  • expensive for many small tenants;
  • provisioning complexity;
  • fleet patching and upgrades;
  • connection management;
  • cross-tenant analytics harder.

8.4 Account-per-Tenant

Strengths

  • strongest AWS account-level isolation;
  • SCP/CloudTrail/Config/account boundary per tenant;
  • strong compliance story;
  • easier tenant-specific networking.

Risks

  • AWS account quota/management complexity;
  • slower onboarding without account vending automation;
  • cross-account operations complexity;
  • higher platform engineering burden.

9. DynamoDB Multi-Tenant Modeling

DynamoDB is common for SaaS because partitioning can be explicit and high-scale, but tenant isolation must still be designed carefully.

Basic Pooled Key Pattern

PK = TENANT#<tenantId>
SK = CASE#<caseId>

Example:

{
  "PK": "TENANT#t-123",
  "SK": "CASE#c-456",
  "status": "PENDING_REVIEW",
  "createdAt": "2026-07-01T10:00:00Z"
}

Better for High-Volume Tenants

PK = TENANT#<tenantId>#BUCKET#<bucketId>
SK = CASE#<caseId>

This can avoid a single tenant concentrating too much traffic into one partition key design.

Isolation Concerns

ConcernMitigation
Missing tenant keyUse repository/access layer that requires tenant context
Cross-tenant GSI queryInclude tenant in GSI partition key where needed
Hot tenantShard/bucket tenant keys, throttle, or isolate tenant
Tenant exportBuild tenant-scoped export workflow
Tenant deletionTrack all tenant item collections and asynchronous purge
AuditEmit tenant ID for all write events

10. S3 Multi-Tenant Modeling

S3 can support pooled or siloed tenant storage.

Pooled Bucket Prefix Model

s3://case-platform-evidence/tenant=t-123/case=c-456/document=d-789.pdf

Dedicated Bucket Model

s3://tenant-t-123-evidence/case=c-456/document=d-789.pdf

Comparison

ModelStrengthRisk
Shared bucket, tenant prefixCost-efficient, manageable for many tenantsIAM/policy complexity, accidental broad access
Bucket per tenantStronger boundary, easier lifecycle/replication customizationBucket count/operations/policy scale
Account per tenant with bucketsStrong isolation and complianceMore platform automation required

S3 Tenant Isolation Controls

  • bucket policy;
  • IAM policy with tenant-scoped prefixes;
  • session tags / ABAC;
  • KMS key strategy;
  • object tags;
  • access points;
  • signed URLs with tenant validation;
  • CloudTrail data events for sensitive buckets;
  • Macie for sensitive data discovery where appropriate.

11. Tenant Lifecycle

A SaaS system must manage tenants as first-class entities.

Lifecycle States

Lifecycle Operations

OperationRequired Actions
OnboardCreate tenant record, assign tier, provision resources, configure identity, seed defaults
ActivateEnable access, verify health, emit audit event
Update tierChange entitlement, quota, routing, capacity, billing plan
SuspendBlock access, preserve data, stop billing or change status, audit action
ReactivateRestore access and quotas safely
OffboardExport data, enforce retention, revoke access, delete or archive resources
DeletePurge data where legally allowed, retain required audit evidence

Tenant Registry Example

{
  "tenantId": "t-123",
  "name": "Acme Regulation Group",
  "tier": "regulated-enterprise",
  "status": "ACTIVE",
  "cellId": "cell-02",
  "isolationModel": "BRIDGE",
  "dataRegion": "ap-southeast-3",
  "kmsKeyRef": "alias/tenant-t-123",
  "createdAt": "2026-07-01T09:00:00Z"
}

12. Tenant Provisioning Architecture

Tenant provisioning should be automated and idempotent.

Provisioning Invariants

  • safe to retry;
  • every step auditable;
  • partial failure recoverable;
  • resource names deterministic;
  • tenant ID immutable;
  • no tenant traffic before activation;
  • failure state visible to operators;
  • cleanup path exists.

13. Entitlement and Authorization

Authorization answers:

Who can do this?

Entitlement answers:

Is this tenant allowed to use this product capability at all?

Example

A user may have role case_manager, but their tenant may not have the advanced_evidence_analytics entitlement.

Entitlement Types

TypeExample
FeatureEvidence analytics enabled
CapacityMax 10,000 active cases
Rate100 API calls/sec
RegionData must stay in ap-southeast-3
IntegrationSFTP partner export enabled
ComplianceLegal hold feature enabled
SupportDedicated support SLA

Common Mistake

Putting tenant entitlements only in frontend feature flags.

Entitlements must be enforced server-side.


14. Metering and Unit Economics

SaaS without metering is financially blind.

Metering Events

{
  "tenantId": "t-123",
  "eventType": "CASE_CREATED",
  "quantity": 1,
  "timestamp": "2026-07-01T10:00:00Z",
  "source": "case-service",
  "idempotencyKey": "case-created:c-456"
}

What to Meter

DimensionExample
API usageRequests per tenant
StorageGB per tenant, object count
ComputeJobs executed, function invocations
WorkflowCases opened/closed
Data transferExport volume
SearchIndexed documents, queries
AI usageTokens, model calls, embeddings
SupportAdmin operations, SLA tier

Uses of Metering

  • billing;
  • cost allocation;
  • quota enforcement;
  • abuse detection;
  • capacity planning;
  • tenant profitability;
  • tier design;
  • sustainability optimization.

Metering Pipeline


15. Noisy-Neighbor Control

A noisy neighbor is a tenant whose usage degrades other tenants.

Noisy-Neighbor Sources

SourceExample
API trafficTenant floods API endpoints
DatabaseTenant causes hot partition or heavy queries
QueueTenant fills shared queue backlog
StorageTenant uploads many large objects
SearchTenant runs expensive queries
BatchTenant jobs consume worker pool
ReportingTenant scans large history repeatedly
AI inferenceTenant consumes expensive model quota

Controls

ControlAWS/Architecture Mechanism
Rate limitAPI Gateway usage plans/throttling, WAF rate rules, app quota
Queue isolationPer-tenant queue or priority queue
Worker isolationPer-tier worker pools, reserved concurrency
DB isolationTenant partitioning, read replicas, dedicated DB for large tenant
Cache isolationtenant-aware keys, eviction strategy, separate cluster for large tenants
Search isolationper-tenant index or routing key for high-volume tenants
Cell isolationplace groups of tenants into separate cells
Tieringstronger limits for lower tiers, dedicated capacity for enterprise

Tenant-Level SLO

Global SLOs can hide tenant pain.

Overall p95 latency: 180ms
Tenant A p95 latency: 120ms
Tenant B p95 latency: 4.8s

A SaaS platform must support tenant-level observability.


16. Tenant-Aware Observability

Telemetry must include tenant context while protecting sensitive data.

Required Dimensions

SignalTenant Attributes
Metricstenant_id or tenant_tier for aggregated metrics, carefully controlled cardinality
Logstenant_id, request_id, user_id/sub, operation, outcome
Tracestenant_id as attribute/baggage where safe
Audittenant_id, actor, action, resource, decision, timestamp
Costtenant_id/tier/cell where available

Cardinality Warning

Putting raw tenant_id on high-cardinality metrics may become expensive or operationally difficult. Use:

  • tenant tier metrics;
  • top-N tenant dashboards;
  • per-tenant logs/traces;
  • sampled tenant-level metrics;
  • dedicated metrics for enterprise tenants;
  • aggregate + drill-down strategy.

Tenant Dashboard

A useful tenant dashboard includes:

  • request rate;
  • error rate;
  • p95/p99 latency;
  • throttling count;
  • queue backlog;
  • workflow completion time;
  • storage usage;
  • quota usage;
  • recent incidents;
  • entitlement status;
  • deployment version/cell.

17. Cell-Based Architecture

AWS Well-Architected guidance describes cell-based architecture as multiple isolated instances of a workload, where each cell handles a subset of requests and does not share state with other cells.

The goal is to reduce blast radius.

Basic Cell Model

Cell Properties

A proper cell:

  • serves a bounded set of tenants or workload partitions;
  • has independent data state;
  • can fail without failing all tenants;
  • can be deployed/upgraded independently or progressively;
  • has cell-level observability;
  • has routing/placement control;
  • has capacity boundaries;
  • has operational runbooks;
  • avoids shared critical dependencies where possible.

What Is Not a Cell

Not a CellWhy
Multiple pods in one Kubernetes cluster sharing one DBShared blast radius persists
Multiple ECS services sharing same global databaseData dependency can fail all tenants
Multiple AZs in one app tier with one overloaded dependencyFailure not contained by workload subset
Shards with no operational isolationSharding alone is not cell-based architecture

18. Cell Router and Placement

Cell architecture needs a routing and placement mechanism.

Tenant Placement Record

{
  "tenantId": "t-123",
  "cellId": "cell-apac-02",
  "region": "ap-southeast-3",
  "tier": "enterprise",
  "status": "ACTIVE"
}

Routing Flow

Routing Options

OptionUse WhenConcern
Tenant subdomaintenant.example.com maps to cellDNS/cache complexity
Token claimTenant ID in JWT resolves to cellRouter must validate token
API keyMachine integrationsKey lifecycle/security
Header from trusted gatewayInternal routingMust prevent spoofing
Control-plane lookupDynamic placementLatency/cache/fallback

Placement Strategy

StrategyDescription
Hash-basedTenant mapped by hash to cell
Capacity-awarePlace tenant where capacity exists
Tier-basedEnterprise tenants placed in stronger cells
Geography-basedPlace tenant based on data residency/latency
Compliance-basedRegulated tenants placed in compliant cells
DedicatedLarge tenant gets own cell or stack

19. Cell Blast Radius

Cell design is about limiting impact.

Blast Radius Questions

  • How many tenants can one bad deployment affect?
  • How many tenants can one database failure affect?
  • How many tenants can one queue backlog affect?
  • How many tenants can one noisy tenant affect?
  • How many tenants can one IAM mistake affect?
  • How many tenants can one regional issue affect?
  • How many tenants can one operator action affect?

Cell Size Trade-Off

Smaller CellsLarger Cells
Lower blast radiusBetter utilization
More operational overheadFewer deployments/resources
More routing complexitySimpler management
Easier tenant evacuation per groupLarger failure impact
More expensiveMore cost-efficient

Rule

A cell is valuable only if its failure is meaningfully contained.

If every cell depends on the same write-critical global database, the cell boundary is weak.


20. Cell-Based SaaS on AWS

Example Architecture

Account Strategy Options

ModelStrengthRisk
All cells in one accountSimple early-stageWeak account blast-radius boundary
Account per environment, cells insideModerate isolationShared account quota/policy risk
Account per cellStronger isolation and quota boundaryMore automation needed
Account per large tenantStrongest tenant boundaryHighest operational complexity

21. Shared Services in Cell Architecture

Cells often need shared services, but shared services can become shared failure domains.

Shared Service Examples

  • identity provider;
  • tenant registry;
  • entitlement service;
  • billing/metering pipeline;
  • observability platform;
  • deployment pipeline;
  • support/admin tooling;
  • artifact registry;
  • public edge/router;
  • audit lake.

Classification

Shared Service TypeFailure ImpactDesign Guidance
Read-mostly configCan be cachedCache locally in cell
Write-critical registryCan block onboarding/routingReplicate or degrade gracefully
Runtime authorizationCan affect all requestsCache entitlements, fail closed/open by policy
ObservabilityShould not break serving pathAsync telemetry, backpressure protection
BillingShould not block core user actionAsync metering with replay
IdentityHigh impactToken validation cache, defined outage behavior

Rule

A shared service must not silently erase the blast-radius benefit of cells.

22. Tenant Evacuation and Cell Rebalancing

A mature cell architecture may need to move tenants between cells.

Reasons:

  • noisy tenant isolation;
  • cell capacity pressure;
  • compliance/data residency change;
  • cell maintenance;
  • disaster recovery;
  • enterprise tenant upgrade;
  • shard/cell imbalance.

Tenant Evacuation Steps

Evacuation Requirements

  • tenant-scoped data ownership;
  • consistent export/import;
  • idempotent provisioning;
  • placement registry update;
  • routing cache invalidation;
  • reconciliation report;
  • rollback/roll-forward plan;
  • tenant communication plan;
  • audit evidence.

23. Multi-Tenant Security Failure Modes

Failure ModeExamplePrevention
Tenant context spoofingUser passes another tenant ID in bodyDerive tenant from trusted identity/token
Missing tenant filterQuery returns all tenant rowsRepository guardrails, tests, DB policies where available
Cross-tenant cache leakCache key lacks tenant IDTenant-aware cache keys
Shared S3 prefix leakPolicy allows broad prefix accessIAM conditions, access points, tests
Admin overreachSupport user sees all tenants unnecessarilyTenant-scoped support tooling and just-in-time access
Log leakageLogs contain another tenant's PIIRedaction and structured logging policy
Async job leakWorker processes wrong tenant dataTenant context in event envelope
Metrics leakCustomer dashboard exposes another tenant statsDashboard access control and aggregation
Backup restore mistakeRestore tenant A over tenant BTenant-scoped restore runbooks and validation

Tenant-Aware Cache Key

Bad:

case:123

Good:

tenant:t-123:case:123

24. Multi-Tenant Testing

SaaS systems require isolation tests.

Test Categories

TestPurpose
Cross-tenant access testEnsure tenant A cannot read/write tenant B
Tenant context missing testEnsure request/event without tenant is rejected
Entitlement testEnsure disabled feature cannot be invoked
Noisy-neighbor testEnsure tenant A load does not break tenant B
Tenant deletion testEnsure data purge/export behavior works
Admin access testEnsure support tools are tenant-scoped
Migration testEnsure tenant movement between cells works
Restore testEnsure tenant-level restore does not affect others
Observability testEnsure tenant telemetry exists without data leakage

Example Isolation Test Cases

1. User from tenant A requests /tenants/B/cases/123 -> denied.
2. User from tenant A requests /cases/123 where case belongs to B -> denied or not found.
3. Worker receives event with tenant_id missing -> dead-letter or reject.
4. Cache lookup for tenant A cannot return tenant B object.
5. Support operator without tenant grant cannot view tenant data.
6. Tenant A export contains no tenant B records.
7. Tenant B high load does not violate tenant A SLO.

25. Regulated Enterprise SaaS

Regulated enterprise SaaS often needs stronger boundaries.

Typical Requirements

  • tenant-specific data residency;
  • legal hold and retention;
  • customer-managed keys or dedicated keys;
  • tenant-specific audit export;
  • dedicated support access approval;
  • stricter isolation for evidence data;
  • private connectivity;
  • compliance reports;
  • per-tenant backup/restore;
  • custom incident notification;
  • tenant-level DR commitments.

Architecture Direction

For regulated tenants, prefer bridge or siloed elements:

Evidence Requirements

EvidencePurpose
Tenant provisioning logProve lifecycle control
Access logsProve who accessed tenant data
Authorization decisionsProve policy enforcement
Data export logsProve data handling
Backup/restore evidenceProve recoverability
Key usage logsProve cryptographic control
Incident recordsProve response process
Change recordsProve deployment governance
Tenant deletion recordProve retention/deletion handling

26. SaaS on ECS, EKS, Lambda, and Serverless

ECS/Fargate

Good fit when:

  • services are containerized;
  • platform wants simpler operations than Kubernetes;
  • tenant isolation can be by service, task, cluster, or account;
  • predictable microservice runtime is needed.

Tenant isolation options:

  • pooled ECS service;
  • per-tenant task/service for large tenants;
  • per-tier cluster;
  • per-cell ECS cluster/account;
  • task role boundaries.

EKS

Good fit when:

  • Kubernetes ecosystem is required;
  • many teams deploy workloads;
  • platform has strong day-2 operations capability;
  • namespace/network policy/admission control are mature.

Tenant isolation options:

  • namespace per tenant/team;
  • node pool per tier;
  • cluster per cell;
  • cluster per regulated tenant;
  • Kubernetes RBAC + IAM + network policy.

Warning:

Kubernetes namespace isolation is not equivalent to tenant isolation for hostile or regulated tenants.

Lambda/Serverless

Good fit when:

  • workload is event-driven;
  • tenant traffic is bursty;
  • per-function scaling is useful;
  • operational overhead should be minimized.

Tenant isolation options:

  • pooled function with tenant-aware logic;
  • reserved concurrency per critical function;
  • separate functions per tier/tenant for stronger isolation;
  • per-tenant queues;
  • per-cell serverless stack.

API Gateway/AppSync

Good fit when:

  • API boundary needs auth/throttling;
  • tenant-specific usage plans/keys are useful;
  • GraphQL access control can be enforced carefully;
  • edge authorization and request validation matter.

27. Tenant Cost Allocation

Tenant cost allocation is rarely perfect, but it must be useful.

Allocation Methods

MethodAccuracyCost
Dedicated resourcesHighHigh infrastructure cost
TagsMedium/high where supportedLow/medium effort
Usage meteringHigh for application eventsRequires pipeline
Proportional allocationApproximateEasy but less defensible
HybridPracticalRequires governance

Unit Economics Examples

ProductUnit Metric
Case managementCost per active case
Evidence platformCost per GB-month and document processed
API productCost per 1,000 API calls
AnalyticsCost per report/query/job
AI assistantCost per tenant token/model invocation
Workflow engineCost per state transition/process instance

Tenant Profitability View

Tenant revenue
- allocated infrastructure cost
- support cost
- third-party/API cost
- AI/model cost
- data transfer cost
= gross margin estimate

A tenant can be high revenue and still unprofitable if the architecture gives it unbounded shared-resource consumption.


28. Deployment Strategy for SaaS

SaaS deployment must account for tenant impact.

Deployment Options

OptionUse When
All tenants at onceSmall/simple product, low risk
Canary by percentageGood telemetry, homogeneous tenants
Canary by tenantEnterprise-safe validation
Canary by cellCell architecture exists
Tier-first rolloutInternal/free tier before enterprise
Dedicated tenant rolloutRegulated tenants need controlled windows

Cell-Based Progressive Delivery

SaaS Deployment Safety Criteria

  • tenant impact prediction;
  • cell health before/after;
  • tenant-level SLOs;
  • rollback per cell;
  • database migration compatibility;
  • feature flag by tenant/tier;
  • entitlement compatibility;
  • support readiness;
  • release notes per tenant tier.

29. SaaS Admin and Support Tooling

Support tooling is a major isolation risk.

Required Controls

ControlWhy
Tenant-scoped accessPrevent broad customer data exposure
Just-in-time elevationReduce standing privilege
Approval workflowRegulated access evidence
Session loggingInvestigation and audit
Reason codeExplain why access occurred
Data maskingReduce sensitive data exposure
Break-glass pathEmergency access with strong audit
Time-bound grantsPrevent privilege persistence

Support Tool Invariant

An operator must not be able to accidentally operate on the wrong tenant.

Design support UI/API so tenant context is explicit, locked, visible, and logged.


30. Common SaaS Anti-Patterns

Anti-Pattern 1: Tenant ID as Optional Field

Symptoms:

  • some tables have tenant ID, some do not;
  • events sometimes omit tenant ID;
  • logs cannot be filtered by tenant;
  • support tools search globally by default.

Fix:

  • tenant context is mandatory except for explicit system-scoped records;
  • schema/API/event contracts enforce tenant context;
  • tests reject missing tenant context.

Anti-Pattern 2: Shared Everything, No Guardrails

Symptoms:

  • all tenants share DB/queue/cache;
  • no quotas;
  • no tenant-level metrics;
  • enterprise tenant performance depends on free-tier load.

Fix:

  • quotas;
  • per-tier pools;
  • cell boundaries;
  • dedicated resources for large tenants.

Anti-Pattern 3: Silo Everything Too Early

Symptoms:

  • one full stack per tiny tenant;
  • onboarding slow;
  • cost high;
  • patching painful;
  • versions drift.

Fix:

  • use pooled model for small tenants;
  • automate dedicated stacks only for justified tiers;
  • keep common control plane.

Anti-Pattern 4: Billing Added Later

Symptoms:

  • no usage events;
  • no cost allocation;
  • expensive tenants hidden;
  • plan limits unenforceable.

Fix:

  • design metering as platform capability from the start;
  • connect metering to entitlement and quota.

Anti-Pattern 5: Cell Architecture With Shared Fate

Symptoms:

  • cells share one global write DB;
  • cells share one critical queue;
  • deployment still rolls everywhere at once;
  • routing cannot isolate unhealthy cell.

Fix:

  • make state and operations cell-local where possible;
  • deploy progressively by cell;
  • define cell evacuation/failover.

31. Design Review Checklist

Tenant Model

  • Is tenant defined clearly?
  • Is tenant context mandatory?
  • Are tenant IDs immutable?
  • Are tenant tiers explicit?
  • Is tenant lifecycle modeled?

Isolation

  • Can tenant A access tenant B data through any path?
  • Are background jobs tenant-scoped?
  • Are support tools tenant-scoped?
  • Are cache keys tenant-aware?
  • Are exports tenant-scoped?
  • Are logs safe?

Operations

  • Can we see tenant-level health?
  • Can we throttle one tenant?
  • Can we suspend one tenant?
  • Can we move one tenant?
  • Can we restore one tenant?
  • Can we notify impacted tenants?

Cost

  • Can we estimate tenant cost?
  • Can we detect unprofitable tenants?
  • Can quotas protect shared resources?
  • Can tenant tier drive capacity allocation?

Cell Architecture

  • What is the cell boundary?
  • What dependencies are shared?
  • How many tenants can one cell failure impact?
  • Can routing isolate a bad cell?
  • Can deployment roll out by cell?
  • Can tenants be rebalanced?

Compliance

  • Can we prove tenant isolation?
  • Can we export tenant audit logs?
  • Can we enforce data residency?
  • Can we support legal hold and retention?
  • Can we delete tenant data where legally required?

32. Deliberate Practice

Exercise 1: Choose Silo/Pool/Bridge

For each tenant profile, choose a model and justify it:

TenantProfile
A20 users, low data volume, standard support
B5,000 users, high API usage, strict SLA
CGovernment regulator, data residency, legal hold
DFree-tier trial tenant
EEnterprise tenant requiring private connectivity

Exercise 2: Tenant Isolation Threat Model

Draw every access path where cross-tenant leakage could occur:

  • API;
  • background job;
  • cache;
  • logs;
  • support UI;
  • export;
  • data lake;
  • search index;
  • backup restore;
  • analytics dashboard.

For each path, define prevention, detection, and response.

Exercise 3: Design a Tenant Lifecycle Workflow

Create an onboarding workflow using Step Functions-style states:

  • create tenant record;
  • allocate cell;
  • provision resources;
  • configure identity;
  • configure entitlements;
  • run health checks;
  • activate tenant;
  • emit audit event.

Exercise 4: Design a Cell Model

Design a three-cell SaaS architecture and define:

  • cell size;
  • routing strategy;
  • tenant placement;
  • shared services;
  • deployment order;
  • evacuation plan;
  • metrics per cell.

33. Self-Correction Checklist

You understand this part when you can answer:

  • What is a tenant?
  • Why is tenant context an invariant?
  • What is the difference between tenant isolation and data partitioning?
  • When should you choose silo, pool, or bridge?
  • How can pooled infrastructure still enforce isolation?
  • Why are background jobs a cross-tenant risk?
  • How do tenant entitlement and user authorization differ?
  • What is noisy-neighbor control?
  • Why does SaaS require tenant-level observability?
  • What is a cell in cell-based architecture?
  • What shared services can weaken cell isolation?
  • How do you move a tenant between cells?
  • How do you prove tenant isolation to an auditor?

34. Engineering Judgment Summary

Enterprise SaaS architecture is boundary engineering.

The senior posture is:

Make tenant context explicit.
Enforce isolation at every access path.
Use pooled resources where efficiency matters.
Use siloed resources where isolation matters.
Use bridge models when enterprise reality requires both.
Use cells to limit blast radius.
Meter usage so cost and behavior are visible.
Automate tenant lifecycle so operations scale.

A SaaS platform fails when tenancy is treated as a column in a database. It succeeds when tenancy becomes a first-class architectural, operational, security, and economic boundary.


References

Lesson Recap

You just completed lesson 32 in final stretch. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.