Series MapLesson 28 / 35
Deepen PracticeOrdered learning track

Learn Aws Part 028 Cost Engineering Finops Unit Economics And Sustainability

26 min read5096 words
PrevNext
Lesson 2835 lesson track2029 Deepen Practice

title: Learn AWS Engineering Mastery - Part 028 description: AWS cost engineering, FinOps operating model, unit economics, tagging, Cost Explorer, CUR 2.0, Budgets, Cost Anomaly Detection, Savings Plans, rightsizing, and sustainability. series: learn-aws seriesTitle: Learn AWS Engineering Mastery order: 28 partTitle: Cost Engineering: FinOps, Unit Economics, and Sustainability tags:

  • aws
  • cost-engineering
  • finops
  • cost-optimization
  • sustainability
  • budgets
  • cost-explorer
  • cur
  • savings-plans
  • compute-optimizer
  • series date: 2026-07-01

Cost Engineering: FinOps, Unit Economics, and Sustainability

Target pembelajaran: setelah bagian ini, kita mampu memperlakukan biaya AWS sebagai sinyal arsitektur, bukan sekadar urusan billing. Kita akan mampu mendesain tagging, cost allocation, unit economics, guardrails, pricing model, rightsizing, dan sustainability trade-off untuk workload production-grade.

Part sebelumnya membahas compliance dan evidence. Part ini membahas dimensi yang sering memisahkan engineer biasa dari engineer senior/top-tier:

Apakah kita bisa menjelaskan mengapa arsitektur ini menghabiskan biaya tertentu, cost driver mana yang paling besar, bagaimana biaya berubah ketika traffic naik, dan apa trade-off antara reliability, performance, security, sustainability, dan cost?

Cost engineering bukan “memurah-murahkan sistem”. Cost engineering adalah kemampuan menjalankan sistem yang memberikan business value dengan penggunaan resource yang efektif, transparan, dan terkontrol.


1. Kaufman Skill Map

Kaufman-style skill deconstruction:

Sub-skillPertanyaan intiOutput yang harus bisa dibuat
Cost visibilityDi mana uang keluar?Cost dashboard by account/service/workload
Cost allocationSiapa pemilik biaya?Tagging taxonomy + Cost Categories
Unit economicsBiaya per unit bisnis berapa?Cost per tenant/request/case/job
Pricing modelKomitmen apa yang rasional?Savings Plans/RI/Spot decision
RightsizingResource mana over/under-provisioned?Rightsizing backlog with risk
Demand shapingBisakah demand diratakan/dikurangi?Scaling, schedule, cache, lifecycle policies
GovernanceBagaimana mencegah runaway spend?Budgets, anomaly detection, guardrails
SustainabilityResource mana tidak perlu?Utilization and waste reduction plan

Skill targetnya bukan hanya “bisa buka Cost Explorer”. Skill targetnya adalah mampu menutup loop:


2. Mental Model: Cost Is an Architecture Signal

AWS cost berasal dari keputusan arsitektur.

Architecture decisionCost consequence
Multi-AZ databaseHigher availability, higher cost
NAT Gateway for all private egressSimpler private routing, data processing cost
High-cardinality logsBetter debugging, high ingestion/storage cost
Always-on EKS clusterPlatform flexibility, baseline cost
Lambda high memoryFaster execution, different GB-second cost profile
S3 lifecycle policy absentOld data accumulates in expensive class
Cross-AZ chatty servicesHigher data transfer cost and latency
Overprovisioned RDSLower performance risk, wasted capacity
Underprovisioned cacheLower cache cost, higher database load

Cost is not a bill at the end of the month. Cost is telemetry about system design.

2.1 The Cost Triangle

Examples:

  • Increasing redundancy improves reliability but increases cost.
  • Encrypting and retaining logs supports security/compliance but increases storage/processing cost.
  • Caching improves latency and reduces database load but adds infrastructure and invalidation complexity.
  • Aggressive rightsizing reduces cost but can reduce headroom.
  • Removing unused resources improves cost and sustainability.

Top-tier judgment means not optimizing cost in isolation.

2.2 Cost Engineering vs Cost Cutting

Cost cuttingCost engineering
“Reduce the AWS bill by 30%.”“Reduce waste while preserving SLO and risk posture.”
Deletes resources reactivelyUses ownership, telemetry, and change control
Ignores unit economicsOptimizes cost per business outcome
Creates reliability regressionsEvaluates risk and rollback
One-time effortContinuous operating model

Cost engineering is a product and platform discipline.


3. Cost Visibility Foundation

You cannot optimize what you cannot attribute.

Minimum cost visibility stack:

3.1 Account-Based Allocation

The first allocation boundary is the AWS account.

Good multi-account cost design:

Account typeCost ownership
Shared networkPlatform/network team, allocated by consuming workload if possible
Log archiveSecurity/platform baseline cost
Security toolingSecurity/platform baseline cost
Dev workloadProduct/team owner
Prod workloadProduct/team owner
SandboxIndividual/team owner with strict budget
Data platformData platform owner, sometimes allocated by dataset/consumer

Accounts are clean billing containers, but shared services still need allocation model.

3.2 Tagging Taxonomy

Tags are the bridge between resource and business context.

Recommended core tags:

TagPurpose
Environmentprod, staging, dev, sandbox
Applicationworkload/application name
Servicecomponent/service name
Owneraccountable team or group
CostCenterfinancial allocation
DataClassificationpublic/internal/confidential/restricted
Criticalitytier-0/tier-1/tier-2/etc.
ManagedByterraform/cdk/cloudformation/manual
Lifecyclepermanent/ephemeral/experimental
TenantIdonly if safe and cardinality is controlled

Tagging rules:

  1. Tags must be standardized.
  2. Tags must be enforced at creation time where possible.
  3. Tags must be activated for cost allocation when needed.
  4. Tags must not contain secrets or sensitive personal data.
  5. High-cardinality tags need careful review.
  6. Shared cost needs explicit allocation policy.

3.3 Cost Categories

Cost Categories let finance/platform map costs into business dimensions without requiring every resource to have perfect tags.

Examples:

  • map accounts to business units;
  • map services to platform domains;
  • group shared infrastructure;
  • split production vs non-production;
  • isolate experimental spend;
  • build showback/chargeback views.

Tags are resource-side metadata. Cost Categories are billing-side grouping rules. Use both.


4. Cost Explorer, Budgets, Anomaly Detection, CUR

4.1 Cost Explorer

Use Cost Explorer for interactive exploration:

  • service-level spend;
  • account-level trends;
  • daily/monthly breakdown;
  • amortized vs unblended cost;
  • usage type;
  • linked account;
  • tag/category grouping;
  • RI/Savings Plans coverage/utilization views.

Cost Explorer is good for asking: “What changed?”

But it is not always enough for detailed engineering analysis. For high-resolution analysis, use CUR/Data Exports.

4.2 AWS Budgets

Budgets are guardrails and feedback mechanisms.

Budget types:

Budget typeUse case
Cost budgetMonthly cost threshold
Usage budgetSpecific service usage threshold
RI utilization/coverageCommitment tracking
Savings Plans utilization/coverageCommitment tracking

Budget design:

  • budget per environment;
  • budget per product/team;
  • budget for sandbox;
  • budget for experimental accounts;
  • alerts at multiple thresholds;
  • action playbooks, not only email noise.

Example budget policy:

ThresholdAction
50% forecast exceededNotify owner
80% actual exceededNotify owner + platform
100% forecast exceededReview top cost drivers
120% actual exceededEscalate and require mitigation plan

4.3 Cost Anomaly Detection

Cost anomaly detection catches unusual spend patterns that normal budget thresholds miss.

Common anomaly scenarios:

  • runaway logs;
  • misconfigured autoscaling;
  • data transfer spike;
  • accidental large instance family;
  • unbounded Kinesis shard scaling;
  • NAT Gateway egress surge;
  • forgotten load test;
  • snapshots accumulating;
  • unexpected Region usage;
  • public API abuse.

Design principle:

An anomaly alert without an owner and runbook is just billing noise.

4.4 CUR 2.0 and Data Exports

CUR is the detailed cost dataset for serious analysis. CUR 2.0/Data Exports provide detailed cost and usage data that can be delivered to S3 and queried.

Use cases:

  • unit cost calculation;
  • cost per tenant/request/job;
  • shared cost allocation;
  • anomaly forensic analysis;
  • pricing model simulation;
  • historical trend modeling;
  • internal chargeback;
  • sustainability/waste analysis.

Example Athena-style questions:

-- Pseudo-query: monthly service cost by application
SELECT
  month,
  resource_tags_application,
  product_product_name,
  SUM(line_item_unblended_cost) AS cost
FROM cur
WHERE month = '2026-06'
GROUP BY 1, 2, 3
ORDER BY cost DESC;

CUR is where cost engineering becomes data engineering.


5. Unit Economics

A bill says how much you spent. Unit economics says whether the spend makes sense.

5.1 Choosing the Unit

Pick units that match product value.

SystemUseful unit cost
API platformcost per million requests
SaaS platformcost per tenant per month
Case managementcost per case lifecycle
Document processingcost per document processed
Streaming ingestioncost per GB ingested
Search platformcost per indexed document/query
AI inferencecost per successful inference/task
Data lakecost per TB stored and scanned

For a regulatory case management platform, useful unit metrics might be:

  • cost per active case;
  • cost per investigation workflow;
  • cost per evidence document;
  • cost per enforcement decision;
  • cost per tenant/agency;
  • cost per audit query;
  • cost per notification/event.

5.2 Unit Cost Formula

unit_cost = allocated_total_cost / business_units_processed

But allocated total cost must be carefully defined:

allocated_total_cost = direct_service_cost
                     + allocated_shared_platform_cost
                     + allocated_observability_cost
                     + allocated_security_cost
                     + allocated_data_transfer_cost
                     + allocated_support/backup cost

5.3 Example: Cost per Case

Suppose a case management platform has:

Cost componentMonthly cost
App compute4,000
Database7,000
Search2,000
S3 evidence storage1,500
Eventing/workflow1,200
Observability/logs2,300
Shared platform allocation3,000
Security/audit tooling allocation1,000
Total22,000

If 55,000 active case lifecycle steps were processed:

cost_per_case_step = 22,000 / 55,000 = 0.40

This number becomes powerful when tracked over time:

MonthCostCase stepsCost per case stepInterpretation
Jan20,00050,0000.40baseline
Feb24,00075,0000.32scale efficiency improved
Mar30,00060,0000.50regression or fixed-cost spike

A rising bill is not automatically bad. A rising unit cost may be bad.

5.4 Unit Cost Failure Modes

Failure modeConsequenceMitigation
Wrong denominatorMisleading efficiency metricAlign with business value
Shared cost ignoredUnit cost underreportedAllocation model
Observability excludedDebug cost hiddenInclude telemetry cost
Tenant cost blendedNoisy-neighbor invisibleTenant-aware metering
One-time migration cost includedFalse regressionSeparate project vs run cost
Discounts ignoredBad optimization decisionsUse amortized/effective cost where appropriate

6. Pricing Model Strategy

6.1 On-Demand

On-Demand is good for:

  • unpredictable workloads;
  • experimentation;
  • early-stage systems;
  • spiky workloads with no baseline;
  • workloads with short lifespan;
  • avoiding commitment risk.

It is expensive if you have stable baseline usage.

6.2 Savings Plans

Savings Plans provide discounts in exchange for consistent hourly spend commitment for one- or three-year terms.

Use Savings Plans when:

  • compute baseline is stable;
  • services include EC2/Fargate/Lambda depending plan type;
  • commitment can be applied broadly;
  • organization has cost forecasting maturity.

Risks:

  • overcommitment;
  • wrong plan type;
  • purchasing before architecture stabilizes;
  • ignoring upcoming migration/modernization;
  • treating commitment as optimization before rightsizing.

Decision rule:

Rightsize first, commit second.

6.3 Reserved Instances

Reserved Instances still matter for services and scenarios where RI model applies, especially databases and some service-specific commitments.

Use RIs when:

  • workload is stable;
  • instance family/Region/service is unlikely to change;
  • higher discount justifies reduced flexibility;
  • database baseline is well understood.

6.4 Spot

Spot is useful for interruptible compute:

  • batch jobs;
  • CI workers;
  • data processing;
  • rendering;
  • stateless workers;
  • async queues;
  • fault-tolerant Kubernetes/ECS workloads.

Spot is dangerous for:

  • stateful primary databases;
  • non-checkpointed long jobs;
  • latency-critical synchronous services without fallback;
  • workloads that cannot handle interruption.

6.5 Pricing Model Matrix

Workload patternRecommended pricing posture
Stable always-on API computeSavings Plans after rightsizing
Production RDS stable DBRI/commitment after capacity validation
Batch processingSpot + checkpointing
Developer sandboxScheduled shutdown + On-Demand
New product experimentOn-Demand until baseline emerges
Lambda event burstOn-Demand; commit only after stable GB-second baseline
ECS/Fargate baselineCompute Savings Plans if stable
EKS node groupsBlend Savings Plans + Spot for tolerant workloads

7. Rightsizing and Waste Reduction

7.1 Rightsizing Mental Model

Rightsizing is not “choose smaller instance”. It is matching provisioned capacity to observed and expected demand while preserving risk posture.

7.2 Compute Optimizer

Compute Optimizer analyzes resource configuration and utilization metrics to generate recommendations for resources such as EC2, Auto Scaling groups, EBS, Lambda, ECS on Fargate, RDS/Aurora, and idle resources where supported.

Use recommendations as decision input, not automatic truth.

Review:

  • observation period;
  • CPU/memory/disk/network signals;
  • peak vs average;
  • workload seasonality;
  • SLO requirements;
  • failover headroom;
  • deployment schedule;
  • commitment coverage.

7.3 Common Waste Sources

Waste sourceDetectionFix
Idle EC2low utilization, no trafficstop/terminate/schedule
Oversized RDSlow CPU/memory/IOresize, tune, proxy, read scaling
Old snapshotsage and ownership queryretention lifecycle
Unused EBS volumesunattached volumesdelete after approval
Excess logsingestion/storage analysissampling, retention, structured logs
NAT egressusage type analysisendpoints, egress architecture
Over-sharded streamslow per-shard throughputreshard/downscale
Over-provisioned OpenSearchlow JVM/CPU/storage pressureresize/index lifecycle
Unused load balancersno target/no trafficdelete
Nonprod always onschedules missingstop schedules/ephemeral envs

7.4 Rightsizing Risk Matrix

ChangeRiskControl
Reduce EC2 sizeCPU/memory saturationcanary, ASG rollback
Reduce RDS classDB bottleneckperformance test, maintenance window
Lower log retentionforensic evidence losscompliance review
Move S3 to colder classrestore latency/costlifecycle by access pattern
Reduce provisioned concurrencycold start impactSLO validation
Reduce stream shardsconsumer laglag monitoring
Delete snapshotsrecovery lossretention policy approval

Rightsizing should be safe, reversible where possible, and tied to telemetry.


8. Service-Specific Cost Drivers

8.1 EC2 and Auto Scaling

Cost drivers:

  • instance family/size;
  • OS/license;
  • EBS volumes;
  • data transfer;
  • load balancers;
  • idle capacity;
  • commitment coverage;
  • Spot interruption handling.

Optimization:

  • use ASG target tracking;
  • use Graviton where compatible;
  • separate baseline and burst capacity;
  • use Spot for fault-tolerant workers;
  • schedule nonprod;
  • monitor EBS waste.

8.2 ECS/Fargate

Cost drivers:

  • vCPU and memory requested;
  • task count;
  • always-on services;
  • image pull/deploy frequency;
  • logs;
  • data transfer;
  • load balancers;
  • NAT egress.

Optimization:

  • rightsize task CPU/memory;
  • autoscale on queue depth/request metrics;
  • use Savings Plans for stable baseline;
  • avoid excessive sidecars;
  • tune log volume;
  • prefer VPC endpoints where cost-effective.

8.3 EKS

Cost drivers:

  • cluster control plane;
  • node groups;
  • over-requested CPU/memory;
  • daemonsets;
  • load balancers;
  • NAT egress;
  • observability cardinality;
  • persistent volumes;
  • cross-AZ traffic.

Optimization:

  • request/limit discipline;
  • bin packing;
  • Karpenter/cluster autoscaling;
  • namespace/team chargeback;
  • spot nodes for tolerant workloads;
  • rightsize daemonsets;
  • watch metrics/log cardinality.

8.4 Lambda

Cost drivers:

  • invocation count;
  • duration;
  • memory;
  • provisioned concurrency;
  • logs;
  • downstream retries;
  • event source batch size;
  • architecture choice.

Optimization:

  • tune memory for duration/cost sweet spot;
  • make handlers idempotent to reduce retry waste;
  • batch where appropriate;
  • avoid chatty synchronous chains;
  • control log volume;
  • use provisioned concurrency only where needed.

8.5 RDS/Aurora

Cost drivers:

  • instance class;
  • storage;
  • I/O;
  • backup retention;
  • replicas;
  • Multi-AZ;
  • data transfer;
  • connection scaling;
  • query inefficiency.

Optimization:

  • tune queries before scaling vertically;
  • use read replicas only for real read load;
  • use RDS Proxy where connection storm exists;
  • align backup retention to requirement;
  • monitor I/O cost;
  • evaluate Aurora Serverless only if workload pattern fits.

8.6 DynamoDB

Cost drivers:

  • read/write capacity or request units;
  • item size;
  • GSI count;
  • Streams;
  • global tables replication;
  • TTL usage;
  • backup/export;
  • hot partitions causing overprovisioning.

Optimization:

  • access-pattern-first modeling;
  • keep item size reasonable;
  • avoid unnecessary GSIs;
  • use on-demand/provisioned appropriately;
  • model partitions to avoid hot keys;
  • use TTL for ephemeral data.

8.7 S3 and Data Lake

Cost drivers:

  • storage class;
  • request count;
  • data retrieval;
  • lifecycle transitions;
  • replication;
  • inventory;
  • Athena scan volume;
  • small files;
  • logs/evidence retention.

Optimization:

  • lifecycle policies;
  • partitioning;
  • columnar formats;
  • compaction;
  • avoid scanning raw huge datasets;
  • align retention with value/compliance;
  • use Intelligent-Tiering where pattern fits.

8.8 Observability

Cost drivers:

  • log ingestion;
  • log retention;
  • custom metrics;
  • high-cardinality metrics;
  • trace sampling;
  • dashboard/alarm scale;
  • repeated query patterns.

Optimization:

  • structured logs with controlled fields;
  • sample traces intentionally;
  • set retention by log class;
  • avoid logging full payloads;
  • reduce noisy debug logs;
  • use metric filters carefully;
  • define telemetry budget per service.

Observability cost is not waste by default. But unbounded telemetry is architectural debt.


9. Demand Management

Cost follows demand, but demand can be shaped.

9.1 Scaling to Demand

Good scaling signals:

WorkloadBetter scaling signal
API computerequest rate, CPU, latency, target response time
Queue workersqueue depth per worker, oldest message age
Stream consumersconsumer lag
Batch jobsjob backlog and deadline
Searchquery latency, CPU/JVM, indexing backlog
DatabaseCPU, connections, IOPS, query latency

Bad scaling signals:

  • average CPU only for latency-sensitive service;
  • queue depth without processing rate;
  • memory without GC/runtime understanding;
  • request count without latency/SLO;
  • manual desired capacity forever.

9.2 Scheduling Nonprod

Nonprod cost often hides waste.

Patterns:

  • stop dev/test databases outside office hours;
  • tear down preview environments;
  • use ephemeral test stacks;
  • schedule batch clusters;
  • limit sandbox account budgets;
  • expire experimental resources.

9.3 Caching and Work Avoidance

The cheapest work is work you do not perform.

Examples:

  • cache reference data;
  • precompute expensive reports;
  • use CDN for static assets;
  • deduplicate events;
  • compact small files;
  • avoid repeated full table scans;
  • use lifecycle rules;
  • drop unnecessary logs.

Cost optimization often starts in application behavior, not AWS console.


10. Sustainability

Sustainability in AWS is shared responsibility. AWS optimizes the sustainability of the cloud infrastructure; customers optimize workloads in the cloud.

Practical sustainability overlaps heavily with cost and performance:

Sustainability principleEngineering action
Maximize utilizationRightsize, autoscale, consolidate idle resources
Reduce wasteDelete unused assets, lifecycle old data
Match demandScale down nonprod, avoid overprovisioning
Efficient softwareOptimize hot code paths and database queries
Efficient dataCompress, partition, avoid repeated scans
Region choiceConsider business, latency, cost, and sustainability goals
Modern hardwareEvaluate efficient instance families such as Graviton where compatible

10.1 Sustainability Is Not Only “Choose a Green Region”

Region selection matters, but workload behavior matters too.

Examples:

  • keeping 50 idle dev databases running wastes resources in any Region;
  • scanning 100 TB daily because files are unpartitioned wastes compute;
  • logging full payloads forever wastes storage and query compute;
  • overprovisioning for peak all month wastes capacity;
  • inefficient code increases CPU time and energy use.

10.2 Sustainability and Cost Trade-Offs

ActionCost impactSustainability impactCaveat
Delete unused resourcesLowerBetterNeed ownership approval
Autoscale downLowerBetterMust preserve availability
Use colder S3 classLowerOften betterRetrieval latency/cost
Use GravitonOften lowerOften better efficiencyCompatibility testing
Reduce log retentionLowerBetterCompliance/forensics risk
Compress dataLower storage/scansBetterCPU overhead
Multi-Region active-activeHigherMore resourcesMay be required for resilience

Sustainability is an architectural dimension, not a marketing label.


11. FinOps Operating Model

FinOps is the operating discipline that connects engineering, product, finance, and leadership around cloud value.

11.1 Ownership Model

RoleResponsibility
Workload teamOwn direct cost, unit economics, optimization backlog
Platform teamShared services cost, guardrails, golden paths
Security teamSecurity tooling cost and risk justification
FinanceForecasting, allocation, budget process
Product ownerBusiness value and acceptable unit cost
Engineering leadershipTrade-off decisions and accountability

Cost without ownership is just a number.

11.2 Showback vs Chargeback

ModelMeaningWhen useful
ShowbackShow teams their cost, no internal billing transferEarly maturity, education
ChargebackAllocate/bill cost to teams/business unitsMature org, strong tagging/allocation
HybridDirect cost charged, shared cost shown or allocatedCommon enterprise pattern

Start with showback if allocation quality is weak. Chargeback with bad data creates political conflict.

11.3 Cost Review Cadence

Recommended cadence:

CadenceActivity
Dailyanomaly alerts for runaway spend
Weeklytop cost drivers and new spikes
Monthlybudget vs actual, unit cost trends, optimization backlog
Quarterlycommitment planning, architecture review, shared cost allocation
Before major launchcost model and load-test cost projection
After incident/load testcost impact review

12. Guardrails

12.1 Preventive Guardrails

Examples:

  • restrict expensive instance families in sandbox;
  • deny unsupported Regions;
  • require tags on deploy through IaC;
  • restrict public data transfer paths;
  • limit creation of certain resources by account type.

Be careful: cost guardrails can block legitimate engineering work if too blunt.

12.2 Detective Guardrails

Examples:

  • untagged resources report;
  • idle resources report;
  • budget threshold alert;
  • anomaly detection alert;
  • unattached EBS alert;
  • old snapshot report;
  • top data transfer report;
  • high log ingestion report.

12.3 Corrective Guardrails

Examples:

  • stop sandbox instances at night;
  • delete expired preview environments;
  • reduce log retention for nonprod;
  • quarantine unowned resources;
  • notify owner before deletion.

Corrective guardrails need safety valves and owner communication.


13. Cost-Aware Architecture Review

Before production launch, ask:

13.1 Visibility

  • Are all resources tagged?
  • Is the workload mapped to account, application, owner, environment?
  • Are shared costs understood?
  • Is CUR/Data Export available for deeper analysis?

13.2 Forecasting

  • What is the expected monthly cost at baseline traffic?
  • What is the expected cost at 2x/5x/10x traffic?
  • What is fixed vs variable cost?
  • What service has nonlinear cost risk?

13.3 Scaling

  • What metric drives autoscaling?
  • What is minimum capacity?
  • What happens during traffic spike?
  • Is there queue/backpressure?

13.4 Data

  • What is retention policy?
  • What data moves across AZ/Region/internet?
  • Are lifecycle rules defined?
  • Are analytics queries partitioned?

13.5 Observability

  • What is log volume per request?
  • What is trace sampling policy?
  • Are custom metrics bounded?
  • What is telemetry retention?

13.6 Commitment

  • Is usage stable enough for Savings Plans/RI?
  • Has rightsizing happened first?
  • Are migrations planned that may invalidate commitments?

14. Failure Modeling

14.1 Runaway Spend

Symptom: Cost spikes unexpectedly.

Causes:

  • load test left running;
  • autoscaling runaway;
  • recursive Lambda/event loop;
  • logs exploded;
  • public endpoint abused;
  • NAT egress spike;
  • large data scan;
  • backup/snapshot accumulation.

Mitigation:

  • anomaly detection;
  • budget alerts;
  • service quotas;
  • circuit breakers;
  • log sampling;
  • query limits;
  • owner routing;
  • emergency cost runbook.

14.2 Invisible Shared Cost

Symptom: Product teams look cheap but platform account grows.

Causes:

  • shared NAT;
  • shared observability;
  • centralized logging;
  • shared security scanning;
  • shared data platform;
  • shared EKS cluster.

Mitigation:

  • allocation rules;
  • per-tenant/service telemetry;
  • platform cost dashboard;
  • showback model.

14.3 Wrong Commitment

Symptom: Organization buys Savings Plans/RI but usage changes.

Causes:

  • bought before rightsizing;
  • migration to serverless/containers;
  • wrong Region/family;
  • overestimated baseline;
  • product decommissioned.

Mitigation:

  • commitment governance;
  • rolling purchase strategy;
  • coverage/utilization tracking;
  • architecture roadmap review.

14.4 Optimization Causes Outage

Symptom: Cost reduction breaks SLO.

Causes:

  • insufficient headroom;
  • no load test;
  • changed DB class too aggressively;
  • removed redundancy;
  • reduced cache too much;
  • shortened retention needed for incident.

Mitigation:

  • risk review;
  • canary;
  • rollback;
  • SLO validation;
  • workload owner sign-off.

15. Runbook: Investigating a Cost Spike

  1. Identify time window.
  2. Check Cost Explorer by service and linked account.
  3. Compare daily/hourly trend if available.
  4. Group by usage type.
  5. Group by tag/cost category.
  6. Check anomaly detection detail.
  7. Query CUR for top line items.
  8. Identify owner from tags/account mapping.
  9. Determine whether spike is expected business demand.
  10. If unexpected, classify:
    • runaway usage;
    • one-time event;
    • deployment regression;
    • abuse/security incident;
    • data transfer/logging issue.
  11. Mitigate safely.
  12. Record finding and prevention control.

Example spike narrative:

On 2026-06-14, NAT Gateway data processing cost in prod-network increased 4.7x.
CUR showed source workload app-case-api after release 2026.06.14.2.
VPC Flow Logs and service metrics indicated large S3 traffic via NAT instead of Gateway Endpoint.
Mitigation: route S3 access through gateway endpoint, add CI check for private subnet S3 endpoint dependency, and add NAT usage anomaly alert.

This is engineering-grade cost analysis.


16. Optimization Backlog Template

id: COST-2026-001
title: Reduce NAT Gateway data processing from case-api to S3
owner: platform-network
workload: regulated-case-management
monthlySavingsEstimate: 1800
risk: low
riskNotes: requires route table update and endpoint policy validation
sloImpact: none expected
securityImpact: improves private access path
sustainabilityImpact: reduces unnecessary data path
implementation:
  - create S3 gateway endpoint
  - update route tables
  - validate endpoint policy
  - monitor NAT bytes
rollback:
  - restore previous route tables
validation:
  - NAT processed bytes reduced
  - case-api S3 calls successful
  - no increase in error rate
status: proposed

Good cost backlog items include risk, validation, and owner—not just savings estimate.


17. Deliberate Practice

Practice 1: Build a Cost Allocation Model

For a fictional AWS organization:

  • 5 workload accounts;
  • 1 network account;
  • 1 security account;
  • 1 log archive account;
  • 1 data platform account.

Deliverables:

  • account-to-owner mapping;
  • tag taxonomy;
  • Cost Categories proposal;
  • shared cost allocation rule;
  • showback dashboard layout.

Practice 2: Design Unit Economics

Pick one system:

  • API;
  • SaaS;
  • case management;
  • document processing;
  • data lake.

Define:

  • business unit metric;
  • direct cost;
  • shared cost;
  • telemetry cost;
  • formula;
  • dashboard fields;
  • interpretation rules.

Practice 3: Cost Spike Investigation

Simulate a spike caused by:

  • NAT egress;
  • log ingestion;
  • DynamoDB GSI;
  • Athena scans;
  • Lambda retry loop.

Create:

  • investigation steps;
  • likely source queries;
  • owner routing;
  • mitigation;
  • prevention guardrail.

Practice 4: Commitment Decision

Given 6 months of stable compute usage:

  • identify baseline;
  • rightsizing candidates;
  • migration risks;
  • recommended Savings Plans/RI posture;
  • what not to commit.

Practice 5: Sustainability Review

For a workload, identify:

  • idle resources;
  • overprovisioned capacity;
  • excessive retention;
  • inefficient queries;
  • unnecessary data movement;
  • nonprod always-on resources.

Produce a sustainability improvement plan with cost and risk notes.


18. Anti-Patterns

Anti-patternBetter approach
Cost only reviewed by financeShared engineering/product/finance review
No tags or inconsistent tagsStandard taxonomy + enforcement
Optimizing only top-line billTrack unit economics
Buying commitments before rightsizingRightsize first, commit second
Deleting resources without owner reviewSafe optimization workflow
Treating logs as freeTelemetry budget and retention policy
Ignoring data transferModel network/data path cost
Chargeback with bad allocation dataStart showback, improve allocation quality
Cost alerts to nobodyOwner-routed alerts and runbooks
Cost optimization breaks reliabilitySLO-aware optimization
Sustainability as separate initiativeIntegrate with utilization/waste reviews

19. Self-Correction Checklist

Before saying an AWS workload is cost mature:

  • Can we attribute cost by account, application, owner, and environment?
  • Are cost allocation tags activated and enforced?
  • Are Cost Categories defined for business reporting?
  • Is CUR 2.0/Data Export available for detailed analysis?
  • Are budgets configured with meaningful owners?
  • Is Cost Anomaly Detection routed to a runbook?
  • Do we know fixed vs variable costs?
  • Do we know unit cost per business outcome?
  • Are rightsizing recommendations reviewed safely?
  • Are commitments purchased only after baseline validation?
  • Are nonprod environments scheduled/ephemeral where possible?
  • Are logs/traces/metrics cost-aware?
  • Are S3 lifecycle and data retention policies defined?
  • Are shared platform costs visible?
  • Are sustainability improvements part of architecture review?

20. Engineering Judgment Summary

Cost engineering in AWS is a control loop.

The strongest mental model:

AWS cost is the financial shadow of architecture, usage, reliability posture, observability choices, data movement, and organizational ownership. We manage it by measuring accurately, attributing ownership, explaining cost drivers, optimizing safely, validating impact, and preventing regression.

A top-tier engineer can say:

  • what the workload costs;
  • why it costs that much;
  • how cost changes with usage;
  • who owns each cost driver;
  • what can be optimized safely;
  • what should not be optimized because it protects reliability/security/compliance;
  • how unit economics trend over time;
  • how sustainability improves through less waste and better utilization.

Do not reduce cost by weakening the system blindly. Reduce waste, improve efficiency, and make trade-offs explicit.


21. References

Lesson Recap

You just completed lesson 28 in deepen practice. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.