Deepen PracticeOrdered learning track

Learn Aws Part 024 Operations Incident Management Ssm And Runbooks

[]15 min read2950 words

In This Lesson

1. Kaufman Skill Map 2. Mental Model: Operations Is a Control System 3. Systems Manager Overview

Lesson 2435 lesson track20–29 Deepen Practice

title: Learn AWS Engineering Mastery - Part 024 description: AWS operations engineering using Systems Manager, Session Manager, Run Command, Automation runbooks, OpsCenter, Incident Manager, patching, production access, and incident response. series: learn-aws seriesTitle: Learn AWS Engineering Mastery order: 24 partTitle: Operations, Incident Management, SSM, and Runbooks tags:

aws
systems-manager
ssm
incident-management
runbooks
opscenter
session-manager
patch-manager
operations
sre
series date: 2026-07-01

Operations, Incident Management, SSM, and Runbooks

Target pembelajaran: setelah bagian ini, kita mampu mendesain operating model AWS yang aman, otomatis, bisa diaudit, dan siap insiden—menggunakan Systems Manager, Session Manager, Automation, OpsCenter, Incident Manager, runbooks, playbooks, patching, dan access control.

Part sebelumnya membahas observability: bagaimana sistem memberi sinyal. Part ini membahas pertanyaan berikutnya:

Ketika sinyal menunjukkan masalah, siapa melakukan apa, dengan akses apa, mengikuti prosedur apa, dan meninggalkan evidence apa?

Operations bukan hanya “login ke server lalu cek log”. Di AWS modern, operasi production-grade harus:

Mengurangi akses manual langsung.
Mengganti tindakan ad-hoc dengan runbook.
Menghubungkan alarm ke response plan.
Menjaga audit trail.
Membatasi blast radius operator.
Mengotomatiskan remediation yang aman.
Memastikan patch/configuration compliance.
Menjalankan incident response dengan role dan timeline jelas.

AWS Systems Manager adalah salah satu fondasi utama untuk operasi ini. Systems Manager diposisikan sebagai operations hub dan secure end-to-end management solution untuk AWS, hybrid, dan multicloud environment.

1. Kaufman Skill Map

Kaufman-style deconstruction:

Sub-skill	Yang harus dikuasai	Ukuran self-correction
Operating model	Owner, severity, on-call, escalation, evidence	Insiden tidak bergantung pada hero engineer
Production access	Session Manager, IAM, audit, no shared SSH	Bisa debug tanpa membuka inbound SSH
SSM managed nodes	Agent, instance profile, hybrid activation	Node bisa dikelola secara konsisten
Run Command	Execute command at scale safely	Tidak perlu manual shell loop
Automation	Repeatable remediation	Runbook bisa diuji dan dibatasi
OpsCenter	Centralized operational work item	Alarm menghasilkan OpsItem yang bisa ditindaklanjuti
Incident Manager	Response plan, contacts, escalation	Pihak yang tepat terlibat otomatis
Patch/compliance	Patch baselines, windows, inventory	Compliance diketahui, bukan diasumsikan

2. Mental Model: Operations Is a Control System

Operations adalah control loop.

AWS operating model yang baik memiliki tiga plane:

Principle:

Jangan memberi manusia akses luas untuk melakukan operasi yang seharusnya bisa diekspresikan sebagai runbook terbatas.

3. Systems Manager Overview

AWS Systems Manager mencakup banyak capability. Untuk seri ini, kita fokus pada capability yang penting untuk operations engineering.

Capability	Fungsi
Session Manager	Secure interactive access ke managed node tanpa inbound SSH terbuka
Run Command	Menjalankan command administratif secara remote dan terkontrol
Automation	Menjalankan runbook untuk maintenance, deployment, dan remediation
State Manager	Menjaga managed node/resource pada desired state
Patch Manager	Mengotomatisasi patching security dan update lain
Inventory	Mengumpulkan metadata software/configuration dari managed nodes
OpsCenter	Mengelola operational issues/OpsItems dan menjalankan runbook
Incident Manager	Response plan, contacts, escalation, runbook untuk incident response
Parameter Store	Configuration/secrets sederhana dengan IAM/KMS integration
Change Manager	Approval workflow untuk perubahan operasional tertentu

Systems Manager mengandalkan konsep managed node.

Managed node dapat berupa:

EC2 instance.
On-premises server.
VM di environment lain.
Edge/hybrid node yang dikonfigurasi untuk Systems Manager.

Elemen penting:

4. Production Access: No SSH by Default

Traditional pattern:

Open port 22 -> SSH key -> bastion -> server shell

Problems:

Inbound network exposure.
Long-lived SSH keys.
Weak session audit unless heavily customized.
Manual commands not repeatable.
Operator blast radius too large.
Difficult separation between read-only diagnosis and destructive action.

AWS-native safer baseline:

No inbound SSH
SSM Agent installed
IAM-controlled access
Session Manager session logging
CloudTrail API audit
Runbooks for common actions

4.1 Session Manager Guardrails

Recommended controls:

Control	Why
No inbound SSH security group rule	Reduces network attack surface
IAM least privilege	Limits who can start sessions
Tag-based access	Operators access only owned environment/service
Session logging	Supports audit/review
KMS encryption	Protects session/log data
MFA/SSO	Strengthens human identity
Time-bound access	Reduces standing privilege
Separate break-glass role	Emergency only, heavily monitored

Example IAM condition idea:

{
  "Condition": {
    "StringEquals": {
      "ssm:resourceTag/Environment": "prod",
      "ssm:resourceTag/Service": "case-workflow"
    }
  }
}

4.2 Read-Only First

Production access should be tiered:

Tier	Capability
Observer	View metrics/logs/dashboards only
Diagnoser	Start read-only sessions / run diagnostic commands
Operator	Execute approved runbooks
Maintainer	Perform changes with approval
Break-glass	Emergency wide access with high audit

Avoid giving every on-call admin access. Most incidents need diagnosis and constrained mitigation, not root shell.

5. Run Command

Run Command allows remote, secure management of configuration and one-time administrative tasks at scale.

Use Run Command for:

Collecting diagnostics.
Restarting a known service.
Checking disk usage.
Rotating local agent configuration.
Triggering safe cache clear.
Running health probe scripts.
Verifying patch state.

Avoid Run Command for:

Unreviewed arbitrary production mutation.
Long-running unknown scripts.
Data repair without approval.
Running secrets in command parameters.
Replacing proper deployment pipelines.

Pattern:

Target by tags, not manually enumerated instance IDs.

Example targeting:

Environment=prod
Service=case-workflow
Role=worker

6. Automation Runbooks

Systems Manager Automation runbooks simplify maintenance, deployment, and remediation tasks across AWS services.

A runbook should encode:

Preconditions.
Parameters.
Safety checks.
Execution steps.
Rollback or stop condition.
Output/evidence.
Access boundary.

Bad runbook:

SSH into instance and try restarting things.

Good runbook:

Given service, environment, and deployment version:
1. Confirm active alarm.
2. Confirm current deployment marker.
3. Check target health.
4. If latest deployment caused issue, shift traffic back.
5. Verify SLO recovery.
6. Record incident note.

6.1 Runbook Structure

name: RollbackCaseWorkflowCanary
parameters:
  Environment:
    allowedValues: [prod, staging]
  Service:
    allowedPattern: "^[a-z0-9-]+$"
  DeploymentId:
    required: true
preconditions:
  - caller has production-operator role
  - active alarm exists
  - rollback target exists
steps:
  - fetch current deployment state
  - verify rollback candidate health
  - shift traffic to previous version
  - monitor error rate for 10 minutes
  - stop if error rate worsens
outputs:
  - previousVersion
  - finalTrafficState
  - alarmState

6.2 Safe Automation Principles

Principle	Explanation
Narrow parameter set	Prevent arbitrary target/action
Validate preconditions	Avoid running during wrong state
Idempotent steps	Safe retry on partial failure
Timeouts	Avoid hanging automation
Approval for destructive actions	Human gate for irreversible changes
Observable execution	Emit logs/events
Least privilege role	Automation can only do required actions
Dry-run mode where possible	Preview before mutation

7. Playbooks vs Runbooks

AWS Well-Architected describes playbooks as step-by-step guides for investigating incidents; runbooks are commonly used to mitigate known issues.

Type	Purpose	Example
Playbook	Investigate/scope/root cause	“API 5xx spike investigation”
Runbook	Execute known mitigation	“Rollback ECS service to previous task definition”

A mature team links them:

8. OpsCenter

OpsCenter centralizes operational work items called OpsItems. OpsItems can contain context, related resources, investigation data, and linked Automation runbooks.

Use OpsCenter for:

Non-page operational issues.
Repeated alarms that need owner action.
Compliance drift findings.
Patch failures.
Manual remediation tracking.
Linking operational evidence and resources.

OpsItem fields should include:

source: CloudWatchAlarm
severity: Sev3
service: case-workflow
environment: prod
resourceArn: arn:aws:...
correlationId: optional
alarmName: prod-case-workflow-queue-age-high
firstSeenAt: 2026-07-01T10:15:12Z
owner: case-platform-team
runbook: DiagnoseQueueBacklog

OpsCenter smell:

OpsItems have no owner.
OpsItems are never closed.
Alarms generate duplicate spam.
OpsItems do not link to resources or runbooks.
There is no severity taxonomy.

9. Incident Manager

Incident Manager helps define response plans, contacts, escalation plans, and runbooks for incidents.

Incident response preparation happens before the incident:

Configure contacts.
Configure escalation plans.
Configure chat channels.
Configure response plans.
Attach Automation runbooks.
Map alarms to incident creation.
Define severity and ownership.

9.1 Response Plan Content

A response plan should define:

Field	Example
Incident title template	`Prod API 5xx high - case-workflow`
Impact/severity	Sev1/Sev2/Sev3
Contacts	Primary on-call
Escalation	Team lead, platform, security, data
Chat channel	War-room channel
Runbook	Diagnose/mitigate known issue
Automation role	Least privilege response role
Tags	service, environment, owner, compliance

9.2 Incident Roles

Role	Responsibility
Incident Commander	Coordinates response, protects focus
Operations Lead	Executes/coordinates mitigation
Communications Lead	Updates stakeholders
Scribe	Maintains timeline/evidence
Subject Matter Expert	Diagnoses domain-specific issue
Approver	Approves risky/destructive mitigation

Small teams may combine roles, but responsibilities still need to exist.

9.3 Incident Lifecycle

Do not end an incident at “service is back”. End it when:

User impact is resolved.
Monitoring confirms stability.
Timeline is captured.
Follow-up owners are assigned.
Evidence is preserved.

10. Severity Model

Severity must be explicit.

Severity	Definition	Example	Response
Sev1	Critical broad impact or data integrity/security risk	Case transitions failing globally	Immediate page, incident commander
Sev2	Major partial impact	One region or critical workflow degraded	Page owning team
Sev3	Limited degradation or risk	Queue backlog increasing but within SLA	Ticket + timed response
Sev4	Hygiene/improvement	Patch drift in non-prod	Backlog

Severity should consider:

User impact.
Regulatory impact.
Data integrity risk.
Security risk.
Duration.
Blast radius.
Workaround availability.

11. Patch Manager and Configuration Compliance

Patch Manager automates patching for managed nodes with security-related and other updates.

Patching is not merely “install latest patches”. It is a controlled risk process.

Patch strategy:

Environment	Policy
Dev	Early patch, detect breakage
Staging	Patch before prod, representative testing
Prod canary	Small subset first
Prod fleet	Wave rollout with monitoring
Critical emergency	Expedited path with approval

Patch baseline should define:

Approved patches.
Rejected patches.
Approval delay.
Operating system/product scope.
Compliance severity.
Maintenance window.
Reboot behavior.

Failure modes:

Failure	Prevention
Patch all prod at once	Wave/canary rollout
No rollback plan	AMI/snapshot strategy
Patches break workload	Staging parity and health checks
Unknown inventory	Systems Manager Inventory
Missed reboot	Explicit reboot policy and alarms
Compliance assumed	Centralized compliance dashboard

12. Inventory and Fleet Visibility

Inventory collects metadata from managed nodes.

Useful inventory questions:

Which instances run unsupported OS versions?
Which nodes miss required agent version?
Which hosts have vulnerable package versions?
Which nodes lack required configuration?
Which application versions are deployed?
Which nodes are not reporting to SSM?

For regulated environments, inventory is evidence:

At date X, these nodes existed, with these versions, under this patch baseline, with this compliance state.

13. State Manager

State Manager keeps managed nodes/resources in desired state.

Use cases:

Ensure agent installed/running.
Ensure security configuration exists.
Ensure log collector configuration exists.
Ensure file permissions match baseline.
Ensure compliance scanner runs.

Be careful: State Manager can fight deployment automation if responsibilities are unclear.

Ownership rule:

State type	Owner
OS/security baseline	Platform/security team
Application binaries	Deployment pipeline
Runtime config	Config platform / application team
Emergency mitigation	Incident runbook with expiry

14. Change Manager and Operational Approval

Change Manager can be used when operational changes require approval and controlled execution.

Use it for:

High-risk production maintenance.
Emergency changes needing evidence.
Data repair workflows.
Security baseline changes.
Cross-account operational actions.

Do not use heavyweight approval for every small change; that creates bypass behavior. Use risk-based approval.

Risk dimensions:

Dimension	Low risk	High risk
Scope	One non-prod resource	Multi-account prod
Reversibility	Easy rollback	Irreversible data mutation
User impact	None	Customer-facing downtime
Security impact	No privilege change	IAM/KMS/network change
Data impact	No data mutation	Production data repair

15. Operational Readiness Review

Before service goes production, require operational readiness.

Checklist:

[ ] Service owner defined.
[ ] On-call rotation exists.
[ ] Severity model mapped.
[ ] CloudWatch alarms defined.
[ ] SLO or critical SLIs defined.
[ ] Dashboards exist.
[ ] Logs are structured and retained.
[ ] Correlation IDs propagate.
[ ] Runbooks linked to alarms.
[ ] Session Manager access works.
[ ] No inbound SSH required.
[ ] Patch baseline applied if compute fleet exists.
[ ] Backup/restore runbook tested.
[ ] Rollback runbook tested.
[ ] Incident response plan exists.
[ ] Escalation contacts configured.
[ ] Security and compliance logs retained.
[ ] Cost alarms/budgets exist.

A workload without runbooks is not ready. A workload with runbooks that were never tested is also not ready.

16. Runbook Patterns

16.1 Diagnose API Error Spike

Input: service, environment, alarmName

1. Confirm alarm state and start time.
2. Check request count to rule out low-traffic noise.
3. Check recent deployment events.
4. Compare 4xx vs 5xx.
5. Check ALB/API Gateway integration latency.
6. Check service logs by correlation IDs.
7. Check dependency metrics.
8. Check regional/AZ distribution.
9. Decide: rollback, scale, degrade, or escalate.
10. Record finding in incident timeline.

16.2 Drain Bad ECS Task Set

Input: cluster, service, taskSet/deploymentId

1. Confirm latest deployment correlates with error spike.
2. Confirm previous task set is healthy.
3. Stop traffic shift or rollback deployment.
4. Watch 5xx, latency, target health.
5. Confirm error budget burn returns to normal.
6. Record deployment ID and rollback result.

16.3 Replay DLQ Safely

Input: dlq, sourceQueue, maxMessages, timeWindow

1. Identify failure reason category.
2. Confirm bug/config causing failure is fixed.
3. Sample messages and validate schema.
4. Re-drive small batch.
5. Monitor target processing error rate.
6. Increase batch gradually.
7. Stop on repeated failure.
8. Record replay count and remaining DLQ depth.

16.4 Emergency Read-Only Data Inspection

Input: caseId, environment, reason

1. Require approver if prod regulated data.
2. Use read-only role.
3. Query by internal ID, not broad scan.
4. Redact sensitive output in shared channels.
5. Record reason, actor, timestamp, query reference.
6. Close access session.

17. Incident Timeline

A useful timeline includes:

10:01 Alarm triggered: prod-case-workflow-5xx-high
10:02 Incident created by response plan
10:03 On-call acknowledged
10:05 Impact confirmed: 32% transition requests failing
10:07 Recent deployment identified: version 2026.07.01-1042
10:10 Rollback runbook started
10:14 Traffic shifted to previous version
10:18 Error rate returned below threshold
10:25 Monitoring stable for 7 minutes
10:30 Incident resolved
10:45 Follow-up action: add compatibility test for transition policy schema

Bad timeline:

Had issue, fixed by rollback.

That is not enough for learning or audit.

18. Post-Incident Review

Post-incident review is not blame. It is system improvement.

Include:

What happened?
What was the user/business/regulatory impact?
How was it detected?
Why was it not prevented?
What made diagnosis slow?
What mitigated the issue?
What made mitigation risky?
Which alarms/runbooks worked?
Which assumptions were wrong?
What concrete action items reduce recurrence or impact?

Action items should be system-level:

Weak action	Better action
Be more careful	Add deployment compatibility gate
Improve monitoring	Add SLO burn-rate alarm for transition failures
Document better	Link alarm to tested runbook
Avoid mistakes	Add policy-as-code validation

19. Security Model for Operations

Operational capability is powerful. Treat it like production write access.

Controls:

Separate human identity from workload identity.
Use IAM Identity Center/SSO where possible.
Use permission sets by operational role.
Require MFA for production access.
Avoid long-lived access keys.
Use Session Manager instead of SSH where feasible.
Log sessions and API activity.
Restrict automation roles.
Use approvals for destructive runbooks.
Monitor break-glass role usage.

19.1 Break-Glass Access

Break-glass should be:

Rare.
Time-bound.
MFA-protected.
Heavily logged.
Alerted when used.
Reviewed after use.
Separated from normal on-call workflow.

Break-glass becoming daily workflow is a governance failure.

20. Cost and Operational Efficiency

Operations also has cost.

Cost drivers:

Idle overprovisioned fleets due to fear of incidents.
Manual on-call time.
Repeated incidents without root fix.
Excessive logs/metrics from diagnostic panic.
Delayed patching causing emergency work.
Overly complex incident tooling with no adoption.

Efficiency patterns:

Pattern	Benefit
Runbooks for common actions	Lower MTTR
Alarm deduplication	Less fatigue
SLO-based paging	Fewer useless pages
OpsItem automation	Better tracking
Patch waves	Lower change risk
Session Manager	Less network/security maintenance
Game days	Practice before real incidents

21. Failure Modes

Failure mode	Symptom	Prevention
SSH/bastion dependency	Access blocked or unaudited during incident	Session Manager baseline
Runbooks outdated	On-call follows wrong steps	Scheduled runbook tests
Alarm without owner	Nobody responds	Ownership registry
Too many pages	Alert fatigue	SLO/severity design
Automation too powerful	Wrong target mutated	Least privilege + parameters
No incident commander	Chaotic parallel actions	Explicit roles
No timeline	Weak postmortem	Scribe role and Incident Manager timeline
Patch all-at-once	Fleet outage	Waves/canary
Manual data repair	Integrity/audit risk	Approved data repair runbook
Break-glass normalized	Governance erosion	Monitor and review every use

22. Deliberate Practice

Exercise 1: Replace SSH with Session Manager

For one EC2 workload:

Remove inbound SSH from security group.
Ensure SSM Agent is available.
Attach minimal instance profile.
Configure session logging.
Create IAM policy for team access by tags.
Test read-only diagnostic session.

Self-check:

Can operator access without public IP?
Is session logged?
Can operator access only the intended environment?

Exercise 2: Convert a Manual Fix into Runbook

Pick a common incident action:

restart worker
clear stuck deployment
re-drive DLQ
rollback service
increase queue consumers

Convert it into:

Inputs.
Preconditions.
Safety checks.
Execution steps.
Rollback/stop condition.
Output evidence.

Self-check:

Can a new on-call run it safely?
Does it prevent wrong environment targeting?
Does it emit evidence?

Exercise 3: Create Response Plan

For one Sev2 alarm:

Define response plan.
Add contacts.
Add escalation.
Add linked runbook.
Define chat channel.
Simulate alarm.

Self-check:

Was the right person engaged?
Did the runbook appear in context?
Was timeline created?

23. Production Checklist

[ ] No inbound SSH required for standard operations.
[ ] Session Manager configured and tested.
[ ] Session logs retained and protected.
[ ] Run Command usage restricted by IAM and tags.
[ ] Automation runbooks exist for common mitigations.
[ ] Destructive runbooks require approval.
[ ] Alarms map to owner and runbook.
[ ] Response plans exist for Sev1/Sev2 scenarios.
[ ] Contacts and escalation plans are current.
[ ] OpsItems have owner/severity/resource context.
[ ] Patch baseline and patch windows are defined.
[ ] Inventory collection is enabled where compute fleet exists.
[ ] Break-glass role is monitored.
[ ] Incident timeline process is defined.
[ ] Post-incident review template exists.
[ ] Game days test operational readiness.

24. Summary

Operations engineering di AWS adalah desain control loop yang aman dan bisa diaudit.

Inti Part 024:

Operations bukan heroics; operations adalah sistem.
Systems Manager adalah fondasi penting untuk production access dan automation.
Session Manager mengurangi kebutuhan inbound SSH.
Run Command menjalankan administrative task secara aman dan terkontrol.
Automation runbooks mengubah mitigasi menjadi prosedur repeatable.
Playbooks membantu investigasi; runbooks menjalankan mitigasi.
OpsCenter membantu mengelola operational work items.
Incident Manager menghubungkan alarm, contacts, escalation, dan runbook.
Patch Manager dan Inventory penting untuk compliance dan fleet hygiene.
Break-glass harus jarang, time-bound, dan heavily audited.

Di Part 025, kita akan masuk ke reliability engineering: HA, failover, chaos, graceful degradation, backup/restore, RTO/RPO, dan failure-mode-driven architecture.

References

AWS Documentation — AWS Systems Manager: https://docs.aws.amazon.com/systems-manager/latest/APIReference/Welcome.html
AWS Documentation — Session Manager: https://docs.aws.amazon.com/systems-manager/latest/userguide/session-manager.html
AWS Documentation — Run Command: https://docs.aws.amazon.com/systems-manager/latest/userguide/run-command.html
AWS Documentation — Systems Manager Automation: https://docs.aws.amazon.com/systems-manager/latest/userguide/systems-manager-automation.html
AWS Documentation — OpsCenter: https://docs.aws.amazon.com/systems-manager/latest/userguide/OpsCenter.html
AWS Documentation — Patch Manager: https://docs.aws.amazon.com/systems-manager/latest/userguide/patch-manager.html
AWS Documentation — Incident Manager response plans: https://docs.aws.amazon.com/incident-manager/latest/userguide/response-plans.html
AWS Documentation — Incident Manager runbooks: https://docs.aws.amazon.com/incident-manager/latest/userguide/runbooks.html
AWS Well-Architected — Use playbooks to investigate issues: https://docs.aws.amazon.com/wellarchitected/latest/framework/ops_ready_to_support_use_playbooks.html

Lesson Recap

You just completed lesson 24 in deepen practice. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Back To Series Next Lesson

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.

Previous Lesson

Lesson 23

Learn Aws Part 023 Observability Cloudwatch Xray Opentelemetry And Slo

Next Lesson

Lesson 25

Learn Aws Part 025 Reliability Engineering Ha Failover Chaos And Rto Rpo