Learn Aws Part 024 Operations Incident Management Ssm And Runbooks
title: Learn AWS Engineering Mastery - Part 024 description: AWS operations engineering using Systems Manager, Session Manager, Run Command, Automation runbooks, OpsCenter, Incident Manager, patching, production access, and incident response. series: learn-aws seriesTitle: Learn AWS Engineering Mastery order: 24 partTitle: Operations, Incident Management, SSM, and Runbooks tags:
- aws
- systems-manager
- ssm
- incident-management
- runbooks
- opscenter
- session-manager
- patch-manager
- operations
- sre
- series date: 2026-07-01
Operations, Incident Management, SSM, and Runbooks
Target pembelajaran: setelah bagian ini, kita mampu mendesain operating model AWS yang aman, otomatis, bisa diaudit, dan siap insiden—menggunakan Systems Manager, Session Manager, Automation, OpsCenter, Incident Manager, runbooks, playbooks, patching, dan access control.
Part sebelumnya membahas observability: bagaimana sistem memberi sinyal. Part ini membahas pertanyaan berikutnya:
Ketika sinyal menunjukkan masalah, siapa melakukan apa, dengan akses apa, mengikuti prosedur apa, dan meninggalkan evidence apa?
Operations bukan hanya “login ke server lalu cek log”. Di AWS modern, operasi production-grade harus:
- Mengurangi akses manual langsung.
- Mengganti tindakan ad-hoc dengan runbook.
- Menghubungkan alarm ke response plan.
- Menjaga audit trail.
- Membatasi blast radius operator.
- Mengotomatiskan remediation yang aman.
- Memastikan patch/configuration compliance.
- Menjalankan incident response dengan role dan timeline jelas.
AWS Systems Manager adalah salah satu fondasi utama untuk operasi ini. Systems Manager diposisikan sebagai operations hub dan secure end-to-end management solution untuk AWS, hybrid, dan multicloud environment.
1. Kaufman Skill Map
Kaufman-style deconstruction:
| Sub-skill | Yang harus dikuasai | Ukuran self-correction |
|---|---|---|
| Operating model | Owner, severity, on-call, escalation, evidence | Insiden tidak bergantung pada hero engineer |
| Production access | Session Manager, IAM, audit, no shared SSH | Bisa debug tanpa membuka inbound SSH |
| SSM managed nodes | Agent, instance profile, hybrid activation | Node bisa dikelola secara konsisten |
| Run Command | Execute command at scale safely | Tidak perlu manual shell loop |
| Automation | Repeatable remediation | Runbook bisa diuji dan dibatasi |
| OpsCenter | Centralized operational work item | Alarm menghasilkan OpsItem yang bisa ditindaklanjuti |
| Incident Manager | Response plan, contacts, escalation | Pihak yang tepat terlibat otomatis |
| Patch/compliance | Patch baselines, windows, inventory | Compliance diketahui, bukan diasumsikan |
2. Mental Model: Operations Is a Control System
Operations adalah control loop.
AWS operating model yang baik memiliki tiga plane:
Principle:
Jangan memberi manusia akses luas untuk melakukan operasi yang seharusnya bisa diekspresikan sebagai runbook terbatas.
3. Systems Manager Overview
AWS Systems Manager mencakup banyak capability. Untuk seri ini, kita fokus pada capability yang penting untuk operations engineering.
| Capability | Fungsi |
|---|---|
| Session Manager | Secure interactive access ke managed node tanpa inbound SSH terbuka |
| Run Command | Menjalankan command administratif secara remote dan terkontrol |
| Automation | Menjalankan runbook untuk maintenance, deployment, dan remediation |
| State Manager | Menjaga managed node/resource pada desired state |
| Patch Manager | Mengotomatisasi patching security dan update lain |
| Inventory | Mengumpulkan metadata software/configuration dari managed nodes |
| OpsCenter | Mengelola operational issues/OpsItems dan menjalankan runbook |
| Incident Manager | Response plan, contacts, escalation, runbook untuk incident response |
| Parameter Store | Configuration/secrets sederhana dengan IAM/KMS integration |
| Change Manager | Approval workflow untuk perubahan operasional tertentu |
Systems Manager mengandalkan konsep managed node.
Managed node dapat berupa:
- EC2 instance.
- On-premises server.
- VM di environment lain.
- Edge/hybrid node yang dikonfigurasi untuk Systems Manager.
Elemen penting:
4. Production Access: No SSH by Default
Traditional pattern:
Open port 22 -> SSH key -> bastion -> server shell
Problems:
- Inbound network exposure.
- Long-lived SSH keys.
- Weak session audit unless heavily customized.
- Manual commands not repeatable.
- Operator blast radius too large.
- Difficult separation between read-only diagnosis and destructive action.
AWS-native safer baseline:
No inbound SSH
SSM Agent installed
IAM-controlled access
Session Manager session logging
CloudTrail API audit
Runbooks for common actions
4.1 Session Manager Guardrails
Recommended controls:
| Control | Why |
|---|---|
| No inbound SSH security group rule | Reduces network attack surface |
| IAM least privilege | Limits who can start sessions |
| Tag-based access | Operators access only owned environment/service |
| Session logging | Supports audit/review |
| KMS encryption | Protects session/log data |
| MFA/SSO | Strengthens human identity |
| Time-bound access | Reduces standing privilege |
| Separate break-glass role | Emergency only, heavily monitored |
Example IAM condition idea:
{
"Condition": {
"StringEquals": {
"ssm:resourceTag/Environment": "prod",
"ssm:resourceTag/Service": "case-workflow"
}
}
}
4.2 Read-Only First
Production access should be tiered:
| Tier | Capability |
|---|---|
| Observer | View metrics/logs/dashboards only |
| Diagnoser | Start read-only sessions / run diagnostic commands |
| Operator | Execute approved runbooks |
| Maintainer | Perform changes with approval |
| Break-glass | Emergency wide access with high audit |
Avoid giving every on-call admin access. Most incidents need diagnosis and constrained mitigation, not root shell.
5. Run Command
Run Command allows remote, secure management of configuration and one-time administrative tasks at scale.
Use Run Command for:
- Collecting diagnostics.
- Restarting a known service.
- Checking disk usage.
- Rotating local agent configuration.
- Triggering safe cache clear.
- Running health probe scripts.
- Verifying patch state.
Avoid Run Command for:
- Unreviewed arbitrary production mutation.
- Long-running unknown scripts.
- Data repair without approval.
- Running secrets in command parameters.
- Replacing proper deployment pipelines.
Pattern:
Target by tags, not manually enumerated instance IDs.
Example targeting:
Environment=prod
Service=case-workflow
Role=worker
6. Automation Runbooks
Systems Manager Automation runbooks simplify maintenance, deployment, and remediation tasks across AWS services.
A runbook should encode:
- Preconditions.
- Parameters.
- Safety checks.
- Execution steps.
- Rollback or stop condition.
- Output/evidence.
- Access boundary.
Bad runbook:
SSH into instance and try restarting things.
Good runbook:
Given service, environment, and deployment version:
1. Confirm active alarm.
2. Confirm current deployment marker.
3. Check target health.
4. If latest deployment caused issue, shift traffic back.
5. Verify SLO recovery.
6. Record incident note.
6.1 Runbook Structure
name: RollbackCaseWorkflowCanary
parameters:
Environment:
allowedValues: [prod, staging]
Service:
allowedPattern: "^[a-z0-9-]+$"
DeploymentId:
required: true
preconditions:
- caller has production-operator role
- active alarm exists
- rollback target exists
steps:
- fetch current deployment state
- verify rollback candidate health
- shift traffic to previous version
- monitor error rate for 10 minutes
- stop if error rate worsens
outputs:
- previousVersion
- finalTrafficState
- alarmState
6.2 Safe Automation Principles
| Principle | Explanation |
|---|---|
| Narrow parameter set | Prevent arbitrary target/action |
| Validate preconditions | Avoid running during wrong state |
| Idempotent steps | Safe retry on partial failure |
| Timeouts | Avoid hanging automation |
| Approval for destructive actions | Human gate for irreversible changes |
| Observable execution | Emit logs/events |
| Least privilege role | Automation can only do required actions |
| Dry-run mode where possible | Preview before mutation |
7. Playbooks vs Runbooks
AWS Well-Architected describes playbooks as step-by-step guides for investigating incidents; runbooks are commonly used to mitigate known issues.
| Type | Purpose | Example |
|---|---|---|
| Playbook | Investigate/scope/root cause | “API 5xx spike investigation” |
| Runbook | Execute known mitigation | “Rollback ECS service to previous task definition” |
A mature team links them:
8. OpsCenter
OpsCenter centralizes operational work items called OpsItems. OpsItems can contain context, related resources, investigation data, and linked Automation runbooks.
Use OpsCenter for:
- Non-page operational issues.
- Repeated alarms that need owner action.
- Compliance drift findings.
- Patch failures.
- Manual remediation tracking.
- Linking operational evidence and resources.
OpsItem fields should include:
source: CloudWatchAlarm
severity: Sev3
service: case-workflow
environment: prod
resourceArn: arn:aws:...
correlationId: optional
alarmName: prod-case-workflow-queue-age-high
firstSeenAt: 2026-07-01T10:15:12Z
owner: case-platform-team
runbook: DiagnoseQueueBacklog
OpsCenter smell:
- OpsItems have no owner.
- OpsItems are never closed.
- Alarms generate duplicate spam.
- OpsItems do not link to resources or runbooks.
- There is no severity taxonomy.
9. Incident Manager
Incident Manager helps define response plans, contacts, escalation plans, and runbooks for incidents.
Incident response preparation happens before the incident:
- Configure contacts.
- Configure escalation plans.
- Configure chat channels.
- Configure response plans.
- Attach Automation runbooks.
- Map alarms to incident creation.
- Define severity and ownership.
9.1 Response Plan Content
A response plan should define:
| Field | Example |
|---|---|
| Incident title template | Prod API 5xx high - case-workflow |
| Impact/severity | Sev1/Sev2/Sev3 |
| Contacts | Primary on-call |
| Escalation | Team lead, platform, security, data |
| Chat channel | War-room channel |
| Runbook | Diagnose/mitigate known issue |
| Automation role | Least privilege response role |
| Tags | service, environment, owner, compliance |
9.2 Incident Roles
| Role | Responsibility |
|---|---|
| Incident Commander | Coordinates response, protects focus |
| Operations Lead | Executes/coordinates mitigation |
| Communications Lead | Updates stakeholders |
| Scribe | Maintains timeline/evidence |
| Subject Matter Expert | Diagnoses domain-specific issue |
| Approver | Approves risky/destructive mitigation |
Small teams may combine roles, but responsibilities still need to exist.
9.3 Incident Lifecycle
Do not end an incident at “service is back”. End it when:
- User impact is resolved.
- Monitoring confirms stability.
- Timeline is captured.
- Follow-up owners are assigned.
- Evidence is preserved.
10. Severity Model
Severity must be explicit.
| Severity | Definition | Example | Response |
|---|---|---|---|
| Sev1 | Critical broad impact or data integrity/security risk | Case transitions failing globally | Immediate page, incident commander |
| Sev2 | Major partial impact | One region or critical workflow degraded | Page owning team |
| Sev3 | Limited degradation or risk | Queue backlog increasing but within SLA | Ticket + timed response |
| Sev4 | Hygiene/improvement | Patch drift in non-prod | Backlog |
Severity should consider:
- User impact.
- Regulatory impact.
- Data integrity risk.
- Security risk.
- Duration.
- Blast radius.
- Workaround availability.
11. Patch Manager and Configuration Compliance
Patch Manager automates patching for managed nodes with security-related and other updates.
Patching is not merely “install latest patches”. It is a controlled risk process.
Patch strategy:
| Environment | Policy |
|---|---|
| Dev | Early patch, detect breakage |
| Staging | Patch before prod, representative testing |
| Prod canary | Small subset first |
| Prod fleet | Wave rollout with monitoring |
| Critical emergency | Expedited path with approval |
Patch baseline should define:
- Approved patches.
- Rejected patches.
- Approval delay.
- Operating system/product scope.
- Compliance severity.
- Maintenance window.
- Reboot behavior.
Failure modes:
| Failure | Prevention |
|---|---|
| Patch all prod at once | Wave/canary rollout |
| No rollback plan | AMI/snapshot strategy |
| Patches break workload | Staging parity and health checks |
| Unknown inventory | Systems Manager Inventory |
| Missed reboot | Explicit reboot policy and alarms |
| Compliance assumed | Centralized compliance dashboard |
12. Inventory and Fleet Visibility
Inventory collects metadata from managed nodes.
Useful inventory questions:
- Which instances run unsupported OS versions?
- Which nodes miss required agent version?
- Which hosts have vulnerable package versions?
- Which nodes lack required configuration?
- Which application versions are deployed?
- Which nodes are not reporting to SSM?
For regulated environments, inventory is evidence:
At date X, these nodes existed, with these versions, under this patch baseline, with this compliance state.
13. State Manager
State Manager keeps managed nodes/resources in desired state.
Use cases:
- Ensure agent installed/running.
- Ensure security configuration exists.
- Ensure log collector configuration exists.
- Ensure file permissions match baseline.
- Ensure compliance scanner runs.
Be careful: State Manager can fight deployment automation if responsibilities are unclear.
Ownership rule:
| State type | Owner |
|---|---|
| OS/security baseline | Platform/security team |
| Application binaries | Deployment pipeline |
| Runtime config | Config platform / application team |
| Emergency mitigation | Incident runbook with expiry |
14. Change Manager and Operational Approval
Change Manager can be used when operational changes require approval and controlled execution.
Use it for:
- High-risk production maintenance.
- Emergency changes needing evidence.
- Data repair workflows.
- Security baseline changes.
- Cross-account operational actions.
Do not use heavyweight approval for every small change; that creates bypass behavior. Use risk-based approval.
Risk dimensions:
| Dimension | Low risk | High risk |
|---|---|---|
| Scope | One non-prod resource | Multi-account prod |
| Reversibility | Easy rollback | Irreversible data mutation |
| User impact | None | Customer-facing downtime |
| Security impact | No privilege change | IAM/KMS/network change |
| Data impact | No data mutation | Production data repair |
15. Operational Readiness Review
Before service goes production, require operational readiness.
Checklist:
[ ] Service owner defined.
[ ] On-call rotation exists.
[ ] Severity model mapped.
[ ] CloudWatch alarms defined.
[ ] SLO or critical SLIs defined.
[ ] Dashboards exist.
[ ] Logs are structured and retained.
[ ] Correlation IDs propagate.
[ ] Runbooks linked to alarms.
[ ] Session Manager access works.
[ ] No inbound SSH required.
[ ] Patch baseline applied if compute fleet exists.
[ ] Backup/restore runbook tested.
[ ] Rollback runbook tested.
[ ] Incident response plan exists.
[ ] Escalation contacts configured.
[ ] Security and compliance logs retained.
[ ] Cost alarms/budgets exist.
A workload without runbooks is not ready. A workload with runbooks that were never tested is also not ready.
16. Runbook Patterns
16.1 Diagnose API Error Spike
Input: service, environment, alarmName
1. Confirm alarm state and start time.
2. Check request count to rule out low-traffic noise.
3. Check recent deployment events.
4. Compare 4xx vs 5xx.
5. Check ALB/API Gateway integration latency.
6. Check service logs by correlation IDs.
7. Check dependency metrics.
8. Check regional/AZ distribution.
9. Decide: rollback, scale, degrade, or escalate.
10. Record finding in incident timeline.
16.2 Drain Bad ECS Task Set
Input: cluster, service, taskSet/deploymentId
1. Confirm latest deployment correlates with error spike.
2. Confirm previous task set is healthy.
3. Stop traffic shift or rollback deployment.
4. Watch 5xx, latency, target health.
5. Confirm error budget burn returns to normal.
6. Record deployment ID and rollback result.
16.3 Replay DLQ Safely
Input: dlq, sourceQueue, maxMessages, timeWindow
1. Identify failure reason category.
2. Confirm bug/config causing failure is fixed.
3. Sample messages and validate schema.
4. Re-drive small batch.
5. Monitor target processing error rate.
6. Increase batch gradually.
7. Stop on repeated failure.
8. Record replay count and remaining DLQ depth.
16.4 Emergency Read-Only Data Inspection
Input: caseId, environment, reason
1. Require approver if prod regulated data.
2. Use read-only role.
3. Query by internal ID, not broad scan.
4. Redact sensitive output in shared channels.
5. Record reason, actor, timestamp, query reference.
6. Close access session.
17. Incident Timeline
A useful timeline includes:
10:01 Alarm triggered: prod-case-workflow-5xx-high
10:02 Incident created by response plan
10:03 On-call acknowledged
10:05 Impact confirmed: 32% transition requests failing
10:07 Recent deployment identified: version 2026.07.01-1042
10:10 Rollback runbook started
10:14 Traffic shifted to previous version
10:18 Error rate returned below threshold
10:25 Monitoring stable for 7 minutes
10:30 Incident resolved
10:45 Follow-up action: add compatibility test for transition policy schema
Bad timeline:
Had issue, fixed by rollback.
That is not enough for learning or audit.
18. Post-Incident Review
Post-incident review is not blame. It is system improvement.
Include:
- What happened?
- What was the user/business/regulatory impact?
- How was it detected?
- Why was it not prevented?
- What made diagnosis slow?
- What mitigated the issue?
- What made mitigation risky?
- Which alarms/runbooks worked?
- Which assumptions were wrong?
- What concrete action items reduce recurrence or impact?
Action items should be system-level:
| Weak action | Better action |
|---|---|
| Be more careful | Add deployment compatibility gate |
| Improve monitoring | Add SLO burn-rate alarm for transition failures |
| Document better | Link alarm to tested runbook |
| Avoid mistakes | Add policy-as-code validation |
19. Security Model for Operations
Operational capability is powerful. Treat it like production write access.
Controls:
- Separate human identity from workload identity.
- Use IAM Identity Center/SSO where possible.
- Use permission sets by operational role.
- Require MFA for production access.
- Avoid long-lived access keys.
- Use Session Manager instead of SSH where feasible.
- Log sessions and API activity.
- Restrict automation roles.
- Use approvals for destructive runbooks.
- Monitor break-glass role usage.
19.1 Break-Glass Access
Break-glass should be:
- Rare.
- Time-bound.
- MFA-protected.
- Heavily logged.
- Alerted when used.
- Reviewed after use.
- Separated from normal on-call workflow.
Break-glass becoming daily workflow is a governance failure.
20. Cost and Operational Efficiency
Operations also has cost.
Cost drivers:
- Idle overprovisioned fleets due to fear of incidents.
- Manual on-call time.
- Repeated incidents without root fix.
- Excessive logs/metrics from diagnostic panic.
- Delayed patching causing emergency work.
- Overly complex incident tooling with no adoption.
Efficiency patterns:
| Pattern | Benefit |
|---|---|
| Runbooks for common actions | Lower MTTR |
| Alarm deduplication | Less fatigue |
| SLO-based paging | Fewer useless pages |
| OpsItem automation | Better tracking |
| Patch waves | Lower change risk |
| Session Manager | Less network/security maintenance |
| Game days | Practice before real incidents |
21. Failure Modes
| Failure mode | Symptom | Prevention |
|---|---|---|
| SSH/bastion dependency | Access blocked or unaudited during incident | Session Manager baseline |
| Runbooks outdated | On-call follows wrong steps | Scheduled runbook tests |
| Alarm without owner | Nobody responds | Ownership registry |
| Too many pages | Alert fatigue | SLO/severity design |
| Automation too powerful | Wrong target mutated | Least privilege + parameters |
| No incident commander | Chaotic parallel actions | Explicit roles |
| No timeline | Weak postmortem | Scribe role and Incident Manager timeline |
| Patch all-at-once | Fleet outage | Waves/canary |
| Manual data repair | Integrity/audit risk | Approved data repair runbook |
| Break-glass normalized | Governance erosion | Monitor and review every use |
22. Deliberate Practice
Exercise 1: Replace SSH with Session Manager
For one EC2 workload:
- Remove inbound SSH from security group.
- Ensure SSM Agent is available.
- Attach minimal instance profile.
- Configure session logging.
- Create IAM policy for team access by tags.
- Test read-only diagnostic session.
Self-check:
- Can operator access without public IP?
- Is session logged?
- Can operator access only the intended environment?
Exercise 2: Convert a Manual Fix into Runbook
Pick a common incident action:
restart worker
clear stuck deployment
re-drive DLQ
rollback service
increase queue consumers
Convert it into:
- Inputs.
- Preconditions.
- Safety checks.
- Execution steps.
- Rollback/stop condition.
- Output evidence.
Self-check:
- Can a new on-call run it safely?
- Does it prevent wrong environment targeting?
- Does it emit evidence?
Exercise 3: Create Response Plan
For one Sev2 alarm:
- Define response plan.
- Add contacts.
- Add escalation.
- Add linked runbook.
- Define chat channel.
- Simulate alarm.
Self-check:
- Was the right person engaged?
- Did the runbook appear in context?
- Was timeline created?
23. Production Checklist
[ ] No inbound SSH required for standard operations.
[ ] Session Manager configured and tested.
[ ] Session logs retained and protected.
[ ] Run Command usage restricted by IAM and tags.
[ ] Automation runbooks exist for common mitigations.
[ ] Destructive runbooks require approval.
[ ] Alarms map to owner and runbook.
[ ] Response plans exist for Sev1/Sev2 scenarios.
[ ] Contacts and escalation plans are current.
[ ] OpsItems have owner/severity/resource context.
[ ] Patch baseline and patch windows are defined.
[ ] Inventory collection is enabled where compute fleet exists.
[ ] Break-glass role is monitored.
[ ] Incident timeline process is defined.
[ ] Post-incident review template exists.
[ ] Game days test operational readiness.
24. Summary
Operations engineering di AWS adalah desain control loop yang aman dan bisa diaudit.
Inti Part 024:
- Operations bukan heroics; operations adalah sistem.
- Systems Manager adalah fondasi penting untuk production access dan automation.
- Session Manager mengurangi kebutuhan inbound SSH.
- Run Command menjalankan administrative task secara aman dan terkontrol.
- Automation runbooks mengubah mitigasi menjadi prosedur repeatable.
- Playbooks membantu investigasi; runbooks menjalankan mitigasi.
- OpsCenter membantu mengelola operational work items.
- Incident Manager menghubungkan alarm, contacts, escalation, dan runbook.
- Patch Manager dan Inventory penting untuk compliance dan fleet hygiene.
- Break-glass harus jarang, time-bound, dan heavily audited.
Di Part 025, kita akan masuk ke reliability engineering: HA, failover, chaos, graceful degradation, backup/restore, RTO/RPO, dan failure-mode-driven architecture.
References
- AWS Documentation — AWS Systems Manager: https://docs.aws.amazon.com/systems-manager/latest/APIReference/Welcome.html
- AWS Documentation — Session Manager: https://docs.aws.amazon.com/systems-manager/latest/userguide/session-manager.html
- AWS Documentation — Run Command: https://docs.aws.amazon.com/systems-manager/latest/userguide/run-command.html
- AWS Documentation — Systems Manager Automation: https://docs.aws.amazon.com/systems-manager/latest/userguide/systems-manager-automation.html
- AWS Documentation — OpsCenter: https://docs.aws.amazon.com/systems-manager/latest/userguide/OpsCenter.html
- AWS Documentation — Patch Manager: https://docs.aws.amazon.com/systems-manager/latest/userguide/patch-manager.html
- AWS Documentation — Incident Manager response plans: https://docs.aws.amazon.com/incident-manager/latest/userguide/response-plans.html
- AWS Documentation — Incident Manager runbooks: https://docs.aws.amazon.com/incident-manager/latest/userguide/runbooks.html
- AWS Well-Architected — Use playbooks to investigate issues: https://docs.aws.amazon.com/wellarchitected/latest/framework/ops_ready_to_support_use_playbooks.html
You just completed lesson 24 in deepen practice. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.
Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.