Series/Learn Java Microservices Design and Architect

Series MapLesson 53 / 100

Build CoreOrdered learning track

Runbooks and Operational Playbooks

Learn Java Microservices Design and Architect - Part 053

Runbooks dan operational playbooks untuk Java microservices: diagnosis tree, mitigation steps, escalation path, known bad states, evidence capture, dan incident-ready operations.

[2026-07-05]15 min read2871 words

In This Lesson

1. Core Mental Model 2. Runbook vs Playbook vs SOP vs Checklist 3. Runbook Is Part of the Service Contract

PrevNext

Lesson 53100 lesson track19–54 Build Core

#java#microservices#runbook#playbook+5 more

Part 053 — Runbooks and Operational Playbooks

Runbook adalah jembatan antara telemetry dan tindakan.

Dashboard memberi sinyal.

Alert memanggil manusia.

Runbook menjawab: apa yang harus dilakukan manusia itu sekarang?

Di sistem microservices, engineer bisa melihat ratusan metric, puluhan log stream, dan ribuan span. Tanpa runbook, incident response mudah berubah menjadi eksplorasi liar: buka dashboard, klik trace, cari log, tanya Slack, restart service, lalu berharap masalah hilang.

Itu bukan operasi production-grade.

Operasi yang matang punya playbook yang membuat respons incident menjadi:

cepat
konsisten
aman
bisa diaudit
bisa dilatih
bisa diperbaiki setelah incident

Part ini membahas:

perbedaan runbook, playbook, SOP, checklist, dan diagnostic tree
struktur runbook yang bisa dipakai saat tekanan tinggi
runbook-linked alert
diagnosis tree untuk Java microservices
known bad states catalog
mitigation vs remediation
escalation path
evidence capture
operational command safety
runbook untuk HTTP API, async consumer, database, dependency, dan workflow
executable runbook dan automation boundary
runbook review checklist

1. Core Mental Model

Runbook bukan dokumentasi panjang.

Runbook adalah decision support system untuk kondisi buruk.

Ketika incident terjadi, cognitive bandwidth engineer turun. Orang yang biasanya sangat tajam bisa melewatkan langkah dasar: cek deployment terakhir, cek scope impact, cek apakah alarm duplicate, cek dependency, cek saturation, cek apakah restart aman.

Runbook yang baik mengurangi beban mental.

Perhatikan urutannya.

Runbook tidak dimulai dengan root cause.

Runbook dimulai dengan stabilization.

Dalam incident, pertanyaan pertama bukan:

Kenapa ini terjadi?

Pertanyaan pertama adalah:

Bagaimana kita membatasi dampak sekarang tanpa membuat keadaan lebih buruk?

Root cause boleh menunggu. User impact tidak selalu boleh menunggu.

2. Runbook vs Playbook vs SOP vs Checklist

Istilah ini sering dicampur. Untuk engineering organisasi besar, pemisahan ini berguna.

Artifact	Tujuan	Contoh
Checklist	memastikan langkah tidak lupa	pre-deploy checklist, incident handoff checklist
SOP	prosedur rutin yang standar	rotate secret, drain pod, scale consumer
Runbook	panduan tindakan untuk alert/symptom tertentu	`Case API high 5xx`, `Kafka lag increasing`
Playbook	strategi respons untuk kelas incident	`regional dependency outage`, `database saturation`, `bad deployment`
Diagnostic tree	decision tree untuk mempersempit cause	latency berasal dari DB, downstream, queue, JVM, atau network

Runbook biasanya lebih spesifik daripada playbook.

Playbook menjawab:

Untuk kelas masalah ini, strategi kita apa?

Runbook menjawab:

Alert ini berbunyi. Langkah pertama, kedua, ketiga apa?

3. Runbook Is Part of the Service Contract

Service production-grade tidak hanya punya API contract.

Ia juga punya operational contract.

Operational contract menjawab:

apa SLO service ini?
apa alert yang valid?
siapa owner-nya?
apa dashboard utama?
apa dependency kritikal?
apa safe mitigation?
apa rollback path?
apa known bad state?
kapan harus escalate?
evidence apa yang harus disimpan?

Runbook adalah manifestasi operational contract.

Jika service punya alert tapi tidak punya runbook, alert itu belum production-ready.

4. Anatomy of a Good Runbook

Runbook bagus harus bisa dipakai oleh engineer yang:

sedang mengantuk
belum hafal semua service
tidak ikut menulis kode awal
sedang menangani incident lintas tim
butuh membuat keputusan cepat

Struktur minimal:

# Runbook: <Alert/Symptom Name>

## Purpose
Apa masalah yang dicakup runbook ini.

## Severity Guidance
Kapan SEV1/SEV2/SEV3.

## Immediate Safety Notes
Hal yang tidak boleh dilakukan sembarangan.

## Confirm Impact
Cara mengonfirmasi apakah user/business terkena dampak.

## Fast Mitigation
Langkah aman untuk mengurangi dampak.

## Diagnosis Tree
Urutan investigasi berbasis telemetry.

## Known Bad States
Kondisi buruk yang pernah terjadi atau diperkirakan.

## Escalation
Kapan dan ke siapa eskalasi.

## Recovery Validation
Bagaimana memastikan sistem pulih.

## Evidence to Capture
Log, trace, metric, config, deployment, command, timeline.

## Post-Incident Follow-up
Item yang harus dibuat setelah incident.

Runbook bukan essay.

Runbook harus actionable.

Kalimat seperti ini buruk:

Check the logs and investigate.

Kalimat seperti ini lebih baik:

Open dashboard case-api / server errors. Filter by route, status, exception_class, and deployment_version. If error rate is concentrated in one route and one version, jump to “Bad Deployment Suspected”. If error rate is across all routes and dependency latency is high, jump to “Dependency Saturation Suspected”.

5. Alert-Linked Runbook

Alert tanpa runbook membuat on-call memulai dari nol.

Alert yang baik langsung membawa context:

service
environment
region
symptom
SLO affected
dashboard link
trace search link
log query link
runbook link
owner
escalation channel

Contoh alert annotation:

alert: CaseApiHighErrorRate
expr: |
  sum(rate(http_server_requests_seconds_count{
    service="case-api",
    status=~"5.."
  }[5m]))
  /
  sum(rate(http_server_requests_seconds_count{
    service="case-api"
  }[5m])) > 0.02
for: 10m
labels:
  severity: page
  service: case-api
  team: case-platform
annotations:
  summary: "case-api 5xx rate above SLO threshold"
  impact: "Users may fail to submit or update cases"
  dashboard: "https://observability.example/d/case-api"
  traces: "https://tracing.example/search?service=case-api&error=true"
  runbook: "https://runbooks.example/case-api/high-5xx"
  escalation: "#inc-case-platform"

Runbook harus tahu alert ini berasal dari symptom apa.

Jika alert berbasis SLO, runbook harus menghubungkan telemetry dengan user journey.

6. Incident Response Loop

Runbook tidak berdiri sendiri. Ia hidup dalam incident response loop.

Setiap incident harus memperbaiki minimal satu dari ini:

alert
dashboard
trace/log instrumentation
mitigation lever
runbook
architecture constraint
test/fault injection
deployment safety

Jika incident selesai tapi runbook tidak berubah, organisasi mungkin kehilangan pembelajaran.

7. Roles During Incident

Untuk incident besar, jangan biarkan satu orang melakukan semuanya.

Minimal role:

Role	Responsibility
Incident commander	menjaga koordinasi, prioritas, keputusan
Tech lead / investigator	mendiagnosis dan memilih mitigation teknis
Scribe	mencatat timeline, command, decision, evidence
Communication lead	update stakeholder/customer/internal channel
Service owner	memberi domain/system knowledge

Untuk incident kecil, satu orang bisa memegang beberapa role. Tetapi runbook tetap harus menjelaskan kapan incident butuh role split.

Tanda perlu role split:

user impact luas
banyak service/team terlibat
ada customer/regulator-facing impact
mitigation berisiko tinggi
perlu komunikasi eksternal
incident berlangsung lama

8. Severity Guidance

Runbook harus membantu menentukan severity.

Severity bukan “seberapa panik tim”. Severity adalah kombinasi:

breadth of impact
depth of impact
business criticality
duration
regulatory/customer obligation
workaround availability

Contoh severity untuk regulatory case-management:

Severity	Condition
SEV1	Semua user tidak bisa submit atau process enforcement case di production
SEV1	Decision issuance salah secara sistemik atau audit trail hilang
SEV2	Sebagian besar case processing gagal, workaround terbatas
SEV2	SLA escalation terlambat secara luas
SEV3	Satu capability lambat/gagal dengan workaround manual
SEV4	Degradasi minor tanpa user-visible impact

Runbook harus memberi guidance, bukan mengganti judgment.

9. Mitigation vs Remediation

Ini perbedaan penting.

Mitigation mengurangi dampak sekarang.

Remediation memperbaiki penyebab.

Contoh:

Symptom	Mitigation	Remediation
5xx setelah deploy	rollback/canary abort	fix bug dan tambah test
DB saturation	reduce traffic, shed noncritical query	optimize query/index/schema
Kafka lag	scale consumer, pause low-priority producer	fix slow handler/idempotency issue
Downstream timeout	degrade feature, open circuit	improve dependency reliability/contract
Memory leak	restart instances safely	fix leak and heap test

Runbook harus mengutamakan mitigation aman sebelum root cause.

Namun mitigation juga punya risiko. Restart service tanpa memahami queue backlog bisa memperburuk lag. Scale consumer tanpa melihat DB saturation bisa memperberat database. Rollback tanpa memahami schema migration bisa merusak compatibility.

10. Safe Mitigation Rules

Mitigation aman harus memenuhi syarat:

reversible
bounded blast radius
tidak melanggar data integrity
tidak menyembunyikan audit evidence
bisa diverifikasi
punya owner jelas
punya timeout atau rollback plan

Contoh runbook note:

## Immediate Safety Notes

Do not:
- restart all pods at once
- purge queues without approval from service owner
- disable audit event publishing
- manually update case state in DB unless emergency data-fix procedure is approved
- increase retry count during downstream outage

Safe first actions:
- confirm deployment version
- compare impacted routes
- reduce traffic to bad version
- enable degraded read-only mode for noncritical dashboard widgets
- scale read-side instances only if DB saturation is below threshold

Di regulated systems, “cepat” tidak boleh berarti “menghapus bukti”.

11. Diagnosis Tree: The Shape

Diagnosis tree mencegah investigasi acak.

Untuk microservices, tree dasar bisa seperti ini:

Tree ini harus disesuaikan per service.

Service command-heavy berbeda dengan query-heavy. HTTP API berbeda dengan async worker. Workflow coordinator berbeda dengan event projector.

12. Known Bad States Catalog

Known bad state adalah kondisi yang:

pernah menyebabkan incident
secara teori sangat mungkin terjadi
punya signature telemetry yang jelas
punya mitigation yang diketahui

Catalog ini mengubah pengalaman individual menjadi pengetahuan organisasi.

Template:

## Known Bad State: <Name>

### Signature
- metric:
- log pattern:
- trace pattern:
- user symptom:

### Likely Causes
- cause 1
- cause 2

### Immediate Mitigation
- step 1
- step 2

### Unsafe Actions
- do not ...

### Validation
- metric returns to ...
- error stops ...

### Permanent Fix Ideas
- ...

Contoh:

## Known Bad State: Kafka Consumer Lag Caused by DB Pool Saturation

### Signature
- `kafka_consumer_records_lag_max` rising for `case-events`
- `hikaricp_connections_pending` > 0
- DB CPU > 80%
- consumer handler trace shows most time in `CaseProjectionRepository.upsert`
- no increase in handler exception rate

### Likely Causes
- projection upsert query regressed
- new event version triggers expensive enrichment
- DB pool too small relative to consumer concurrency
- read model index missing after migration

### Immediate Mitigation
1. Do not simply scale consumers.
2. Reduce consumer concurrency by 50% if DB is saturated.
3. Pause noncritical projection consumers if available.
4. If caused by latest deployment, rollback projector version.
5. If lag threatens SLA, escalate to data platform/database owner.

### Unsafe Actions
- do not purge topic
- do not increase retry count
- do not scale consumers while DB pending connections are high

### Validation
- DB pending connections returns to 0
- lag stops increasing
- oldest unprocessed event age decreases
- projection freshness SLI recovers

Known bad states are not blame records.

They are memory systems.

13. Runbook for High 5xx in Java HTTP Service

A practical runbook must be specific enough.

Example:

# Runbook: case-api high 5xx

## Purpose
Use when `case-api` 5xx rate threatens SLO for submit/update/read case user journeys.

## Confirm Impact
1. Open SLO dashboard `case-api / user journeys`.
2. Check whether failures affect:
   - `POST /cases`
   - `PATCH /cases/{id}`
   - `POST /cases/{id}/submit`
   - `GET /cases/{id}`
3. Check if error budget burn is active.
4. Check support/customer/regulatory-facing channel for reports.

## Fast Mitigation
If error rate started after deployment:
- stop rollout
- route traffic away from new version
- rollback if schema/config compatible

If error rate caused by dependency timeout:
- open circuit for optional dependency
- enable degraded mode if available
- fail fast instead of waiting full timeout

If error rate caused by DB saturation:
- shed noncritical reads
- reduce expensive query traffic
- disable optional dashboard widgets
- do not scale API if DB is already saturated

## Diagnosis
1. Group 5xx by route.
2. Group by exception class.
3. Group by deployment version.
4. Check dependency spans.
5. Check DB pool metrics.
6. Check JVM/thread metrics.
7. Check config changes.

## Escalation
Escalate to:
- case-platform owner if route-specific
- database owner if DB saturation
- identity/platform if auth dependency
- workflow team if submit triggers workflow failures

## Recovery Validation
- 5xx rate below threshold for 15 minutes
- successful submit/update synthetic checks pass
- no growing async backlog caused by failed command side effects

14. Java/Spring Metrics Needed by the Runbook

Runbook depends on telemetry shape. If telemetry is missing, runbook becomes vague.

For HTTP service:

request count by route/method/status
latency histogram by route/method/status
exception class count
dependency latency/error by target
DB pool active/idle/pending
thread pool active/queued/rejected
JVM heap/nonheap/GC pause
CPU/process load
deployment version/build SHA
feature flag state
idempotency duplicate/replay count
outbox pending age

Example Micrometer custom counter:

package com.example.caseapi.observability;

import io.micrometer.core.instrument.Counter;
import io.micrometer.core.instrument.MeterRegistry;

public final class CaseCommandMetrics {
    private final Counter caseSubmitRejected;
    private final Counter caseSubmitAccepted;

    public CaseCommandMetrics(MeterRegistry registry) {
        this.caseSubmitRejected = Counter.builder("case_command_rejected_total")
                .tag("command", "submit_case")
                .description("Total rejected SubmitCase commands by business validation or safety gate")
                .register(registry);

        this.caseSubmitAccepted = Counter.builder("case_command_accepted_total")
                .tag("command", "submit_case")
                .description("Total accepted SubmitCase commands")
                .register(registry);
    }

    public void rejected(String reason) {
        // Do not create unbounded tags from raw validation messages.
        // Prefer stable reason taxonomy.
        caseSubmitRejected.increment();
    }

    public void accepted() {
        caseSubmitAccepted.increment();
    }
}

This example intentionally avoids dynamic tag value for reason. A real implementation might use a controlled enum tag such as reason="invalid_state", but must avoid arbitrary high-cardinality labels.

15. Runbook for Latency Degradation

Latency incident needs different thinking from error incident.

A request can succeed but violate user experience or SLA.

# Runbook: case-api high latency

## Confirm Impact
- Is p95/p99 above SLO or only average latency?
- Which route/user journey is affected?
- Is latency correlated with traffic increase?
- Is latency correlated with deployment or config change?

## First Split
- server processing time high?
- downstream span high?
- DB span high?
- queue wait high?
- client/network high?

## Common Signatures

### DB pool saturation
- `hikaricp_connections_pending` > 0
- request latency rises across DB-heavy routes
- DB spans dominate trace

Mitigation:
- shed noncritical reads
- disable expensive filters/sorts
- reduce consumer concurrency if consumers share DB
- rollback query-changing deployment

### Thread pool saturation
- executor queue grows
- active threads at max
- rejected tasks increase
- CPU may or may not be high

Mitigation:
- reduce ingress concurrency
- shed low-priority traffic
- avoid increasing thread pool without checking DB/downstream saturation

### Downstream latency
- trace shows dependency span dominates
- local CPU/DB normal
- circuit breaker slow-call rate high

Mitigation:
- fail fast
- use cached/stale value if allowed
- disable optional enrichment
- escalate to dependency owner

Latency debugging must avoid one trap: increasing capacity at the wrong layer.

If DB is saturated, adding API pods can increase pressure and worsen latency.

16. Runbook for Kafka/Event Consumer Lag

Async systems fail quietly. User-facing API may look healthy while business process is falling behind.

# Runbook: case-event-projector lag increasing

## Confirm Impact
- Which topic/consumer group?
- Is lag increasing or stable?
- What is oldest unprocessed event age?
- Which read model/user journey depends on this projector?
- Is freshness SLO violated?

## Diagnosis Tree
1. Check consumer error rate.
2. Check handler latency.
3. Check DLQ count.
4. Check DB pool and query latency.
5. Check poison message signature.
6. Check deployment version.
7. Check upstream event volume spike.

## Mitigation
If handler failing on specific event:
- isolate poison message according to DLQ policy
- do not skip silently
- record event id and reason

If handler slow due to DB:
- reduce concurrency if DB saturated
- scale consumers only if DB has headroom
- pause low-priority projections

If upstream spike:
- calculate catch-up time
- increase partitions/consumer only if architecture supports it
- adjust producer rate if possible

## Recovery Validation
- lag trend negative
- oldest unprocessed event age decreasing
- projection watermark advances
- read model freshness SLO recovered

The key metric is often not raw lag count. It is oldest unprocessed event age.

A lag of 10,000 events may be harmless if events are tiny and catch-up is fast. A lag of 50 events may be severe if they block legal escalation SLA.

17. Runbook for Workflow Stuck

Long-running workflow requires lifecycle observability.

# Runbook: enforcement workflow stuck

## Confirm Impact
- Which workflow type?
- How many instances stuck?
- Which state?
- How long in state?
- Is SLA/escalation deadline affected?
- Is this newly created or existing workflow version?

## Diagnosis
1. Group stuck instances by state.
2. Check timer jobs due but not executed.
3. Check external command/reply correlation.
4. Check worker availability.
5. Check workflow version/deployment.
6. Check failed task/event history.
7. Check if downstream dependency is rejecting commands.

## Mitigation
- restart worker only if job executor/worker is unhealthy
- pause new workflow starts if stuck state causes duplicate external commands
- replay/retry failed activity according to idempotency guarantees
- manually advance state only through approved operational command

## Unsafe Actions
- do not edit workflow DB directly
- do not replay non-idempotent external command without idempotency key
- do not delete workflow history

## Recovery Validation
- due timers are being consumed
- stuck state count decreases
- no duplicate decision/action emitted
- audit trail remains complete

Workflow incidents are dangerous because retrying can duplicate business side effects.

A runbook must know which activities are idempotent and which require manual approval.

18. Operational Commands as Code

Many mitigation steps become repeated commands:

pause consumer
resume consumer
reduce concurrency
enable degraded mode
disable noncritical enrichment
route away from version
trigger projection rebuild
replay DLQ event
mark workflow for manual review

These commands should not be random shell snippets buried in Slack.

They should be operational APIs or controlled scripts with:

authentication
authorization
audit logging
dry-run mode
idempotency
input validation
blast radius limit
rollback path

Example operational command model:

package com.example.caseops;

import java.time.Instant;
import java.util.UUID;

public record OperationalCommand(
        UUID commandId,
        String commandType,
        String targetService,
        String targetScope,
        String requestedBy,
        String reason,
        Instant requestedAt,
        boolean dryRun
) {}

Example safety gate:

package com.example.caseops;

public final class OperationalSafetyGate {

    public void validate(OperationalCommand command) {
        if (command.reason() == null || command.reason().isBlank()) {
            throw new IllegalArgumentException("Operational command requires reason");
        }

        if (command.targetScope().equals("all-regions") && !isEmergencyApproved(command)) {
            throw new IllegalStateException("All-region operation requires emergency approval");
        }
    }

    private boolean isEmergencyApproved(OperationalCommand command) {
        // In production, check incident id, role, approval workflow, and audit policy.
        return command.reason().contains("INC-");
    }
}

Manual operations are part of the system.

If they bypass audit, they are hidden architecture.

19. Runbook Evidence Capture

During incident, capture evidence before it disappears.

Evidence types:

alert start time and condition
impacted service, route, region, tenant
deployment version/build SHA
config/feature flag state
dashboard snapshots
trace IDs
representative log lines
command executed
mitigation timestamp
owner decision
user/business impact estimate
recovery timestamp

Template:

## Evidence Log

Incident ID: INC-2026-xxxx
Service: case-api
Environment: prod
Region: ap-southeast-1
Started: 2026-07-05T02:14:00+07:00
Detected by: SLO burn-rate alert
Impacted journey: Submit Case

### Timeline
- 02:14 alert fired: case-api high 5xx
- 02:16 impact confirmed: POST /cases/{id}/submit failing 7%
- 02:19 deployment v2026.07.05.3 identified as likely cause
- 02:22 rollout stopped
- 02:24 traffic shifted away from bad version
- 02:31 5xx back below SLO threshold

### Evidence
- dashboard link:
- trace IDs:
- log query:
- deployment diff:
- config diff:

### Decisions
- rollback chosen over feature flag because failure occurs before flag evaluation
- no manual DB change performed

For regulated environments, this is not bureaucracy. It is defensibility.

20. Escalation Path

Runbook must define escalation by boundary.

Escalation is not failure. It is correct ownership routing.

Condition	Escalate To
unknown data integrity risk	service/domain owner + incident commander
database saturation beyond service pool	database/platform owner
dependency SLO breach	dependency owner
auth/token/mTLS failure	identity/platform owner
workflow duplicate side effect risk	workflow owner + domain owner
audit trail inconsistency	compliance/audit owner + domain owner
customer/regulator impact	incident commander + communication lead

Bad escalation:

Anyone know why prod is broken?

Good escalation:

case-api submit journey SEV2. 5xx concentrated on SubmitCaseHandler in version 2026.07.05.3. DB and downstream normal. Suspect null policy mapping introduced in latest deploy. Traffic shifted to previous version; error rate recovering. Need case-platform owner to confirm rollback safety for command schema change.

21. Runbook as Executable Knowledge

Runbooks often decay because they are separate from the system.

Better maturity levels:

Level	Description
0	no runbook
1	static markdown with manual steps
2	linked to alerts and dashboards
3	contains copy-paste-safe queries and commands
4	has guarded automation for safe actions
5	continuously tested by drills or synthetic incidents

Executable runbook does not mean fully automated incident response.

It means repeated, safe, bounded operations are codified.

Examples:

generate incident context bundle
find latest deployment for impacted service
compare error by version
list top exception classes
check dependency health
pause/resume consumer with audit
flip degraded mode with expiration
open incident channel with template

22. Example Incident Context Bundle

A simple Java service can expose an internal diagnostic endpoint, but be careful with security.

Better: build a protected admin command or observability query bundle.

Example model:

package com.example.caseapi.diagnostics;

import java.time.Instant;
import java.util.List;
import java.util.Map;

public record IncidentContextBundle(
        String service,
        String environment,
        String region,
        String version,
        Instant generatedAt,
        Map<String, String> featureFlags,
        List<String> recentDeployments,
        List<String> topErrorClasses,
        Map<String, Double> dependencyP95Millis,
        Map<String, Long> queueOldestAgeSeconds
) {}

The goal is not to dump secrets or sensitive data.

The goal is to shorten the first 10 minutes of diagnosis.

23. Runbook Drift

Runbook drift happens when:

service changed but runbook did not
metric names changed
dashboard moved
owner changed
mitigation no longer safe
dependency topology changed
feature flag removed
alert threshold changed
deployment process changed

Drift turns runbook into false confidence.

Prevention:

runbook review as part of service readiness
link runbook to service catalog
test runbook in GameDay
require runbook update after incident
validate links automatically
store metric/query snippets as code where possible
include runbook in architecture review

24. Runbook Versioning

Runbook should be versioned with service or service catalog.

Useful metadata:

runbook:
  id: case-api-high-5xx
  service: case-api
  owner: case-platform
  version: 2026.07.05
  appliesTo:
    environments: [prod, staging]
    regions: [ap-southeast-1, ap-southeast-3]
  alerts:
    - CaseApiHighErrorRate
    - CaseSubmitSloBurn
  lastReviewed: 2026-07-05
  reviewers:
    - case-platform-tech-lead
    - sre-oncall-representative
  maturity: 3

Runbook metadata makes it discoverable and auditable.

25. Runbook for Degraded Mode

Degraded mode must be designed before incident.

Example:

# Runbook: Enable case dashboard degraded mode

## Purpose
Use when case dashboard read model or enrichment dependency is slow/unavailable, but core case command path remains healthy.

## Effect
- hides noncritical enrichment panels
- serves stale summary up to 15 minutes
- disables expensive filter facets
- keeps submit/update commands enabled

## Preconditions
- command API error rate below threshold
- audit read model available
- stale data warning banner enabled

## Command
opsctl feature set case-dashboard degraded-mode=true --ttl 2h --reason INC-xxxx

## Validation
- dashboard p95 below 1.5s
- no increase in command API errors
- user banner visible
- freshness metric clearly displayed

## Exit Criteria
- enrichment dependency recovered for 30 minutes
- projection freshness under 2 minutes
- no active burn-rate alert

Important: degraded mode must have TTL or explicit owner. Otherwise temporary mitigation becomes permanent architecture.

26. Runbook for Data Repair

Data repair is high risk.

Never treat manual DB update as normal runbook unless organization has mature approval and audit process.

A safer pattern:

identify inconsistent records
classify impact
generate repair proposal
dry-run validation
approval
execute through domain command
emit audit event
verify read model and downstream effects

Example repair proposal:

package com.example.caseops.repair;

import java.util.List;
import java.util.UUID;

public record RepairProposal(
        UUID proposalId,
        String incidentId,
        String repairType,
        List<String> affectedCaseIds,
        String reason,
        boolean requiresApproval,
        List<String> expectedDomainEvents
) {}

Data repair should preserve domain semantics. Direct SQL may fix one table while breaking invariants, projections, audit, and downstream read models.

27. Runbook Testing

A runbook that has never been used or tested is a hypothesis.

Ways to test:

tabletop exercise
GameDay
staging fault injection
synthetic alert drill
broken dashboard link check
dry-run operational command
new hire on-call simulation
post-incident replay

Runbook test questions:

Could someone unfamiliar with the service follow this?
Are dashboard links valid?
Are metric names still valid?
Are commands safe to copy?
Are unsafe actions explicit?
Are mitigation steps reversible?
Are escalation contacts current?
Does recovery validation include user-facing symptom?

28. Anti-Patterns

28.1 “Check the logs” Runbook

This is not a runbook. It is an admission that diagnosis knowledge is not encoded.

Better:

exact log query
expected fields
exception classes to group by
correlation ID usage
next decision based on result

28.2 Restart-First Culture

Restart can be useful, but as reflex it hides root cause and can worsen load.

Before restart:

is there a memory leak?
is queue backlog safe?
will restart trigger stampede?
are readiness/draining configured?
will in-flight commands duplicate?
is evidence captured?

28.3 Runbook Without Owner

No owner means no maintenance.

Every runbook needs owner and review date.

28.4 Dashboard-Only Runbook

Dashboard shows state. Runbook must explain action.

28.5 Unsafe Copy-Paste Command

Commands should include environment guard, dry-run, target scope, and reason.

Bad:

kubectl delete pod -l app=case-api

Better:

opsctl rollout restart case-api \
  --env prod \
  --region ap-southeast-1 \
  --max-unavailable 1 \
  --reason INC-2026-1234 \
  --dry-run

29. Architecture Review Questions

When reviewing a new Java microservice, ask:

What user journey alerts page humans?
Does each page alert link to a runbook?
Does the runbook include fast mitigation?
Are unsafe actions explicitly listed?
Does the service expose enough telemetry for the runbook?
Are operational commands audited?
Are data repair steps domain-safe?
Is degraded mode designed?
Are known bad states documented?
Is escalation by ownership boundary clear?
Can a new on-call follow this at 03:00?
Was runbook tested?

If answer is mostly no, the service is not operationally ready.

30. Minimal Runbook Template for This Series

Use this as baseline for each service.

# Runbook: <Service> / <Symptom>

## Summary
- Service:
- Owner:
- Environment:
- Alert:
- SLO/User journey:

## Immediate Safety Notes
Do not:
- ...

Safe first actions:
- ...

## Confirm Impact
1. ...
2. ...

## Fast Mitigation
If deployment-related:
- ...

If dependency-related:
- ...

If saturation-related:
- ...

## Diagnosis Tree
```mermaid
flowchart TD
  A["Alert"] --> B{"Deployment?"}
  B -- Yes --> C["Rollback/route away if safe"]
  B -- No --> D{"Dependency?"}
  D -- Yes --> E["Degrade/fail fast/escalate"]
  D -- No --> F{"Saturation?"}
  F -- Yes --> G["Shed/load limit/isolate"]
  F -- No --> H["Deep investigation"]
```

## Known Bad States
- ...

## Escalation
- ...

## Recovery Validation
- ...

## Evidence to Capture
- ...

## Follow-up
- ...

31. Final Mental Model

Runbook adalah kode sosial-operasional.

Ia tidak mengeksekusi request, tetapi mengeksekusi organisasi saat sistem gagal.

Engineer top-level tidak hanya bisa mendesain service yang berjalan di happy path. Mereka mendesain service yang bisa:

gagal dengan jelas
didiagnosis cepat
dimitigasi aman
diekskalasi tepat
dipulihkan terukur
dipelajari setelah incident

Tanpa runbook, observability berhenti sebagai data.

Dengan runbook, observability menjadi tindakan.

32. Practical Exercise

Ambil satu service yang kamu punya atau bayangkan case-api.

Buat runbook untuk salah satu alert:

high 5xx
p99 latency high
DB pool saturation
Kafka lag increasing
workflow stuck
outbox pending age high

Wajib mencakup:

confirm impact
immediate safety notes
fast mitigation
diagnosis tree
known bad states
escalation path
recovery validation
evidence capture

Kemudian tanya:

Apakah engineer baru bisa mengikuti ini tanpa bertanya ke pembuat service?

Jika tidak, runbook belum cukup baik.

Lesson Recap

You just completed lesson 53 in build core. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Back To Series Next Lesson

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.

Previous Lesson

Lesson 52

Alerting and SLO Design

Next Lesson

Lesson 54

Production Debugging Without Guessing