Final StretchOrdered learning track

Learn Enterprise Cpq Oms Glassfish Camunda8 Part 059 Runbook Operations And Production Support

[]23 min read4590 words

In This Lesson

1. Mental Model: Operations Is a Control System 2. The Four Kinds of Operational Problems 3. Operational Source of Truth

PrevNext

Lesson 5960 lesson track51–60 Final Stretch

title: Build From Scratch: Enterprise Java Microservices CPQ & Order Management Platform - Part 059 description: Runbook operations dan production support untuk enterprise CPQ/OMS: stuck quote, stuck order, Camunda incident, Kafka lag, database lock, duplicate event, failed integration, failed compensation, manual repair, reconciliation, dan incident learning loop. series: learn-enterprise-cpq-oms-glassfish-camunda8 seriesTitle: Build From Scratch: Enterprise Java Microservices CPQ & Order Management Platform order: 59 partTitle: Runbook, Operations, and Production Support tags:

java
microservices
cpq
oms
runbook
incident-response
production-support
camunda-8
kafka
postgresql
redis
glassfish
observability
operations date: 2026-07-02

Runbook, Operations, and Production Support

Sistem CPQ/OMS enterprise tidak selesai ketika semua test hijau dan deployment sukses. Sistem baru dianggap matang ketika tim bisa menjawab pertanyaan ini pada jam 02:00 saat ada order bernilai besar yang stuck:

Apa yang sedang terjadi?
Customer mana yang terdampak?
Apakah harga, quote, order, approval, fulfillment, billing trigger, dan asset masih konsisten?
Apakah aman untuk retry?
Apakah aman untuk repair manual?
Apakah repair meninggalkan evidence yang bisa diaudit?
Apakah penyebabnya bisa dicegah di release berikutnya?

Production support bukan pekerjaan reaktif. Production support adalah bagian dari desain sistem.

Di part ini kita membangun runbook untuk platform CPQ/OMS yang sudah kita desain sepanjang seri: JAX-RS/Jersey/GlassFish API, PostgreSQL/MyBatis source of truth, Camunda 8/Zeebe orchestration, Kafka event backbone, Redis acceleration layer, outbox/inbox, external adapter, audit, observability, CI/CD, dan release safety.

1. Mental Model: Operations Is a Control System

Jangan melihat operations sebagai kumpulan prosedur manual. Lihat sebagai control system.

Operations loop yang baik memiliki sifat:

detectable: failure punya signal.
classifiable: tim bisa membedakan incident teknis, business fallout, data drift, dan expected pending.
bounded: blast radius bisa dihitung.
recoverable: ada safe retry, replay, resume, compensation, atau repair path.
auditable: setiap tindakan support punya evidence.
learnable: incident menghasilkan perbaikan desain, test, alert, atau runbook.

Kalau salah satu hilang, sistem mungkin berjalan, tetapi belum production-grade.

2. The Four Kinds of Operational Problems

Di CPQ/OMS, semua masalah production hampir selalu masuk salah satu dari empat kategori.

Category	Meaning	Example	Primary Tool
Technical failure	Komponen teknis gagal	Kafka lag, DB lock, Redis down, GlassFish unhealthy	Observability + infra runbook
Process failure	Workflow tidak bisa maju	Camunda incident, worker exhausted retries	Operate + workflow ref table
Business fallout	Process berjalan tetapi business condition gagal	Inventory unavailable, provisioning rejected, price approval expired	Fallout queue + manual repair
Data consistency drift	State antar-store tidak sinkron	Order completed tapi asset belum aktif	Reconciliation + repair command

Kesalahan umum: semua dianggap “technical error”. Akibatnya tim terus retry padahal domain state memang menolak lanjut.

Contoh:

Provisioning API 500 sementara mungkin retryable.
Provisioning API 409 service already exists mungkin ambiguous outcome.
Provisioning API 422 invalid service profile bukan retry problem; itu fallout atau decomposition bug.
Camunda incident bukan selalu masalah Camunda; sering kali worker code gagal setelah semua retry habis.
Kafka lag bukan selalu Kafka lambat; bisa consumer blocked oleh external dependency.

3. Operational Source of Truth

Production support harus tahu membaca data dari store yang tepat.

Question	Source of Truth	Supporting Store
Apakah quote valid?	PostgreSQL quote aggregate	API read model
Apakah order sudah dikomit?	PostgreSQL order aggregate	Kafka events
Apakah workflow sedang berjalan?	Camunda/Zeebe + workflow reference table	Operate UI
Apakah fulfillment task selesai?	PostgreSQL fulfillment task table	Camunda job state
Apakah event sudah dipublish?	PostgreSQL outbox + Kafka topic	producer metric
Apakah consumer sudah memproses event?	PostgreSQL inbox/projection state	Kafka consumer offset
Apakah cache benar?	PostgreSQL + catalog/config/pricing version	Redis key
Apakah external call pernah dilakukan?	`external_call_attempt` table	external system log
Apakah support mengubah data?	audit log + repair command table	ticket/change record

Rule penting:

Redis bukan source of truth. Kafka bukan source of truth untuk current state. Camunda bukan source of truth untuk business aggregate. PostgreSQL bukan satu-satunya observability source. Masing-masing punya peran.

4. Standard Runbook Format

Setiap runbook harus punya format konsisten. Jangan menulis runbook seperti cerita panjang. Operator butuh prosedur yang bisa dieksekusi di bawah tekanan.

Template:

# Runbook: <Scenario>

## Symptom
Apa signal yang terlihat.

## Severity Decision
Cara menentukan P0/P1/P2/P3.

## First 5 Minutes
Langkah containment awal.

## Diagnosis
Query, dashboard, log, trace, workflow, Kafka, DB, Redis checks.

## Safe Actions
Retry, replay, resume, cache invalidate, repair command.

## Unsafe Actions
Hal yang tidak boleh dilakukan.

## Verification
Cara membuktikan state sudah pulih.

## Escalation
Kapan dan ke siapa harus eskalasi.

## Evidence
Apa yang harus dicatat untuk audit/postmortem.

## Prevention
Test, alert, guardrail, schema, or design improvement.

Format ini penting karena incident bukan waktu yang baik untuk mendesain proses.

5. Severity Model for CPQ/OMS

Severity tidak boleh hanya berdasarkan error rate. Di CPQ/OMS, satu order enterprise bernilai besar bisa lebih penting daripada seribu quote kecil.

Severity	Definition	Example	Response
P0	Widespread business stop or legal/commercial risk	Semua order tidak bisa submit; pricing salah massal	Incident commander, freeze release, executive comms
P1	Major customer or high-value flow blocked	Key account order stuck before activation	Immediate support + engineering
P2	Degraded capability with workaround	Approval reminder delayed, projection stale	Same-day fix/recovery
P3	Low impact operational issue	One dashboard metric missing	Backlog/support queue

Tambahkan business dimensions:

revenue impact
number of affected tenants
number of affected customers
quote/order value
regulatory/audit risk
SLA breach risk
irreversible external action risk
billing/customer-notification risk

Severity harus bisa dihitung dari telemetry dan business metadata, bukan feeling.

6. First 5 Minutes Checklist

Saat alert masuk:

Buka incident channel.
Tetapkan incident commander jika P0/P1.
Freeze deployment untuk komponen terkait.
Identifikasi tenant/customer/order/quote scope.
Lihat dashboard health: API, DB, Kafka, Camunda, Redis, external systems.
Cek apakah issue baru muncul setelah deployment terakhir.
Tentukan apakah containment perlu dilakukan: disable feature flag, pause relay, pause worker, disable external call, atau hold order progression.
Jangan langsung repair data sebelum root state diketahui.

Containment examples:

pause outbox relay for a poison topic
disable quote-to-order conversion for one tenant
pause provisioning worker for adapter bug
set order intake to degraded mode
bypass cache and force source-of-truth reads
stop auto-retry for non-idempotent external calls

Containment is not a fix. It is a way to stop bleeding.

7. Core Dashboards

Minimum dashboard set:

7.1 Business Transaction Dashboard

Shows:

quote created/submitted/approved/accepted/converted per hour
order submitted/validated/decomposed/completed/failed per hour
fallout count by category
stuck quote count by state and age
stuck order count by state and age
order value impacted
tenant impact

7.2 Runtime Dashboard

Shows:

GlassFish request rate, latency, error rate
PostgreSQL connection pool, locks, slow queries, deadlocks
Kafka producer error, consumer lag, DLQ rate
Camunda job activation/completion/failure, incident count
Redis hit rate, latency, eviction, memory
external adapter latency/error/retry

7.3 Consistency Dashboard

Shows:

outbox backlog by age
inbox failure by consumer
projection drift count
workflow reference without process instance
process instance without valid order state
order completed but asset not activated
fulfillment task completed but order item not advanced
duplicate command rejected count

Dashboard bukan hiasan. Dashboard adalah input runbook.

8. Operational Tables You Should Have

Production support sangat bergantung pada tabel operasional yang sengaja didesain.

-- Simplified operational repair command table
create table support_repair_command (
  repair_command_id uuid primary key,
  tenant_id uuid not null,
  target_type text not null,
  target_id uuid not null,
  command_type text not null,
  requested_by text not null,
  approved_by text,
  reason text not null,
  precondition_json jsonb not null,
  execution_status text not null,
  execution_result_json jsonb,
  created_at timestamptz not null,
  executed_at timestamptz
);

-- Simplified reconciliation finding table
create table reconciliation_finding (
  finding_id uuid primary key,
  tenant_id uuid not null,
  finding_type text not null,
  severity text not null,
  entity_type text not null,
  entity_id uuid not null,
  expected_state_json jsonb not null,
  actual_state_json jsonb not null,
  status text not null,
  detected_at timestamptz not null,
  resolved_at timestamptz
);

-- Simplified external call attempt table
create table external_call_attempt (
  attempt_id uuid primary key,
  tenant_id uuid not null,
  adapter_name text not null,
  business_entity_type text not null,
  business_entity_id uuid not null,
  idempotency_key text not null,
  request_hash text not null,
  status text not null,
  http_status int,
  external_reference text,
  error_code text,
  error_message text,
  created_at timestamptz not null,
  completed_at timestamptz,
  unique (tenant_id, adapter_name, idempotency_key)
);

These tables are not afterthoughts. They are part of the system design.

9. Runbook: Stuck Quote

Symptom

A quote remains in one of these states longer than expected:

DRAFT
CONFIGURED
PRICED
VALIDATED
APPROVAL_PENDING
ACCEPTED
CONVERTING_TO_ORDER

Severity Decision

Escalate severity if:

quote is close to expiration
quote value exceeds threshold
quote belongs to strategic customer
many quotes stuck in same state
approval or pricing rules recently changed
quote has already been accepted by customer

Diagnosis Query

select
  q.quote_id,
  q.quote_number,
  q.tenant_id,
  q.customer_id,
  q.state,
  q.revision,
  q.total_amount,
  q.currency,
  q.updated_at,
  now() - q.updated_at as age
from quote q
where q.state in ('APPROVAL_PENDING', 'CONVERTING_TO_ORDER')
  and q.updated_at < now() - interval '30 minutes'
order by q.updated_at asc;

Check Approval Case

select
  ac.approval_case_id,
  ac.quote_id,
  ac.status,
  ac.policy_version,
  ac.current_step,
  ac.created_at,
  ac.updated_at
from approval_case ac
where ac.quote_id = :quote_id;

Check whether:

approval case exists
approval case references correct quote revision
approver assignment exists
Camunda process instance exists
timer/escalation worker is healthy
quote was revised after approval started

Check Conversion

select
  idempotency_key,
  request_hash,
  command_type,
  status,
  response_json,
  created_at,
  completed_at
from idempotency_record
where tenant_id = :tenant_id
  and target_type = 'QUOTE'
  and target_id = :quote_id
order by created_at desc;

Then check order uniqueness:

select order_id, order_number, state, source_quote_id, source_quote_revision
from customer_order
where source_quote_id = :quote_id
order by created_at desc;

Safe Actions

Allowed:

resume approval reminder job if only reminder failed
retry failed approval workflow start if no process instance exists and quote revision still matches
replay quote accepted event if conversion command already succeeded but event projection failed
run idempotent convert quote command again with same idempotency key
cancel approval case only through domain command if quote revision invalidated it

Not allowed:

manually set quote state from APPROVAL_PENDING to APPROVED
manually insert order without conversion command
update quote total amount after approval without creating revision
bypass approval because “approver said yes in chat” unless evidence is attached through approved repair command

Verification

A quote is recovered only if:

quote state is valid
quote revision is consistent
approval evidence exists when required
order exists if quote was converted
quote-to-order conversion is idempotent
audit log explains state transition
customer-facing projection matches source state

10. Runbook: Stuck Order

Symptom

An order stays too long in:

SUBMITTED
VALIDATED
DECOMPOSED
IN_PROGRESS
PARTIAL
CANCELLING
COMPENSATING
FALLOUT

First Diagnosis

select
  o.order_id,
  o.order_number,
  o.tenant_id,
  o.customer_id,
  o.state,
  o.source_quote_id,
  o.created_at,
  o.updated_at,
  now() - o.updated_at as age
from customer_order o
where o.order_id = :order_id;

Check item distribution:

select state, count(*)
from order_item
where order_id = :order_id
group by state
order by state;

Check fulfillment task distribution:

select task_type, status, count(*)
from fulfillment_task
where order_id = :order_id
group by task_type, status
order by task_type, status;

Interpret the Result

Observation	Likely Cause
Order `VALIDATED`, no fulfillment plan	decomposition failed or not started
Plan exists, no Camunda instance	workflow start failed
Camunda active, task pending	worker unavailable or external dependency slow
Task failed with retry left	automated retry in progress
Task failed with no retry	incident/fallout likely required
Task completed but order item not advanced	worker transaction or projection failure
Order complete but asset absent	post-fulfillment asset update drift

Check Workflow Reference

select
  workflow_ref_id,
  business_entity_type,
  business_entity_id,
  process_definition_id,
  process_instance_key,
  process_version,
  status,
  started_at,
  completed_at
from workflow_reference
where business_entity_type = 'ORDER'
  and business_entity_id = :order_id;

Safe Actions

Allowed:

retry decomposition if no fulfillment plan exists and order version unchanged
start workflow if durable workflow start request exists but no process instance exists
retry a failed fulfillment task if adapter call is idempotent
move task to manual fallout through command when business error is non-retryable
run reconciliation for order after workflow completion
resume order aggregation after child tasks are valid

Not allowed:

set order state to COMPLETED manually
mark fulfillment task completed without evidence
skip failed provisioning task and activate asset anyway
delete Camunda process instance to “unstick” order
update asset before order item completion evidence exists

Verification

Recovered order must have:

valid order state
valid order item states
valid fulfillment task states
no unresolved fallout for blocking tasks
workflow reference aligned with process status
outbox events published
asset/subscription state aligned if order completed
audit evidence for support action

11. Runbook: Camunda 8 Incident

Symptom

Operate shows an incident, or metrics show growing incident count.

Camunda incident means process execution is stuck and requires intervention. In our architecture, it does not automatically mean business fallout. First classify it.

Diagnosis

Collect:

process instance key
BPMN process ID
process version
element ID
job type
error message
retries exhausted
variables relevant to business entity
linked workflow_reference

select *
from workflow_reference
where process_instance_key = :process_instance_key;

Then locate domain entity:

select order_id, state, version, updated_at
from customer_order
where order_id = :business_entity_id;

Incident Classification

Cause	Classification	Action
transient external timeout	technical retry	reset retry if idempotent
validation failure from domain service	business fallout	create fallout case
worker bug	code defect	deploy fix then resolve incident
missing variable	workflow contract bug	migrate/recreate/rescue carefully
external already executed but response lost	ambiguous outcome	reconcile external system before retry
task no longer valid due to cancellation	process/domain mismatch	controlled compensation/cancel path

Safe Resolution Pattern

Verify business entity state.
Verify whether worker side effect already happened.
Verify idempotency key and external call attempt.
Decide retry, BPMN error, fallout, compensation, or migration.
Record support action.
Resolve incident only after state is safe.

Unsafe Actions

resolving incident without understanding side effect
increasing retries blindly for non-idempotent call
editing variables to arbitrary values
moving token forward when domain state says task failed
resolving process while order remains inconsistent

12. Runbook: Kafka Consumer Lag

Symptom

Consumer lag grows for one or more consumer groups.

Kafka lag diagnosis must separate:

producer surge
consumer slow processing
consumer crash loop
poison event
downstream dependency slow
rebalance instability
database contention
schema incompatibility

First Queries

Operational table:

select consumer_name, status, count(*), max(updated_at) as last_update
from inbox_message
where created_at > now() - interval '6 hours'
group by consumer_name, status
order by consumer_name, status;

Failed events:

select
  consumer_name,
  topic,
  partition_no,
  offset_no,
  event_type,
  error_code,
  retry_count,
  updated_at
from inbox_message
where status in ('FAILED', 'RETRYING')
order by updated_at desc
limit 100;

Decision Tree

Safe Actions

Allowed:

scale consumers up to partition limit
pause consumer for poison event containment
move poison event to DLQ after evidence and approval
fix consumer code and replay from inbox
replay projection from event log when consumer is idempotent
increase DB pool only after confirming DB can handle it

Not allowed:

skip offsets without storing skipped event evidence
delete inbox rows to reduce error count
reset consumer offset on production topic without replay plan
replay non-idempotent consumer blindly
scale consumers when partition key requires strict ordering per aggregate and consumer code is not safe

Verification

lag decreases
consumer error rate normalizes
inbox failed count stable or resolved
projection drift does not increase
order/quote timeline shows missing events recovered

13. Runbook: Outbox Backlog

Symptom

Outbox rows accumulate and events are not reaching Kafka.

Diagnosis Query

select
  status,
  event_type,
  count(*) as count,
  min(created_at) as oldest,
  max(created_at) as newest
from outbox_event
group by status, event_type
order by oldest asc;

Check locked/publishing rows:

select outbox_event_id, event_type, status, locked_by, locked_until, retry_count, updated_at
from outbox_event
where status in ('PUBLISHING', 'FAILED')
order by updated_at asc
limit 100;

Common Causes

Cause	Signal	Action
Kafka unavailable	producer failures	wait/retry, verify broker
schema incompatible	serialization failure	block release, fix schema
relay crashed	no lock updates	restart relay
poison event	same event fails repeatedly	quarantine event
DB lock	relay cannot claim rows	inspect lock/query
topic missing	unknown topic	provision topic-as-code

Safe Actions

restart relay if stateless and idempotent
release stale locks after confirming relay is dead
quarantine poison event with support command
replay outbox row after code/schema fix
pause producers for growing backlog during P0

Not allowed:

mark outbox PUBLISHED without Kafka evidence
delete outbox rows without event loss approval
change event payload manually without preserving evidence

14. Runbook: Database Lock or Slow Query

Symptom

API latency spikes, command timeouts, worker timeouts, or PostgreSQL lock waits grow.

Diagnosis

Find active sessions:

select
  pid,
  state,
  wait_event_type,
  wait_event,
  now() - query_start as age,
  left(query, 200) as query
from pg_stat_activity
where state <> 'idle'
order by age desc;

Find blockers:

select
  blocked_locks.pid as blocked_pid,
  blocked_activity.query as blocked_query,
  blocking_locks.pid as blocking_pid,
  blocking_activity.query as blocking_query
from pg_catalog.pg_locks blocked_locks
join pg_catalog.pg_stat_activity blocked_activity
  on blocked_activity.pid = blocked_locks.pid
join pg_catalog.pg_locks blocking_locks
  on blocking_locks.locktype = blocked_locks.locktype
 and blocking_locks.database is not distinct from blocked_locks.database
 and blocking_locks.relation is not distinct from blocked_locks.relation
 and blocking_locks.page is not distinct from blocked_locks.page
 and blocking_locks.tuple is not distinct from blocked_locks.tuple
 and blocking_locks.virtualxid is not distinct from blocked_locks.virtualxid
 and blocking_locks.transactionid is not distinct from blocked_locks.transactionid
 and blocking_locks.classid is not distinct from blocked_locks.classid
 and blocking_locks.objid is not distinct from blocked_locks.objid
 and blocking_locks.objsubid is not distinct from blocked_locks.objsubid
 and blocking_locks.pid != blocked_locks.pid
join pg_catalog.pg_stat_activity blocking_activity
  on blocking_activity.pid = blocking_locks.pid
where not blocked_locks.granted;

Operational Decision

Situation	Action
Long read query blocks migration	cancel query if safe and approved
Migration blocks order writes	rollback/stop migration if possible
Command transaction hangs on external call	code bug; external call must not be inside DB transaction
Hot aggregate receives concurrent writes	inspect idempotency and optimistic locking
Missing index causes scan	emergency index only through controlled migration

Unsafe Actions

kill random DB sessions without knowing transaction effect
add index manually outside migration process
raise connection pool blindly
disable constraints to let commands pass
run repair update outside domain command

15. Runbook: Duplicate Event or Duplicate Command

Symptom

same order event processed twice
duplicate order created from same quote
external provisioning called twice
duplicate notification sent
idempotency conflict returned to client

First Check

select *
from idempotency_record
where tenant_id = :tenant_id
  and idempotency_key = :idempotency_key;

Check duplicate business uniqueness:

select source_quote_id, source_quote_revision, count(*)
from customer_order
where source_quote_id = :quote_id
group by source_quote_id, source_quote_revision;

Check external call attempts:

select adapter_name, idempotency_key, request_hash, status, external_reference, count(*)
from external_call_attempt
where business_entity_id = :entity_id
group by adapter_name, idempotency_key, request_hash, status, external_reference;

Diagnosis

Duplicate can come from:

client retry without idempotency key
same idempotency key with different request hash
DB unique constraint missing
outbox relay publishes twice after unknown outcome
Kafka consumer retry after crash
Camunda worker duplicate job execution
external API timeout after success

Safe Actions

if duplicate command rejected by idempotency table, no repair needed
if duplicate event processed but consumer idempotent, verify no side effect duplicated
if duplicate external side effect happened, reconcile external system and create compensation/fallout
if duplicate order created, do not delete; create cancellation/void path with audit

Unsafe Actions

deleting duplicate order row
manually merging two order rows
ignoring duplicate if external side effect exists
changing idempotency record hash to force pass

16. Runbook: Failed External Integration

External integration failure is the most common source of OMS fallout.

Integration Types

Type	Example	Failure Mode
Sync request/response	credit check	timeout, 4xx, 5xx
Async callback	provisioning	callback lost, duplicate callback
Polling	shipment status	stale state, rate limit
Event-based	billing trigger	consumer lag, schema mismatch

Diagnosis

select
  attempt_id,
  adapter_name,
  business_entity_type,
  business_entity_id,
  idempotency_key,
  status,
  http_status,
  external_reference,
  error_code,
  retry_count,
  created_at,
  completed_at
from external_call_attempt
where business_entity_id = :entity_id
order by created_at desc;

Classify Error

Error	Meaning	Action
timeout no external reference	unknown outcome	reconcile before retry
429	rate limited	retry with backoff, throttle
500/503	transient maybe	retry if idempotent
400/422	request invalid	fallout/bug, do not retry blindly
409	conflict	inspect idempotency/external current state
duplicate callback	expected in async systems	dedupe by callback ID

Safe Action Pattern

Unsafe Actions

retry non-idempotent provisioning create call with new request ID
assume timeout means failure
assume HTTP 500 means no side effect
manually advance task without external reference

17. Runbook: Failed Compensation

Compensation failure is more dangerous than forward failure because a side effect may already exist.

Example

Order fulfillment did:

reserve resource: success
provision service: success
activate billing: failed
compensation starts
deprovision service: failed

Now system cannot simply mark order failed. It must preserve customer-impacting reality.

Diagnosis

Check compensation plan:

select task_id, task_type, status, compensation_task_id, compensation_status
from fulfillment_task
where order_id = :order_id
order by sequence_no;

Check external references:

select adapter_name, external_reference, status, created_at, completed_at
from external_call_attempt
where business_entity_id in (
  select task_id from fulfillment_task where order_id = :order_id
)
order by created_at;

Safe Actions

retry compensation only if idempotent
reconcile external state before retrying ambiguous compensation
escalate to manual fallout if irreversible
freeze billing/customer notification until reality is known
record partial compensation evidence

Unsafe Actions

mark compensation success because order needs closing
delete external reference to hide partial side effect
retry deactivation with different idempotency key
complete cancellation while asset remains active

18. Runbook: Redis Cache Staleness or Corruption

Symptom

quote prices differ between users
configuration option availability inconsistent
catalog update not reflected
cache hit rate drops suddenly
stale catalog used after publish

Diagnosis

Check active catalog version in PostgreSQL:

select catalog_id, active_version, published_at
from catalog_publication
where tenant_id = :tenant_id;

Check Redis key version policy:

catalog:{tenantId}:activeVersion -> v42
catalog:{tenantId}:v42:offering:{offeringId}
pricing:{tenantId}:v42:priceList:{priceListId}
config-rule:{tenantId}:v42:offering:{offeringId}

Safe Actions

invalidate cache for tenant/version
switch service to source-of-truth read temporarily
rebuild cache from PostgreSQL
block quote submission if catalog version mismatch is detected

Unsafe Actions

change Redis value manually to “fix” product config
use stale cache to force quote conversion
delete all Redis keys during peak without stampede protection

19. Runbook: Projection Drift

Projection drift means read model says one thing while source aggregate says another.

Example

Order aggregate state is COMPLETED
Customer timeline projection still says IN_PROGRESS
Operational dashboard shows stuck order

Diagnosis

select order_id, state, updated_at
from customer_order
where order_id = :order_id;

select order_id, projected_state, last_event_id, updated_at
from order_search_projection
where order_id = :order_id;

select outbox_event_id, event_type, aggregate_id, status, published_at
from outbox_event
where aggregate_id = :order_id
order by created_at;

select consumer_name, event_id, status, processed_at
from inbox_message
where aggregate_id = :order_id
order by processed_at;

Safe Actions

replay projection consumer from event ID if idempotent
rebuild projection from aggregate table if projection is derived-only
mark projection stale and hide from customer-facing UI if necessary

Unsafe Actions

update projection manually without event/evidence
treat projection as source of truth for repair
create new event that lies about aggregate history

20. Safe Repair Command Design

Manual repair must go through command handlers, not direct SQL.

A repair command must define:

target entity
current expected state
allowed transition
reason
approval if required
evidence attachment/reference
before snapshot
after snapshot
audit record
emitted event if business-visible
verification rule

Example repair command:

public final class MarkFulfillmentTaskAsManuallyResolvedCommand {
  public UUID tenantId;
  public UUID taskId;
  public long expectedTaskVersion;
  public String resolutionCode;
  public String reason;
  public String evidenceReference;
  public String requestedBy;
  public String approvedBy;
}

Handler shape:

public RepairResult handle(MarkFulfillmentTaskAsManuallyResolvedCommand command) {
  return unitOfWork.inTransaction(() -> {
    FulfillmentTask task = taskRepository.loadForUpdate(command.tenantId, command.taskId);

    repairPolicy.assertAllowed(command, task);
    task.markManuallyResolved(command.resolutionCode(), command.reason());

    taskRepository.save(task);
    auditRepository.append(AuditRecord.fromRepair(command, task));
    outboxRepository.append(FulfillmentTaskManuallyResolved.from(task));

    return RepairResult.success(task.id(), task.state(), task.version());
  });
}

Important: repair command is not a backdoor. It is a domain command with stricter evidence rules.

21. Reconciliation Jobs

Reconciliation is how the system detects silent drift.

21.1 Reconciliation Types

Reconciliation	Checks
Quote approval reconciliation	quote state vs approval case vs process instance
Quote conversion reconciliation	accepted quote vs order existence
Order workflow reconciliation	order state vs workflow reference
Fulfillment reconciliation	task state vs external references
Asset reconciliation	completed order vs installed base
Billing reconciliation	completed billable action vs billing trigger
Projection reconciliation	aggregate state vs read models
Outbox reconciliation	domain state vs event publication

21.2 Reconciliation Output

Reconciliation must not silently mutate data by default. It should produce findings.

21.3 Auto-Repair Rules

Auto-repair is allowed only when:

correction is deterministic
no external side effect is required
no business approval is required
before/after state is auditable
operation is idempotent
replay is safe

Example allowed:

rebuild stale search projection
mark outbox stale lock as available after relay death
regenerate customer timeline from durable events

Example not allowed:

mark provisioning success
activate asset
approve quote
cancel order after partial fulfillment

22. Support Access Model

Support capability is dangerous. Design it like production code.

Roles:

Role	Capabilities
Viewer	inspect quote/order/workflow/logs
Operator	retry safe task, invalidate cache, replay projection
Senior Operator	execute approved repair commands
Incident Commander	containment decisions, severity, comms
Engineer	code-level diagnosis, hotfix
Business Approver	approve commercial/manual exception
Auditor	inspect evidence, no mutation

Every support action must include:

actor
role
tenant
target entity
command
reason
ticket/reference
before state
after state
timestamp
correlation ID

No shared admin account.

23. Communication Model

Incident communication should separate technical status from business impact.

Internal Status Format

Severity: P1
Status: Contained / Investigating / Mitigating / Monitoring / Resolved
Start Time: 2026-07-02T10:15:00+07:00
Affected Tenants: tenant-a, tenant-b
Affected Capabilities: order fulfillment, provisioning adapter
Customer Impact: 48 orders delayed, no duplicate billing detected
Current Hypothesis: provisioning adapter timeout causing worker retry exhaustion
Next Update: 30 minutes
Owner: Incident Commander

Customer/Business Status

Keep customer-facing communication factual:

what capability is impacted
whether submitted orders are safe
whether duplicate billing/order risk exists
current workaround
expected next update
final resolution summary

Do not expose internal stack details unless contractual/compliance context requires it.

24. Post-Incident Review

A post-incident review should not ask “who broke it?” It should ask:

What conditions allowed this failure to reach production?
Which signal detected it?
Which signal was missing?
Which runbook step worked?
Which runbook step was unclear?
Was recovery safe or lucky?
Did we need manual SQL?
Can this failure be converted into automated detection, test, or guardrail?
Should API/schema/workflow/event compatibility gates be improved?
Should domain invariant be moved earlier?

Output categories:

code fix
test addition
alert addition
dashboard improvement
runbook improvement
architecture change
data repair tool
release gate change
training item

25. Production Support Anti-Patterns

Avoid these:

Manual SQL as normal support tool
- It bypasses invariants, audit, events, and projections.
Retry everything
- Retry can duplicate external side effects.
Camunda as business state source
- Process state is not the same as domain state.
Kafka offset reset without replay model
- You can lose or duplicate business effects.
Cache flush as universal cure
- It can create stampede and hide root cause.
Projection mutation as repair
- Projection should be regenerated, not manually invented.
No evidence for commercial repair
- Quote/order decisions must be defensible.
Postmortem without action item
- That is documentation theater.
Support access without least privilege
- Production support can become insider-risk surface.
No tenant blast-radius control

Multi-tenant incident response must isolate impact.

26. Minimal Runbook Set Before Go-Live

Before production launch, at minimum have runbooks for:

API high error rate
API high latency
PostgreSQL lock/deadlock
PostgreSQL connection exhaustion
failed database migration
Kafka producer failure
Kafka consumer lag
outbox backlog
inbox poison message
Camunda incident
worker crash loop
stuck quote approval
stuck quote conversion
stuck order fulfillment
failed provisioning
failed compensation
asset drift
billing trigger drift
Redis cache staleness
Redis outage
projection drift
duplicate command/event
data repair command
deployment rollback/roll-forward
tenant-specific containment

Each runbook should be tested through game day simulation.

27. Game Day Scenarios

Run game days before production.

Example scenarios:

Kafka unavailable for 20 minutes.
Outbox relay crashes after Kafka publish but before DB status update.
Provisioning API times out after creating service.
Camunda worker throws exception until retries exhausted.
Redis has stale catalog version.
Quote approval process uses old variable schema.
Database migration blocks order submission.
Duplicate quote-to-order conversion request arrives.
Billing trigger consumer processes event twice.
Order completed but asset projection failed.

For each scenario, measure:

detection time
triage time
containment time
recovery time
data consistency after recovery
customer impact clarity
runbook gaps

28. Production Readiness Checklist

A CPQ/OMS platform is operationally ready when:

every state transition has audit evidence
every external side effect has attempt record
every command has idempotency policy
every event has outbox/inbox path
every projection can be rebuilt
every workflow instance has business reference
every incident has runbook
every repair command is audited
every cache can be invalidated or bypassed
every critical dashboard has owner
every alert has action
every P0/P1 has post-incident review
every release can be correlated with incidents
every tenant impact can be scoped

Production readiness is not a checkbox. It is the ability to recover safely.

29. What You Should Be Able to Do After This Part

You should now be able to:

Classify CPQ/OMS production failures by technical/process/business/data category.
Diagnose stuck quote and stuck order scenarios safely.
Understand Camunda incident without confusing it with business state.
Handle Kafka lag, outbox backlog, and inbox failures without losing events.
Diagnose database locks and slow queries without randomly killing sessions.
Handle external integration ambiguity without duplicate side effects.
Design repair commands instead of manual SQL patches.
Build reconciliation jobs that detect drift safely.
Prepare support access model with audit and least privilege.
Convert incidents into preventive engineering work.

30. Closing Model

The strongest production systems are not the systems that never fail. They are systems where failure is:

visible
bounded
diagnosable
recoverable
auditable
learnable

For enterprise CPQ/OMS, this matters more than elegance. An elegant architecture that cannot explain a stuck order is not production-grade.

Next part is the final part of the series: architecture review and extension roadmap.

Lesson Recap

You just completed lesson 59 in final stretch. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Back To Series Next Lesson

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.

Previous Lesson

Lesson 58

Learn Enterprise Cpq Oms Glassfish Camunda8 Part 058 Ci Cd Quality Gates And Release Safety

Next Lesson

Lesson 60

Learn Enterprise Cpq Oms Glassfish Camunda8 Part 060 Final Architecture Review And Extension Roadmap