Series/Learn Enterprise CPQ OMS Camunda 7

Deepen PracticeOrdered learning track

Camunda 7 Operations and Incident Playbook

Learn Enterprise CPQ OMS Camunda 7 - Part 053

Production operations and incident playbook for Camunda 7 inside a large Java microservices CPQ and order management platform.

[2026-07-02]20 min read3897 words

In This Lesson

1. What Camunda 7 Owns in This Architecture 2. Operational Mental Model 3. Business Key Discipline

PrevNext

Lesson 5364 lesson track36–53 Deepen Practice

#java#microservices#cpq#oms+5 more

Part 053 — Camunda 7 Operations and Incident Playbook

A workflow engine is not production-ready because the BPMN diagram is valid.
It is production-ready when failed jobs, stuck process instances, duplicate callbacks, broken variables, bad deployments, and human recovery paths are controlled, observable, auditable, and reversible.

This part is about operating Camunda 7 as a production workflow runtime inside the CPQ/OMS platform we have been building.

We are not repeating BPMN basics. We are building the operational muscle that separates a toy workflow from an enterprise-grade order orchestration system.

The core question is simple:

When an order process fails at 02:00, can the platform tell an operator exactly what failed, why it failed, what can be retried safely, what must be investigated, and what action leaves an audit trail?

If the answer is no, Camunda is only moving complexity from code into diagrams.

1. What Camunda 7 Owns in This Architecture

In this CPQ/OMS platform, Camunda 7 owns process orchestration, not domain truth.

That means:

Quote Service owns quote state.
Pricing Service owns pricing evidence.
Order Service owns order state.
Inventory/Fulfillment integrations own external fulfillment facts.
Camunda owns orchestration progress, wait states, retries, user tasks, process variables, incidents, and process history.

Camunda can decide what step is next, but the domain service decides whether the business transition is valid.

Bad production systems invert this. They hide business state inside process variables and then try to reconstruct order truth from workflow tables. That path leads to brittle migration, impossible reporting, and unsafe manual fixes.

1.1 Correct Responsibility Split

1.2 The Golden Rule

Do not let Camunda process variables become the hidden database of the business.

Camunda variables should contain:

identifiers,
correlation keys,
compact routing facts,
process-local flags,
retry metadata,
decision result references,
small immutable snapshots required by the process.

They should not contain:

complete quote JSON,
complete order JSON,
mutable pricing breakdown,
large documents,
external system payloads,
secret data,
authority-critical state that bypasses domain services.

A process variable is operational state. A domain aggregate is business truth.

2. Operational Mental Model

Camunda 7 operations revolve around five runtime surfaces:

Process instance — one execution of a BPMN process.
Job — async work scheduled by the engine.
External task — pull-based work executed by external workers.
Incident — unresolved failed execution requiring attention.
History — persisted runtime/historic evidence for analysis and operations.

In CPQ/OMS, these surfaces map to business flows:

Camunda Runtime Surface	CPQ/OMS Meaning	Production Question
Process instance	Quote approval or order orchestration instance	Which quote/order is this about?
Business key	Business correlation identifier	Can an operator search by quote/order ID?
Job	Async continuation, timer, retryable work	Is failure technical or business?
External task	Work delegated to worker/integration adapter	Can it be retried safely?
User task	Human approval/fallout/review work	Who can act and what is stale?
Incident	Engine-visible unresolved failure	Should this become a fallout case?
History	Evidence of what happened	Can we reconstruct the timeline?

The platform must never require an operator to inspect raw engine tables to understand a customer-impacting failure.

3. Business Key Discipline

Every process instance must have a meaningful business key.

For this platform:

quote approval process business key = quote:{quoteId}:rev:{revisionNo}
order orchestration process business key = order:{orderId}
change order process business key = change-order:{changeOrderId}
fallout process business key = fallout:{falloutCaseId}

Why this matters:

Cockpit search becomes useful.
logs can include business key.
metrics can tag business flow.
domain services can correlate workflow safely.
support teams can speak in business language.

3.1 Anti-Pattern: UUID-Only Business Key

This is operationally weak:

businessKey = 9c2e8af6-1b0b-4e95-978f-5d637c3ef8d7

The process engine can correlate it, but humans cannot reason about it. Use UUIDs internally if needed, but expose a business key shape that carries meaning.

3.2 Business Key Is Not Authorization

A business key is not a security boundary.

Never assume that because a user knows order:123, they can view or operate it. Authorization must still be checked in the domain/control-plane API.

4. Process Definition Deployment Strategy

Camunda process definitions are versioned. A new deployment creates a new version of the process definition. Existing running process instances normally continue with the version they started with, while new instances use the newest version unless started explicitly with another version.

This is a strength if planned. It is a production hazard if ignored.

4.1 Deployment Units

Recommended structure:

workflow-definitions/
  quote-approval/
    quote-approval.bpmn
    discount-policy.dmn
    approval-routing.dmn
    README.md
    scenarios/
  order-orchestration/
    order-orchestration.bpmn
    fulfillment-routing.dmn
    README.md
    scenarios/

A workflow deployment should include:

BPMN files,
related DMN files,
version notes,
migration notes,
compatibility notes,
expected variables,
expected external task topics,
expected domain commands/events,
rollback/roll-forward guidance.

4.2 Version Compatibility Checklist

Before deploying a new process definition:

Check	Question
Variable compatibility	Can old and new process versions handle existing variable contracts?
Worker compatibility	Are external task topics still supported by workers?
Domain API compatibility	Are called APIs backward-compatible?
DMN compatibility	Are decision output shapes compatible?
Incident behavior	Are new failure paths observable?
Migration need	Do running instances need migration or can they drain?
Monitoring	Do dashboards distinguish process definition versions?

4.3 Drain vs Migrate

Do not migrate process instances just because a new BPMN version exists.

Use this rule:

Situation	Preferred Action
Old version is safe and short-lived	Let instances drain.
Old version has non-critical improvement missing	Let instances drain.
Old version has business logic defect	Consider migration or controlled compensation.
Old version calls removed API	Migrate or keep compatibility adapter.
Old version has stuck wait state	Repair, migrate, or terminate with domain-safe recovery.

Migration is an operational intervention, not a normal deployment step.

5. Job Executor Operations

The Camunda job executor is the engine component that executes async continuations, timers, and other background jobs.

For CPQ/OMS, job executor behavior affects:

quote approval reminders,
approval SLA timers,
order orchestration async continuations,
retry timing,
process throughput,
incident generation,
database load.

5.1 Async Boundary Design

A BPMN async boundary is not a decoration. It defines a transaction boundary and retry boundary.

Use async before/after when:

the next step calls an external system,
a service task may fail transiently,
a long-running continuation should not roll back previous progress,
a retry should happen independently,
an operator may need to inspect failure at that point.

Avoid async boundaries when:

the task is pure in-memory routing,
the step is not meaningful operationally,
adding a wait state would create unnecessary job volume,
failure cannot be handled independently.

5.2 Job Executor Sizing Model

Do not tune the job executor by guessing.

Model it like this:

required throughput = business operations per second × workflow jobs per operation
worker capacity = active job threads × average jobs completed per second per thread
safety margin = 30% to 50% depending on traffic variability

Example:

200 submitted orders/minute
7 internal Camunda jobs per order
= 1400 jobs/minute
= about 23.3 jobs/second

If each job performs a quick domain command and averages 100 ms, one worker thread can theoretically handle 10 jobs/s. But production threads are not theoretical. DB locks, API calls, retries, serialization, and GC reduce capacity. You design for measured capacity, not ideal capacity.

5.3 Job Executor Failure Modes

Failure Mode	Symptom	Likely Cause	First Response
Job backlog grows	Delayed timers, slow orchestration	Too few workers, slow tasks, DB contention	Check job acquisition, DB load, task duration
Many failed jobs	Incidents rising	Bad deployment, external outage, contract mismatch	Classify by exception and activity ID
Duplicate side effect	External system called twice	Non-idempotent delegate or retry	Stop retry, inspect domain idempotency
Lock timeout	Job retried while previous work may still run	Long task, timeout mismatch	Treat as unknown outcome
Database pressure	Engine tables hot	High job volume/history, poor indexing, noisy polling	Reduce job churn, tune DB, separate runtime/load

6. External Task Worker Operations

External tasks are a good fit for microservices because workers pull work from Camunda and complete/fail/BPMN-error tasks explicitly.

In CPQ/OMS, external tasks are useful for:

fulfillment adapter calls,
inventory reservation,
billing handoff,
document generation,
notification dispatch,
slow integration boundaries.

6.1 Worker Contract

Every external task topic must have a published worker contract:

topic: reserve-inventory
owner: fulfillment-adapter-service
inputVariables:
  - orderId
  - orderLineIds
  - reservationRequestId
outputVariables:
  - reservationOutcomeRef
failureSemantics:
  technicalFailure: retry via Camunda failure
  businessFailure: BPMN error INVENTORY_REJECTED
  unknownOutcome: create reconciliation case
idempotencyKey: orderId + reservationRequestId
maxLockDuration: 2 minutes
retryPolicy: R5/PT30S

If a topic has no contract, production support will reverse-engineer it during an outage. That is not operations. That is archaeology.

6.2 Complete vs Failure vs BPMN Error

Use the right completion path:

Worker Outcome	Camunda Action	Meaning
Work succeeded	`complete`	Process may continue.
Technical transient failure	`handleFailure`	Retry may be safe.
Business rejection	`handleBpmnError`	BPMN should route business path.
Unknown outcome	Usually failure + reconciliation/fallout	Do not blindly retry side effects.

6.3 External Task Lock Expiration

Lock expiration creates one of the most dangerous production states: the worker may still be doing work, but Camunda may make the task available again.

For side-effecting tasks:

use idempotency keys,
use external request IDs,
persist outbound request records,
handle duplicate complete attempts,
treat timeouts as unknown until reconciled,
monitor lock expiration separately from normal failure.

7. Incident Taxonomy

Do not treat all Camunda incidents the same. The engine sees an execution failure. Operations must classify it into a business-relevant category.

Recommended taxonomy:

Incident Type	Meaning	Example	Default Handling
Technical transient	Dependency or infra temporary issue	inventory timeout	retry with backoff
Technical persistent	Bug/config/schema issue	missing variable, class not found	stop retry, fix deployment/config
Business modeled	Expected business exception	credit check rejected	BPMN error path, no incident
Business unmodeled	Business condition missing BPMN path	unavailable product state	create fallout case
Data corruption	State violates invariant	order line missing required snapshot	block, investigate, manual correction
Unknown outcome	External side effect may or may not have happened	timeout after reserve request	reconcile before retry
Operational unsafe	Retry may duplicate money/inventory/contract side effect	duplicate billing handoff	freeze, require senior approval

7.1 Incident Lifecycle

The important point: incident handling should not be a raw Cockpit action first. It should become a controlled recovery workflow when business state is involved.

8. When to Retry

A retry is safe only when the operation is idempotent or has no side effect.

8.1 Retry Decision Matrix

Failed Step	Safe to Auto-Retry?	Why
Read product catalog	Yes	Read-only, cacheable.
Recalculate price preview	Usually yes	Pure computation if inputs stable.
Send email	Only with idempotency/dedup	Can duplicate customer message.
Reserve inventory	Only with external request ID	Can double reserve.
Create billing account	Usually no without reconciliation	Can create duplicate account.
Submit order to downstream OMS	No unless downstream idempotency exists	Can duplicate order.
Generate document	Usually yes with artifact idempotency	Same input should yield same artifact reference.
Complete human approval	No blind retry	Human action semantics must be preserved.

8.2 Retry Budget

Retries should have a budget:

max attempts × delay pattern × side-effect risk × operator visibility

Example:

inventory availability read:
  attempts: 5
  delay: 5s, 15s, 30s, 60s, 120s
  auto-fallout-after: exhausted

billing handoff:
  attempts: 1 automatic
  then: reconciliation case

A retry policy is not only technical. It encodes risk appetite.

9. From Camunda Incident to Fallout Case

Camunda incidents are technical workflow incidents. CPQ/OMS needs business fallout cases.

Recommended translation:

9.1 Fallout Case Payload

{
  "falloutCaseId": "FO-2026-000912",
  "source": "CAMUNDA_INCIDENT",
  "businessKey": "order:ORD-100928",
  "processDefinitionKey": "order-orchestration",
  "activityId": "reserveInventoryTask",
  "incidentType": "UNKNOWN_OUTCOME",
  "severity": "HIGH",
  "domainEntityType": "ORDER",
  "domainEntityId": "ORD-100928",
  "safeActions": [
    "CHECK_RESERVATION_STATUS",
    "MARK_RESERVATION_CONFIRMED",
    "MARK_RESERVATION_FAILED",
    "ESCALATE_TO_FULFILLMENT"
  ]
}

Do not expose arbitrary “set process variable” or “move token” operations to business users. Wrap them in domain-safe recovery commands.

10. Manual Recovery Rules

Manual recovery is necessary in enterprise OMS. It is also dangerous.

The rule:

Operators should choose business recovery actions, not manipulate engine internals directly.

10.1 Good Recovery Actions

Retry reservation after reconciliation
Mark external order accepted with reference
Cancel downstream fulfillment request
Escalate to fulfillment operations
Recreate notification artifact
Re-drive billing handoff with same idempotency key

10.2 Bad Recovery Actions

Set variable "reserved" = true
Move token to next task because it seems stuck
Delete incident without domain confirmation
Terminate process instance without order cancellation
Update ACT_RU_* tables directly
Retry all failed jobs globally

10.3 Recovery Command Pattern

A recovery action should be implemented as a domain command:

POST /orders/{orderId}/recovery-actions/confirm-inventory-reservation

The command should:

authorize the operator,
validate order state,
validate fallout case state,
validate external evidence,
append audit evidence,
update domain state,
publish outbox event,
correlate or signal Camunda safely.

11. History Retention and Cleanup

Camunda history is useful but can become expensive.

In CPQ/OMS, history is used for:

approval trace,
process diagnostics,
incident investigation,
SLA measurement,
audit support,
operational reporting.

But not all history must live forever in Camunda tables.

11.1 Retention Classes

Process Type	Retention Need	Strategy
Quote approval	Medium/high	Keep enough for audit window; export key evidence to audit store.
Order orchestration	High	Keep operational history; export critical timeline.
Notification workflow	Medium	Keep summary, archive detail.
Batch cleanup process	Low	Short retention.
Fallout process	High	Keep or archive with case record.

11.2 Evidence Export Pattern

Do not depend only on Camunda history tables for regulatory evidence.

Use an audit/evidence store for business-critical history:

Camunda history is operational history. The audit evidence store is business evidence.

11.3 Cleanup Safety

Before history cleanup:

confirm retention requirements,
confirm audit export completeness,
test on staging-sized data,
monitor query performance before/after,
exclude active diagnostics window,
avoid cleanup job competing with peak order traffic,
document what data will no longer be available in Cockpit.

12. Camunda Database Hygiene

Camunda 7 uses relational database tables for runtime and history. That means workflow performance is also database performance.

12.1 Common Table Pressure Areas

Area	Why It Grows
Runtime execution tables	active process instances, wait states
Runtime job tables	timers, async continuations, retries
Variable tables	too many/large variables
History process/activity tables	high process volume
History variable/detail tables	verbose variable updates
Incident tables	unresolved failures

12.2 Variable Hygiene

Bad variable design is one of the fastest ways to degrade Camunda operations.

Avoid:

large serialized Java objects,
huge JSON blobs,
frequently updated large variables,
sensitive data,
versionless complex payloads,
variables duplicated from domain aggregates.

Prefer:

IDs,
compact enums,
immutable references,
small routing facts,
external document/artifact references,
explicit versioned variable contracts.

12.3 Query Hygiene

Operational dashboards should not repeatedly execute heavy ad-hoc Camunda queries against production runtime tables.

Use projection/read model tables for:

order workflow status,
approval task queue,
fallout case dashboard,
SLA breach list,
process KPI dashboard.

Let Camunda be the engine. Do not turn it into your BI database.

13. Monitoring Model

A production Camunda installation needs both technical and business monitoring.

13.1 Technical Metrics

Track:

active process instances by definition/version,
job backlog,
failed jobs,
incident count by activity,
external task lock expirations,
job acquisition latency,
history cleanup duration,
DB connection pool usage,
slow engine queries,
worker success/failure rate,
worker latency by topic.

13.2 Business Metrics

Track:

quote approval cycle time,
quote approval SLA breaches,
order orchestration duration,
fulfillment fallout rate,
cancellation compensation success rate,
manual recovery count,
stuck orders by stage,
rework loop count,
retry-to-success ratio,
unknown outcome count.

13.3 Dashboard Shape

A good dashboard answers:

Which business flows are unhealthy?
Which technical component is the likely cause?
Which incidents need manual action?
Which actions are safe?
Which customers/orders are affected?

14. Alerting Strategy

Bad alerting wakes people up for noise. Good alerting points to customer or business impact.

14.1 Alert Classes

Alert	Trigger	Severity
Order orchestration backlog	Active orders stuck beyond stage budget	High
Incident spike	Incident rate above baseline	High
External task lock expiration spike	Workers timing out	High
Approval SLA breach	Business deadline exceeded	Medium/high
History cleanup failure	Cleanup repeatedly fails	Medium
Process deployment mismatch	New process version active without compatible worker	High
Failed job repeated at same activity	Likely code/config defect	High

14.2 Alert Payload

An alert should include:

business impact: orders affected / quote approvals delayed
process key/version
activity id
incident category
example business keys
first detected time
recent deployment correlation
safe first action
runbook link

Never send an alert that only says “Camunda incidents > 100”. That is a symptom without operational context.

15. Runbooks

A runbook is not a wiki page full of vague advice. It is a deterministic procedure for a known failure class.

15.1 Runbook Template

# Runbook: Inventory Reservation Unknown Outcome

## Symptoms
- Camunda incident at activity reserveInventoryTask
- External task failure reason contains timeout after outbound request
- Order status is RESERVATION_PENDING

## Impact
- Order cannot progress to fulfillment
- Duplicate reservation possible if retried blindly

## Do Not
- Do not retry job before reconciliation
- Do not set process variable manually
- Do not cancel order unless customer/ops confirms

## Diagnosis
1. Open fallout case by orderId.
2. Check outbound request record by reservationRequestId.
3. Query fulfillment adapter status endpoint.
4. Compare external reservation state with order state.

## Safe Recovery
- If external reservation exists: run ConfirmReservationRecoveryCommand.
- If external reservation does not exist: run MarkReservationFailedAndRetryCommand.
- If external state unknown: escalate to fulfillment ops.

## Verification
- Order state changed from RESERVATION_PENDING.
- Camunda process moved past reserveInventoryTask or routed to fallout.
- Audit event recorded.
- Customer-facing status updated.

15.2 Required Runbooks

For CPQ/OMS, minimum runbooks:

quote approval task stuck,
approval SLA timer not firing,
quote approved but stale price detected,
order process not started after quote acceptance,
duplicate order submit attempt,
inventory reservation timeout,
billing handoff timeout,
document generation failure,
notification duplicate risk,
Camunda incident spike after deployment,
external task worker down,
process migration failure,
history cleanup failure,
engine database contention,
fallout case cannot be resolved.

16. Cockpit/Admin Usage Policy

Camunda Cockpit/Admin are powerful. In enterprise CPQ/OMS, raw operational tools must be governed.

16.1 Who Can Use What

Capability	Allowed Role	Notes
View process instance	Support/ops	Subject to tenant/business authorization.
View variables	Restricted ops	Sensitive variables must be avoided/masked.
Retry failed job	Technical ops	Only when runbook says retry is safe.
Modify variables	Senior technical ops only	Prefer domain recovery command.
Migrate process instance	Workflow platform owner	Requires migration plan.
Delete/terminate instance	Highly restricted	Must align with domain cancellation/closure.
Deploy process definition	Release pipeline only	No manual production deployment.

16.2 Never Do This in Production

update Camunda engine tables directly,
delete incidents without root cause classification,
retry all failed jobs globally,
modify business variables without audit,
terminate order process while order is still active,
deploy BPMN manually outside release governance,
expose Cockpit broadly as business workbench.

17. Process Instance Migration Playbook

Migration is needed when running process instances must move from old process definition version to new version.

17.1 Migration Preconditions

Before migration:

identify affected process definition versions,
classify active instance states,
map old activities to new activities,
confirm variable compatibility,
confirm worker/topic compatibility,
confirm DMN compatibility,
run migration simulation in staging using production-like instances,
define rollback/abort strategy,
create approval record.

17.2 Migration Plan Shape

source: order-orchestration:v12
target: order-orchestration:v13
reason: old process calls removed inventory topic
scope:
  instances at activities:
    - waitForReservationCallback
    - reserveInventoryTask
exclude:
  - completed instances
  - instances with active fallout case
validation:
  - variable orderId exists
  - variable fulfillmentPlanRef exists
  - domain order state in RESERVATION_PENDING or RESERVATION_REQUESTED
post-check:
  - no incident at old activity
  - process instance version updated
  - order projection still consistent

17.3 Migration Risk

Migration changes the execution path of live business flows. Treat it like a data migration and production operation, not a developer convenience.

18. Production Incident Scenarios

18.1 Scenario: Order Stuck After Quote Acceptance

Symptoms:

Quote is ACCEPTED.
Order exists in SUBMITTED.
No active order orchestration process found.

Likely causes:

workflow start failed after domain commit,
outbox publisher down,
process deployment missing,
process start idempotency conflict.

Safe response:

Check order workflow_correlation record.
Check outbox event OrderSubmitted.
Check Camunda process by business key order:{orderId}.
If no process exists and order state allows orchestration, run StartOrderWorkflowRecoveryCommand.
Verify process instance started and order projection updated.

Never create process instance manually without domain recovery command, because duplicate workflow may be created.

18.2 Scenario: Inventory Reservation Timeout

Symptoms:

Incident at external task reserveInventory.
External request sent.
No callback received.

Safe response:

Classify as unknown outcome.
Do not retry immediately.
Query external inventory by request ID.
If reservation exists, confirm reservation in Order Service and correlate Camunda.
If reservation does not exist, mark failed and allow retry or route fallout.
Append audit evidence.

18.3 Scenario: Incident Spike After Deployment

Symptoms:

Many incidents at same activity after release.
Errors mention missing variable, class, topic, or API 4xx.

Safe response:

Freeze automatic retry if side effects are involved.
Compare release manifest to active process version.
Check worker compatibility.
Check OpenAPI/schema compatibility.
Decide roll-forward adapter/fix vs process migration vs allow old instances to drain.
Create postmortem entry.

19. Operational Data Model Extensions

Do not rely only on Camunda internal tables. Build operational tables owned by platform services.

Example:

create table workflow_correlation (
  id uuid primary key,
  tenant_id varchar(64) not null,
  domain_type varchar(64) not null,
  domain_id varchar(128) not null,
  process_definition_key varchar(128) not null,
  process_instance_id varchar(128),
  business_key varchar(256) not null,
  status varchar(64) not null,
  started_at timestamptz,
  ended_at timestamptz,
  last_incident_at timestamptz,
  created_at timestamptz not null,
  updated_at timestamptz not null,
  unique (tenant_id, domain_type, domain_id, process_definition_key)
);

This table lets the domain/control-plane know what workflow exists for a business entity without querying Camunda raw tables for every operation.

20. CPQ/OMS Camunda Operations Checklist

Before production:

21. The Top 1% Lens

A common engineer says:

“The process failed. Retry the job.”

A production engineer asks:

“What kind of failure is this? Did the failed activity have an external side effect? Is retry idempotent? Does the domain state agree with the workflow state? What evidence do we need before moving the process forward?”

That shift is the difference between running workflows and operating a business-critical workflow platform.

Camunda 7 in CPQ/OMS is powerful precisely because it exposes long-running state. But every exposed wait state becomes an operational responsibility.

The mature architecture does not pretend failures will not happen. It designs failures into the platform:

technical failure,
business rejection,
unknown outcome,
manual recovery,
audit evidence,
safe retry,
controlled migration,
operational visibility.

That is what makes workflow orchestration production-grade.

22. What Comes Next

Camunda operations cannot be healthy if the database underneath is unhealthy.

The next part moves into PostgreSQL operations for CPQ/OMS: connection pools, vacuum, indexes, partitioning, transaction bloat, backups, restore drills, query hygiene, and operational data ownership.

Lesson Recap

You just completed lesson 53 in deepen practice. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Back To Series Next Lesson

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.

Previous Lesson

Lesson 52

Admin Console and Operational Control Plane

Next Lesson

Lesson 54

PostgreSQL Operations for CPQ/OMS