Camunda Versioning, Migration, and Incident Ops
Learn Production Grade Contract-First Java Orchestration Platform - Part 030
Camunda 7 process versioning, process instance migration, incident operations, failed job retry, history cleanup, and production runbook design.
Part 030 — Camunda Versioning, Migration, and Incident Ops
A process model is code.
A running process instance is production state.
Changing a BPMN file is therefore not like changing a diagram. It is closer to changing database schema while transactions are still open.
In Camunda 7, process versioning, instance migration, failed job retry, incidents, history cleanup, and operational repair are not secondary topics. They are the difference between a workflow demo and a durable case-management platform.
This part is about how to operate Camunda 7 when there are thousands or millions of active process instances, long-running cases, human tasks, timers, service calls, Kafka correlations, and legal deadlines.
1. The Production Problem
In a regulatory enforcement platform, a case may run for months or years.
During that period:
- law may change
- SLA policy may change
- user task assignment policy may change
- process path may change
- subprocess may be split
- service task may be replaced
- Kafka event contract may evolve
- database schema may migrate
- bugs may be discovered in running process instances
- incidents may accumulate after an outage
- history tables may grow until performance degrades
The naive question is:
“How do we deploy the new BPMN?”
The real question is:
“What happens to every running process instance, every pending timer, every active task, every failed job, every historical audit record, and every downstream consumer when this process changes?”
That is the level this part targets.
2. Mental Model: Process Definition vs Process Instance
Camunda 7 separates process definitions from process instances.
Deploying a new process definition version does not automatically rewrite all running instances. Existing instances usually continue on the version they started with unless migrated or modified.
This is good. It prevents accidental mutation of in-flight legal work.
It is also dangerous. You may end up with many versions running at once.
Therefore, every BPMN change needs a versioning decision.
3. Four Types of BPMN Change
Not all BPMN changes are equal.
| Change Type | Example | Migration Risk | Typical Strategy |
|---|---|---|---|
| Cosmetic | task label wording, diagram layout | Low | deploy new version, no migration |
| Additive inactive path | add future task after current wait state | Medium | deploy + optional migration |
| Behavioral change | different gateway condition, changed retry, new escalation | High | versioned rollout + explicit migration plan |
| Destructive structural change | remove active task, rename task id, delete subprocess | Very high | avoid, or migrate with controlled mapping |
The most dangerous changes are not visual. They are semantic.
Examples:
- changing a task definition key
- changing variable names used by gateways
- changing message correlation names
- changing business key assumptions
- changing timer duration expression
- changing async boundaries
- changing error boundary behavior
- changing delegate expression target
- changing retry cycle
- deleting a wait state where instances currently sit
Process versioning is contract evolution.
4. Stable BPMN IDs Are a Contract
In BPMN, element IDs are not just internal XML details. They are operational handles.
Bad habit:
<bpmn:userTask id="Activity_1x2y3z" name="Supervisor Approval" />
Better:
<bpmn:userTask id="supervisorApproval" name="Supervisor Approval" />
A stable ID supports:
- task definition key lookup
- migration mapping
- incident triage
- audit interpretation
- dashboard grouping
- code references
- process tests
Rules:
- Never allow random modeler-generated IDs to remain in production BPMN.
- Use meaningful stable IDs.
- Treat IDs as part of the contract.
- Do not rename IDs casually.
- Document every ID rename as a breaking change.
5. Process Definition Versioning Policy
Adopt a policy similar to API versioning.
| Version Change | Meaning | Example |
|---|---|---|
| Patch | no semantic change for running instances | label fix, documentation, non-executed extension property |
| Minor | backward-compatible new path | add optional review after current task |
| Major | behavior may change for active instances | changed decision logic, removed task, changed message contract |
Camunda itself increments process definition versions on deployment. Your platform should also track semantic release metadata.
Example table:
create table process_ops.process_release (
release_id uuid primary key,
process_key text not null,
camunda_definition_id text not null,
camunda_version integer not null,
semantic_version text not null,
release_type text not null,
deployed_at timestamptz not null,
deployed_by text not null,
migration_required boolean not null,
migration_plan_id uuid,
notes text not null
);
This gives operators a business-facing view of process releases.
6. Deployment Is Not Migration
A deployment introduces a new process definition version.
A migration moves existing process instances from one definition version to another.
A modification changes token positions or variables within instances.
Do not mix these concepts.
A safe release may deploy v2 but not migrate any existing instance.
That is often the right choice.
7. Migration Decision Matrix
Before migrating, classify active instances.
| Instance Position | Should Migrate? | Why |
|---|---|---|
| Not yet reached changed area | Usually yes | safer to move before hitting old behavior |
| Sitting at unchanged user task | Maybe | depends on task key compatibility |
| Sitting at removed user task | High risk | requires mapping or manual completion |
| Sitting at service task incident | Usually no until repaired | migration may hide root cause |
| Waiting at message event with changed name | High risk | correlation contract changed |
| Waiting at timer event with changed duration | Policy decision | legal deadline implications |
| Completed process | No | history only |
| Suspended instance | Usually no until reviewed | suspension reason matters |
Migration should be data-driven, not global by default.
8. Migration Plan as Artifact
A migration plan should be a reviewed artifact.
It should answer:
- Source process definition version.
- Target process definition version.
- Which instances are eligible.
- Which instances are excluded.
- Activity mappings.
- Variable transformations.
- Task projection impact.
- SLA obligation impact.
- Audit event to append.
- Rollback/compensation plan.
- Operator approval.
- Monitoring after migration.
Example:
migrationPlanId: MP-2026-07-CASE-REVIEW-V3
processKey: enforcementCaseLifecycle
sourceVersion: 12
targetVersion: 13
reason: Add mandatory legal review for high-risk penalty recommendation
eligible:
- currentActivity in [investigationReview, supervisorApproval]
- riskBand = HIGH
excluded:
- hasActiveIncident = true
- caseStatus in [CLOSED, CANCELLED]
activityMappings:
investigationReview: investigationReview
supervisorApproval: supervisorApproval
variableTransformations:
legalReviewRequired: true
projectionRepair: rebuildWorkQueueForMigratedInstances
auditEvent: PROCESS_INSTANCE_MIGRATED
approvalRequiredFrom:
- workflow-platform-owner
- enforcement-policy-owner
A migration is not merely a technical command. It is a change to active work.
9. Process Instance Migration Flow
Always support dry-run.
Dry-run should report:
- candidate count
- excluded count by reason
- current activities
- incident count
- active task count
- SLA impact count
- process variable compatibility warnings
- estimated batch size
- expected lock/DB pressure
10. Variable Migration
BPMN migration maps activities. It does not magically fix your variable contract.
If a gateway changes from:
${approved == true}
to:
${supervisorDecision == 'APPROVED'}
then existing instances may not have supervisorDecision.
You need variable transformation.
if (Boolean.TRUE.equals(vars.get("approved"))) {
runtimeService.setVariable(processInstanceId, "supervisorDecision", "APPROVED");
} else if (Boolean.FALSE.equals(vars.get("approved"))) {
runtimeService.setVariable(processInstanceId, "supervisorDecision", "RETURNED");
}
But do not scatter transformations in random scripts.
Make variable migration explicit, tested, and auditable.
11. Active Task Migration
Human task migration is sensitive.
If an instance is sitting at supervisorApproval, and v2 still has supervisorApproval, migration may be straightforward.
If v2 splits that task into:
legalSufficiencyReviewsupervisorApproval
then the current task cannot simply become both.
Options:
- Let existing instances finish old path.
- Migrate only instances before the split point.
- Complete old task then start new subprocess for legal review.
- Modify token position manually with explicit audit.
- Cancel old task and create new task projection after migration.
For regulatory work, option 1 is often safest unless policy requires immediate change.
12. Message Correlation Compatibility
Message names and correlation keys are contracts.
Bad change:
<bpmn:message id="Message_EvidenceReceived" name="EvidenceReceived" />
changed to:
<bpmn:message id="Message_DocumentReceived" name="DocumentReceived" />
without compatibility bridge.
Consequence:
- old instances wait for
EvidenceReceived - new event publisher sends
DocumentReceived - old instances never wake
Compatibility strategies:
- publish both old and new messages during transition
- keep message name stable and evolve payload
- correlate through adapter that knows process version
- migrate waiting instances before changing publisher
- store waiting subscription metadata for monitoring
Never change message correlation casually.
13. Timer Compatibility
Timer changes can have legal consequences.
Changing PT3D to PT5D is not just a technical change. It changes deadline behavior.
Questions before timer migration:
- Does the new timer apply to existing cases?
- Does law or policy allow recalculation?
- Should already-created timer jobs be replaced?
- Should existing SLA obligations be recalculated?
- How is the reason audited?
- Who approves deadline changes?
In many systems, Camunda timer controls workflow wake-up, while the domain SLA table controls legal deadline. If so, update SLA first, then align timer behavior.
14. Incident Model
A Camunda incident means the engine could not proceed automatically and needs attention.
Typical causes:
- service task exception exhausted retries
- failed async continuation
- external system outage
- database timeout/deadlock
- expression error
- missing delegate bean
- invalid variable type
- message correlation mismatch
- history cleanup job failure
- serialization/deserialization failure
An incident is not the root cause. It is a symptom with a process position.
Your runbook should classify incidents by type and recovery action.
15. Failed Job Retry Semantics
Async service tasks and timers are executed by jobs. If a job fails, retries decrease. When retries reach zero, an incident is created.
Design retry policy intentionally.
| Failure Type | Retry? | Example |
|---|---|---|
| transient HTTP timeout | yes | downstream service slow |
| Kafka broker temporarily unavailable | yes | producer cannot publish |
| optimistic locking | yes, short retry | concurrent engine update |
| validation error | no | invalid command payload |
| missing delegate bean | no until deployment fixed | bad release |
| SQL syntax error | no until code fixed | mapper bug |
| authorization denied | no | policy failure |
| external 409 conflict | depends | idempotency conflict |
Do not retry non-retryable failures for hours. That hides bugs and creates noise.
16. Retry Cycle Design
Retry cycle should reflect dependency behavior.
Example BPMN extension:
<camunda:failedJobRetryTimeCycle>R5/PT10M</camunda:failedJobRetryTimeCycle>
Meaning: retry five times, ten minutes apart.
But do not apply one retry policy everywhere.
| Task | Suggested Policy | Reason |
|---|---|---|
| call internal idempotent service | R5/PT2M | likely transient |
| publish outbox signal | R10/PT1M | should recover quickly |
| call external regulator registry | R8/PT15M | external outage possible |
| validate immutable input | no async retry | deterministic failure |
| send non-critical notification | worker-level retry + DLQ | do not block legal process |
A retry policy is part of the process contract.
17. Incident Triage Matrix
| Incident Class | Signal | First Action | Recovery |
|---|---|---|---|
| Dependency outage | many incidents same activity | check dependency health | fix dependency, bulk retry |
| Deployment bug | incidents after release | inspect logs, missing bean/class | rollback/fix deploy, retry |
| Data bug | few incidents with bad variables | inspect variables/domain data | repair data, retry |
| Model bug | gateway expression failure | inspect BPMN version | deploy fixed version, migrate/modify |
| Lock contention | optimistic locking/deadlock | inspect DB metrics | tune concurrency, retry |
| Poison instance | same instance repeatedly fails | isolate case | manual repair or business cancellation |
| History cleanup | cleanup incident | inspect TTL/window/table bloat | adjust cleanup config, retry cleanup |
Good operations begin with classification.
18. Operator Runbook: Dependency Outage Incident Storm
Scenario: external registry service is down for 40 minutes. Hundreds of service task jobs fail and become incidents.
Runbook:
- Confirm incident spike by activity id and process definition version.
- Confirm dependency outage from service metrics/logs.
- Stop manual retries while dependency is still down.
- Confirm no irreversible partial side effects occurred.
- Restore dependency or switch to fallback.
- Run a small retry sample.
- If sample succeeds, bulk retry by process/activity/incident reason.
- Monitor job executor load and DB pressure.
- Verify process catch-up rate.
- Append operational incident report.
Bad recovery is clicking retry repeatedly without fixing root cause.
19. Operator Runbook: Bad Deployment
Scenario: new service version deploys without a delegate bean required by BPMN.
Symptoms:
- incidents start immediately after deployment
- same activity id
- exception mentions missing bean/class/delegate expression
Runbook:
- Freeze further process deployments.
- Identify affected process definition version.
- Confirm whether old running instances or only new instances are affected.
- Roll back app image or deploy hotfix with missing bean.
- Do not migrate instances yet.
- Retry a single failed job.
- Retry remaining jobs in controlled batches.
- Create postmortem action: deployment smoke test must instantiate delegate expressions.
20. Operator Runbook: Bad Process Variable
Scenario: a gateway expression expects supervisorDecision, but instance has approved.
Runbook:
- Query incidents by activity id and exception message.
- Sample variables from affected instances.
- Confirm variable migration rule.
- Write tested repair script/API operation.
- Dry-run affected instances.
- Set missing variables with audit event.
- Retry jobs.
- Add compatibility test to process suite.
The repair should be traceable. A random database update is not acceptable.
21. Do Not Repair Camunda Runtime Tables Directly
Direct updates to Camunda runtime tables are tempting under pressure.
Avoid them.
Use Camunda APIs for:
- setting variables
- retrying jobs
- migrating instances
- modifying process instances
- suspending/resuming
- deleting/canceling when appropriate
Direct table mutation can bypass caches, history, incident handlers, authorization, and engine invariants.
If a vendor/support-approved database operation is unavoidable, treat it as an emergency procedure with backup, approval, and post-repair consistency checks.
22. Suspension Strategy
Suspension can be useful during incidents or policy holds.
You may suspend:
- process definition
- process instance
- job definition
But suspension is not a generic pause button.
Questions:
- Are timers supposed to stop?
- Should users still complete active tasks?
- Should Kafka correlations be rejected or stored?
- Should SLA continue or pause?
- Is suspension legal/business approved?
- How will resumption be audited?
For regulatory cases, domain case hold and process suspension must be aligned. Suspending Camunda but leaving domain SLA active may create false breaches. Pausing SLA without suspending process may allow illegal progress.
23. History Cleanup and Retention
Camunda history tables can grow heavily in long-running platforms.
History is useful for:
- debugging
- audit support
- process analytics
- incident investigation
- migration verification
But unlimited history growth affects:
- storage
- index bloat
- query latency
- backup/restore time
- cleanup job pressure
Retention design must distinguish:
| Data | Source | Retention Logic |
|---|---|---|
| Process execution history | Camunda history | engine cleanup policy |
| Legal audit | domain audit tables | legal/regulatory retention |
| Case facts | domain tables | case retention policy |
| Operational logs | logging platform | observability retention |
| Kafka events | Kafka topic retention/compaction | event contract policy |
Do not assume Camunda history is your legal audit store.
24. History Time To Live Discipline
Each deployed process should have a history time-to-live policy.
Questions:
- How long after process completion is engine history needed?
- Which audit facts are preserved elsewhere?
- Does deletion affect legal defensibility?
- Are batch operations also retained appropriately?
- Is cleanup window scheduled during low load?
- Is cleanup failure monitored?
Production checklist:
- set TTL intentionally
- avoid null TTL in regulated systems unless explicitly justified
- configure cleanup batch window
- monitor cleanup duration
- monitor history table growth
- test cleanup in staging with realistic volume
25. Process Instance Modification
Modification changes the token state of a running process instance.
It is powerful and dangerous.
Use cases:
- skip a broken automated step after side effect already happened
- move token from obsolete task to replacement task
- cancel a stuck path
- re-enter a failed subprocess
Risks:
- bypasses domain validation
- bypasses task completion audit
- causes inconsistent projections
- invalidates SLA assumptions
- surprises downstream systems
Rules:
- Prefer normal business commands.
- Prefer migration over ad-hoc modification when changing versions.
- Use modification only with operator approval.
- Append domain audit event.
- Reconcile work queue and SLA after modification.
- Document reason and before/after token state.
26. Process Cancellation and Restart
Cancellation is a business action, not just an engine delete.
For an enforcement case, cancellation may mean:
- duplicate complaint
- outside jurisdiction
- withdrawn complaint
- legal invalidity
- merged case
- administrative error
The domain case should move to a meaningful terminal or merged state. Camunda cancellation should follow domain decision.
Restart is equally sensitive.
Questions before restart:
- Are prior side effects idempotent?
- Will duplicate notifications be sent?
- Will Kafka events be republished?
- Will tasks be duplicated?
- Is old audit preserved?
- Does new process use same business key?
A restart is not a rollback. It is a new execution attempt with history.
27. Version-Aware Code
Your Java delegates/workers must be version-aware without becoming messy.
Avoid this:
if (processVersion == 7) {
// old behavior
} else if (processVersion == 8) {
// new behavior
} else if (processVersion == 9) {
// another behavior
}
Prefer stable command handlers and versioned adapters:
The delegate reads process variables and converts them into a domain command. If variable contract changes, version the adapter, not the domain service.
Example:
interface SupervisorApprovalVariableAdapter {
ApproveRecommendationCommand toCommand(DelegateExecution execution);
}
final class SupervisorApprovalV12Adapter implements SupervisorApprovalVariableAdapter { }
final class SupervisorApprovalV13Adapter implements SupervisorApprovalVariableAdapter { }
Keep version complexity near the boundary.
28. Process Testing Before Deployment
Every BPMN version should pass tests before deployment.
Test categories:
| Test | Purpose |
|---|---|
| parse/deploy test | BPMN is deployable |
| delegate wiring test | delegate expressions resolve |
| happy path test | main lifecycle completes |
| gateway matrix test | each condition path tested |
| timer test | timer path behaves as expected |
| message correlation test | external event wakes process |
| incident test | technical error creates retry/incident |
| migration dry-run test | old version maps to new version |
| variable compatibility test | old variables still route safely |
| history TTL test | process has cleanup metadata |
A workflow release without process tests is a blind release.
29. Release Choreography with BPMN, Java, DB, and Kafka
Process changes often require coordinated release.
Example: add legal review step after supervisor approval for high-risk cases.
Changes:
- BPMN adds
legalReviewuser task - Java adds command handler for legal review completion
- OpenAPI adds legal review endpoint
- DB adds legal review decision table
- work queue projection supports new task type
- authorization adds legal reviewer permission
- SLA policy adds legal review deadline
- Kafka emits
LegalReviewRequested
Safe sequence:
Do not deploy BPMN that calls Java code not deployed yet.
Do not remove DB columns used by old process versions.
Do not remove Kafka event handlers used by old instances.
30. Blue-Green and Canary for Process Releases
For application code, blue-green/canary is common.
For process definitions, canary means controlling which new process instances use the new definition.
Strategies:
- Start only internal test cases on new process version.
- Route low-risk jurisdiction to new version.
- Route small percentage of new cases.
- Keep old version for existing cases.
- Monitor incident rate and task completion rate.
- Gradually increase routing.
You need a process start policy:
create table process_ops.process_start_policy (
policy_id uuid primary key,
process_key text not null,
target_definition_version integer not null,
jurisdiction_code text,
risk_band text,
percentage integer not null default 100,
enabled boolean not null,
created_at timestamptz not null
);
Starting a process by key always selecting latest may be too blunt for production.
31. Monitoring Camunda Operations
Monitor at least:
Engine health
- job executor active
- acquired jobs per minute
- failed jobs
- incidents by process/activity
- job backlog
- due timers count
- deployment count
Process health
- active instances by version
- instances by current activity
- task age distribution
- SLA warning/breach count
- stuck wait states
- message correlation failures
Database health
- Camunda DB CPU/IO
- lock waits
- slow queries
- history table growth
- index bloat
- connection pool saturation
Release health
- incidents after deployment
- new version adoption
- migration success/failure
- old version drain rate
- rollback/hotfix count
Operator health
- incident time to acknowledge
- incident time to resolve
- repeated incident classes
- manual modifications count
- unauthorized repair attempts
32. Incident Dashboards
Group incidents by operational meaning.
Useful dashboard dimensions:
- process definition key
- process definition version
- activity id
- incident type
- exception class/message fingerprint
- tenant/jurisdiction
- first occurrence
- latest occurrence
- affected case count
- retry count
- deployment version
A dashboard showing only total incident count is almost useless.
A dashboard showing “83 incidents at registryValidationTask after deployment case-api:2026.07.03-2” is actionable.
33. Incident Fingerprinting
Store incident fingerprints to support grouping.
create table process_ops.incident_fingerprint (
fingerprint_id uuid primary key,
process_key text not null,
process_version integer not null,
activity_id text not null,
exception_class text,
message_hash text not null,
first_seen_at timestamptz not null,
last_seen_at timestamptz not null,
occurrence_count bigint not null,
status text not null,
owner_team text
);
Fingerprint should ignore volatile values like UUIDs and timestamps.
This helps answer:
- Is this a known problem?
- Did it start after a release?
- Which team owns it?
- Is retry safe?
- Was there a previous runbook?
34. Bulk Retry Safety
Bulk retry is powerful.
Before bulk retry, verify:
- root cause is fixed
- operation is idempotent
- downstream dependency can handle catch-up load
- Kafka/outbox side effects are duplicate-safe
- DB locks will not spike
- job executor thread count is appropriate
- retry batch can be stopped
- monitoring is active
Use staged retry:
- retry 1 instance
- retry 10 instances
- retry 100 instances
- retry remaining in batches
Never bulk retry a poison incident class blindly.
35. Migration and Incident Interaction
Do not migrate incidented instances casually.
If an instance has a failed service task incident, migrating it can:
- move it away from the failing activity
- hide the root cause
- break compensation assumptions
- leave domain side effects half-applied
- make retry impossible or confusing
Recommended policy:
| Incident Status | Migration Policy |
|---|---|
| no incident | eligible if mapping valid |
| transient dependency incident | fix and retry before migration |
| data bug incident | repair data then retry/migrate |
| model bug incident | may need migration/modification |
| unknown incident | exclude from migration |
Migration should reduce risk, not bury evidence.
36. Audit for Migration and Ops
Every operational intervention should be auditable.
Events:
| Event | Meaning |
|---|---|
PROCESS_DEFINITION_DEPLOYED | new BPMN version deployed |
PROCESS_INSTANCE_MIGRATION_DRY_RUN | dry-run generated |
PROCESS_INSTANCE_MIGRATION_APPROVED | human approved migration |
PROCESS_INSTANCE_MIGRATED | instance migrated |
PROCESS_INSTANCE_MODIFIED | token/variable modified |
JOB_RETRIED | operator retried failed job |
INCIDENT_CLASSIFIED | incident assigned class/owner |
INCIDENT_RESOLVED | root cause and repair recorded |
PROCESS_INSTANCE_SUSPENDED | instance suspended |
PROCESS_INSTANCE_RESUMED | instance resumed |
Operational audit should include:
- actor
- reason
- affected instances
- old definition/version
- new definition/version
- activity mappings
- variable changes
- approval reference
- timestamp
- correlation id
37. Camunda 7 Lifecycle Risk
Camunda 7 is still encountered widely in production, but new platform design must acknowledge lifecycle risk.
Practical implications:
- Wrap Camunda-specific APIs behind internal gateways.
- Keep BPMN model portable where possible.
- Avoid spreading Camunda variable assumptions across domain services.
- Keep task projection and SLA outside engine internals.
- Keep domain audit independent from Camunda history.
- Document process semantics separately from Camunda implementation.
- Prepare future migration strategy without prematurely rebuilding everything.
The right stance is not panic. It is containment.
Treat Camunda 7 as a powerful stateful engine behind a boundary.
38. Operational Database Discipline
Camunda 7 uses a relational database heavily. Operational behavior depends on database health.
Watch for:
- slow history queries
- job acquisition contention
- high lock wait
- large runtime tables
- huge historic variable tables
- long-running transactions
- missing cleanup windows
- excessive serialized variables
- oversized process variables
General rules:
- Keep variables small.
- Avoid large object serialization in process variables.
- Use domain DB for domain facts.
- Tune history level intentionally.
- Configure cleanup.
- Separate reporting workloads from engine runtime queries when needed.
- Do not let ad-hoc Cockpit queries become production reporting APIs.
39. End-to-End Failure Drill
Drill: bad BPMN release introduces gateway expression error.
Setup
- deploy process v15
- new gateway uses
${legalReviewRequired} - variable missing for some migrated instances
Expected system behavior
- Process tests should catch missing variable path before deployment.
- If missed, incidents appear at gateway activity.
- Incident fingerprint groups failures.
- Alert routes to workflow platform owner.
- Operator freezes migration.
- Dry-run identifies affected instances.
- Variable repair script is prepared and tested.
- Variables are set through API/RuntimeService with audit.
- Jobs retried in small batch.
- Process version test suite updated.
- Postmortem updates release checklist.
This is the production loop: detect, classify, repair, audit, prevent recurrence.
40. Production Checklist
Before releasing BPMN changes:
- BPMN element IDs are stable and meaningful.
- Process version release note exists.
- Semantic change type is classified.
- Java delegates/workers are deployed before BPMN uses them.
- DB expand migration is deployed before process needs new tables/columns.
- OpenAPI/API endpoints exist for new human tasks.
- Authorization rules exist for new task commands.
- Work queue projection supports new task type.
- SLA policy supports new path.
- Kafka event contracts are compatible.
- Process tests pass.
- Migration dry-run exists if needed.
- Active instances are classified by current activity.
- Incidents are excluded or handled deliberately.
- Rollback/forward-fix plan exists.
- Monitoring dashboard is ready.
- Operator runbook is updated.
- History TTL/cleanup implications are reviewed.
Before resolving incidents:
- Incident class is identified.
- Root cause is fixed or isolated.
- Retry safety is confirmed.
- Side effects are idempotent.
- Small retry sample succeeds.
- Bulk retry is throttled.
- Audit event is recorded.
- Dashboard confirms recovery.
Before modifying/migrating process instances:
- Business owner approval exists.
- Candidate instances are listed.
- Exclusion criteria are explicit.
- Activity mapping is reviewed.
- Variable transformation is tested.
- Work queue/SLA projection repair is planned.
- Audit event captures before/after.
- Reconciliation runs after execution.
41. Anti-Patterns
Anti-pattern 1 — always migrate everything to latest
Consequence:
- unnecessary risk
- legal behavior changes for active cases
- hidden data incompatibility
Better:
- new instances use new version
- old instances finish old version unless migration is justified
Anti-pattern 2 — random BPMN IDs
Consequence:
- migration mapping painful
- incident triage unreadable
- tests brittle
Better:
- stable semantic IDs
Anti-pattern 3 — process variables as hidden database
Consequence:
- huge runtime/history tables
- poor reporting
- fragile migration
Better:
- variables for routing, domain DB for facts
Anti-pattern 4 — retry everything
Consequence:
- repeated side effects
- dependency overload
- hidden bugs
Better:
- classify retryability
- fix root cause
- retry safely
Anti-pattern 5 — direct runtime table repair
Consequence:
- corrupted engine state
- bypassed history/incident logic
- unsupported recovery
Better:
- use engine APIs and audited repair operations
Anti-pattern 6 — Camunda history as legal audit
Consequence:
- cleanup conflicts with retention
- audit semantics tied to engine internals
Better:
- separate domain audit store
42. Final Mental Model
Camunda 7 operations require three separations:
Do not confuse them.
- A new process definition is not a migrated instance.
- A migrated instance is not a domain decision.
- A retried job is not a fixed root cause.
- A resolved incident is not erased history.
- A cleaned history table is not a deleted legal audit.
The production-grade stance is simple:
Treat BPMN as executable code, process instances as durable state, incidents as operational signals, and every operator intervention as an auditable business event.
That mindset is what lets a workflow platform survive real regulatory work.
You just completed lesson 30 in deepen practice. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.
Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.