Final StretchOrdered learning track

Observability, Operations, and Supportability

Learn Java Large Scale ERP - Part 030

Observability, operations, supportability, incident response, business telemetry, and production control-plane design for large-scale ERP systems built with Java.

27 min read5400 words
PrevNext
Lesson 3034 lesson track2934 Final Stretch
#java#erp#observability#operations+5 more

Part 030 — Observability, Operations, and Supportability

Core idea: A large-scale ERP is not production-ready when it works on the happy path. It is production-ready when engineers and support teams can detect, explain, contain, repair, and prove what happened when business processes fail.

ERP operations are different from normal web application operations. A failed page load is annoying. A failed posting batch, duplicated payment, stuck approval, missed shipment, broken period close, or incorrect report can affect financial statements, customer commitments, inventory availability, and audit evidence.

Observability in ERP must cover both:

  1. technical system health — latency, errors, CPU, memory, DB, queues, traces;
  2. business process health — stuck workflows, unreconciled ledgers, failed postings, duplicate attempts, aging exceptions, projection lag, settlement mismatches.

This part focuses on building a supportable ERP: one where failure is visible, diagnosable, recoverable, and defensible.


1. Kaufman Skill Deconstruction

To become effective at ERP observability and operations, decompose the skill into these sub-skills:

Sub-skillWhat top engineers can do
Signal designDecide which logs, metrics, traces, events, and business counters matter.
Business telemetryMeasure process health, not only server health.
Correlation modellingTrace a document, user action, batch run, and integration message across components.
Failure classificationDistinguish validation error, transient failure, control rejection, data corruption, and unknown outcome.
Operational controlProvide safe retry, replay, cancel, reverse, reassign, and reconcile tools.
Support workflowConvert technical failure into support cases with owner, SLA, evidence, and next action.
Incident responseDetect blast radius, freeze risky operations, communicate, remediate, and produce postmortem.
Audit-aware repairFix production data through controlled, evidenced, reversible procedures.
Capacity managementUnderstand batch windows, month-end spikes, report bursts, and queue backlogs.
Production readinessDefine SLOs, dashboards, runbooks, alerts, and operational tests before launch.

The goal is not to collect every possible signal. The goal is to make important business failures visible early and explainable quickly.


2. Observability vs Monitoring

Monitoring asks: is the system behaving as expected?

Observability asks: can we understand why the system is behaving this way from external signals?

ERP needs both.

2.1 Technical monitoring examples

SignalExample
LatencyPOST /vendor-invoices p95 latency.
Error ratepayment API 5xx count.
Saturationdatabase connection pool usage.
Queue depthoutbound invoice event backlog.
JVM healthheap, GC pause, thread count.
Database healthlock wait, deadlocks, slow queries.

2.2 Business observability examples

SignalExample
Stuck workflowPOs waiting approval longer than SLA.
Failed postingaccounting events in FAILED_RETRYABLE or FAILED_TERMINAL.
Reconciliation gapAP subledger does not match GL control account.
Duplicate attemptduplicate idempotency key rejected.
Projection lagAR aging read model behind event stream by 25 minutes.
Period close blockerunposted documents in closing period.
Integration exceptionbank statement lines unmatched after import.
Manual overrideemergency approval/posting actions in last 24 hours.
Inventory anomalynegative stock attempt blocked or allowed by policy.
Batch backlogMRP or posting batch exceeds processing window.

A large ERP can have green servers and still be failing the business.


3. ERP Observability Model

ERP observability should connect technical execution with business lifecycle.

The key design principle:

Every important business command should leave a trail that connects request, decision, state transition, side effect, integration, report projection, and audit evidence.


4. Correlation IDs and Business Identifiers

Technical trace IDs are necessary but not enough. ERP support teams often search by business identifiers.

4.1 Identifier taxonomy

IdentifierPurposeExample
Trace IDdistributed technical request tracing.4bf92f3577b34da6a3ce929d0e0e4736
Span IDoperation-level trace segment.DB call, HTTP call, message publish.
Correlation IDlogical flow across async boundaries.corr-p2p-20260701-00091
Causation IDevent/message that caused this action.evt-goods-receipt-posted-001
Command IDuser/batch/integration command identity.cmd-approve-po-001
Idempotency keyduplicate prevention key.bank-confirmation:TXN9988
Document IDimmutable technical document identity.po_01J...
Document numberbusiness/legal reference.PO-2026-000128
Batch run IDoperation batch identity.posting-run-20260701-01
Tenant/companyscope and isolation.tenant=A, company=ID01

4.2 Logging context

Every meaningful log line should include enough context to reconstruct the flow.

try (MDC.MDCCloseable ignored1 = MDC.putCloseable("tenantId", tenantId);
     MDC.MDCCloseable ignored2 = MDC.putCloseable("companyCode", companyCode);
     MDC.MDCCloseable ignored3 = MDC.putCloseable("correlationId", correlationId);
     MDC.MDCCloseable ignored4 = MDC.putCloseable("documentId", documentId.toString())) {

    purchaseOrderService.approve(command);
}

For asynchronous processing, correlation context must be propagated explicitly through message headers.

headers:
  traceparent: 00-...
  correlation-id: corr-p2p-20260701-00091
  causation-id: evt-po-approved-000128
  tenant-id: tenant-a
  company-code: ID01
  source-document-id: po_01J...

5. Structured Logging for ERP

Free-text logs are useful for humans, but structured logs are necessary for reliable querying, alerting, and investigation.

5.1 ERP log event fields

FieldExample
timestamp2026-07-01T09:15:31.123+07:00
levelINFO, WARN, ERROR
serviceerp-purchasing-service
environmentprod-id
tenantIdtenant-a
companyCodeID01
branchCodeJKT
userIdbuyer-01
actionPURCHASE_ORDER_APPROVE
documentTypePURCHASE_ORDER
documentIdpo_01J...
documentNumberPO-2026-000128
lifecycleStateSUBMITTEDAPPROVED
correlationIdcorr-p2p-20260701-00091
outcomeSUCCESS, REJECTED_BY_CONTROL, FAILED_RETRYABLE
reasonCodeMAKER_CHECKER_VIOLATION, PERIOD_CLOSED
errorClassexception class or domain error code.

5.2 What not to log

ERP logs often become privacy and security risks.

Do not log:

  • passwords, tokens, API keys;
  • full bank account numbers;
  • full tax IDs if not required;
  • unnecessary personal data;
  • raw payment card data;
  • full document attachments;
  • complete salary or sensitive HR payloads;
  • unmasked customer identity in high-volume traces;
  • secrets embedded in integration payloads.

Log identifiers and evidence references, not uncontrolled sensitive payloads.


6. Metrics: Technical and Business

Metrics should be low-cardinality, aggregatable, and actionable.

6.1 Technical metric examples

MetricTypeMeaning
http.server.requeststimerendpoint latency and error count.
jdbc.connections.activegaugeDB connection pool pressure.
erp.outbox.pending.countgaugemessages waiting to publish.
erp.consumer.processing.durationtimerconsumer processing latency.
erp.batch.run.durationtimerbatch execution time.
erp.db.deadlocks.countcounterdatabase deadlocks.
erp.lock.wait.durationtimerlock wait time on critical resources.

6.2 Business metric examples

MetricTypeMeaning
erp.workflow.stuck.countgaugeworkflows beyond SLA.
erp.posting.failed.countcounterfailed accounting postings.
erp.posting.pending.countgaugeaccounting events not posted yet.
erp.reconciliation.gap.countgaugeopen reconciliation differences.
erp.payment.duplicate_ignored.countcounterduplicate payment confirmations suppressed.
erp.invoice.matching_exception.countgaugeunresolved invoice matching issues.
erp.period_close.blocker.countgaugeblockers for close process.
erp.report.projection_lag.secondsgaugeread model freshness lag.
erp.integration.exception.countcounterintegration exceptions by partner/type.
erp.override.emergency.countcounteremergency control overrides.

6.3 Metric cardinality rule

Do not put high-cardinality values into metric labels.

Bad:

erp.posting.failed.count{documentNumber="INV-2026-001929"}

Better:

erp.posting.failed.count{company="ID01", documentType="SALES_INVOICE", reason="PERIOD_CLOSED"}

Use logs/traces for document-level details. Use metrics for aggregate health.


7. Tracing for ERP Workflows

Distributed tracing is useful for ERP, but only if spans map to meaningful operations.

7.1 Trace shape for purchase order approval

A useful trace should show:

  • command handling;
  • authorization check;
  • workflow step;
  • database transaction;
  • audit event write;
  • outbox write;
  • downstream consumer processing.

7.2 Span naming

Bad span names:

process
handle
run
execute

Better span names:

PurchaseOrder.Approve
Authorization.EvaluateApprovalAuthority
Workflow.CompleteTask
Audit.WriteBusinessEvent
Outbox.EnqueueBusinessEvent

Span names should help explain the business operation.


8. Business Health Dashboards

ERP dashboards should not only show infrastructure.

8.1 Dashboard categories

DashboardUsersPurpose
Executive process healthbusiness/platform leadershiphigh-level flow health and risk.
Finance operationsfinance controllers/accounting opsposting, reconciliation, close blockers.
Supply chain operationswarehouse/planning opsstock anomalies, MRP status, shipment exceptions.
Integration operationsplatform/integration teamqueues, partner errors, retries, duplicate suppression.
Workflow operationsprocess owners/supportstuck approvals, SLA breaches, delegation issues.
Support deskL1/L2 supportdocument lookup, user impact, known errors, next action.
Engineering SREengineerslatency, errors, saturation, DB locks, JVM, traces.

8.2 Finance operations dashboard

WidgetQuestion answered
Failed postings by reasonWhat is preventing accounting events from posting?
Pending postings by ageIs financial truth lagging behind operations?
Subledger-GL reconciliation gapsWhere do balances disagree?
Period close blockersWhat must be resolved before close?
Manual journal overridesWhich risky corrections happened recently?
Legal numbering exceptionsAre there voids, retries, or reservation failures?
Payment settlement mismatchesWhich bank confirmations are unresolved?

8.3 Workflow dashboard

WidgetQuestion answered
Tasks beyond SLAWhich approvals are stuck?
Approval queue by roleWhich role is overloaded?
Delegated approvalsWhich decisions used delegation?
Rejections by controlWhich requests violate policy?
Escalations by processWhere is the process design failing?
Reassignment countWhich teams are routing work manually?

9. Alert Design

ERP alerts should be actionable. A bad alert says “something failed.” A good alert says “business risk exists, here is owner and next action.”

9.1 Alert severity

SeverityMeaningExample
SEV1Business-critical, broad impact, financial/legal risk.Posting pipeline down during close; payment duplicate risk.
SEV2Important process degraded or blocked.WMS integration backlog blocks shipping.
SEV3Localized issue with workaround.One vendor invoice stuck due to validation error.
SEV4Informational or trend.Approval SLA slowly worsening.

9.2 Alert examples

alert: ERPPostingPipelineBacklog
condition: erp.posting.pending.count{company="ID01"} > 1000 for 15m
severity: SEV2
owner: finance-platform-oncall
businessImpact: "Operational documents are not reflected in financial ledger."
runbook: "posting-pipeline-backlog.md"
alert: APSubledgerGLReconciliationGap
condition: erp.reconciliation.gap.amount{ledger="AP", company="ID01"} != 0 for 30m
severity: SEV1
owner: finance-control-oncall
businessImpact: "AP control account does not reconcile to AP subledger."
runbook: "ap-gl-reconciliation-gap.md"

9.3 Alert anti-patterns

Anti-patternProblem
Alert on every exceptionCreates noise and alert fatigue.
No business contextOn-call cannot judge impact.
No ownerNobody acts.
No runbookInvestigation starts from scratch.
Threshold copied from another systemDoes not reflect ERP workload.
High-cardinality alert labelsMonitoring system becomes expensive/noisy.
Alert without suppression policyKnown maintenance or batch windows page people unnecessarily.

10. Runbooks

Runbooks convert observability into action.

10.1 ERP runbook template

# Runbook: Posting Pipeline Backlog

## Symptoms
- `erp.posting.pending.count` rising for company X.
- Accounting event consumers lagging.
- Finance reports show projection lag.

## Business Impact
- Operational documents may not be reflected in GL.
- Close process may be delayed.

## First Checks
1. Check consumer error rate.
2. Check DB lock wait and deadlocks.
3. Check outbox publish lag.
4. Check failed accounting events by reason code.
5. Check whether period is closed or config changed.

## Safe Actions
- Pause non-critical batch reports.
- Increase consumer workers if DB is not saturated.
- Retry events in `FAILED_RETRYABLE` state.
- Route `FAILED_TERMINAL` to finance exception queue.

## Unsafe Actions
- Do not manually mark accounting events as posted.
- Do not delete outbox records.
- Do not reopen closed period without finance approval.

## Escalation
- Finance platform on-call.
- Finance controller.
- Database on-call if lock wait > threshold.

## Evidence to Capture
- Batch run ID.
- Event IDs.
- Error reason codes.
- Reconciliation state before and after remediation.

10.2 Good runbook properties

PropertyExplanation
Business impactTells responder why this matters.
Diagnostic pathGives ordered checks.
Safe actionsLists allowed operations.
Unsafe actionsPrevents destructive shortcuts.
Escalation pathIdentifies owner.
Evidence captureSupports postmortem and audit.
Rollback/repairExplains recovery boundaries.

11. Operational Control Plane

Large ERP needs admin tools, but admin tools are dangerous. They must be designed as controlled operations, not ad-hoc database access.

11.1 Required operational tools

ToolPurpose
Document timeline viewerReconstruct lifecycle, commands, events, workflow, postings.
Posting exception consoleReview, retry, reject, or route failed accounting events.
Integration message consoleInspect outbox/inbox, retries, partner errors, idempotency status.
Workflow operations consoleReassign, escalate, cancel, or repair stuck tasks.
Reconciliation dashboardShow differences and drill into source documents.
Batch run consoleView run status, checkpoint, restart, failure reason, output summary.
Audit evidence viewerQuery who did what, when, under which authority/config.
Config publication historySee effective config version and deployment/change evidence.
Report projection monitorRead model checkpoint, lag, refresh status, source offset.

11.2 Control-plane design rules

RuleExplanation
No silent mutationEvery admin action creates audit evidence.
Least privilegeSupport users get scoped operational permissions.
Reason requiredRisky actions require reason code and comment.
Approval for high-risk repairData repair and period reopen require workflow approval.
Idempotent repairRetrying repair must not create duplicate effect.
Dry-run supportShow impact before executing risky actions.
Reconciliation after repairRepair is incomplete until reconciliation passes.
Immutable historyNever rewrite business history without correction event.

11.3 Posting retry console example

Support tools must encode the legal repair path, not bypass it.


12. Error Taxonomy

Not all errors should be handled the same way.

12.1 ERP error classes

ClassMeaningExampleResponse
Validation errorInput violates business rule.Missing tax code.Reject with clear message.
Authorization/control errorUser/action not allowed.Maker approves own PO.Deny and audit.
Configuration errorRequired config missing/ambiguous.No posting rule for transaction type.Route to config owner.
Transient technical errorTemporary infrastructure failure.DB timeout, broker unavailable.Retry with backoff.
External dependency errorPartner/system failed.Tax API unavailable.Retry, fallback, or exception queue.
Concurrency conflictSimultaneous update collision.Optimistic lock failure.Retry or ask user to refresh.
Unknown outcomeSide effect may have happened.Payment timeout after bank accepted request.Reconcile, do not blindly retry.
Data integrity breachImpossible state detected.Posted journal unbalanced.Freeze, escalate, investigate.
Security incidentSuspicious or unauthorized behavior.Privilege escalation attempt.Incident response.

12.2 Error envelope

{
  "errorCode": "ERP-POSTING-0031",
  "errorClass": "CONFIGURATION_ERROR",
  "reasonCode": "MISSING_POSTING_RULE",
  "message": "No posting rule configured for GOODS_RECEIPT in company ID01.",
  "correlationId": "corr-p2p-20260701-00091",
  "documentId": "grn_01J...",
  "documentNumber": "GRN-2026-000128",
  "supportAction": "Route to finance configuration owner and retry after approved config publication."
}

A good error message supports users, support teams, engineers, and auditors.


13. Incident Response for ERP

ERP incidents should be handled with both technical and business discipline.

13.1 Incident phases

13.2 ERP incident triage questions

QuestionWhy it matters
Which business process is affected?P2P, O2C, inventory, GL, payroll-adjacent, reporting.
Which companies/tenants are affected?Determines blast radius and legal impact.
Is financial posting wrong, delayed, or duplicated?Defines severity.
Is stock/customer/vendor state wrong?Determines operational impact.
Are external systems involved?Need partner coordination.
Is the period close affected?Finance deadline risk.
Is there privacy/security exposure?Regulatory incident path.
Is the outcome known or unknown?Determines whether retry is safe.
Is there a safe containment switch?Stop harm without destroying evidence.
What evidence must be preserved?Audit and postmortem.

13.3 Containment actions

IncidentPossible containment
Duplicate payment riskPause payment outbound integration.
Wrong posting ruleDisable affected transaction type or route to exception.
Bad tax calculationFreeze invoice posting for affected tax jurisdiction.
Stock oversell riskDisable automatic allocation for affected warehouse/item.
Read model corruptionMark report as stale and rebuild projection.
Workflow routing bugPause auto-escalation and route affected tasks to manual review.
Migration import issueStop import, preserve staging, reconcile imported subset.

Containment should stop further harm while preserving investigation evidence.


14. Supportability by Design

Supportability is not something added after launch. It is designed into every feature.

14.1 Feature supportability checklist

For every new ERP feature, ask:

  • How will support find a document by business number?
  • How will they see the full lifecycle timeline?
  • How will they know which config version was used?
  • How will they see emitted events and downstream status?
  • How will they distinguish validation error from system error?
  • How will they safely retry or repair?
  • How will they know whether report/read model caught up?
  • What dashboard metric will show this feature is healthy?
  • What alert will fire when it is unhealthy?
  • What audit evidence proves the action was valid?

14.2 Support case model

Support cases should link to evidence, not copy uncontrolled data into tickets.


15. Batch Operations

ERP batch jobs are operationally critical: posting, MRP, depreciation, dunning, report refresh, exchange rate import, bank statement import, migration, and period close.

15.1 Batch observability

SignalMeaning
batch run IDidentity of one execution.
job name/versionwhat logic ran.
input scopecompany, period, document type, cutoff.
rows read/written/skippedprogress and anomaly detection.
checkpointrestart position.
failure reasonretryable or terminal.
durationperformance and SLA.
reconciliation resultbusiness correctness after run.
output artifactsreports, exception files, posting summaries.

15.2 Batch run state machine

A batch job is not complete merely because the process exits. It is complete when its business output is reconciled.


16. Read Model and Projection Operations

ERP read models and reports need operational visibility.

16.1 Projection signals

SignalQuestion
source offset/checkpointHow far has the projection consumed?
lag secondsHow stale is the read model?
failed event countWhich events cannot project?
rebuild statusIs full rebuild running?
last reconciliationDoes projection match source ledger?
schema versionIs read model compatible with current report code?
query latencyCan users access reports within SLA?

16.2 Projection failure handling

FailureResponse
transient DB errorretry projection event.
invalid event payloadpark event and alert owner.
schema mismatchstop projection, run compatibility migration.
wrong projection logicmark report stale, rebuild from source.
backlog too largescale consumers or throttle non-critical producers.
source ledger correctedreplay affected range or full rebuild.

Do not let reports silently serve incorrect or stale data without metadata.


17. Production Data Repair

ERP production repair must be controlled. Direct SQL updates are sometimes tempting, but they are rarely defensible unless wrapped in strict process and evidence.

17.1 Repair principles

PrincipleExplanation
Prefer business correctionUse reversal, adjustment, cancellation, or approved amendment.
Preserve historyDo not overwrite posted facts silently.
Require approvalHigh-risk repair needs maker-checker.
Dry-run firstShow affected rows/documents before mutation.
Record evidenceCapture reason, actor, approver, before/after, script checksum.
Reconcile afterVerify ledger/report/process state after repair.
Automate repeatable repairConvert repeated manual fix into controlled tool.
Prevent recurrenceAdd control, test, alert, or design change.

17.2 Repair request model

A repair without reconciliation is only a mutation, not a completed remediation.


18. Capacity and Workload Operations

ERP traffic is not uniform. It has strong business cycles.

18.1 ERP workload spikes

SpikeExample
Dailywarehouse receiving/shipping peaks.
Weeklypayment runs, replenishment planning.
Monthlyperiod close, depreciation, reports, statements.
Quarterlycompliance reporting, tax reporting.
Yearlyfiscal year close, audit, inventory count.
Event-drivenpromotion, recall, migration, acquisition, regulatory deadline.

18.2 Capacity signals

AreaSignals
Applicationrequest latency, error rate, thread pool, queue wait.
JVMheap, GC pause, allocation rate, CPU.
DatabaseCPU, IOPS, lock wait, slow query, deadlock, bloat, replication lag.
Messaginglag, redelivery count, dead-letter count, publish latency.
Batchduration, throughput, checkpoint delay.
Reportsquery latency, refresh lag, export queue.
Storageattachment growth, audit log volume, retention backlog.

Capacity planning must model business calendar, not just average traffic.


19. SLOs for ERP

Service Level Objectives should reflect business impact.

19.1 Example ERP SLOs

CapabilitySLO
Order entry99.9% of order submissions complete under 2 seconds excluding external tax call.
Stock availabilityp95 availability query under 300 ms for active items.
Posting pipeline99% of accounting events posted within 5 minutes during business hours.
Payment confirmation99% of bank confirmations processed or exceptioned within 10 minutes.
Workflow routing99% of approval tasks created within 30 seconds after submission.
Report freshnessOperational read models lag source by less than 2 minutes p95.
Month-end close batchClose simulation completes within agreed batch window.
Support lookupDocument timeline available for 99.9% of posted documents.

19.2 Error budgets for ERP

For ERP, error budget is not only downtime. It can include:

  • delayed postings;
  • unreconciled differences;
  • stuck workflows;
  • stale critical reports;
  • duplicate suppressed attempts;
  • manual repair volume;
  • failed batch retries;
  • integration exception backlog.

A system can be “available” but still consume its business error budget.


20. Deployment and Release Observability

ERP release risk is high because small changes in rules, config, or reports can have large downstream effects.

20.1 Release telemetry

SignalPurpose
deployment versioncorrelate behavior change to release.
config publication versioncorrelate business behavior change to config.
migration versionidentify schema/data changes.
feature flag stateexplain conditional behavior.
error rate by versiondetect regression.
business metric diffdetect silent business impact.
report reconciliation diffdetect reporting regression.

20.2 Canary for ERP

A good ERP canary checks not only HTTP health but business paths:

  • create draft document in test tenant;
  • route approval task;
  • post synthetic accounting event in sandbox scope;
  • process outbox message;
  • update read model;
  • verify audit event;
  • verify report/checkpoint freshness.

Never use production financial documents as destructive canaries.


21. Multi-Tenant Operations

Multi-tenant ERP operations require tenant-aware observability and containment.

21.1 Tenant-aware signals

SignalWhy it matters
tenant-specific error rateone tenant can fail due to config/localization.
tenant-specific queue lagnoisy tenant can starve others.
tenant-specific report laglarge tenant can overload read model.
tenant-specific batch windowtenant calendar and close process differ.
tenant-specific feature flagbehavior may differ intentionally.
tenant-specific data residencylogs/support access may have restrictions.

21.2 Isolation operations

OperationUse case
pause tenant integrationexternal partner outage for one tenant.
throttle tenant reportsone tenant launches massive export burst.
isolate batch runlarge tenant close should not block smaller tenants.
disable feature for tenanttenant-specific config defect.
tenant-level maintenance modeplanned migration or localization update.

Tenant containment is an operational feature, not merely an architectural statement.


22. Security Observability

ERP security monitoring must include business controls.

22.1 Security signals

SignalExample
failed login anomalybrute force or credential issue.
privilege escalationnew admin role assignment.
SoD violation attemptuser tries to approve own transaction.
emergency accessbreak-glass used.
report/export spikeunusually large export of sensitive data.
role change before approvalsuspicious permission timing.
disabled controlapproval matrix bypassed or emergency config.
API token misusepartner integration sends unusual volume.

22.2 Control observability

Every important control should produce metrics:

erp.control.denied.count{control="MAKER_CHECKER", process="P2P"}
erp.control.override.count{control="CREDIT_LIMIT", company="ID01"}
erp.emergency_access.active.count{tenant="tenant-a"}
erp.sensitive_export.rows.count{report="VENDOR_BANK_ACCOUNTS"}

Security observability should detect business abuse, not only infrastructure attacks.


23. Audit and Evidence Operations

Audit evidence must be available, queryable, and trustworthy.

23.1 Evidence lookup questions

Support and audit teams should be able to answer:

  • Who created this document?
  • Who approved it?
  • What authority did they have at the time?
  • Which config version was used?
  • Which state transitions occurred?
  • Which accounting entries were generated?
  • Which integration messages were sent?
  • Which report included this transaction?
  • Was the document corrected, reversed, voided, or amended?
  • Which support/admin actions touched it?

23.2 Evidence timeline

The timeline should be reconstructable without querying random logs manually.


24. Production Readiness Review

Before launching an ERP capability, perform a production readiness review.

24.1 Review areas

AreaQuestions
Business healthWhat business metrics indicate success/failure?
Technical healthWhat latency/error/saturation metrics exist?
LogsAre logs structured and safe?
TracesCan a command be followed across async boundaries?
DashboardsDo process owners and engineers have useful views?
AlertsAre alerts actionable with owners/runbooks?
Support toolsCan support lookup, retry, repair, and escalate safely?
Failure modesAre retry, duplicate, timeout, unknown outcome, and partial failure handled?
ReconciliationHow do we detect and resolve drift?
SecurityAre sensitive actions audited and monitored?
CapacityHave peak workloads and batch windows been tested?
RollbackWhat can be rolled back, disabled, paused, or contained?

24.2 Launch gate

A capability should not launch unless:

  • dashboards exist;
  • critical alerts exist;
  • runbooks exist;
  • support lookup exists;
  • audit evidence exists;
  • retry/repair path exists;
  • reconciliation path exists where applicable;
  • business owner knows what “healthy” means;
  • on-call knows what to do when unhealthy.

25. Postmortems and Learning Loops

ERP incidents should improve the system.

25.1 Postmortem questions

QuestionPurpose
What business invariant failed or almost failed?Connect incident to domain risk.
Why was it not detected earlier?Improve monitoring/tests.
Why did the system allow or fail to prevent it?Improve controls.
Was the outcome known or unknown?Improve idempotency/reconciliation.
How long did detection take?Improve alerts/business telemetry.
How long did diagnosis take?Improve logs/traces/evidence.
How long did repair take?Improve operational tooling.
Was audit evidence sufficient?Improve defensibility.
What regression test now protects this?Prevent recurrence.
What runbook changed?Improve response.

25.2 Learning loop

A postmortem without test, alert, runbook, or design improvement is incomplete.


26. Source Notes

This material combines ERP-specific operational design with modern Java observability practices.

Relevant baseline references:

  • OpenTelemetry Java documentation introduces generating and collecting telemetry such as metrics, logs, and traces using OpenTelemetry Java APIs and SDKs.
  • OpenTelemetry Java instrumentation includes zero-code and agent-based instrumentation options for Java applications.
  • Spring Boot Actuator observability documents logs, metrics, and traces and integrates with Micrometer Observation.
  • Spring Boot tracing auto-configures Micrometer Tracing and supports OpenTelemetry with OTLP.
  • Java Flight Recorder is useful for Java performance and production diagnostics.
  • OWASP logging guidance remains relevant for designing safe, useful application logs.

The ERP-specific layer is the business observability model: document timeline, workflow health, posting health, reconciliation gaps, control overrides, projection lag, and support repair evidence.


27. Kaufman 20-Hour Practice Plan

Hour 1-3: Define business health signals

Pick one ERP process, such as order-to-cash.

Define metrics for:

  • orders submitted;
  • orders blocked by credit control;
  • shipments pending;
  • invoices posted;
  • AR settlements;
  • failed postings;
  • reconciliation gaps;
  • report lag.

Hour 4-6: Add correlation model

Design headers/fields for:

  • trace ID;
  • correlation ID;
  • causation ID;
  • command ID;
  • document ID/number;
  • tenant/company;
  • batch run ID.

Hour 7-9: Build structured logs

Implement structured logs for:

  • command received;
  • authorization decision;
  • lifecycle transition;
  • audit write;
  • outbox publish;
  • consumer processing;
  • retry/failure.

Hour 10-12: Build dashboards

Create dashboard sketches for:

  • process owner;
  • finance operations;
  • integration operations;
  • engineering on-call;
  • support desk.

Hour 13-15: Write runbooks

Write runbooks for:

  • failed posting backlog;
  • stuck workflow;
  • duplicate payment confirmation;
  • stale report projection;
  • reconciliation gap.

Hour 16-18: Design operational tools

Specify admin console actions:

  • retry posting;
  • park message;
  • reassign approval;
  • rebuild projection;
  • open support case;
  • capture repair evidence.

Hour 19-20: Simulate incident

Run a tabletop exercise:

  • create duplicate payment confirmation;
  • break posting rule config;
  • delay read model projection;
  • produce reconciliation gap;
  • walk through detection, triage, containment, repair, reconciliation, postmortem.

28. Design Review Checklist

Use this checklist for ERP observability and operations review.

Signals

  • Are business health metrics defined for the capability?
  • Are technical metrics sufficient for latency, errors, saturation, and backlog?
  • Are logs structured and safe from sensitive data leakage?
  • Are traces meaningful at business operation boundaries?
  • Are correlation IDs propagated across async messages?

Dashboards and alerts

  • Do dashboards exist for business owner, support, integration, finance, and engineering?
  • Are alerts actionable with severity, owner, business impact, and runbook?
  • Are alert thresholds based on ERP workload and calendar?
  • Are stale reports and projection lag visible?
  • Are reconciliation gaps visible?

Supportability

  • Can support search by document number, user, batch run, and correlation ID?
  • Is document lifecycle timeline visible?
  • Are workflow tasks, postings, integrations, and audit events connected?
  • Are errors classified with reason codes and support actions?
  • Are safe retry/repair tools available?

Operations

  • Are batch jobs observable with run ID, checkpoint, progress, and reconciliation result?
  • Can failed messages be parked, retried, or routed to exception queue safely?
  • Can tenant-level containment be applied?
  • Are production repairs approved, evidenced, dry-run-capable, and reconciled?
  • Are runbooks tested through tabletop exercises?

Incident response

  • Can blast radius be identified by tenant/company/process/document type?
  • Are containment switches available for risky integrations/processes?
  • Is unknown outcome handled differently from safe retry?
  • Does postmortem feed tests, metrics, alerts, runbooks, and design improvements?

29. Summary

ERP observability is not only logs, metrics, and traces. It is the ability to understand business truth under failure.

The central ideas:

  • A green server dashboard does not mean the ERP business process is healthy.
  • Every important command needs correlation across request, workflow, ledger, integration, report, and audit.
  • Metrics should include business health: stuck workflows, failed postings, reconciliation gaps, projection lag, duplicate suppression, and close blockers.
  • Logs must be structured, safe, and searchable by business identifiers.
  • Traces must represent meaningful ERP operations.
  • Alerts need owner, severity, business impact, and runbook.
  • Support tools must repair through controlled, audited operations, not ad-hoc database mutation.
  • Batch, read models, and integrations require first-class operational visibility.
  • Incidents should improve invariants, tests, metrics, alerts, runbooks, and design.

A top ERP engineer asks:

When this fails in production, how will we know, how will we explain it, how will we fix it safely, and how will we prove what happened?

That question separates demo-ready ERP from enterprise-grade ERP.

Lesson Recap

You just completed lesson 30 in final stretch. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.