Observability: Logs, Metrics, Traces
Learn Enterprise CPQ OMS Camunda 7 - Part 042
Designing observability with logs, metrics, traces, correlation identifiers, and business telemetry for a production-grade Java microservices CPQ and order management platform.
Part 042 — Observability: Logs, Metrics, Traces
A CPQ/OMS platform can fail in many ways.
The price engine is slow.
The quote API times out.
A Kafka consumer lags.
A Camunda external task worker retries forever.
A downstream inventory system accepts a reservation but never sends a callback.
A customer sees stale catalog data.
A quote is approved but order creation does not happen.
The database is healthy, but the business process is stuck.
Observability is the difference between guessing and knowing.
But enterprise observability is not just “add logs.”
It is the ability to ask a business/technical question and follow the answer across HTTP, Java services, PostgreSQL transactions, Kafka messages, Redis cache, Camunda workflow, external task workers, and downstream systems.
1. Core Mental Model
Observability has three classic signals:
- logs;
- metrics;
- traces.
For CPQ/OMS, add a fourth dimension:
- business state.
A trace can show that POST /quotes/{id}/submit took 700 ms.
A metric can show quote submission latency p95.
A log can show a validation failure.
But the operator still needs to know:
- Which quote?
- Which revision?
- Which tenant?
- Which process instance?
- Which order?
- Which Kafka event?
- Which external fulfillment step?
- Was the business outcome successful, pending, rejected, or unknown?
If telemetry cannot be joined to business identity, it is only half useful.
2. The Four IDs You Must Carry
A production CPQ/OMS platform should consistently propagate these identifiers.
| Identifier | Purpose | Example |
|---|---|---|
trace_id | Distributed tracing identity | 0af7651916cd43dd8448eb211c80319c |
correlation_id | User/business operation correlation | corr-quote-submit-7d8e |
business_key | Workflow/domain lifecycle identity | quote:Q-1001:rev:4 |
aggregate_id | Domain object identity | quoteId=Q-1001, orderId=O-991 |
Do not collapse them into one field.
They answer different questions.
trace_id answers:
Which technical calls were part of this request path?
correlation_id answers:
Which logs/events belong to the same user or command operation?
business_key answers:
Which long-running business process does this belong to?
aggregate_id answers:
Which domain object changed?
A synchronous HTTP request may have one trace.
A long-running order workflow may span many traces over several days.
That is why Camunda business key and domain aggregate id matter.
3. Propagation Map
A clean propagation model looks like this:
Each boundary needs propagation rules:
| Boundary | Propagate |
|---|---|
| HTTP inbound | traceparent, correlation_id, actor, tenant |
| HTTP outbound | traceparent, correlation_id, tenant-safe headers |
| Kafka event | trace_id, correlation_id, aggregate id, event id, causation id |
| PostgreSQL audit | trace_id, correlation_id, aggregate id, workflow key |
| Camunda process | business key, domain ids, minimal variables |
| Redis key/value | avoid storing trace as authority; use trace for logs only |
| External task worker | fetch task, create new span, log process/task ids |
The aim is not to put every context value everywhere.
The aim is to make every diagnostic path joinable.
4. Logs: Structured, Sparse, Useful
A log line should be a small event with context.
Bad log:
Error processing request
Better log:
{
"level": "ERROR",
"message": "quote submission failed",
"service": "quote-service",
"tenantId": "tenant-a",
"quoteId": "Q-1001",
"quoteRevision": 4,
"actorId": "u-771",
"correlationId": "corr-7d8e",
"traceId": "0af7651916cd43dd8448eb211c80319c",
"errorCode": "QUOTE_STALE_PRICE",
"lifecycleState": "PRICED",
"command": "SubmitQuote",
"retryable": false
}
Use structured logs.
Use stable field names.
Use domain language.
Do not log secrets.
Do not log full payloads by default.
Do not log the same failure ten times in the same call stack.
Recommended Log Event Classes
| Class | Example |
|---|---|
| Command accepted | quote.submit.command.accepted |
| Command rejected | quote.submit.command.rejected |
| Domain transition | quote.lifecycle.transitioned |
| External call | inventory.reservation.requested |
| External result | inventory.reservation.accepted |
| Unknown outcome | inventory.reservation.outcome_unknown |
| Workflow action | camunda.external_task.completed |
| Retry exhausted | fulfillment.retry_exhausted |
| Fallout opened | fallout.case.opened |
| Security denial | authorization.denied |
The log event name should be stable enough to use in dashboards and alerts.
5. Metrics: Measure Systems and Business Flow
Metrics are for trends, alerts, and SLOs.
A metric should be cheap to aggregate and safe to store at high volume.
Technical Metrics
| Metric | Type | Labels |
|---|---|---|
http_server_request_duration_seconds | histogram | service, method, route, status |
db_transaction_duration_seconds | histogram | service, operation |
kafka_consumer_lag | gauge | consumer_group, topic, partition |
redis_operation_duration_seconds | histogram | service, operation |
camunda_external_task_duration_seconds | histogram | topic, worker |
camunda_external_task_failures_total | counter | topic, error_code |
Business Metrics
| Metric | Type | Labels |
|---|---|---|
quote_created_total | counter | tenant_segment, channel |
quote_submitted_total | counter | tenant_segment, channel |
quote_approval_required_total | counter | policy_version, reason_code |
quote_approved_total | counter | approval_level |
quote_rejected_total | counter | reason_code |
order_created_total | counter | order_type |
order_fallout_open_total | counter | fallout_type |
order_fulfillment_duration_seconds | histogram | product_family |
manual_recovery_total | counter | recovery_action |
Avoid High-Cardinality Labels
Do not use these as metric labels:
- quote id;
- order id;
- user id;
- customer id;
- process instance id;
- full error message;
- raw product SKU if cardinality is large;
- tenant id if there are many tenants and the backend cannot handle it.
Put high-cardinality identifiers in logs and traces.
Put low-cardinality dimensions in metrics.
This one rule prevents many observability backends from becoming expensive and slow.
6. Traces: Follow Causality Across Boundaries
A trace tells the story of a request path.
For CPQ/OMS, trace spans should map to meaningful operations.
Example span tree:
POST /quotes/{quoteId}/submit
QuoteResource.submitQuote
QuoteApplicationService.submit
QuoteRepository.load
PricingFreshnessChecker.verify
ApprovalPolicy.evaluate
QuoteRepository.save
AuditRepository.append
OutboxRepository.append
CamundaRuntime.startProcessInstanceByKey
For an external task worker:
camunda.external_task.fulfillment.reserve_inventory
OrderRepository.load
InventoryClient.reserve
OrderRepository.markReservationRequested
AuditRepository.append
OutboxRepository.append
Good span names are neither too generic nor too specific.
Bad:
process
Too specific:
process quote Q-1001 revision 4 for tenant-a
Good:
QuoteApplicationService.submit
Use span attributes for identifiers:
cpq.quote_id=Q-1001
cpq.quote_revision=4
cpq.tenant_segment=enterprise
workflow.business_key=quote:Q-1001:rev:4
camunda.process_definition_key=quote-approval
Do not put sensitive data in trace attributes.
7. OpenTelemetry Positioning
OpenTelemetry should be treated as the instrumentation standard layer.
The application emits telemetry through OpenTelemetry APIs/SDKs/instrumentation.
The collector receives, processes, and exports telemetry to the chosen backend.
Do not couple domain code directly to a vendor backend.
Vendor-neutral instrumentation gives you leverage.
The implementation detail may change.
The semantic model should not.
8. W3C Trace Context
For HTTP boundaries, propagate W3C Trace Context headers:
traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
tracestate: vendor-specific-state
traceparent carries portable trace identity.
tracestate carries vendor-specific state.
Add a separate correlation id header, for example:
x-correlation-id: corr-7d8e
Do not use correlation id as a replacement for trace context.
Correlation id is business/debug grouping.
Trace context is distributed tracing identity.
9. JAX-RS/Jersey Instrumentation Pattern
Use a request filter to establish context.
@Provider
public class RequestContextFilter implements ContainerRequestFilter, ContainerResponseFilter {
@Override
public void filter(ContainerRequestContext request) {
String correlationId = getOrCreateCorrelationId(request);
String tenantId = resolveTenant(request);
Actor actor = resolveActor(request);
RequestContextHolder.set(new RequestContext(
correlationId,
tenantId,
actor
));
MDC.put("correlationId", correlationId);
MDC.put("tenantId", tenantId);
MDC.put("actorId", actor.id());
}
@Override
public void filter(ContainerRequestContext request, ContainerResponseContext response) {
response.getHeaders().putSingle("x-correlation-id", RequestContextHolder.current().correlationId());
MDC.clear();
RequestContextHolder.clear();
}
}
This filter should not decide business observability.
It only creates the request context.
Command handlers and workflow workers should add domain-specific events and span attributes.
10. Kafka Observability
Kafka breaks synchronous trace continuity unless you deliberately propagate context.
Each event should include or carry headers for:
- event id;
- event type;
- aggregate type;
- aggregate id;
- aggregate version;
- causation id;
- correlation id;
- trace id or trace context;
- producer service;
- schema version;
- occurred at.
Example Kafka headers:
event_id=evt-991
event_type=QuoteAccepted
aggregate_type=QUOTE
aggregate_id=Q-1001
correlation_id=corr-7d8e
traceparent=00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
producer_service=quote-service
schema_version=1.2.0
Consumer logs should include:
- topic;
- partition;
- offset;
- consumer group;
- event id;
- aggregate id;
- correlation id;
- processing result;
- retry count;
- dead-letter status.
Kafka observability must answer:
- Was the event produced?
- Was the event visible in the topic?
- Did the intended consumer read it?
- Did the consumer process it?
- Did processing mutate local state?
- Did the consumer publish follow-up events?
- If it failed, where is the failure parked?
11. Camunda 7 Observability
Camunda observability has two layers.
Engine/Technical Layer
Track:
- job executor health;
- failed jobs;
- incident count;
- external task lock duration;
- external task retries;
- external task failures;
- process instance count;
- history cleanup;
- database query latency;
- process deployment version.
Business Process Layer
Track:
- quote approvals waiting by age;
- approval SLA breach count;
- orders stuck in fulfillment by step;
- fallout cases opened by type;
- process instances with unknown external outcome;
- compensation required count;
- manual recovery actions;
- average quote approval time;
- average order fulfillment time.
Do not stop at engine metrics.
A Camunda engine can be technically healthy while the business is stuck.
The business dashboard must not rely only on process-engine health.
12. PostgreSQL Observability
For CPQ/OMS, database observability should cover:
- connection pool usage;
- query latency by operation;
- transaction duration;
- lock waits;
- deadlocks;
- row update conflicts;
- optimistic lock failures;
- migration duration;
- table/index growth;
- outbox backlog;
- audit table growth;
- projection lag;
- slow queries for search/reporting.
Domain-specific database metrics matter:
| Metric | Meaning |
|---|---|
outbox_pending_records | Event publication backlog |
idempotency_conflict_total | Duplicate or conflicting command attempts |
quote_optimistic_lock_failure_total | Concurrent quote edit pressure |
audit_append_failure_total | Critical audit health issue |
projection_lag_seconds | Read model freshness |
If outbox_pending_records grows, the system may still accept commands but downstream systems are becoming stale.
If projection_lag_seconds grows, the UI may show old state.
If audit_append_failure_total is nonzero, treat it as severe.
13. Redis Observability
Redis observability in this platform should answer:
- Are hot keys forming?
- Are cache hit rates healthy?
- Are keys evicted unexpectedly?
- Are TTLs missing?
- Is memory pressure rising?
- Are lock-like keys expiring too late or too early?
- Are rate-limit keys exploding in cardinality?
- Are stale catalog/price caches being invalidated?
Useful metrics:
| Metric | Meaning |
|---|---|
redis_cache_hit_ratio | Cache effectiveness |
redis_operation_duration_seconds | Redis latency |
redis_key_evictions_total | Memory pressure / policy effect |
catalog_cache_stale_served_total | Intentional stale read served |
price_preview_cache_miss_total | Pricing preview recomputation pressure |
idempotency_fast_path_hit_total | Duplicate command short-circuit |
The important CPQ-specific rule:
Redis health is not business correctness. Redis only accelerates or protects. PostgreSQL remains the authority.
14. Dashboards by Question
Do not build dashboards by technology only.
Build dashboards by question.
Quote Conversion Dashboard
Answers:
- Are quotes moving through lifecycle?
- Where are they stuck?
- Is pricing slow?
- Are approvals delayed?
- Are documents generating successfully?
Signals:
- quote created/submitted/approved/accepted counts;
- pricing duration histogram;
- approval waiting age;
- quote document generation failures;
- quote rejection reason distribution.
Order Fulfillment Dashboard
Answers:
- Are orders being created and fulfilled?
- Which fulfillment steps are failing?
- Are external systems slow or unreliable?
- Is compensation increasing?
Signals:
- order created count;
- order fulfillment duration;
- fulfillment step failure count;
- unknown outcome count;
- fallout open count;
- external API latency/error rate.
Workflow Health Dashboard
Answers:
- Are Camunda jobs failing?
- Are external tasks locked too long?
- Are incidents increasing?
- Are approvals breaching SLA?
Signals:
- failed jobs;
- incidents;
- external task failure count;
- task age histogram;
- process instance age by process definition.
Integration Event Dashboard
Answers:
- Is outbox draining?
- Are Kafka consumers keeping up?
- Are events being dead-lettered?
- Are projections stale?
Signals:
- outbox pending count;
- event publish latency;
- consumer lag;
- DLQ count;
- projection lag.
15. Alerting Rules
Alert on user/business impact, not just CPU.
Examples:
| Alert | Severity | Meaning |
|---|---|---|
| Quote submit error rate above threshold | High | Customers/sales cannot progress quotes |
| Pricing p95 above SLO | High | CPQ UX degraded |
| Audit append failures | Critical | Defensibility broken |
| Outbox pending age above threshold | High | Downstream systems stale |
| Order fallout spike | High | Fulfillment reliability issue |
| Unknown outcome count rising | Critical | External consistency risk |
| Approval task SLA breach | Medium/High | Revenue workflow blocked |
| Kafka DLQ nonzero for critical event | High | Integration processing failure |
| Camunda incident count rising | High | Workflow execution failure |
| Projection lag above stale-data budget | Medium | UI/search may mislead users |
Avoid alert spam.
Each alert should have:
- owner;
- runbook link;
- severity;
- business impact statement;
- first diagnostic query;
- escalation rule.
An alert without an action is noise.
16. SLO Thinking
A CPQ/OMS SLO should express user/business expectations.
Examples:
| Capability | SLO Candidate |
|---|---|
| Create quote | 99.9% successful under valid input within 1s |
| Price quote | 99% completed within 2s for standard catalog size |
| Submit quote | 99.9% accepted/rejected deterministically within 1s |
| Approval task visibility | 99% visible in worklist within 10s after submission |
| Accept quote | 99.9% creates exactly one order intent or deterministic rejection |
| Order event publication | 99.9% critical events published within 30s of DB commit |
| Projection freshness | 99% read model lag below 15s |
| Audit append | 100% for lifecycle-changing successful commands |
Be careful with 100% SLOs.
Use them only for invariants where failure should stop the command, such as audit append for lifecycle-changing actions.
17. Runbook-Oriented Observability
When an operator gets an alert, the telemetry should support a path.
Example: QuoteAccepted event published but order not created.
Runbook path:
- Search logs by
correlation_id. - Find quote acceptance audit record.
- Verify outbox record for
QuoteAccepted. - Verify Kafka topic/partition/offset.
- Check order consumer lag.
- Check inbox/idempotency record in Order Service.
- Check order creation transaction result.
- Check Camunda process start command.
- Check fallout case or DLQ.
- Execute reconciliation if needed.
The telemetry model should make each step possible.
If step 4 and step 5 cannot be connected, Kafka observability is incomplete.
If step 8 cannot be connected to order id/business key, workflow observability is incomplete.
18. Anti-Patterns
Anti-Pattern 1 — Logging Full Payloads
This leaks sensitive data and creates noise.
Log identifiers, outcome, reason code, and safe summaries.
Anti-Pattern 2 — Metrics With High-Cardinality Labels
quote_id as a metric label will eventually hurt the metrics backend.
Use logs/traces for high-cardinality investigation.
Anti-Pattern 3 — Only Technical Dashboards
CPU, memory, and HTTP 500s are necessary but insufficient.
You also need quote/order/workflow health.
Anti-Pattern 4 — No Correlation Across Async Boundaries
If Kafka messages and Camunda tasks cannot be correlated to the original command, investigation becomes archaeology.
Anti-Pattern 5 — Treating Traces as Audit
Traces are not durable business evidence.
Use audit for accountability.
Use traces for diagnostics.
Anti-Pattern 6 — No Runbooks
Dashboards without runbooks produce anxiety, not operations.
Every critical alert needs a next action.
19. Production Readiness Checklist
Before go-live, verify:
- every inbound request gets or propagates a correlation id;
- W3C trace context is propagated across HTTP boundaries;
- Kafka events carry correlation/causation/event identifiers;
- Camunda process instances use meaningful business keys;
- external task logs include task id, topic, process instance id, business key, and retry count;
- audit records include trace/correlation/business identifiers;
- logs are structured and redacted;
- metrics avoid high-cardinality labels;
- dashboards answer business questions;
- alerts have owners and runbooks;
- outbox, inbox, DLQ, projection lag, and fallout are visible;
- observability works in failure drills, not only in happy-path demos.
20. The Design Standard
For this series, the observability rule is:
Every important business operation must be traceable from HTTP request to domain mutation, audit record, outbox event, Kafka processing, Camunda workflow, external task execution, downstream handoff, and final business state.
This does not mean every operation is synchronous.
It means every operation is explainable.
In a small app, observability helps developers debug.
In an enterprise CPQ/OMS platform, observability helps the organization operate, recover, defend, and improve.
References
- OpenTelemetry Documentation: https://opentelemetry.io/docs/
- OpenTelemetry Signals: https://opentelemetry.io/docs/concepts/signals/
- W3C Trace Context Recommendation: https://www.w3.org/TR/trace-context/
- W3C Trace Context Level 2: https://www.w3.org/TR/trace-context-2/
- OWASP Logging Cheat Sheet: https://cheatsheetseries.owasp.org/cheatsheets/Logging_Cheat_Sheet.html
You just completed lesson 42 in deepen practice. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.
Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.