Contract Observability and Incident Response: Detect, Diagnose, Recover, and Improve
Learn Java API Contract Engineering, Event Contract Engineering & Schema Governance - Part 031
Contract observability and incident response for Java API and event platforms: metrics, logs, traces, violation dashboards, schema incidents, compatibility incidents, DLQ analysis, consumer lag, and governance feedback loops.
Part 031 — Contract Observability and Incident Response: Detect, Diagnose, Recover, and Improve
Tujuan Pembelajaran
Contract governance tidak selesai di CI, registry, atau catalog. Contract harus observable di production.
Pertanyaan production yang harus bisa dijawab cepat:
- API mana yang mengirim response tidak sesuai OpenAPI?
- Event mana yang gagal schema validation?
- Consumer mana yang crash setelah enum baru?
- Schema version mana yang sedang dipakai producer?
- Apakah deprecated event masih dikonsumsi?
- Apakah Kafka key berubah dan menyebabkan out-of-order projection?
- Apakah DLQ spike karena producer bug atau consumer bug?
- Apakah registry outage memblokir publish?
- Apakah replay gagal karena schema lama hilang?
- Apakah contract incident butuh rollback, quarantine, upcaster, correction event, atau migration?
Setelah part ini, kamu harus mampu:
- mendefinisikan metrics, logs, traces, dashboards, and alerts untuk contract conformance;
- membuat taxonomy contract violation;
- mendeteksi schema, compatibility, registry, API response, event, and DLQ incidents;
- menganalisis consumer lag, unknown event types, invalid schemas, replay failure, and runtime drift;
- mendesain runbook incident;
- menghubungkan postmortem dengan governance improvement;
- membuat SLI/SLO untuk contract platform;
- menghindari observability anti-pattern seperti high-cardinality metrics dan noisy alerts.
1. Contract Observability Mental Model
Contract observability adalah kemampuan melihat apakah runtime behavior sesuai declared contract.
Observability goals:
- detect violations quickly;
- identify artifact, version, producer, consumer, and owner;
- preserve evidence;
- reduce diagnosis time;
- support rollback, redrive, replay, and correction;
- feed governance rules, tests, and runbooks.
2. Contract Violation Taxonomy
Stable violation codes are foundational. Jangan hanya log validation failed.
2.1 API Violations
API_REQUEST_SCHEMA_INVALID
API_RESPONSE_SCHEMA_INVALID
API_UNDOCUMENTED_STATUS_CODE
API_UNDOCUMENTED_ERROR_CODE
API_ERROR_SHAPE_INVALID
API_IDEMPOTENCY_KEY_MISSING
API_IDEMPOTENCY_CONFLICT
API_REQUIRED_HEADER_MISSING
API_SECURITY_SCOPE_MISMATCH
API_DEPRECATED_OPERATION_USED
API_OPERATION_RETIRED_BUT_CALLED
2.2 Event Violations
EVENT_SCHEMA_INVALID
EVENT_ENVELOPE_INVALID
EVENT_TYPE_UNKNOWN
EVENT_TYPE_UNREGISTERED
EVENT_TOPIC_MISMATCH
EVENT_KEY_MISMATCH
EVENT_SOURCE_INVALID
EVENT_SEQUENCE_GAP
EVENT_OUT_OF_ORDER
EVENT_DUPLICATE
EVENT_REPLAY_UNSAFE
EVENT_CLASSIFICATION_VIOLATION
EVENT_DEPRECATED_CONSUMED
EVENT_RETIRED_PRODUCED
EVENT_SCHEMA_VERSION_UNSUPPORTED
2.3 Registry Violations
SCHEMA_NOT_FOUND
SCHEMA_ID_RESOLUTION_FAILED
SCHEMA_COMPATIBILITY_REJECTED
SCHEMA_REGISTRY_UNAVAILABLE
SCHEMA_REFERENCE_UNRESOLVED
SCHEMA_UNREGISTERED_PRODUCER
SCHEMA_VERSION_UNSUPPORTED
2.4 Governance Violations
CONTRACT_OWNER_MISSING
CONTRACT_LIFECYCLE_MISSING
CONTRACT_EXCEPTION_EXPIRED
CONTRACT_DEPRECATED_PAST_SUNSET
CONTRACT_RUNTIME_DRIFT
CONTRACT_CATALOG_METADATA_MISMATCH
CONTRACT_UNKNOWN_CONSUMER_DETECTED
Stable codes enable dashboards, alerts, runbooks, and postmortem automation.
3. Metrics Design
3.1 API Metrics
api_contract_request_validation_total
api_contract_response_validation_total
api_contract_violation_total
api_contract_deprecated_usage_total
api_contract_error_code_total
api_contract_undocumented_status_total
api_contract_idempotency_conflict_total
Recommended bounded labels:
service
api
operationId
status
violationCode
environment
Avoid high-cardinality labels:
requestId
customerId
caseId
userId
rawPathWithIds
correlationId
3.2 Event Metrics
event_contract_published_total
event_contract_consumed_total
event_contract_violation_total
event_contract_dlq_total
event_contract_quarantine_total
event_contract_duplicate_total
event_contract_sequence_gap_total
event_contract_unknown_type_total
event_contract_schema_validation_total
event_contract_deprecated_usage_total
Recommended labels:
service
topic
eventType
schemaRef
consumerGroup
violationCode
environment
Avoid:
eventId
aggregateId
customerId
caseId
Use logs/traces for those IDs.
3.3 Registry Metrics
schema_registry_lookup_total
schema_registry_lookup_latency_seconds
schema_registry_lookup_failure_total
schema_registry_compatibility_check_total
schema_registry_registration_total
schema_registry_cache_hit_total
schema_registry_cache_miss_total
Registry is part of platform runtime. Treat it like critical infrastructure.
4. Logs
Logs capture high-cardinality diagnosis details.
Structured contract violation log:
{
"level": "WARN",
"event": "contract_violation",
"violationCode": "EVENT_SEQUENCE_GAP",
"service": "case-projection-service",
"topic": "case-events",
"eventType": "CaseApproved",
"schemaRef": "case.CaseApproved:7",
"eventId": "evt_01J...",
"aggregateId": "case_01J...",
"expectedVersion": 18,
"actualVersion": 20,
"correlationId": "corr_01J...",
"action": "quarantined"
}
Rules:
- include eventId/requestId/correlationId in logs;
- never log raw sensitive payload by default;
- include action taken: ignored, quarantined, retried, rejected, redriven;
- include schemaRef/version;
- include topic/operationId/consumerGroup;
- include DLQ/quarantine reference.
5. Traces
Trace contract boundaries:
- HTTP request enters gateway;
- service validates request;
- service persists state;
- outbox event is created;
- event is published;
- consumer receives event;
- consumer validates envelope/schema;
- consumer updates projection or triggers side effect;
- DLQ/quarantine if failed.
Useful trace attributes:
contract.api.operation_id
contract.event.type
contract.schema.ref
contract.violation.code
messaging.destination.name
messaging.kafka.message.key
6. Dashboards
6.1 Contract Health Dashboard
Show:
- total contract violations by service;
- top violation codes;
- API response validation failures;
- event schema validation failures;
- DLQ/quarantine trend;
- schema registry failures;
- deprecated usage;
- runtime drift findings;
- high-risk topics/events;
- open incidents.
6.2 Event Stream Dashboard
For each topic/event:
- publish rate;
- consume rate by group;
- lag;
- DLQ count;
- schema versions observed;
- unknown event type count;
- duplicate count;
- sequence gap count;
- replay activity;
- deprecated consumption.
6.3 API Contract Dashboard
For each operation:
- request validation failures;
- response validation failures;
- undocumented status codes;
- error code distribution;
- deprecated endpoint usage;
- client IDs;
- SDK versions if available.
7. Alerting
Alert only on actionable signals.
Critical alerts:
- invalid event published to tier-1 topic;
- schema registry unavailable in production;
- DLQ spike on payment/ledger topic;
- response validation failure for public API above threshold;
- retired event still produced;
- Kafka key mismatch detected;
- schema ID resolution failure during replay.
Warning alerts:
- deprecated event still consumed after sunset;
- unknown consumer group appears;
- new schema version observed before catalog update;
- response validation shadow failure increasing;
- unknown enum value seen by old consumer.
Bad alert:
Contract violation happened.
Good alert:
EVENT_SCHEMA_INVALID increased to 8% for PaymentCaptured on payment-events.
Owner: payment-platform.
Top producer version: payment-service 4.17.2.
Action: check recent schema change and DLQ sample.
8. Incident Types and Diagnosis
8.1 Producer Published Invalid Event
Symptoms:
- consumer deserialization failures;
- DLQ spike;
- schema validation fail;
- unknown schema ID;
- projection gaps.
Response:
- stop or rollback producer;
- identify bad event range by offset/time/schema;
- quarantine invalid events;
- publish correction events or replay fixed events;
- notify consumers;
- update producer tests/gates.
8.2 Consumer Breaks on New Compatible Schema
Symptoms:
- one consumer group DLQ spike;
- schema registry passed;
- other consumers okay.
Likely cause:
- consumer strict parser;
- enum switch no default;
- unknown field rejection;
- generated model old.
Response:
- pause or patch consumer;
- temporarily suppress new value if producer can;
- redrive DLQ after fix;
- add consumer contract tests.
8.3 Schema Registry Outage
Symptoms:
- serializers/deserializers fail lookup;
- producer publish failure;
- consumer parse failure for uncached schemas.
Response:
- check registry SLO;
- confirm cache behavior;
- failover/restore registry;
- pause deployments requiring new schemas;
- avoid auto-register retry storms.
8.4 Deprecated Contract Removed Too Early
Symptoms:
- clients get 404/410;
- event stops arriving;
- consumer workflow starves.
Response:
- restore old endpoint/event if possible;
- notify impacted consumers;
- update catalog usage detection;
- revise retirement gate.
8.5 Kafka Key Change Breaks Projection
Symptoms:
- out-of-order events;
- sequence gaps;
- projection corruption;
- partition distribution changed.
Response:
- rollback key change;
- quarantine affected offsets;
- rebuild projection from safe point;
- add key contract tests and diff gate.
9. Incident Runbook Template
# Contract Incident Runbook
## 1. Identify
- violation code:
- artifact:
- API/event/schema/topic:
- owner:
- severity:
- start time:
## 2. Scope
- affected producers:
- affected consumers:
- schema versions:
- topic offsets/time range:
- request IDs/correlation IDs:
- data classification:
## 3. Stop the Bleeding
- rollback producer?
- disable feature flag?
- pause consumer?
- fail-open/shadow?
- block bad schema?
- quarantine invalid events?
## 4. Preserve Evidence
- DLQ samples:
- logs:
- traces:
- schema versions:
- contract diff:
- deployment version:
## 5. Recover
- redrive DLQ?
- replay events?
- publish correction?
- rebuild projection?
- restore schema?
- notify consumers?
## 6. Prevent
- new test:
- new lint rule:
- new diff rule:
- new policy:
- catalog metadata update:
- owner/action items:
10. DLQ Analysis
DLQ is evidence, not trash.
Group DLQ by:
- failure code;
- event type;
- producer version;
- schemaRef;
- consumer group;
- topic/partition;
- time window.
DLQ spike questions:
- Did producer deploy recently?
- Did schema version change?
- Did consumer deploy recently?
- Is failure deserialization, schema validation, or domain state?
- Are all consumers affected or one group?
- Does invalid event share producer version?
- Is event out-of-order or duplicate?
- Is registry lookup failing?
- Is DLQ itself failing?
Redrive only after:
- root cause fixed;
- idempotency verified;
- ordering implications considered;
- side effects safe;
- consumer can handle event;
- redrive rate controlled;
- audit record created.
11. Consumer Lag and Contract Incidents
Consumer lag can hide compatibility incidents.
Scenario:
- producer publishes new event version at 10:00;
- consumer is lagging 6 hours;
- failure starts at 16:00 when it reaches new events.
Do not correlate incident only with wall-clock deploy time. Use:
- event
publishedAt; - Kafka offset;
- schemaRef;
- producer version;
- consumer lag;
- consumer processed timestamp.
Metrics:
consumer_lag
event_published_at_to_processed_at_latency
schema_version_processed_by_consumer
12. Replay Observability
Replay should be observable separately from live processing.
Metrics:
event_replay_started_total
event_replay_completed_total
event_replay_failed_total
event_replay_events_processed_total
event_replay_side_effect_skipped_total
event_replay_schema_unsupported_total
Replay logs must include:
- replay ID;
- source topic/offset/time range;
- event types;
- schema versions;
- side-effect mode;
- result.
Replay failure is often a compatibility failure discovered late.
13. Schema Version Observability
Track schema versions observed in runtime.
Producer metric:
event_schema_version_published_total{eventType="CaseApproved", schemaRef="case.CaseApproved:7"}
Consumer metric:
event_schema_version_consumed_total{consumerGroup="case-projection", schemaRef="case.CaseApproved:7"}
Use for:
- migration tracking;
- old schema retirement;
- replay readiness;
- incident diagnosis;
- consumer lag understanding.
14. Contract SLOs
Possible platform SLOs:
14.1 Registry SLO
99.9% schema lookup success over 30 days.
p95 schema lookup latency < 50ms.
14.2 Contract Validation SLO
99.99% of produced tier-1 events validate against registered schema.
14.3 Catalog Freshness SLO
95% of contract metadata changes visible in catalog within 10 minutes.
14.4 Deprecation SLO
0 retired artifacts with active runtime usage.
14.5 Incident Detection SLO
Tier-1 contract violations alert within 5 minutes.
SLOs must be tied to business risk.
15. Java Observability Implementation
15.1 Metrics Wrapper
public final class ContractMetrics {
private final MeterRegistry registry;
public void recordViolation(String service, String artifact, String violationCode) {
Counter.builder("contract_violation_total")
.tag("service", service)
.tag("artifact", artifact)
.tag("violationCode", violationCode)
.register(registry)
.increment();
}
}
Keep tags bounded.
15.2 Structured Logging
log.warn("contract violation",
kv("violationCode", violation.code()),
kv("eventType", event.metadata().eventType()),
kv("eventId", event.metadata().eventId()),
kv("correlationId", event.metadata().correlationId()),
kv("action", "quarantined")
);
15.3 Trace Attributes
Span.current().setAttribute("contract.event.type", event.metadata().eventType());
Span.current().setAttribute("contract.schema.ref", event.metadata().schemaRef());
Span.current().setAttribute("contract.violation.code", violation.code());
16. Postmortem Questions
For every contract incident:
- What contract promise was violated?
- Was artifact wrong, implementation wrong, or consumer assumption wrong?
- Did CI catch it?
- Did registry compatibility pass? Why insufficient?
- Did catalog know impacted consumers?
- Did runtime observability detect quickly?
- Was DLQ evidence sufficient?
- Could rollback/replay recover safely?
- What test/lint/diff/policy should be added?
- What ownership/lifecycle metadata was missing?
Postmortem output should update the governance system.
17. Anti-Patterns
- No stable violation codes.
- Metrics with
eventIdlabel. - DLQ without original context.
- Logs with sensitive payload.
- Alerts without owner.
- Catalog not linked from alert.
- Runtime signals not feeding governance.
- No replay metrics.
- Only producer metrics, no consumer visibility.
- No schema version tracking.
- Deprecated usage invisible.
- Contract incident postmortems produce reminders, not rules/tests.
18. Practice Lab
Lab 1 — Metrics Design
Define metrics for event PaymentCaptured including publish, consume, validation failure, DLQ, duplicate, schema version, and deprecated usage.
Lab 2 — Incident Runbook
Write runbook for EVENT_SCHEMA_INVALID spike on payment-events.
Lab 3 — DLQ Analysis
Given DLQ samples all have schemaRef PaymentCaptured:9 and producer version 4.17.2, design diagnosis steps.
Lab 4 — Consumer Lag
Consumer fails 6 hours after producer deploy. Explain how lag changes investigation.
Lab 5 — Postmortem Action
Incident: consumer crashed on new enum value. Propose tests, lint, diff, and policy updates.
Lab 6 — Dashboard
Design dashboard for deprecated event usage and schema version migration.
19. Senior Engineer Heuristics
- Contract conformance must be observable in production.
- Stable violation codes are the foundation.
- Metrics show pattern; logs show details; traces show flow.
- DLQ is evidence, not trash.
- Consumer lag changes incident timelines.
- Track schema versions published and consumed.
- Deprecated usage must be visible.
- Registry health is platform health.
- Avoid high-cardinality metrics.
- Runtime drift detection closes the governance loop.
- Incident runbooks should start from artifact ownership.
- Redrive requires idempotency and ordering analysis.
- Postmortems should produce tests/rules, not only reminders.
- Observability without action ownership is noise.
- A top-tier platform learns from every contract incident.
20. Summary
Contract observability and incident response ensure that API/event/schema contracts are not only designed and tested, but monitored and improved in production. Metrics, logs, traces, DLQ analysis, schema version telemetry, consumer lag, catalog context, and runbooks work together.
Main takeaways:
- define stable violation taxonomy;
- instrument API, event, registry, DLQ, replay, and governance signals;
- build dashboards for contract health, deprecation, schema migration, and DLQ;
- alert only on actionable risk;
- preserve evidence for diagnosis;
- use catalog for ownership and impact analysis;
- track schema versions and consumer lag;
- redrive/replay carefully;
- postmortems must update tests, lint rules, diff rules, policies, and runbooks;
- contract engineering becomes mature when production incidents improve the governance system.
Part berikutnya adalah final capstone: membangun Java Contract Governance Platform untuk API dan event contracts end-to-end.
You just completed lesson 31 in final stretch. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.
Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.