Deepen PracticeOrdered learning track

Observability, Metrics, Logs, and Traces

Learn Java Kafka in Action - Part 029

Observability handbook for production Kafka systems: metrics, logs, traces, lag, SLOs, alerting, dashboards, and incident runbooks.

21 min read4020 words
PrevNext
Lesson 2935 lesson track2029 Deepen Practice
#java#kafka#observability#monitoring+6 more

Part 029 — Observability, Metrics, Logs, and Traces

Observability in Kafka is not "having Grafana dashboards". Observability is the ability to answer operational questions quickly and correctly:

  • Is the platform accepting writes?
  • Are records durable?
  • Are consumers keeping up?
  • Are specific business workflows stuck?
  • Is lag caused by producer volume, consumer slowness, broker saturation, rebalance, partition skew, or downstream failure?
  • Can we prove what happened for an individual event?
  • Can we replay safely?

Kafka systems fail in layers. A Java service may be healthy while its consumer group is stuck. A broker may be alive while a partition is under-replicated. A Kafka Streams app may be running while one task is endlessly restoring state. A DLQ may be receiving records quietly while business processing is effectively degraded.

This part builds a production observability model for Kafka-based systems.

Learning Goals

After this part, you should be able to:

  1. Design observability across broker, topic, partition, producer, consumer, Kafka Streams, Kafka Connect, ksqlDB, and domain workflow layers.
  2. Distinguish metric symptoms from root causes.
  3. Build lag alerts that do not page unnecessarily during normal bursts.
  4. Design structured logs and trace propagation for event-driven systems.
  5. Build incident runbooks for lag, rebalance storm, under-replication, DLQ spike, processing errors, and state restore.
  6. Define SLOs that reflect user/business impact, not only infrastructure availability.

Mental Model: Kafka Observability Is Multi-Layer Causality

A Kafka incident often looks like this:

The visible symptom may be consumer lag, but the root cause may be:

  • producer traffic increase;
  • partition skew;
  • broker disk/network saturation;
  • consumer CPU saturation;
  • external database latency;
  • retry storm;
  • rebalance loop;
  • schema deserialization errors;
  • Kafka Streams state restore;
  • DLQ replay overload;
  • overloaded Connect sink;
  • ksqlDB query repartitioning unexpectedly.

The observability goal is not only to see that lag exists. The goal is to classify the cause fast enough to choose the correct action.

Observability Stack

A production Kafka platform needs five signal layers.

A common weak design is to monitor only L2 and L3. That tells you Kafka is alive, but not whether a regulatory case, quote, fulfillment order, or payment settlement is delayed.

A senior Kafka engineer defines domain-level SLOs above Kafka infrastructure metrics.

Observability vs Monitoring

TermMeaningKafka Example
MonitoringKnown checks for known failure modesAlert when under-replicated partitions > 0
ObservabilityAbility to investigate unknown failure modesTrace one event through producer, topic, stream app, sink DB
TelemetryRaw emitted dataJMX metric, log line, span, audit event
SLIQuantified service behavior99% of order events processed within 2 minutes
SLOTarget for SLIOrder projection freshness p99 < 120s
AlertHuman-actionable notificationPage when freshness SLO burn rate is high
RunbookDecision procedure during incidentLag triage flow with remediation steps

Signal 1: Broker and Controller Metrics

Kafka brokers expose many metrics. The point is not to alert on all of them. The point is to select signals that represent durability, availability, throughput, saturation, and control-plane health.

Broker Health Metrics

CategoryMetric ConceptWhy It MattersTypical Alert Strategy
AvailabilityBroker process upBasic livenessPage if broker loss reduces redundancy or availability
DurabilityUnder-replicated partitionsReplicas are not fully caught upPage if sustained
DurabilityOffline partitionsPartition unavailableImmediate page
ReplicationISR shrink/expand rateReplica instabilityAlert if frequent or correlated with network/disk issue
ControllerActive controller countExactly one active controller expectedAlert if none or unstable
Request latencyProduce/fetch request p95/p99Client-visible latencyAlert based on SLO impact
NetworkRequest queue / network processor idleSaturation signalAlert if sustained saturation
DiskLog flush/write latency, disk usageBroker throughput and retention safetyAlert before disk-full
JVMGC pause, heap, CPUBroker process stabilityAlert if correlated with latency

Topic/Partition Metrics

Topic-level metrics help answer whether the incident is global or localized.

QuestionMetric Direction
Is one topic generating abnormal volume?bytes/messages in per topic
Is one partition hot?per-partition bytes/messages, leader request rate
Is retention at risk?disk usage by topic, retention bytes/time
Is compaction keeping up?log cleaner backlog, dirty ratio
Is replication healthy?ISR size, under-replicated partition count

A hot partition is often invisible if you only look at topic aggregate metrics.

A topic can look healthy in aggregate while one partition dominates end-to-end latency.

Signal 2: Producer Metrics

A producer incident is often mistaken for broker trouble. Java producers buffer, batch, compress, retry, and block under pressure. You need producer-side metrics to see this.

Metric ConceptMeaningFailure Signal
record-send-rateHow many records the producer sendsDrop may mean upstream issue or producer block
record-error-rateFailed sendsBroker/auth/schema/timeout issue
record-retry-rateRetried sendsBroker latency, transient network, throttling
request-latency-avg/p99Broker request latency observed by producerRising before application timeout
batch-size-avgActual batch sizeToo low means poor batching; too high may imply latency trade-off
compression-rateCompression effectivenessHelps understand network/disk pressure
buffer-available-bytesRemaining producer bufferLow value means backpressure risk
bufferpool-wait-timeTime waiting for buffer memoryStrong producer-side saturation signal
record-queue-timeTime records wait before sendHigh value means local backlog
outgoing-byte-rateNetwork egressThroughput capacity planning

Producer Latency Breakdown

Producer latency can come from:

  1. application serialization;
  2. buffer wait;
  3. batching linger;
  4. network request;
  5. broker append/replication;
  6. retry and timeout;
  7. callback execution.

A good dashboard separates these instead of showing one generic “publish latency”.

Signal 3: Consumer Metrics

Consumer observability must answer two different questions:

  1. Offset lag: how many records behind is the consumer?
  2. Time lag / freshness: how old is the newest unprocessed business event?

Offset lag alone is insufficient. A lag of 10 records may be serious if each record represents a 2 GB file processing request. A lag of 100,000 records may be harmless if each record is tiny and the consumer catches up in seconds.

Consumer Metric Catalog

Metric ConceptMeaningDiagnostic Use
records-consumed-rateRecords consumed per secondCompare with production rate
bytes-consumed-rateInput throughputDetect payload size changes
records-lag-maxMax offset lag among assigned partitionsIdentify worst partition
poll-latencyTime spent in pollBroker/fetch behavior
poll-idle-ratioWhether consumer is idleLow idle + high lag = processing bottleneck
commit-latencyOffset commit latencyCoordinator/broker issue
rebalance count/rateGroup instabilityRebalance storm detection
assigned partitionsWork distributionPartition/consumer mismatch
processing latencyBusiness handler durationDownstream bottleneck
end-to-end latencyevent time to completed side-effectUser/business impact

Lag Is Not One Number

For high-confidence operations, track:

  • produced high watermark;
  • fetched offset;
  • processing-completed offset;
  • committed offset;
  • oldest unprocessed event age;
  • oldest retry event age;
  • DLQ age.

Signal 4: Kafka Streams Metrics

Kafka Streams adds more moving parts:

  • topology;
  • tasks;
  • stream threads;
  • local state stores;
  • changelog topics;
  • repartition topics;
  • state restore;
  • standby replicas;
  • punctuators;
  • window retention;
  • record caches;
  • commit interval;
  • rebalance and task migration.

Kafka Streams Observability Questions

QuestionSignal
Is the app processing?process rate, poll rate, commit rate
Is one task stuck?per-task process latency/rate
Is state restore happening?restore rate, restore remaining records
Is state store growing unexpectedly?store size, changelog size
Is repartition creating pressure?internal topic throughput and lag
Are windows dropping late events?dropped records, skipped records, late event count
Are joins missing data?join result rate, table freshness, key mismatch count
Is EOS causing transaction pressure?commit latency, transaction abort rate

Stream Task Hotspot

Aggregate app metrics may hide a single hot task. Always break down by thread/task/topic/partition when possible.

Signal 5: Kafka Connect Metrics

Connect observability is about connector lifecycle, task execution, external system health, and data correctness.

AreaSignalMeaning
Workerworker up, REST availableConnect cluster health
Connectorconnector stateRUNNING, PAUSED, FAILED
Tasktask stateIndividual task failure may partially degrade pipeline
Sourcesource record poll rateExternal source extraction health
Sinksink record send rateExternal sink write health
Errortotal errors, DLQ writesData quality and integration failures
Offsetsource offsetsBackfill/restart behavior
ExternalDB/API latencyOften the actual bottleneck

A connector can be “RUNNING” while one task is failed. Alerting must inspect tasks, not just connector names.

Signal 6: ksqlDB Metrics

ksqlDB observability focuses on queries and materialized views.

QuestionSignal
Is the persistent query running?query status
Is the query processing input?rows consumed/produced
Is a materialized view fresh?source lag and query lag
Is the query repartitioning heavily?internal topic throughput
Are pull queries slow?pull query latency
Are schemas or keys mismatched?processing log/errors

ksqlDB is friendly at the SQL layer, but operationally it still produces Kafka Streams applications and internal Kafka topics.

Logs: Make Events Investigable

Kafka logs must not become free-text dumping grounds. Event-driven systems require structured logs that allow reconstruction.

Required Log Fields for Kafka Applications

FieldPurpose
serviceWhich service emitted the log
envprod/staging/dev
clientIdKafka client identity
consumerGroupIdConsumer group identity
topicKafka topic
partitionPartition number
offsetKafka offset
eventIdBusiness/event identity
eventTypeEvent semantic type
entityIdAggregate/entity identity
correlationIdEnd-to-end request/workflow correlation
causationIdPrevious event/command that caused this event
schemaVersionEvent contract version
attemptRetry attempt
errorCodeStable error classification
durationMsHandler duration
outcomesuccess/retry/dlq/skip

Example Java Structured Log

log.info("kafka_event_processed",
    kv("topic", record.topic()),
    kv("partition", record.partition()),
    kv("offset", record.offset()),
    kv("eventId", envelope.eventId()),
    kv("eventType", envelope.eventType()),
    kv("entityId", envelope.entityId()),
    kv("correlationId", envelope.correlationId()),
    kv("schemaVersion", envelope.schemaVersion()),
    kv("durationMs", elapsedMillis),
    kv("outcome", "success"));

The exact logging API depends on your stack, but the invariant is stable: every processing decision should be reconstructable.

Tracing: Propagating Context Through Kafka

Distributed tracing is harder with Kafka than synchronous HTTP because producer and consumer are decoupled by time and storage.

A trace must cross an asynchronous boundary:

Header Propagation

Common headers:

HeaderPurpose
traceparentW3C trace context
tracestateVendor-specific trace context
correlation-idBusiness workflow correlation
causation-idCausal event/command identity
event-idEvent identity
schema-idOptional schema/debug metadata

Trace context is useful for latency and causal investigation. Business correlation ID is still required because traces may be sampled or expire.

Domain-Level Observability

Infrastructure metrics tell you whether Kafka is functioning. Domain metrics tell you whether the business process is functioning.

For a CPQ/order-management/regulatory-style system, domain observability may include:

WorkflowDomain SLI
Quote generationp95 time from QuoteRequested to QuoteCalculated
Order capturep99 time from OrderSubmitted to OrderAccepted
Fulfillmentcount of orders stuck in WAITING_FOR_INVENTORY older than threshold
Case escalationoverdue escalation events by severity
Pricing updatepropagation freshness of price catalog projection
Compliance auditpercentage of events with valid actor, reason, and correlation metadata

Kafka lag does not automatically equal business delay. Domain SLOs close that gap.

SLI/SLO Design

SLIWhy It Is Good
End-to-end processing latency from event timestamp to durable projectionMeasures user-visible freshness
Oldest unprocessed event age per workflowDetects stuck processing even with low offset lag
DLQ event rate by error classMeasures data/integration correctness
Replay completion timeMeasures recovery capability
Consumer catch-up ratioShows whether backlog is decreasing
Under-replicated partition durationMeasures durability risk window
Schema compatibility failure rateMeasures contract governance health

Weak SLIs

Weak MetricWhy It Is Weak
Broker process upNecessary but not sufficient
Average consumer lagHides hot partitions
Total message rateNo correctness signal
CPU onlyMany Kafka incidents are I/O, skew, or downstream related
Error log countNo denominator or user impact

Alert Design

A good alert has four properties:

  1. It represents user/business risk.
  2. It requires human action.
  3. It includes enough context to start diagnosis.
  4. It has a runbook.

Alert Matrix

AlertSeverityTrigger ConceptImmediate Question
Offline partitionsCriticalAny sustained offline partitionWhich topic/partition? Can clients read/write?
Under-replicated partitionsHighSustained non-zeroWhich brokers/replicas? Disk/network issue?
Consumer freshness SLO burnHighOldest unprocessed event age risingIs consumer slow, stuck, or rebalancing?
DLQ spikeHigh/MediumError-class-specific rateIs schema, data, or downstream failing?
Rebalance stormHighFrequent rebalancesDeploy loop? session timeout? max poll breach?
Producer error/retry spikeHighRetry/error rate increaseBroker, auth, schema, network, timeout?
Disk capacity riskHighTime-to-full below thresholdRetention, traffic, compaction, cleanup?
State restore stuckHighRestore not progressingChangelog size? local disk? version issue?

Avoid Alerting on Raw Lag Alone

Lag should usually be paired with:

  • age of oldest unprocessed event;
  • rate of incoming events;
  • rate of processing;
  • catch-up estimate;
  • workflow criticality;
  • partition distribution.

A better alert:

Page when oldest_unprocessed_order_event_age_seconds is above 300 seconds for 10 minutes and consumer catch-up ratio is below 1.0.

A weaker alert:

Page when consumer lag > 10,000.

Incident Triage: Consumer Lag

Lag Runbook

  1. Identify affected consumer group, topic, partition.
  2. Compare input rate vs processing rate.
  3. Check oldest unprocessed event age.
  4. Check whether lag is global or partition-local.
  5. Check consumer rebalance count and assignment changes.
  6. Check processing latency and downstream dependency latency.
  7. Check DLQ/error/retry rate.
  8. Check broker produce/fetch latency and under-replication.
  9. Estimate catch-up time.
  10. Choose action: scale consumers, pause replay, fix poison pill, increase downstream capacity, rollback deploy, adjust partitioning, or initiate incident comms.

Incident Triage: DLQ Spike

DLQ without observability is just delayed data loss.

Each DLQ record should include:

  • original topic/partition/offset;
  • original key;
  • event ID;
  • event type;
  • schema version;
  • failure timestamp;
  • consumer group;
  • exception class;
  • stable error code;
  • retry attempt count;
  • stack trace hash;
  • replay eligibility;
  • operator notes if manually handled.

Incident Triage: Rebalance Storm

Common causes:

  • consumer process crash loop;
  • deployment rolling too aggressively;
  • max.poll.interval.ms exceeded by slow processing;
  • session.timeout.ms too low for environment;
  • network instability;
  • autoscaler oscillation;
  • long GC pause;
  • partition assignment protocol mismatch;
  • large state restore in Kafka Streams;
  • Connect task restart loop.

Immediate checks:

  1. Did deployment start recently?
  2. Are members joining/leaving repeatedly?
  3. Did processing latency exceed max.poll.interval.ms?
  4. Is CPU/GC/network unstable?
  5. Are consumers using static membership where appropriate?
  6. Are stateful stream tasks being migrated repeatedly?

Dashboard Design

Executive Kafka Health Dashboard

This is for quick incident awareness.

PanelPurpose
Cluster write/read throughputTraffic shape
Offline partitionsAvailability
Under-replicated partitionsDurability risk
Request latency p95/p99Client-visible broker health
Disk usage/time-to-fullCapacity risk
Top consumer groups by freshness breachBusiness impact
DLQ rate by domain/error classCorrectness impact
Deployment/change markersCorrelation with incidents

Consumer Group Dashboard

PanelPurpose
Lag by partitionSkew detection
Oldest event ageFreshness impact
Input rate vs processing rateCatch-up diagnosis
Rebalance countGroup stability
Handler latencyProcessing bottleneck
Downstream dependency latencyExternal bottleneck
Retry/DLQ rateError pressure
Assigned partitions per instanceWork distribution

Streams Dashboard

PanelPurpose
Process rate by taskHot task detection
Commit latencyState/transaction health
State restore progressRecovery behavior
Rebalance countTask migration instability
Repartition topic throughputHidden topology cost
Changelog topic throughputState durability cost
Dropped/skipped recordsData correctness issue
RocksDB/store sizeLocal state capacity

Observability in Java Applications

Producer Wrapper Pattern

Avoid scattering raw producer calls everywhere. Wrap publishing with telemetry.

public final class ObservableEventPublisher<K, V> {
    private final KafkaProducer<K, V> producer;
    private final MeterRegistry metrics;

    public CompletableFuture<RecordMetadata> publish(
            String topic,
            K key,
            V value,
            Headers headers,
            String eventType) {

        long started = System.nanoTime();
        CompletableFuture<RecordMetadata> result = new CompletableFuture<>();

        ProducerRecord<K, V> record = new ProducerRecord<>(topic, null, key, value, headers);

        producer.send(record, (metadata, exception) -> {
            long elapsedMs = TimeUnit.NANOSECONDS.toMillis(System.nanoTime() - started);

            if (exception == null) {
                metrics.timer("kafka.publish.latency",
                        "topic", topic,
                        "eventType", eventType).record(elapsedMs, TimeUnit.MILLISECONDS);
                metrics.counter("kafka.publish.success",
                        "topic", topic,
                        "eventType", eventType).increment();
                result.complete(metadata);
            } else {
                metrics.counter("kafka.publish.error",
                        "topic", topic,
                        "eventType", eventType,
                        "exception", exception.getClass().getSimpleName()).increment();
                result.completeExceptionally(exception);
            }
        });

        return result;
    }
}

The wrapper should not hide Kafka semantics. It should standardize telemetry, headers, logging, and error classification.

Consumer Handler Telemetry

public final class ObservableRecordHandler<K, V> {
    private final MeterRegistry metrics;
    private final DomainHandler<V> delegate;

    public void handle(ConsumerRecord<K, V> record) {
        long started = System.nanoTime();
        String topic = record.topic();

        try {
            delegate.handle(record.value());
            metrics.counter("kafka.consumer.record.success", "topic", topic).increment();
        } catch (RetryableDomainException e) {
            metrics.counter("kafka.consumer.record.retryable_error",
                    "topic", topic,
                    "errorCode", e.errorCode()).increment();
            throw e;
        } catch (NonRetryableDomainException e) {
            metrics.counter("kafka.consumer.record.non_retryable_error",
                    "topic", topic,
                    "errorCode", e.errorCode()).increment();
            throw e;
        } finally {
            long elapsedMs = TimeUnit.NANOSECONDS.toMillis(System.nanoTime() - started);
            metrics.timer("kafka.consumer.record.processing.latency", "topic", topic)
                    .record(elapsedMs, TimeUnit.MILLISECONDS);
        }
    }
}

Expose domain error classification. Do not rely only on exception class names.

Cardinality Discipline

High-cardinality metrics can destroy observability systems.

Avoid labels like:

  • eventId;
  • userId;
  • orderId;
  • raw exception message;
  • full topic name if topics are dynamic per tenant/user;
  • stack trace;
  • request ID;
  • unbounded tenant IDs if tenant count is large.

Prefer bounded labels:

  • topic group;
  • service;
  • environment;
  • event type;
  • error code;
  • exception class;
  • consumer group;
  • deployment version;
  • criticality tier.

Use logs/traces for high-cardinality lookup. Use metrics for aggregation.

Event Audit Trail vs Observability Logs

Do not confuse audit with logs.

AspectObservability LogAudit Event
PurposeDiagnose system behaviorProve business/legal action history
RetentionOperationalRegulatory/business policy
MutabilityUsually append-only in log store but operationalStrong append-only requirement
AudienceEngineers/SREBusiness, compliance, auditors
SchemaLogging schemaDomain event schema
QueryIncident investigationCase reconstruction

A regulatory-grade system often needs both.

Common Kafka Observability Anti-Patterns

Anti-Pattern 1: “Lag Alert = Kafka Alert”

Lag is a symptom. It may be caused by downstream DB, a hot key, bad deploy, or deliberate replay.

Better: alert on freshness SLO and attach diagnostic dimensions.

Anti-Pattern 2: Average Lag

Average lag hides the worst partition.

Better: max lag and oldest event age by partition.

Anti-Pattern 3: Dashboards Without Runbooks

A dashboard that requires a senior engineer to interpret it at 3 a.m. is incomplete.

Better: every alert links to a triage tree.

Anti-Pattern 4: No Domain Correlation ID

Without correlation ID, event-driven debugging becomes archaeology.

Better: propagate correlation and causation IDs in headers and payload envelope.

Anti-Pattern 5: Metrics With Unbounded Labels

High-cardinality labels can make monitoring more expensive and less reliable than the system being monitored.

Better: bounded metrics + high-cardinality logs/traces.

Anti-Pattern 6: DLQ Without Replay Strategy

A DLQ is not a trash bin. It is a quarantine queue.

Better: every DLQ class has owner, retention, replay policy, and dashboard.

Observability Review Checklist

Use this checklist for every Kafka application or platform component.

Broker/Platform

  • Are offline partitions alerted immediately?
  • Are under-replicated partitions monitored with duration?
  • Is controller instability visible?
  • Are produce/fetch request latencies tracked?
  • Is disk time-to-full tracked?
  • Are network and request queue saturation visible?
  • Are JMX endpoints secured?

Producer

  • Is publish success/error rate measured?
  • Is producer latency split from business request latency?
  • Are retries and timeouts visible?
  • Is buffer exhaustion visible?
  • Are schema serialization errors classified?
  • Is client.id meaningful and stable?

Consumer

  • Is lag tracked by partition?
  • Is oldest unprocessed event age tracked?
  • Is processing latency tracked separately from poll latency?
  • Are offset commits tracked?
  • Are rebalances visible?
  • Are retry and DLQ rates classified?
  • Are downstream dependency latencies correlated?

Kafka Streams

  • Are task-level metrics visible?
  • Is state restore progress visible?
  • Are dropped/skipped records visible?
  • Are internal repartition/changelog topics monitored?
  • Are commit/transaction metrics visible?
  • Are topology version changes marked?

Connect/ksqlDB

  • Are connector/task states monitored?
  • Are failed tasks alerted?
  • Are source/sink rates visible?
  • Are DLQ/error records classified?
  • Are persistent query statuses monitored?
  • Are materialized view freshness SLIs defined?

Domain

  • Are workflow freshness SLIs defined?
  • Are stuck-state counts visible?
  • Are correlation IDs propagated?
  • Can one event be traced end-to-end?
  • Can audit reconstruction be performed without application logs?

Practice Lab

Lab 1: Build a Consumer Lag Triage Dashboard

Create dashboard panels for one consumer group:

  1. lag by partition;
  2. oldest event age by partition;
  3. input rate vs processed rate;
  4. handler latency p95/p99;
  5. downstream DB latency;
  6. rebalance rate;
  7. DLQ rate by error code.

Then answer:

  • Which signal tells you whether backlog is getting worse?
  • Which signal tells you whether a single key is hot?
  • Which signal tells you whether downstream dependency is the bottleneck?

Lab 2: Add Correlation Headers

Implement producer and consumer logic that propagates:

  • traceparent;
  • correlation-id;
  • causation-id;
  • event-id.

Validate that one event can be followed from API request to Kafka topic to consumer handler to database update.

Lab 3: DLQ Spike Simulation

Inject three errors:

  1. schema deserialization error;
  2. non-retryable business validation error;
  3. downstream timeout.

Verify that the DLQ dashboard separates them by stable error code.

Architecture Decision Record Template

# ADR: Kafka Observability Model for <System>

## Context
<Which services, topics, consumer groups, stream apps, connectors, and workflows are involved?>

## Business SLOs
<What user/domain freshness or correctness targets matter?>

## Infrastructure Signals
<Broker, topic, partition, replication, disk, network, controller metrics.>

## Application Signals
<Producer, consumer, Kafka Streams, Connect, ksqlDB metrics.>

## Domain Signals
<Workflow age, stuck state, DLQ by business error, projection freshness.>

## Logging Standard
<Required structured fields and error taxonomy.>

## Trace Propagation
<Headers and sampling strategy.>

## Alerts
<Which alerts page, which create tickets, which are dashboard-only?>

## Runbooks
<Links to triage flows.>

## Cardinality Controls
<Metric labels allowed and disallowed.>

## Trade-Offs
<Cost, retention, sampling, operational complexity.>

Key Takeaways

  • Kafka observability must connect infrastructure state to domain workflow health.
  • Lag is a symptom, not a root cause.
  • Offset lag and time/freshness lag are different.
  • Consumer, producer, broker, Streams, Connect, and ksqlDB each expose different failure surfaces.
  • Structured logs and trace headers are mandatory for event-driven debugging.
  • DLQ without replay strategy is delayed data loss.
  • Good alerts encode actionability and runbook context.

References

Lesson Recap

You just completed lesson 29 in deepen practice. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.