Learn Java Messaging Event Streaming Part 027 Ksqldb Operations Scaling State Metrics
title: Learn Java Messaging and Event Streaming - Part 027 description: ksqlDB operations for deployment, scaling, state management, lag, metrics, high availability, recovery, rollout discipline, and production runbooks. series: learn-java-messaging-event-streaming seriesTitle: Learn Java Messaging and Event Streaming order: 27 partTitle: ksqlDB Operations, Scaling, State, Lag, and Metrics tags:
- java
- kafka
- ksqldb
- operations
- observability
- state-store
- consumer-lag
- scaling
- production
- stream-processing date: 2026-06-28
Part 027 — ksqlDB Operations: Deployment, Scaling, State, Lag, and Metrics
1. What We Are Solving
In Part 025 and Part 026, we treated ksqlDB as a way to express stream processing with SQL.
In this part, we treat ksqlDB as a production runtime.
That means the questions change.
Instead of asking:
Can this SQL produce the right result?
we ask:
Can this query keep producing the right result under load, schema change, rebalance, node failure, replay, and operational rollout?
That shift matters because a ksqlDB persistent query is not just SQL.
It is a deployed stream-processing application with:
- Kafka source topics
- Kafka sink topics
- consumer groups
- internal repartition topics
- changelog topics
- local state stores
- JVM memory
- RocksDB/native memory
- network usage
- disk usage
- schema contracts
- operational lifecycle
- owner accountability
A weak ksqlDB operation model turns elegant SQL into an opaque incident generator.
A strong model makes ksqlDB safe enough for case-management, alerting, monitoring, enrichment, and current-state projections.
2. The ksqlDB Runtime Mental Model
A ksqlDB server is not a database server in the traditional relational sense.
It is closer to a managed Kafka Streams runtime with a REST interface and SQL language.
A persistent query behaves like this:
The important consequence:
ksqlDB state is not only inside the ksqlDB server. It is distributed across Kafka topics, local disks, and query metadata.
So operational review must include:
- Kafka topic configuration
- query topology
- state-store size
- local disk capacity
- internal-topic lifecycle
- schema compatibility
- consumer lag
- failure recovery time
- query owner
- deployment mechanism
3. Core Operational Objects
In production, track these as first-class objects.
| Object | Why It Matters |
|---|---|
| ksqlDB cluster | Hosts query execution and exposes REST API |
| Persistent query | Long-running stream-processing job |
| Source topic | Input contract and throughput source |
| Sink topic | Output contract for downstream consumers |
| Internal topic | Repartition/changelog mechanics used by query |
| State store | Local materialized state used for tables, joins, windows, aggregations |
| Consumer group | Used to track query consumption and lag |
| Schema subject | Governs key/value compatibility |
| Query ID | Unit of explainability, troubleshooting, termination, and ownership |
| Application ID | Kafka Streams identity underneath query execution |
The operational mistake is to monitor only the ksqlDB process.
The correct approach is to monitor the whole query execution chain.
4. Deployment Topology
A minimal production ksqlDB deployment has multiple ksqlDB server instances behind a stable access path.
A production deployment should define:
- number of ksqlDB nodes
- CPU and memory allocation
- local state directory
- persistent volume strategy
- service discovery strategy
- TLS/authentication to Kafka
- TLS/authentication to ksqlDB REST endpoint
- schema registry connectivity
- Kafka ACLs
- query deployment process
- monitoring and alert routing
ksqlDB is easy to start locally.
It is not automatically easy to operate safely.
5. Deployment Modes
5.1 Interactive Development Mode
Useful for:
- learning
- prototyping
- query exploration
- validating stream/table semantics
Risk:
- ad-hoc queries become production behavior
- query ownership is unclear
- rollback is manual
- naming is inconsistent
Use this mode only outside production or in tightly controlled admin environments.
5.2 GitOps / Migration-Based Production Mode
Preferred for production.
Treat ksqlDB statements like database migrations or service deployment manifests.
A production change should include:
- SQL file
- query name
- topic contract
- schema references
- owner
- rollout plan
- rollback plan
- expected lag impact
- expected state size
- data replay behavior
- test evidence
Example directory:
ksqldb/
001-create-source-streams.sql
002-create-case-current-state.sql
003-create-escalation-candidates-v1.sql
004-create-escalation-candidates-v2.sql
rollback/
003-terminate-escalation-candidates-v1.sql
tests/
case-current-state-inputs.json
case-current-state-expected.json
The goal is not ceremony.
The goal is reproducibility.
5.3 Platform-Managed Mode
In larger organizations, a platform team may operate ksqlDB clusters and product teams submit query definitions.
This requires clear boundaries:
| Concern | Owner |
|---|---|
| ksqlDB cluster uptime | Platform team |
| Kafka cluster health | Data platform team |
| Query semantics | Product/domain team |
| Schema compatibility | Producer + consumer owners |
| Sink topic contract | Query owner |
| Alert response | Query owner + platform on-call |
| Data governance | Domain/data governance team |
Without ownership, ksqlDB becomes shared mutable infrastructure.
Shared mutable infrastructure becomes operational debt.
6. Query Lifecycle
A persistent query has lifecycle stages.
6.1 Designed
Define:
- source collections
- output collection
- key semantics
- join semantics
- window semantics
- schema compatibility
- expected throughput
- expected state size
- failure behavior
6.2 Reviewed
Review should include both SQL and runtime effects.
Ask:
- Does this create an internal repartition topic?
- Does this create a changelog topic?
- Does this maintain unbounded state?
- Does this depend on event time or processing time?
- Is the output idempotent for downstream consumers?
- Is the query safe to replay?
6.3 Tested
Test with:
- normal events
- duplicate events
- out-of-order events
- late events
- schema version changes
- missing fields
- null keys
- tombstones
- high-cardinality keys
- skewed keys
6.4 Deployed
Deploy using versioned SQL.
Avoid manually pasting production SQL unless this is explicitly an emergency operation.
6.5 Running
Monitor:
- query status
- consumer lag
- throughput
- error rate
- processing latency
- state-store growth
- JVM memory
- disk utilization
- broker-side topic health
6.6 Deprecated
Deprecation must include downstream migration.
A query is not safe to terminate just because the SQL is old.
It may still own a sink topic used by downstream services.
7. Capacity Planning Mental Model
Capacity planning starts with the shape of the query.
7.1 Stateless Queries
Example:
CREATE STREAM high_priority_case_events AS
SELECT *
FROM case_events
WHERE priority = 'HIGH';
Main cost drivers:
- input rate
- output rate
- serialization/deserialization
- predicate cost
- network throughput
- sink topic partitions
Stateless queries are usually easier to scale.
7.2 Stateful Aggregations
Example:
CREATE TABLE case_event_counts AS
SELECT case_id, COUNT(*) AS event_count
FROM case_events
GROUP BY case_id;
Main cost drivers:
- key cardinality
- update frequency per key
- state-store size
- changelog traffic
- restore time
- compaction behavior
- local disk
Stateful queries need disk and restore planning.
7.3 Windowed Aggregations
Example:
CREATE TABLE hourly_case_alert_counts AS
SELECT case_id, COUNT(*) AS alert_count
FROM case_alerts
WINDOW TUMBLING (SIZE 1 HOUR, GRACE PERIOD 10 MINUTES)
GROUP BY case_id;
Main cost drivers:
- key cardinality
- window count
- grace period
- retention
- late events
- suppression behavior
- state cleanup
Windowed state can be much larger than expected.
A query with 1 million cases and 24 active hourly windows can create a very different state profile than a simple current-state table.
7.4 Joins
Main cost drivers:
- co-partitioning
- table materialization
- repartitioning
- join window size
- table update rate
- stream event rate
- skewed keys
A join is not just a lookup.
A join creates operational load on state stores, changelog topics, and possibly repartition topics.
8. Scaling Model
ksqlDB scales through Kafka Streams partition assignment.
The simplified rule:
Parallelism is bounded by the number of input partitions and the topology shape.
If a source topic has 6 partitions, running 20 ksqlDB servers does not automatically create 20-way parallelism for that query.
8.1 Horizontal Scaling
Add ksqlDB servers when:
- CPU is saturated
- consumer lag grows steadily
- query tasks can be redistributed
- partitions allow additional parallelism
- state restore time is acceptable
Do not add servers blindly if the actual bottleneck is:
- one hot partition
- slow sink topic
- broker throttling
- schema registry latency
- network saturation
- disk pressure
- state-store compaction
8.2 Partition Count as Scaling Boundary
For a query that consumes a topic with 12 partitions:
| ksqlDB Servers | Possible Effect |
|---|---|
| 1 | One process owns all tasks |
| 2 | Tasks may split across two processes |
| 3 | Better distribution if partitions are balanced |
| 12 | Upper-bound one active task per partition shape |
| 20 | Extra instances may be idle or underused |
The exact task distribution depends on topology, query shape, and assignment.
But the invariant remains:
You cannot scale beyond the partitioning model without changing the data model or topology.
8.3 Vertical Scaling
Scale vertically when:
- state store needs more disk throughput
- RocksDB/native memory needs more headroom
- query CPU is high but partition count is low
- rebalancing cost is too high with too many small nodes
Vertical scaling is not failure-proof.
A very large stateful task on one machine can increase recovery time.
8.4 Scaling Stateful Queries
Stateful scaling is harder than stateless scaling.
When a stateful query task moves to another node, state may need to be restored from changelog topics.
That means scaling can temporarily increase:
- broker read load
- network load
- local disk writes
- processing latency
- consumer lag
Scaling itself can become an incident if performed during peak load without capacity headroom.
9. High Availability and Pull Queries
ksqlDB can serve pull queries from materialized state.
For availability, consider:
- multiple ksqlDB nodes
- state replication through changelog topics
- standby replicas where supported/configured
- Kafka topic replication
- load balancer behavior
- query routing behavior
- read staleness tolerance
A pull query reads current materialized state, not a relational database table.
So define:
- expected freshness
- allowed staleness
- behavior during restore
- behavior during rebalance
- behavior when key is missing
- timeout behavior
- retry behavior
For a regulatory dashboard, stale state may be acceptable for a few seconds.
For enforcement deadline decisions, stale state may be unacceptable.
Do not use the same availability model for both.
10. Consumer Lag in ksqlDB
Consumer lag is the most important early warning signal for many ksqlDB incidents.
But lag must be interpreted carefully.
10.1 What Lag Means
Lag means the query is behind the source topic.
A simple mental model:
lag = latest topic offset - query consumed offset
If lag keeps growing, the query is not keeping up.
If lag spikes and then falls, the query may be recovering after a burst.
If lag is low but output is wrong, the problem is not throughput; it is logic, schema, keying, or data quality.
10.2 Types of Lag
| Lag Type | Meaning |
|---|---|
| Source lag | Query behind input topic |
| Repartition lag | Query behind internal repartition topic |
| Changelog restore lag | State store being rebuilt from changelog |
| Sink consumer lag | Downstream service behind ksqlDB output topic |
| End-to-end lag | Delay from source event creation to final derived output |
Do not collapse all lag into one number.
10.3 Getting Query Consumer Group
In many workflows, use EXPLAIN to inspect query metadata and locate the underlying consumer group.
Example:
EXPLAIN CSAS_ESCALATION_CANDIDATES_17;
Then monitor the consumer group lag with your Kafka monitoring system.
This matters because a persistent ksqlDB query is a Kafka consumer application.
It is not magic outside Kafka operational mechanics.
10.4 Lag Diagnosis Tree
11. Metrics That Matter
Good ksqlDB monitoring combines:
- ksqlDB metrics
- Kafka client metrics
- Kafka broker metrics
- topic metrics
- JVM metrics
- OS/disk metrics
- downstream consumer metrics
11.1 Query Health Metrics
Track:
- query status
- query error count
- processing rate
- input records/sec
- output records/sec
- skipped records
- deserialization errors
- production errors
- processing latency
- commit latency
11.2 Lag Metrics
Track:
- consumer lag per query
- lag trend
- lag by partition
- maximum partition lag
- end-to-end event latency
Averages hide hot partitions.
Always inspect lag by partition when troubleshooting.
11.3 State Metrics
Track:
- state-store size
- changelog topic size
- restore progress
- restore duration
- RocksDB memory/disk signals where available
- local disk free space
- compaction pressure
11.4 JVM Metrics
Track:
- heap usage
- GC pause time
- thread count
- direct/native memory symptoms
- CPU usage
- open file descriptors
Stateful stream processing often stresses disk and native memory, not just heap.
11.5 Kafka Topic Metrics
Track for source, sink, repartition, and changelog topics:
- partition count
- replication factor
- under-replicated partitions
- offline partitions
- produce rate
- consume rate
- retained bytes
- compaction health
- min ISR risk
A ksqlDB query can look unhealthy because Kafka topics underneath are unhealthy.
12. Alerting Rules
Good alerts describe symptoms that require action.
Poor alerts describe numbers that are merely interesting.
12.1 Useful Alerts
| Alert | Why It Matters |
|---|---|
| Lag increasing for N minutes | Query cannot keep up |
| Max partition lag much higher than average | Hot partition or skew |
| Query not running | Output may stop |
| Deserialization errors > 0 | Schema/data contract broken |
| State dir disk usage high | Node may fail or corrupt local availability |
| Restore time exceeds threshold | Recovery objective at risk |
| Sink topic produce failures | Output contract broken |
| Internal topic missing | Query cannot recover correctly |
| DLQ/quarantine output spike | Upstream data quality or logic failure |
12.2 Avoid Noisy Alerts
Avoid alerting only on:
- instantaneous lag spike
- CPU above 70% for 1 minute
- one-time rebalance
- low-volume topic no output
- query restart during planned deploy
Noise trains engineers to ignore the system.
13. State Management
State is the core operational difference between simple filtering and serious stream processing.
ksqlDB state can exist as:
- local state store
- Kafka changelog topic
- sink compacted topic
- internal repartition topic
- materialized table for pull queries
13.1 Local State Directory
The local state directory must be treated as important runtime storage.
It should have:
- enough disk capacity
- predictable I/O performance
- monitoring
- safe cleanup procedure
- recovery strategy
Do not place large state stores on ephemeral disks unless restore time is acceptable.
13.2 Changelog Topics
Changelog topics allow state reconstruction.
But they also add:
- write amplification
- broker storage usage
- replication traffic
- compaction requirements
- restore-time dependency
A very large changelog topic can make node recovery slow.
13.3 Repartition Topics
Repartition topics appear when ksqlDB must reorganize data by key.
They are not harmless temporary details.
They can affect:
- throughput
- broker storage
- latency
- query recovery
- access control
- observability
If a query creates repartition topics unexpectedly, revisit key design.
13.4 State Size Estimation
Estimate state size before production.
For table aggregation:
state size ≈ number_of_keys × average_state_per_key × overhead_factor
For windowed aggregation:
state size ≈ number_of_keys × active_windows_per_key × average_state_per_window × overhead_factor
The overhead factor accounts for serialization, RocksDB metadata, changelog compaction, and implementation overhead.
Use measured load tests to refine the estimate.
14. Internal Topics Are Production Topics
ksqlDB internal topics are easy to ignore because they are not part of the domain language.
That is dangerous.
Internal topics may include:
- repartition topics
- changelog topics
- command topics
- processing log topics
They require:
- replication factor
- retention/compaction correctness
- ACLs
- monitoring
- backup/DR consideration
- naming policy
Never delete internal topics casually.
Deleting the wrong internal topic can force state rebuild, break query execution, or lose recovery history.
15. Query Deployment Patterns
15.1 Immutable Derived Topic Pattern
Instead of mutating a query output contract in place, create a versioned output.
CREATE STREAM escalation_candidates_v2 AS
SELECT ...
FROM case_events
...
EMIT CHANGES;
Then migrate downstream consumers from v1 to v2.
Benefits:
- safer rollback
- parallel validation
- no sudden downstream contract break
- easier audit trail
Cost:
- temporary duplicate compute
- extra topics
- migration coordination
15.2 Shadow Query Pattern
Run a new query alongside the old one.
Compare outputs before switching consumers.
Use this for high-risk logic changes.
15.3 Blue/Green Output Pattern
Useful when downstream consumers can switch topic names or aliases.
- Blue query produces current production output.
- Green query produces next output.
- Consumers switch after validation.
15.4 Terminate and Recreate Pattern
Use carefully.
Terminating and recreating a query may be acceptable for:
- stateless derived streams
- non-critical projections
- development environments
It is risky for:
- stateful queries
- materialized views used by applications
- long-retention output topics
- audit-sensitive projections
16. Rollback Strategy
A rollback plan must be written before deployment.
Rollback options:
| Strategy | Use When |
|---|---|
| Re-point consumers to old topic | Versioned output exists |
| Terminate new query | New query has no required downstream dependency |
| Pause downstream consumer | Output is unsafe but source data can be replayed |
| Recreate from earlier offset | Replay is deterministic and source retained |
| Deploy compensating query | Bad output needs correction stream |
A rollback that destroys forensic evidence is not acceptable in regulated systems.
Preserve:
- bad outputs
- source offsets
- query version
- schema version
- deployment timestamp
- operator action
- correction event
17. Processing Log and Bad Records
ksqlDB can surface processing errors through logs and processing-log mechanisms depending on deployment configuration.
Operationally, classify bad records into:
| Bad Record Type | Example | Response |
|---|---|---|
| Deserialization failure | Invalid Avro/JSON/Protobuf payload | Quarantine producer/schema issue |
| Null key | Aggregation requires key | Fix producer or filter/rekey explicitly |
| Invalid semantic value | Negative SLA duration | Quarantine and domain validation |
| Late event | Event outside grace period | Review event-time model |
| Unknown enum | New status not recognized | Schema compatibility/process issue |
| Tombstone surprise | Table receives delete marker | Validate table semantics |
Do not treat all bad records as query bugs.
Some are upstream contract violations.
Some are valid domain exceptions.
Some are signs the query model is too strict.
18. Security and Access Control
ksqlDB needs access to:
- source topics
- sink topics
- internal topics
- schema registry
- Kafka consumer groups
- Kafka transactional/idempotent behavior where relevant
- REST endpoint users
Use least privilege.
For a query that reads case-events and writes case-current-state, it should not have broad access to unrelated evidence, payment, identity, or enforcement topics.
Security review should include:
- topic ACLs
- consumer group ACLs
- schema registry ACLs
- service account rotation
- TLS configuration
- REST endpoint authentication
- audit logs for query changes
- PII handling
- retention policy
ksqlDB makes data transformation easy.
That also makes accidental data propagation easy.
19. Multi-Tenancy and Isolation
There are two common models.
19.1 Shared ksqlDB Cluster
Multiple teams run queries in one cluster.
Pros:
- efficient infrastructure usage
- simpler platform management
- centralized governance
Cons:
- noisy neighbor risk
- complex ownership
- shared blast radius
- harder performance isolation
19.2 Dedicated ksqlDB Cluster per Domain
Each domain or product area gets its own cluster.
Pros:
- clearer ownership
- better isolation
- easier chargeback
- smaller blast radius
Cons:
- more infrastructure
- more operational overhead
- possible duplication
For regulatory systems, prefer isolation around data sensitivity and operational criticality.
For example:
- enforcement-critical projections
- public dashboard projections
- analytics enrichment
- experimentation
should not necessarily share the same runtime blast radius.
20. Disaster Recovery and Replay
Replay is only safe if the full chain is replay-safe.
A ksqlDB replay plan must define:
- source topic retention
- source schema availability
- query SQL version
- sink topic handling
- offset reset behavior
- state-store cleanup
- internal topic handling
- downstream consumer behavior
- external side effects
- audit evidence
20.1 Replay Modes
| Replay Mode | Meaning |
|---|---|
| Rebuild projection | Recreate materialized view from source topics |
| Reprocess into new topic | Produce corrected output to versioned sink |
| Backfill missing period | Process selected historical data into output |
| Compensating stream | Emit correction events for prior bad output |
20.2 Replay Risk
Replay can create duplicates downstream.
Downstream consumers must be idempotent if replayed outputs represent commands or trigger side effects.
Better:
- derived facts can be replayed
- materialized views can be rebuilt
- candidate events can be recomputed
Riskier:
- email notifications
- enforcement task creation
- payment operations
- irreversible external API calls
Do not let ksqlDB directly produce irreversible commands unless the downstream boundary is explicitly idempotent and audited.
21. Upgrade Discipline
ksqlDB upgrades may involve:
- ksqlDB server version
- Kafka broker version
- Kafka Streams version
- Java runtime version
- schema registry version
- connector ecosystem version
- SQL syntax behavior
- serialization behavior
Before upgrade:
- Export query definitions.
- Inventory persistent queries.
- Inventory internal topics.
- Check compatibility notes.
- Run staging replay.
- Validate materialized outputs.
- Measure restore time.
- Test rollback.
- Verify ACLs.
- Verify metrics and alerts.
Never upgrade a stateful stream-processing runtime as if it were a stateless web server.
22. Performance Testing
A useful ksqlDB performance test resembles production.
It should include:
- realistic payload sizes
- realistic key distribution
- realistic partition count
- realistic schema format
- realistic window/grace configuration
- realistic retention/compaction
- realistic downstream consumers
- failure injection
- restart/rebalance test
- state restore measurement
22.1 What to Measure
Measure:
- input records/sec
- output records/sec
- p50/p95/p99 processing latency
- consumer lag trend
- CPU
- heap
- GC pauses
- disk throughput
- state-store size
- changelog topic growth
- restore time
- rebalance duration
22.2 Skew Test
Always test skew.
Example:
- 90% of events belong to 10% of cases
- one
case_idreceives a burst of 100,000 events - one party appears in thousands of cases
- one region produces most alerts
Average throughput benchmarks are misleading if production has skew.
23. Common Failure Modes
23.1 Query Cannot Keep Up
Symptoms:
- increasing consumer lag
- delayed sink output
- downstream stale views
Causes:
- insufficient CPU
- insufficient partitions
- hot partition
- expensive join/aggregation
- slow broker writes
- state-store pressure
Response:
- Identify whether lag is uniform or partition-specific.
- Check CPU and disk.
- Inspect query plan.
- Check repartition topics.
- Scale only if partitioning allows it.
- Redesign key if skew is structural.
23.2 State Restore Takes Too Long
Symptoms:
- node restart causes long unavailability
- high changelog consumption
- query remains degraded
Causes:
- huge state store
- insufficient standby replicas
- slow disk
- changelog topic too large
- broker read throttling
Response:
- Measure restore duration.
- Compare to RTO.
- Tune state and retention.
- Consider standby replicas.
- Reduce state size or split query.
23.3 Internal Topic Explosion
Symptoms:
- many internal topics
- unexpected broker storage
- hard-to-debug query plan
Causes:
- accidental repartitioning
- too many intermediate queries
- poor key design
- experimentation in shared cluster
Response:
- Inventory internal topics by query.
- Review
EXPLAINoutput. - Consolidate or version intentionally.
- Clean only after verifying ownership.
23.4 Schema Compatibility Failure
Symptoms:
- deserialization errors
- query stops or skips records
- output missing fields
Causes:
- incompatible producer change
- wrong subject naming
- default missing
- enum evolution mistake
Response:
- Freeze producer rollout if needed.
- Identify offending schema ID/version.
- Quarantine bad records if possible.
- Deploy compatible query or schema fix.
- Backfill/replay if output gap matters.
23.5 Wrong Key, Correct SQL
Symptoms:
- join misses valid records
- aggregation splits state incorrectly
- duplicate rows per entity
- lag in repartition topic
Cause:
- SQL appears semantically correct, but Kafka key semantics are wrong.
Response:
- Inspect key format and value fields.
- Verify co-partitioning.
- Re-key explicitly.
- Version output topic if contract changes.
This is one of the most common senior-level mistakes.
24. Regulatory Case-Management Example
Assume we maintain current escalation candidates.
Inputs:
case-eventscase-assignment-eventscase-risk-scorescase-deadline-events
Outputs:
case-current-statecase-escalation-candidatescase-escalation-dashboard-view
Operational questions:
| Question | Why It Matters |
|---|---|
| Can the projection be rebuilt from retained source topics? | Audit and disaster recovery |
| Is output a fact or a candidate? | Prevent accidental enforcement action |
| Is query logic policy-versioned? | Regulatory defensibility |
| Can late evidence change prior escalation state? | Temporal correctness |
| Are deadlines event-time or processing-time based? | Legal correctness |
| Does replay create duplicate tasks? | Side-effect safety |
| Are PII fields minimized? | Governance |
| Is lag alert tied to SLA impact? | Operational relevance |
A strong design separates:
- facts: durable domain events
- views: rebuildable projections
- candidates: computed suggestions
- commands: explicit intent to act
- side effects: performed by idempotent services, not hidden inside SQL
25. Production Readiness Checklist
Before deploying a ksqlDB query, answer these.
25.1 Semantics
- What business fact or view does the query represent?
- Is the output stream/table named according to meaning?
- Is the key correct?
- Are null keys handled?
- Are tombstones expected?
- Is event time handled correctly?
- Is late data handled explicitly?
25.2 Runtime
- What source topics are read?
- What sink topic is written?
- What internal topics are created?
- What state stores are created?
- What is the expected state size?
- What is the expected restore time?
- What is the expected throughput?
- What is the expected lag under peak load?
25.3 Operations
- Who owns the query?
- How is it deployed?
- How is it rolled back?
- What alerts exist?
- What dashboards exist?
- How are processing errors surfaced?
- How are bad records quarantined?
- What happens during node failure?
25.4 Governance
- Are schemas registered?
- Is compatibility mode correct?
- Are ACLs least privilege?
- Is PII minimized?
- Is retention correct?
- Is output used for decisions or merely visibility?
- Can the result be explained to auditors?
26. Anti-Patterns
26.1 Production SQL by Console
Manually pasting SQL into production without version control creates invisible infrastructure.
Avoid it.
26.2 No Query Owner
If a query has no owner, it has no reliable responder during incidents.
26.3 Treating ksqlDB as a Stateless SQL Engine
Stateful queries require state-store, changelog, restore, and disk planning.
26.4 Ignoring Internal Topics
Internal topics are part of the query runtime.
Monitor them.
26.5 Using ksqlDB for Irreversible Side Effects
ksqlDB should produce derived streams/views/candidates.
External side effects should be executed by idempotent services.
26.6 Scaling Before Understanding Skew
Adding nodes does not fix a hot key.
26.7 No Replay Test
A stream-processing system that has never been replay-tested is not truly understood.
27. A Minimal On-Call Runbook
When ksqlDB output is delayed:
- Identify affected query ID.
- Check query status.
- Run or inspect
EXPLAINmetadata. - Identify consumer group.
- Check lag by partition.
- Check source topic produce rate.
- Check sink topic produce errors.
- Check ksqlDB server CPU/heap/GC.
- Check local disk/state directory.
- Check internal topic health.
- Check schema/processing errors.
- Check recent deployments.
- Decide: wait, scale, rollback, pause downstream, or replay.
When query output is wrong:
- Preserve source offsets and output records.
- Identify query version.
- Identify schema version.
- Reproduce with sample records.
- Determine whether error is source data, SQL logic, keying, schema, or timing.
- Stop unsafe downstream side effects.
- Deploy corrected versioned output.
- Backfill or emit compensating correction.
- Document audit trail.
28. Summary
ksqlDB operations require thinking like a stream-processing operator, not like a SQL user.
The core model is:
- persistent queries are deployed applications
- streams and tables are backed by Kafka topics
- stateful queries create state stores and changelog topics
- repartitioning is a runtime cost
- lag is the primary throughput health signal
- internal topics are production assets
- deployment must be versioned
- rollback must preserve auditability
- replay must be designed, not improvised
ksqlDB is powerful because it compresses a lot of stream-processing work into SQL.
That compression is safe only when the runtime mechanics remain visible.
Top-tier engineers keep both views in mind:
- the declarative SQL model
- the distributed runtime model underneath
References
- Confluent Documentation — ksqlDB Overview: https://docs.confluent.io/platform/current/ksqldb/overview.html
- Confluent Documentation — How ksqlDB Works: https://docs.confluent.io/platform/current/ksqldb/operate-and-deploy/how-it-works.html
- Confluent Documentation — Capacity Planning with ksqlDB: https://docs.confluent.io/platform/current/ksqldb/operate-and-deploy/capacity-planning.html
- Confluent Documentation — High Availability in ksqlDB: https://docs.confluent.io/platform/current/ksqldb/operate-and-deploy/high-availability.html
- Confluent Documentation — Monitoring and Metrics in ksqlDB: https://docs.confluent.io/platform/current/ksqldb/operate-and-deploy/monitoring.html
- Confluent Documentation — Monitor ksqlDB Persistent Queries: https://docs.confluent.io/cloud/current/ksqldb/monitoring-ksqldb.html
- Apache Kafka Documentation — Monitoring: https://kafka.apache.org/41/operations/monitoring/
You just completed lesson 27 in deepen practice. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.
Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.