Runbooks
This page contains procedures for common incidents. Each runbook is designed to be followed during an active incident. Start with the alert you are seeing, work through diagnosis, then follow the resolution steps.
Keep this page bookmarked. When an alert fires, open it and follow the relevant procedure.
What a runbook is (and isn’t)
Section titled “What a runbook is (and isn’t)”A runbook is a checklist for your brain during an outage. When production is down, you don’t want to be figuring out which metrics matter or what the recovery command is. You want a procedure you can follow that leads to a known good state.
Each runbook in this page follows the same structure:
- Alert: What you see in your monitoring (the trigger)
- Diagnosis: How to confirm the root cause (the investigation)
- Resolve: The steps to fix it (the action)
- Prevent: How to stop it happening again (the learning)
A runbook is not a troubleshooting guide for development. If you’re debugging why your topology behaves unexpectedly in the test driver, that’s a different process. Runbooks are for when the service is already deployed and something has gone wrong in production.
Rebalance storm
Section titled “Rebalance storm”Alert: task-restart-total rate spike across multiple
instances, throughput drops, state-transition: RUNNING -> REBALANCING -> RUNNING flapping in logs.
Diagnosis
Section titled “Diagnosis”A “storm” is rebalances triggered faster than the group can settle, usually within minutes of each other. Common causes:
- Liveness flaps. One instance is hitting GC pauses, network blips, or saturated CPU. Its heartbeats time out, the group kicks it, it rejoins, repeat.
- Probing-rebalance loop.
probingRebalanceIntervalMsis set short andacceptableRecoveryLagis set tight; warmups pass the threshold, the probing rebalance fires, the new ownership is fragile and triggers another probe immediately. - Memory pressure. RocksDB compaction stalling the instance long enough to miss heartbeats.
- Misconfigured
session.timeout.mson the consumer side relative toheartbeat.interval.ms.
Resolve
Section titled “Resolve”- Identify the flapping instance(s).
setRebalanceListenerlogs should show which instances repeatedly join and leave. - For liveness flaps: look at OS-level metrics for that instance (CPU, GC, network). Fix the underlying cause; the rebalance loop stops itself.
- For probing-rebalance loops: temporarily raise
probingRebalanceIntervalMsto 600_000 (10 min default). Restart the affected instances with the new config. - For RocksDB compaction pressure: check
storePutTotalversus underlying disk write throughput; the ratio tells you whether you’re write-throttled. Mitigations include moving the state directory to faster storage, reducingnumStandbyReplicastemporarily, or shedding load viapauseKafkaStreamson the worst-affected instance until compaction catches up. - For consumer-config skew: confirm
session.timeout.ms > 3 × heartbeat.interval.msand that neither has changed recently.
Prevent
Section titled “Prevent”- Standardise
probingRebalanceIntervalMsandacceptableRecoveryLagso all instances agree. - Monitor GC pauses; alert at p99 > 200 ms.
- Provision local disk IOPS for RocksDB’s compaction worst case, not its steady state. See Running in containers for how compaction interacts with container memory limits and the OOM-killer.
CommitFatal after commit2PC
Section titled “CommitFatal after commit2PC”Alert: commit-cycle-fatal counter > 0 with reason
commit2PC: ….
Diagnosis
Section titled “Diagnosis”The producer transaction committed (records are durable on
Kafka), but the external 2PC sink failed to finalise. The
in-flight SinkTxnId is stranded in the prepared state on the
external system. The runtime is killed on this outcome.
Resolve
Section titled “Resolve”- Confirm the runtime is dead. The supervisor should already
be restarting it. Don’t
unpauseor otherwise interfere before the restart. - On restart, the runtime calls
tpsRecoverfor every configured sink. The sink returns a list ofSinkTxnIds currently in the prepared state. - For each token, the runtime calls the sink’s recovery logic
:
CommitFromToken(finish the half-committed txn),AbortFromToken(roll it back), orUnknownLeaveAsIs(log and leave). - Verify the recovery decision is correct. Cross-reference
the consumer offsets at
applicationId-<task>with the prepared-txn list on the external system. If a prepared txn’s producer cycle did commit (offsets advanced past it), the correct action isCommitFromToken. - If
UnknownLeaveAsIscame back, escalate. Manual review: inspect the external system’s prepared state, confirm what was supposed to happen, finish or roll back by hand.
Prevent
Section titled “Prevent”- Ensure
tpsRecoveris implemented correctly for every production sink: not just returning[]. - Make
tpsCommitandtpsAbortstrictly idempotent so a duplicate call after a partial recovery is a no-op. - Alert on
commit-cycle-fatalseparately fromcommit-cycle-aborted; the latter is normal noise, the former is operator-required.
See Exactly-once across Kafka and other systems for the contract details.
Standby never promotes; lag stays high
Section titled “Standby never promotes; lag stays high”Alert: Per-task warmup lag for a standby has been above
acceptableRecoveryLag for longer than
2 × probingRebalanceIntervalMs.
Diagnosis
Section titled “Diagnosis”The standby’s changelog-replay loop can’t keep up with the active’s production rate. Common causes:
- Network bandwidth saturation. The standby is reading the changelog at a rate lower than the active is writing it.
- Local disk throughput. State writes during replay are slower than the changelog read.
numStandbyReplicasis too high for the cluster’s write capacity. Each replica triples the changelog read fan- out.- The active’s commit-cycle is faster than the standby can catch up between cycles. Symptom: lag oscillates around a floor but never crosses it.
Resolve
Section titled “Resolve”- Identify the affected standby.
LagListenersnapshots tell you which task and which store. - Compare standby-replay throughput to active-write throughput. If the gap is structural (active produces faster than standby can drain even at idle), the standby will never catch up.
- For network / disk saturation: move the standby to an
instance with more headroom, or reduce
numStandbyReplicasfor this task. Both require config changes + restart. - For commit-cycle skew: raise
commitIntervalMson the active so batches are larger; the standby’s replay cost per commit goes down. - As a last resort: use Riffle’s
SnapshotPointerstandby mode. The standby stops replicating bytes and instead tracks(snapshotId, advancedTo). Promotion fetches the snapshot blob + replays the tail. Trade-off: promotion takes longer (snapshot fetch time) but steady-state cost is zero.
Prevent
Section titled “Prevent”- Provision the standby instances with at least equal disk and network capacity to the active.
- Right-size
numStandbyReplicasfor your write capacity, not just your storage capacity.
Orphan internal topics detected
Section titled “Orphan internal topics detected”Alert: Startup log line orphan internal topic: <name> or
the orphan-detector metric > 0.
Diagnosis
Section titled “Diagnosis”A previous deploy renamed an operator or removed a store. The broker is keeping the old internal topic (changelog or repartition) and its disk usage continues.
Resolve
Section titled “Resolve”- Confirm it is genuinely orphaned. Run
Kafka.Streams.Observability.OrphanTopics.detectOrphansagainst the current production topology and the current broker topic list. The output should match the alert. - Check for in-flight rolling deploys. If a v1 instance is still running and v2 introduced the rename, the “orphan” is actually still in use by v1. Don’t delete until v1 is fully drained.
- Settle. Wait at least one full commit-cycle multiple beyond the slowest instance’s drain time.
- Delete via the AdminClient. Use a real broker-side
deleteTopicscall. The runtime will not do this for you. - Confirm clean. Re-run the detector; expect zero output.
Prevent
Section titled “Prevent”- Pin every stateful operator’s name with
Named. - Run the orphan detector in CI against the deployment-shape golden file.
- Treat any topology change that affects internal-topic names as a stateful migration (see Topology evolution).
Producer fenced / INVALID_PRODUCER_EPOCH
Section titled “Producer fenced / INVALID_PRODUCER_EPOCH”Alert: Runtime log line containing
InvalidProducerEpochException or ProducerFencedException.
Diagnosis
Section titled “Diagnosis”Under EOS-V2 the broker fences a producer when a newer producer
with the same transactional.id has been observed. This usually
means:
- Two instances of the same
(applicationId, taskId)are somehow alive. Almost always a misconfigured rollout or a zombie from a previous deploy. - A network partition resolved with both sides thinking they own the task. KIP-848’s incremental reconciliation makes this very rare but not impossible.
transactional.id.expiration.mson the broker has elapsed for an idle producer; the broker forgets it; the instance tries to commit and gets fenced.
Resolve
Section titled “Resolve”- The runtime will already have transitioned the affected task
to
ERRORand (depending onsetUncaughtExceptionHandler) either restarted the thread, shut down the client, or shut down the application. Default per the runtime is to log and try to recover the task. - Identify the zombie.
metadataForLocalThreadson every instance shows what they think they own. The instance whose ownership doesn’t match the rebalance log is the zombie. - Kill the zombie. SIGTERM the OS process. The group rebalance will reassign cleanly.
- If the cause was broker-side TXN expiration: raise
transactional.id.expiration.mson the broker for this workload, or shorten the EOS commit interval so the producer isn’t ever idle long enough to expire.
Prevent
Section titled “Prevent”- Use process supervisors that kill old generations before starting new ones (no overlapping lifecycle).
- Set sensible
instance.idforstatic membershipso rebalances don’t churn during normal restarts.
Async I/O backpressure causes stream-thread stall
Section titled “Async I/O backpressure causes stream-thread stall”Alert: Async-operator aio-deposit-rate near zero while
aio-enqueue-rate also near zero (the queue is full and the
stream thread is blocked).
Diagnosis
Section titled “Diagnosis”The external system the async operator is calling has slowed
down or is failing. The in-flight queue
(aioBufferCapacity) fills, the stream thread blocks on
enqueue, and the entire downstream pipeline stalls.
This is working as intended: it’s the backpressure signal - but it should not last long.
Resolve
Section titled “Resolve”- Check the external system. If it’s down, the right action is at the external system, not at the streams app.
- If the external system is slow but up: check the async
operator’s
aioRetryandaioTimeout. A long timeout with retries means each in-flight slot is occupied for(attempts + 1) × timeoutworst case. The queue can never drain faster than that. - If the queue is structurally too small: raise
aioBufferCapacity(requires restart). The trade-off is more in-flight memory and a longer pre-commit drain. - If the failures are partial: consider switching
aioOnFailurefromFailTasktoLogAndContinueso a minority of failures don’t shed the whole pipeline. - If the external system is overwhelmed: the async
operator is doing what you asked: flooding it. Drop
aioWorkers(restart required) to reduce concurrent calls.
Prevent
Section titled “Prevent”- Size
aioBufferCapacity ≈ 4 × aioWorkersso brief stalls don’t immediately propagate. - Set
aioTimeoutto the external system’s p99 + a buffer, not its average. - Monitor
aio-failure-rateseparately; sustained high failure rate is its own incident.
Schema deserialisation flood
Section titled “Schema deserialisation flood”Alert: droppedRecordsTotal rate spike, paired with a
matching rate on the DeserializationException log channel.
Diagnosis
Section titled “Diagnosis”An upstream producer started writing records that the current
deserialiser can’t parse. (The railway-oriented programming page explains where this routing decision lives and how a DLQ wired through the DeserializationHandler gives you reprocessability.) Three usual causes:
- Schema Registry compat policy was bypassed: a new schema
was published without compatibility checks, and your
registrySerdeCheckedwrapper rejects every record. (This means the wrapper is doing its job; the producer is the problem.) - A new producer service rolled out with a different wire format and skipped Schema Registry entirely.
- A new field with a default value was added correctly, but your generated decoder doesn’t have it.
Resolve
Section titled “Resolve”- Identify the offending source. Look at the
DeserializationExceptionpayload to see the bad record; trace it to its producer. - Stop the bleeding. If you’re losing important records,
switch the deserialiser handler from
logAndContinuetofailFastso the stream stops processing. The records remain on the topic and can be re-processed once the bug is fixed. (Be careful: withfailFast, your consumer group stops making progress until the bad records are dealt with.) - Fix the producer. Either roll the producer back, or re-publish through Schema Registry with the correct compatibility check.
- Re-process the dropped records. Use the
processingException.handlerDEAD_LETTERpolicy (KIP-1033) if you have one configured; otherwise re-consume from the offending offset range with a one-shot consumer.
Prevent
Section titled “Prevent”- Use
registrySerdeCheckedfor every Schema Registry-backed serde. It probes the per-subject compatibility mode and fails fast at construction. - Alert on
droppedRecordsTotalrate, not just absolute count. - Enforce Schema Registry compatibility at the producer side too: don’t rely on the consumer being the only check.
Local state directory grows without bound
Section titled “Local state directory grows without bound”Alert: stateDir disk usage exceeds expected ceiling.
Diagnosis
Section titled “Diagnosis”State stores grow with the number of unique keys. Common causes of unbounded growth:
- The source topic is not compacted (or has very long retention) and you’re materialising every key ever seen.
- A windowed store has
withGracePeriodlonger than your retention budget assumes. - A KTable is built off a topic with very high unique-key cardinality and you didn’t realise.
- Old standby task directories from a prior deploy were never cleaned up.
Resolve
Section titled “Resolve”- Identify which store is growing. RocksDB has per-store
subdirectories under
stateDir; sizing them is direct. - For (1) and (3): add a TTL via
Kafka.Streams.State.KeyValue.TTL(wall-clock) orEventTimeTTL(driven off the coordinated watermark). This actively expires entries on every read; pair with a punctuator that callsexpireBeforefor active sweeping. - For (2): confirm the grace period matches the window retention. A 1-hour window with a 7-day grace materialises 7 days of state.
- For (4):
cleanUpwipes the local directory; the runtime re-warms from the changelog or snapshot on next start. Safe when standbys exist; loses work otherwise.
Prevent
Section titled “Prevent”- Always pick a
KeyValueStorebackend that fits the cardinality budget. In-memory is fine for ≤10⁶ keys; for more, use RocksDB (+rocksdbflag), the Riffle tiered (hot + cold S3) backend, or the remote-KV backend. - Apply TTLs proactively for any topology where the key cardinality is unbounded.
- Monitor
stateDirsize as a first-class metric, not just disk usage. Running in containers → disk sizing has the budgeting formula.
EOS commit cycle taking longer than commitIntervalMs
Section titled “EOS commit cycle taking longer than commitIntervalMs”Alert: commit-duration p99 approaches commitIntervalMs.
Diagnosis
Section titled “Diagnosis”The commit cycle (runCommitCycle) walks six steps:
beginTxn → flush → commitOffsets → preCommit2PC → commitTxn → commit2PC → storeCommit. Any of them can take time. Most likely:
flushis slow because there are many records in the transactional buffer (high commit interval, large per-record processing cost).commitTxnround-trip to the broker is slow (network latency, broker load).commit2PCis slow because the external sink’s commit operation is expensive (e.g. Iceberg manifest commit on a large dataset).storeCommitis slow because the KIP-892 transactional buffer has accumulated many writes.
Resolve
Section titled “Resolve”- Per-step timing.
Kafka.Streams.Metricsexposes per-step counters. Identify which step dominates. - For (1): consider lowering
commitIntervalMsso each cycle has fewer records. Trade-off: more commit overhead per record, but better tail latency. - For (2): check broker health; this is rarely the bottleneck if the broker is healthy.
- For (3): raise
commitIntervalMsso 2PC commits are amortised over more records. Be aware this widens the reprocessing window on a fault. - For (4): tune cache size (
cacheMaxBytesBuffering) so writes coalesce more aggressively before they hit the store.
Prevent
Section titled “Prevent”- Benchmark the commit cycle under your expected throughput at
load-test time; size
commitIntervalMsaccordingly. - For 2PC sinks, the commit is structurally per-cycle; size cycles for the sink’s commit cost, not the per-record cost.
Interactive query returns StoreNotFound during a rebalance
Section titled “Interactive query returns StoreNotFound during a rebalance”Alert: Spikes of 404s on an IQ-fronted endpoint during a rolling deploy.
Diagnosis
Section titled “Diagnosis”During a rebalance, a partition (and its state store) is
transiently unowned. Queries routed to the previous owner get
StoreNotFound; queries routed to the new owner get the same
until the store has been re-bound (instant with standby + KIP-848,
slow without).
Resolve
Section titled “Resolve”- In the query layer, retry with backoff and refresh
metadata.
Kafka.Streams.Discovery.StreamsMetadataupdates as the rebalance completes; a retry after~1susually succeeds. - Optionally fall through to a standby for the duration of
the rebalance.
KeyQueryMetadata.standbyHostsreturns every live standby; a stale read is usually better than a 404. - If the rebalance is taking minutes, see “Standby never promotes; lag stays high” above: that’s actually the underlying issue.
Prevent
Section titled “Prevent”- Build the query layer with the rebalance-window assumption baked in. A naïve “one shot, fail on 404” client will be brittle.
- Use
numStandbyReplicas >= 1so the rebalance is metadata-only and the unavailability window is sub-second.
Reading the metrics during an incident
Section titled “Reading the metrics during an incident”When you don’t know where to start: Every incident produces symptoms, but the same symptom (“things are slow”) can have different root causes (CPU saturation, disk I/O, network latency, or a poison-pill record). This table gives you a diagnostic path. Start with the symptom you’re seeing, check the “First metric” column, and use that reading to decide what to check next.
Quick reference for which metric to look at first:
| Symptom | First metric | Then |
|---|---|---|
| ”Things are slow” | process-latency (p50, p99) per node | Drill into the slowest node |
| ”Throughput dropped” | processTotal rate per node | If a single node, check its process-latency; if global, check commit-cycle-aborted |
| ”Records seem to disappear” | droppedRecordsTotal | Check DeserializationException log channel |
| ”Rebalance loop” | task-restart-total | Check setRebalanceListener log |
| ”EOS issues” | commit-cycle-aborted and -fatal | The reason field tells you which step failed |
| ”Standby not promoting” | per-task warmup lag from LagListener | Check standby replay throughput vs active write rate |
Related reading
Section titled “Related reading”- Observability: the metric surface this page leans on.
- Topology evolution: the deployment procedures these runbooks reference.
- Exactly-once across Kafka and other systems - the EOS internals behind the commit-cycle runbooks.