Skip to content

Tutorial 1: What is Kafka Streams?

Part one of five. This part is conceptual; the code starts in the next one.

Most web services follow request/response: receive a request, do work, send a response. That model breaks down when data arrives continuously (logs, events, sensor readings, user actions) and when processing depends on past data. Streaming systems handle unending sequences of records and deal with the hard parts: failures without data loss, state that survives restarts, coordination across processing steps.

Kafka Streams is a library, not a cluster. You compile your topology into your application binary, start it, and the framework handles partition assignment, state management, offset tracking, fault tolerance, and scaling. Your service is the runtime.

A Kafka topic is an append-only log. Each entry is a record with a key, value, and timestamp. Records are ordered within a partition and kept for a configurable retention window.

The log is partitioned (split into independent streams) for parallelism. Anyone can append; readers track their own position and read forward.

A topology is a chain of processing steps. Each step is an operator that transforms the stream:

  • Source: Reads from a topic
  • map: Transform each record (like map on lists)
  • filter: Keep only matching records
  • groupBy: Reorganize by a different key
  • count/aggregate: Compute running totals
  • Sink: Write to a topic
flowchart LR
  In[Kafka topic] --> Filter[filter suspicious IPs]
  Filter --> Group[groupBy user]
  Group --> Count[count login attempts]
  Count --> Out[Kafka topic]
  Group -.->|maintains| Store[(state: attempts per user)]

This example reads login events, filters to suspicious IPs, groups by user, counts attempts per user, and outputs counts to another topic. The state store (attempts per user) is maintained automatically.

Stateless operators see each record in isolation. A filter or map doesn’t need to remember anything between records.

Stateful operators need memory. A count needs to remember previous counts. A windowed average needs to remember recent values. Kafka Streams handles state recovery by writing state changes to a changelog topic (a hidden Kafka topic), replaying the changelog on restart to rebuild state, and maintaining standby replicas on other instances for fast failover.

How Kafka Streams differs from other approaches

Section titled “How Kafka Streams differs from other approaches”

You could write a loop that polls Kafka and processes records:

-- Conceptual code
forever $ do
records <- poll consumer
forM_ records process

This works for simple cases. But you must handle failures (if process throws, what happens to the batch?), state (where do you store intermediate results?), scaling (how do you coordinate multiple instances?), and exactly-once (how do you avoid double-processing on restart?). Kafka Streams handles all of this.

Flink is a full streaming platform. You submit jobs to a cluster that manages them. This is suited for large-scale analytics but adds operational complexity: separate cluster to maintain, jobs isolated from your service code, deployment is “submit a JAR to the cluster.” Kafka Streams keeps the processing in your service. Same binary, same deployment, same monitoring.

The library builds on three guarantees from Kafka itself:

  1. Durability. Records are replicated across brokers. If your service dies mid-processing, the records are still there when you restart.

  2. Replay. Each consumer tracks its position. Restart from where you left off, or rewind to reprocess old data.

  3. Ordering within partitions. Records with the same key always land on the same partition and are consumed in order. This makes stateful processing predictable.

The library uses (1) for state recovery (state changes go to a changelog topic), and (2) + (3) to make processing consistent across restarts.

CapabilityWhat you get
State storesLocal data structures (key-value, windowed, session) backed by Kafka
JoinsCombine streams and tables: stream-stream, stream-table, table-table
WindowsTime-based grouping: tumbling (fixed intervals), hopping (overlapping), session (activity gaps)
Exactly-onceAtomic processing: input, output, and state update together
Standby tasksHot replicas for fast failover
Interactive queriesRead your state stores directly from your service

For advanced use cases, optional extensions are available:

  • Async I/O: Call external APIs without blocking processing
  • Snapshot stores: Fast recovery from large state
  • Two-phase commit sinks: Exactly-once writes to databases, S3, etc.

Start with the base library; add extensions when you need them.

TermPlain English meaning
TopologyYour processing pipeline: a graph of operators that data flows through
KStreamA stream of records (append-only, like a log: new events keep getting added)
KTableA table derived from a stream (each key keeps only its latest value)
State storeLocal storage for stateful operators (in-memory or on disk)
PartitionOne shard of a topic; the unit of parallelism
TaskOne instance of your topology processing one partition
Consumer groupMultiple instances sharing the work
Changelog topicHidden Kafka topic that backs a state store

The remaining four parts build a pipe topology, a word counter with queryable state, a page-view enricher with joins, and a production checklist. Each is self-contained and runs without a Kafka broker.

Continue to Tutorial 2: Your first topology →