Tutorial 1: What is Kafka Streams?
Part one of five. This part is conceptual; the code starts in the next one.
Why a streaming library
Section titled “Why a streaming library”Most web services follow request/response: receive a request, do work, send a response. That model breaks down when data arrives continuously (logs, events, sensor readings, user actions) and when processing depends on past data. Streaming systems handle unending sequences of records and deal with the hard parts: failures without data loss, state that survives restarts, coordination across processing steps.
Kafka Streams is a library, not a cluster. You compile your topology into your application binary, start it, and the framework handles partition assignment, state management, offset tracking, fault tolerance, and scaling. Your service is the runtime.
Core concepts
Section titled “Core concepts”Topics: the log of events
Section titled “Topics: the log of events”A Kafka topic is an append-only log. Each entry is a record with a key, value, and timestamp. Records are ordered within a partition and kept for a configurable retention window.
The log is partitioned (split into independent streams) for parallelism. Anyone can append; readers track their own position and read forward.
Processing pipelines (topologies)
Section titled “Processing pipelines (topologies)”A topology is a chain of processing steps. Each step is an operator that transforms the stream:
- Source: Reads from a topic
- map: Transform each record (like
mapon lists) - filter: Keep only matching records
- groupBy: Reorganize by a different key
- count/aggregate: Compute running totals
- Sink: Write to a topic
flowchart LR In[Kafka topic] --> Filter[filter suspicious IPs] Filter --> Group[groupBy user] Group --> Count[count login attempts] Count --> Out[Kafka topic] Group -.->|maintains| Store[(state: attempts per user)]
This example reads login events, filters to suspicious IPs, groups by user, counts attempts per user, and outputs counts to another topic. The state store (attempts per user) is maintained automatically.
Stateful vs stateless processing
Section titled “Stateful vs stateless processing”Stateless operators see each record in isolation. A filter or map doesn’t need to remember anything between records.
Stateful operators need memory. A count needs to remember previous counts. A windowed average needs to remember recent values. Kafka Streams handles state recovery by writing state changes to a changelog topic (a hidden Kafka topic), replaying the changelog on restart to rebuild state, and maintaining standby replicas on other instances for fast failover.
How Kafka Streams differs from other approaches
Section titled “How Kafka Streams differs from other approaches”Plain Kafka consumer
Section titled “Plain Kafka consumer”You could write a loop that polls Kafka and processes records:
-- Conceptual codeforever $ do records <- poll consumer forM_ records processThis works for simple cases. But you must handle failures (if process throws, what happens to the batch?), state (where do you store intermediate results?), scaling (how do you coordinate multiple instances?), and exactly-once (how do you avoid double-processing on restart?). Kafka Streams handles all of this.
Flink is a full streaming platform. You submit jobs to a cluster that manages them. This is suited for large-scale analytics but adds operational complexity: separate cluster to maintain, jobs isolated from your service code, deployment is “submit a JAR to the cluster.” Kafka Streams keeps the processing in your service. Same binary, same deployment, same monitoring.
What Kafka provides
Section titled “What Kafka provides”The library builds on three guarantees from Kafka itself:
-
Durability. Records are replicated across brokers. If your service dies mid-processing, the records are still there when you restart.
-
Replay. Each consumer tracks its position. Restart from where you left off, or rewind to reprocess old data.
-
Ordering within partitions. Records with the same key always land on the same partition and are consumed in order. This makes stateful processing predictable.
The library uses (1) for state recovery (state changes go to a changelog topic), and (2) + (3) to make processing consistent across restarts.
What the library adds
Section titled “What the library adds”| Capability | What you get |
|---|---|
| State stores | Local data structures (key-value, windowed, session) backed by Kafka |
| Joins | Combine streams and tables: stream-stream, stream-table, table-table |
| Windows | Time-based grouping: tumbling (fixed intervals), hopping (overlapping), session (activity gaps) |
| Exactly-once | Atomic processing: input, output, and state update together |
| Standby tasks | Hot replicas for fast failover |
| Interactive queries | Read your state stores directly from your service |
Extended features
Section titled “Extended features”For advanced use cases, optional extensions are available:
- Async I/O: Call external APIs without blocking processing
- Snapshot stores: Fast recovery from large state
- Two-phase commit sinks: Exactly-once writes to databases, S3, etc.
Start with the base library; add extensions when you need them.
Quick vocabulary
Section titled “Quick vocabulary”| Term | Plain English meaning |
|---|---|
| Topology | Your processing pipeline: a graph of operators that data flows through |
| KStream | A stream of records (append-only, like a log: new events keep getting added) |
| KTable | A table derived from a stream (each key keeps only its latest value) |
| State store | Local storage for stateful operators (in-memory or on disk) |
| Partition | One shard of a topic; the unit of parallelism |
| Task | One instance of your topology processing one partition |
| Consumer group | Multiple instances sharing the work |
| Changelog topic | Hidden Kafka topic that backs a state store |
The remaining four parts build a pipe topology, a word counter with queryable state, a page-view enricher with joins, and a production checklist. Each is self-contained and runs without a Kafka broker.