Kafka Streams
A stream processing library that ships inside your JVM app, not a cluster you babysit
Use Cases
Architecture
How It Works Internally
The core idea behind Kafka Streams is the processing topology, a directed acyclic graph of stream processors. Source processors pull from Kafka topics. Intermediate processors do the work: map, filter, join, aggregate. Sink processors write results back to Kafka topics. The topology is built using either the high-level DSL (KStream, KTable, GlobalKTable) or the low-level Processor API, with custom process() callbacks and direct state store access.
What makes Kafka Streams interesting is how it parallelizes that topology through stream tasks. It creates one task per input topic partition. If the topology reads from a topic with 12 partitions, that produces 12 tasks. Those tasks get distributed across application instances using Kafka's consumer group protocol. Four instances? Each picks up 3 tasks. Scaling is just launching more instances, up to the number of input partitions. Nothing to reconfigure. That simplicity is the main reason teams pick Kafka Streams over Flink for straightforward use cases.
State stores handle everything stateful: aggregations, joins, windowing. Under the hood, each state store is backed by RocksDB, an embedded LSM-tree key-value store. RocksDB writes to an in-memory MemTable, flushes to immutable SST files on disk, and compacts them in the background. Here is where people get surprised: each state store uses off-heap memory for the block cache (default 50MB), bloom filters, and index blocks. A topology with 5 state stores will quietly consume 250-500MB of off-heap memory per instance. This does not show up in JVM heap metrics, which is exactly why it catches people off guard. Tune block_cache_size and write_buffer_size per state store when running more than a handful of them.
Changelog topics are the fault tolerance layer for state stores. Every write to a state store also gets produced to a compacted Kafka changelog topic. When an instance dies and another instance picks up its tasks, the new instance rebuilds the state store by consuming the changelog from the beginning. This works, but it can be slow. The fix is standby replicas: set num.standby.replicas to maintain hot-standby copies of state stores on other instances, which cuts failover time from minutes to seconds.
Production Architecture
Deploy Kafka Streams as a normal JVM application. Put it behind a load balancer, run it in Kubernetes, whatever the team does for JVM services. Each instance is stateless from a deployment perspective. State lives in RocksDB locally and gets backed up to changelog topics. One strong recommendation: use Kubernetes StatefulSets with persistent volumes for state store directories. Without persistent volumes, every pod reschedule triggers a full state restoration, and that can take a long time.
Set num.stream.threads to roughly match the number of CPU cores per instance. Each thread processes a subset of tasks. For critical applications, set num.standby.replicas=1 so warm state replicas are ready to go. For exactly-once semantics, use processing.guarantee=exactly_once_v2 (requires Kafka 2.5+). The v2 mode uses one producer per thread instead of one per task, which dramatically cuts down the number of transactional producers in play.
Interactive queries are an underused feature. Expose a REST API that queries local state stores directly. Each instance knows which instance owns which key (via KafkaStreams.metadataForKey()), so a query router can fan out requests to the right instance. This turns the Kafka Streams app into a queryable materialized view. It is not a replacement for a real database, but for lookups on data already being processed, it saves writing that data to an external store and querying it there.
Monitor these metrics: process-rate (records processed per second), process-latency (per-record processing time), commit-rate (transaction commits per second), poll-rate (consumer poll frequency), state-store-put/get-rate, and rebalance-latency. The one to alert on is consumer-lag per input partition. If it exceeds the SLA threshold, the application is falling behind.
Capacity Planning
Each stream task processes records sequentially from one partition. If a single partition produces 10,000 records/sec and processing takes 1ms per record, one task handles 1,000 records/sec. That means 10 partitions (and 10 tasks) are needed to keep up. The math: required_partitions = throughput / (1000 / processing_latency_ms). Simple, but people still get it wrong by underestimating processing latency under load.
Memory planning is where teams mess up most often. The requirements include JVM heap (2-8GB is typical) plus RocksDB off-heap (block_cache_size * num_state_stores * num_tasks_per_instance). For an instance running 4 tasks, each with 3 state stores at 64MB block cache: 4 * 3 * 64MB = 768MB off-heap. Total memory: 4GB heap + 768MB off-heap + 500MB overhead = roughly 5.3GB. Budget more than expected. RocksDB memory usage is hard to observe through standard JVM tooling, and OOM kills from off-heap growth are a common production surprise.
State store sizing follows from the data. The changelog topic retains the latest value per key (it is a compacted topic). If a KTable holds 100 million keys with 500-byte values, that is about 50GB of total state. Spread across 12 instances, each holds about 4.2GB. Restoration time: at 50MB/s consumer throughput, restoring 4.2GB takes roughly 84 seconds. With num.standby.replicas=1, failover happens in under 5 seconds because the standby is already caught up. The tradeoff is double the disk usage across the cluster, which is usually worth it.
Failure Scenarios
Scenario 1: Instance restart triggers 30-minute state restoration. An instance with 50GB of accumulated state store data crashes. The replacement must replay the entire changelog topic to rebuild RocksDB. At 30MB/s consumer throughput (bottlenecked by disk writes), restoration takes 28 minutes. During that time, the tasks assigned to this instance make no progress and consumer lag grows. Detection: watch restore-rate and restore-remaining metrics, then alert when restoration exceeds the SLA. Fix: configure num.standby.replicas=1 to keep a warm copy. When the primary fails, the standby takes over in seconds with no restoration needed. Alternatively, use persistent volumes in Kubernetes so the RocksDB data survives pod restarts. For any state store larger than 1GB, always run at least 1 standby replica.
Scenario 2: Co-partitioning violation causes silent incorrect join results. Someone adds a Kafka Streams join between an orders topic (24 partitions) and a users topic (12 partitions). Kafka Streams validates co-partitioning at startup and throws a TopologyException, so that gets caught quickly. The subtle version is worse: both topics have 24 partitions, but one was produced with a custom partitioner while the other uses the default. Records for the same key land on different partitions in each topic, so the join never matches them. The application runs without errors, but join results are empty or incomplete. I have seen this take weeks to catch. Detection: write integration tests that verify join output for known key pairs, and track join-hit-rate as a custom metric. Fix: repartition one topic using a through() operation with the correct partitioner so matching keys end up on the same partition number. This kind of bug often traces back to a legacy topic using a hash-modulo partitioner that does not match Kafka's default Murmur2 partitioner.
Pros
- • No separate cluster needed. It runs as a library inside your app.
- • Exactly-once processing semantics
- • Elastic scaling via Kafka consumer groups
- • Interactive queries on local state stores
- • Simple deployment (just a JVM application)
Cons
- • Locked to Kafka for both input and output
- • JVM only (Java/Kotlin/Scala)
- • Limited to Kafka's partitioning model
- • State stores can grow large on disk
- • Less capable than Flink for complex processing
When to use
- • You already run Kafka and need straightforward stream processing
- • You want to skip managing a separate processing cluster
- • Building event-driven microservices
- • You need exactly-once guarantees inside the Kafka ecosystem
When NOT to use
- • Processing data from non-Kafka sources
- • You need advanced CEP or event-time processing
- • Your team runs Python or another non-JVM stack
- • Complex multi-stream joins and windowing beyond what KStreams handles well
Key Points
- •Library, not a cluster. Kafka Streams runs inside the JVM process and scales through Kafka consumer groups. There is zero infrastructure to manage beyond the app itself.
- •Exactly-once processing works through Kafka transactions. The read-process-write cycle is atomic: producer transactions and consumer offset commits happen in a single transaction.
- •RocksDB-backed state stores provide fast local key-value access for aggregations, joins, and windowing. Data sits on the local disk of each instance.
- •Changelog topics replicate every state store mutation back to Kafka. When an instance restarts, it replays the changelog to rebuild state. That is the fault tolerance mechanism.
- •KTable and KStream duality matters. A KStream is a record stream (insert semantics). A KTable is a changelog stream (upsert semantics). They are convertible between the two forms.
- •Co-partitioning is mandatory for joins. Both input topics must share the same partition count and the same partitioning strategy, or the join silently produces wrong results.
Common Mistakes
- ✗Not sizing RocksDB state stores. Each one uses off-heap memory for block cache and memtable. Ten state stores with the default 64MB block cache each will eat 640MB of off-heap memory.
- ✗Rebalancing delays with many tasks. Each partition creates a stream task. With 500 partitions across 3 topics, that means 500 tasks, and rebalancing takes minutes while state stores restore.
- ✗Ignoring the co-partitioning requirement for joins. Joining two topics with different partition counts silently produces wrong results. Kafka Streams catches this at startup, but mismatched partitioners slip through.
- ✗Forgetting about restore time after restart. State store restoration replays the entire changelog topic. With 100GB of state, expect 30+ minutes of the instance sitting idle.
- ✗Leaving the default commit interval at 30s without thinking about it. A crash loses up to 30 seconds of uncommitted progress. Drop it to 100ms for low-latency workloads, but expect lower throughput.