Kafka Streams

How It Works Internally

The core idea behind Kafka Streams is the processing topology, a directed acyclic graph of stream processors. Source processors pull from Kafka topics. Intermediate processors do the work: map, filter, join, aggregate. Sink processors write results back to Kafka topics. The topology is built using either the high-level DSL (KStream, KTable, GlobalKTable) or the low-level Processor API, with custom process() callbacks and direct state store access.

What makes Kafka Streams interesting is how it parallelizes that topology through stream tasks. It creates one task per input topic partition. If the topology reads from a topic with 12 partitions, that produces 12 tasks. Those tasks get distributed across application instances using Kafka's consumer group protocol. Four instances? Each picks up 3 tasks. Scaling is just launching more instances, up to the number of input partitions. Nothing to reconfigure. That simplicity is the main reason teams pick Kafka Streams over Flink for straightforward use cases.

State stores handle everything stateful: aggregations, joins, windowing. Under the hood, each state store is backed by RocksDB, an embedded LSM-tree key-value store. RocksDB writes to an in-memory MemTable, flushes to immutable SST files on disk, and compacts them in the background. Here is where people get surprised: each state store uses off-heap memory for the block cache (default 50MB), bloom filters, and index blocks. A topology with 5 state stores will quietly consume 250-500MB of off-heap memory per instance. This does not show up in JVM heap metrics, which is exactly why it catches people off guard. Tune block_cache_size and write_buffer_size per state store when running more than a handful of them.

Changelog topics are the fault tolerance layer for state stores. Every write to a state store also gets produced to a compacted Kafka changelog topic. When an instance dies and another instance picks up its tasks, the new instance rebuilds the state store by consuming the changelog from the beginning. This works, but it can be slow. The fix is standby replicas: set num.standby.replicas to maintain hot-standby copies of state stores on other instances, which cuts failover time from minutes to seconds.

Production Architecture

Deploy Kafka Streams as a normal JVM application. Put it behind a load balancer, run it in Kubernetes, whatever the team does for JVM services. Each instance is stateless from a deployment perspective. State lives in RocksDB locally and gets backed up to changelog topics. One strong recommendation: use Kubernetes StatefulSets with persistent volumes for state store directories. Without persistent volumes, every pod reschedule triggers a full state restoration, and that can take a long time.

Set num.stream.threads to roughly match the number of CPU cores per instance. Each thread processes a subset of tasks. For critical applications, set num.standby.replicas=1 so warm state replicas are ready to go. For exactly-once semantics, use processing.guarantee=exactly_once_v2 (requires Kafka 2.5+). The v2 mode uses one producer per thread instead of one per task, which dramatically cuts down the number of transactional producers in play.

Interactive queries are an underused feature. Expose a REST API that queries local state stores directly. Each instance knows which instance owns which key (via KafkaStreams.metadataForKey()), so a query router can fan out requests to the right instance. This turns the Kafka Streams app into a queryable materialized view. It is not a replacement for a real database, but for lookups on data already being processed, it saves writing that data to an external store and querying it there.

Monitor these metrics: process-rate (records processed per second), process-latency (per-record processing time), commit-rate (transaction commits per second), poll-rate (consumer poll frequency), state-store-put/get-rate, and rebalance-latency. The one to alert on is consumer-lag per input partition. If it exceeds the SLA threshold, the application is falling behind.

Capacity Planning

Each stream task processes records sequentially from one partition. If a single partition produces 10,000 records/sec and processing takes 1ms per record, one task handles 1,000 records/sec. That means 10 partitions (and 10 tasks) are needed to keep up. The math: required_partitions = throughput / (1000 / processing_latency_ms). Simple, but people still get it wrong by underestimating processing latency under load.

Memory planning is where teams mess up most often. The requirements include JVM heap (2-8GB is typical) plus RocksDB off-heap (block_cache_size * num_state_stores * num_tasks_per_instance). For an instance running 4 tasks, each with 3 state stores at 64MB block cache: 4 * 3 * 64MB = 768MB off-heap. Total memory: 4GB heap + 768MB off-heap + 500MB overhead = roughly 5.3GB. Budget more than expected. RocksDB memory usage is hard to observe through standard JVM tooling, and OOM kills from off-heap growth are a common production surprise.

State store sizing follows from the data. The changelog topic retains the latest value per key (it is a compacted topic). If a KTable holds 100 million keys with 500-byte values, that is about 50GB of total state. Spread across 12 instances, each holds about 4.2GB. Restoration time: at 50MB/s consumer throughput, restoring 4.2GB takes roughly 84 seconds. With num.standby.replicas=1, failover happens in under 5 seconds because the standby is already caught up. The tradeoff is double the disk usage across the cluster, which is usually worth it.

Failure Scenarios

Scenario 1: Instance restart triggers 30-minute state restoration. An instance with 50GB of accumulated state store data crashes. The replacement must replay the entire changelog topic to rebuild RocksDB. At 30MB/s consumer throughput (bottlenecked by disk writes), restoration takes 28 minutes. During that time, the tasks assigned to this instance make no progress and consumer lag grows. Detection: watch restore-rate and restore-remaining metrics, then alert when restoration exceeds the SLA. Fix: configure num.standby.replicas=1 to keep a warm copy. When the primary fails, the standby takes over in seconds with no restoration needed. Alternatively, use persistent volumes in Kubernetes so the RocksDB data survives pod restarts. For any state store larger than 1GB, always run at least 1 standby replica.

Scenario 2: Co-partitioning violation causes silent incorrect join results. Someone adds a Kafka Streams join between an orders topic (24 partitions) and a users topic (12 partitions). Kafka Streams validates co-partitioning at startup and throws a TopologyException, so that gets caught quickly. The subtle version is worse: both topics have 24 partitions, but one was produced with a custom partitioner while the other uses the default. Records for the same key land on different partitions in each topic, so the join never matches them. The application runs without errors, but join results are empty or incomplete. I have seen this take weeks to catch. Detection: write integration tests that verify join output for known key pairs, and track join-hit-rate as a custom metric. Fix: repartition one topic using a through() operation with the correct partitioner so matching keys end up on the same partition number. This kind of bug often traces back to a legacy topic using a hash-modulo partitioner that does not match Kafka's default Murmur2 partitioner.

How It Works Internally

Production Architecture

Capacity Planning

Failure Scenarios

Use Cases

Architecture

How It Works Internally

Production Architecture

Capacity Planning

Failure Scenarios

Pros

Cons

When to use

When NOT to use

Key Points

Common Mistakes

Related Technologies

Kafka Streams

Use Cases

Architecture

How It Works Internally

Production Architecture

Capacity Planning

Failure Scenarios

Pros

Cons

When to use

When NOT to use

Key Points

Common Mistakes

Related Technologies