Kafka
The distributed commit log that became the backbone of event streaming
Use Cases
Architecture
How It Works Internally
Kafka is a distributed append-only commit log. That single idea explains almost every design decision in the system.
Each topic splits into partitions. Each partition is an ordered, immutable sequence of records stored on disk. Producers append records to the tail, and every record gets a monotonically increasing 64-bit offset. On disk, a partition is a series of segment files (1GB each by default), with an index file that maps offsets to physical file positions for O(1) lookups. There's also a time index that maps timestamps to offsets, which is how consumers can seek to "give me everything from the last 3 hours" without scanning from the beginning.
The Write Path
The entire write path is optimized for sequential I/O. Producers batch records by partition and send them in a single network request. Brokers append the batch directly to the segment file using zero-copy transfer via sendfile(). No deserialization, no re-serialization, no copying between kernel and user space. The OS page cache handles read buffering. This is how Kafka hits 800MB/s+ per broker on modern hardware with commodity SSDs.
The producer batching configuration matters more than people realize. linger.ms controls how long the producer waits to accumulate records before sending a batch (default 0, meaning send immediately). batch.size sets the maximum batch size in bytes (default 16KB). For throughput-sensitive workloads, bump linger.ms to 5-20ms and batch.size to 256KB-1MB. The latency tradeoff is measured in single-digit milliseconds. The throughput gain can be 5-10x.
ISR and Durability
The ISR (In-Sync Replica) mechanism governs durability. Each partition has a leader and N-1 followers. A follower stays in the ISR as long as it has fetched all messages within replica.lag.time.max.ms (default 30 seconds). When a follower falls behind, it gets kicked out of the ISR. When it catches up, it's added back.
The producer acks setting controls the durability guarantees: acks=0 (fire-and-forget, producer doesn't wait for any acknowledgment), acks=1 (leader writes to its local log and acknowledges, followers might not have it yet), or acks=all (all ISR replicas must acknowledge before the producer gets a success response). With acks=all and min.insync.replicas=2, strong durability holds even if one replica goes down. If the ISR shrinks below min.insync.replicas, the broker rejects writes with a NotEnoughReplicas exception. This is by design. It's better to reject writes than to silently lose data.
KRaft: ZooKeeper Is Gone
Kafka 4.0 removed ZooKeeper entirely. There is no migration path left. Any ZooKeeper-based Kafka deployment should have migrated already.
KRaft (Kafka Raft) moves all metadata management into the Kafka process itself. A subset of brokers (or dedicated nodes) form a controller quorum that uses the Raft consensus protocol to elect a leader and replicate metadata. The metadata includes topic configurations, partition assignments, broker registrations, ACLs, and quotas. All of this used to live in ZooKeeper znodes. Now it lives in an internal __cluster_metadata topic that the controllers manage.
The practical benefits are significant. There is no longer a need to deploy, monitor, and upgrade a separate ZooKeeper ensemble. Partition leader elections are faster (milliseconds instead of seconds during ZooKeeper sessions). Metadata propagation to brokers happens through a push model instead of ZooKeeper watches, which scales better as the cluster grows. Controller failover is faster because the new controller already has the full metadata log in memory.
For small clusters (under 10 brokers), run combined-mode where every broker is also a controller. For larger clusters, dedicate 3 or 5 nodes as controllers and keep them off the produce/fetch hot path. Controller nodes need fast disks for the metadata log but don't need much CPU or memory.
Tiered Storage
Before tiered storage, retention was limited by how much disk could be attached to each broker. Want 90 days of retention at 500MB/s ingest? That's roughly 3.8 petabytes of disk across the replication factor. Expensive.
Tiered storage (KIP-405, GA since Kafka 3.6) splits storage into two tiers. The local tier keeps recent segments on broker disks for low-latency reads. The remote tier offloads completed (sealed) segments to object storage like S3, GCS, or Azure Blob Storage. When a consumer reads old data, the broker fetches it from object storage transparently. The consumer doesn't know or care which tier served the response.
Configuration is per-topic. Set remote.storage.enable=true and local.retention.ms to control how long data stays on local disk before becoming remote-only. A common pattern: 24 hours local, 90 days remote. The brokers need enough local disk for the hot window, and the object storage bill covers the rest.
The catch: remote reads are slower. S3 first-byte latency is 50-200ms compared to sub-millisecond for local NVMe. If consumers routinely read data older than the local retention window (backfills, reprocessing jobs, new consumer groups starting from earliest), factor in the latency hit and the S3 GET request costs.
Consumer Groups and Rebalancing
Consumer groups handle partition assignment. A group coordinator broker manages membership. When consumers join or leave, a rebalance kicks in. The Eager protocol revokes all partitions and reassigns them (a stop-the-world pause), while CooperativeSticky performs incremental rebalancing, only moving partitions that actually need to change.
For any service where availability matters, use CooperativeSticky. The Eager protocol will cause problems in production. A rolling deployment of 10 consumer instances triggers 10 full rebalances. With CooperativeSticky, each restart only reassigns the partitions that belonged to the restarting instance. The rest keep processing without interruption.
Kafka 4.0 also introduced a new server-side consumer group protocol that moves partition assignment logic from the consumer client to the group coordinator broker. This makes rebalancing even faster and more predictable because the broker has a global view of all members and partitions. It also means upgrading the client library is enough to get the improvement, no broker config changes needed.
Share Groups
Share groups (KIP-932) are one of the bigger additions to Kafka's consumption model. Traditional consumer groups assign each partition to exactly one consumer. That's great for ordered processing, but it's awkward for queue-style semantics where any available worker picks up the next message.
Share groups let multiple consumers read from the same partition concurrently. Records are delivered to individual consumers, acknowledged independently, and automatically redelivered if a consumer fails to acknowledge within the timeout. This is conceptually similar to how SQS or RabbitMQ work, but built on top of Kafka's durable log.
When to use share groups: workloads where processing order doesn't matter and the goal is to scale consumers independently of partition count. Think notification dispatch, image processing pipelines, or batch job distribution. When not to use them: anything requiring strict per-key ordering. Consumer groups are still the right model for that.
Exactly-Once Semantics
Exactly-once in Kafka requires three pieces:
Idempotent producers (enable.idempotence=true, which is the default since Kafka 3.0). Each producer gets a unique ID and sequence numbers per partition. The broker deduplicates retried batches by checking the sequence number. This prevents duplicates from network retries but doesn't help with application-level retries.
Transactions extend idempotence across multiple partitions and topics. A producer can atomically write to partitions A, B, and C, and either all writes commit or none do. The transaction coordinator (a broker-side component) manages the two-phase commit protocol. Common use case: read from input topic, process, write to output topic, and commit the consumer offset, all in one atomic operation. This is the foundation of exactly-once stream processing in Kafka Streams.
Consumer isolation (isolation.level=read_committed). Consumers with this setting skip records that are part of uncommitted or aborted transactions. Without it, consumers see everything including records that will eventually be rolled back.
The performance cost of transactions is real but manageable. Expect 5-15% throughput reduction compared to non-transactional produces, mostly from the extra round trips for transaction coordination. Worth it when correctness matters.
Schema Registry
Schema Registry is not part of Kafka itself, but it's essential for any serious deployment. It stores Avro, Protobuf, or JSON Schema definitions and enforces compatibility rules (backward, forward, full) when producers register new schema versions.
Without Schema Registry, producers and consumers negotiate schema changes through hope and Slack messages. Producer team adds a field, consumer team doesn't know about it, deserialization breaks at 2am. With Schema Registry, the compatibility check happens at registration time. A breaking change gets rejected before it reaches a single consumer.
The wire format prepends a 5-byte header (1 magic byte + 4-byte schema ID) to every message. The consumer fetches the schema by ID from the registry, caches it locally, and deserializes. Overhead is negligible.
Production Architecture
Broker Topology
Start with a minimum of 3 brokers, replication factor 3, and min.insync.replicas=2. This setup tolerates one broker failure without data loss or downtime. For KRaft mode, run 3 or 5 controller nodes. In smaller clusters, controllers and brokers can share nodes (combined mode). In larger clusters (20+ brokers), dedicate separate controller nodes.
Place brokers across availability zones with rack-awareness enabled (broker.rack). This ensures replicas span failure domains. A full AZ outage won't cause data loss if the replication factor is 3 across 3 AZs.
Security
Kafka security has three layers. Authentication: SASL/SCRAM, SASL/OAUTHBEARER (for token-based auth), or mTLS client certificates. mTLS is the strongest option and avoids managing passwords, but certificate rotation adds operational complexity. Encryption: TLS for data in transit. Configure ssl.protocol=TLSv1.3 and disable older versions. Authorization: ACLs (Access Control Lists) define which principals can produce, consume, or administer which topics. Use prefix-based ACLs (--resource-pattern-type prefixed) to avoid maintaining individual ACLs for hundreds of topics. For more complex policies, integrate with a policy engine like OPA.
Multi-Datacenter with MirrorMaker 2
MirrorMaker 2 (MM2) replicates topics between Kafka clusters across datacenters or cloud regions. It runs as a Kafka Connect connector, which means it benefits from Connect's fault tolerance and exactly-once delivery.
MM2 preserves consumer offsets across clusters through offset translation. A consumer failing over from the primary to the DR cluster can resume from approximately where it left off, not from the beginning.
Key configuration: set replication.factor=3 on the MM2 internal topics, tune producer.buffer.memory and producer.batch.size for the WAN bandwidth, and monitor replication lag with kafka.connect:type=MirrorSourceConnector,*. Acceptable replication lag depends on the RPO, but most teams target under 30 seconds for disaster recovery.
Kafka Connect
Kafka Connect is the integration framework for getting data in and out of Kafka. Source connectors ingest data from external systems (Debezium for CDC, JDBC source for databases, S3 source for files). Sink connectors push data to external systems (Elasticsearch, S3, BigQuery, Snowflake).
Run Connect in distributed mode with at least 3 workers. Use the exactly.once.source.support=enabled setting (available since Kafka 3.3) for source connectors that support it. This prevents duplicate records when a Connect worker fails and tasks get rebalanced.
The metrics that matter: source-record-poll-rate, sink-record-send-rate, task-error-total-record-errors, and offset-commit-completion-rate. A failing connector that nobody notices is worse than no connector at all.
Capacity Planning
Partitions
Each partition consumes roughly 10MB of memory on the broker for index caches and in-flight produce/fetch buffers. A broker can handle 4,000 to 10,000 partitions, but recovery time after a broker failure scales linearly with partition count. At 10K partitions, leader election takes 10 to 30 seconds. For most workloads, target 30 to 50 partitions per topic. Start with fewer and scale up. Partitions can be added to an existing topic, but they cannot be reduced without creating a new topic and migrating.
The right number of partitions depends on the target throughput. A single partition can sustain roughly 10-50MB/s of produce throughput (hardware dependent). If a topic needs 200MB/s, that means at least 4-20 partitions just for throughput. Factor in consumer parallelism: there cannot be more active consumers in a group than partitions, so partition count also sets the maximum consumer concurrency.
Storage
Storage math is straightforward: throughput_MB/s * retention_seconds * replication_factor. A topic ingesting 100MB/s with 7-day retention and RF=3 needs about 181TB of disk. That number surprises people.
With tiered storage, the math changes dramatically. If local retention is 24 hours and remote retention is 90 days, the broker disks only need 100MB/s * 86400s * 3 = ~25TB. The remaining 156TB lives in object storage at roughly $0.023/GB/month (S3 Standard), which is about $3,600/month instead of the $15,000+ that equivalent EBS gp3 volumes would cost.
Network
Network bandwidth is often the real bottleneck, not disk. A 10Gbps NIC saturates at roughly 1GB/s, and with replication, the effective producer throughput is NIC_bandwidth / replication_factor. With RF=3, one 10Gbps NIC provides about 330MB/s of producer throughput. At that point, either add NICs, move to 25Gbps networking, or use compression.
Use compression. lz4 for speed, zstd for better ratios. Compression happens at the producer and decompresses at the consumer. Brokers keep data compressed on disk and during replication (zero-copy). In my experience, lz4 is the right default unless the storage bill is significant and can tolerate slightly higher CPU usage on producers. Compression ratios of 4-8x are typical for JSON payloads, which effectively multiplies network and storage capacity.
Broker Sizing
A reasonable starting point for a production broker: 8-16 vCPUs, 32-64GB RAM (Kafka uses the OS page cache aggressively, so more RAM means more cached data), 2-4TB NVMe or gp3 SSD per broker, and 10Gbps+ networking. Allocate no more than 6GB to the JVM heap. The rest goes to page cache, which is what actually makes reads fast.
Failure Scenarios
Scenario 1: Broker dies with acks=1 configured. The producer gets an acknowledgment from the leader, but the record hasn't replicated to followers yet. If the leader dies before replication completes, that record is gone. Detect this by monitoring UnderReplicatedPartitions and producer error rates. The fix: switch to acks=all with min.insync.replicas=2. This is the only sane configuration for any data that matters. If the small throughput hit from acks=all seems concerning, benchmark it first. In practice, the difference is smaller than most people expect because replication happens asynchronously and the bottleneck is usually network or disk, not the ack round trip.
Scenario 2: Consumer group enters an infinite rebalance loop. A slow consumer exceeds max.poll.interval.ms (default 5 minutes), gets kicked from the group, rejoins, and the cycle repeats. The group makes zero progress. The JoinRate and SyncRate metrics will spike on the group coordinator, and consumer lag will grow monotonically. To fix it, increase max.poll.interval.ms, reduce max.poll.records, or (most likely) fix the slow processing path. Nine times out of ten, the root cause is a downstream call that occasionally hangs. Add per-record processing timeouts and use async offset commits, and the problem goes away. On Kafka 4.0+, the new server-side consumer protocol also reduces the blast radius of individual consumer slowdowns because the broker can reassign partitions without triggering a full group rebalance.
Scenario 3: Tiered storage retrieval failures during consumer backfill. A new consumer group starts reading from the beginning of a topic with 60 days of retention, but only 24 hours live on local disk. The remaining 59 days are in S3. The consumer backfill triggers thousands of S3 GET requests per second. If the S3 bucket is in a different region, latency is 100-200ms per request. The consumer's fetch throughput drops from 500MB/s (local) to 20-50MB/s (remote). Meanwhile, S3 request costs add up: at 10,000 GET requests per second, that's roughly $36/hour. Detection: monitor remote-fetch-bytes-per-sec, remote-fetch-requests-per-sec, and remote-fetch-errors-per-sec on the broker. Watch the cloud provider's object storage metrics for throttling (S3 returns 503 SlowDown if per-prefix request rates are exceeded). Recovery: if a backfill is coming, temporarily increase local.retention.ms on the topic to pre-fetch data to local disk before the consumer starts. Use the S3 request rate guidelines (3,500 PUT/5,500 GET per prefix per second) to estimate whether the object storage prefix structure needs sharding. For recurring backfill patterns, consider maintaining a separate "backfill" consumer group with rate limiting configured at the application level to avoid hammering remote storage.
Partition Scaling and Repartitioning
This is one of those topics that catches people off guard because the operation itself is trivial. Running kafka-topics.sh --alter --partitions 60 takes a second. Dealing with the consequences can take weeks.
Adding Partitions
When the partition count on a topic increases, Kafka creates the new partitions empty. No data moves. Existing records stay in their original partitions. New records get assigned to partitions using the updated partition count in the hash function (murmur2(key) % new_partition_count).
Here's the problem. If the topic uses keyed messages (and most important topics do), the key-to-partition mapping changes for a significant percentage of keys. A key that previously mapped to partition 3 might now map to partition 47. Any consumer that relies on all events for a given key arriving in order on the same partition will start seeing events split across old and new partitions.
For unkeyed topics (round-robin distribution, like log aggregation), adding partitions is straightforward. No ordering contract exists, so nothing breaks. The result is simply more parallelism.
For keyed topics, adding partitions is effectively a repartitioning event. The impact depends on what sits downstream:
- Stateless consumers usually handle it fine after a rebalance. They don't care about ordering across the boundary.
- Kafka Streams applications are the worst hit. Streams builds local state stores (RocksDB) keyed by the partition assignment. When partitions change, the state stores become invalid because the key-to-partition mapping no longer matches. The application has to rebuild state from the changelog topic, which can take hours for large state stores. I've seen teams lose an entire day of processing because someone bumped partitions on a core topic without understanding the Streams dependency.
- Consumers doing session aggregation or windowed processing will produce incorrect results for any sessions that span the repartition boundary, because events for the same key are now split across partitions.
Decreasing Partitions
Not possible. Kafka does not support reducing the partition count on an existing topic. The data model makes it impossible without losing or rewriting data.
If fewer partitions are genuinely needed (maybe the topic was over-provisioned and the metadata overhead is a problem), the process is:
- Create a new topic with the desired partition count
- Start a mirror process (MirrorMaker 2, Kafka Streams app, or a simple consumer-producer bridge) that reads from the old topic and writes to the new one
- Cut over producers to the new topic
- Wait for consumers to drain the old topic
- Cut over consumers to the new topic
- Delete the old topic
This is disruptive. Plan for it carefully and schedule it during a maintenance window.
How to Avoid the Pain
The best repartitioning is the one that never has to happen. Some strategies:
Over-provision slightly from the start. A topic with 60 partitions costs about 600MB of extra broker memory compared to 6 partitions. That's nothing. But going from 6 to 60 partitions on a keyed topic in production is painful. I generally start with 3x the number of partitions I think I'll need for the next 12 months, with a floor of 12 for any topic that carries keyed data.
Use the throughput formula to project forward. partitions = max(target_throughput / per_partition_throughput, max_consumer_instances). If a topic does 50MB/s today and 4x growth is expected over two years, size for 200MB/s now. At 10MB/s per partition, that's 20 partitions. Round up to 30 for headroom.
Design partition keys for stability. Use identifiers that don't change (user ID, account ID, device ID), not derived attributes (region, plan tier) that might shift. If the partition key can change, there is already an ordering problem regardless of whether partitions get added.
Consider a repartitioning topic pattern. For Kafka Streams applications, introduce an intermediate topic with the new partition count and use through() or repartition() to explicitly redirect data. This gives control over the migration instead of having it happen implicitly.
Monitor partition count across the cluster. Track total partitions per broker (kafka.server:type=ReplicaManager,name=PartitionCount). Set alerts at 4,000 per broker. When approaching the limit, it is time to add brokers, not squeeze more partitions onto existing ones.
Pros
- • Absurdly high throughput (millions of messages/sec on modest hardware) thanks to sequential I/O and zero-copy transfer
- • Durable message storage with configurable retention, from hours to forever
- • Scales horizontally by adding brokers and partitions without downtime
- • Strong ordering guarantee within partitions, which is enough for most use cases
- • Rich ecosystem: Connect for integrations, Streams for processing, Schema Registry for governance
- • KRaft mode eliminates ZooKeeper entirely, simplifying operations and reducing the component count
- • Tiered storage offloads old data to object storage, decoupling compute from long-term retention costs
Cons
- • Operationally heavy even with KRaft. You still need to understand brokers, partitions, consumer groups, ISR, and replication.
- • Wrong tool for low-latency request-reply patterns. If you need sub-millisecond RPC, look elsewhere.
- • Consumer group rebalancing can stall your entire pipeline if not configured carefully
- • No built-in message routing or filtering. Every consumer reads from a partition and filters client-side.
- • Steep learning curve, especially around offset management, exactly-once semantics, and partition key design
- • Partition count is hard to change after the fact. Repartitioning means re-keying all your data.
When to use
- • You need durable, ordered event streaming at scale
- • You're building event-driven, CQRS, or event-sourcing architectures
- • High-throughput log or data pipeline ingestion (100K+ events/sec)
- • Decoupling producers and consumers where replay capability matters
- • CDC pipelines that capture database changes and fan them out to downstream systems
When NOT to use
- • Simple task queues with low volume (use SQS or Redis streams instead)
- • You need complex message routing, priority queues, or dead-letter exchanges (RabbitMQ is better suited)
- • Request-reply messaging patterns where you need synchronous responses
- • Small team without the bandwidth to operate distributed infrastructure (consider a managed service or a simpler queue)
- • Message ordering across multiple partitions is a hard requirement (Kafka only guarantees order within a partition)
Key Points
- •Kafka is an append-only distributed commit log. Messages are immutable and ordered by offset within each partition. This single design choice is why Kafka can sustain millions of writes per second.
- •Ordering is guaranteed only within a single partition, not across partitions. The partition key design determines how data is distributed and whether related events stay together. Get this wrong and consumers will process events out of order.
- •KRaft mode (Kafka 4.0+) replaced ZooKeeper with an internal Raft-based metadata quorum. ZooKeeper is gone for good. The controller quorum runs inside the Kafka process itself, which cuts the operational footprint in half.
- •Tiered storage (GA since Kafka 3.6) offloads completed log segments to object storage like S3 or GCS. Retaining months or years of data becomes possible without paying for broker-local disk. Consumers seamlessly read from both local and remote tiers.
- •Exactly-once semantics require three things working together: idempotent producers (enable.idempotence=true), transactional APIs for atomic multi-partition writes, and read_committed isolation on consumers.
- •The ISR (In-Sync Replica) set determines durability. A broker falls out of ISR if it lags behind by replica.lag.time.max.ms (default 30s). With acks=all and min.insync.replicas=2, the cluster survives any single broker failure without data loss.
- •Share groups (KIP-932) bring queue-style consumption to Kafka. Unlike consumer groups where each partition is assigned to exactly one consumer, share groups let multiple consumers process records from the same partition independently. This bridges the gap between Kafka and traditional message queues.
- •Partitions can be added to a topic but cannot be removed. Adding partitions to a keyed topic changes which partition each key maps to, silently breaking ordering guarantees. Plan the partition count carefully before going to production.
Common Mistakes
- ✗Creating too many partitions (>10K per broker). This increases end-to-end latency, metadata overhead, and recovery time after a broker failure. Start with fewer partitions and scale up, not the other way around.
- ✗Not making consumers idempotent. At-least-once delivery means duplicates will happen during rebalances and retries. Design for it from day one, not as an afterthought.
- ✗Using Kafka for request-reply patterns. Kafka is built for async streaming, not synchronous RPC. If the solution involves polling a response topic, it is the wrong tool.
- ✗Ignoring consumer lag monitoring. Undetected lag leads to processing delays and, eventually, a dead consumer group nobody notices until the business reports stale data.
- ✗Getting retention wrong. Unbounded retention fills disks (unless tiered storage is in play). Too-short retention drops data before slow consumers finish reading it.
- ✗Not using Schema Registry for schema evolution. Without it, producers and consumers negotiate schema changes through hope and Slack messages. That breaks eventually.
- ✗Treating tiered storage as free. Remote reads from S3 are 10-50x slower than local disk reads. If consumers routinely read old data, size the local retention to cover the hot read window.
- ✗Running KRaft controllers on the same nodes as high-throughput brokers in large clusters. Controller metadata operations compete for I/O with produce/fetch requests. Dedicate controller nodes once the cluster passes 20 brokers.
- ✗Adding partitions to a keyed topic without understanding the consequences. Existing keys will hash to different partitions, breaking ordering guarantees and corrupting Kafka Streams state stores. Plan the partition count upfront.