Message Queues & Event Streaming

Why It Exists

When service A calls service B synchronously, their fates are chained together. If B is slow, A is slow. If B goes down, A breaks. That's a bad deal.

Message queues and event streams stick a buffer in between. Producers throw messages into that buffer and move on with their lives. Consumers pull from it at their own pace. This one change buys a lot: it absorbs traffic spikes, lets services fail independently, and frees teams to deploy and scale their components on their own schedule. Anyone building distributed systems ends up here sooner or later.

How It Works

Queues vs Streams

Message queues (RabbitMQ, SQS) use a competing consumers model. Each message goes to exactly one consumer. Once acknowledged, it's gone. This is the right model for task distribution: job processing, email sending, order fulfillment. One task, one worker.

Event streams (Kafka, Pulsar) work differently. They append messages to an immutable, ordered log. Multiple consumer groups can read the same stream independently, each tracking their own offset. Messages stick around for a configurable period (hours, days, or forever), which enables replay. That's a big deal. It makes it possible to rebuild a search index, backfill a new data store, or replay production events to debug a tricky issue. Once a team has had replay, going back is unthinkable.

Delivery Semantics

Guarantee	Behavior	Implementation
At-most-once	Message may be lost, never duplicated	Fire and forget, no ack
At-least-once	Message is never lost, may be duplicated	Ack after processing, retry on failure
Exactly-once	Message processed exactly once	Idempotent consumers + transactional produce/consume

People love to debate exactly-once. In practice, getting there means combining idempotent producers (Kafka assigns sequence numbers to catch duplicates), transactional writes (atomic produce + offset commit), and idempotent consumers (deduplication by message ID or idempotent database operations like upserts). It works, but it's not free. Every layer adds latency and complexity. Most systems do just fine with at-least-once plus idempotent consumers.

Partitioning and Ordering

Kafka topics split into partitions. This is where people get tripped up: ordering is only guaranteed within a single partition, not across them.

Producers assign messages to partitions by key. All messages with the same key land in the same partition, so ordering is preserved for that entity. Consumer groups assign partitions to consumers, one partition per consumer. That's the parallelism lever. But here's the catch: with 12 partitions and 15 consumers, 3 of those consumers sit idle doing nothing. Plan partition counts carefully, because increasing it later is easy but decreasing it is not.

Backpressure Strategies

When consumers can't keep up with producers, there are a few options:

Scale consumers. Add more of them, up to the partition count. Start here.
Rate-limit producers. Throttle ingestion when consumer lag crosses a threshold. This works but pushes the problem upstream.
Batch processing. Have consumers process messages in batches to reduce per-message overhead. Simple and effective.
Priority queues. Route high-priority messages to a separate queue with dedicated consumers. Useful when not all messages are equal.
Dead letter queues (DLQ). After N failed attempts, park the message in a DLQ and move on. Don't let one bad message block everything behind it.

Production Considerations

Idempotency is not optional. Every consumer must handle duplicate messages gracefully. Use unique message IDs. Use upserts. Test this explicitly. Duplicates will happen, and "it shouldn't happen" is not an engineering plan.
Watch consumer lag. In Kafka, track consumer_lag per partition. What counts as "too much" depends on the SLA. For real-time systems, lag above a few thousand messages means something is wrong.
Schema evolution matters. Use a schema registry (Confluent, Apicurio) with Avro or Protobuf. Make schema changes forward and backward compatible. Otherwise, consumers break on every deploy and weekends get spent doing rollbacks.
Retention and compaction. Kafka supports time-based retention (delete after 7 days) and log compaction (keep the latest value per key). Compacted topics are great for current-state snapshots, like a user profile store.
Broker sizing. Kafka throughput scales with disk I/O, not CPU. Use dedicated disks for commit logs. Size hardware based on peak write throughput multiplied by the replication factor. Don't plan for average load; plan for the spike.

Failure Scenarios

Scenario 1: Consumer Group Rebalance Storm. A Kafka consumer group has 50 consumers. One consumer slows down, misses its heartbeat, and the group coordinator triggers a rebalance. During that rebalance (30-60 seconds), all 50 consumers stop processing. The slow consumer rejoins, fails again, triggers another rebalance. Now it's a loop. This cascading storm can cause 20+ minutes of effective downtime while the queue keeps growing. Detection: Monitor rebalance_rate and rebalance_latency. Alert when rebalances exceed 2 per hour. Track max.poll.interval.ms violations per consumer. Recovery: Switch to Kafka's cooperative sticky assignor so rebalances are incremental, not stop-the-world. Set session.timeout.ms to 30s and max.poll.interval.ms to 5 minutes for batch workloads. Use static group membership (group.instance.id) to skip rebalances during rolling deployments. Uber did this and cut rebalance-related downtime by 90%.

Scenario 2: Poison Message Blocking. A malformed message lands in a Kafka topic. The consumer tries to deserialize it, throws an exception, and retries forever because it's configured for at-least-once with no error handling. That consumer is now stuck on one message, and every message behind it piles up. Lag grows to millions. Detection: Monitor per-partition consumer offset progress. If one partition's offset is frozen while others advance, there's likely a poison message. Alert when lag for a single partition exceeds 10x the average. Recovery: Build a dead-letter-queue pattern from day one. After 3 failed attempts, publish the message to a DLQ topic and advance the offset. Stripe follows this pattern for their billions of daily events: every consumer gets a DLQ, and a separate pipeline reprocesses those messages after the bug is fixed.

Scenario 3: Kafka Broker Disk Full. The topic's retention is set to 30 days, but traffic spikes 5x during a product launch. The commit log fills the broker's disk. Kafka stops accepting writes to all partitions on that broker, not just the topic that caused the problem. Producers start buffering and eventually drop messages. Detection: Alert on disk utilization above 75% per broker. Monitor log.retention.bytes versus actual disk usage. Track daily ingestion growth rate. Recovery: For the immediate fire, add size-based retention (e.g., 500GB cap per topic) alongside time-based retention. Long-term, implement per-topic quotas and consider tiered storage (offloading old segments to S3) to keep disk pressure manageable.

Capacity Planning

Metric	Threshold	Action
Consumer lag	> 100K messages (real-time)	Scale consumers or investigate slow processing
Broker disk utilization	> 70%	Add brokers or reduce retention
Partition count per broker	> 4,000	Add brokers (metadata overhead degrades performance beyond this)
Producer throughput	> 80% of broker network capacity	Add brokers or partitions
Replication factor overhead	RF * write_throughput > disk I/O capacity	Upgrade disks or add brokers

Scale references for calibration: LinkedIn's Kafka deployment handles 7+ trillion messages per day across thousands of brokers in multiple data centers. A single Kafka broker on modern NVMe hardware can sustain roughly 600MB/s write throughput. These are useful reference points, but actual numbers depend entirely on message size, replication factor, and consumer patterns.

Capacity formula: Required brokers = (peak_write_MB/s * replication_factor) / per_broker_write_throughput_MB/s. For partitions: partition_count = max(peak_consumer_throughput / single_consumer_throughput, desired_parallelism). The practical ceiling is around 10,000 partitions per broker. Retention storage: storage_per_broker = daily_ingest_GB * retention_days * replication_factor / broker_count. Always add 40% headroom. That headroom pays for itself when Black Friday hits.

Architecture Decision Record

Messaging Technology Decision Matrix

Criteria (Weight)	Apache Kafka	RabbitMQ	AWS SQS/SNS	Apache Pulsar
Throughput need	> 100K msg/s	< 50K msg/s	< 100K msg/s	> 100K msg/s
Message replay needed	Yes (log-based)	No (messages deleted after ack)	No (once consumed)	Yes (log-based)
Ordering guarantee	Per-partition	Per-queue	FIFO queues (limited)	Per-partition
Operational complexity	High (ZK/KRaft, brokers, schema registry)	Medium (Erlang cluster)	None (managed)	High (BookKeeper + brokers)
Multi-tenancy	Poor (topic-level isolation)	Good (vhosts)	Good (per-queue)	Excellent (native)

How to pick:

Need event sourcing, replay, or multiple independent consumers reading the same data? Kafka or Pulsar. The immutable log is the whole point. There's no faking it with a queue.
Need complex routing (topic exchanges, header-based routing, priority queues)? RabbitMQ. Its exchange and binding model is far more flexible than Kafka's simple topic-partition structure.
Want zero operational overhead and throughput stays under 50K messages/second? Go with AWS SQS + SNS. Running a self-managed Kafka cluster costs more in engineering time than most teams realize at that scale.
Multi-tenancy is a first-class requirement (SaaS platform, shared infrastructure team)? Pulsar's native tenant and namespace isolation with per-tenant quotas beats Kafka's bolted-on alternatives.
Team under 10 engineers? Do not run Kafka in-house. Use a managed service (Confluent Cloud, MSK, Redpanda Cloud) or start with SQS. Kafka operations (broker tuning, partition rebalancing, monitoring, upgrades) will eat the team's time. I've seen small teams lose months to it. The technology is great, but the operational tax is real.

Tool	Type	Best For	Scale
Apache Kafka	Open Source	Event streaming, high throughput, replay capability	Large-Enterprise
RabbitMQ	Open Source	Traditional messaging, routing patterns, AMQP	Small-Large
AWS SQS/SNS	Managed	Serverless, zero ops, fan-out patterns	Small-Enterprise
Apache Pulsar	Open Source	Multi-tenancy, tiered storage, geo-replication	Large-Enterprise

Why It Exists

When service A calls service B synchronously, their fates are chained together. If B is slow, A is slow. If B goes down, A breaks. That's a bad deal.

How It Works

Queues vs Streams

Delivery Semantics

Guarantee	Behavior	Implementation
At-most-once	Message may be lost, never duplicated	Fire and forget, no ack
At-least-once	Message is never lost, may be duplicated	Ack after processing, retry on failure
Exactly-once	Message processed exactly once	Idempotent consumers + transactional produce/consume

Partitioning and Ordering

Kafka topics split into partitions. This is where people get tripped up: ordering is only guaranteed within a single partition, not across them.

Backpressure Strategies

When consumers can't keep up with producers, there are a few options:

Scale consumers. Add more of them, up to the partition count. Start here.
Rate-limit producers. Throttle ingestion when consumer lag crosses a threshold. This works but pushes the problem upstream.
Batch processing. Have consumers process messages in batches to reduce per-message overhead. Simple and effective.
Priority queues. Route high-priority messages to a separate queue with dedicated consumers. Useful when not all messages are equal.
Dead letter queues (DLQ). After N failed attempts, park the message in a DLQ and move on. Don't let one bad message block everything behind it.

Production Considerations

Idempotency is not optional. Every consumer must handle duplicate messages gracefully. Use unique message IDs. Use upserts. Test this explicitly. Duplicates will happen, and "it shouldn't happen" is not an engineering plan.
Watch consumer lag. In Kafka, track consumer_lag per partition. What counts as "too much" depends on the SLA. For real-time systems, lag above a few thousand messages means something is wrong.
Schema evolution matters. Use a schema registry (Confluent, Apicurio) with Avro or Protobuf. Make schema changes forward and backward compatible. Otherwise, consumers break on every deploy and weekends get spent doing rollbacks.
Retention and compaction. Kafka supports time-based retention (delete after 7 days) and log compaction (keep the latest value per key). Compacted topics are great for current-state snapshots, like a user profile store.
Broker sizing. Kafka throughput scales with disk I/O, not CPU. Use dedicated disks for commit logs. Size hardware based on peak write throughput multiplied by the replication factor. Don't plan for average load; plan for the spike.

Failure Scenarios

Capacity Planning

Metric	Threshold	Action
Consumer lag	> 100K messages (real-time)	Scale consumers or investigate slow processing
Broker disk utilization	> 70%	Add brokers or reduce retention
Partition count per broker	> 4,000	Add brokers (metadata overhead degrades performance beyond this)
Producer throughput	> 80% of broker network capacity	Add brokers or partitions
Replication factor overhead	RF * write_throughput > disk I/O capacity	Upgrade disks or add brokers

Architecture Decision Record

Messaging Technology Decision Matrix

Criteria (Weight)	Apache Kafka	RabbitMQ	AWS SQS/SNS	Apache Pulsar
Throughput need	> 100K msg/s	< 50K msg/s	< 100K msg/s	> 100K msg/s
Message replay needed	Yes (log-based)	No (messages deleted after ack)	No (once consumed)	Yes (log-based)
Ordering guarantee	Per-partition	Per-queue	FIFO queues (limited)	Per-partition
Operational complexity	High (ZK/KRaft, brokers, schema registry)	Medium (Erlang cluster)	None (managed)	High (BookKeeper + brokers)
Multi-tenancy	Poor (topic-level isolation)	Good (vhosts)	Good (per-queue)	Excellent (native)

How to pick:

Need event sourcing, replay, or multiple independent consumers reading the same data? Kafka or Pulsar. The immutable log is the whole point. There's no faking it with a queue.
Need complex routing (topic exchanges, header-based routing, priority queues)? RabbitMQ. Its exchange and binding model is far more flexible than Kafka's simple topic-partition structure.
Want zero operational overhead and throughput stays under 50K messages/second? Go with AWS SQS + SNS. Running a self-managed Kafka cluster costs more in engineering time than most teams realize at that scale.
Multi-tenancy is a first-class requirement (SaaS platform, shared infrastructure team)? Pulsar's native tenant and namespace isolation with per-tenant quotas beats Kafka's bolted-on alternatives.
Team under 10 engineers? Do not run Kafka in-house. Use a managed service (Confluent Cloud, MSK, Redpanda Cloud) or start with SQS. Kafka operations (broker tuning, partition rebalancing, monitoring, upgrades) will eat the team's time. I've seen small teams lose months to it. The technology is great, but the operational tax is real.

Architecture Diagram

Why It Exists

How It Works

Queues vs Streams

Delivery Semantics

Partitioning and Ordering

Backpressure Strategies

Production Considerations

Failure Scenarios

Capacity Planning

Architecture Decision Record

Messaging Technology Decision Matrix

Key Points

Tool Comparison

Common Mistakes

Related Topics

Message Queues & Event Streaming

Architecture Diagram

Why It Exists

How It Works

Queues vs Streams

Delivery Semantics

Partitioning and Ordering

Backpressure Strategies

Production Considerations

Failure Scenarios

Capacity Planning

Architecture Decision Record

Messaging Technology Decision Matrix

Key Points

Tool Comparison

Common Mistakes

Related Topics