Message Queues & Event Streaming
Architecture Diagram
Why It Exists
When service A calls service B synchronously, their fates are chained together. If B is slow, A is slow. If B goes down, A breaks. That's a bad deal.
Message queues and event streams stick a buffer in between. Producers throw messages into that buffer and move on with their lives. Consumers pull from it at their own pace. This one change buys a lot: it absorbs traffic spikes, lets services fail independently, and frees teams to deploy and scale their components on their own schedule. Anyone building distributed systems ends up here sooner or later.
How It Works
Queues vs Streams
Message queues (RabbitMQ, SQS) use a competing consumers model. Each message goes to exactly one consumer. Once acknowledged, it's gone. This is the right model for task distribution: job processing, email sending, order fulfillment. One task, one worker.
Event streams (Kafka, Pulsar) work differently. They append messages to an immutable, ordered log. Multiple consumer groups can read the same stream independently, each tracking their own offset. Messages stick around for a configurable period (hours, days, or forever), which enables replay. That's a big deal. It makes it possible to rebuild a search index, backfill a new data store, or replay production events to debug a tricky issue. Once a team has had replay, going back is unthinkable.
Delivery Semantics
| Guarantee | Behavior | Implementation |
|---|---|---|
| At-most-once | Message may be lost, never duplicated | Fire and forget, no ack |
| At-least-once | Message is never lost, may be duplicated | Ack after processing, retry on failure |
| Exactly-once | Message processed exactly once | Idempotent consumers + transactional produce/consume |
People love to debate exactly-once. In practice, getting there means combining idempotent producers (Kafka assigns sequence numbers to catch duplicates), transactional writes (atomic produce + offset commit), and idempotent consumers (deduplication by message ID or idempotent database operations like upserts). It works, but it's not free. Every layer adds latency and complexity. Most systems do just fine with at-least-once plus idempotent consumers.
Partitioning and Ordering
Kafka topics split into partitions. This is where people get tripped up: ordering is only guaranteed within a single partition, not across them.
Producers assign messages to partitions by key. All messages with the same key land in the same partition, so ordering is preserved for that entity. Consumer groups assign partitions to consumers, one partition per consumer. That's the parallelism lever. But here's the catch: with 12 partitions and 15 consumers, 3 of those consumers sit idle doing nothing. Plan partition counts carefully, because increasing it later is easy but decreasing it is not.
Backpressure Strategies
When consumers can't keep up with producers, there are a few options:
- Scale consumers. Add more of them, up to the partition count. Start here.
- Rate-limit producers. Throttle ingestion when consumer lag crosses a threshold. This works but pushes the problem upstream.
- Batch processing. Have consumers process messages in batches to reduce per-message overhead. Simple and effective.
- Priority queues. Route high-priority messages to a separate queue with dedicated consumers. Useful when not all messages are equal.
- Dead letter queues (DLQ). After N failed attempts, park the message in a DLQ and move on. Don't let one bad message block everything behind it.
Production Considerations
- Idempotency is not optional. Every consumer must handle duplicate messages gracefully. Use unique message IDs. Use upserts. Test this explicitly. Duplicates will happen, and "it shouldn't happen" is not an engineering plan.
- Watch consumer lag. In Kafka, track
consumer_lagper partition. What counts as "too much" depends on the SLA. For real-time systems, lag above a few thousand messages means something is wrong. - Schema evolution matters. Use a schema registry (Confluent, Apicurio) with Avro or Protobuf. Make schema changes forward and backward compatible. Otherwise, consumers break on every deploy and weekends get spent doing rollbacks.
- Retention and compaction. Kafka supports time-based retention (delete after 7 days) and log compaction (keep the latest value per key). Compacted topics are great for current-state snapshots, like a user profile store.
- Broker sizing. Kafka throughput scales with disk I/O, not CPU. Use dedicated disks for commit logs. Size hardware based on peak write throughput multiplied by the replication factor. Don't plan for average load; plan for the spike.
Failure Scenarios
Scenario 1: Consumer Group Rebalance Storm. A Kafka consumer group has 50 consumers. One consumer slows down, misses its heartbeat, and the group coordinator triggers a rebalance. During that rebalance (30-60 seconds), all 50 consumers stop processing. The slow consumer rejoins, fails again, triggers another rebalance. Now it's a loop. This cascading storm can cause 20+ minutes of effective downtime while the queue keeps growing. Detection: Monitor rebalance_rate and rebalance_latency. Alert when rebalances exceed 2 per hour. Track max.poll.interval.ms violations per consumer. Recovery: Switch to Kafka's cooperative sticky assignor so rebalances are incremental, not stop-the-world. Set session.timeout.ms to 30s and max.poll.interval.ms to 5 minutes for batch workloads. Use static group membership (group.instance.id) to skip rebalances during rolling deployments. Uber did this and cut rebalance-related downtime by 90%.
Scenario 2: Poison Message Blocking. A malformed message lands in a Kafka topic. The consumer tries to deserialize it, throws an exception, and retries forever because it's configured for at-least-once with no error handling. That consumer is now stuck on one message, and every message behind it piles up. Lag grows to millions. Detection: Monitor per-partition consumer offset progress. If one partition's offset is frozen while others advance, there's likely a poison message. Alert when lag for a single partition exceeds 10x the average. Recovery: Build a dead-letter-queue pattern from day one. After 3 failed attempts, publish the message to a DLQ topic and advance the offset. Stripe follows this pattern for their billions of daily events: every consumer gets a DLQ, and a separate pipeline reprocesses those messages after the bug is fixed.
Scenario 3: Kafka Broker Disk Full. The topic's retention is set to 30 days, but traffic spikes 5x during a product launch. The commit log fills the broker's disk. Kafka stops accepting writes to all partitions on that broker, not just the topic that caused the problem. Producers start buffering and eventually drop messages. Detection: Alert on disk utilization above 75% per broker. Monitor log.retention.bytes versus actual disk usage. Track daily ingestion growth rate. Recovery: For the immediate fire, add size-based retention (e.g., 500GB cap per topic) alongside time-based retention. Long-term, implement per-topic quotas and consider tiered storage (offloading old segments to S3) to keep disk pressure manageable.
Capacity Planning
| Metric | Threshold | Action |
|---|---|---|
| Consumer lag | > 100K messages (real-time) | Scale consumers or investigate slow processing |
| Broker disk utilization | > 70% | Add brokers or reduce retention |
| Partition count per broker | > 4,000 | Add brokers (metadata overhead degrades performance beyond this) |
| Producer throughput | > 80% of broker network capacity | Add brokers or partitions |
| Replication factor overhead | RF * write_throughput > disk I/O capacity | Upgrade disks or add brokers |
Scale references for calibration: LinkedIn's Kafka deployment handles 7+ trillion messages per day across thousands of brokers in multiple data centers. A single Kafka broker on modern NVMe hardware can sustain roughly 600MB/s write throughput. These are useful reference points, but actual numbers depend entirely on message size, replication factor, and consumer patterns.
Capacity formula: Required brokers = (peak_write_MB/s * replication_factor) / per_broker_write_throughput_MB/s. For partitions: partition_count = max(peak_consumer_throughput / single_consumer_throughput, desired_parallelism). The practical ceiling is around 10,000 partitions per broker. Retention storage: storage_per_broker = daily_ingest_GB * retention_days * replication_factor / broker_count. Always add 40% headroom. That headroom pays for itself when Black Friday hits.
Architecture Decision Record
Messaging Technology Decision Matrix
| Criteria (Weight) | Apache Kafka | RabbitMQ | AWS SQS/SNS | Apache Pulsar |
|---|---|---|---|---|
| Throughput need | > 100K msg/s | < 50K msg/s | < 100K msg/s | > 100K msg/s |
| Message replay needed | Yes (log-based) | No (messages deleted after ack) | No (once consumed) | Yes (log-based) |
| Ordering guarantee | Per-partition | Per-queue | FIFO queues (limited) | Per-partition |
| Operational complexity | High (ZK/KRaft, brokers, schema registry) | Medium (Erlang cluster) | None (managed) | High (BookKeeper + brokers) |
| Multi-tenancy | Poor (topic-level isolation) | Good (vhosts) | Good (per-queue) | Excellent (native) |
How to pick:
- Need event sourcing, replay, or multiple independent consumers reading the same data? Kafka or Pulsar. The immutable log is the whole point. There's no faking it with a queue.
- Need complex routing (topic exchanges, header-based routing, priority queues)? RabbitMQ. Its exchange and binding model is far more flexible than Kafka's simple topic-partition structure.
- Want zero operational overhead and throughput stays under 50K messages/second? Go with AWS SQS + SNS. Running a self-managed Kafka cluster costs more in engineering time than most teams realize at that scale.
- Multi-tenancy is a first-class requirement (SaaS platform, shared infrastructure team)? Pulsar's native tenant and namespace isolation with per-tenant quotas beats Kafka's bolted-on alternatives.
- Team under 10 engineers? Do not run Kafka in-house. Use a managed service (Confluent Cloud, MSK, Redpanda Cloud) or start with SQS. Kafka operations (broker tuning, partition rebalancing, monitoring, upgrades) will eat the team's time. I've seen small teams lose months to it. The technology is great, but the operational tax is real.
Key Points
- •Decouples producers from consumers, allowing asynchronous processing and better fault tolerance
- •Message queues (RabbitMQ, SQS) deliver each message to one consumer. Event streams (Kafka) let multiple consumers read the same data independently
- •At-least-once vs exactly-once vs at-most-once: the delivery guarantee choice shapes the entire application design
- •Consumer groups provide parallel processing while keeping order within each partition
- •Dead letter queues catch failed messages for debugging and reprocessing later
Tool Comparison
| Tool | Type | Best For | Scale |
|---|---|---|---|
| Apache Kafka | Open Source | Event streaming, high throughput, replay capability | Large-Enterprise |
| RabbitMQ | Open Source | Traditional messaging, routing patterns, AMQP | Small-Large |
| AWS SQS/SNS | Managed | Serverless, zero ops, fan-out patterns | Small-Enterprise |
| Apache Pulsar | Open Source | Multi-tenancy, tiered storage, geo-replication | Large-Enterprise |
Common Mistakes
- Not making consumers idempotent. At-least-once delivery means messages can and will be processed more than once
- Letting queues grow without bounds. Producers outpace consumers, the queue fills the disk, and the broker dies
- Reaching for a message queue when a plain function call would do the job. Don't add unnecessary complexity
- Ignoring consumer lag. If consumers fall behind, there's a scaling or performance problem, and it's better to know about it before the users do
- Assuming ordering across partitions. Kafka only guarantees order within a single partition, not across them