Messaging & StreamingTech 2 of 8

Messaging

Elixir / BEAM

The runtime that handles millions of concurrent connections where Java and Go need hundreds of servers

Use Cases

Real-time chat and messaging platforms (Discord, WhatsApp-scale WebSocket handling)Notification delivery layer holding millions of concurrent push connectionsIoT device communication (millions of persistent connections from sensors and devices)Telecom infrastructure (call routing, signaling, session management)Live collaboration features (presence indicators, cursor tracking, typing indicators)API gateways and connection proxies for high-concurrency workloads

Architecture

Why BEAM Exists

Ericsson built Erlang in 1986 to run telephone switches. The requirements: handle millions of simultaneous phone calls, each call isolated so one crash does not affect others, and the system must never go down (five nines uptime). The BEAM virtual machine was designed specifically for this: massive concurrency through lightweight processes, fault isolation through no shared memory, and distribution across multiple machines as a first-class feature.

Elixir (created by Jose Valim in 2012) runs on the same BEAM VM but with modern syntax, better tooling, and the Phoenix web framework. Everything that made Erlang good at phone calls makes Elixir good at WebSockets, chat, notifications, and any workload that holds millions of concurrent connections.

Processes Are Not Threads

In Java, each thread uses ~1MB of stack memory. At 1 million concurrent connections, that's 1TB of RAM just for thread stacks. Go is better with goroutines (~8KB each), but even Go needs external coordination (Redis) to track which user is on which server.

BEAM takes a different approach. Each process uses ~2KB of memory. These are not OS threads. The BEAM VM runs its own scheduler that preemptively schedules millions of processes across all CPU cores. Processes share no memory. Communication happens strictly through message passing (the actor model).

A single Elixir server with 64GB RAM can hold 2-5 million processes. Each WebSocket connection is one process. One server does what takes 100+ Java servers.

Java:    1M connections = 100 servers (10K connections each) + Redis for routing
Go:      1M connections = 20 servers (50K connections each) + Redis for routing
Elixir:  1M connections = 1-2 servers (500K-1M connections each), no Redis

The scheduler is preemptive, not cooperative. A slow process cannot block other processes (unlike Go goroutines, which are cooperatively scheduled and can starve others if they don't yield). Every process gets a fair share of CPU time through reduction counting.

Built-In Distribution

BEAM nodes find each other and connect automatically. When two nodes connect, their process registries merge. A message sent to a process on any node in the cluster reaches it by registered name. BEAM routes it to the correct node internally.

# On any node in the cluster:
send({:user, 456}, %{type: "notification", body: "New message"})

# BEAM internally:
# 1. Look up {:user, 456} in the distributed registry
# 2. Find it on node2@10.0.1.5
# 3. Send the message over the TCP link between nodes
# 4. Process on node2 receives it, pushes to WebSocket

This replaces the entire Redis connection registry pattern. No external service to maintain, no extra network hop for lookups, no TTL expiry headaches. The registry updates automatically when a node joins, leaves, or a process crashes.

Supervision Trees and Fault Tolerance

In an Elixir application, every process runs under a supervisor. If a process crashes (bad input, timeout, network error), the supervisor restarts it automatically. Restart happens in milliseconds. No other process is affected because there is no shared memory.

This is the "let it crash" philosophy. Instead of wrapping every function call in try/catch and handling every possible error, let the process crash and restart clean. The supervisor handles recovery. The code stays simple.

A supervision tree is hierarchical. The top-level supervisor watches subsystem supervisors, which watch individual worker processes. A crash at the leaf level restarts only that leaf. A crash in a subsystem supervisor restarts only that subsystem.

WhatsApp used this model to serve 2 billion users with fewer than 50 engineers. Their Erlang servers achieved 99.999% uptime (5 minutes of downtime per year) because individual process crashes never cascaded.

Phoenix Framework for WebSockets

Phoenix, the web framework for Elixir, handles WebSocket connections through its Channels abstraction with built-in lifecycle management: connect, join a topic (chat room, notification stream), send and receive messages, handle disconnection and reconnection.

Phoenix PubSub provides publish/subscribe across all nodes in the cluster. When a message is published to a topic, every subscriber on every node receives it. No external message broker needed for the pub/sub layer.

Phoenix Presence tracks which users are currently online across all nodes. It uses CRDTs internally to merge presence state without conflicts. "Who is online in this chat room" comes for free without building a separate presence service.

Discord runs Phoenix Channels for millions of concurrent users. A single Discord "guild" (server) with thousands of members runs as a set of Phoenix Channel processes across multiple BEAM nodes. Message delivery, typing indicators, presence updates, and voice state all flow through Channels.

Hybrid Architecture: Elixir + Java/Go

The idea is not to rewrite the backend in Elixir. The production pattern is to use Elixir only for the connection and delivery layer while the business logic stays in Java/Go.

The Java/Go backend decides who to notify, builds the payload, applies user preferences, and publishes to Kafka. The Elixir layer consumes from Kafka and delivers notifications through WebSocket connections. The two systems communicate through Kafka (or gRPC). The Java service doesn't need to know Elixir exists.

This is not theoretical. Discord uses Elixir for WebSocket connections with Rust and Python for backend services. Pinterest uses Elixir for notification delivery with Java for notification logic. Bleacher Report uses Elixir for real-time push with Ruby for the main application.

Scaling the Kafka-to-Elixir Bridge

The scaling question: 3 Elixir nodes consuming from a 30-partition Kafka topic. A Kafka message for user 456 lands on Node 3, but user 456's WebSocket is on Node 1. How does the message get from Node 3 to user 456?

Step 1: Elixir nodes form a Kafka consumer group. All Elixir nodes join the same consumer group using Broadway (an Elixir library for Kafka/RabbitMQ consumption). Kafka assigns partitions across nodes automatically: Node 1 gets partitions 0-9, Node 2 gets 10-19, Node 3 gets 20-29.

Step 2: Java/Go publishes a notification to Kafka. The Java backend decides user 456 should receive a notification and publishes {user_id: 456, title: "New message", body: "..."} to the notifications topic. Kafka routes the message to a partition based on the key.

Step 3: An Elixir node consumes the message and routes it. The consuming node looks up user 456 in the distributed process registry and sends the message directly to that process. This is NOT a broadcast to all nodes. It is a direct point-to-point send to one specific process.

# Broadway consumer on whichever Elixir node receives the Kafka message
def handle_message(message) do
  user_id = message.data.user_id   # 456
  payload = message.data            # the notification

  # Look up user 456's process in the cluster-wide registry
  case :pg.get_members(:users, user_id) do
    [pid | _] ->
      # Found the process. send() works across nodes automatically.
      # If pid is on Node 1 and we are on Node 3, BEAM routes
      # the message over the internal TCP link. Direct send, not broadcast.
      send(pid, {:notification, payload})

    [] ->
      # User 456 is offline (no WebSocket process exists).
      # Store the notification for delivery when they reconnect.
      store_for_later(user_id, payload)
  end
end

Step 4: The WebSocket process receives the message and pushes to the user.

# This process runs on whatever node user 456 connected to.
# It was spawned when user 456 opened the app and registered as {:user, 456}.
def handle_info({:notification, payload}, state) do
  # Push through the WebSocket to the user's phone/browser
  push(state.socket, "notification", payload)
  {:noreply, state}
end

This is not broadcast. When Node 3 calls send(pid, message) and pid lives on Node 1, BEAM sends the message directly to Node 1 over the internal TCP connection between nodes. Node 2 never sees it. No node receives messages it does not need.

NOT this (broadcast):
  Node 3 -> send to ALL nodes -> "does anyone have user 456?"

THIS (direct send):
  Node 3 -> registry lookup: "user 456 is pid #xyz on Node 1"
         -> send directly to pid #xyz on Node 1
         -> only Node 1 receives it

Two approaches for partition assignment:

Approach A: Hash-aligned partitioning. Partition Kafka by user_id hash and route WebSocket connections using the same hash. The message always lands on the node holding the connection. No cross-node forwarding needed. Simple but rigid: if a node dies, both Kafka consumers and WebSocket connections must rebalance simultaneously.

Approach B: Any-node consumption with BEAM forwarding (what Discord does). Any node consumes from any Kafka partition. Cross-node forwarding happens through the distributed registry as shown above. This is better because Kafka scaling and WebSocket scaling are independent. Add more Elixir nodes for more connections. Kafka rebalances consumers automatically. BEAM handles cross-node message routing without any external service.

Performance Characteristics

Process creation: ~1-2 microseconds. Context switch between processes: ~0.1 microseconds. Message send between processes on the same node: ~0.1 microseconds. Message send between processes on different nodes (same datacenter): ~100-500 microseconds.

A single BEAM node can handle 2-5 million concurrent WebSocket connections depending on message rate and hardware. At 10,000 messages per second per node (aggregate across all connections), CPU usage stays under 30% on a 16-core machine. The bottleneck is almost always network bandwidth, not CPU or memory.

Phoenix Channels benchmark: a single server has handled 2 million simultaneous WebSocket connections with response times under 1ms per message. This was on a 40-core machine with 128GB RAM.

Pros

• Millions of lightweight processes per machine. Each process is ~2KB vs ~1MB for an OS thread. One server handles what takes hundreds of servers in Java/Go
• Built-in distributed process registry across nodes. No Redis or external coordination needed to track which user is on which server
• Fault tolerance via supervision trees. A crashed process is restarted automatically without affecting other processes. WhatsApp achieved 99.999% uptime with this
• Hot code upgrades. Deploy new code without dropping connections. Telecom systems ran for years without restart
• Battle-tested runtime. Erlang/BEAM has powered telecom switches since 1986. Elixir (2012) adds modern syntax and tooling on the same VM

Cons

• Smaller ecosystem than Java/Go/Node. Fewer libraries, fewer framework choices, fewer Stack Overflow answers
• Limited hiring pool. Finding experienced Elixir developers is harder than Java or Go developers
• Not suited for CPU-heavy work. Number crunching, ML inference, image processing are all better in Go/Rust/C++. BEAM is optimized for I/O concurrency, not compute
• Learning curve for OTP patterns. Supervisors, GenServers, and the 'let it crash' philosophy require a mental model shift from try/catch thinking
• Distributed Erlang has a fully connected mesh topology. Past ~50-100 nodes, the mesh becomes expensive. Large deployments need clustering libraries like libcluster or partisan

When to use

• The system needs to hold millions of concurrent WebSocket or TCP connections with minimal infrastructure
• Real-time messaging or notification delivery where connection routing is the bottleneck
• Fault tolerance matters and process-level isolation is preferred over container restarts
• The project is a connection/delivery layer while business logic stays in Java/Go/Python

When NOT to use

• CPU-bound workloads like ML inference, video encoding, or heavy computation
• The team has zero Erlang/Elixir experience and the project timeline is tight
• Simple request-response APIs where Go or Java already meet the latency and throughput needs
• A massive library ecosystem is needed for third-party integrations (payment SDKs, cloud provider clients)

Key Points

•BEAM processes are not OS threads. Each process uses ~2KB of memory, is preemptively scheduled by the BEAM VM across all CPU cores, and shares no memory with other processes. Communication is strictly through message passing. One machine can run 2-5 million processes simultaneously. This is why a single Elixir server replaces hundreds of Java/Go servers for connection-heavy workloads.
•The distributed process registry replaces Redis for connection tracking. When a user connects, Elixir spawns a process and registers it as {:user, 456} in the cluster-wide registry (pg module). To send a notification to user 456, call send({:user, 456}, payload). BEAM finds the process across any node in the cluster and delivers the message. No external lookup service, no Redis, no connection registry table.
•Supervision trees provide automatic fault recovery. Every process has a supervisor. If a process crashes (bad message, timeout, bug), the supervisor restarts it in milliseconds. Other processes are unaffected because there is no shared memory. This is the 'let it crash' philosophy: instead of defensive programming with try/catch everywhere, let the process crash and restart clean. WhatsApp used this to achieve 99.999% uptime.
•Phoenix Channels handle WebSocket connections at scale. Phoenix is the web framework for Elixir. Its Channels abstraction manages WebSocket lifecycle (connect, join room, send/receive messages, disconnect) with built-in PubSub and Presence tracking (who is online right now). Discord runs Phoenix Channels for millions of concurrent users.
•Hybrid architecture with Java/Go is the production pattern. The idea is not to rewrite the backend in Elixir. Business logic stays in Java/Go. Elixir handles only the connection and delivery layer. Java/Go publishes notifications to Kafka. Elixir consumes from Kafka and pushes to the correct WebSocket. The two systems communicate via Kafka or gRPC. The Java service doesn't need to know Elixir exists. Discord (Elixir + Rust), Pinterest (Elixir + Java), and Bleacher Report (Elixir + Ruby) all use this pattern.
•Cross-node message routing scales without Redis. When Elixir Node 3 consumes a Kafka message for user 456 but user 456's WebSocket is on Node 1, BEAM handles it automatically. Node 3 looks up {:user, 456} in the distributed registry, finds it on Node 1, and forwards the message internally in microseconds. No external hop, no Redis lookup. This is the key advantage over the Java/Go + Redis approach where every notification requires a Redis round-trip to find the correct server.

Common Mistakes

✗Trying to use Elixir for CPU-heavy work. BEAM is optimized for I/O concurrency (waiting on sockets, databases, external APIs). For tasks like image processing, ML inference, or heavy computation, each BEAM process blocks a scheduler. Use Rust NIFs or offload to a Go/Python service for CPU work.
✗Treating BEAM processes like OS threads. Developers from Java/Go backgrounds try to minimize process count. In BEAM, spawning a process is microseconds and 2KB. Spawn one per connection, one per request, one per task. Millions of processes is normal and expected.
✗Ignoring the fully connected mesh limit. Distributed Erlang connects every node to every other node. At 50+ nodes, the mesh overhead (heartbeats, connection maintenance) becomes significant. Use libraries like libcluster with gossip-based membership or partition the cluster into subclusters.
✗Not using an existing framework. Building raw GenServer WebSocket handlers from scratch when Phoenix Channels already handles connection lifecycle, heartbeats, reconnection, PubSub, and presence. Phoenix is not optional overhead; it saves months of work.
✗Assuming Kafka partition alignment with WebSocket routing. Partitioning Kafka by user_id and routing WebSocket connections by the same hash couples two systems tightly. A node failure requires rebalancing both Kafka consumers and WebSocket connections simultaneously. The better approach (what Discord does): let any node consume from Kafka, use BEAM's distributed registry to forward messages to the correct node internally.

Related Technologies