System V & POSIX Message Queues
Mental Model
Post office with sorting mailboxes. Every letter gets an urgency stamp -- red, yellow, green. The box automatically floats red-stamped letters to the top. Open the box, and the most urgent letter comes out first no matter when it was dropped off. Letters are sealed in envelopes (message boundaries): no tearing, no merging. The mailbox stays bolted to the wall even when nobody is home. Move away without canceling the box and it sits there forever, full of unread mail, eating building space.
The Problem
A monitoring system pushes 10,000 events per second, 99% routine metrics and 1% critical alerts. In a pipe, alerts sit behind thousands of metrics in the FIFO stream -- seconds of delay on time-sensitive events. Crashed processes leave kernel-persistent queues leaking memory; after a week of restarts, 50 orphaned queues exhaust the RLIMIT_MSGQUEUE budget. The defaults make it worse: msg_max=10 fills instantly under load, and two unrelated services both named /commands silently corrupt each other's messages.
Architecture
Pipes are great until priorities matter -- and they can't do that.
With a pipe, messages arrive in order. First in, first out. There's no way to say "this alert is urgent -- deliver it before the log entries that were queued first." And there's no way to pull out only messages of a specific type while leaving others in the queue.
Message queues solve exactly this. They're the kernel's built-in priority mailbox: every message has a priority level, and mq_receive() always delivers the most important one first.
What Actually Happens
There are two APIs. POSIX message queues are the right choice for new code. System V exists only for legacy compatibility.
POSIX message queues (mq_open, mq_send, mq_receive) store messages in a kernel red-black tree keyed by priority. On each mq_send() call, the kernel copies the message into kernel memory and inserts it at the right priority level. mq_receive() always returns the oldest message at the highest priority.
Messages have boundaries. Each mq_send() becomes exactly one mq_receive(). No parsing byte streams, no framing protocols, no "did I get the whole message?" checks.
System V message queues (msgget, msgsnd, msgrcv) predate POSIX and use a different model. Messages have a long type field instead of a priority. The receiver can request messages of a specific type, any type up to a value, or just the first available. This is flexible for multiplexing, but the API is clunky and the identifiers don't work with epoll.
Both support blocking and non-blocking modes. Both persist until explicitly removed.
Under the Hood
The killer feature: POSIX MQ descriptors are file descriptors. On Linux, mqd_t is just an int fd. This means POSIX message queues work with select(), poll(), and epoll(). MQ events can be multiplexed with socket I/O in a single event loop. The fd becomes readable when messages are available and writable when there's space.
System V MQs return integer identifiers that are not file descriptors. No epoll support. No integration with event loops. This alone makes POSIX MQs the right choice for any event-driven system.
mq_notify() is one-shot and subtle. A process can register for async notification when a message arrives on a previously empty queue. The kernel will either send a signal (SIGEV_SIGNAL) or spawn a thread (SIGEV_THREAD). But the notification fires exactly once. Re-registration is required after each delivery. And it only fires when the queue transitions from empty to non-empty -- if messages are already queued, no notification.
Internally, glibc implements SIGEV_THREAD by creating a long-lived helper thread that blocks waiting for the kernel's notification signal. The kernel side is always signals.
Resource limits are tighter than expected. Default POSIX MQ limits: msg_max=10 messages, msgsize_max=8192 bytes, queues_max=256. A queue with 10 slots fills instantly under load. Always check and tune /proc/sys/fs/mqueue/ for production use. RLIMIT_MSGQUEUE caps total bytes across all queues owned by a user (default 819200 bytes).
The persistence trap. Both POSIX and System V queues persist after the creating process exits. mq_close() closes the descriptor but the queue stays on /dev/mqueue with all its messages. mq_unlink() removes the name. System V queues persist until msgctl(IPC_RMID) or reboot. If a process crashes without cleanup, it leaks kernel memory. Monitor /dev/mqueue and ipcs -q regularly in production.
IPC namespace isolation. Both MQ types are isolated by IPC namespaces (CLONE_NEWIPC). Docker and Kubernetes containers get separate namespaces, so queues don't leak between containers. systemd's PrivateIPC= directive gives individual services their own IPC namespace.
Common Questions
When to use MQs vs pipes vs sockets?
Pipes: simple parent-child byte streams. Message queues: message boundaries, priority ordering, or unrelated process communication without the overhead of setting up sockets. Unix domain sockets: bidirectional, connection-oriented communication with fd passing. TCP sockets: when network transparency is needed.
What happens to a POSIX MQ when all processes close it but nobody unlinks it?
It sits on /dev/mqueue consuming kernel memory indefinitely. All messages remain intact. It stays there until mq_unlink() or reboot. Best practice: the creating process should unlink in an atexit handler, and ops should monitor /dev/mqueue for orphans.
How does mq_notify(SIGEV_THREAD) work internally?
The kernel doesn't create threads directly. glibc creates a helper thread that blocks waiting for the notification signal. When the signal arrives, the helper calls the registered callback. This is a glibc implementation detail.
Can POSIX MQs be used across containers?
Only if they share the same IPC namespace. POSIX MQs are isolated by IPC namespaces (CLONE_NEWIPC), not mount namespaces. Two containers with different IPC namespaces can't see each other's queues, even if they both mount /dev/mqueue.
How Technologies Use This
Dozens of PostgreSQL backend processes need to coordinate access to shared buffer pools and lock tables. Using socket round-trips for every shared buffer access would add milliseconds of overhead to every page lookup, destroying query performance. Running two PostgreSQL instances on the same host risks collisions on System V IPC keys, causing mysterious startup failures.
The postmaster creates System V shared memory segments via shmget() and semaphore arrays via semget() at startup, and every forked backend inherits direct access to these IPC resources. This eliminates the need for socket communication entirely. The IPC key collision risk arises because System V IPC uses global integer keys, so two instances choosing the same key would corrupt each other's shared memory.
Docker solves the collision problem by giving each container its own IPC namespace via CLONE_NEWIPC, so the SysV keys inside container A are completely invisible to container B. This is why PostgreSQL in Docker just works without manual IPC key configuration, even when running 10+ instances per host.
A service using POSIX message queues crashes without calling mq_unlink(). The queue persists on /dev/mqueue, consuming kernel memory indefinitely. Multiply this by dozens of services restarting after failures and the result is a slow memory leak that takes weeks to notice. Two services using the same queue name /commands also interfere with each other's messages.
POSIX and System V IPC resources are kernel-persistent by design. They survive process exit and only disappear on explicit removal or reboot. Without namespace isolation, all services on a host share one IPC keyspace, so crashed services leak resources and name collisions cause silent cross-service message corruption.
systemd 254+ introduced the PrivateIPC= directive, which gives each service its own IPC namespace via CLONE_NEWIPC. When the service stops or crashes, the namespace is destroyed and all POSIX and SysV IPC resources inside it are automatically cleaned up. Name collisions are also eliminated: two services can both use /commands without interfering, since each sees only its own isolated IPC namespace with its own /dev/mqueue mount.
Same Concept Across Tech
| Concept | Docker | JVM | Node.js | Go | K8s |
|---|---|---|---|---|---|
| Message passing | N/A (use network between containers) | JMS / BlockingQueue (in-process) | process.send() for child IPC | channels (in-process), net for IPC | N/A (use gRPC/NATS between pods) |
| Priority delivery | N/A | PriorityBlockingQueue | N/A (manual sorting) | heap-based priority queue | PriorityClass (pod scheduling, not messaging) |
| IPC namespace isolation | --ipc=private (default) | N/A | N/A | N/A | shareProcessNamespace controls IPC sharing |
| Persistence after crash | MQ persists until container namespace destroyed | In-process queues lost on JVM exit | Lost on process exit | Lost on process exit | Persistent volumes for durable messaging |
| Async notification | mq_notify (kernel-level) | CompletableFuture / Observer | EventEmitter | select on channel | Watch API for K8s resources |
Stack Layer Mapping
| Layer | Message Queue Mechanism |
|---|---|
| Hardware | N/A (MQs are pure kernel abstractions) |
| Kernel | mqueue filesystem (POSIX) or msg_queue linked list (SysV) |
| IPC namespace | Isolates MQ keyspace per container/service |
| System calls | mq_open/mq_send/mq_receive (POSIX), msgget/msgsnd/msgrcv (SysV) |
| /dev/mqueue | Virtual filesystem exposing POSIX MQ state |
| Application | Opens queue by name, sends/receives with priority |
Design Rationale
System V MQs return integer IDs that cannot integrate with event loops -- a dealbreaker for event-driven code. POSIX MQs fixed this by returning real file descriptors that work with select, poll, and epoll. The rb-tree priority structure gives O(log n) insertion and O(1) retrieval of the most urgent message. Kernel persistence was intentional: queues survive crashes so messages are not lost. The tradeoff is a cleanup burden pushed to application code via mq_unlink(), and crashed processes that skip it leak kernel memory indefinitely.
If You See This, Think This
| Symptom | Likely Cause | First Check |
|---|---|---|
| mq_send blocks or returns EAGAIN | Queue full (default msg_max=10) | cat /proc/sys/fs/mqueue/msg_max and increase |
| mq_open fails with ENOSPC | Too many queues (default queues_max=256) | cat /proc/sys/fs/mqueue/queues_max and count active queues |
| Queue persists after process exit | mq_close called but mq_unlink missing | ls /dev/mqueue/ for orphaned queues |
| mq_notify callback never fires | Queue was not empty when notification registered | Drain queue first, then register mq_notify |
| Two services corrupting each other's messages | Shared IPC namespace with same queue name | ls -la /dev/mqueue/ and add IPC namespace isolation |
| RLIMIT_MSGQUEUE exceeded | Too many queues or large messages per user | ulimit -q and clean up leaked queues |
When to Use / Avoid
Use when:
- Delivering priority-ordered messages between processes on the same host
- Building event systems where urgent alerts must jump ahead of routine events
- Integrating message delivery into an epoll-based event loop (POSIX MQs only)
- Communicating between unrelated processes that cannot share pipes
Avoid when:
- Communication needs to cross machine boundaries (use TCP/Unix sockets instead)
- Throughput exceeds kernel MQ limits (use shared memory ring buffers or user-space message brokers)
- Simple parent-child byte streaming suffices (use pipes)
- The application is event-driven and needs bidirectional communication (use Unix domain sockets)
Try It Yourself
1 # Mount mqueue filesystem (usually auto-mounted)
2
3 mount -t mqueue none /dev/mqueue 2>/dev/null; ls -la /dev/mqueue/
4
5 # View POSIX MQ limits
6
7 cat /proc/sys/fs/mqueue/msg_max && cat /proc/sys/fs/mqueue/msgsize_max && cat /proc/sys/fs/mqueue/queues_max
8
9 # List System V message queues
10
11 ipcs -q
12
13 # Remove a System V message queue by ID
14
15 ipcs -q | awk 'NR>3 && $2 ~ /[0-9]/ {print $2}' | xargs -I{} ipcrm -q {}
16
17 # Inspect a POSIX MQ's current state
18
19 cat /dev/mqueue/myqueue 2>/dev/null || echo 'No queue named myqueue'
20
21 # Check mqueue resource usage
22
23 find /dev/mqueue -type f | wc -l && echo 'queues active'Debug Checklist
- 1
ls -la /dev/mqueue/ -- list active POSIX message queues - 2
cat /dev/mqueue/<name> -- show queue state (current/max messages, size) - 3
ipcs -q -- list System V message queues - 4
cat /proc/sys/fs/mqueue/msg_max -- check POSIX MQ message count limit - 5
cat /proc/sys/fs/mqueue/msgsize_max -- check POSIX MQ message size limit - 6
ipcrm -q <msqid> -- remove a leaked System V message queue
Key Takeaways
- ✓POSIX MQs deliver by priority: mq_receive() always returns the highest-priority message first (0 to MQ_PRIO_MAX-1, at least 32 levels). System V MQs have a 'type' field for selective retrieval but no strict priority ordering.
- ✓mq_notify() fires once when a message arrives on an empty queue -- then you must re-register. It won't fire if the queue already has messages. One process per queue. This design prevents thundering herds but demands careful coding.
- ✓On Linux, POSIX MQ descriptors are real file descriptors. They work with select(), poll(), and epoll(). You can multiplex MQ events with socket I/O in one event loop. System V MQs return integer IDs that don't work with epoll -- a dealbreaker for event-driven code.
- ✓Default POSIX MQ limits are surprisingly low: msg_max=10, msgsize_max=8192, queues_max=256. Production systems almost always need to tune /proc/sys/fs/mqueue/ values.
- ✓Both MQ types persist until explicitly removed or reboot. POSIX MQs survive mq_close() -- you must call mq_unlink(). System V MQs survive until msgctl(IPC_RMID). Crashed processes leak queues. This is a real operational hazard.
Common Pitfalls
- ✗Mistake: calling mq_close() and assuming the queue is gone. Reality: mq_close() closes the descriptor but the queue persists on /dev/mqueue. You must call mq_unlink() to actually remove it. Leaked queues eat kernel memory until reboot.
- ✗Mistake: not tuning msg_max. Reality: with the default of 10, mq_send() blocks (or returns EAGAIN) almost immediately under any real load. Always check and raise /proc/sys/fs/mqueue/msg_max.
- ✗Mistake: using System V MQs in new code. Reality: SysV IPC uses integer keys (not fds), doesn't work with epoll, has awkward permissions, and the API is inconsistent. POSIX MQs are superior in every way. SysV exists only for legacy compatibility.
- ✗Mistake: expecting mq_notify to re-arm automatically. Reality: notification fires once. If messages arrive between the notification callback and your re-registration call, they're silently available but no new notification fires.
Reference
In One Line
POSIX MQs over System V -- they work with epoll and deliver by priority -- but always mq_unlink() on shutdown or leaked queues eat kernel memory forever.