Processes, Threads & SchedulingTopic 4 of 14

Processes & ThreadsIntermediate

System V & POSIX Message Queues

PostgreSQLsystemd

🧠

Mental Model

Post office with sorting mailboxes. Every letter gets an urgency stamp -- red, yellow, green. The box automatically floats red-stamped letters to the top. Open the box, and the most urgent letter comes out first no matter when it was dropped off. Letters are sealed in envelopes (message boundaries): no tearing, no merging. The mailbox stays bolted to the wall even when nobody is home. Move away without canceling the box and it sits there forever, full of unread mail, eating building space.

💡

The Problem

A monitoring system pushes 10,000 events per second, 99% routine metrics and 1% critical alerts. In a pipe, alerts sit behind thousands of metrics in the FIFO stream -- seconds of delay on time-sensitive events. Crashed processes leave kernel-persistent queues leaking memory; after a week of restarts, 50 orphaned queues exhaust the RLIMIT_MSGQUEUE budget. The defaults make it worse: msg_max=10 fills instantly under load, and two unrelated services both named /commands silently corrupt each other's messages.

Architecture

Pipes are great until priorities matter -- and they can't do that.

With a pipe, messages arrive in order. First in, first out. There's no way to say "this alert is urgent -- deliver it before the log entries that were queued first." And there's no way to pull out only messages of a specific type while leaving others in the queue.

Message queues solve exactly this. They're the kernel's built-in priority mailbox: every message has a priority level, and mq_receive() always delivers the most important one first.

What Actually Happens

There are two APIs. POSIX message queues are the right choice for new code. System V exists only for legacy compatibility.

POSIX message queues (mq_open, mq_send, mq_receive) store messages in a kernel red-black tree keyed by priority. On each mq_send() call, the kernel copies the message into kernel memory and inserts it at the right priority level. mq_receive() always returns the oldest message at the highest priority.

Messages have boundaries. Each mq_send() becomes exactly one mq_receive(). No parsing byte streams, no framing protocols, no "did I get the whole message?" checks.

System V message queues (msgget, msgsnd, msgrcv) predate POSIX and use a different model. Messages have a long type field instead of a priority. The receiver can request messages of a specific type, any type up to a value, or just the first available. This is flexible for multiplexing, but the API is clunky and the identifiers don't work with epoll.

Both support blocking and non-blocking modes. Both persist until explicitly removed.

Under the Hood

The killer feature: POSIX MQ descriptors are file descriptors. On Linux, mqd_t is just an int fd. This means POSIX message queues work with select(), poll(), and epoll(). MQ events can be multiplexed with socket I/O in a single event loop. The fd becomes readable when messages are available and writable when there's space.

System V MQs return integer identifiers that are not file descriptors. No epoll support. No integration with event loops. This alone makes POSIX MQs the right choice for any event-driven system.

mq_notify() is one-shot and subtle. A process can register for async notification when a message arrives on a previously empty queue. The kernel will either send a signal (SIGEV_SIGNAL) or spawn a thread (SIGEV_THREAD). But the notification fires exactly once. Re-registration is required after each delivery. And it only fires when the queue transitions from empty to non-empty -- if messages are already queued, no notification.

Internally, glibc implements SIGEV_THREAD by creating a long-lived helper thread that blocks waiting for the kernel's notification signal. The kernel side is always signals.

Resource limits are tighter than expected. Default POSIX MQ limits: msg_max=10 messages, msgsize_max=8192 bytes, queues_max=256. A queue with 10 slots fills instantly under load. Always check and tune /proc/sys/fs/mqueue/ for production use. RLIMIT_MSGQUEUE caps total bytes across all queues owned by a user (default 819200 bytes).

The persistence trap. Both POSIX and System V queues persist after the creating process exits. mq_close() closes the descriptor but the queue stays on /dev/mqueue with all its messages. mq_unlink() removes the name. System V queues persist until msgctl(IPC_RMID) or reboot. If a process crashes without cleanup, it leaks kernel memory. Monitor /dev/mqueue and ipcs -q regularly in production.

IPC namespace isolation. Both MQ types are isolated by IPC namespaces (CLONE_NEWIPC). Docker and Kubernetes containers get separate namespaces, so queues don't leak between containers. systemd's PrivateIPC= directive gives individual services their own IPC namespace.

Common Questions

When to use MQs vs pipes vs sockets?

Pipes: simple parent-child byte streams. Message queues: message boundaries, priority ordering, or unrelated process communication without the overhead of setting up sockets. Unix domain sockets: bidirectional, connection-oriented communication with fd passing. TCP sockets: when network transparency is needed.

What happens to a POSIX MQ when all processes close it but nobody unlinks it?

It sits on /dev/mqueue consuming kernel memory indefinitely. All messages remain intact. It stays there until mq_unlink() or reboot. Best practice: the creating process should unlink in an atexit handler, and ops should monitor /dev/mqueue for orphans.

How does mq_notify(SIGEV_THREAD) work internally?

The kernel doesn't create threads directly. glibc creates a helper thread that blocks waiting for the notification signal. When the signal arrives, the helper calls the registered callback. This is a glibc implementation detail.

Can POSIX MQs be used across containers?

Only if they share the same IPC namespace. POSIX MQs are isolated by IPC namespaces (CLONE_NEWIPC), not mount namespaces. Two containers with different IPC namespaces can't see each other's queues, even if they both mount /dev/mqueue.

How Technologies Use This

PostgreSQL

Dozens of PostgreSQL backend processes need to coordinate access to shared buffer pools and lock tables. Using socket round-trips for every shared buffer access would add milliseconds of overhead to every page lookup, destroying query performance. Running two PostgreSQL instances on the same host risks collisions on System V IPC keys, causing mysterious startup failures.

The postmaster creates System V shared memory segments via shmget() and semaphore arrays via semget() at startup, and every forked backend inherits direct access to these IPC resources. This eliminates the need for socket communication entirely. The IPC key collision risk arises because System V IPC uses global integer keys, so two instances choosing the same key would corrupt each other's shared memory.

Docker solves the collision problem by giving each container its own IPC namespace via CLONE_NEWIPC, so the SysV keys inside container A are completely invisible to container B. This is why PostgreSQL in Docker just works without manual IPC key configuration, even when running 10+ instances per host.

systemd

A service using POSIX message queues crashes without calling mq_unlink(). The queue persists on /dev/mqueue, consuming kernel memory indefinitely. Multiply this by dozens of services restarting after failures and the result is a slow memory leak that takes weeks to notice. Two services using the same queue name /commands also interfere with each other's messages.

POSIX and System V IPC resources are kernel-persistent by design. They survive process exit and only disappear on explicit removal or reboot. Without namespace isolation, all services on a host share one IPC keyspace, so crashed services leak resources and name collisions cause silent cross-service message corruption.

systemd 254+ introduced the PrivateIPC= directive, which gives each service its own IPC namespace via CLONE_NEWIPC. When the service stops or crashes, the namespace is destroyed and all POSIX and SysV IPC resources inside it are automatically cleaned up. Name collisions are also eliminated: two services can both use /commands without interfering, since each sees only its own isolated IPC namespace with its own /dev/mqueue mount.

Same Concept Across Tech

Concept	Docker	JVM	Node.js	Go	K8s
Message passing	N/A (use network between containers)	JMS / BlockingQueue (in-process)	process.send() for child IPC	channels (in-process), net for IPC	N/A (use gRPC/NATS between pods)
Priority delivery	N/A	PriorityBlockingQueue	N/A (manual sorting)	heap-based priority queue	PriorityClass (pod scheduling, not messaging)
IPC namespace isolation	--ipc=private (default)	N/A	N/A	N/A	shareProcessNamespace controls IPC sharing
Persistence after crash	MQ persists until container namespace destroyed	In-process queues lost on JVM exit	Lost on process exit	Lost on process exit	Persistent volumes for durable messaging
Async notification	mq_notify (kernel-level)	CompletableFuture / Observer	EventEmitter	select on channel	Watch API for K8s resources

Stack Layer Mapping

Layer	Message Queue Mechanism
Hardware	N/A (MQs are pure kernel abstractions)
Kernel	mqueue filesystem (POSIX) or msg_queue linked list (SysV)
IPC namespace	Isolates MQ keyspace per container/service
System calls	mq_open/mq_send/mq_receive (POSIX), msgget/msgsnd/msgrcv (SysV)
/dev/mqueue	Virtual filesystem exposing POSIX MQ state
Application	Opens queue by name, sends/receives with priority

Design Rationale

System V MQs return integer IDs that cannot integrate with event loops -- a dealbreaker for event-driven code. POSIX MQs fixed this by returning real file descriptors that work with select, poll, and epoll. The rb-tree priority structure gives O(log n) insertion and O(1) retrieval of the most urgent message. Kernel persistence was intentional: queues survive crashes so messages are not lost. The tradeoff is a cleanup burden pushed to application code via mq_unlink(), and crashed processes that skip it leak kernel memory indefinitely.

If You See This, Think This

Symptom	Likely Cause	First Check
mq_send blocks or returns EAGAIN	Queue full (default msg_max=10)	`cat /proc/sys/fs/mqueue/msg_max` and increase
mq_open fails with ENOSPC	Too many queues (default queues_max=256)	`cat /proc/sys/fs/mqueue/queues_max` and count active queues
Queue persists after process exit	mq_close called but mq_unlink missing	`ls /dev/mqueue/` for orphaned queues
mq_notify callback never fires	Queue was not empty when notification registered	Drain queue first, then register mq_notify
Two services corrupting each other's messages	Shared IPC namespace with same queue name	`ls -la /dev/mqueue/` and add IPC namespace isolation
RLIMIT_MSGQUEUE exceeded	Too many queues or large messages per user	`ulimit -q` and clean up leaked queues

When to Use / Avoid

Use when:

Delivering priority-ordered messages between processes on the same host
Building event systems where urgent alerts must jump ahead of routine events
Integrating message delivery into an epoll-based event loop (POSIX MQs only)
Communicating between unrelated processes that cannot share pipes

Avoid when:

Communication needs to cross machine boundaries (use TCP/Unix sockets instead)
Throughput exceeds kernel MQ limits (use shared memory ring buffers or user-space message brokers)
Simple parent-child byte streaming suffices (use pipes)
The application is event-driven and needs bidirectional communication (use Unix domain sockets)

Try It Yourself

 1  # Mount mqueue filesystem (usually auto-mounted)
 2  
 3  mount -t mqueue none /dev/mqueue 2>/dev/null; ls -la /dev/mqueue/
 4  
 5  # View POSIX MQ limits
 6  
 7  cat /proc/sys/fs/mqueue/msg_max && cat /proc/sys/fs/mqueue/msgsize_max && cat /proc/sys/fs/mqueue/queues_max
 8  
 9  # List System V message queues
10  
11  ipcs -q
12  
13  # Remove a System V message queue by ID
14  
15  ipcs -q | awk 'NR>3 && $2 ~ /[0-9]/ {print $2}' | xargs -I{} ipcrm -q {}
16  
17  # Inspect a POSIX MQ's current state
18  
19  cat /dev/mqueue/myqueue 2>/dev/null || echo 'No queue named myqueue'
20  
21  # Check mqueue resource usage
22  
23  find /dev/mqueue -type f | wc -l && echo 'queues active'

Debug Checklist

1ls -la /dev/mqueue/ -- list active POSIX message queues
2cat /dev/mqueue/<name> -- show queue state (current/max messages, size)
3ipcs -q -- list System V message queues
4cat /proc/sys/fs/mqueue/msg_max -- check POSIX MQ message count limit
5cat /proc/sys/fs/mqueue/msgsize_max -- check POSIX MQ message size limit
6ipcrm -q <msqid> -- remove a leaked System V message queue

Key Takeaways

✓POSIX MQs deliver by priority: mq_receive() always returns the highest-priority message first (0 to MQ_PRIO_MAX-1, at least 32 levels). System V MQs have a 'type' field for selective retrieval but no strict priority ordering.
✓mq_notify() fires once when a message arrives on an empty queue -- then you must re-register. It won't fire if the queue already has messages. One process per queue. This design prevents thundering herds but demands careful coding.
✓On Linux, POSIX MQ descriptors are real file descriptors. They work with select(), poll(), and epoll(). You can multiplex MQ events with socket I/O in one event loop. System V MQs return integer IDs that don't work with epoll -- a dealbreaker for event-driven code.
✓Default POSIX MQ limits are surprisingly low: msg_max=10, msgsize_max=8192, queues_max=256. Production systems almost always need to tune /proc/sys/fs/mqueue/ values.
✓Both MQ types persist until explicitly removed or reboot. POSIX MQs survive mq_close() -- you must call mq_unlink(). System V MQs survive until msgctl(IPC_RMID). Crashed processes leak queues. This is a real operational hazard.

Common Pitfalls

✗Mistake: calling mq_close() and assuming the queue is gone. Reality: mq_close() closes the descriptor but the queue persists on /dev/mqueue. You must call mq_unlink() to actually remove it. Leaked queues eat kernel memory until reboot.
✗Mistake: not tuning msg_max. Reality: with the default of 10, mq_send() blocks (or returns EAGAIN) almost immediately under any real load. Always check and raise /proc/sys/fs/mqueue/msg_max.
✗Mistake: using System V MQs in new code. Reality: SysV IPC uses integer keys (not fds), doesn't work with epoll, has awkward permissions, and the API is inconsistent. POSIX MQs are superior in every way. SysV exists only for legacy compatibility.
✗Mistake: expecting mq_notify to re-arm automatically. Reality: notification fires once. If messages arrive between the notification callback and your re-registration call, they're silently available but no new notification fires.

Reference

System Calls

mq_openmq_sendmq_receivemsggetmsgsndmsgrcv

Tools

ls -la /dev/mqueue/ipcs -qstrace -e trace=ipc,mq_open,mq_timedsend,mq_timedreceive

📌

In One Line

POSIX MQs over System V -- they work with epoll and deliver by priority -- but always mq_unlink() on shutdown or leaked queues eat kernel memory forever.

System V & POSIX Message Queues

PostgreSQLsystemd

🧠

Mental Model

💡

The Problem

Architecture

Pipes are great until priorities matter -- and they can't do that.

Message queues solve exactly this. They're the kernel's built-in priority mailbox: every message has a priority level, and mq_receive() always delivers the most important one first.

What Actually Happens

There are two APIs. POSIX message queues are the right choice for new code. System V exists only for legacy compatibility.

Messages have boundaries. Each mq_send() becomes exactly one mq_receive(). No parsing byte streams, no framing protocols, no "did I get the whole message?" checks.

Both support blocking and non-blocking modes. Both persist until explicitly removed.

Under the Hood

System V MQs return integer identifiers that are not file descriptors. No epoll support. No integration with event loops. This alone makes POSIX MQs the right choice for any event-driven system.

Internally, glibc implements SIGEV_THREAD by creating a long-lived helper thread that blocks waiting for the kernel's notification signal. The kernel side is always signals.

Common Questions

When to use MQs vs pipes vs sockets?

What happens to a POSIX MQ when all processes close it but nobody unlinks it?

How does mq_notify(SIGEV_THREAD) work internally?

Can POSIX MQs be used across containers?

How Technologies Use This

PostgreSQL

systemd

Same Concept Across Tech

Concept	Docker	JVM	Node.js	Go	K8s
Message passing	N/A (use network between containers)	JMS / BlockingQueue (in-process)	process.send() for child IPC	channels (in-process), net for IPC	N/A (use gRPC/NATS between pods)
Priority delivery	N/A	PriorityBlockingQueue	N/A (manual sorting)	heap-based priority queue	PriorityClass (pod scheduling, not messaging)
IPC namespace isolation	--ipc=private (default)	N/A	N/A	N/A	shareProcessNamespace controls IPC sharing
Persistence after crash	MQ persists until container namespace destroyed	In-process queues lost on JVM exit	Lost on process exit	Lost on process exit	Persistent volumes for durable messaging
Async notification	mq_notify (kernel-level)	CompletableFuture / Observer	EventEmitter	select on channel	Watch API for K8s resources

Stack Layer Mapping

Layer	Message Queue Mechanism
Hardware	N/A (MQs are pure kernel abstractions)
Kernel	mqueue filesystem (POSIX) or msg_queue linked list (SysV)
IPC namespace	Isolates MQ keyspace per container/service
System calls	mq_open/mq_send/mq_receive (POSIX), msgget/msgsnd/msgrcv (SysV)
/dev/mqueue	Virtual filesystem exposing POSIX MQ state
Application	Opens queue by name, sends/receives with priority

Design Rationale

If You See This, Think This

Symptom	Likely Cause	First Check
mq_send blocks or returns EAGAIN	Queue full (default msg_max=10)	`cat /proc/sys/fs/mqueue/msg_max` and increase
mq_open fails with ENOSPC	Too many queues (default queues_max=256)	`cat /proc/sys/fs/mqueue/queues_max` and count active queues
Queue persists after process exit	mq_close called but mq_unlink missing	`ls /dev/mqueue/` for orphaned queues
mq_notify callback never fires	Queue was not empty when notification registered	Drain queue first, then register mq_notify
Two services corrupting each other's messages	Shared IPC namespace with same queue name	`ls -la /dev/mqueue/` and add IPC namespace isolation
RLIMIT_MSGQUEUE exceeded	Too many queues or large messages per user	`ulimit -q` and clean up leaked queues

When to Use / Avoid

Use when:

Delivering priority-ordered messages between processes on the same host
Building event systems where urgent alerts must jump ahead of routine events
Integrating message delivery into an epoll-based event loop (POSIX MQs only)
Communicating between unrelated processes that cannot share pipes

Avoid when:

Communication needs to cross machine boundaries (use TCP/Unix sockets instead)
Throughput exceeds kernel MQ limits (use shared memory ring buffers or user-space message brokers)
Simple parent-child byte streaming suffices (use pipes)
The application is event-driven and needs bidirectional communication (use Unix domain sockets)

Try It Yourself

 1  # Mount mqueue filesystem (usually auto-mounted)
 2  
 3  mount -t mqueue none /dev/mqueue 2>/dev/null; ls -la /dev/mqueue/
 4  
 5  # View POSIX MQ limits
 6  
 7  cat /proc/sys/fs/mqueue/msg_max && cat /proc/sys/fs/mqueue/msgsize_max && cat /proc/sys/fs/mqueue/queues_max
 8  
 9  # List System V message queues
10  
11  ipcs -q
12  
13  # Remove a System V message queue by ID
14  
15  ipcs -q | awk 'NR>3 && $2 ~ /[0-9]/ {print $2}' | xargs -I{} ipcrm -q {}
16  
17  # Inspect a POSIX MQ's current state
18  
19  cat /dev/mqueue/myqueue 2>/dev/null || echo 'No queue named myqueue'
20  
21  # Check mqueue resource usage
22  
23  find /dev/mqueue -type f | wc -l && echo 'queues active'

Debug Checklist

1ls -la /dev/mqueue/ -- list active POSIX message queues
2cat /dev/mqueue/<name> -- show queue state (current/max messages, size)
3ipcs -q -- list System V message queues
4cat /proc/sys/fs/mqueue/msg_max -- check POSIX MQ message count limit
5cat /proc/sys/fs/mqueue/msgsize_max -- check POSIX MQ message size limit
6ipcrm -q <msqid> -- remove a leaked System V message queue

Key Takeaways

✓POSIX MQs deliver by priority: mq_receive() always returns the highest-priority message first (0 to MQ_PRIO_MAX-1, at least 32 levels). System V MQs have a 'type' field for selective retrieval but no strict priority ordering.
✓mq_notify() fires once when a message arrives on an empty queue -- then you must re-register. It won't fire if the queue already has messages. One process per queue. This design prevents thundering herds but demands careful coding.
✓On Linux, POSIX MQ descriptors are real file descriptors. They work with select(), poll(), and epoll(). You can multiplex MQ events with socket I/O in one event loop. System V MQs return integer IDs that don't work with epoll -- a dealbreaker for event-driven code.
✓Default POSIX MQ limits are surprisingly low: msg_max=10, msgsize_max=8192, queues_max=256. Production systems almost always need to tune /proc/sys/fs/mqueue/ values.
✓Both MQ types persist until explicitly removed or reboot. POSIX MQs survive mq_close() -- you must call mq_unlink(). System V MQs survive until msgctl(IPC_RMID). Crashed processes leak queues. This is a real operational hazard.

Common Pitfalls

✗Mistake: calling mq_close() and assuming the queue is gone. Reality: mq_close() closes the descriptor but the queue persists on /dev/mqueue. You must call mq_unlink() to actually remove it. Leaked queues eat kernel memory until reboot.
✗Mistake: not tuning msg_max. Reality: with the default of 10, mq_send() blocks (or returns EAGAIN) almost immediately under any real load. Always check and raise /proc/sys/fs/mqueue/msg_max.
✗Mistake: using System V MQs in new code. Reality: SysV IPC uses integer keys (not fds), doesn't work with epoll, has awkward permissions, and the API is inconsistent. POSIX MQs are superior in every way. SysV exists only for legacy compatibility.
✗Mistake: expecting mq_notify to re-arm automatically. Reality: notification fires once. If messages arrive between the notification callback and your re-registration call, they're silently available but no new notification fires.

Reference

System Calls

mq_openmq_sendmq_receivemsggetmsgsndmsgrcv

Tools

ls -la /dev/mqueue/ipcs -qstrace -e trace=ipc,mq_open,mq_timedsend,mq_timedreceive

📌

In One Line

POSIX MQs over System V -- they work with epoll and deliver by priority -- but always mq_unlink() on shutdown or leaked queues eat kernel memory forever.

System V & POSIX Message Queues

Mental Model

The Problem

Architecture

What Actually Happens

Under the Hood

Common Questions

How Technologies Use This

Same Concept Across Tech

If You See This, Think This

When to Use / Avoid

Try It Yourself

Debug Checklist

Key Takeaways

Common Pitfalls

Reference

In One Line

Related Topics

System V & POSIX Message Queues

Mental Model

The Problem

Architecture

What Actually Happens

Under the Hood

Common Questions

How Technologies Use This

Same Concept Across Tech

If You See This, Think This

When to Use / Avoid

Try It Yourself

Debug Checklist

Key Takeaways

Common Pitfalls

Reference

In One Line

Related Topics