Processes, Threads & SchedulingTopic 8 of 14

Processes & ThreadsIntermediate

Signals & Signal Handling

DockerNginxPostgreSQLRedisGoJVMNode.jssystemd

🧠

Mental Model

Someone taps a shoulder while a person is mid-sentence at a desk. The sentence stops, an urgent note gets read, and then the original sentence picks up exactly where it left off. The danger: if the note says "go pour coffee" but the interrupted task was already holding the coffee pot, both hands reach for the same pot and everything freezes. Safe rule: glance at the note, scribble a reminder on a sticky pad, and deal with it properly once the current sentence is done.

💡

The Problem

docker stop sends SIGTERM to PID 1, but bash is sitting there as PID 1 and silently ignores it. The app never hears about the shutdown. Ten seconds later, SIGKILL destroys everything -- in-flight data gone, transactions left half-written. Elsewhere, five child processes exit before the parent handles SIGCHLD, but standard signals collapse into a single pending bit. Four zombies pile up silently, leaking PIDs toward the 32,768 limit. And then the subtle one: a SIGTERM handler calls printf(), which grabs an internal lock, but the signal interrupted printf() mid-operation in the main thread. Deadlock. It reproduces once per 100,000 shutdowns and takes weeks to track down.

Architecture

A server is running. It needs to reload its config. There is no socket to connect to. No file it polls. The only option is to tap it on the shoulder and say "hey, re-read the configuration."

That tap is a signal. It is the oldest form of inter-process communication in Unix, and it is still everywhere. Nginx uses SIGHUP for config reload. Redis uses SIGTERM for graceful shutdown. Go uses SIGURG internally to preempt goroutines. The JVM turns SIGSEGV into NullPointerException.

But here is the catch. Signals are asynchronous. They can interrupt code at any point -- mid-syscall, mid-malloc, mid-printf. If a signal handler calls the wrong function at the wrong time, the result is corruption or deadlock. And the bug only reproduces when the signal arrives at exactly the wrong instruction. These are some of the hardest bugs in systems programming.

What Actually Happens

Signal delivery is a two-phase process.

Phase 1: Sending. When a signal is sent -- via kill(), tgkill(), or a kernel event like SIGSEGV -- the kernel adds the signal to the target's pending set and sets the TIF_SIGPENDING flag on the target thread. The signal is NOT delivered yet. It sits in the pending queue.

Phase 2: Delivery. The signal is delivered when the target thread transitions from kernel mode to user mode (returning from a syscall, interrupt, or page fault). At that transition point, do_signal() checks for pending unblocked signals and delivers them.

Delivery means the kernel manipulates the thread's saved register state:

Push a signal frame onto the thread's stack (or alternate stack via sigaltstack()).
The frame contains saved registers, signal number, optional siginfo_t, and a hidden return address pointing to a sigreturn() trampoline.
Set the instruction pointer to the handler function.
The handler runs. When it returns, sigreturn() restores the original registers.
Execution resumes at the exact instruction that was interrupted.

The signal mask (sigprocmask()) controls which signals are blocked. Blocked signals stay pending until unblocked. During handler execution, the kernel automatically blocks the signal being handled (preventing recursive delivery) plus any extra signals specified in sa_mask.

Under the Hood

Standard vs real-time signals. Signals 1-31 are standard signals (SIGTERM, SIGCHLD, etc.) and are NOT queued. Two SIGCHLDs collapse into one pending bit. Signals 32-64 are real-time signals (SIGRTMIN to SIGRTMAX) and ARE queued, with delivery in priority order (lowest number first). Real-time signals also carry data via siginfo_t.si_value.

This has a practical consequence. If five child processes exit before the parent handles SIGCHLD, only one SIGCHLD may be delivered. The handler must call waitpid(-1, &status, WNOHANG) in a loop to reap ALL children. Handling exactly one child per SIGCHLD is a classic bug.

Async-signal-safety. When a signal interrupts a function, the handler runs in the same thread context. If the interrupted function holds an internal lock (like malloc's arena lock), calling that function from the handler deadlocks. The POSIX-guaranteed safe functions include: _exit, write, read, open, close, signal, kill, fork, exec*, waitpid, sem_post, and a few others. Notably absent: printf, malloc, free, syslog, pthread_mutex_lock.

The practical rule: do almost nothing in the handler. Set a volatile sig_atomic_t flag and return. Do the real work in the main loop.

signalfd() and the self-pipe trick. Event-driven servers cannot use traditional signal handlers because they disrupt the event loop. The classic solution is the "self-pipe trick": the handler writes a byte to a pipe that is in the epoll set. Linux's signalfd() is cleaner -- block the signal with sigprocmask(), then read signalfd_siginfo structs from a file descriptor in the event loop. No handler, no async-safety concerns. This is what systemd and many modern daemons use.

Process-directed vs thread-directed. kill() sends a signal to the process (thread group). The kernel picks an arbitrary thread that does not have the signal blocked. tgkill() / pthread_kill() targets a specific thread. Synchronous signals (SIGSEGV, SIGFPE, SIGBUS) are always delivered to the faulting thread -- anything else would not make sense.

Common Questions

How does SIGCHLD handling interact with waitpid() in a concurrent server?

Since standard signals are not queued, a single SIGCHLD may represent multiple child exits. The handler (or main loop) must call waitpid(-1, &status, WNOHANG) in a loop until it returns 0 or -1. Calling waitpid() exactly once per SIGCHLD is the classic mistake -- children get missed and zombies accumulate.

What happens when a signal hits a process blocked in a slow syscall?

If the signal has a handler: with SA_RESTART, the kernel automatically restarts the syscall after the handler returns (for most blocking calls). Without SA_RESTART, read() returns -1 with errno=EINTR. But some syscalls are NEVER restarted regardless of SA_RESTART -- connect(), poll(), nanosleep(), sem_wait() always return EINTR. If the signal's disposition is SIG_IGN or default-ignore, the syscall is not interrupted at all.

Why does SIGSEGV sometimes cause a stack overflow instead of a clean crash?

If a SIGSEGV handler itself accesses invalid memory, another SIGSEGV fires. If the handler runs on the main stack (no sigaltstack()) and SIGSEGV is not blocked during its own delivery, recursive faults exhaust the stack. The kernel detects this spiral and kills the process. Production crash handlers (like Google Breakpad) use sigaltstack() with a dedicated 64KB alternate stack to avoid this.

Can a process in D (uninterruptible sleep) state be killed?

No. Not even SIGKILL works. The D state means the process is waiting for I/O completion (typically disk or NFS) and the kernel cannot safely interrupt it without corrupting data structures. That is why hung NFS mounts and broken FUSE filesystems create unkillable processes. Kernel 4.18 introduced TASK_KILLABLE (a D state that responds to fatal signals) for many code paths, but not all.

How Technologies Use This

Docker

Running docker stop on a container should trigger a clean shutdown. Ten seconds later, Docker force-kills it with SIGKILL. Data in flight is lost, connections are severed, and database transactions are left inconsistent. The app never received the SIGTERM at all.

If the Dockerfile uses CMD ["bash", "-c", "python app.py"], bash runs as PID 1 inside the container. Bash does not forward signals to children when running as a non-interactive shell -- it ignores SIGTERM entirely. Docker sends SIGTERM to PID 1 (bash), bash ignores it, and the app never knows about the shutdown. After the 10-second grace period, SIGKILL kills everything uncleanly.

Fix: use exec-form CMD ["python", "app.py"] so the app runs as PID 1 and receives SIGTERM directly. Or use Docker's --init flag, which injects tini as PID 1 -- tini forwards signals to all children, waits for them to exit, and then exits itself. Either approach enables graceful shutdown.

Nginx

A production Nginx instance handles 20,000 active connections. The operator needs to rotate logs, reload config, and eventually shut down -- all without dropping a single request. There is no management socket, no REST API, no control plane. Just a running daemon process.

Signals provide the entire control interface. SIGHUP tells the master to re-read nginx.conf, fork new workers, and drain old ones. SIGUSR1 reopens all log files so logrotate works -- without it, Nginx keeps writing to the deleted old log file since it holds an open fd. SIGQUIT triggers graceful shutdown where workers finish in-flight requests (30+ seconds for streaming responses) before exiting.

Three signals control the entire lifecycle: SIGHUP for reload, SIGUSR1 for log rotation, SIGQUIT for graceful stop. Each triggers a completely different code path in the master process, and the result is zero-downtime config changes and log rotation with no management overhead.

PostgreSQL

A production PostgreSQL server needs to shut down, but a long-running analytics query has been active for 3 hours. A gentle stop waits for it to finish -- that could be hours more. A hard stop kills everything and requires WAL recovery on restart. Something in between is needed.

PostgreSQL maps three signals to three shutdown modes. SIGTERM is smart shutdown: stop accepting new connections but wait for existing clients to disconnect on their own. SIGINT is fast shutdown: forcibly disconnect all clients, roll back in-progress transactions, and exit cleanly with a checkpoint -- no recovery needed on restart. SIGQUIT is immediate: kill all backends without checkpoint, requiring WAL replay (30 seconds to several minutes) on next startup.

Choose SIGTERM for planned maintenance with flexible timing, SIGINT for urgent restarts requiring a clean state, and SIGQUIT only for emergencies. The postmaster also relies on SIGCHLD with a waitpid(-1, WNOHANG) loop to reap crashed backends, since standard signals do not queue.

Redis

Redis is single-threaded. If the SIGTERM handler tries to save the dataset before exiting, it blocks the event loop for 5-30 seconds. Every client stalls. If the handler calls malloc or writes to the log, it risks deadlock or corruption when the signal interrupts those same functions mid-operation.

Redis uses the flag-and-check pattern. The SIGTERM handler sets a single variable (server.shutdown_asap = 1) and returns immediately. No file I/O, no malloc, no logging -- fully async-signal-safe. The main event loop checks this flag on every iteration and performs the actual RDB/AOF save safely outside signal context.

For crashes, Redis installs a SIGSEGV/SIGBUS handler that uses only write() to stderr (async-signal-safe) to dump a stack trace, memory stats, client info, and the last few commands processed. This crash report is often enough to diagnose the bug without a multi-gigabyte core dump, saving hours of debugging time.

Before Go 1.14, a goroutine in a tight CPU loop with no function calls could never be preempted. GC pauses stretched from milliseconds to seconds as the runtime waited for the rogue goroutine to yield. The scheduler was powerless to interrupt it.

Go now uses SIGURG for asynchronous preemption. When the garbage collector or scheduler needs to stop a goroutine, it sends SIGURG to the OS thread running it. The signal handler saves the goroutine's register state, inserts an async preemption point, and allows the scheduler to switch immediately. SIGSEGV and SIGBUS are converted into Go panics with full stack traces, so nil pointer dereferences produce readable errors instead of cryptic core dumps.

For application-level signals, signal.Notify() delivers them to a channel, converting async interrupts into synchronous channel reads that fit Go's concurrency model. This lets a server drain 10,000 in-flight requests on SIGTERM using standard select/channel patterns instead of unsafe signal handlers.

JVM

Java code dereferences null pointers constantly in the unhappy path. Adding an explicit if-null check before every pointer access would cost 5-15% of CPU on pointer-heavy code. Yet NullPointerException appears with a clean stack trace and no performance penalty on the happy path.

The JVM registers a SIGSEGV handler at startup. When a null dereference triggers a hardware fault at address 0x0, the handler inspects the faulting address, recognizes the pattern as a managed-object null access, constructs a NullPointerException, and unwinds through the exception framework. SIGFPE becomes ArithmeticException for division by zero. The hardware does the null check for free -- no branch needed.

SIGQUIT (kill -3 or Ctrl+backslash) dumps all thread stacks and lock states to stderr -- invaluable for diagnosing production deadlocks without attaching a debugger. SIGTERM and SIGINT trigger Java's shutdown hooks, giving applications a window to close database connections, flush buffers, and deregister from service discovery before exit.

Node.js

Kubernetes sends SIGTERM to a Node.js server handling 5,000 in-flight HTTP requests. Without a signal handler, Node exits immediately. All 5,000 clients receive connection-reset errors, triggering retries that cascade into upstream services and amplify the outage.

Signals are asynchronous but Node's event loop is single-threaded. libuv bridges this gap with the self-pipe trick: the signal handler writes a byte to an internal pipe polled by the event loop, converting the async interrupt into a safe callback on the next iteration. No async-signal-safety concerns in user code -- the handler runs as a normal event loop callback.

Use process.on(SIGTERM, handler) to stop accepting new connections, wait for in-flight requests to complete (5-30 seconds with a timeout), close database pools, then exit cleanly. Kubernetes gives pods 30 seconds (terminationGracePeriodSeconds) between SIGTERM and SIGKILL specifically to allow this drain.

systemd

Stopping a service that has forked children and spawned background helpers. The main process exits cleanly, but orphaned children keep running -- holding ports, consuming memory, and writing to log files. Sending SIGTERM to just the main PID missed everything else.

systemd sends SIGTERM to every process in the service's cgroup (KillMode=control-group), not just the main PID. This reaches workers, helpers, and grandchildren that setsid() and double-fork cannot hide. After SIGTERM, systemd waits TimeoutStopSec (default 90 seconds) for graceful exit. If any process survives, SIGKILL finishes it off.

Internally, systemd avoids traditional signal handlers entirely. It uses signalfd() to receive signals as readable events on a file descriptor, integrated into its epoll-based event loop. This sidesteps every async-signal-safety concern and lets systemd process SIGCHLD, SIGTERM, and SIGHUP in the same unified dispatch that handles socket activation, timer events, and D-Bus messages.

Same Concept Across Tech

Concept	Docker	JVM	Node.js	Go	K8s
Graceful shutdown	docker stop sends SIGTERM to PID 1; use exec-form CMD or --init (tini)	Runtime.addShutdownHook runs on SIGTERM/SIGINT	process.on('SIGTERM') to stop accepting, drain, exit	signal.Notify(ch, syscall.SIGTERM) to channel-based drain	terminationGracePeriodSeconds (default 30s) between SIGTERM and SIGKILL
Config reload	docker kill -s HUP $container sends SIGHUP to PID 1	No standard signal for reload; use JMX or REST endpoint	process.on('SIGHUP') for custom reload logic	signal.Notify for SIGHUP; viper.WatchConfig() for file-based	kubectl rollout restart for pod-level; SIGHUP for in-pod daemons
Crash handling	Container restart policy handles crashes; tini reaps zombies	SIGSEGV becomes NullPointerException; SIGQUIT dumps all thread stacks	uncaughtException and unhandledRejection handlers	SIGSEGV/SIGBUS become Go panics with stack traces	restartPolicy + liveness probes for crash recovery
Child reaping	PID 1 must reap zombies; tini or --init handles this	N/A -- JVM threads are not child processes	child_process.on('exit') must be handled to avoid zombies	cmd.Wait() must be called for every exec.Command	Container init (PID 1) must handle SIGCHLD
Internal signal use	N/A	SIGSEGV for NullPointerException; SIGFPE for ArithmeticException	libuv self-pipe trick converts signals to event loop callbacks	SIGURG for goroutine preemption (Go 1.14+)	N/A

Stack Layer	Mechanism
Application	sigaction() installs handlers; signalfd() for event-loop integration
Language runtime	Go: SIGURG for preemption; JVM: SIGSEGV for NPE; Node: libuv self-pipe trick
Kernel signal subsystem	Pending queue (sigpending), delivery on kernel-to-user transition, TIF_SIGPENDING flag
Process lifecycle	SIGCHLD on child exit; SIGHUP on terminal hangup; SIGKILL/SIGSTOP uncatchable
Hardware	SIGSEGV from MMU page fault; SIGFPE from FPU exception; SIGBUS from alignment fault

Design rationale: Signals solve a narrow problem -- notifying a process asynchronously without any pre-established communication channel -- and they solve it with almost no overhead. But running code in an interrupted context is inherently dangerous, since the handler shares locks and state with whatever it just interrupted. signalfd() and the self-pipe trick both exist to convert that dangerous async model into the synchronous event-loop model where everything is safe to call.

If You See This, Think This

Symptom	Likely Cause	First Check
docker stop takes exactly 10 seconds then force-kills	PID 1 (shell) not forwarding SIGTERM to app	Check Dockerfile CMD form; use exec-form or --init
Zombie processes accumulating (Z state in ps)	SIGCHLD handler not looping waitpid() with WNOHANG	grep SigCgt /proc/$PARENT_PID/status to verify SIGCHLD is caught
Intermittent deadlock on graceful shutdown	Signal handler calls non-async-signal-safe function (printf, malloc)	strace the hang; check handler code for unsafe calls
Process ignores SIGKILL (stuck in D state)	Uninterruptible sleep waiting for I/O (NFS, FUSE, disk)	cat /proc/$PID/wchan to see what kernel function it is blocked in
Slow syscalls return EINTR unexpectedly	SA_RESTART not set on sigaction, or non-restartable syscall interrupted	Check sigaction flags; wrap non-restartable calls in EINTR retry loop
Only one child reaped despite multiple exits	Standard SIGCHLD not queued; handler calls waitpid once	Change handler to waitpid(-1, &status, WNOHANG) in a while loop

When to Use / Avoid

Use when implementing graceful shutdown -- catch SIGTERM to drain connections and flush buffers before exit
Use when building daemon lifecycle control -- SIGHUP for config reload, SIGUSR1 for log rotation
Use when reaping child processes -- SIGCHLD handler with waitpid() loop prevents zombie accumulation
Use signalfd() in event-driven servers to avoid async-signal-safety concerns entirely
Avoid calling non-async-signal-safe functions (printf, malloc, syslog) inside signal handlers
Avoid relying on signal queuing for standard signals (1-31) -- use real-time signals if delivery guarantee is needed

Try It Yourself

 1  # List all signals with numbers
 2  
 3  kill -l
 4  
 5  # Send SIGUSR1 to a process
 6  
 7  kill -USR1 $(pidof nginx)
 8  
 9  # View signal masks for a process (pending, blocked, ignored, caught)
10  
11  grep Sig /proc/$$/status
12  
13  # Decode signal mask hex to binary (e.g., SigCgt)
14  
15  python3 -c "mask=0x$(grep SigCgt /proc/$$/status | awk '{print $2}'); [print(f'  Signal {i}: caught') for i in range(1,65) if mask & (1<<(i-1))]"
16  
17  # Trace signal delivery to a running process
18  
19  strace -p $(pidof sleep) -e trace=signal 2>&1
20  
21  # Send signal to all processes in a process group
22  
23  kill -TERM -$(ps -o pgid= -p $PID | tr -d ' ')

Debug Checklist

1grep Sig /proc/$PID/status -- view pending, blocked, ignored, and caught signal bitmasks
2strace -e trace=signal -p $PID -- trace signal delivery and mask changes in real time
3kill -l -- list all signal names and numbers
4kill -0 $PID -- test if a process exists without sending a signal
5cat /proc/$PID/status | grep -E 'SigPnd|SigBlk' -- check for stuck pending or over-blocked signals
6strace -e trace=rt_sigaction -p $PID 2>&1 | head -20 -- see which signals a process has handlers for

Key Takeaways

✓Signal handlers literally hijack your thread's control flow. The kernel saves registers to a ucontext_t on the stack, redirects execution to the handler, and on return, a hidden sigreturn() trampoline restores the original context. Your code resumes as if nothing happened.
✓Only about 25 functions are async-signal-safe. printf(), malloc(), and mutex operations are NOT among them. In a handler, you should only set a volatile sig_atomic_t flag, call write() on a pipe, or use sem_post(). Everything else risks deadlock or corruption.
✓SA_RESTART makes some syscalls auto-restart after signal delivery. read(), write(), wait() are restartable. But connect(), poll(), sem_wait(), and nanosleep() are never restarted -- they always return EINTR. Memorize which ones restart and which do not.
✓signalfd() turns signals into file descriptor events. Block the signal with sigprocmask(), then read signalfd_siginfo structs from the fd in your event loop. No async handler needed. This is the modern Linux way to handle signals in event-driven servers.
✓SIGKILL and SIGSTOP cannot be caught, blocked, or ignored. Period. But here is the catch: a process in uninterruptible sleep (D state) will not respond to even SIGKILL until it leaves that state. That is why hung NFS mounts create unkillable processes.

Common Pitfalls

✗Using signal() instead of sigaction(). Mistake: thinking they are equivalent. Reality: signal() has undefined behavior regarding handler reset (SA_RESETHAND semantics vary across systems) and does not let you control SA_RESTART. Always use sigaction().
✗Calling printf, malloc, or syslog in a signal handler. This corrupts internal data structures when the handler interrupts those same functions mid-operation. The fix: set a flag in the handler, do the real work in the main loop.
✗Not blocking signals during critical sections. If a handler fires between a check and an update of shared data, the handler sees inconsistent state. Use sigprocmask() to block signals around critical sections.
✗Assuming signals are queued. They are not (for standard signals 1-31). If 5 children exit before the parent handles SIGCHLD, only one SIGCHLD may be delivered. The handler must call waitpid() in a WNOHANG loop to reap ALL exited children, not just one.

Reference

System Calls

sigactionkillsigprocmasksigsuspendsignalfd

Tools

strace -e trace=signalkill -l/proc/[pid]/status (SigPnd, SigBlk, SigIgn, SigCgt)

📌

In One Line

Signal handlers should do almost nothing -- set a flag and get out, or skip handlers entirely with signalfd() -- and never forget that SIGCHLD does not queue, so waitpid() must loop.

Signals & Signal Handling

DockerNginxPostgreSQLRedisGoJVMNode.jssystemd

🧠

Mental Model

💡

The Problem

Architecture

A server is running. It needs to reload its config. There is no socket to connect to. No file it polls. The only option is to tap it on the shoulder and say "hey, re-read the configuration."

What Actually Happens

Signal delivery is a two-phase process.

Delivery means the kernel manipulates the thread's saved register state:

Push a signal frame onto the thread's stack (or alternate stack via sigaltstack()).
The frame contains saved registers, signal number, optional siginfo_t, and a hidden return address pointing to a sigreturn() trampoline.
Set the instruction pointer to the handler function.
The handler runs. When it returns, sigreturn() restores the original registers.
Execution resumes at the exact instruction that was interrupted.

Under the Hood

The practical rule: do almost nothing in the handler. Set a volatile sig_atomic_t flag and return. Do the real work in the main loop.

Common Questions

How does SIGCHLD handling interact with waitpid() in a concurrent server?

What happens when a signal hits a process blocked in a slow syscall?

Why does SIGSEGV sometimes cause a stack overflow instead of a clean crash?

Can a process in D (uninterruptible sleep) state be killed?

How Technologies Use This

Docker

Nginx

PostgreSQL

Redis

JVM

Node.js

systemd

Same Concept Across Tech

Concept	Docker	JVM	Node.js	Go	K8s
Graceful shutdown	docker stop sends SIGTERM to PID 1; use exec-form CMD or --init (tini)	Runtime.addShutdownHook runs on SIGTERM/SIGINT	process.on('SIGTERM') to stop accepting, drain, exit	signal.Notify(ch, syscall.SIGTERM) to channel-based drain	terminationGracePeriodSeconds (default 30s) between SIGTERM and SIGKILL
Config reload	docker kill -s HUP $container sends SIGHUP to PID 1	No standard signal for reload; use JMX or REST endpoint	process.on('SIGHUP') for custom reload logic	signal.Notify for SIGHUP; viper.WatchConfig() for file-based	kubectl rollout restart for pod-level; SIGHUP for in-pod daemons
Crash handling	Container restart policy handles crashes; tini reaps zombies	SIGSEGV becomes NullPointerException; SIGQUIT dumps all thread stacks	uncaughtException and unhandledRejection handlers	SIGSEGV/SIGBUS become Go panics with stack traces	restartPolicy + liveness probes for crash recovery
Child reaping	PID 1 must reap zombies; tini or --init handles this	N/A -- JVM threads are not child processes	child_process.on('exit') must be handled to avoid zombies	cmd.Wait() must be called for every exec.Command	Container init (PID 1) must handle SIGCHLD
Internal signal use	N/A	SIGSEGV for NullPointerException; SIGFPE for ArithmeticException	libuv self-pipe trick converts signals to event loop callbacks	SIGURG for goroutine preemption (Go 1.14+)	N/A

Stack Layer	Mechanism
Application	sigaction() installs handlers; signalfd() for event-loop integration
Language runtime	Go: SIGURG for preemption; JVM: SIGSEGV for NPE; Node: libuv self-pipe trick
Kernel signal subsystem	Pending queue (sigpending), delivery on kernel-to-user transition, TIF_SIGPENDING flag
Process lifecycle	SIGCHLD on child exit; SIGHUP on terminal hangup; SIGKILL/SIGSTOP uncatchable
Hardware	SIGSEGV from MMU page fault; SIGFPE from FPU exception; SIGBUS from alignment fault

If You See This, Think This

Symptom	Likely Cause	First Check
docker stop takes exactly 10 seconds then force-kills	PID 1 (shell) not forwarding SIGTERM to app	Check Dockerfile CMD form; use exec-form or --init
Zombie processes accumulating (Z state in ps)	SIGCHLD handler not looping waitpid() with WNOHANG	grep SigCgt /proc/$PARENT_PID/status to verify SIGCHLD is caught
Intermittent deadlock on graceful shutdown	Signal handler calls non-async-signal-safe function (printf, malloc)	strace the hang; check handler code for unsafe calls
Process ignores SIGKILL (stuck in D state)	Uninterruptible sleep waiting for I/O (NFS, FUSE, disk)	cat /proc/$PID/wchan to see what kernel function it is blocked in
Slow syscalls return EINTR unexpectedly	SA_RESTART not set on sigaction, or non-restartable syscall interrupted	Check sigaction flags; wrap non-restartable calls in EINTR retry loop
Only one child reaped despite multiple exits	Standard SIGCHLD not queued; handler calls waitpid once	Change handler to waitpid(-1, &status, WNOHANG) in a while loop

When to Use / Avoid

Use when implementing graceful shutdown -- catch SIGTERM to drain connections and flush buffers before exit
Use when building daemon lifecycle control -- SIGHUP for config reload, SIGUSR1 for log rotation
Use when reaping child processes -- SIGCHLD handler with waitpid() loop prevents zombie accumulation
Use signalfd() in event-driven servers to avoid async-signal-safety concerns entirely
Avoid calling non-async-signal-safe functions (printf, malloc, syslog) inside signal handlers
Avoid relying on signal queuing for standard signals (1-31) -- use real-time signals if delivery guarantee is needed

Try It Yourself

 1  # List all signals with numbers
 2  
 3  kill -l
 4  
 5  # Send SIGUSR1 to a process
 6  
 7  kill -USR1 $(pidof nginx)
 8  
 9  # View signal masks for a process (pending, blocked, ignored, caught)
10  
11  grep Sig /proc/$$/status
12  
13  # Decode signal mask hex to binary (e.g., SigCgt)
14  
15  python3 -c "mask=0x$(grep SigCgt /proc/$$/status | awk '{print $2}'); [print(f'  Signal {i}: caught') for i in range(1,65) if mask & (1<<(i-1))]"
16  
17  # Trace signal delivery to a running process
18  
19  strace -p $(pidof sleep) -e trace=signal 2>&1
20  
21  # Send signal to all processes in a process group
22  
23  kill -TERM -$(ps -o pgid= -p $PID | tr -d ' ')

Debug Checklist

1grep Sig /proc/$PID/status -- view pending, blocked, ignored, and caught signal bitmasks
2strace -e trace=signal -p $PID -- trace signal delivery and mask changes in real time
3kill -l -- list all signal names and numbers
4kill -0 $PID -- test if a process exists without sending a signal
5cat /proc/$PID/status | grep -E 'SigPnd|SigBlk' -- check for stuck pending or over-blocked signals
6strace -e trace=rt_sigaction -p $PID 2>&1 | head -20 -- see which signals a process has handlers for

Key Takeaways

✓Signal handlers literally hijack your thread's control flow. The kernel saves registers to a ucontext_t on the stack, redirects execution to the handler, and on return, a hidden sigreturn() trampoline restores the original context. Your code resumes as if nothing happened.
✓Only about 25 functions are async-signal-safe. printf(), malloc(), and mutex operations are NOT among them. In a handler, you should only set a volatile sig_atomic_t flag, call write() on a pipe, or use sem_post(). Everything else risks deadlock or corruption.
✓SA_RESTART makes some syscalls auto-restart after signal delivery. read(), write(), wait() are restartable. But connect(), poll(), sem_wait(), and nanosleep() are never restarted -- they always return EINTR. Memorize which ones restart and which do not.
✓signalfd() turns signals into file descriptor events. Block the signal with sigprocmask(), then read signalfd_siginfo structs from the fd in your event loop. No async handler needed. This is the modern Linux way to handle signals in event-driven servers.
✓SIGKILL and SIGSTOP cannot be caught, blocked, or ignored. Period. But here is the catch: a process in uninterruptible sleep (D state) will not respond to even SIGKILL until it leaves that state. That is why hung NFS mounts create unkillable processes.

Common Pitfalls

✗Using signal() instead of sigaction(). Mistake: thinking they are equivalent. Reality: signal() has undefined behavior regarding handler reset (SA_RESETHAND semantics vary across systems) and does not let you control SA_RESTART. Always use sigaction().
✗Calling printf, malloc, or syslog in a signal handler. This corrupts internal data structures when the handler interrupts those same functions mid-operation. The fix: set a flag in the handler, do the real work in the main loop.
✗Not blocking signals during critical sections. If a handler fires between a check and an update of shared data, the handler sees inconsistent state. Use sigprocmask() to block signals around critical sections.
✗Assuming signals are queued. They are not (for standard signals 1-31). If 5 children exit before the parent handles SIGCHLD, only one SIGCHLD may be delivered. The handler must call waitpid() in a WNOHANG loop to reap ALL exited children, not just one.

Reference

System Calls

sigactionkillsigprocmasksigsuspendsignalfd

Tools

strace -e trace=signalkill -l/proc/[pid]/status (SigPnd, SigBlk, SigIgn, SigCgt)

📌

In One Line

Signal handlers should do almost nothing -- set a flag and get out, or skip handlers entirely with signalfd() -- and never forget that SIGCHLD does not queue, so waitpid() must loop.

Signals & Signal Handling

Mental Model

The Problem

Architecture

What Actually Happens

Under the Hood

Common Questions

How Technologies Use This

Same Concept Across Tech

If You See This, Think This

When to Use / Avoid

Try It Yourself

Debug Checklist

Key Takeaways

Common Pitfalls

Reference

In One Line

Related Topics

Signals & Signal Handling

Mental Model

The Problem

Architecture

What Actually Happens

Under the Hood

Common Questions

How Technologies Use This

Same Concept Across Tech

If You See This, Think This

When to Use / Avoid

Try It Yourself

Debug Checklist

Key Takeaways

Common Pitfalls

Reference

In One Line

Related Topics