Kernel & BootTopic 9 of 13

Kernel InternalsAdvanced

Interrupt Handling & Softirqs

NginxKafka

🧠

Mental Model

Mid-surgery, an emergency patient arrives. The surgeon cannot ignore it but also cannot abandon the open patient. Two seconds to stabilize the newcomer (top half), hand them off to a nurse for prep (bottom half), then back to the original operation. The nurse works through the slower tasks in the background. When emergencies arrive faster than the nurse can prep, the whole hospital falls behind.

💡

The Problem

Six million packets per second on a 40 Gbps NIC, each one firing a hardware interrupt. Run every handler to completion and the CPU does nothing but answer doorbells -- the system is technically alive but frozen for applications. Worse, all interrupts land on CPU 0 because nobody configured IRQ affinity, so 15 other cores sit idle while CPU 0 pegs at 100% in softirq time.

Architecture

The CPU does not poll hardware for events. It gets forcibly interrupted.

A packet arrives on the NIC. A disk completes a DMA transfer. A key is pressed. In each case, the hardware yanks the CPU out of whatever it was doing -- mid-instruction if necessary -- and says "deal with this now."

That sounds fine for a keyboard. But a server receiving 10 million packets per second? If every packet triggers a full interrupt, the machine spends all its time answering the doorbell and none of it doing actual work.

What Actually Happens

When a device needs CPU attention, it signals via MSI-X -- writing a small message to a memory address that the CPU's Local APIC interprets as an interrupt vector.

The CPU immediately suspends its current execution. Saves registers. Looks up the vector in the IDT. Jumps to the registered handler.

This handler is the top-half (hardirq). It runs with the current interrupt line masked, acknowledges the hardware, grabs essential data (like a DMA completion status), and raises a softirq. The whole thing should take microseconds, not milliseconds.

The real processing happens in the bottom-half. When the hardirq returns, the kernel checks the softirq pending bitmask and runs all pending handlers inline -- before returning to whatever user-space task was interrupted. There are 10 fixed softirq types: NET_RX, NET_TX, BLOCK, TIMER, SCHED, and others.

Here is the catch: if softirq processing takes too long (more than ~2ms or 10 restart rounds), the kernel stops and defers the remaining work to ksoftirqd -- a per-CPU kernel thread running at nice 19 priority. This prevents softirqs from starving applications.

For work that needs process context -- sleeping, taking mutexes, performing I/O -- the kernel provides workqueues. A driver calls schedule_work() to enqueue a work item that executes later in a kworker thread. Unlike softirqs, workqueues can sleep.

Under the Hood

NAPI changes the game for networking. At millions of packets per second, the interrupt overhead alone can consume entire CPU cores. NAPI solves this by switching to a poll model: after the first packet interrupt, the driver disables further NIC interrupts and registers a poll function. The kernel's NET_RX softirq polls the device for batches of packets (up to 64) without interrupts. When the queue drains, interrupts are re-enabled. This is why ksoftirqd often shows high CPU on network-heavy boxes -- it is NAPI doing its job.

Interrupt coalescing trades latency for throughput. NICs can be configured (ethtool -C) to delay interrupts and batch completions. rx-usecs=50 tells the NIC to wait 50 microseconds before signaling, grouping ~100 packets per interrupt. This cuts interrupt rate 10-100x but adds latency. Adaptive coalescing (adaptive-rx on) adjusts dynamically based on load.

IRQ affinity and RSS. Multi-queue NICs have one IRQ per receive queue. RSS (Receive Side Scaling) hashes incoming packets to distribute flows across queues. Setting /proc/irq/N/smp_affinity_list pins each queue's IRQ to a specific CPU core. The goal: the application processing packets from queue N runs on the same CPU as IRQ N, keeping data in L1/L2 cache.

Threaded IRQs are now the default. Since kernel 4.x, most drivers use request_threaded_irq(). The hardirq handler is minimal -- it just wakes a dedicated kernel thread that runs the real handler in process context. This makes interrupt handling fully preemptible and improves latency predictability.

Common Questions

What happens if a hardirq handler takes too long?

Other interrupts of the same type are delayed. If the handler runs for milliseconds, NIC ring buffers overflow (packet drops), keystrokes are lost, and timers drift. The kernel's lockup detector triggers if a hardirq holds a CPU for more than 10 seconds. In RT kernels, handlers exceeding their budget generate latency traces.

How do softirqs differ from tasklets?

Softirqs are statically allocated (only 10 types) and can run concurrently on multiple CPUs -- the same handler can execute simultaneously on different cores. Tasklets are dynamically allocatable and serialized -- a given tasklet runs on only one CPU at a time. Since kernel 5.9, tasklets are deprecated for new code in favor of threaded IRQs and workqueues.

Why does ksoftirqd sometimes cause latency problems?

When softirq work is deferred to ksoftirqd, it runs at nice 19 (lowest priority). If the system has CPU-bound tasks, ksoftirqd may not get scheduled promptly, adding milliseconds of delay to network packet processing. Solutions: use cgroup CPU reservations for ksoftirqd, enable RPS to distribute work, or use busy_poll mode where the application thread directly polls the NIC.

How Technologies Use This

Nginx

An Nginx server handling 50K connections per second shows mysterious latency spikes despite low average CPU usage. The p99 latency is 10x the median, and no amount of application-level tuning makes a difference.

The hidden cause is IRQ affinity mismatch. A packet arrives on the NIC and triggers an interrupt on CPU 3, but the Nginx worker reading from epoll runs on CPU 7. The packet data must cross L3 cache domains, adding over 100ns of memory latency per packet. Under load, these cross-cache transfers accumulate into the latency spikes that monitoring tools cannot explain.

Pinning each Nginx worker and its NIC queue IRQ to the same physical core via /proc/irq/N/smp_affinity_list keeps packet data in L1 cache. This single tuning step can double throughput from 500K to over 1M requests per second on a multi-queue 10GbE NIC.

Kafka

A Kafka broker handling 2 million messages per second suddenly pins entire CPU cores at 100% on ksoftirqd. Broker threads are starved of CPU time, consumer lag spikes, and the cluster starts falling behind with no obvious application-level cause.

The culprit is uncoalesced interrupts. Every incoming packet triggers a separate hardware interrupt, and at millions of packets per second the overhead of context-switching into the interrupt handler thousands of times per millisecond consumes all available CPU cycles. The broker threads never get a chance to run because the kernel is perpetually servicing NIC interrupts.

Kafka deployments fix this with ethtool -C rx-usecs=50, which batches roughly 100 packets per interrupt and lets NAPI polling mode process them in bulk without further interrupts. This drops the interrupt rate by 100x and frees 15-20% CPU per core for actual message processing.

Same Concept Across Tech

Technology	How interrupts affect it	Key tuning
Nginx	Worker per CPU core. IRQ affinity should map NIC queues to the same cores as workers	Set worker_cpu_affinity + IRQ smp_affinity to match
Redis	Single-threaded. If NIC interrupts go to a different core than Redis, every packet crosses NUMA	Pin Redis and its NIC IRQ to the same core
DPDK	Bypasses kernel interrupts entirely. Polls NIC directly from user space	Eliminates interrupt overhead at the cost of dedicating CPU cores
Kafka	High packet rate from many producers. IRQ imbalance causes one core to bottleneck	Enable RSS (Receive Side Scaling) on the NIC for multi-queue
Kubernetes	Node-level IRQ imbalance affects all pods on that node	irqbalance should be enabled on all nodes

Stack layer mapping (one CPU core at 100% si):

Layer	What to check	Tool
Application	Is the app CPU-bound or is it waiting on I/O?	Application profiler
Network	Packet rate? NIC multi-queue enabled?	ethtool -l, ethtool -S
IRQ	Are all interrupts going to one core?	cat /proc/interrupts, check distribution
Kernel	Is ksoftirqd consuming CPU? Is NAPI polling active?	top (look for ksoftirqd), /proc/softirqs
Hardware	NIC RSS (Receive Side Scaling) configured? IRQ coalescing?	ethtool -c, ethtool -x

Design Rationale A monolithic handler either holds interrupts disabled too long (packet drops, missed events) or runs too little code (hardware left in a bad state). Pushing everything into a kernel thread would add scheduling latency to every hardware event. The top-half/bottom-half split threads the needle: acknowledge hardware with interrupts masked in microseconds, then do the real work in softirqs with interrupts re-enabled. NAPI takes it further -- once the first packet triggers an interrupt, the driver switches to polling mode so millions of packets per second get processed in batches, dodging the livelock that per-packet interrupts would cause.

If You See This, Think This

Symptom	Likely cause	First check
One CPU at 100% si, others idle	All NIC interrupts going to one core	cat /proc/interrupts, check IRQ affinity
High si% across all CPUs	Massive packet rate overwhelming softirq processing	Check packet rate with ethtool -S
ksoftirqd at high CPU in top	Softirq work deferred too often because CPU was busy	Indicates sustained interrupt load
Network latency spikes at high throughput	Interrupt coalescing too aggressive, or NAPI budget too low	ethtool -c, adjust rx-usecs
Application latency jitter despite low CPU	Interrupts preempting application code at unpredictable times	Consider CPU isolation (isolcpus) for latency-critical processes
NIC dropping packets at high rate	Ring buffer overflow because interrupts not processed fast enough	ethtool -S (rx_dropped), increase ring buffer size

When to Use / Avoid

Relevant when:

Debugging high CPU in si (softirq) or hi (hardware interrupt) in top/mpstat
Tuning network performance on multi-queue NICs (IRQ affinity)
Understanding why one CPU core is at 100% while others are idle (IRQ imbalance)
Working with real-time or latency-sensitive workloads where interrupt jitter matters

Watch out for:

All interrupts going to CPU 0 by default (set IRQ affinity with irqbalance or /proc/irq/N/smp_affinity)
NAPI polling mode on NICs (disables interrupts during high load, polls instead)
Softirq processing can be delayed if the CPU is busy in user space (check ksoftirqd)

Try It Yourself

 1  # View per-CPU interrupt counts
 2  
 3  cat /proc/interrupts | head -20
 4  
 5  # View softirq statistics per CPU
 6  
 7  cat /proc/softirqs
 8  
 9  # Watch interrupt rate changes in real-time
10  
11  watch -n 1 -d 'cat /proc/interrupts | grep -E "(CPU|eth|nvme)"'
12  
13  # Check IRQ affinity for a specific interrupt
14  
15  IRQ=$(grep eth /proc/interrupts 2>/dev/null | head -1 | awk -F: '{print $1}' | tr -d ' '); [ -n "$IRQ" ] && cat /proc/irq/$IRQ/smp_affinity_list || echo 'No eth IRQ found'
16  
17  # Monitor ksoftirqd CPU usage
18  
19  ps -eo pid,comm,%cpu | grep ksoftirqd
20  
21  # Show interrupt handler timing via ftrace
22  
23  cat /sys/kernel/debug/tracing/available_events 2>/dev/null | grep irq | head -10

Debug Checklist

1View per-CPU interrupt counts: cat /proc/interrupts
2Check softirq activity: cat /proc/softirqs
3Monitor interrupt rate: watch -d -n1 cat /proc/interrupts
4Check IRQ affinity: cat /proc/irq/<irq_num>/smp_affinity_list
5Check if irqbalance is running: systemctl status irqbalance
6Monitor si% per CPU: mpstat -P ALL 1

Key Takeaways

✓IRQ affinity is a major tuning knob. The IOAPIC routes MSI/MSI-X interrupts to specific cores, and /proc/irq/N/smp_affinity controls which CPUs handle each device. Pinning NIC interrupts to the same core running your application keeps packet data in L1 cache.
✓Most device drivers since kernel 4.x use threaded IRQs by default. The hardirq handler is minimal -- it just wakes a kernel thread that runs the main handler in process context. This improves latency predictability dramatically.
✓NAPI flips from interrupts to polling under load. After the first packet, the driver disables NIC interrupts and polls for batches of packets. This prevents interrupt livelock at high packet rates. It is why ksoftirqd often shows high CPU on network-heavy servers.
✓Softirqs are re-entrant across CPUs (same type can run on different cores simultaneously) but non-preemptible on a single CPU. If they take too long, the kernel defers remaining work to ksoftirqd to prevent user-space starvation.
✓/proc/interrupts shows per-CPU interrupt counts for every IRQ line. A sudden spike means a device is generating excessive interrupts. Unbalanced columns mean poor IRQ affinity. irqbalance tries to distribute load automatically.

Common Pitfalls

✗Mistake: Doing heavy processing in the hardirq handler. Reality: The hardirq runs with the interrupt line masked. Spending too long here delays other devices and causes packet drops. Move all non-essential work to softirqs or workqueues.
✗Mistake: Calling sleeping functions from softirq or hardirq context. Reality: These contexts have no process to schedule away from. Use GFP_ATOMIC for allocation, spinlocks for synchronization. kmalloc with GFP_KERNEL or mutex_lock will deadlock or panic.
✗Mistake: Ignoring ksoftirqd CPU usage. Reality: When ksoftirqd threads eat significant CPU, softirq processing is exceeding its inline budget (~2ms). This is common on network-heavy systems and signals the need for RSS, RPS, or interrupt affinity tuning.
✗Mistake: Setting IRQ affinity without considering cache topology. Reality: Pinning a NIC IRQ and the application to the same CPU socket (same L3 cache) cuts memory latency. Cross-NUMA IRQ handling adds 100+ ns per packet in cache misses.

Reference

System Calls

(kernel internal)

Tools

/proc/interrupts/proc/softirqsperf top -e irq:irq_handler_entry

📌

In One Line

Pin NIC IRQs to the same cores as the application threads, enable NAPI coalescing under load, and check /proc/interrupts the moment one core pegs at 100% si while the rest sit idle.

Interrupt Handling & Softirqs

NginxKafka

🧠

Mental Model

💡

The Problem

Architecture

The CPU does not poll hardware for events. It gets forcibly interrupted.

What Actually Happens

When a device needs CPU attention, it signals via MSI-X -- writing a small message to a memory address that the CPU's Local APIC interprets as an interrupt vector.

The CPU immediately suspends its current execution. Saves registers. Looks up the vector in the IDT. Jumps to the registered handler.

Under the Hood

Common Questions

What happens if a hardirq handler takes too long?

How do softirqs differ from tasklets?

Why does ksoftirqd sometimes cause latency problems?

How Technologies Use This

Nginx

Kafka

Same Concept Across Tech

Technology	How interrupts affect it	Key tuning
Nginx	Worker per CPU core. IRQ affinity should map NIC queues to the same cores as workers	Set worker_cpu_affinity + IRQ smp_affinity to match
Redis	Single-threaded. If NIC interrupts go to a different core than Redis, every packet crosses NUMA	Pin Redis and its NIC IRQ to the same core
DPDK	Bypasses kernel interrupts entirely. Polls NIC directly from user space	Eliminates interrupt overhead at the cost of dedicating CPU cores
Kafka	High packet rate from many producers. IRQ imbalance causes one core to bottleneck	Enable RSS (Receive Side Scaling) on the NIC for multi-queue
Kubernetes	Node-level IRQ imbalance affects all pods on that node	irqbalance should be enabled on all nodes

Stack layer mapping (one CPU core at 100% si):

Layer	What to check	Tool
Application	Is the app CPU-bound or is it waiting on I/O?	Application profiler
Network	Packet rate? NIC multi-queue enabled?	ethtool -l, ethtool -S
IRQ	Are all interrupts going to one core?	cat /proc/interrupts, check distribution
Kernel	Is ksoftirqd consuming CPU? Is NAPI polling active?	top (look for ksoftirqd), /proc/softirqs
Hardware	NIC RSS (Receive Side Scaling) configured? IRQ coalescing?	ethtool -c, ethtool -x

If You See This, Think This

Symptom	Likely cause	First check
One CPU at 100% si, others idle	All NIC interrupts going to one core	cat /proc/interrupts, check IRQ affinity
High si% across all CPUs	Massive packet rate overwhelming softirq processing	Check packet rate with ethtool -S
ksoftirqd at high CPU in top	Softirq work deferred too often because CPU was busy	Indicates sustained interrupt load
Network latency spikes at high throughput	Interrupt coalescing too aggressive, or NAPI budget too low	ethtool -c, adjust rx-usecs
Application latency jitter despite low CPU	Interrupts preempting application code at unpredictable times	Consider CPU isolation (isolcpus) for latency-critical processes
NIC dropping packets at high rate	Ring buffer overflow because interrupts not processed fast enough	ethtool -S (rx_dropped), increase ring buffer size

When to Use / Avoid

Relevant when:

Debugging high CPU in si (softirq) or hi (hardware interrupt) in top/mpstat
Tuning network performance on multi-queue NICs (IRQ affinity)
Understanding why one CPU core is at 100% while others are idle (IRQ imbalance)
Working with real-time or latency-sensitive workloads where interrupt jitter matters

Watch out for:

All interrupts going to CPU 0 by default (set IRQ affinity with irqbalance or /proc/irq/N/smp_affinity)
NAPI polling mode on NICs (disables interrupts during high load, polls instead)
Softirq processing can be delayed if the CPU is busy in user space (check ksoftirqd)

Try It Yourself

 1  # View per-CPU interrupt counts
 2  
 3  cat /proc/interrupts | head -20
 4  
 5  # View softirq statistics per CPU
 6  
 7  cat /proc/softirqs
 8  
 9  # Watch interrupt rate changes in real-time
10  
11  watch -n 1 -d 'cat /proc/interrupts | grep -E "(CPU|eth|nvme)"'
12  
13  # Check IRQ affinity for a specific interrupt
14  
15  IRQ=$(grep eth /proc/interrupts 2>/dev/null | head -1 | awk -F: '{print $1}' | tr -d ' '); [ -n "$IRQ" ] && cat /proc/irq/$IRQ/smp_affinity_list || echo 'No eth IRQ found'
16  
17  # Monitor ksoftirqd CPU usage
18  
19  ps -eo pid,comm,%cpu | grep ksoftirqd
20  
21  # Show interrupt handler timing via ftrace
22  
23  cat /sys/kernel/debug/tracing/available_events 2>/dev/null | grep irq | head -10

Debug Checklist

1View per-CPU interrupt counts: cat /proc/interrupts
2Check softirq activity: cat /proc/softirqs
3Monitor interrupt rate: watch -d -n1 cat /proc/interrupts
4Check IRQ affinity: cat /proc/irq/<irq_num>/smp_affinity_list
5Check if irqbalance is running: systemctl status irqbalance
6Monitor si% per CPU: mpstat -P ALL 1

Key Takeaways

✓IRQ affinity is a major tuning knob. The IOAPIC routes MSI/MSI-X interrupts to specific cores, and /proc/irq/N/smp_affinity controls which CPUs handle each device. Pinning NIC interrupts to the same core running your application keeps packet data in L1 cache.
✓Most device drivers since kernel 4.x use threaded IRQs by default. The hardirq handler is minimal -- it just wakes a kernel thread that runs the main handler in process context. This improves latency predictability dramatically.
✓NAPI flips from interrupts to polling under load. After the first packet, the driver disables NIC interrupts and polls for batches of packets. This prevents interrupt livelock at high packet rates. It is why ksoftirqd often shows high CPU on network-heavy servers.
✓Softirqs are re-entrant across CPUs (same type can run on different cores simultaneously) but non-preemptible on a single CPU. If they take too long, the kernel defers remaining work to ksoftirqd to prevent user-space starvation.
✓/proc/interrupts shows per-CPU interrupt counts for every IRQ line. A sudden spike means a device is generating excessive interrupts. Unbalanced columns mean poor IRQ affinity. irqbalance tries to distribute load automatically.

Common Pitfalls

✗Mistake: Doing heavy processing in the hardirq handler. Reality: The hardirq runs with the interrupt line masked. Spending too long here delays other devices and causes packet drops. Move all non-essential work to softirqs or workqueues.
✗Mistake: Calling sleeping functions from softirq or hardirq context. Reality: These contexts have no process to schedule away from. Use GFP_ATOMIC for allocation, spinlocks for synchronization. kmalloc with GFP_KERNEL or mutex_lock will deadlock or panic.
✗Mistake: Ignoring ksoftirqd CPU usage. Reality: When ksoftirqd threads eat significant CPU, softirq processing is exceeding its inline budget (~2ms). This is common on network-heavy systems and signals the need for RSS, RPS, or interrupt affinity tuning.
✗Mistake: Setting IRQ affinity without considering cache topology. Reality: Pinning a NIC IRQ and the application to the same CPU socket (same L3 cache) cuts memory latency. Cross-NUMA IRQ handling adds 100+ ns per packet in cache misses.

Reference

System Calls

(kernel internal)

Tools

/proc/interrupts/proc/softirqsperf top -e irq:irq_handler_entry

📌

In One Line

Pin NIC IRQs to the same cores as the application threads, enable NAPI coalescing under load, and check /proc/interrupts the moment one core pegs at 100% si while the rest sit idle.

Interrupt Handling & Softirqs

Mental Model

The Problem

Architecture

What Actually Happens

Under the Hood

Common Questions

How Technologies Use This

Same Concept Across Tech

If You See This, Think This

When to Use / Avoid

Try It Yourself

Debug Checklist

Key Takeaways

Common Pitfalls

Reference

In One Line

Related Topics

Interrupt Handling & Softirqs

Mental Model

The Problem

Architecture

What Actually Happens

Under the Hood

Common Questions

How Technologies Use This

Same Concept Across Tech

If You See This, Think This

When to Use / Avoid

Try It Yourself

Debug Checklist

Key Takeaways

Common Pitfalls

Reference

In One Line

Related Topics