Workqueues & Tasklets
Mental Model
A restaurant kitchen during dinner rush. The head chef (hardirq handler) stands at the pass and receives order tickets in real time. The chef cannot leave the pass, so each ticket gets clipped to the order rail (queue_work). Line cooks (kworker threads) pull tickets and prepare dishes at their own stations. If a cook is stuck waiting for the oven (blocked on I/O), the kitchen manager (pool manager) pulls in another cook so other orders keep moving. An ordered workqueue is like a single sushi chef who must plate one roll at a time in sequence. Tasklets are the bartender: fast drinks only, no cooking, and one bartender never works two bars simultaneously.
The Problem
A storage driver sees high latency because work items are queuing behind each other on a single-threaded workqueue. Each I/O completion work item takes 50 microseconds, but under load the queue backs up and the 99th percentile latency reaches 8 milliseconds. The driver allocated its workqueue with alloc_ordered_workqueue(), which limits execution to one work item at a time. Switching to a multi-threaded workqueue with alloc_workqueue("drv_io", WQ_UNBOUND | WQ_MEM_RECLAIM, 0) allows CMWQ to scale kworker threads dynamically, dropping p99 completion latency from 8ms to 120 microseconds.
Architecture
A device interrupt fires. The handler needs to update a data structure that requires a mutex. It needs to allocate memory with GFP_KERNEL. It needs to call a function that might sleep for 100 microseconds waiting on firmware. None of this is legal in interrupt context.
This is the fundamental problem workqueues solve. They bridge the gap between the fast, restrictive world of interrupt handlers and the full-featured process context where sleeping, memory allocation, and complex synchronization are all permitted.
The Bottom-Half Spectrum
Linux offers three mechanisms for deferring work out of hardirq context, each with different tradeoffs:
Softirqs are statically allocated at compile time (only 10 exist: HI, TIMER, NET_TX, NET_RX, BLOCK, IRQ_POLL, TASKLET, SCHED, HRTIMER, RCU). They run immediately after the hardirq with interrupts enabled but in atomic context. They cannot sleep. They can run on multiple CPUs simultaneously, which means softirq handlers must be fully reentrant. Adding a new softirq requires modifying kernel source.
Tasklets are built on top of softirqs (TASKLET_SOFTIRQ and HI_SOFTIRQ). They provide a simpler programming model: a single tasklet instance is serialized (never runs on two CPUs at the same time), but different tasklet instances can run in parallel. They still cannot sleep. Tasklets are the legacy mechanism, and the kernel community is actively converting them to workqueues or threaded IRQs.
Workqueues run in full process context on kworker threads. They can sleep, take mutexes, allocate memory, and perform I/O. They are the right choice for anything that cannot guarantee completion without blocking.
Mechanism Context Can sleep? Concurrency model
----------- --------------- ---------- ---------------------------------
Softirq Atomic/softirq No Same softirq runs on all CPUs
Tasklet Atomic/softirq No Per-instance serialized
Workqueue Process Yes CMWQ manages pool of kworker threads
Concurrency Managed Workqueues (CMWQ)
Before kernel 2.6.36, each workqueue created dedicated kernel threads per CPU. A system with 30 workqueues on a 64-core machine spawned 1920 threads, most idle. CMWQ redesigned the entire subsystem around shared thread pools.
The key insight: workqueues do not own threads. They are routing labels. When code calls queue_work(my_wq, &my_work), the workqueue routes the work item to the appropriate worker pool, and a kworker thread from that pool executes it.
Two kinds of pools exist:
Per-CPU bound pools. Each CPU has two pools: one normal priority and one high priority (for WQ_HIGHPRI workqueues). Work queued on a bound workqueue runs on the CPU where it was queued. This preserves cache locality.
Unbound pools. Shared across CPUs, keyed by NUMA node and nice level. Work queued on a WQ_UNBOUND workqueue can execute on any CPU in the same NUMA node. The scheduler picks the least-loaded CPU. Better for latency-sensitive work that should not wait behind CPU-bound tasks on the originating core.
The pool manager watches every pool. When all kworker threads in a pool are blocked (sleeping on I/O, waiting for a mutex), the manager immediately spawns a new kworker. When blocked workers wake up and the pool has excess idle threads, extra threads are reaped after a timeout. Thread count tracks actual concurrency demand, not the number of registered workqueues.
/* Creating a workqueue: the modern API */
/* Concurrent, unbound, with memory reclaim safety */
struct workqueue_struct *io_wq;
io_wq = alloc_workqueue("driver_io",
WQ_UNBOUND | WQ_MEM_RECLAIM,
0); /* 0 = default max_active (256) */
/* Ordered: strict FIFO, one item at a time */
struct workqueue_struct *cmd_wq;
cmd_wq = alloc_ordered_workqueue("driver_cmd",
WQ_MEM_RECLAIM);
/* High priority: uses the high-pri per-CPU pool */
struct workqueue_struct *hp_wq;
hp_wq = alloc_workqueue("driver_hp",
WQ_HIGHPRI | WQ_MEM_RECLAIM,
0);
The Workqueue Flags
WQ_UNBOUND detaches work from the queuing CPU. Useful when work items are latency-sensitive and should not wait behind CPU-bound tasks on a specific core. The scheduler picks the best CPU. Tradeoff: potential L1/L2 cache misses.
WQ_MEM_RECLAIM pre-allocates a rescue worker thread. Under memory pressure, if the system cannot allocate a new kworker, the rescue worker runs the work items. Without this flag, a workqueue used in the I/O completion or memory reclaim path can deadlock. Every block driver and filesystem workqueue needs this flag.
WQ_HIGHPRI routes work to the high-priority per-CPU pool. Workers in this pool run at a lower nice value, getting preferential scheduling. Used for latency-critical completion handlers.
WQ_FREEZABLE allows the workqueue to be frozen during system suspend. Work items are paused and resumed on wake. Used for non-critical periodic tasks that should not interfere with suspend/resume.
WQ_CPU_INTENSIVE tells CMWQ that work items on this workqueue may be CPU-intensive. The pool manager does not count these workers as "blocked" when deciding whether to spawn new threads, preventing a thundering herd of kworkers for intentionally CPU-bound work.
Work Item Lifecycle
/* 1. Define the work function */
static void my_work_func(struct work_struct *work)
{
struct my_device *dev = container_of(work,
struct my_device,
completion_work);
/* Process context: mutexes, GFP_KERNEL, msleep all legal */
mutex_lock(&dev->lock);
process_completed_io(dev);
mutex_unlock(&dev->lock);
}
/* 2. Initialize (usually in probe/init) */
struct my_device {
struct work_struct completion_work;
struct delayed_work watchdog;
/* ... */
};
INIT_WORK(&dev->completion_work, my_work_func);
/* 3. Queue from interrupt handler */
static irqreturn_t my_irq_handler(int irq, void *data)
{
struct my_device *dev = data;
u32 status = readl(dev->regs + STATUS);
/* Acknowledge hardware interrupt (top half) */
writel(status, dev->regs + STATUS_ACK);
/* Defer heavy processing to workqueue (bottom half) */
queue_work(dev->io_wq, &dev->completion_work);
return IRQ_HANDLED;
}
/* 4. Delayed work for periodic polling */
static void watchdog_func(struct work_struct *work)
{
struct my_device *dev = container_of(
to_delayed_work(work),
struct my_device, watchdog);
check_device_health(dev);
/* Re-arm: run again in 1 second */
queue_delayed_work(dev->io_wq, &dev->watchdog, HZ);
}
/* 5. Cleanup in remove/exit */
cancel_delayed_work_sync(&dev->watchdog);
cancel_work_sync(&dev->completion_work);
destroy_workqueue(dev->io_wq);
The container_of macro is the key pattern here. The work_struct is embedded directly in the driver's data structure. When the work function executes, it recovers the enclosing structure via container_of, giving access to all device state without global variables.
Tasklets: The Legacy Path
Tasklets still exist in the kernel and understanding them matters for reading older driver code:
/* Tasklet declaration and handler */
static void my_tasklet_handler(struct tasklet_struct *t)
{
/* Runs in softirq context. Cannot sleep. */
struct my_device *dev = from_tasklet(dev, t, my_tasklet);
u32 status = readl(dev->regs + STATUS);
process_rx_packets(dev, status);
}
/* In device structure */
struct my_device {
struct tasklet_struct my_tasklet;
/* ... */
};
/* Initialization */
tasklet_setup(&dev->my_tasklet, my_tasklet_handler);
/* Schedule from hardirq */
tasklet_schedule(&dev->my_tasklet);
/* Teardown */
tasklet_kill(&dev->my_tasklet);
The critical limitation: tasklet_schedule() on an already-scheduled tasklet is a no-op. If the hardirq fires twice before the tasklet runs, the second event is effectively merged into the first. Drivers must handle this by re-reading hardware status inside the tasklet handler rather than relying on one invocation per interrupt.
Diagnosing Workqueue Problems
When kworker threads appear at the top of CPU usage, the first question is: which function is running?
# Identify the function a kworker is stuck in
cat /proc/$(pgrep -f "kworker/0:1")/stack
# Trace all work item executions with function names
trace-cmd record -e workqueue:workqueue_execute_start \
-e workqueue:workqueue_execute_end sleep 10
trace-cmd report | head -40
# Count work items per workqueue over 5 seconds
perf stat -e workqueue:workqueue_queue_work -a sleep 5
# Watch for workqueue stall warnings in kernel log
dmesg -T | grep -i "workqueue.*stall"
A common pattern: high p99 I/O latency traced to an ordered workqueue. The trace-cmd output shows work items waiting milliseconds in the queue while a single item executes. The fix is almost always switching from alloc_ordered_workqueue() to alloc_workqueue() with appropriate concurrency, after verifying that work items have no ordering dependencies.
Common Questions
When should a driver use a workqueue instead of a tasklet or threaded IRQ?
Use a workqueue when the deferred work needs to sleep, take a mutex, allocate memory with GFP_KERNEL, or perform any operation that might block. Threaded IRQs (request_threaded_irq) are the right choice when the bottom half naturally pairs with a specific interrupt and does not need to be shared across multiple event sources. Workqueues are better when work is generated from multiple sources or needs to be cancelled, flushed, or delayed independently.
What happens when queue_work() is called on an already-queued work item?
It returns false and does nothing. The work item remains in the queue exactly once. This is a deliberate design: it prevents unbounded queue growth when interrupts fire faster than work items execute. The work function must re-read current state rather than assuming it corresponds to a single event. If distinct events must each trigger separate processing, use separate work_struct instances or a dedicated data queue consumed by a single work item.
Why does the system_wq exist and when is it safe to use?
The system_wq is a pre-allocated bound workqueue for lightweight, non-blocking work items. It is safe for short tasks that complete in microseconds and never sleep. Using it for anything that might block delays all other system_wq users across the kernel. If there is any doubt about whether the work function blocks, allocate a private workqueue.
How does WQ_MEM_RECLAIM prevent deadlock?
When a workqueue has WQ_MEM_RECLAIM, the kernel creates a rescue worker thread at workqueue allocation time. Under memory pressure, if the normal kworker allocation path fails, the rescue worker executes pending work items. This is critical for I/O completion paths: the memory reclaim subsystem writes dirty pages to disk, which triggers block I/O completions, which must run on a workqueue. Without the rescue worker, the system deadlocks because freeing memory requires completing I/O, which requires a worker thread, which requires memory.
What is the difference between cancel_work_sync() and flush_work()?
cancel_work_sync() cancels pending work and waits for any executing instance to finish. After it returns, the work item is guaranteed not to be running or pending. flush_work() does not cancel; it waits for the currently pending or executing instance to complete. If the work item re-queues itself (common for periodic tasks), flush_work() only waits for the current iteration. Use cancel_delayed_work_sync() for delayed work to also cancel the pending timer.
How Technologies Use This
A Docker host running 40 containers on overlayfs reports write latency spikes of 15-50ms on files that were never modified before. Normal writes to previously modified files complete in under 1ms. The spikes occur only on first writes to files inherited from lower image layers, and the containers share a single NVMe-backed storage pool doing 80K IOPS.
Overlayfs implements copy-up when a container first modifies a file from a read-only lower layer. The copy-up operation reads the entire file from the lower layer, writes it to the upper layer, sets extended attributes, and updates directory entries. This multi-step process involves blocking I/O and metadata updates that cannot run in atomic context, so overlayfs defers portions of this work to the kernel's workqueue subsystem. A kworker thread picks up the copy-up work item from the per-CPU workqueue pool, and the process that triggered the write blocks until the work item completes.
The latency spike duration depends directly on the file size being copied up and the current workqueue backlog. A 4 MB file copy-up takes 3-5ms on a fast NVMe drive, but if 10 containers trigger copy-ups simultaneously, their work items queue behind each other on the same kworker thread pool. Monitoring workqueue depth with `cat /sys/kernel/debug/workqueue` and checking for backed-up items in the writeback workqueue reveals whether the spikes correlate with copy-up contention or underlying storage saturation.
A PostgreSQL instance on a 16-core server commits WAL (write-ahead log) records with fsync() after every transaction, producing 8,000 fsync calls per second. Each fsync() triggers a journal commit in the ext4 filesystem's jbd2 layer. The jbd2 commit involves writing metadata blocks, waiting for I/O completion, and updating the superblock, all operations that require sleeping and cannot run in interrupt context.
The jbd2 journal thread schedules commit work items on a dedicated workqueue. When PostgreSQL calls fsync(), the filesystem posts a journal commit work item and the calling process sleeps until the kworker thread completes it. Under heavy WAL pressure, dozens of PostgreSQL backends may call fsync() simultaneously. The workqueue serializes these commits efficiently, batching multiple fsync requests into a single journal transaction. This batching is what prevents 8,000 separate fsync calls from becoming 8,000 separate disk flushes.
The latency impact surfaces when the workqueue backs up. If a single journal commit takes 5ms because the storage controller is busy, all waiting fsync callers are blocked behind it. Monitoring with `perf record -e workqueue:workqueue_execute_start -a -- sleep 10` shows how long each work item spends on the jbd2 workqueue. If p99 WAL fsync latency exceeds 10ms, the workqueue execution trace often reveals that commit batching is too aggressive or the underlying storage cannot sustain the flush rate.
An Nginx server handling 50,000 requests per second on a 25 Gbps NIC (Mellanox ConnectX-5) experiences occasional 2ms latency spikes that correlate with network link state changes and MTU renegotiation events. The NIC driver (mlx5) handles fast-path packet processing through NAPI poll in softirq context, but link events, firmware commands, and error recovery require sleeping operations that cannot execute in softirq.
The mlx5 driver schedules these slow-path operations on a private workqueue created with alloc_workqueue("mlx5_health", WQ_MEM_RECLAIM, 0). When the hardirq handler detects a link state change, it calls queue_work() to post the event to this workqueue. A kworker thread picks up the work item and runs the handler in full process context, where it can take rtnl_lock, call GFP_KERNEL allocations for ring buffer reallocation, and wait for firmware response ACKs that take 200-500 microseconds.
The latency spikes affecting Nginx occur when the driver's workqueue handler takes rtnl_lock for a link renegotiation, and a concurrent Nginx configuration reload also needs rtnl_lock to set up new listening sockets. The reload blocks on the lock until the firmware command completes. Diagnosing this requires checking workqueue activity with `cat /proc/sched_debug | grep kworker` and correlating the kworker execution times with Nginx reload timestamps in the access log.
Same Concept Across Tech
| Technology | How it uses workqueues | Key gotcha |
|---|---|---|
| NVMe / Block layer | I/O completion callbacks run on kworker threads via blk_mq_complete_request | Ordered workqueues create head-of-line blocking at high IOPS; use concurrent WQ for independent completions |
| mlx5 / ixgbe (NIC drivers) | Link state changes, firmware commands, and error recovery on private workqueues | Using system_wq for firmware resets blocks unrelated kernel subsystems for hundreds of milliseconds |
| ext4 / XFS (Filesystems) | Journal commits, metadata writeback, inode reclaim, and unwritten extent conversion | Missing WQ_MEM_RECLAIM causes deadlock when the reclaim path depends on block I/O completion |
| USB subsystem | URB completion handling, hub port status changes, and device enumeration | USB operations inherently sleep; cannot use tasklets or softirqs for the actual processing |
| RCU | Callback processing after grace period completion runs on kworker threads | High RCU callback rates can saturate the workqueue; monitor with /sys/kernel/debug/rcu |
Stack layer mapping (high kworker CPU usage):
| Layer | What to check | Tool |
|---|---|---|
| Application | Is the workload generating excessive I/O completions or events? | Application-level I/O metrics |
| Block layer | Are work items backed up on an ordered workqueue? | trace-cmd -e workqueue, blktrace |
| Workqueue | Which function is consuming CPU in kworker threads? | cat /proc/<kworker_pid>/stack |
| Pool manager | Are workers being spawned due to blocked work items? | ps aux grep kworker, count over time |
| CPU | Is the kworker thread competing with user-space for CPU time? | perf top, mpstat -P ALL |
Design Rationale The original Linux workqueue implementation (pre-2.6.36) created dedicated kernel threads per workqueue per CPU. A system with 30 workqueues on a 64-core machine spawned 1920 threads, most sitting idle. CMWQ (merged in 2010) solved this by introducing shared worker pools. All workqueues on a given CPU feed into the same pool of kworker threads. The pool manager watches for blocked workers and spawns replacements on demand. This keeps thread count proportional to actual concurrency rather than registered workqueues, cutting idle thread overhead by 10-50x on large systems.
If You See This, Think This
| Symptom | Likely cause | First check |
|---|---|---|
| kworker threads consuming 100% of a CPU core | A work function is CPU-intensive or stuck in a tight loop | cat /proc/<pid>/stack to identify the function |
| I/O completion latency spikes at high IOPS | Ordered workqueue serializing independent completions | trace-cmd -e workqueue; check if alloc_ordered_workqueue is used |
| Deadlock under memory pressure with block I/O | Workqueue missing WQ_MEM_RECLAIM flag, no rescue worker available | Check workqueue allocation flags in driver source |
| "workqueue: ... hogging CPU" in dmesg | A work function running for more than the watchdog threshold (default 30s) | dmesg, then cat /proc/<kworker_pid>/stack |
| Hundreds of kworker threads visible in ps | Many workers blocked on I/O or mutexes, causing pool manager to spawn replacements | Check what workers are waiting on via /proc/<pid>/wchan |
| Work items silently dropped | Calling queue_work() on an already-pending work_struct, which returns false | Add logging around queue_work() return value |
When to Use / Avoid
Relevant when:
- Writing a kernel driver that needs to defer work from interrupt context to process context
- Diagnosing kworker threads consuming unexpected CPU or causing latency
- Understanding why a storage driver's completion latency is high under load
- Choosing between tasklets, softirqs, and workqueues for bottom-half processing
Watch out for:
- Ordered workqueues serialize all items. Independent work should use a concurrent workqueue
- Forgetting WQ_MEM_RECLAIM on I/O path workqueues causes deadlocks under memory pressure
- Work items on system_wq can block unrelated kernel subsystems
- Tasklets cannot sleep and are being phased out in favor of threaded IRQs and workqueues
Try It Yourself
1 # Count kworker threads per CPU
2
3 ps -eo comm | grep kworker | sed 's|kworker/\([0-9]*\):.*|\1|' | sort | uniq -c | sort -rn
4
5 # Show the top 10 kworker threads by CPU usage
6
7 ps -eo pid,comm,%cpu --sort=-%cpu | grep kworker | head -10
8
9 # Trace workqueue events for 10 seconds
10
11 trace-cmd record -e workqueue -o /tmp/wq.dat sleep 10 && trace-cmd report /tmp/wq.dat | head -50
12
13 # Identify which function a kworker is executing right now
14
15 for pid in $(pgrep -f kworker); do echo "=== $pid ==="; cat /proc/$pid/stack 2>/dev/null | head -5; done
16
17 # Check workqueue max_active settings
18
19 find /sys/devices/virtual/workqueue -name max_active -exec sh -c "echo {}; cat {}" \;
20
21 # Monitor workqueue work item rate with ftrace
22
23 echo 1 > /sys/kernel/debug/tracing/events/workqueue/workqueue_queue_work/enable && sleep 5 && cat /sys/kernel/debug/tracing/trace_pipe | head -20Debug Checklist
- 1
List kworker threads and their CPU time: ps -eo pid,comm,%cpu | grep kworker | sort -k3 -rn | head -20 - 2
Trace work item execution latency: trace-cmd record -e workqueue:workqueue_execute_start -e workqueue:workqueue_execute_end - 3
Count work items queued per second: perf stat -e workqueue:workqueue_queue_work -a sleep 5 - 4
Check for workqueue stalls: dmesg | grep -i 'workqueue.*stall' - 5
Identify which function a busy kworker is executing: cat /proc/<pid>/stack - 6
Check per-workqueue concurrency settings: cat /sys/devices/virtual/workqueue/*/max_active
Key Takeaways
- ✓Concurrency Managed Workqueues (CMWQ) replaced the old create_workqueue() API. Instead of each workqueue owning dedicated threads, all workqueues share per-CPU kworker pools. The pool manager monitors how many workers are sleeping. If all workers for a CPU are blocked (waiting on I/O, mutexes), it spawns a new kworker to keep other work items flowing. This keeps kworker thread count proportional to actual concurrency needs, not the number of registered workqueues.
- ✓WQ_UNBOUND workqueues do not pin work to the CPU that queued it. The scheduler is free to run the kworker on any CPU, which helps latency- sensitive work avoid head-of-line blocking behind CPU-bound tasks on the originating core. The tradeoff is potential cache misses when the work item accesses data that was hot on the queuing CPU.
- ✓WQ_MEM_RECLAIM guarantees forward progress under memory pressure. Without this flag, a workqueue that needs to allocate memory during low-memory conditions can deadlock if the memory reclaim path itself depends on workqueue execution. Filesystems and block drivers must set this flag. Internally, the kernel pre-allocates a rescue worker thread for each WQ_MEM_RECLAIM workqueue.
- ✓Tasklets are the older bottom-half mechanism. They run in softirq context (TASKLET_SOFTIRQ / HI_SOFTIRQ), cannot sleep, and are serialized per tasklet instance (the same tasklet never runs on two CPUs simultaneously). Different tasklet instances can run in parallel on different CPUs. The kernel community is gradually converting tasklets to workqueues because workqueues provide better concurrency control and debugging.
- ✓alloc_ordered_workqueue() creates a workqueue that processes items strictly one at a time, in FIFO order. This is useful when work items have ordering dependencies (journal commits, firmware command sequences), but creates a bottleneck if items are independent. Always verify whether ordering is actually required before choosing ordered execution.
Common Pitfalls
- ✗Using system_wq for long-running or potentially blocking work. The shared system workqueue has limited concurrency, and one slow work item blocks unrelated subsystems. A firmware reset taking 500ms on system_wq delays timer callbacks, RCU processing, and driver state machines across the entire kernel. Allocate a private workqueue for anything that may block for more than a few milliseconds.
- ✗Calling flush_workqueue() or flush_work() from within a work item on the same workqueue. This deadlocks because the flushing work item is waiting for the target work item to complete, but the target is queued behind the flushing item (or the workqueue is ordered). Use separate workqueues for work items that need to wait on each other.
- ✗Queuing a work_struct that is already pending or executing. queue_work() returns false in this case and the new request is silently dropped. Code that needs to ensure a function runs again after the current execution must re-queue from within the work function itself, or use a flag to signal that re-execution is needed.
- ✗Forgetting WQ_MEM_RECLAIM on block or filesystem workqueues. Under memory pressure, the kernel reclaims pages by flushing dirty data through the block layer. If the block driver's completion workqueue cannot make progress because kworker allocation fails, the system deadlocks. The rescue worker mechanism exists specifically to prevent this.
- ✗Assuming tasklets provide parallelism. A single tasklet instance is strictly serialized. Scheduling the same tasklet on multiple CPUs does not make it run in parallel. The second CPU spins or reschedules until the first CPU finishes. For parallel execution of the same function, use per-CPU work items instead.
Reference
In One Line
Workqueues move heavy lifting out of interrupt handlers into sleeping-capable kernel threads, and CMWQ's pool manager ensures one blocked work item never starves the rest.