BIO & Request Queues
Mental Model
Think of a shipping warehouse. Applications hand packages (BIOs) to the receiving dock. Each package has a label (sector address) and contents (bio_vec pages). The warehouse sorter (plug list) groups packages heading to nearby addresses into the same pallet (merged request). Pallets move to staging lanes (software queues), one per loading bay worker (CPU). Each worker pushes pallets onto the truck assigned to their bay (hardware dispatch queue). If there is only one truck bay but 64 workers, everyone queues behind a single door. Adding 64 truck bays (blk-mq hardware queues) lets all workers load simultaneously.
The Problem
An NVMe drive rated for 1 million random 4K read IOPS sits behind a storage application that measures only 200K IOPS. The CPU is 40% idle. The NVMe device queue depth never exceeds 8. The drive has 64 hardware queues, but blktrace shows all I/O flowing through a single software queue. The application uses synchronous read() calls from 16 threads, each waiting for one I/O to complete before issuing the next. Merging is not the problem here -- the requests are random. The queue topology is wrong, and the submission pattern cannot saturate the hardware.
Architecture
Every byte read from or written to a block device in Linux passes through the BIO and request queue subsystem. The path is deceptively layered: a filesystem builds a struct bio describing which pages go to which sectors, the block layer merges adjacent BIOs into struct requests for efficiency, and the multi-queue framework (blk-mq) fans those requests across per-CPU hardware dispatch queues so the device driver can consume them in parallel.
Getting this path wrong means leaving 80% of NVMe performance on the table. Getting it right means saturating devices capable of a million IOPS.
What Actually Happens
Here is the sequence when a filesystem submits a block write:
- The filesystem (ext4, XFS, btrfs) determines which disk sectors need writing and which pages hold the data.
- It calls
bio_alloc()to allocate a struct bio from a mempool. The mempool guarantees allocation succeeds even under memory pressure -- critical because failing a write to free dirty pages would deadlock. - It attaches memory pages via
bio_add_page(), which populatesbio_vecentries. Each bio_vec is a (page, offset, length) tuple. A single bio can hold up to BIO_MAX_VECS (256) entries, covering roughly 1 MB with 4 KB pages. - It sets
bi_iter.bi_sector(the target disk offset), the operation (REQ_OP_WRITE), and any flags (REQ_SYNC,REQ_FUAfor write-through). - It calls
submit_bio(), which hands the bio to the block layer.
From submit_bio(), the bio enters the blk-mq path:
blk_mq_submit_bio()checks the current task's plug list. If the task calledblk_start_plug()earlier (most filesystem code does), BIOs accumulate in a per-task list. Adjacent BIOs targeting consecutive sectors get merged here viablk_attempt_plug_merge().- If no merge is possible, a new struct request is allocated. This request receives a tag from the hardware context's tag bitmap -- an integer that uniquely identifies this in-flight I/O.
- The request enters the per-CPU software staging queue (
blk_mq_ctx). If an I/O scheduler is configured (mq-deadline, bfq, kyber), the request passes through it for reordering or priority handling. - The request is dispatched to the hardware dispatch queue (
blk_mq_hw_ctx). The hctx is chosen based on the submitting CPU's mapping. - The driver's
queue_rq()callback fires. For NVMe, this isnvme_queue_rq(), which builds a submission queue entry and writes the doorbell register. - The device processes the I/O and signals completion via interrupt (MSI-X for NVMe, routed to the submitting CPU).
blk_mq_complete_request()finds the request by its tag and calls the bio'sbi_end_iocallback. The filesystem learns the I/O completed.
When blk_finish_plug() is called (typically at the end of a syscall), any remaining BIOs in the plug list are flushed to the scheduler in one batch. This batching is what makes plug merging effective -- without it, each bio would be dispatched individually with no merge window.
Under the Hood
struct bio internals. The bio is allocated from a bio_set, which wraps a mempool and a slab cache. The mempool preallocates a reserve of BIOs so that critical I/O paths (writeback, swap) can always make progress. bio->bi_iter tracks the current position within the bio: bi_sector is the disk offset, bi_size is bytes remaining, bi_idx is the current index into bi_io_vec, and bi_bvec_done tracks partial progress within a bio_vec. This iterator design means a bio can be advanced without modifying the underlying bio_vec array, which is critical for bio splitting.
bio_vec and scatter-gather. Each bio_vec points to a single page (or compound page). The DMA layer maps these into scatter-gather list entries for the device. Modern NVMe devices support up to 256 scatter-gather entries per command (matching BIO_MAX_VECS). If the page layout is physically contiguous, the DMA layer can coalesce adjacent entries, reducing the number of PRP/SGL entries the device must process.
Request merging levels. Merging happens at three points: the plug list (per-task, within a single syscall), the I/O scheduler (global within a queue), and the driver (some drivers do last-chance merging). Back-merge (appending a new bio to an existing request's tail) is the overwhelmingly common case for sequential I/O. Front-merge (prepending) requires the scheduler's red-black tree lookup and is rarer. The merge check is O(1) for back-merge (compare against the plug list's last request) and O(log n) for scheduler-level merging.
blk-mq queue topology. The mapping from software queues (ctx) to hardware queues (hctx) is established at device registration. blk_mq_map_queues() assigns each CPU to an hctx. For NVMe, the mapping is 1:1 (one hctx per CPU, one NVMe SQ/CQ pair per hctx). For SATA (AHCI), there is typically one hctx for the single hardware queue, with all CPUs funneling into it. This mismatch explains why SATA devices see no benefit from adding CPUs beyond a certain point.
Tag allocation. Each hctx manages a bitmap of tags (blk_mq_tags). When a request is allocated, it receives the next available tag via blk_mq_get_tag(). If all tags are in use, the caller sleeps on a wait queue. The tag serves as an index into a request array, enabling O(1) completion lookup. NVMe devices return the tag in the completion queue entry. This eliminates the legacy block layer's linear search through the request list on completion.
The "none" scheduler. When the scheduler is set to "none", requests bypass scheduling entirely and go straight from the software staging queue to the hardware dispatch queue. For NVMe, this is almost always correct: the device has its own internal scheduler, and adding a kernel-side scheduler just adds latency. For rotational drives, mq-deadline groups requests by direction (read/write) and deadline, reducing head seeks.
Common Questions
Why does a single-threaded application fail to saturate NVMe?
Synchronous I/O (read/write syscalls) blocks the thread until the I/O completes. With one thread, one I/O is in flight at a time. NVMe device latency for a 4 KB random read is roughly 10-20 microseconds. At 15us per I/O, one thread achieves 66K IOPS. The device can handle 1M IOPS, but 93% of its capacity sits unused. The fix is either massive thread counts (hundreds of threads, each blocking on one I/O) or asynchronous submission via io_uring with SQPOLL.
How does the kernel decide which hardware queue receives a BIO?
The submitting CPU determines the queue. blk_mq_map_queues() builds a CPU-to-hctx mapping at device registration. When blk_mq_submit_bio() runs, it looks up the current CPU's assigned hctx. This means a workload running on 4 of 64 CPUs will only use 4 of 64 hardware queues. To use all queues, I/O must originate from all CPUs. Thread pinning, io_uring with per-CPU rings, or SO_INCOMING_CPU-style affinity helps distribute the load.
What happens when a BIO is too large for the device?
Every block device advertises limits: max_sectors_kb (max bytes per request), max_segments (max scatter-gather entries), and max_segment_size. If a bio exceeds these, bio_split() divides it at the boundary. The split produces two BIOs sharing the same underlying pages -- only the bi_iter state differs. Device mapper relies on this heavily: dm_accept_partial_bio() splits at stripe, chunk, and device boundaries. The split is invisible to the filesystem; it sees a single completion when all splits finish (via the bio chain mechanism).
What is the difference between plugging and the I/O scheduler?
Plugging is per-task and short-lived: it batches BIOs within a single syscall to give the block layer a brief merge window. The I/O scheduler is per-device and persistent: it holds requests across syscalls, reordering them for seek optimization (mq-deadline) or fairness (bfq). Plugging is always active. The I/O scheduler is optional and selectable per device. For NVMe, plugging provides the only useful merge window since the scheduler is "none".
How does blk-mq handle devices with fewer queues than CPUs?
If a device has 4 hardware queues but the system has 64 CPUs, blk-mq maps 16 CPUs to each hctx. Requests from those 16 CPUs contend for the same hctx tag bitmap and dispatch lock. This is still far better than the legacy single-queue design, but there is measurable contention. NVMe devices typically expose at least as many queues as CPUs (up to 65535 queues per the spec), avoiding this issue entirely.
How Technologies Use This
A PostgreSQL 16 instance on a 64-core server writes WAL (Write-Ahead Log) to a dedicated NVMe SSD (Intel P5800X) that exposes 64 hardware submission queues. During a peak OLTP workload of 80,000 transactions per second, PostgreSQL generates approximately 400 MB/s of sequential WAL writes. Each WAL write produces a struct bio that the kernel submits to the block layer. The critical question is whether the block layer can dispatch these BIOs to the NVMe device fast enough to avoid becoming a bottleneck.
The blk-mq (multi-queue) layer creates one hardware dispatch queue (hctx) per NVMe submission queue, mapped 1:1 to CPU cores. When a PostgreSQL backend on CPU 17 issues a WAL write, the kernel builds a struct bio, checks for merge opportunities with pending requests in the per-CPU plug list, and dispatches the resulting struct request directly to hctx 17. The NVMe driver (drivers/nvme/host/pci.c) writes the command to submission queue 17 and rings the corresponding doorbell register. Completion interrupts arrive on CPU 17, and the io_comp_batch mechanism batches multiple completions into a single softirq invocation. No spinlock protects the queue, and no cross-core cache-line bouncing occurs because each CPU exclusively owns its queue pair.
Tuning queue depth matters for WAL write throughput. With nr_requests set to the default 256 per hctx, PostgreSQL's WAL writer can have 256 outstanding I/Os per CPU before the blk-mq layer begins throttling. On the P5800X, which sustains 900,000 random 4K read IOPS across all queues, the WAL write workload (sequential, mostly 8 KB to 64 KB chunks) never approaches this limit. Monitoring /sys/block/nvme0n1/mq/*/dispatched reveals per-queue dispatch counts, and any imbalance indicates NUMA misconfiguration where WAL writer processes are scheduled on CPUs whose hctx maps to a remote NUMA node's NVMe controller.
A Docker host runs 30 containers using the overlay2 storage driver on an ext4 filesystem backed by a SATA SSD. When a container modifies a file inherited from a lower layer, overlay2 performs a copy-up operation: the kernel reads the entire file from the lower directory into page cache, then writes it to the upper directory. For a container that modifies a 50 MB log file, this copy-up generates two sets of BIOs, one for reading 50 MB and one for writing 50 MB, even though only a few bytes changed. Across 30 containers performing copy-ups simultaneously, the SATA SSD sees 3 GB/s of amplified I/O, far exceeding its 550 MB/s sequential write bandwidth.
Each copy-up write creates a struct bio with bio_vec entries pointing to the page cache pages holding the file data. These BIOs enter the blk-mq software staging queue, where the mq-deadline or BFQ scheduler attempts to merge adjacent BIOs into larger struct request objects. The copy-up pattern is sequential within a single file but interleaved across containers, so the scheduler's merge rate drops from 90% (single-stream sequential) to approximately 30% (multi-stream interleaved). The resulting flood of small, non-mergeable requests saturates the SATA command queue (depth 32), and iowait on the host climbs above 60%. Containers not performing copy-ups still experience degraded I/O because their read and write BIOs compete in the same single hardware queue that SATA exposes.
Monitoring /proc/diskstats reveals the amplification: the avgqu-sz (average queue size) exceeds the device queue depth, and await (average I/O wait time) spikes from 0.5 ms to 40 ms during copy-up storms. The mitigation is to pre-copy frequently modified files into the container's writable layer at build time (COPY in the Dockerfile), to use volumes for write-heavy paths (which bypass overlay2 entirely), or to move to an NVMe device where blk-mq's per-CPU hardware queues eliminate the single-queue bottleneck that SATA imposes.
A RocksDB instance running inside a cgroup-v2 container handles 50,000 point reads per second while background compaction rewrites 2 GB of SST files per minute. Both workloads share the same NVMe device. During heavy compaction, foreground read latency spikes from 0.2 ms to 5 ms because compaction BIOs flood the blk-mq dispatch queues, and the NVMe device's internal scheduling favors the large sequential writes over the small random reads.
At the BIO layer, foreground reads generate 4 KB struct bio objects flagged with REQ_SYNC, while compaction writes produce 64 to 256 KB BIOs flagged with REQ_BACKGROUND. The blk-mq scheduler (mq-deadline on most production configurations) separates reads and writes into distinct internal queues and guarantees read starvation cannot exceed a configurable deadline (default 500 ms for reads, 5 seconds for writes). However, the NVMe device itself has no awareness of this priority, and its internal write-back cache can stall read completions during large sequential write bursts. The cgroup-v2 I/O controller applies throttling at the BIO submission layer: setting io.max on the container's cgroup to "rbps=max wbps=209715200" (200 MB/s write cap) limits the rate at which compaction BIOs enter the blk-mq queues.
With io.max throttling compaction writes to 200 MB/s, foreground read BIOs no longer compete with a burst of compaction traffic in the NVMe submission queues. Read P99 latency drops from 5 ms back to 0.3 ms. The tradeoff is that compaction falls behind during write-heavy workloads, causing LSM tree depth to grow and increasing read amplification. Monitoring /sys/fs/cgroup/<container>/io.stat shows the actual bytes dispatched and the number of BIOs throttled (io.stat's dbytes and dios fields), enabling operators to tune the io.max write cap based on the ratio of compaction throughput to foreground read latency.
Same Concept Across Tech
| Technology | How it uses the BIO / blk-mq path | Key consideration |
|---|---|---|
| NVMe | 1:1 mapping of blk-mq hctx to NVMe SQ/CQ pairs. nvme_queue_rq() builds SQE from struct request. MSI-X routes completions to the issuing CPU | Set scheduler to "none". Ensure application spreads I/O across CPUs to use all queues |
| SCSI / SAS | Shared tag set across LUNs. scsi_queue_rq() converts request to SCSI CDB. Typically fewer hardware queues than NVMe | mq-deadline often beneficial for rotational SCSI drives. Monitor per-LUN queue depth |
| device mapper (LVM) | Intercepts BIOs at the dm target layer. dm-stripe splits BIOs at chunk boundaries. dm-crypt encrypts bio pages in-place | BIO splitting adds CPU overhead proportional to stripe count. Align I/O size to stripe chunk size |
| ext4 / XFS | Filesystems build multi-page BIOs from page cache dirty pages. Extent-aligned writeback maximizes merge potential | Fragmented files produce small, non-mergeable BIOs. Defragment or preallocate extents |
| io_uring | SQPOLL mode submits BIOs from a kernel thread without syscall overhead. Batches submissions for plug merging | One kernel thread per ring limits to one CPU. Use multiple rings for multi-queue saturation |
Stack layer mapping (NVMe not reaching rated IOPS):
| Layer | What to check | Tool |
|---|---|---|
| Application | Is I/O submission synchronous and single-threaded? | strace -c to count syscalls/sec |
| Block layer | Are BIOs landing on a single CPU/queue? | bpftrace block:block_bio_queue by CPU |
| blk-mq | How many hctx queues are receiving dispatches? | /sys/kernel/debug/block//hctx/dispatched |
| I/O Scheduler | Is an unnecessary scheduler adding latency? | cat /sys/block/*/queue/scheduler |
| Driver | Is the NVMe queue depth being reached? | iostat -x avgqu-sz per device |
| Hardware | Are all NVMe queues initialized? | dmesg, nvme smart-log |
Design Rationale The legacy block layer used a single request queue per device, protected by a spinlock. This design predated SSDs and NVMe. When devices could handle 200 IOPS (rotational drives), a single lock was irrelevant. When devices reached 100K+ IOPS (early SSDs), lock contention consumed 30-40% of CPU. When NVMe devices reached 1M+ IOPS, the single queue became a hard ceiling. blk-mq (merged in Linux 3.13 by Jens Axboe) redesigned the entire path: per-CPU software queues eliminate lock contention on the submission side, per-device hardware queues map directly to device capabilities, and tag-based allocation replaces the linear scan of the old request pool. The result is that block layer overhead dropped from ~20 microseconds per I/O to under 1 microsecond.
If You See This, Think This
| Symptom | Likely cause | First check |
|---|---|---|
| NVMe IOPS far below rated spec | Single-threaded or single-CPU I/O submission, only one hctx active | bpftrace block_bio_queue by CPU, check hctx dispatched counts |
| High CPU usage at moderate IOPS | I/O scheduler overhead on NVMe, or excessive BIO splitting at DM layer | Switch scheduler to "none", check bio_split tracepoints |
| iostat shows zero read/write merges on sequential workload | Plug merging disabled, O_DIRECT with tiny writes, or nomerges is set | cat /sys/block/<dev>/queue/nomerges, check blktrace for M events |
| Latency spikes at high queue depth | Tag exhaustion in hctx, all tags allocated, new I/O must wait | Check nr_requests vs actual in-flight, increase nr_requests if device supports it |
| Device mapper LVM shows higher latency than raw device | BIO splitting overhead at stripe boundaries | Align I/O size to LVM stripe chunk size, reduce stripe count |
| One CPU at 100% while others idle during I/O workload | All I/O funneling through one software queue | Spread I/O threads across CPUs, use io_uring with multiple rings |
| avgqu-sz stuck at 1 despite NVMe queue depth of 1024 | Synchronous I/O pattern, each request waits for completion before next submission | Switch to async I/O (io_uring, libaio) or increase thread count significantly |
When to Use / Avoid
Relevant when:
- Diagnosing why an NVMe device is not reaching advertised IOPS
- Tuning I/O scheduler selection for different device types (NVMe vs SATA vs HDD)
- Understanding device mapper and LVM bio splitting overhead
- Profiling block I/O latency with blktrace or bpftrace
- Working on kernel drivers or filesystem code that submits BIOs
Watch out for:
- Single-threaded synchronous I/O will never saturate a multi-queue device regardless of queue configuration
- I/O schedulers add CPU overhead and latency; NVMe devices generally perform best with "none"
- BIO splitting at device mapper boundaries is invisible to applications but adds kernel CPU time per I/O
- Per-CPU queue mapping means NUMA-unaware thread placement causes cross-node memory access for I/O buffers
Try It Yourself
1 # Trace block I/O events on an NVMe device (Q=queue, M=merge, D=dispatch, C=complete)
2
3 blktrace -d /dev/nvme0n1 -o - | blkparse -i - | head -50
4
5 # Shorthand: live block trace stream
6
7 btrace /dev/nvme0n1
8
9 # Check current I/O scheduler and available options
10
11 cat /sys/block/nvme0n1/queue/scheduler
12
13 # Set scheduler to none for NVMe (no reordering overhead)
14
15 echo none > /sys/block/nvme0n1/queue/scheduler
16
17 # Check number of hardware dispatch queues
18
19 ls -d /sys/kernel/debug/block/nvme0n1/hctx* | wc -l
20
21 # Check per-hctx dispatch statistics
22
23 for h in /sys/kernel/debug/block/nvme0n1/hctx*/dispatched; do echo "$h:"; cat "$h"; done
24
25 # Check queue depth, max transfer size, and merge settings
26
27 echo "nr_requests: $(cat /sys/block/nvme0n1/queue/nr_requests)" && echo "max_sectors_kb: $(cat /sys/block/nvme0n1/queue/max_sectors_kb)" && echo "nomerges: $(cat /sys/block/nvme0n1/queue/nomerges)"
28
29 # Watch merge rates with iostat (rrqm/s and wrqm/s columns)
30
31 iostat -xm 1 /dev/nvme0n1
32
33 # Trace BIO submission by CPU using bpftrace
34
35 bpftrace -e 'tracepoint:block:block_bio_queue { @[cpu] = count(); }'
36
37 # Histogram of request sizes at dispatch time
38
39 bpftrace -e 'tracepoint:block:block_rq_insert { @bytes = hist(args->bytes); }'
40
41 # Count bio splits (indicates DM or alignment splitting)
42
43 bpftrace -e 'tracepoint:block:block_split { @splits = count(); }'
44
45 # Check device stat counters (reads, read_merges, read_sectors, read_ms, writes, ...)
46
47 cat /sys/block/nvme0n1/statDebug Checklist
- 1
Check current I/O scheduler: cat /sys/block/<dev>/queue/scheduler - 2
Check hardware queue count: ls /sys/kernel/debug/block/<dev>/hctx* | wc -l - 3
Check queue depth: cat /sys/block/<dev>/queue/nr_requests - 4
Check max transfer size: cat /sys/block/<dev>/queue/max_sectors_kb - 5
Check merge statistics: cat /sys/block/<dev>/stat (field 3 = reads merged, field 7 = writes merged) - 6
Check per-hctx dispatch counts: cat /sys/kernel/debug/block/<dev>/hctx*/dispatched - 7
Trace BIO submission by CPU: bpftrace -e 'tracepoint:block:block_bio_queue { @[cpu] = count(); }' - 8
Trace request sizes: bpftrace -e 'tracepoint:block:block_rq_insert { @bytes = hist(args->bytes); }' - 9
Check merge disable: cat /sys/block/<dev>/queue/nomerges (0 = merging enabled) - 10
Check NVMe queue utilization: nvme list and dmesg | grep nvme for queue setup messages
Key Takeaways
- ✓A struct bio describes a single contiguous I/O operation on disk but can reference scattered pages in memory through bio_vec entries. A struct request aggregates multiple contiguous BIOs into a single unit for the driver. The bio is the filesystem-to-block interface; the request is the block-to-driver interface.
- ✓The blk-mq layer eliminates the single-queue bottleneck that capped legacy block I/O at roughly 500K IOPS regardless of device capability. By mapping per-CPU software queues to per-device hardware queues, it scales linearly with core count. A 64-core server with an NVMe device goes from 500K IOPS (single queue, lock-bound) to 1M+ IOPS (multi-queue, lock-free).
- ✓BIO merging happens at two levels. First, the plug list: within a single syscall, the kernel accumulates BIOs in a per-task plug and merges adjacent ones before releasing them. Second, the I/O scheduler: if enabled, it reorders and merges requests in the software staging queue. For random I/O workloads, merging provides no benefit, and the "none" scheduler avoids the overhead entirely.
- ✓The bio split mechanism (bio_split()) is critical for device mapper and RAID. When a bio crosses a stripe boundary or chunk size limit, the block layer splits it into two BIOs at the boundary. The split bio shares the original's pages via bio_vec references -- no data copying occurs. The original bio's bi_iter is adjusted to cover only the remaining range.
- ✓Tag-based completion in blk-mq assigns each in-flight request a unique integer tag from the hctx tag bitmap. When the device signals completion, it returns the tag, and the kernel looks up the request directly by tag index. No scanning of a completion list is needed. This is O(1) per completion, essential at 1M+ IOPS.
Common Pitfalls
- ✗Using synchronous I/O from too few threads against NVMe. Each synchronous read() or write() blocks the thread until the single I/O completes. With 16 threads and 200us device latency, the maximum throughput is 16 / 0.0002 = 80K IOPS, regardless of device capability. Either increase thread count to hundreds or switch to io_uring / libaio for asynchronous submission.
- ✗Running an I/O scheduler on NVMe devices. mq-deadline or bfq add latency and CPU overhead for reordering that NVMe firmware handles internally. For NVMe, set the scheduler to "none" via echo none > /sys/block/nvme0n1/queue/scheduler. Reserve mq-deadline for rotational drives where seek optimization matters.
- ✗Assuming a single submission thread can saturate a multi-queue device. blk-mq maps software queues to CPUs. If all I/O originates from one CPU, only one hardware queue receives work. The other 63 queues sit idle. Spread I/O across CPUs using multiple threads, io_uring with SQPOLL, or multiple file descriptors with separate aio contexts.
- ✗Ignoring the max_sectors_kb and max_segments limits. If a bio exceeds the device's maximum transfer size, the block layer splits it. Frequent splitting adds overhead. Aligning application I/O size to /sys/block/<dev>/queue/max_sectors_kb avoids unnecessary splits.
- ✗Disabling plug merging by calling blk_finish_plug() too early or issuing O_DIRECT writes one page at a time. The plug batches BIOs from a single syscall, giving the block layer a window to merge. Issuing tiny, unplugged writes defeats this optimization and inflates IOPS unnecessarily.
Reference
In One Line
struct bio carries each I/O from filesystem to block layer, requests merge adjacent BIOs into efficient batches, and blk-mq fans them out across per-CPU hardware queues so NVMe devices can actually hit a million IOPS.