File Systems & I/OTopic 11 of 19

Storage & FilesystemsAdvanced

BIO & Request Queues

PostgreSQLDockerRocksDB

🧠

Mental Model

Think of a shipping warehouse. Applications hand packages (BIOs) to the receiving dock. Each package has a label (sector address) and contents (bio_vec pages). The warehouse sorter (plug list) groups packages heading to nearby addresses into the same pallet (merged request). Pallets move to staging lanes (software queues), one per loading bay worker (CPU). Each worker pushes pallets onto the truck assigned to their bay (hardware dispatch queue). If there is only one truck bay but 64 workers, everyone queues behind a single door. Adding 64 truck bays (blk-mq hardware queues) lets all workers load simultaneously.

💡

The Problem

An NVMe drive rated for 1 million random 4K read IOPS sits behind a storage application that measures only 200K IOPS. The CPU is 40% idle. The NVMe device queue depth never exceeds 8. The drive has 64 hardware queues, but blktrace shows all I/O flowing through a single software queue. The application uses synchronous read() calls from 16 threads, each waiting for one I/O to complete before issuing the next. Merging is not the problem here -- the requests are random. The queue topology is wrong, and the submission pattern cannot saturate the hardware.

Architecture

Every byte read from or written to a block device in Linux passes through the BIO and request queue subsystem. The path is deceptively layered: a filesystem builds a struct bio describing which pages go to which sectors, the block layer merges adjacent BIOs into struct requests for efficiency, and the multi-queue framework (blk-mq) fans those requests across per-CPU hardware dispatch queues so the device driver can consume them in parallel.

Getting this path wrong means leaving 80% of NVMe performance on the table. Getting it right means saturating devices capable of a million IOPS.

What Actually Happens

Here is the sequence when a filesystem submits a block write:

The filesystem (ext4, XFS, btrfs) determines which disk sectors need writing and which pages hold the data.
It calls bio_alloc() to allocate a struct bio from a mempool. The mempool guarantees allocation succeeds even under memory pressure -- critical because failing a write to free dirty pages would deadlock.
It attaches memory pages via bio_add_page(), which populates bio_vec entries. Each bio_vec is a (page, offset, length) tuple. A single bio can hold up to BIO_MAX_VECS (256) entries, covering roughly 1 MB with 4 KB pages.
It sets bi_iter.bi_sector (the target disk offset), the operation (REQ_OP_WRITE), and any flags (REQ_SYNC, REQ_FUA for write-through).
It calls submit_bio(), which hands the bio to the block layer.

From submit_bio(), the bio enters the blk-mq path:

blk_mq_submit_bio() checks the current task's plug list. If the task called blk_start_plug() earlier (most filesystem code does), BIOs accumulate in a per-task list. Adjacent BIOs targeting consecutive sectors get merged here via blk_attempt_plug_merge().
If no merge is possible, a new struct request is allocated. This request receives a tag from the hardware context's tag bitmap -- an integer that uniquely identifies this in-flight I/O.
The request enters the per-CPU software staging queue (blk_mq_ctx). If an I/O scheduler is configured (mq-deadline, bfq, kyber), the request passes through it for reordering or priority handling.
The request is dispatched to the hardware dispatch queue (blk_mq_hw_ctx). The hctx is chosen based on the submitting CPU's mapping.
The driver's queue_rq() callback fires. For NVMe, this is nvme_queue_rq(), which builds a submission queue entry and writes the doorbell register.
The device processes the I/O and signals completion via interrupt (MSI-X for NVMe, routed to the submitting CPU).
blk_mq_complete_request() finds the request by its tag and calls the bio's bi_end_io callback. The filesystem learns the I/O completed.

When blk_finish_plug() is called (typically at the end of a syscall), any remaining BIOs in the plug list are flushed to the scheduler in one batch. This batching is what makes plug merging effective -- without it, each bio would be dispatched individually with no merge window.

Under the Hood

struct bio internals. The bio is allocated from a bio_set, which wraps a mempool and a slab cache. The mempool preallocates a reserve of BIOs so that critical I/O paths (writeback, swap) can always make progress. bio->bi_iter tracks the current position within the bio: bi_sector is the disk offset, bi_size is bytes remaining, bi_idx is the current index into bi_io_vec, and bi_bvec_done tracks partial progress within a bio_vec. This iterator design means a bio can be advanced without modifying the underlying bio_vec array, which is critical for bio splitting.

bio_vec and scatter-gather. Each bio_vec points to a single page (or compound page). The DMA layer maps these into scatter-gather list entries for the device. Modern NVMe devices support up to 256 scatter-gather entries per command (matching BIO_MAX_VECS). If the page layout is physically contiguous, the DMA layer can coalesce adjacent entries, reducing the number of PRP/SGL entries the device must process.

Request merging levels. Merging happens at three points: the plug list (per-task, within a single syscall), the I/O scheduler (global within a queue), and the driver (some drivers do last-chance merging). Back-merge (appending a new bio to an existing request's tail) is the overwhelmingly common case for sequential I/O. Front-merge (prepending) requires the scheduler's red-black tree lookup and is rarer. The merge check is O(1) for back-merge (compare against the plug list's last request) and O(log n) for scheduler-level merging.

blk-mq queue topology. The mapping from software queues (ctx) to hardware queues (hctx) is established at device registration. blk_mq_map_queues() assigns each CPU to an hctx. For NVMe, the mapping is 1:1 (one hctx per CPU, one NVMe SQ/CQ pair per hctx). For SATA (AHCI), there is typically one hctx for the single hardware queue, with all CPUs funneling into it. This mismatch explains why SATA devices see no benefit from adding CPUs beyond a certain point.

Tag allocation. Each hctx manages a bitmap of tags (blk_mq_tags). When a request is allocated, it receives the next available tag via blk_mq_get_tag(). If all tags are in use, the caller sleeps on a wait queue. The tag serves as an index into a request array, enabling O(1) completion lookup. NVMe devices return the tag in the completion queue entry. This eliminates the legacy block layer's linear search through the request list on completion.

The "none" scheduler. When the scheduler is set to "none", requests bypass scheduling entirely and go straight from the software staging queue to the hardware dispatch queue. For NVMe, this is almost always correct: the device has its own internal scheduler, and adding a kernel-side scheduler just adds latency. For rotational drives, mq-deadline groups requests by direction (read/write) and deadline, reducing head seeks.

Common Questions

Why does a single-threaded application fail to saturate NVMe?

Synchronous I/O (read/write syscalls) blocks the thread until the I/O completes. With one thread, one I/O is in flight at a time. NVMe device latency for a 4 KB random read is roughly 10-20 microseconds. At 15us per I/O, one thread achieves 66K IOPS. The device can handle 1M IOPS, but 93% of its capacity sits unused. The fix is either massive thread counts (hundreds of threads, each blocking on one I/O) or asynchronous submission via io_uring with SQPOLL.

How does the kernel decide which hardware queue receives a BIO?

The submitting CPU determines the queue. blk_mq_map_queues() builds a CPU-to-hctx mapping at device registration. When blk_mq_submit_bio() runs, it looks up the current CPU's assigned hctx. This means a workload running on 4 of 64 CPUs will only use 4 of 64 hardware queues. To use all queues, I/O must originate from all CPUs. Thread pinning, io_uring with per-CPU rings, or SO_INCOMING_CPU-style affinity helps distribute the load.

What happens when a BIO is too large for the device?

Every block device advertises limits: max_sectors_kb (max bytes per request), max_segments (max scatter-gather entries), and max_segment_size. If a bio exceeds these, bio_split() divides it at the boundary. The split produces two BIOs sharing the same underlying pages -- only the bi_iter state differs. Device mapper relies on this heavily: dm_accept_partial_bio() splits at stripe, chunk, and device boundaries. The split is invisible to the filesystem; it sees a single completion when all splits finish (via the bio chain mechanism).

What is the difference between plugging and the I/O scheduler?

Plugging is per-task and short-lived: it batches BIOs within a single syscall to give the block layer a brief merge window. The I/O scheduler is per-device and persistent: it holds requests across syscalls, reordering them for seek optimization (mq-deadline) or fairness (bfq). Plugging is always active. The I/O scheduler is optional and selectable per device. For NVMe, plugging provides the only useful merge window since the scheduler is "none".

How does blk-mq handle devices with fewer queues than CPUs?

If a device has 4 hardware queues but the system has 64 CPUs, blk-mq maps 16 CPUs to each hctx. Requests from those 16 CPUs contend for the same hctx tag bitmap and dispatch lock. This is still far better than the legacy single-queue design, but there is measurable contention. NVMe devices typically expose at least as many queues as CPUs (up to 65535 queues per the spec), avoiding this issue entirely.

How Technologies Use This

PostgreSQL

A PostgreSQL 16 instance on a 64-core server writes WAL (Write-Ahead Log) to a dedicated NVMe SSD (Intel P5800X) that exposes 64 hardware submission queues. During a peak OLTP workload of 80,000 transactions per second, PostgreSQL generates approximately 400 MB/s of sequential WAL writes. Each WAL write produces a struct bio that the kernel submits to the block layer. The critical question is whether the block layer can dispatch these BIOs to the NVMe device fast enough to avoid becoming a bottleneck.

The blk-mq (multi-queue) layer creates one hardware dispatch queue (hctx) per NVMe submission queue, mapped 1:1 to CPU cores. When a PostgreSQL backend on CPU 17 issues a WAL write, the kernel builds a struct bio, checks for merge opportunities with pending requests in the per-CPU plug list, and dispatches the resulting struct request directly to hctx 17. The NVMe driver (drivers/nvme/host/pci.c) writes the command to submission queue 17 and rings the corresponding doorbell register. Completion interrupts arrive on CPU 17, and the io_comp_batch mechanism batches multiple completions into a single softirq invocation. No spinlock protects the queue, and no cross-core cache-line bouncing occurs because each CPU exclusively owns its queue pair.

Tuning queue depth matters for WAL write throughput. With nr_requests set to the default 256 per hctx, PostgreSQL's WAL writer can have 256 outstanding I/Os per CPU before the blk-mq layer begins throttling. On the P5800X, which sustains 900,000 random 4K read IOPS across all queues, the WAL write workload (sequential, mostly 8 KB to 64 KB chunks) never approaches this limit. Monitoring /sys/block/nvme0n1/mq/*/dispatched reveals per-queue dispatch counts, and any imbalance indicates NUMA misconfiguration where WAL writer processes are scheduled on CPUs whose hctx maps to a remote NUMA node's NVMe controller.

Docker

A Docker host runs 30 containers using the overlay2 storage driver on an ext4 filesystem backed by a SATA SSD. When a container modifies a file inherited from a lower layer, overlay2 performs a copy-up operation: the kernel reads the entire file from the lower directory into page cache, then writes it to the upper directory. For a container that modifies a 50 MB log file, this copy-up generates two sets of BIOs, one for reading 50 MB and one for writing 50 MB, even though only a few bytes changed. Across 30 containers performing copy-ups simultaneously, the SATA SSD sees 3 GB/s of amplified I/O, far exceeding its 550 MB/s sequential write bandwidth.

Each copy-up write creates a struct bio with bio_vec entries pointing to the page cache pages holding the file data. These BIOs enter the blk-mq software staging queue, where the mq-deadline or BFQ scheduler attempts to merge adjacent BIOs into larger struct request objects. The copy-up pattern is sequential within a single file but interleaved across containers, so the scheduler's merge rate drops from 90% (single-stream sequential) to approximately 30% (multi-stream interleaved). The resulting flood of small, non-mergeable requests saturates the SATA command queue (depth 32), and iowait on the host climbs above 60%. Containers not performing copy-ups still experience degraded I/O because their read and write BIOs compete in the same single hardware queue that SATA exposes.

Monitoring /proc/diskstats reveals the amplification: the avgqu-sz (average queue size) exceeds the device queue depth, and await (average I/O wait time) spikes from 0.5 ms to 40 ms during copy-up storms. The mitigation is to pre-copy frequently modified files into the container's writable layer at build time (COPY in the Dockerfile), to use volumes for write-heavy paths (which bypass overlay2 entirely), or to move to an NVMe device where blk-mq's per-CPU hardware queues eliminate the single-queue bottleneck that SATA imposes.

RocksDB

A RocksDB instance running inside a cgroup-v2 container handles 50,000 point reads per second while background compaction rewrites 2 GB of SST files per minute. Both workloads share the same NVMe device. During heavy compaction, foreground read latency spikes from 0.2 ms to 5 ms because compaction BIOs flood the blk-mq dispatch queues, and the NVMe device's internal scheduling favors the large sequential writes over the small random reads.

At the BIO layer, foreground reads generate 4 KB struct bio objects flagged with REQ_SYNC, while compaction writes produce 64 to 256 KB BIOs flagged with REQ_BACKGROUND. The blk-mq scheduler (mq-deadline on most production configurations) separates reads and writes into distinct internal queues and guarantees read starvation cannot exceed a configurable deadline (default 500 ms for reads, 5 seconds for writes). However, the NVMe device itself has no awareness of this priority, and its internal write-back cache can stall read completions during large sequential write bursts. The cgroup-v2 I/O controller applies throttling at the BIO submission layer: setting io.max on the container's cgroup to "rbps=max wbps=209715200" (200 MB/s write cap) limits the rate at which compaction BIOs enter the blk-mq queues.

With io.max throttling compaction writes to 200 MB/s, foreground read BIOs no longer compete with a burst of compaction traffic in the NVMe submission queues. Read P99 latency drops from 5 ms back to 0.3 ms. The tradeoff is that compaction falls behind during write-heavy workloads, causing LSM tree depth to grow and increasing read amplification. Monitoring /sys/fs/cgroup/<container>/io.stat shows the actual bytes dispatched and the number of BIOs throttled (io.stat's dbytes and dios fields), enabling operators to tune the io.max write cap based on the ratio of compaction throughput to foreground read latency.

Same Concept Across Tech

Technology	How it uses the BIO / blk-mq path	Key consideration
NVMe	1:1 mapping of blk-mq hctx to NVMe SQ/CQ pairs. nvme_queue_rq() builds SQE from struct request. MSI-X routes completions to the issuing CPU	Set scheduler to "none". Ensure application spreads I/O across CPUs to use all queues
SCSI / SAS	Shared tag set across LUNs. scsi_queue_rq() converts request to SCSI CDB. Typically fewer hardware queues than NVMe	mq-deadline often beneficial for rotational SCSI drives. Monitor per-LUN queue depth
device mapper (LVM)	Intercepts BIOs at the dm target layer. dm-stripe splits BIOs at chunk boundaries. dm-crypt encrypts bio pages in-place	BIO splitting adds CPU overhead proportional to stripe count. Align I/O size to stripe chunk size
ext4 / XFS	Filesystems build multi-page BIOs from page cache dirty pages. Extent-aligned writeback maximizes merge potential	Fragmented files produce small, non-mergeable BIOs. Defragment or preallocate extents
io_uring	SQPOLL mode submits BIOs from a kernel thread without syscall overhead. Batches submissions for plug merging	One kernel thread per ring limits to one CPU. Use multiple rings for multi-queue saturation

Stack layer mapping (NVMe not reaching rated IOPS):

Layer	What to check	Tool
Application	Is I/O submission synchronous and single-threaded?	strace -c to count syscalls/sec
Block layer	Are BIOs landing on a single CPU/queue?	bpftrace block:block_bio_queue by CPU
blk-mq	How many hctx queues are receiving dispatches?	/sys/kernel/debug/block//hctx/dispatched
I/O Scheduler	Is an unnecessary scheduler adding latency?	cat /sys/block/*/queue/scheduler
Driver	Is the NVMe queue depth being reached?	iostat -x avgqu-sz per device
Hardware	Are all NVMe queues initialized?	dmesg, nvme smart-log

Design Rationale The legacy block layer used a single request queue per device, protected by a spinlock. This design predated SSDs and NVMe. When devices could handle 200 IOPS (rotational drives), a single lock was irrelevant. When devices reached 100K+ IOPS (early SSDs), lock contention consumed 30-40% of CPU. When NVMe devices reached 1M+ IOPS, the single queue became a hard ceiling. blk-mq (merged in Linux 3.13 by Jens Axboe) redesigned the entire path: per-CPU software queues eliminate lock contention on the submission side, per-device hardware queues map directly to device capabilities, and tag-based allocation replaces the linear scan of the old request pool. The result is that block layer overhead dropped from ~20 microseconds per I/O to under 1 microsecond.

If You See This, Think This

Symptom	Likely cause	First check
NVMe IOPS far below rated spec	Single-threaded or single-CPU I/O submission, only one hctx active	bpftrace block_bio_queue by CPU, check hctx dispatched counts
High CPU usage at moderate IOPS	I/O scheduler overhead on NVMe, or excessive BIO splitting at DM layer	Switch scheduler to "none", check bio_split tracepoints
iostat shows zero read/write merges on sequential workload	Plug merging disabled, O_DIRECT with tiny writes, or nomerges is set	cat /sys/block/<dev>/queue/nomerges, check blktrace for M events
Latency spikes at high queue depth	Tag exhaustion in hctx, all tags allocated, new I/O must wait	Check nr_requests vs actual in-flight, increase nr_requests if device supports it
Device mapper LVM shows higher latency than raw device	BIO splitting overhead at stripe boundaries	Align I/O size to LVM stripe chunk size, reduce stripe count
One CPU at 100% while others idle during I/O workload	All I/O funneling through one software queue	Spread I/O threads across CPUs, use io_uring with multiple rings
avgqu-sz stuck at 1 despite NVMe queue depth of 1024	Synchronous I/O pattern, each request waits for completion before next submission	Switch to async I/O (io_uring, libaio) or increase thread count significantly

When to Use / Avoid

Relevant when:

Diagnosing why an NVMe device is not reaching advertised IOPS
Tuning I/O scheduler selection for different device types (NVMe vs SATA vs HDD)
Understanding device mapper and LVM bio splitting overhead
Profiling block I/O latency with blktrace or bpftrace
Working on kernel drivers or filesystem code that submits BIOs

Watch out for:

Single-threaded synchronous I/O will never saturate a multi-queue device regardless of queue configuration
I/O schedulers add CPU overhead and latency; NVMe devices generally perform best with "none"
BIO splitting at device mapper boundaries is invisible to applications but adds kernel CPU time per I/O
Per-CPU queue mapping means NUMA-unaware thread placement causes cross-node memory access for I/O buffers

Try It Yourself

 1  # Trace block I/O events on an NVMe device (Q=queue, M=merge, D=dispatch, C=complete)
 2  
 3  blktrace -d /dev/nvme0n1 -o - | blkparse -i - | head -50
 4  
 5  # Shorthand: live block trace stream
 6  
 7  btrace /dev/nvme0n1
 8  
 9  # Check current I/O scheduler and available options
10  
11  cat /sys/block/nvme0n1/queue/scheduler
12  
13  # Set scheduler to none for NVMe (no reordering overhead)
14  
15  echo none > /sys/block/nvme0n1/queue/scheduler
16  
17  # Check number of hardware dispatch queues
18  
19  ls -d /sys/kernel/debug/block/nvme0n1/hctx* | wc -l
20  
21  # Check per-hctx dispatch statistics
22  
23  for h in /sys/kernel/debug/block/nvme0n1/hctx*/dispatched; do echo "$h:"; cat "$h"; done
24  
25  # Check queue depth, max transfer size, and merge settings
26  
27  echo "nr_requests: $(cat /sys/block/nvme0n1/queue/nr_requests)" && echo "max_sectors_kb: $(cat /sys/block/nvme0n1/queue/max_sectors_kb)" && echo "nomerges: $(cat /sys/block/nvme0n1/queue/nomerges)"
28  
29  # Watch merge rates with iostat (rrqm/s and wrqm/s columns)
30  
31  iostat -xm 1 /dev/nvme0n1
32  
33  # Trace BIO submission by CPU using bpftrace
34  
35  bpftrace -e 'tracepoint:block:block_bio_queue { @[cpu] = count(); }'
36  
37  # Histogram of request sizes at dispatch time
38  
39  bpftrace -e 'tracepoint:block:block_rq_insert { @bytes = hist(args->bytes); }'
40  
41  # Count bio splits (indicates DM or alignment splitting)
42  
43  bpftrace -e 'tracepoint:block:block_split { @splits = count(); }'
44  
45  # Check device stat counters (reads, read_merges, read_sectors, read_ms, writes, ...)
46  
47  cat /sys/block/nvme0n1/stat

Debug Checklist

1Check current I/O scheduler: cat /sys/block/<dev>/queue/scheduler
2Check hardware queue count: ls /sys/kernel/debug/block/<dev>/hctx* | wc -l
3Check queue depth: cat /sys/block/<dev>/queue/nr_requests
4Check max transfer size: cat /sys/block/<dev>/queue/max_sectors_kb
5Check merge statistics: cat /sys/block/<dev>/stat (field 3 = reads merged, field 7 = writes merged)
6Check per-hctx dispatch counts: cat /sys/kernel/debug/block/<dev>/hctx*/dispatched
7Trace BIO submission by CPU: bpftrace -e 'tracepoint:block:block_bio_queue { @[cpu] = count(); }'
8Trace request sizes: bpftrace -e 'tracepoint:block:block_rq_insert { @bytes = hist(args->bytes); }'
9Check merge disable: cat /sys/block/<dev>/queue/nomerges (0 = merging enabled)
10Check NVMe queue utilization: nvme list and dmesg | grep nvme for queue setup messages

Key Takeaways

✓A struct bio describes a single contiguous I/O operation on disk but can reference scattered pages in memory through bio_vec entries. A struct request aggregates multiple contiguous BIOs into a single unit for the driver. The bio is the filesystem-to-block interface; the request is the block-to-driver interface.
✓The blk-mq layer eliminates the single-queue bottleneck that capped legacy block I/O at roughly 500K IOPS regardless of device capability. By mapping per-CPU software queues to per-device hardware queues, it scales linearly with core count. A 64-core server with an NVMe device goes from 500K IOPS (single queue, lock-bound) to 1M+ IOPS (multi-queue, lock-free).
✓BIO merging happens at two levels. First, the plug list: within a single syscall, the kernel accumulates BIOs in a per-task plug and merges adjacent ones before releasing them. Second, the I/O scheduler: if enabled, it reorders and merges requests in the software staging queue. For random I/O workloads, merging provides no benefit, and the "none" scheduler avoids the overhead entirely.
✓The bio split mechanism (bio_split()) is critical for device mapper and RAID. When a bio crosses a stripe boundary or chunk size limit, the block layer splits it into two BIOs at the boundary. The split bio shares the original's pages via bio_vec references -- no data copying occurs. The original bio's bi_iter is adjusted to cover only the remaining range.
✓Tag-based completion in blk-mq assigns each in-flight request a unique integer tag from the hctx tag bitmap. When the device signals completion, it returns the tag, and the kernel looks up the request directly by tag index. No scanning of a completion list is needed. This is O(1) per completion, essential at 1M+ IOPS.

Common Pitfalls

✗Using synchronous I/O from too few threads against NVMe. Each synchronous read() or write() blocks the thread until the single I/O completes. With 16 threads and 200us device latency, the maximum throughput is 16 / 0.0002 = 80K IOPS, regardless of device capability. Either increase thread count to hundreds or switch to io_uring / libaio for asynchronous submission.
✗Running an I/O scheduler on NVMe devices. mq-deadline or bfq add latency and CPU overhead for reordering that NVMe firmware handles internally. For NVMe, set the scheduler to "none" via echo none > /sys/block/nvme0n1/queue/scheduler. Reserve mq-deadline for rotational drives where seek optimization matters.
✗Assuming a single submission thread can saturate a multi-queue device. blk-mq maps software queues to CPUs. If all I/O originates from one CPU, only one hardware queue receives work. The other 63 queues sit idle. Spread I/O across CPUs using multiple threads, io_uring with SQPOLL, or multiple file descriptors with separate aio contexts.
✗Ignoring the max_sectors_kb and max_segments limits. If a bio exceeds the device's maximum transfer size, the block layer splits it. Frequent splitting adds overhead. Aligning application I/O size to /sys/block/<dev>/queue/max_sectors_kb avoids unnecessary splits.
✗Disabling plug merging by calling blk_finish_plug() too early or issuing O_DIRECT writes one page at a time. The plug batches BIOs from a single syscall, giving the block layer a window to merge. Issuing tiny, unplugged writes defeats this optimization and inflates IOPS unnecessarily.

Reference

System Calls

readwritepreadv2pwritev2io_submitio_uring_enter

Tools

blktrace + blkparsebtrace/sys/block/*/queue//sys/kernel/debug/block/*/hctx*/iostat -x 1bpftrace one-liners

📌

In One Line

struct bio carries each I/O from filesystem to block layer, requests merge adjacent BIOs into efficient batches, and blk-mq fans them out across per-CPU hardware queues so NVMe devices can actually hit a million IOPS.

BIO & Request Queues

PostgreSQLDockerRocksDB

🧠

Mental Model

💡

The Problem

Architecture

Getting this path wrong means leaving 80% of NVMe performance on the table. Getting it right means saturating devices capable of a million IOPS.

What Actually Happens

Here is the sequence when a filesystem submits a block write:

The filesystem (ext4, XFS, btrfs) determines which disk sectors need writing and which pages hold the data.
It calls bio_alloc() to allocate a struct bio from a mempool. The mempool guarantees allocation succeeds even under memory pressure -- critical because failing a write to free dirty pages would deadlock.
It attaches memory pages via bio_add_page(), which populates bio_vec entries. Each bio_vec is a (page, offset, length) tuple. A single bio can hold up to BIO_MAX_VECS (256) entries, covering roughly 1 MB with 4 KB pages.
It sets bi_iter.bi_sector (the target disk offset), the operation (REQ_OP_WRITE), and any flags (REQ_SYNC, REQ_FUA for write-through).
It calls submit_bio(), which hands the bio to the block layer.

From submit_bio(), the bio enters the blk-mq path:

blk_mq_submit_bio() checks the current task's plug list. If the task called blk_start_plug() earlier (most filesystem code does), BIOs accumulate in a per-task list. Adjacent BIOs targeting consecutive sectors get merged here via blk_attempt_plug_merge().
If no merge is possible, a new struct request is allocated. This request receives a tag from the hardware context's tag bitmap -- an integer that uniquely identifies this in-flight I/O.
The request enters the per-CPU software staging queue (blk_mq_ctx). If an I/O scheduler is configured (mq-deadline, bfq, kyber), the request passes through it for reordering or priority handling.
The request is dispatched to the hardware dispatch queue (blk_mq_hw_ctx). The hctx is chosen based on the submitting CPU's mapping.
The driver's queue_rq() callback fires. For NVMe, this is nvme_queue_rq(), which builds a submission queue entry and writes the doorbell register.
The device processes the I/O and signals completion via interrupt (MSI-X for NVMe, routed to the submitting CPU).
blk_mq_complete_request() finds the request by its tag and calls the bio's bi_end_io callback. The filesystem learns the I/O completed.

Under the Hood

Common Questions

Why does a single-threaded application fail to saturate NVMe?

How does the kernel decide which hardware queue receives a BIO?

What happens when a BIO is too large for the device?

What is the difference between plugging and the I/O scheduler?

How does blk-mq handle devices with fewer queues than CPUs?

How Technologies Use This

PostgreSQL

Docker

RocksDB

Same Concept Across Tech

Technology	How it uses the BIO / blk-mq path	Key consideration
NVMe	1:1 mapping of blk-mq hctx to NVMe SQ/CQ pairs. nvme_queue_rq() builds SQE from struct request. MSI-X routes completions to the issuing CPU	Set scheduler to "none". Ensure application spreads I/O across CPUs to use all queues
SCSI / SAS	Shared tag set across LUNs. scsi_queue_rq() converts request to SCSI CDB. Typically fewer hardware queues than NVMe	mq-deadline often beneficial for rotational SCSI drives. Monitor per-LUN queue depth
device mapper (LVM)	Intercepts BIOs at the dm target layer. dm-stripe splits BIOs at chunk boundaries. dm-crypt encrypts bio pages in-place	BIO splitting adds CPU overhead proportional to stripe count. Align I/O size to stripe chunk size
ext4 / XFS	Filesystems build multi-page BIOs from page cache dirty pages. Extent-aligned writeback maximizes merge potential	Fragmented files produce small, non-mergeable BIOs. Defragment or preallocate extents
io_uring	SQPOLL mode submits BIOs from a kernel thread without syscall overhead. Batches submissions for plug merging	One kernel thread per ring limits to one CPU. Use multiple rings for multi-queue saturation

Stack layer mapping (NVMe not reaching rated IOPS):

Layer	What to check	Tool
Application	Is I/O submission synchronous and single-threaded?	strace -c to count syscalls/sec
Block layer	Are BIOs landing on a single CPU/queue?	bpftrace block:block_bio_queue by CPU
blk-mq	How many hctx queues are receiving dispatches?	/sys/kernel/debug/block//hctx/dispatched
I/O Scheduler	Is an unnecessary scheduler adding latency?	cat /sys/block/*/queue/scheduler
Driver	Is the NVMe queue depth being reached?	iostat -x avgqu-sz per device
Hardware	Are all NVMe queues initialized?	dmesg, nvme smart-log

If You See This, Think This

Symptom	Likely cause	First check
NVMe IOPS far below rated spec	Single-threaded or single-CPU I/O submission, only one hctx active	bpftrace block_bio_queue by CPU, check hctx dispatched counts
High CPU usage at moderate IOPS	I/O scheduler overhead on NVMe, or excessive BIO splitting at DM layer	Switch scheduler to "none", check bio_split tracepoints
iostat shows zero read/write merges on sequential workload	Plug merging disabled, O_DIRECT with tiny writes, or nomerges is set	cat /sys/block/<dev>/queue/nomerges, check blktrace for M events
Latency spikes at high queue depth	Tag exhaustion in hctx, all tags allocated, new I/O must wait	Check nr_requests vs actual in-flight, increase nr_requests if device supports it
Device mapper LVM shows higher latency than raw device	BIO splitting overhead at stripe boundaries	Align I/O size to LVM stripe chunk size, reduce stripe count
One CPU at 100% while others idle during I/O workload	All I/O funneling through one software queue	Spread I/O threads across CPUs, use io_uring with multiple rings
avgqu-sz stuck at 1 despite NVMe queue depth of 1024	Synchronous I/O pattern, each request waits for completion before next submission	Switch to async I/O (io_uring, libaio) or increase thread count significantly

When to Use / Avoid

Relevant when:

Diagnosing why an NVMe device is not reaching advertised IOPS
Tuning I/O scheduler selection for different device types (NVMe vs SATA vs HDD)
Understanding device mapper and LVM bio splitting overhead
Profiling block I/O latency with blktrace or bpftrace
Working on kernel drivers or filesystem code that submits BIOs

Watch out for:

Single-threaded synchronous I/O will never saturate a multi-queue device regardless of queue configuration
I/O schedulers add CPU overhead and latency; NVMe devices generally perform best with "none"
BIO splitting at device mapper boundaries is invisible to applications but adds kernel CPU time per I/O
Per-CPU queue mapping means NUMA-unaware thread placement causes cross-node memory access for I/O buffers

Try It Yourself

 1  # Trace block I/O events on an NVMe device (Q=queue, M=merge, D=dispatch, C=complete)
 2  
 3  blktrace -d /dev/nvme0n1 -o - | blkparse -i - | head -50
 4  
 5  # Shorthand: live block trace stream
 6  
 7  btrace /dev/nvme0n1
 8  
 9  # Check current I/O scheduler and available options
10  
11  cat /sys/block/nvme0n1/queue/scheduler
12  
13  # Set scheduler to none for NVMe (no reordering overhead)
14  
15  echo none > /sys/block/nvme0n1/queue/scheduler
16  
17  # Check number of hardware dispatch queues
18  
19  ls -d /sys/kernel/debug/block/nvme0n1/hctx* | wc -l
20  
21  # Check per-hctx dispatch statistics
22  
23  for h in /sys/kernel/debug/block/nvme0n1/hctx*/dispatched; do echo "$h:"; cat "$h"; done
24  
25  # Check queue depth, max transfer size, and merge settings
26  
27  echo "nr_requests: $(cat /sys/block/nvme0n1/queue/nr_requests)" && echo "max_sectors_kb: $(cat /sys/block/nvme0n1/queue/max_sectors_kb)" && echo "nomerges: $(cat /sys/block/nvme0n1/queue/nomerges)"
28  
29  # Watch merge rates with iostat (rrqm/s and wrqm/s columns)
30  
31  iostat -xm 1 /dev/nvme0n1
32  
33  # Trace BIO submission by CPU using bpftrace
34  
35  bpftrace -e 'tracepoint:block:block_bio_queue { @[cpu] = count(); }'
36  
37  # Histogram of request sizes at dispatch time
38  
39  bpftrace -e 'tracepoint:block:block_rq_insert { @bytes = hist(args->bytes); }'
40  
41  # Count bio splits (indicates DM or alignment splitting)
42  
43  bpftrace -e 'tracepoint:block:block_split { @splits = count(); }'
44  
45  # Check device stat counters (reads, read_merges, read_sectors, read_ms, writes, ...)
46  
47  cat /sys/block/nvme0n1/stat

Debug Checklist

1Check current I/O scheduler: cat /sys/block/<dev>/queue/scheduler
2Check hardware queue count: ls /sys/kernel/debug/block/<dev>/hctx* | wc -l
3Check queue depth: cat /sys/block/<dev>/queue/nr_requests
4Check max transfer size: cat /sys/block/<dev>/queue/max_sectors_kb
5Check merge statistics: cat /sys/block/<dev>/stat (field 3 = reads merged, field 7 = writes merged)
6Check per-hctx dispatch counts: cat /sys/kernel/debug/block/<dev>/hctx*/dispatched
7Trace BIO submission by CPU: bpftrace -e 'tracepoint:block:block_bio_queue { @[cpu] = count(); }'
8Trace request sizes: bpftrace -e 'tracepoint:block:block_rq_insert { @bytes = hist(args->bytes); }'
9Check merge disable: cat /sys/block/<dev>/queue/nomerges (0 = merging enabled)
10Check NVMe queue utilization: nvme list and dmesg | grep nvme for queue setup messages

Key Takeaways

✓A struct bio describes a single contiguous I/O operation on disk but can reference scattered pages in memory through bio_vec entries. A struct request aggregates multiple contiguous BIOs into a single unit for the driver. The bio is the filesystem-to-block interface; the request is the block-to-driver interface.
✓The blk-mq layer eliminates the single-queue bottleneck that capped legacy block I/O at roughly 500K IOPS regardless of device capability. By mapping per-CPU software queues to per-device hardware queues, it scales linearly with core count. A 64-core server with an NVMe device goes from 500K IOPS (single queue, lock-bound) to 1M+ IOPS (multi-queue, lock-free).
✓BIO merging happens at two levels. First, the plug list: within a single syscall, the kernel accumulates BIOs in a per-task plug and merges adjacent ones before releasing them. Second, the I/O scheduler: if enabled, it reorders and merges requests in the software staging queue. For random I/O workloads, merging provides no benefit, and the "none" scheduler avoids the overhead entirely.
✓The bio split mechanism (bio_split()) is critical for device mapper and RAID. When a bio crosses a stripe boundary or chunk size limit, the block layer splits it into two BIOs at the boundary. The split bio shares the original's pages via bio_vec references -- no data copying occurs. The original bio's bi_iter is adjusted to cover only the remaining range.
✓Tag-based completion in blk-mq assigns each in-flight request a unique integer tag from the hctx tag bitmap. When the device signals completion, it returns the tag, and the kernel looks up the request directly by tag index. No scanning of a completion list is needed. This is O(1) per completion, essential at 1M+ IOPS.

Common Pitfalls

✗Using synchronous I/O from too few threads against NVMe. Each synchronous read() or write() blocks the thread until the single I/O completes. With 16 threads and 200us device latency, the maximum throughput is 16 / 0.0002 = 80K IOPS, regardless of device capability. Either increase thread count to hundreds or switch to io_uring / libaio for asynchronous submission.
✗Running an I/O scheduler on NVMe devices. mq-deadline or bfq add latency and CPU overhead for reordering that NVMe firmware handles internally. For NVMe, set the scheduler to "none" via echo none > /sys/block/nvme0n1/queue/scheduler. Reserve mq-deadline for rotational drives where seek optimization matters.
✗Assuming a single submission thread can saturate a multi-queue device. blk-mq maps software queues to CPUs. If all I/O originates from one CPU, only one hardware queue receives work. The other 63 queues sit idle. Spread I/O across CPUs using multiple threads, io_uring with SQPOLL, or multiple file descriptors with separate aio contexts.
✗Ignoring the max_sectors_kb and max_segments limits. If a bio exceeds the device's maximum transfer size, the block layer splits it. Frequent splitting adds overhead. Aligning application I/O size to /sys/block/<dev>/queue/max_sectors_kb avoids unnecessary splits.
✗Disabling plug merging by calling blk_finish_plug() too early or issuing O_DIRECT writes one page at a time. The plug batches BIOs from a single syscall, giving the block layer a window to merge. Issuing tiny, unplugged writes defeats this optimization and inflates IOPS unnecessarily.

Reference

System Calls

readwritepreadv2pwritev2io_submitio_uring_enter

Tools

blktrace + blkparsebtrace/sys/block/*/queue//sys/kernel/debug/block/*/hctx*/iostat -x 1bpftrace one-liners

📌

BIO & Request Queues

Mental Model

The Problem

Architecture

What Actually Happens

Under the Hood

Common Questions

How Technologies Use This

Same Concept Across Tech

If You See This, Think This

When to Use / Avoid

Try It Yourself

Debug Checklist

Key Takeaways

Common Pitfalls

Reference

In One Line

Related Topics

BIO & Request Queues

Mental Model

The Problem

Architecture

What Actually Happens

Under the Hood

Common Questions

How Technologies Use This

Same Concept Across Tech

If You See This, Think This

When to Use / Avoid

Try It Yourself

Debug Checklist

Key Takeaways

Common Pitfalls

Reference

In One Line

Related Topics