Disk I/O Scheduling
Mental Model
An elevator in a tall building. People press buttons on different floors. A naive elevator answers calls in the order they arrive, zig-zagging across 30 floors. A smarter one batches by direction -- everyone going up first, then everyone going down. The I/O scheduler does the same with disk requests: reordering to minimize head travel on HDDs, or guaranteeing fairness between competing workloads on shared storage.
The Problem
PostgreSQL queries normally return in 2ms. Every night at 2 AM, response times spike to 20ms -- no code changes, no deployment. VACUUM is running, flooding the disk with sequential background reads. Query I/O sits behind VACUUM in the same queue, same priority, same treatment. Background maintenance eats 90% of disk bandwidth while latency-sensitive foreground queries starve.
Architecture
The database is fast. Queries return in milliseconds. Then a maintenance job kicks in and everything slows to a crawl.
The disk did not get slower. The I/O scheduler just stopped prioritizing the queries. Background writes are now sitting in the same queue, getting the same treatment, consuming the same bandwidth. Latency-sensitive reads are stuck behind a wall of batch writes.
This is the I/O scheduling problem. And fixing it starts with understanding what sits between the application and the storage hardware.
What Actually Happens
The I/O scheduler sits between the filesystem/page cache layer and the block device driver. When a filesystem issues a block I/O request (a bio), it enters a per-CPU software queue. The scheduler picks requests from these queues and dispatches them to hardware queues.
The blk-mq architecture (the only block layer since Linux 5.0) uses multiple queues. Each CPU core gets its own software staging queue. These map to one or more hardware dispatch queues. On NVMe with 32 hardware queues, each CPU maps to a hardware queue -- eliminating all cross-CPU contention.
This replaced the legacy single-queue block layer, which serialized all I/O through one spinlock. That lock was the bottleneck preventing Linux from scaling beyond ~1M IOPS on fast storage.
Four schedulers are available in blk-mq:
none -- requests pass straight through with no reordering. Zero overhead. Best for NVMe, where the device's internal controller is smarter than any software scheduler. The controller sees the flash translation layer, wear leveling, and internal parallelism. No software scheduler can beat it from the outside.
mq-deadline -- maintains sector-sorted queues for sequential optimization plus FIFO queues with per-request deadlines. A read waiting longer than read_expire (500ms default) jumps the queue. Writes get 5s. The writes_starved parameter (default 2) limits read batches before writes get a turn. Best for SATA SSDs and HDDs in server workloads.
bfq (Budget Fair Queueing) -- assigns I/O bandwidth budgets to processes based on weight. The only scheduler that honors I/O priority classes. Best for desktops where foreground apps need to stay responsive while background tasks run, or servers that need strict I/O isolation between processes.
kyber -- targets latency rather than fairness. Uses tokens to control queue depth, auto-tuning to meet latency targets. Experimental and less commonly used.
Under the Hood
I/O priority only matters with bfq. The ioprio_set() syscall sets per-process I/O priority: real-time (8 levels, highest), best-effort (8 levels, default), or idle (served only when the device is idle). These priorities are encoded in the I/O request and examined by the scheduler.
Here is the catch that trips up everyone: mq-deadline and none completely ignore I/O priorities. If the system uses mq-deadline (the most common server configuration), running ionice -c3 pg_dump does absolutely nothing. The pg_dump process gets the same treatment as production queries. This is one of the most common misconfigurations in database operations.
If ionice needs to work, the block device must be switched to bfq. But bfq has higher CPU overhead at high IOPS. That is the tradeoff.
cgroup v2 I/O control works independently of the scheduler. io.max sets absolute limits (max IOPS and bytes/sec per device). io.weight sets proportional sharing (weight 1-10000). These are enforced at the blk-cgroup layer before requests reach the scheduler, so they work with any scheduler including none. Docker's --device-read-bps and --blkio-weight flags use these.
Common Questions
Kafka on HDD is getting high read latency during heavy producer writes. What is the fix?
Classic write starvation. Check cat /sys/block/sda/queue/scheduler. If it is none, switch to mq-deadline. The read_expire (500ms default) guarantees reads are served within half a second. Increasing writes_starved to 4-8 can further prioritize reads over Kafka's sequential writes. Verify read_ahead_kb is tuned for Kafka's sequential consumer reads (128-256 for HDDs).
Why does NVMe not need an I/O scheduler?
NVMe devices have internal firmware with deep hardware queues (up to 64K entries per queue, up to 64K queues). Each queue has its own submission and completion pair with independent doorbells -- no head-of-line blocking. The controller has full visibility into the flash translation layer, wear leveling, and internal parallelism. Any software scheduler sitting above this adds latency and CPU overhead while making worse ordering decisions. none simply passes requests through.
How does Docker limit container I/O?
Docker uses cgroup v2 io controller. --device-read-bps and --device-write-bps set absolute limits via io.max. --blkio-weight sets proportional weight via io.weight. These work independently of the I/O scheduler. One caveat: io.max throttles by sleeping the submitting process. For buffered I/O (non-O_DIRECT), throttling happens at writeback time, not at write() time, which can cause confusing behavior where writes appear fast but dirty page writeback is throttled.
What replaced CFQ?
CFQ (Completely Fair Queuing) was removed in Linux 5.0 with the entire single-queue block layer. Its replacement is bfq in the blk-mq framework. bfq provides the same fairness and priority features but works with multi-queue hardware. However, bfq has higher CPU overhead. For workloads that do not need fairness (most servers), mq-deadline or none beats both CFQ and bfq.
How Technologies Use This
Kafka consumer reads spike to 500ms+ latency during heavy producer bursts on HDD-backed brokers. Consumers fall behind, lag grows, and alerts fire even though the disk hardware is healthy.
The problem is write starvation. Producers flood the I/O queue with sequential writes, and consumer reads get stuck behind them with no priority distinction. Without a deadline scheduler, reads have no way to jump the queue regardless of how long they have been waiting.
Set the scheduler to mq-deadline for HDD brokers. Its read_expire parameter (default 500ms) guarantees that any read waiting longer than the deadline is served immediately. For SSD-backed brokers, use none instead, because the NVMe controller's internal 64K-deep queues handle ordering better than any software scheduler, and removing it saves 2-5 microseconds of latency per request.
Query response times spike 10x every night when VACUUM runs. Users experience multi-second latency on queries that normally return in milliseconds. The DBA runs ionice -c3 on the VACUUM workers, but nothing improves.
Both VACUUM and CHECKPOINT generate massive sequential disk writes that fill the I/O queue, and foreground queries get stuck behind them. The critical gotcha is that ionice has absolutely zero effect with mq-deadline or none schedulers, which are the defaults on most servers. Running ionice on an mq-deadline system gives a false sense of priority isolation while queries still starve.
Switch the block device to bfq and then set VACUUM workers to IOPRIO_CLASS_IDLE via ionice -c3 so they only consume I/O bandwidth when no queries are waiting. This keeps p99 query latency under 5ms even during aggressive autovacuum. Always verify the active scheduler before relying on ionice.
Same Concept Across Tech
| Technology | I/O scheduling impact | Recommended scheduler |
|---|---|---|
| PostgreSQL | VACUUM competes with queries for disk. mq-deadline prevents starvation | mq-deadline on SSD, bfq with ionice for VACUUM |
| Kafka | Sequential log writes benefit from none on NVMe. Consumers do random reads | none for NVMe log disks |
| MySQL (InnoDB) | InnoDB has its own I/O scheduling (innodb_io_capacity). Kernel scheduler should not interfere | none or mq-deadline, avoid bfq overhead |
| Docker | Container I/O goes through host scheduler. blkio cgroup limits bandwidth per container | Set scheduler at host level, per-container limits via cgroups |
| Kubernetes | Pod I/O competes with other pods on the same node. No per-pod scheduler config | Node-level scheduler, cgroup v2 io.max for per-pod limits |
Stack layer mapping (I/O latency spike):
| Layer | What to check | Tool |
|---|---|---|
| Application | Is a background task flooding disk I/O? | Application logs, iotop |
| I/O scheduler | Which scheduler is active? Is it the right one for this hardware? | cat /sys/block/*/queue/scheduler |
| Block device | Queue depth? Average wait time? | iostat -x, /sys/block/*/queue/nr_requests |
| Cgroup | Is there an I/O limit set? Is it being hit? | cat /sys/fs/cgroup/.../io.stat |
| Hardware | Is the disk actually slow? SMART errors? | smartctl -a, hdparm -t |
Design Rationale Raw request ordering from applications is pathologically bad for rotational media -- random seeks dominate latency, and a single sequential writer can starve every reader indefinitely. That is why a scheduler sits between filesystem and device. The shift from a single-queue block layer to blk-mq was forced by NVMe hardware capable of millions of IOPS but bottlenecked by one spinlock serializing all submissions. Multiple scheduler options (none, mq-deadline, bfq, kyber) exist because the right strategy depends entirely on the hardware: NVMe controllers with 64K-deep internal queues gain nothing from software reordering, while HDDs with 10ms seek times desperately need it.
If You See This, Think This
| Symptom | Likely cause | First check |
|---|---|---|
| Query latency spikes during background jobs | Background I/O starving foreground requests, no priority separation | iostat -x, check await. Consider bfq with ionice |
| NVMe performing worse with bfq scheduler | bfq adds overhead that NVMe does not need (NVMe has internal scheduler) | Switch to none: echo none > /sys/block/nvme*/queue/scheduler |
| ionice has no effect | ionice only works with bfq scheduler | Check active scheduler |
| High await in iostat but low disk utilization | I/O requests queued waiting for scheduler to dispatch | Check queue depth, consider reducing nr_requests |
| One container starving another for disk I/O | No per-container I/O limits, first-come-first-served | Set io.max in cgroup v2 |
| HDD random read latency 100x worse than sequential | I/O scheduler not reordering requests to minimize seek | Use mq-deadline or bfq, not none, for HDDs |
When to Use / Avoid
Choose scheduler based on hardware:
- none (noop): NVMe SSDs with internal scheduling. The kernel adds nothing. Lowest overhead.
- mq-deadline: HDDs or SSDs where latency guarantees matter. Ensures every request completes within a deadline.
- bfq (Budget Fair Queueing): When fairness between processes matters and ionice priority is needed. Higher overhead.
- kyber: Auto-tuned for fast devices. Low overhead, good defaults. Good when mq-deadline is too simple but bfq is too expensive.
Watch out for:
- ionice only works with bfq (has no effect with other schedulers)
- NVMe devices default to none, which is correct. Do not change it.
- Changing schedulers at runtime is possible: echo mq-deadline > /sys/block/sda/queue/scheduler
Try It Yourself
1 # Check the current I/O scheduler for a block device
2 cat /sys/block/sda/queue/scheduler
3
4 # Change the scheduler to mq-deadline at runtime
5 echo 'mq-deadline' | sudo tee /sys/block/sda/queue/scheduler
6
7 # List available schedulers for all block devices
8 for dev in /sys/block/*/queue/scheduler; do echo "$dev: $(cat $dev)"; done
9
10 # Set a process to idle I/O priority (only served when device is idle)
11 ionice -c3 -p $(pgrep pg_dump)
12
13 # Run a command with real-time I/O priority (highest class)
14 sudo ionice -c1 -n0 dd if=/dev/sda of=/dev/null bs=1M count=100
15
16 # Check I/O priority of running processes
17 ionice -p $(pgrep -d' -p ' postgres)
18
19 # Tune mq-deadline parameters
20 echo 250 | sudo tee /sys/block/sda/queue/iosched/read_expire
21 echo 1000 | sudo tee /sys/block/sda/queue/iosched/write_expire
22
23 # Monitor I/O latency distribution with BPF
24 sudo biolatency-bpfcc -D 10
25
26 # Show per-process I/O stats
27 sudo iotop -oPa
28
29 # Check blk-mq queue mapping (software queues to hardware queues)
30 cat /sys/block/nvme0n1/mq/*/cpu_listDebug Checklist
- 1
Check current scheduler: cat /sys/block/sda/queue/scheduler - 2
Check available schedulers: cat /sys/block/sda/queue/scheduler (brackets show active) - 3
Check I/O latency: iostat -x 1 (look at await column) - 4
Check per-process I/O: iotop -o - 5
Set I/O priority: ionice -c2 -n7 -p <pid> (only works with bfq) - 6
Check disk queue depth: cat /sys/block/sda/queue/nr_requests
Key Takeaways
- ✓NVMe SSDs should use 'none' -- no scheduler at all. The device's internal controller with 64K+ queue entries handles ordering better than any software. Adding mq-deadline or bfq to NVMe wastes CPU cycles for zero I/O benefit.
- ✓blk-mq uses per-CPU software queues mapped to hardware dispatch queues. This eliminated the single spinlock that serialized all I/O in the legacy block layer -- the bottleneck that prevented Linux from scaling beyond ~1M IOPS on fast storage.
- ✓ionice is useless with mq-deadline and none schedulers. I/O priority classes only work with bfq (and the removed cfq). Running 'ionice -c3 pg_dump' does nothing on mq-deadline -- a common misconfiguration that gives a false sense of priority isolation.
- ✓mq-deadline prevents read starvation with a 500ms deadline: if a read has waited longer than read_expire, it jumps the queue regardless of other optimizations. Writes get 5s. The writes_starved parameter (default 2) limits how many read batches run before writes get a turn.
- ✓The legacy CFQ scheduler was removed in Linux 5.0 with the entire single-queue block layer. Any tuning guide referencing CFQ, noop, or the old elevator parameter is outdated. Modern tuning targets blk-mq schedulers exclusively.
Common Pitfalls
- ✗Mistake: Using bfq on high-throughput servers. Reality: bfq's per-process budget tracking and internal B-tree operations add significant CPU overhead at high IOPS. For databases and storage nodes, mq-deadline or none provides better throughput with less CPU burn.
- ✗Mistake: Running ionice on processes when using mq-deadline or none. Reality: I/O priorities are completely ignored by these schedulers. You get zero priority isolation. If you need ionice to work, switch to bfq.
- ✗Mistake: Applying HDD-era tuning (large nr_requests, high read_ahead_kb) to NVMe. Reality: NVMe has deep hardware queues and microsecond latencies. Large software queue depths add memory overhead and latency. Default values are usually optimal.
- ✗Mistake: Assuming all block devices use the same scheduler. Reality: Linux allows per-device scheduler selection. Your NVMe boot drive can use 'none' while your HDD data drive uses mq-deadline. Check /sys/block/<dev>/queue/scheduler for each device.
Reference
In One Line
Match the scheduler to the hardware -- none for NVMe, mq-deadline for databases on SSD, bfq when ionice actually needs to work -- and verify with cat /sys/block/*/queue/scheduler before trusting ionice to do anything.