File Systems & I/OTopic 14 of 19

File Systems & I/OAdvanced

Disk I/O Scheduling

KafkaPostgreSQL

🧠

Mental Model

An elevator in a tall building. People press buttons on different floors. A naive elevator answers calls in the order they arrive, zig-zagging across 30 floors. A smarter one batches by direction -- everyone going up first, then everyone going down. The I/O scheduler does the same with disk requests: reordering to minimize head travel on HDDs, or guaranteeing fairness between competing workloads on shared storage.

💡

The Problem

PostgreSQL queries normally return in 2ms. Every night at 2 AM, response times spike to 20ms -- no code changes, no deployment. VACUUM is running, flooding the disk with sequential background reads. Query I/O sits behind VACUUM in the same queue, same priority, same treatment. Background maintenance eats 90% of disk bandwidth while latency-sensitive foreground queries starve.

Architecture

The database is fast. Queries return in milliseconds. Then a maintenance job kicks in and everything slows to a crawl.

The disk did not get slower. The I/O scheduler just stopped prioritizing the queries. Background writes are now sitting in the same queue, getting the same treatment, consuming the same bandwidth. Latency-sensitive reads are stuck behind a wall of batch writes.

This is the I/O scheduling problem. And fixing it starts with understanding what sits between the application and the storage hardware.

What Actually Happens

The I/O scheduler sits between the filesystem/page cache layer and the block device driver. When a filesystem issues a block I/O request (a bio), it enters a per-CPU software queue. The scheduler picks requests from these queues and dispatches them to hardware queues.

The blk-mq architecture (the only block layer since Linux 5.0) uses multiple queues. Each CPU core gets its own software staging queue. These map to one or more hardware dispatch queues. On NVMe with 32 hardware queues, each CPU maps to a hardware queue -- eliminating all cross-CPU contention.

This replaced the legacy single-queue block layer, which serialized all I/O through one spinlock. That lock was the bottleneck preventing Linux from scaling beyond ~1M IOPS on fast storage.

Four schedulers are available in blk-mq:

none -- requests pass straight through with no reordering. Zero overhead. Best for NVMe, where the device's internal controller is smarter than any software scheduler. The controller sees the flash translation layer, wear leveling, and internal parallelism. No software scheduler can beat it from the outside.

mq-deadline -- maintains sector-sorted queues for sequential optimization plus FIFO queues with per-request deadlines. A read waiting longer than read_expire (500ms default) jumps the queue. Writes get 5s. The writes_starved parameter (default 2) limits read batches before writes get a turn. Best for SATA SSDs and HDDs in server workloads.

bfq (Budget Fair Queueing) -- assigns I/O bandwidth budgets to processes based on weight. The only scheduler that honors I/O priority classes. Best for desktops where foreground apps need to stay responsive while background tasks run, or servers that need strict I/O isolation between processes.

kyber -- targets latency rather than fairness. Uses tokens to control queue depth, auto-tuning to meet latency targets. Experimental and less commonly used.

Under the Hood

I/O priority only matters with bfq. The ioprio_set() syscall sets per-process I/O priority: real-time (8 levels, highest), best-effort (8 levels, default), or idle (served only when the device is idle). These priorities are encoded in the I/O request and examined by the scheduler.

Here is the catch that trips up everyone: mq-deadline and none completely ignore I/O priorities. If the system uses mq-deadline (the most common server configuration), running ionice -c3 pg_dump does absolutely nothing. The pg_dump process gets the same treatment as production queries. This is one of the most common misconfigurations in database operations.

If ionice needs to work, the block device must be switched to bfq. But bfq has higher CPU overhead at high IOPS. That is the tradeoff.

cgroup v2 I/O control works independently of the scheduler. io.max sets absolute limits (max IOPS and bytes/sec per device). io.weight sets proportional sharing (weight 1-10000). These are enforced at the blk-cgroup layer before requests reach the scheduler, so they work with any scheduler including none. Docker's --device-read-bps and --blkio-weight flags use these.

Common Questions

Kafka on HDD is getting high read latency during heavy producer writes. What is the fix?

Classic write starvation. Check cat /sys/block/sda/queue/scheduler. If it is none, switch to mq-deadline. The read_expire (500ms default) guarantees reads are served within half a second. Increasing writes_starved to 4-8 can further prioritize reads over Kafka's sequential writes. Verify read_ahead_kb is tuned for Kafka's sequential consumer reads (128-256 for HDDs).

Why does NVMe not need an I/O scheduler?

NVMe devices have internal firmware with deep hardware queues (up to 64K entries per queue, up to 64K queues). Each queue has its own submission and completion pair with independent doorbells -- no head-of-line blocking. The controller has full visibility into the flash translation layer, wear leveling, and internal parallelism. Any software scheduler sitting above this adds latency and CPU overhead while making worse ordering decisions. none simply passes requests through.

How does Docker limit container I/O?

Docker uses cgroup v2 io controller. --device-read-bps and --device-write-bps set absolute limits via io.max. --blkio-weight sets proportional weight via io.weight. These work independently of the I/O scheduler. One caveat: io.max throttles by sleeping the submitting process. For buffered I/O (non-O_DIRECT), throttling happens at writeback time, not at write() time, which can cause confusing behavior where writes appear fast but dirty page writeback is throttled.

What replaced CFQ?

CFQ (Completely Fair Queuing) was removed in Linux 5.0 with the entire single-queue block layer. Its replacement is bfq in the blk-mq framework. bfq provides the same fairness and priority features but works with multi-queue hardware. However, bfq has higher CPU overhead. For workloads that do not need fairness (most servers), mq-deadline or none beats both CFQ and bfq.

How Technologies Use This

Kafka

Kafka consumer reads spike to 500ms+ latency during heavy producer bursts on HDD-backed brokers. Consumers fall behind, lag grows, and alerts fire even though the disk hardware is healthy.

The problem is write starvation. Producers flood the I/O queue with sequential writes, and consumer reads get stuck behind them with no priority distinction. Without a deadline scheduler, reads have no way to jump the queue regardless of how long they have been waiting.

Set the scheduler to mq-deadline for HDD brokers. Its read_expire parameter (default 500ms) guarantees that any read waiting longer than the deadline is served immediately. For SSD-backed brokers, use none instead, because the NVMe controller's internal 64K-deep queues handle ordering better than any software scheduler, and removing it saves 2-5 microseconds of latency per request.

PostgreSQL

Query response times spike 10x every night when VACUUM runs. Users experience multi-second latency on queries that normally return in milliseconds. The DBA runs ionice -c3 on the VACUUM workers, but nothing improves.

Both VACUUM and CHECKPOINT generate massive sequential disk writes that fill the I/O queue, and foreground queries get stuck behind them. The critical gotcha is that ionice has absolutely zero effect with mq-deadline or none schedulers, which are the defaults on most servers. Running ionice on an mq-deadline system gives a false sense of priority isolation while queries still starve.

Switch the block device to bfq and then set VACUUM workers to IOPRIO_CLASS_IDLE via ionice -c3 so they only consume I/O bandwidth when no queries are waiting. This keeps p99 query latency under 5ms even during aggressive autovacuum. Always verify the active scheduler before relying on ionice.

Same Concept Across Tech

Technology	I/O scheduling impact	Recommended scheduler
PostgreSQL	VACUUM competes with queries for disk. mq-deadline prevents starvation	mq-deadline on SSD, bfq with ionice for VACUUM
Kafka	Sequential log writes benefit from none on NVMe. Consumers do random reads	none for NVMe log disks
MySQL (InnoDB)	InnoDB has its own I/O scheduling (innodb_io_capacity). Kernel scheduler should not interfere	none or mq-deadline, avoid bfq overhead
Docker	Container I/O goes through host scheduler. blkio cgroup limits bandwidth per container	Set scheduler at host level, per-container limits via cgroups
Kubernetes	Pod I/O competes with other pods on the same node. No per-pod scheduler config	Node-level scheduler, cgroup v2 io.max for per-pod limits

Stack layer mapping (I/O latency spike):

Layer	What to check	Tool
Application	Is a background task flooding disk I/O?	Application logs, iotop
I/O scheduler	Which scheduler is active? Is it the right one for this hardware?	cat /sys/block/*/queue/scheduler
Block device	Queue depth? Average wait time?	iostat -x, /sys/block/*/queue/nr_requests
Cgroup	Is there an I/O limit set? Is it being hit?	cat /sys/fs/cgroup/.../io.stat
Hardware	Is the disk actually slow? SMART errors?	smartctl -a, hdparm -t

Design Rationale Raw request ordering from applications is pathologically bad for rotational media -- random seeks dominate latency, and a single sequential writer can starve every reader indefinitely. That is why a scheduler sits between filesystem and device. The shift from a single-queue block layer to blk-mq was forced by NVMe hardware capable of millions of IOPS but bottlenecked by one spinlock serializing all submissions. Multiple scheduler options (none, mq-deadline, bfq, kyber) exist because the right strategy depends entirely on the hardware: NVMe controllers with 64K-deep internal queues gain nothing from software reordering, while HDDs with 10ms seek times desperately need it.

If You See This, Think This

Symptom	Likely cause	First check
Query latency spikes during background jobs	Background I/O starving foreground requests, no priority separation	iostat -x, check await. Consider bfq with ionice
NVMe performing worse with bfq scheduler	bfq adds overhead that NVMe does not need (NVMe has internal scheduler)	Switch to none: echo none > /sys/block/nvme*/queue/scheduler
ionice has no effect	ionice only works with bfq scheduler	Check active scheduler
High await in iostat but low disk utilization	I/O requests queued waiting for scheduler to dispatch	Check queue depth, consider reducing nr_requests
One container starving another for disk I/O	No per-container I/O limits, first-come-first-served	Set io.max in cgroup v2
HDD random read latency 100x worse than sequential	I/O scheduler not reordering requests to minimize seek	Use mq-deadline or bfq, not none, for HDDs

When to Use / Avoid

Choose scheduler based on hardware:

none (noop): NVMe SSDs with internal scheduling. The kernel adds nothing. Lowest overhead.
mq-deadline: HDDs or SSDs where latency guarantees matter. Ensures every request completes within a deadline.
bfq (Budget Fair Queueing): When fairness between processes matters and ionice priority is needed. Higher overhead.
kyber: Auto-tuned for fast devices. Low overhead, good defaults. Good when mq-deadline is too simple but bfq is too expensive.

Watch out for:

ionice only works with bfq (has no effect with other schedulers)
NVMe devices default to none, which is correct. Do not change it.
Changing schedulers at runtime is possible: echo mq-deadline > /sys/block/sda/queue/scheduler

Try It Yourself

 1  # Check the current I/O scheduler for a block device
 2  cat /sys/block/sda/queue/scheduler
 3  
 4  # Change the scheduler to mq-deadline at runtime
 5  echo 'mq-deadline' | sudo tee /sys/block/sda/queue/scheduler
 6  
 7  # List available schedulers for all block devices
 8  for dev in /sys/block/*/queue/scheduler; do echo "$dev: $(cat $dev)"; done
 9  
10  # Set a process to idle I/O priority (only served when device is idle)
11  ionice -c3 -p $(pgrep pg_dump)
12  
13  # Run a command with real-time I/O priority (highest class)
14  sudo ionice -c1 -n0 dd if=/dev/sda of=/dev/null bs=1M count=100
15  
16  # Check I/O priority of running processes
17  ionice -p $(pgrep -d' -p ' postgres)
18  
19  # Tune mq-deadline parameters
20  echo 250 | sudo tee /sys/block/sda/queue/iosched/read_expire
21  echo 1000 | sudo tee /sys/block/sda/queue/iosched/write_expire
22  
23  # Monitor I/O latency distribution with BPF
24  sudo biolatency-bpfcc -D 10
25  
26  # Show per-process I/O stats
27  sudo iotop -oPa
28  
29  # Check blk-mq queue mapping (software queues to hardware queues)
30  cat /sys/block/nvme0n1/mq/*/cpu_list

Debug Checklist

1Check current scheduler: cat /sys/block/sda/queue/scheduler
2Check available schedulers: cat /sys/block/sda/queue/scheduler (brackets show active)
3Check I/O latency: iostat -x 1 (look at await column)
4Check per-process I/O: iotop -o
5Set I/O priority: ionice -c2 -n7 -p <pid> (only works with bfq)
6Check disk queue depth: cat /sys/block/sda/queue/nr_requests

Key Takeaways

✓NVMe SSDs should use 'none' -- no scheduler at all. The device's internal controller with 64K+ queue entries handles ordering better than any software. Adding mq-deadline or bfq to NVMe wastes CPU cycles for zero I/O benefit.
✓blk-mq uses per-CPU software queues mapped to hardware dispatch queues. This eliminated the single spinlock that serialized all I/O in the legacy block layer -- the bottleneck that prevented Linux from scaling beyond ~1M IOPS on fast storage.
✓ionice is useless with mq-deadline and none schedulers. I/O priority classes only work with bfq (and the removed cfq). Running 'ionice -c3 pg_dump' does nothing on mq-deadline -- a common misconfiguration that gives a false sense of priority isolation.
✓mq-deadline prevents read starvation with a 500ms deadline: if a read has waited longer than read_expire, it jumps the queue regardless of other optimizations. Writes get 5s. The writes_starved parameter (default 2) limits how many read batches run before writes get a turn.
✓The legacy CFQ scheduler was removed in Linux 5.0 with the entire single-queue block layer. Any tuning guide referencing CFQ, noop, or the old elevator parameter is outdated. Modern tuning targets blk-mq schedulers exclusively.

Common Pitfalls

✗Mistake: Using bfq on high-throughput servers. Reality: bfq's per-process budget tracking and internal B-tree operations add significant CPU overhead at high IOPS. For databases and storage nodes, mq-deadline or none provides better throughput with less CPU burn.
✗Mistake: Running ionice on processes when using mq-deadline or none. Reality: I/O priorities are completely ignored by these schedulers. You get zero priority isolation. If you need ionice to work, switch to bfq.
✗Mistake: Applying HDD-era tuning (large nr_requests, high read_ahead_kb) to NVMe. Reality: NVMe has deep hardware queues and microsecond latencies. Large software queue depths add memory overhead and latency. Default values are usually optimal.
✗Mistake: Assuming all block devices use the same scheduler. Reality: Linux allows per-device scheduler selection. Your NVMe boot drive can use 'none' while your HDD data drive uses mq-deadline. Check /sys/block/<dev>/queue/scheduler for each device.

Reference

System Calls

ioprio_setioprio_get

Tools

/sys/block/<dev>/queue/schedulerioniceblktrace / blkparse / btt

📌

In One Line

Match the scheduler to the hardware -- none for NVMe, mq-deadline for databases on SSD, bfq when ionice actually needs to work -- and verify with cat /sys/block/*/queue/scheduler before trusting ionice to do anything.

Disk I/O Scheduling

KafkaPostgreSQL

🧠

Mental Model

💡

The Problem

Architecture

The database is fast. Queries return in milliseconds. Then a maintenance job kicks in and everything slows to a crawl.

This is the I/O scheduling problem. And fixing it starts with understanding what sits between the application and the storage hardware.

What Actually Happens

This replaced the legacy single-queue block layer, which serialized all I/O through one spinlock. That lock was the bottleneck preventing Linux from scaling beyond ~1M IOPS on fast storage.

Four schedulers are available in blk-mq:

kyber -- targets latency rather than fairness. Uses tokens to control queue depth, auto-tuning to meet latency targets. Experimental and less commonly used.

Under the Hood

If ionice needs to work, the block device must be switched to bfq. But bfq has higher CPU overhead at high IOPS. That is the tradeoff.

Common Questions

Kafka on HDD is getting high read latency during heavy producer writes. What is the fix?

Why does NVMe not need an I/O scheduler?

How does Docker limit container I/O?

What replaced CFQ?

How Technologies Use This

Kafka

Kafka consumer reads spike to 500ms+ latency during heavy producer bursts on HDD-backed brokers. Consumers fall behind, lag grows, and alerts fire even though the disk hardware is healthy.

PostgreSQL

Same Concept Across Tech

Technology	I/O scheduling impact	Recommended scheduler
PostgreSQL	VACUUM competes with queries for disk. mq-deadline prevents starvation	mq-deadline on SSD, bfq with ionice for VACUUM
Kafka	Sequential log writes benefit from none on NVMe. Consumers do random reads	none for NVMe log disks
MySQL (InnoDB)	InnoDB has its own I/O scheduling (innodb_io_capacity). Kernel scheduler should not interfere	none or mq-deadline, avoid bfq overhead
Docker	Container I/O goes through host scheduler. blkio cgroup limits bandwidth per container	Set scheduler at host level, per-container limits via cgroups
Kubernetes	Pod I/O competes with other pods on the same node. No per-pod scheduler config	Node-level scheduler, cgroup v2 io.max for per-pod limits

Stack layer mapping (I/O latency spike):

Layer	What to check	Tool
Application	Is a background task flooding disk I/O?	Application logs, iotop
I/O scheduler	Which scheduler is active? Is it the right one for this hardware?	cat /sys/block/*/queue/scheduler
Block device	Queue depth? Average wait time?	iostat -x, /sys/block/*/queue/nr_requests
Cgroup	Is there an I/O limit set? Is it being hit?	cat /sys/fs/cgroup/.../io.stat
Hardware	Is the disk actually slow? SMART errors?	smartctl -a, hdparm -t

If You See This, Think This

Symptom	Likely cause	First check
Query latency spikes during background jobs	Background I/O starving foreground requests, no priority separation	iostat -x, check await. Consider bfq with ionice
NVMe performing worse with bfq scheduler	bfq adds overhead that NVMe does not need (NVMe has internal scheduler)	Switch to none: echo none > /sys/block/nvme*/queue/scheduler
ionice has no effect	ionice only works with bfq scheduler	Check active scheduler
High await in iostat but low disk utilization	I/O requests queued waiting for scheduler to dispatch	Check queue depth, consider reducing nr_requests
One container starving another for disk I/O	No per-container I/O limits, first-come-first-served	Set io.max in cgroup v2
HDD random read latency 100x worse than sequential	I/O scheduler not reordering requests to minimize seek	Use mq-deadline or bfq, not none, for HDDs

When to Use / Avoid

Choose scheduler based on hardware:

none (noop): NVMe SSDs with internal scheduling. The kernel adds nothing. Lowest overhead.
mq-deadline: HDDs or SSDs where latency guarantees matter. Ensures every request completes within a deadline.
bfq (Budget Fair Queueing): When fairness between processes matters and ionice priority is needed. Higher overhead.
kyber: Auto-tuned for fast devices. Low overhead, good defaults. Good when mq-deadline is too simple but bfq is too expensive.

Watch out for:

ionice only works with bfq (has no effect with other schedulers)
NVMe devices default to none, which is correct. Do not change it.
Changing schedulers at runtime is possible: echo mq-deadline > /sys/block/sda/queue/scheduler

Try It Yourself

 1  # Check the current I/O scheduler for a block device
 2  cat /sys/block/sda/queue/scheduler
 3  
 4  # Change the scheduler to mq-deadline at runtime
 5  echo 'mq-deadline' | sudo tee /sys/block/sda/queue/scheduler
 6  
 7  # List available schedulers for all block devices
 8  for dev in /sys/block/*/queue/scheduler; do echo "$dev: $(cat $dev)"; done
 9  
10  # Set a process to idle I/O priority (only served when device is idle)
11  ionice -c3 -p $(pgrep pg_dump)
12  
13  # Run a command with real-time I/O priority (highest class)
14  sudo ionice -c1 -n0 dd if=/dev/sda of=/dev/null bs=1M count=100
15  
16  # Check I/O priority of running processes
17  ionice -p $(pgrep -d' -p ' postgres)
18  
19  # Tune mq-deadline parameters
20  echo 250 | sudo tee /sys/block/sda/queue/iosched/read_expire
21  echo 1000 | sudo tee /sys/block/sda/queue/iosched/write_expire
22  
23  # Monitor I/O latency distribution with BPF
24  sudo biolatency-bpfcc -D 10
25  
26  # Show per-process I/O stats
27  sudo iotop -oPa
28  
29  # Check blk-mq queue mapping (software queues to hardware queues)
30  cat /sys/block/nvme0n1/mq/*/cpu_list

Debug Checklist

1Check current scheduler: cat /sys/block/sda/queue/scheduler
2Check available schedulers: cat /sys/block/sda/queue/scheduler (brackets show active)
3Check I/O latency: iostat -x 1 (look at await column)
4Check per-process I/O: iotop -o
5Set I/O priority: ionice -c2 -n7 -p <pid> (only works with bfq)
6Check disk queue depth: cat /sys/block/sda/queue/nr_requests

Key Takeaways

✓NVMe SSDs should use 'none' -- no scheduler at all. The device's internal controller with 64K+ queue entries handles ordering better than any software. Adding mq-deadline or bfq to NVMe wastes CPU cycles for zero I/O benefit.
✓blk-mq uses per-CPU software queues mapped to hardware dispatch queues. This eliminated the single spinlock that serialized all I/O in the legacy block layer -- the bottleneck that prevented Linux from scaling beyond ~1M IOPS on fast storage.
✓ionice is useless with mq-deadline and none schedulers. I/O priority classes only work with bfq (and the removed cfq). Running 'ionice -c3 pg_dump' does nothing on mq-deadline -- a common misconfiguration that gives a false sense of priority isolation.
✓mq-deadline prevents read starvation with a 500ms deadline: if a read has waited longer than read_expire, it jumps the queue regardless of other optimizations. Writes get 5s. The writes_starved parameter (default 2) limits how many read batches run before writes get a turn.
✓The legacy CFQ scheduler was removed in Linux 5.0 with the entire single-queue block layer. Any tuning guide referencing CFQ, noop, or the old elevator parameter is outdated. Modern tuning targets blk-mq schedulers exclusively.

Common Pitfalls

✗Mistake: Using bfq on high-throughput servers. Reality: bfq's per-process budget tracking and internal B-tree operations add significant CPU overhead at high IOPS. For databases and storage nodes, mq-deadline or none provides better throughput with less CPU burn.
✗Mistake: Running ionice on processes when using mq-deadline or none. Reality: I/O priorities are completely ignored by these schedulers. You get zero priority isolation. If you need ionice to work, switch to bfq.
✗Mistake: Applying HDD-era tuning (large nr_requests, high read_ahead_kb) to NVMe. Reality: NVMe has deep hardware queues and microsecond latencies. Large software queue depths add memory overhead and latency. Default values are usually optimal.
✗Mistake: Assuming all block devices use the same scheduler. Reality: Linux allows per-device scheduler selection. Your NVMe boot drive can use 'none' while your HDD data drive uses mq-deadline. Check /sys/block/<dev>/queue/scheduler for each device.

Reference

System Calls

ioprio_setioprio_get

Tools

/sys/block/<dev>/queue/schedulerioniceblktrace / blkparse / btt

📌

Disk I/O Scheduling

Mental Model

The Problem

Architecture

What Actually Happens

Under the Hood

Common Questions

How Technologies Use This

Same Concept Across Tech

If You See This, Think This

When to Use / Avoid

Try It Yourself

Debug Checklist

Key Takeaways

Common Pitfalls

Reference

In One Line

Related Topics

Disk I/O Scheduling

Mental Model

The Problem

Architecture

What Actually Happens

Under the Hood

Common Questions

How Technologies Use This

Same Concept Across Tech

If You See This, Think This

When to Use / Avoid

Try It Yourself

Debug Checklist

Key Takeaways

Common Pitfalls

Reference

In One Line

Related Topics