Process Scheduling (CFS)
Mental Model
A meeting with one microphone. Everyone gets a turn, but not equal time -- some speakers carry more weight and get longer at the mic. A running tally tracks how much each person has spoken. Whoever has spoken the least gets the mic next. Hog it, and the tally grows faster, bumping that person sooner. The tally is vruntime.
The Problem
Eight cores, 200 processes that want to run. A Kubernetes pod is delivering half the expected throughput. The interactive shell freezes for seconds while a background build hogs every core. A container with cpu.cfs_quota at 100 ms watches nr_throttled climb in cpu.stat even though host cores sit idle. All scheduling problems -- and cpu.stat plus /proc/schedstat expose exactly what is going wrong.
Architecture
Eight CPU cores and two hundred processes that all want to run. Right now. The kernel has to pick one process per core, let it run for a while, yank it off, and give someone else a turn. It does this thousands of times per second, and most of the time nobody notices.
But then one day the interactive shell freezes for two seconds while a runaway make -j64 hogs every core. Or a Kubernetes pod runs at half the expected speed because it is being throttled by a cgroup limit nobody knew existed. That is when the scheduler stops being invisible and starts being the most important thing to understand.
What Actually Happens
When schedule() is called -- either because a timer tick fires, a task blocks, or a higher-priority task wakes up -- the scheduler does the following:
- Checks scheduling classes in strict priority order: stop (kernel-internal), deadline, realtime, fair (CFS), idle.
- For CFS, it picks the task with the lowest vruntime from a red-black tree. The leftmost node is cached, so this is O(1).
- The selected task runs until one of three things happens: the timer tick updates its vruntime and finds someone else has lower vruntime, a newly woken task has lower vruntime, or the task blocks voluntarily.
- The preempted task gets reinserted into the rb-tree at its new vruntime position.
How vruntime works. When a task runs for wall-clock time dt, its vruntime increases by dt * (weight_of_nice_0 / task_weight). A nice 0 task (weight 1024) running for 1ms gets 1ms of vruntime. A nice -5 task (weight ~3121) gets only ~0.33ms of vruntime for the same wall-clock time. It accumulates vruntime three times slower, so it stays near the left of the tree and gets three times more CPU.
The timer tick. scheduler_tick() fires at HZ frequency (typically 250Hz, meaning every 4ms). It updates the current task's vruntime and checks if another task should preempt. Between ticks, a newly woken task with lower vruntime can also trigger preemption.
Load balancing. CFS maintains per-CPU runqueues. Periodically (every ~4ms when idle, ~64ms when busy), the load balancer compares load across CPUs and migrates tasks from the busiest to the idlest. It considers NUMA topology, cache affinity, and power management.
Under the Hood
Scheduling classes hierarchy. Linux supports multiple scheduling classes, checked in strict priority order: stop_sched_class (kernel-internal, highest priority), dl_sched_class (SCHED_DEADLINE), rt_sched_class (SCHED_FIFO and SCHED_RR), fair_sched_class (CFS, for SCHED_NORMAL/SCHED_BATCH), and idle_sched_class (lowest). A runnable SCHED_FIFO task always preempts any CFS task. Always.
SCHED_DEADLINE. The most sophisticated scheduling policy, and the one most people have never heard of. It implements Earliest Deadline First (EDF) with Constant Bandwidth Server (CBS). A task declares three parameters: runtime (CPU time needed per period), period (how often), and deadline (when it must complete). The kernel performs an admission test to ensure all SCHED_DEADLINE tasks can meet their deadlines. Used for audio processing, robotics, and anything that needs hard real-time guarantees.
Autogroup scheduling. This is why the desktop stays responsive during make -j64. Enabled by default since kernel 2.6.38, autogroups place each TTY session in a separate CFS task group. The build gets one group's fair share. The interactive terminal gets another. 64 compiler processes compete among themselves, not with the shell.
EEVDF (Earliest Eligible Virtual Deadline First). Starting with kernel 6.6, CFS was replaced by EEVDF. It adds a "virtual deadline" concept to improve latency fairness. Tasks are eligible to run when their vruntime is at or below the queue's average, and among eligible tasks, the one with the earliest virtual deadline runs first. This reduces tail latency compared to pure vruntime-based scheduling.
Common Questions
How does CFS handle a newly forked task's vruntime?
A new task's vruntime is set to max(parent_vruntime, cfs_rq->min_vruntime). Setting it to min_vruntime prevents starvation of existing tasks -- a freshly forked process cannot jump to the front of the queue. The sched_child_runs_first sysctl can bias the parent to be preempted by the child, which is useful when the child immediately calls exec (avoids COW faults from the parent writing to shared pages).
What is the difference between voluntary and involuntary context switches?
Voluntary switches happen when a task blocks on I/O, a mutex, or sleep. Involuntary switches happen when the scheduler preempts a running task because another task with lower vruntime (or a higher priority class) is ready. A high involuntary switch count means CPU contention. Check with /proc/[pid]/status (voluntary_ctxt_switches, nonvoluntary_ctxt_switches) or perf stat.
How do cgroups interact with CFS?
With cgroups v2, each cgroup has a cpu.weight (1-10000, default 100) that determines its proportional share. The cgroup's sched_entity competes in the parent's CFS tree. cpu.max (quota/period) sets a hard bandwidth limit -- the cgroup gets throttled when it exceeds its quota. Throttling events show up in cpu.stat as nr_throttled and throttled_usec. This is exactly what Kubernetes uses when CPU limits are set.
Why might a SCHED_FIFO process not preempt a kernel thread?
Kernel threads running in critical sections with preemption disabled (preempt_disable()) cannot be preempted, even by SCHED_FIFO. Kernel threads handling softirqs or running in interrupt context also have implicit priority over all userspace tasks. The PREEMPT_RT patchset (being merged into mainline gradually since 5.x) converts most spinlocks to sleeping locks and makes most of the kernel fully preemptible.
How Technologies Use This
The pod spec says cpu: 500m. The application handles bursts fine for the first 50 ms of each period, then tail latencies spike 10-50x for the remaining 50 ms. Cores are sitting idle on the node, but the container is being throttled to zero CPU.
The kubelet translates cpu: 500m into CFS bandwidth control: cfs_quota_us=50000 and cfs_period_us=100000. The container gets 50 ms of CPU per 100 ms period. A burst of requests exhausts the quota early, and the kernel throttles every thread in the cgroup until the next period -- even with idle cores. CPU requests are different: they map to cpu.shares (v1) or cpu.weight (v2) for proportional scheduling that only matters under contention.
Check cpu.stat for nr_throttled and throttled_usec to confirm throttling. If throttling appears, either raise CPU limits or switch to requests-only (no limits) so the pod can burst on idle cores. Understanding the quota-vs-shares distinction is the difference between diagnosing this in minutes versus days.
A Java application runs in a 4-CPU container, but GC pauses are 300% worse than on bare metal. The GC logs show 64 parallel GC threads fighting for CPU time. Every collection triggers massive CFS throttling that stalls the entire application for hundreds of milliseconds.
The JVM calls sysconf(_SC_NPROCESSORS_ONLN) to detect available CPUs and uses that count to size the parallel GC thread pool, ForkJoinPool, and JIT compilation threads. Inside a cgroup with cpu.max set to 4 CPUs, the JVM still sees the host's 64 physical cores. Those 64 GC threads burn through the cfs_quota_us budget in milliseconds, leaving the application throttled for the rest of each period.
Fix: set -XX:ActiveProcessorCount=4 to override CPU detection. This sizes all thread pools to match the container's actual allocation. Also note that Thread.setPriority() maps to nice values with a 1.25x ratio per level, so MIN_PRIORITY to MAX_PRIORITY is roughly a 5:1 CPU ratio, not the 10:1 most developers expect.
Same Concept Across Tech
| Technology | How scheduling affects it | Key config |
|---|---|---|
| Kubernetes | CPU requests map to cgroup shares, CPU limits map to CFS quota. Throttling shows as nr_throttled in cpu.stat | resources.requests.cpu, resources.limits.cpu |
| Docker | --cpus flag sets CFS quota/period ratio. --cpu-shares sets relative weight | --cpus=2.0 means 200ms quota per 100ms period |
| JVM | GC threads count against container CPU quota. A 4-thread GC in a 2-CPU container gets throttled | -XX:ActiveProcessorCount to override CPU detection |
| Node.js | Single-threaded, but worker_threads and libuv threadpool consume quota | UV_THREADPOOL_SIZE affects total CPU usage |
| Go | GOMAXPROCS defaults to available CPUs. In a container, this may be wrong | runtime.GOMAXPROCS() or GOMAXPROCS env var |
Stack layer mapping (slow container debugging):
| Layer | What to check | Tool |
|---|---|---|
| Application | Is the app CPU-bound or I/O-bound? | Application profiler, top |
| Runtime | Is GC or threadpool consuming unexpected CPU? | JVM GC logs, Go pprof |
| Cgroup | Is the container being throttled? | cat cpu.stat, look for nr_throttled |
| Scheduler | High context switch count? Runqueue length? | pidstat -w, /proc/schedstat |
| Hardware | NUMA placement? CPU frequency scaling? | numactl, cpufreq-info |
Design Rationale Fixed timeslices are fundamentally unfair: a process that blocks frequently ends up with less CPU than one that runs continuously, even at the same priority. Virtual runtime fixes this by tracking weighted CPU time and always picking the task with the least. Interactive processes naturally stay responsive because they block often and accumulate vruntime slowly -- no special interactivity heuristic needed. The multiplicative nice scale (1.25x per level) exists because additive differences lose meaning under load; the ratio between nice 0 and nice 1 should be the same whether 5 tasks or 500 are competing. Bandwidth control (quota/period) came later for containers, because proportional shares cannot enforce a hard CPU budget -- sometimes a cgroup genuinely must not exceed a specific amount.
If You See This, Think This
| Symptom | Likely cause | First check |
|---|---|---|
| Container at 50% expected throughput | CFS quota too low for thread count | cat cpu.stat, check nr_throttled |
| P99 latency spikes every 100ms | CFS quota exhausted mid-period, tasks queued until next period | Increase quota or reduce period |
| Interactive shell freezes during build | Build process at nice 0 competing equally with shell | Renice build process to 10+ |
| High voluntary context switches | I/O-bound process yielding CPU while waiting | pidstat -w, check I/O wait |
| High involuntary context switches | Process preempted because it used its time slice | pidstat -w, check CPU quota |
| JVM GC pauses longer in container than bare metal | GC threads throttled by CFS quota | Set -XX:ActiveProcessorCount to match CPU limit |
When to Use / Avoid
Relevant when:
- Kubernetes pods are throttled and nr_throttled keeps climbing in cpu.stat
- A process gets less CPU than expected despite available cores (check nice values and cgroup shares)
- Tuning latency-sensitive or real-time workloads -- audio, trading, robotics
- Latency spikes correlate with scheduling delays visible in perf sched latency
Watch out for:
- Quota set too low for the thread count -- every thread draws from the same budget
- Confusing shares (proportional, soft, only matters under contention) with quota (absolute hard cap)
- SCHED_FIFO starves all CFS tasks on that core -- always pair it with a bandwidth limit
Try It Yourself
1 # Show scheduling policy and priority of a process
2
3 chrt -p $(pidof nginx | awk '{print $1}')
4
5 # View CFS tuning knobs
6
7 sysctl kernel.sched_latency_ns kernel.sched_min_granularity_ns kernel.sched_wakeup_granularity_ns
8
9 # See per-task CFS stats (vruntime, switches, runtime)
10
11 cat /proc/$$/sched | head -20
12
13 # Record and analyze scheduling latency with perf
14
15 sudo perf sched record -- sleep 5 && sudo perf sched latency --sort max
16
17 # Count voluntary vs involuntary context switches
18
19 grep ctxt /proc/$$/status
20
21 # View CPU affinity mask of a process
22
23 taskset -p $$Debug Checklist
- 1
Check throttling: cat /sys/fs/cgroup/cpu/container-id/cpu.stat - 2
View scheduler stats: cat /proc/<pid>/sched | grep nr_switches - 3
Check scheduling policy: chrt -p <pid> - 4
Monitor context switches: pidstat -w -p <pid> 1 - 5
Check CPU shares vs quota: cat /sys/fs/cgroup/cpu/.../cpu.shares and cpu.cfs_quota_us - 6
View per-CPU runqueue length: cat /proc/schedstat
Key Takeaways
- ✓CFS has no fixed timeslice. It divides a 'target latency' (default 6ms for 8 or fewer CPUs) among runnable tasks, weighted by their nice values. Each task gets at least 0.75ms (sched_min_granularity). With 100 runnable tasks, the target latency scales up automatically.
- ✓vruntime is the key insight. It advances slower for high-priority tasks. A nice -20 process accumulates vruntime ~88x slower than a nice +19 process, so it stays at the left of the red-black tree and gets proportionally more CPU. No fixed timeslices needed.
- ✓Nice values are multiplicative, not linear. Most people miss this. Nice 10 vs nice 0 is roughly a 10:1 CPU ratio, not 2:1. Each nice level is a ~1.25x multiplier, and that compounds across 10 levels.
- ✓SCHED_DEADLINE is the most powerful scheduling policy most people have never heard of. Tasks declare runtime/period/deadline parameters and the kernel guarantees the CPU time. Hard real-time without root.
- ✓The scheduler runs in O(log n) for pick_next_task -- leftmost node of the rb-tree, cached for O(1) access. scheduler_tick() fires every 4ms (at 250Hz) and updates vruntime, checking if preemption is needed.
Common Pitfalls
- ✗Calling sched_yield() in a busy loop thinking it helps other threads. Reality: CFS puts the yielding task at the rightmost position of the rb-tree, but it is immediately eligible to run again. sched_yield is only meaningful for real-time scheduling classes and spinlocks.
- ✗Setting a process to SCHED_FIFO at max priority without a safety net. A CPU-bound SCHED_FIFO task starves ALL normal processes on that CPU. Always use SCHED_DEADLINE or cpu.rt_runtime_us cgroup limits to prevent this.
- ✗Assuming nice values are linear. Thinking nice 10 gets 'half' the CPU of nice 0. The 1.25x-per-level multiplicative scheme means nice 10 vs nice 0 is about a 10:1 CPU ratio. The scale is exponential, not linear.
- ✗Ignoring NUMA topology. CFS has per-CPU runqueues and periodically load-balances across them. Migrating a task to a remote NUMA node means slower memory access. Use numactl or cgroup's cpuset to pin latency-sensitive tasks.
Reference
In One Line
Most container CPU problems are quota misconfiguration (hard cap), not shares (soft proportion) -- check nr_throttled in cpu.stat first.