Pressure Stall Information (PSI)
Mental Model
A highway with three lanes. Utilization says "80% of lanes have cars in them." Pressure says "cars are spending 30% of their time sitting in traffic instead of moving." A highway at 80% utilization with zero pressure is flowing freely. A highway at 60% utilization with high pressure has a bottleneck somewhere causing backups. PSI is the traffic delay metric, not the lane occupancy metric.
The Problem
A Kubernetes node running 40 pods at 90% memory utilization shows no warnings. Thirty seconds later, the OOM killer terminates a critical database pod with no prior signal. RSS-based monitoring showed plenty of headroom because cached pages inflated "available" memory. But /proc/pressure/memory told a different story: some avg10 had been climbing for 30 seconds, reaching 45% (tasks stalled on memory nearly half the time). PSI metrics would have given a 30-second early warning window to evict low-priority pods before the OOM kill happened.
Architecture
A Kubernetes node is running 40 pods. Memory utilization sits at 90%. The monitoring dashboard shows green. Thirty seconds later, the OOM killer terminates a critical database pod. No warning, no graceful shutdown, no chance to reschedule.
The monitoring was watching the wrong metric. It tracked how much memory was used. It did not track how much work was delayed because memory was scarce. That is the difference between utilization and pressure, and Pressure Stall Information (PSI) exists to expose exactly that gap.
What PSI Actually Measures
Every time a task in the Linux kernel cannot make progress because it is waiting for a resource, the kernel records that stall. PSI aggregates these stall times across all CPUs and exposes them through three files:
/proc/pressure/cpu
/proc/pressure/memory
/proc/pressure/io
Each file contains one or two lines:
$ cat /proc/pressure/memory
some avg10=4.67 avg60=2.15 avg300=1.08 total=287263028
full avg10=0.30 avg60=0.12 avg300=0.05 total=18392648
The numbers break down as follows:
- avg10, avg60, avg300: Exponential moving averages of the percentage of time tasks spent stalled, over 10-second, 60-second, and 300-second windows
- total: Cumulative stall time in microseconds since boot
- some: At least one task was stalled during this time, but other tasks could still run
- full: All non-idle tasks were stalled simultaneously -- the system made zero productive progress
For CPU pressure, only the "some" line exists because at least one task is always able to run on each CPU (it is the task that is running). Memory and I/O have both "some" and "full" lines.
A memory pressure "some" of avg10=4.67 means that over the last 10 seconds, tasks spent 4.67% of wall-clock time stalled on memory operations (direct reclaim, page fault I/O, swap I/O). A "full" of avg10=0.30 means all tasks were stalled simultaneously for 0.30% of the time.
The Kernel Machinery
PSI hooks into the scheduler and the memory reclaim path. The tracking works at the per-CPU level:
- When a task transitions to a stalled state (waiting for memory reclaim, blocked on I/O, or runnable but not running), the kernel records the timestamp.
- When the task resumes, the kernel calculates the stall duration and updates per-CPU counters.
- Every 2 seconds, a periodic aggregation pass computes the exponential moving averages from the per-CPU counters.
For memory specifically, stall events are triggered by:
- Direct reclaim: The page allocator cannot find free pages and must synchronously reclaim memory
- Swap read: A task faults on a page that was swapped out
- Thrashing refault: A page that was recently evicted is needed again (refault distance tracking)
- Memory cgroup limit: A cgroup hits its memory.max and the kernel stalls tasks while reclaiming within the cgroup
The "full" metric requires careful accounting. The kernel tracks when all non-idle tasks on a CPU are stalled simultaneously. If even one task is making progress, it counts toward "some" but not "full." The "full" metric is the stronger signal: nonzero "full" pressure means the machine is wasting wall-clock time with zero productive work.
PSI Triggers: Event-Driven Monitoring
Polling /proc/pressure files on a timer works but is wasteful and introduces latency. PSI triggers provide an event-driven alternative through the standard poll()/epoll() interface.
To register a trigger:
# Open the pressure file and write a trigger specification
exec 3<>/proc/pressure/memory
echo "some 150000 1000000" >&3
# Now poll() on fd 3 will return POLLPRI when the threshold is breached
The trigger format is "some|full STALL_US WINDOW_US":
some 150000 1000000: Fire when tasks are stalled for 150ms (150,000 microseconds) within any 1-second (1,000,000 microsecond) windowfull 50000 1000000: Fire when all tasks are stalled for 50ms per 1-second window
The kernel evaluates triggers internally against its PSI counters and wakes the monitoring process through poll()/epoll() only when the threshold is crossed. No periodic polling needed. The monitoring process sleeps until pressure actually occurs.
Multiple triggers can be registered simultaneously with different thresholds for graduated response:
# Moderate pressure: shed optional work
echo "some 100000 1000000" > trigger_moderate
# High pressure: evict low-priority pods
echo "some 300000 1000000" > trigger_high
# Critical: all tasks stalled, emergency action
echo "full 100000 1000000" > trigger_critical
Per-Cgroup PSI (cgroup v2)
System-wide PSI is useful but cannot pinpoint which workload is causing pressure. With cgroup v2, every cgroup exposes its own pressure files:
/sys/fs/cgroup/system.slice/nginx.service/cpu.pressure
/sys/fs/cgroup/system.slice/nginx.service/memory.pressure
/sys/fs/cgroup/system.slice/nginx.service/io.pressure
This is what makes PSI actionable for container orchestrators. Kubernetes can read the pressure files for each pod's cgroup and evict the specific pods causing pressure, rather than relying on the system-wide OOM killer to pick a victim.
# Check memory pressure for all cgroups under kubepods
find /sys/fs/cgroup/kubepods.slice -name memory.pressure \
-exec sh -c 'echo "--- $1 ---" && cat "$1"' _ {} \;
Per-cgroup PSI triggers work the same way as system-wide triggers. A daemon can epoll() on dozens of per-cgroup pressure files and react to whichever cgroup breaches its threshold first.
Why Utilization Fails and Pressure Works
Consider two scenarios on the same 64 GB node:
Scenario 1: 58 GB used. 50 GB is page cache from a read-heavy workload. The kernel can reclaim cache instantly without stalling any tasks.
- Memory utilization: 90%
- Memory pressure (some avg10): 0.01%
- Status: Healthy. No action needed.
Scenario 2: 45 GB used. 40 GB is anonymous memory (heap allocations, mmap'd data) from a memory-hungry application. Free memory is 2 GB and shrinking.
- Memory utilization: 70%
- Memory pressure (some avg10): 35%
- Status: In trouble. Tasks are stalled on direct reclaim 35% of the time. OOM kill likely within 30-60 seconds.
Utilization says Scenario 1 is worse. Pressure says Scenario 2 is worse. Pressure is correct.
The reason is that utilization conflates "used" with "needed." Page cache memory is used but not needed -- the kernel can drop it at zero cost. Anonymous memory is used and needed -- reclaiming it requires swapping, and if there is no swap, the OOM killer is the only recourse.
Practical Setup
Check if PSI is available:
# PSI requires Linux 4.20+ with CONFIG_PSI=y
cat /proc/pressure/cpu
# If this file exists, PSI is active
# Check kernel config
grep CONFIG_PSI /boot/config-$(uname -r)
# CONFIG_PSI=y
# CONFIG_PSI_DEFAULT_DISABLED is not set
Monitor pressure in real time:
# All three resources at once
watch -n2 'for r in cpu memory io; do echo "=== $r ===" && cat /proc/pressure/$r; done'
Compute exact pressure over an interval using the total counter:
# Read total stall microseconds, wait 10 seconds, read again
T1=$(awk '/^some/{print $5}' /proc/pressure/memory | cut -d= -f2)
sleep 10
T2=$(awk '/^some/{print $5}' /proc/pressure/memory | cut -d= -f2)
# Percentage of time stalled in the interval
echo "scale=2; ($T2 - $T1) / 100000" | bc
Set up a PSI trigger from a shell script:
#!/bin/bash
# psi_watch.sh -- Alert when memory pressure exceeds threshold
THRESHOLD="some 150000 1000000" # 150ms stall per 1s window
exec 3<>/proc/pressure/memory
echo "$THRESHOLD" >&3
echo "Monitoring memory pressure (threshold: $THRESHOLD)..."
while true; do
if read -t 30 <&3; then
echo "$(date): Memory pressure threshold breached!"
cat /proc/pressure/memory
# Take action: log, alert, evict, etc.
else
echo "$(date): No pressure (healthy)"
fi
done
exec 3>&-
Common Questions
What is the difference between PSI memory pressure and memory utilization?
Utilization measures how much memory is allocated. Pressure measures how often tasks are stalled waiting for memory. A system can be at 95% utilization with 0% pressure (most memory is reclaimable cache) or at 50% utilization with 40% pressure (the kernel is thrashing trying to reclaim active anonymous pages). Pressure correlates with application latency; utilization does not.
Can PSI work with cgroup v1?
System-wide PSI (/proc/pressure/) works regardless of cgroup version. Per-cgroup PSI (memory.pressure, cpu.pressure, io.pressure in each cgroup directory) requires cgroup v2. Most modern distributions have migrated to cgroup v2 by default. Check with mount | grep cgroup2.
What PSI thresholds should be used for alerting?
There is no universal answer, but reasonable starting points for memory pressure on a production server: avg10 > 10% for warning, avg10 > 40% for critical, "full" avg10 > 0 for emergency. These depend on workload tolerance for latency. Latency-sensitive services should alert at lower thresholds than batch workloads. Start with the defaults and tune based on correlation with actual incident data.
How does PSI interact with the OOM killer?
PSI does not trigger the OOM killer. PSI is purely a measurement and notification system. The OOM killer fires when the kernel physically cannot allocate memory. PSI's value is giving advance warning: memory pressure typically rises 10-60 seconds before the OOM killer fires. Systems like oomd and kubelet use that window to take graceful action (evicting pods, killing low-priority cgroups) instead of letting the OOM killer make an uninformed choice.
What is the overhead of PSI?
PSI adds negligible overhead in normal operation. The per-task state tracking adds a few instructions to scheduler transitions. The periodic averaging pass runs every 2 seconds and takes microseconds. PSI triggers add a check on the averaging path but only wake userspace when thresholds are actually breached. Meta runs PSI on every server in its fleet with no measurable performance impact.
How Technologies Use This
A 64 GB Kubernetes node runs 40 pods with a combined memory request of 48 GB. The kubelet's default eviction policy monitors memory.available (free RAM plus reclaimable caches) and triggers pod eviction when available memory drops below 100 MB. On this node, memory.available reports 6 GB, so no eviction occurs. However, 30 seconds later the OOM killer terminates a critical database pod because the kernel spent the preceding interval in aggressive direct reclaim, thrashing pages between active and inactive LRU lists at a rate that left tasks stalled 40% of the time.
PSI-based eviction replaces the utilization-threshold approach with a contention-based signal. The kubelet reads /proc/pressure/memory and monitors the some avg10 metric, which reports the percentage of time over the last 10 seconds that at least one task was stalled waiting for memory. When avg10 crosses a configured threshold (for example, 10%), the kubelet begins evicting BestEffort and Burstable pods in priority order. The eviction happens before physical memory is exhausted, preventing the OOM killer from making an uninformed victim selection based solely on oom_score_adj values.
The distinction PSI captures is between memory that is used and memory that is contended. A node at 90% memory utilization with zero reclaim stalls (avg10 = 0%) is healthy because all allocations are satisfied from free lists or clean page reclaim. A node at 70% utilization with avg10 at 40% is in crisis because the kernel spends 40% of wall-clock time in the reclaim path, stalling application threads. The kubelet cannot distinguish these two states using memory.available alone, but PSI surfaces the difference directly.
A bare-metal server runs 15 systemd services including a batch ETL job that allocates 6 GB of anonymous memory in bursts. The server has 32 GB of RAM, and the ETL service's cgroup has memory.max set to 8 GB. During peak allocation, the ETL service reaches 7.5 GB, still under its hard limit. Yet the server's overall responsiveness degrades because the kernel enters direct reclaim for the ETL cgroup, stalling both the ETL threads and unrelated processes competing for the same NUMA node's free pages.
systemd version 250 and later supports MemoryPressureWatch=on in unit files, which tells systemd-oomd to monitor the PSI memory pressure metrics for that unit's cgroup subtree. The systemd-oomd daemon reads the some avg10 value from /sys/fs/cgroup/<unit>/memory.pressure every 5 seconds. When avg10 exceeds a configurable threshold (default is 200 ms of stall per 1-second window, equivalent to 20% pressure), systemd-oomd can kill the highest-pressure cgroup under its watch using the SIGKILL signal delivered to all processes in the offending cgroup.
This mechanism reacts to contention rather than absolute usage. The ETL service at 7.5 GB out of an 8 GB limit might be perfectly healthy if its pages are cached file data that the kernel can reclaim cheaply. But the same 7.5 GB composed entirely of dirty anonymous pages triggers constant kswapd scanning and direct reclaim, which PSI reports immediately. systemd-oomd terminates the ETL service before the hard limit triggers the cgroup OOM path, allowing the administrator to configure a restart policy (Restart=on-failure) that brings the service back with a clean memory state rather than leaving it in a half-killed state that the cgroup OOM killer sometimes produces.
An Android phone with 4 GB of RAM runs 80 processes: the foreground app, 15 visible or perceptible background services, and 60+ cached background processes. The old in-kernel lowmemorykiller driver used fixed RSS watermarks (for example, kill cached processes when free memory drops below 80 MB) that required per-device tuning. A phone with 4 GB needed different thresholds than one with 12 GB, and the static thresholds could not distinguish between free memory consumed by useful page cache and free memory lost to anonymous page thrashing.
The userspace lmkd (low-memory killer daemon) replaces the kernel driver by registering PSI triggers on /proc/pressure/memory. It configures multiple stall thresholds: a moderate trigger (some 70000 1000000, meaning 70 ms of partial stall per 1-second window) initiates killing of cached background processes in adj-score order, while a severe trigger (full 700000 1000000, meaning tasks are completely stalled 70% of the time) escalates to killing perceptible background services. The kernel delivers a notification to lmkd through a poll()-able file descriptor whenever the stall window crosses the registered threshold.
PSI normalizes memory pressure as a percentage of time tasks spend stalled, making the same thresholds work across devices with different RAM capacities. A 4 GB phone and a 12 GB phone both experience "70 ms of stall per second" as the same severity of contention, even though their absolute free memory values differ by 3x. The result is 15 to 20% fewer unnecessary process kills on low-RAM devices (measured in Android 10 internal testing) and faster response to genuine memory emergencies, because lmkd reacts within one PSI polling window (1 second) rather than waiting for a watermark threshold that may trail actual contention by several seconds.
Same Concept Across Tech
| Technology | How it uses PSI | Key detail |
|---|---|---|
| Kubernetes | kubelet reads /proc/pressure/memory for eviction decisions | PSI eviction supplements memory.available thresholds with actual stall data |
| systemd | MemoryPressureWatch monitors per-unit cgroup pressure | Triggers stop/restart when memory.pressure some avg10 exceeds threshold |
| Android lmkd | Registers PSI triggers via poll() on /proc/pressure/memory | Multiple thresholds for graduated killing (cached apps first, then perceptible) |
| Meta oomd | Watches memory.pressure across cgroup v2 hierarchy | Kills pressure-causing cgroups 10-60 seconds before OOM killer would fire |
| Facebook/Meta fleet | PSI was created to solve fleet-wide OOM unpredictability | Replaced ad-hoc memory watermarks with a single normalized pressure metric |
Stack layer mapping (unexpected OOM kill on a "healthy" node):
| Layer | What to check | Tool |
|---|---|---|
| Application | Is a specific process allocating aggressively or leaking memory? | /proc/PID/status (VmRSS), /proc/PID/smaps_rollup |
| Cgroup | Which cgroup is under memory pressure? | cat /sys/fs/cgroup/<path>/memory.pressure |
| System | Is system-wide memory pressure elevated? | cat /proc/pressure/memory (check "some" and "full" avg10) |
| Kernel | Did the OOM killer fire? What was the state? | dmesg |
| Hardware | Is physical RAM sufficient for the workload mix? | free -h, check for swap activity with vmstat 1 |
Design Rationale Utilization metrics answer "how much of a resource is consumed" but not "is the resource a bottleneck." A system at 95% memory utilization might be perfectly healthy if most of that memory is reclaimable page cache. A system at 60% memory utilization might be thrashing if most of that memory is anonymous pages under active use and direct reclaim is running constantly. PSI captures what utilization cannot: the fraction of time that work is delayed because a resource is scarce. This makes PSI the correct signal for autoscaling, eviction, and capacity planning decisions.
If You See This, Think This
| Symptom | Likely cause | First check |
|---|---|---|
| OOM kill with no prior warning in monitoring | Monitoring tracks utilization, not pressure | cat /proc/pressure/memory (would have shown rising avg10) |
| Application latency spikes not explained by CPU or network | Memory pressure causing direct reclaim stalls | Check memory.pressure "some" avg10 for the application's cgroup |
| System responsive overall but specific pods are slow | Per-cgroup memory or I/O pressure in those pods | cat /sys/fs/cgroup/<pod-cgroup>/memory.pressure |
| PSI memory "full" is nonzero | All tasks stalled on memory simultaneously (thrashing) | Immediate action needed: reduce workload or add memory |
| PSI cpu "some" is high but CPU utilization is moderate | Too many runnable tasks for available CPUs (scheduling contention) | Check number of runnable threads vs available cores with vmstat |
| PSI io "some" is consistently above 20% | I/O subsystem is a bottleneck for at least some tasks | iostat -x 1 to identify which device is saturated |
When to Use / Avoid
Relevant when:
- Predicting OOM kills before they happen (memory pressure rising while utilization looks normal)
- Deciding when to scale horizontally (CPU pressure indicates tasks are queuing, not just that CPUs are busy)
- Implementing graceful degradation (shed optional work when I/O pressure is high)
- Monitoring container health beyond raw resource limits (per-cgroup PSI in cgroup v2)
Watch out for:
- PSI requires Linux 4.20+ and CONFIG_PSI=y (most modern distributions enable it by default)
- Per-cgroup PSI requires cgroup v2 (cgroup v1 does not support per-cgroup pressure files)
- PSI trigger windows below 500ms can cause excessive wakeups and CPU overhead from the monitoring itself
Try It Yourself
1 # Read current pressure metrics for all resources
2
3 for r in cpu memory io; do echo "=== $r ===" && cat /proc/pressure/$r; done
4
5 # Monitor memory pressure in real time (updates every 2 seconds)
6
7 watch -n2 cat /proc/pressure/memory
8
9 # Read per-cgroup memory pressure for a systemd service
10
11 cat /sys/fs/cgroup/system.slice/nginx.service/memory.pressure
12
13 # Calculate exact pressure over a 10-second interval using the total counter
14
15 T1=$(awk '/some/{print $5}' /proc/pressure/memory | cut -d= -f2); sleep 10; T2=$(awk '/some/{print $5}' /proc/pressure/memory | cut -d= -f2); echo "scale=2; ($T2-$T1)/10000000" | bc
16
17 # Check if PSI is enabled in the running kernel
18
19 test -f /proc/pressure/cpu && echo "PSI enabled" || echo "PSI not available"
20
21 # Find cgroups with highest memory pressure
22
23 find /sys/fs/cgroup -name memory.pressure -exec sh -c 'echo "$(cat "$1" | head -1) $1"' _ {} \; 2>/dev/null | sort -t= -k2 -rn | head -10
24
25 # Use PSI trigger from shell (monitor for 150ms stall per 1s window)
26
27 exec 3<>/proc/pressure/memory; echo "some 150000 1000000" >&3; echo "Waiting for memory pressure..."; read -t 60 <&3 && echo "PRESSURE DETECTED" || echo "No pressure in 60s"; exec 3>&-Debug Checklist
- 1
Check if PSI is enabled: cat /proc/pressure/cpu (if the file exists, PSI is active) - 2
Read current memory pressure: cat /proc/pressure/memory - 3
Read current I/O pressure: cat /proc/pressure/io - 4
Check per-cgroup pressure: cat /sys/fs/cgroup/<path>/memory.pressure - 5
Verify cgroup v2 is active: mount | grep cgroup2 - 6
Check kernel config: grep CONFIG_PSI /boot/config-$(uname -r) - 7
Monitor pressure over time: watch -n2 cat /proc/pressure/memory
Key Takeaways
- ✓PSI measures stall time, not utilization. A system at 95% CPU utilization with 0% CPU pressure means all tasks are running and none are waiting. A system at 60% CPU utilization with 30% CPU pressure means tasks are frequently queued behind others. The distinction matters for capacity planning because utilization alone cannot tell whether adding more work will cause latency degradation.
- ✓The "some" vs "full" distinction is critical. "some" means at least one task is stalled but others are making progress. "full" means every non-idle task is stalled simultaneously -- the machine is doing zero productive work during that time. For memory, "some" pressure triggers page reclaim. "full" pressure means the system is thrashing.
- ✓PSI averages (avg10, avg60, avg300) are exponential moving averages, not simple averages. avg10 reacts to pressure spikes within seconds. avg300 smooths out transient bursts and shows sustained pressure. The "total" field is a cumulative microsecond counter that allows computing exact pressure over any arbitrary time window by taking two readings and dividing the delta.
- ✓PSI triggers use the kernel's internal PSI tracking and deliver notifications through poll()/epoll(). The trigger format is "some|full STALL_US WINDOW_US", meaning "notify when stall time exceeds STALL_US microseconds within a WINDOW_US microsecond window." This is far more efficient than polling /proc/pressure files from userspace on a timer.
- ✓PSI was added in Linux 4.20 (December 2018) by Facebook (now Meta) engineers. It was designed specifically because utilization-based metrics failed to predict OOM events and I/O stalls in Meta's fleet. The kernel already tracked scheduling delays and memory reclaim stalls internally -- PSI simply exposed these existing counters to userspace.
Common Pitfalls
- ✗Using utilization thresholds (percent of RAM used, percent of CPU busy) to predict resource exhaustion. A node at 90% memory usage with most of that in reclaimable page cache is healthy. A node at 70% with most of that in anonymous pages under active use can be on the edge of thrashing. PSI captures the actual contention, not just the raw usage numbers.
- ✗Reading PSI avg10 values too infrequently. avg10 is a 10-second exponential moving average that updates every 2 seconds. Polling once per minute misses short pressure spikes entirely. For responsive eviction or scaling decisions, use PSI triggers (poll/epoll) instead of periodic reads. The trigger mechanism delivers sub-second notifications without any polling overhead.
- ✗Ignoring per-cgroup PSI and relying only on system-wide /proc/pressure files. System-wide PSI aggregates all tasks. A single misbehaving container can cause system-wide memory pressure while 49 other containers are fine. Per- cgroup memory.pressure pinpoints exactly which workload is causing stalls, enabling targeted eviction instead of random OOM kills.
- ✗Setting PSI trigger thresholds too low, causing constant alerting. A healthy system under normal load will occasionally show brief memory pressure spikes during page reclaim. Start with moderate thresholds (e.g., some 150000 1000000 for memory) and tune based on observed baseline pressure. Zero pressure at all times is not a realistic goal on a system doing real work.
Reference
In One Line
PSI turns invisible kernel stall counters into actionable metrics, giving 30 seconds of warning before OOM kills that utilization-based monitoring completely misses.