Virtual File System (VFS)
Mental Model
A hotel front desk that handles every guest request the same way. Newspaper? The desk calls the gift shop. Room service? The kitchen. Taxi? The dispatcher. Guests fill out the same form every time, never knowing which department fulfills it. Behind the desk, a lookup card for each room lists which department handles which request type. A new service joins the hotel by handing the front desk a card with its phone numbers -- nothing else changes.
The Problem
A typical server has 70+ registered filesystem types. Without a common abstraction, every application would carry separate I/O code for ext4, NFS, procfs, FUSE, overlayfs, and whatever comes next -- and adding a new storage backend would mean patching every userspace program on the system. A node with 50 Docker containers needs 50 independent mount trees routing I/O through overlayfs layers. A Kubernetes pod migrating from EBS to Ceph would need storage-specific code changes on every move. And path resolution without dentry caching would hit disk for every component of every path: a shell searching 8 $PATH directories for a single command would trigger 8 disk lookups per execution.
Architecture
Calling read() on /etc/hostname returns the machine's name. Calling read() on /proc/cpuinfo returns CPU details. Calling read() on an NFS-mounted spreadsheet returns last quarter's numbers.
Same function. Same arguments. Completely different storage backends. One is a file on the local SSD. One does not exist on any disk -- the kernel fabricates it on the fly. One lives on a server three timezones away.
How does one syscall work for all of them? The answer is the Virtual File System.
What Actually Happens
Here is the path when read(fd, buf, 4096) is called:
- The syscall lands in
vfs_read() - The kernel looks at the
struct filefor the fd - It follows
file->f_op->read_iter()— a function pointer - That pointer was set when the file was opened, based on the filesystem type
- For ext4, it points to
ext4_file_read_iter(). For NFS,nfs_file_read(). For procfs,proc_reg_read()
That's it. No switch statement. No filesystem detection at read time. Just one pointer dereference.
This indirection is set up at open() time. When the VFS resolves the path and finds the inode, the filesystem driver has already attached its function pointer table. Every subsequent operation on that fd goes through those pointers.
Under the Hood
VFS defines four core objects, each with an associated operations table.
The superblock (struct super_block + super_operations) represents a mounted filesystem instance. It holds the block size, root dentry, device reference, and operations like alloc_inode(), write_inode(), and statfs(). Every mount() call creates one.
The inode (struct inode + inode_operations) represents a file's identity and metadata — mode, uid, gid, size — plus operations like lookup(), create(), mkdir(), and permission(). This is the filesystem-independent version; the driver populates it from whatever on-disk format it uses.
The dentry (struct dentry + dentry_operations) represents a path component in the directory tree. Dentries form a cached tree used for path resolution. Their operations handle hash comparison and revalidation — critical for NFS, where server-side changes must be detected.
The file (struct file + file_operations) represents an open file instance. It holds the current offset, flags, and the actual I/O operations: read_iter(), write_iter(), mmap(), fsync(), and ioctl(). This is where the dispatch happens.
How filesystems plug in. Each driver registers via register_filesystem(), providing a file_system_type struct with a mount() callback. When mount() is called, the kernel finds the registered type in a global linked list, calls its mount callback to create the in-memory superblock, then grafts it into the mount tree at the specified mount point.
Path walking across mount points. The path walk algorithm uses the dentry cache to resolve each component. At each step, it checks if the current dentry is a mountpoint. If so, it transparently crosses into the mounted filesystem's root dentry. The boundary is invisible.
The dentry cache (dcache). This is one of the most performance-critical structures in the kernel. It's a hash table indexed by (parent dentry, name hash). A cache hit avoids disk I/O entirely. The kernel implements two walk modes: RCU-walk (lockless, no reference counts, blazing fast) for the common case, and ref-walk (locking, can sleep) as a fallback when the cache misses or a permission check needs to block. Hot path lookups are lock-free.
The unified page cache. Here's something elegant: the page cache is filesystem-agnostic. When vfs_read() is called, it typically goes through generic_file_read_iter(), which checks a per-inode page cache (an xarray of pages). Cache misses invoke the filesystem's readpage() or readahead() to populate pages. Because the cache is unified, memory pressure, dirty writeback, and sync/fsync work the same way whether the backend is ext4, NFS, or FUSE.
Common Questions
How does the kernel decide which filesystem driver handles an open() call?
It doesn't decide at open() time — the decision was made at mount() time. The kernel resolves the path component by component through the dentry cache. At each mount point, it transitions to the mounted filesystem's root dentry. Once the final component is reached, the inode's i_fop pointer (set during inode initialization by the filesystem driver) determines the file_operations table. The dispatch is baked into the inode, not computed per-call.
What happens when a filesystem is mounted on a non-empty directory?
The existing contents become hidden, not deleted. The dentry for the mount point gets a d_mounted flag, and path resolution switches to the mounted filesystem's root dentry instead of continuing into the original directory. Unmounting reveals the original contents. It is like placing a book on top of another -- the bottom book is still there, just not visible.
How does FUSE work from VFS's perspective?
FUSE registers as a regular filesystem type. Its inode and file operations point to FUSE kernel module functions. These functions translate VFS operations into FUSE protocol messages, write them to /dev/fuse, and block until the userspace daemon reads the message, processes it, and writes back a response. Every operation requires at least two context switches. That's the cost of running filesystem logic in userspace — and why FUSE is inherently slower than in-kernel filesystems.
Why does the dentry cache store entries for files that don't exist?
Because "does this file exist?" is one of the most common questions the kernel answers. The shell searches every directory in $PATH for each command. A build system checks dozens of include paths for each header. Without negative dentry caching, each failed lookup would hit disk. Negative dentries are cached with a timeout and evicted under memory pressure. They're a small investment that prevents enormous I/O waste.
How Technologies Use This
A container starts, opens /etc/hostname, and reads its own unique value. Another container on the same host opens the same path and reads something completely different. Both run on one kernel, one disk, yet neither sees the other's files. Without VFS, achieving this isolation would require separate kernels or full disk images per container.
The trick is function pointer indirection at the VFS layer. Docker's overlay2 driver registers overlayfs as a VFS filesystem type, stacking read-only image layers with a writable upper directory. Mount namespaces (CLONE_NEWNS) give each container its own mount tree, so open() and read() dispatch through VFS to overlayfs without ever seeing the host's mounts.
On a node running 50 containers, VFS dispatches I/O across 50 independent mount trees with under 2% CPU overhead. The application code never changes and never needs to know which storage backend sits underneath.
A pod mounts an EBS volume on Monday, migrates to a node with Ceph on Tuesday, and the application code never changes. Without VFS, every pod would need storage-specific I/O code for every backend, and adding a new storage system would mean patching every workload.
VFS makes this invisible. CSI drivers call mount() with the appropriate filesystem type, and VFS dispatches all I/O through function pointer tables. The kubelet mounts the backend into the pod's mount namespace, so the application's read() and write() calls flow through VFS to the correct driver without any awareness of the underlying storage.
ConfigMap volumes push this further, using tmpfs-backed mounts with atomic symlink swaps so pods pick up config changes in under 1 second without restarting. In-flight reads never see a torn file. One abstraction layer handles every storage backend Kubernetes supports.
Same Concept Across Tech
| Concept | Docker | JVM | Node.js | Go | K8s |
|---|---|---|---|---|---|
| Filesystem abstraction | overlay2 stacks layers via VFS overlayfs | java.nio.file.FileSystem SPI mimics VFS pattern | fs module calls map to VFS syscalls | os.Open dispatches through VFS | CSI drivers mount backends via VFS mount() |
| Mount isolation | Mount namespace (CLONE_NEWNS) per container | N/A (JVM shares host mount tree) | N/A (shares host mount tree) | N/A (shares host mount tree) | Pod mount namespace via container runtime |
| Pseudo-filesystem | /proc, /sys bind-mounted into container | /proc/self for JVM introspection | /proc/self/fd for fd enumeration | /proc/self/maps for memory inspection | ConfigMap uses tmpfs-backed atomic symlink swap |
| Page cache | Shared host page cache across all containers | JVM file I/O benefits from unified page cache | fs.readFile benefits from page cache | os.ReadFile benefits from page cache | EmptyDir tmpfs bypasses page cache |
Stack Layer Mapping
| Layer | Component |
|---|---|
| Hardware | Block devices, NIC (for NFS), RAM (for tmpfs/procfs) |
| Block I/O | Request queue, I/O scheduler, device driver |
| Filesystem driver | ext4, XFS, NFS, overlayfs, FUSE -- implements VFS operations |
| VFS | super_block, inode, dentry, file + operations tables |
| Syscall | open(), read(), write(), mount(), stat() |
| Userspace | Application code -- unaware of underlying filesystem type |
Design Rationale: With 70+ filesystem types, a switch/case at read time would be absurd. Function pointer dispatch resolves in one indirection, set once at mount time. The dentry cache with RCU-walk makes the common case -- looking up a path that was recently resolved -- completely lock-free. Negative dentries exist because "does this file exist?" is asked millions of times for paths that do not exist, and hitting disk on every miss would be catastrophic for shell $PATH lookups and build systems. The unified page cache means dirty tracking, writeback, and eviction all work consistently whether data sits on ext4, NFS, or FUSE.
If You See This, Think This
| Symptom | Likely Cause | First Check |
|---|---|---|
| EOPNOTSUPP on fallocate or xattr calls | Underlying filesystem does not support the operation | findmnt -T <path> to check filesystem type |
| Mounts inside container visible on host | Mount propagation set to shared instead of private | cat /proc/self/mountinfo check propagation flags |
| FUSE mount 10x slower than ext4 for metadata ops | Every VFS op requires context switch to userspace daemon | strace -c -p <fuse_pid> to measure round-trip overhead |
| Dentry cache consuming gigabytes of slab memory | Many unique paths accessed (build systems, package managers) | slabtop | grep dentry |
| stat() on NFS returns stale metadata | Dentry/attribute cache timeout not expired | mount | grep nfs check actimeo/acregmin settings |
| Path resolution slow on deeply nested paths | Dentry cache miss causing disk I/O per component | perf trace -e 'fs:*' <command> |
When to Use / Avoid
- Use VFS understanding when debugging filesystem-specific behavior differences (e.g., fallocate works on ext4 but not NFS v3)
- Use when designing container storage strategies -- overlayfs, bind mounts, and tmpfs all dispatch through VFS
- Use when implementing or debugging FUSE filesystems to understand the kernel-userspace round-trip cost
- Use when diagnosing mount propagation issues in containerized environments
- Avoid assuming all filesystems support the same operations -- check for EOPNOTSUPP returns
- Avoid bypassing VFS (direct block I/O) unless building a database engine with specific alignment requirements
Try It Yourself
1 # List all filesystem types registered with the kernel; 'nodev' means no block device required (pseudo-fs)
2 cat /proc/filesystems
3
4 # Display the complete mount hierarchy with filesystem types and mount options
5 findmnt --tree -o TARGET,SOURCE,FSTYPE,OPTIONS
6
7 # Mount a tmpfs (purely VFS/memory-backed filesystem). no block device involved
8 mount -t tmpfs -o size=256m tmpfs /mnt/ramdisk
9
10 # Trace the mount syscall to see how the kernel attaches a filesystem to the VFS tree
11 strace -e mount,umount2 mount /dev/sdb1 /mnt/data
12
13 # Per-process mount info showing mount IDs, parent IDs, device, root, mount point, and options. richer than /proc/mounts
14 cat /proc/self/mountinfo
15
16 # Count dentry cache events system-wide over 5 seconds to measure path resolution pressure
17 sudo perf stat -e 'dentry:*' -a sleep 5Debug Checklist
- 1
findmnt --tree -o TARGET,SOURCE,FSTYPE,OPTIONS - 2
cat /proc/filesystems - 3
cat /proc/self/mountinfo - 4
slabtop | grep -i 'dentry\|inode' - 5
perf stat -e 'dentry:*' -a -- sleep 5 2>&1 - 6
strace -e open,openat,mount -p <PID> 2>&1 | head -20
Key Takeaways
- ✓There's no giant switch/case on filesystem type. When you call read(), the kernel invokes file->f_op->read_iter(), which is ext4_file_read_iter() for ext4, nfs_file_read() for NFS, etc. Pure function-pointer indirection — one level of dispatch, zero branching
- ✓The dentry cache (dcache) caches "file not found" too. Negative dentries prevent repeated disk reads for names that don't exist — critical when your shell searches $PATH or a build system checks dozens of include directories
- ✓Plugging in a new filesystem is registering a struct. register_filesystem() adds a file_system_type to a global linked list; mount() walks the list, finds the right driver, and calls its mount() method to create a superblock
- ✓'Everything is a file' isn't a metaphor — it's an architecture. Pseudo-filesystems (procfs, sysfs, tmpfs) implement VFS operations purely in kernel memory, never touching a block device
- ✓One page cache rules them all. The VFS page cache is unified across ext4, NFS, and even FUSE — so eviction, dirty writeback, and readahead work consistently regardless of backend
Common Pitfalls
- ✗Assuming all filesystems support the same features — VFS operations can return -EOPNOTSUPP for unsupported ops (e.g., fallocate on NFS v3, xattrs on FAT32). The abstraction is uniform; the capabilities are not
- ✗Confusing the VFS inode with the on-disk inode — the VFS inode is an in-memory, filesystem-independent structure populated by the driver's read_inode. It may contain fields that don't exist on disk at all
- ✗Ignoring mount propagation (shared, slave, private) — this controls how mount events flow across namespaces. Get it wrong and mounts inside a container leak to the host, or vice versa
- ✗Expecting FUSE to perform like an in-kernel filesystem — every VFS operation on FUSE requires a context switch to userspace and back. That round-trip is the price of flexibility
Reference
In One Line
One read() call, one function pointer dereference, and the right filesystem driver runs -- VFS makes ext4, NFS, procfs, and FUSE all look the same from userspace.