File Systems & I/OTopic 19 of 19

File Systems & I/OAdvanced

Virtual File System (VFS)

DockerKubernetes

🧠

Mental Model

A hotel front desk that handles every guest request the same way. Newspaper? The desk calls the gift shop. Room service? The kitchen. Taxi? The dispatcher. Guests fill out the same form every time, never knowing which department fulfills it. Behind the desk, a lookup card for each room lists which department handles which request type. A new service joins the hotel by handing the front desk a card with its phone numbers -- nothing else changes.

💡

The Problem

A typical server has 70+ registered filesystem types. Without a common abstraction, every application would carry separate I/O code for ext4, NFS, procfs, FUSE, overlayfs, and whatever comes next -- and adding a new storage backend would mean patching every userspace program on the system. A node with 50 Docker containers needs 50 independent mount trees routing I/O through overlayfs layers. A Kubernetes pod migrating from EBS to Ceph would need storage-specific code changes on every move. And path resolution without dentry caching would hit disk for every component of every path: a shell searching 8 $PATH directories for a single command would trigger 8 disk lookups per execution.

Architecture

Calling read() on /etc/hostname returns the machine's name. Calling read() on /proc/cpuinfo returns CPU details. Calling read() on an NFS-mounted spreadsheet returns last quarter's numbers.

Same function. Same arguments. Completely different storage backends. One is a file on the local SSD. One does not exist on any disk -- the kernel fabricates it on the fly. One lives on a server three timezones away.

How does one syscall work for all of them? The answer is the Virtual File System.

What Actually Happens

Here is the path when read(fd, buf, 4096) is called:

The syscall lands in vfs_read()
The kernel looks at the struct file for the fd
It follows file->f_op->read_iter() — a function pointer
That pointer was set when the file was opened, based on the filesystem type
For ext4, it points to ext4_file_read_iter(). For NFS, nfs_file_read(). For procfs, proc_reg_read()

That's it. No switch statement. No filesystem detection at read time. Just one pointer dereference.

This indirection is set up at open() time. When the VFS resolves the path and finds the inode, the filesystem driver has already attached its function pointer table. Every subsequent operation on that fd goes through those pointers.

Under the Hood

VFS defines four core objects, each with an associated operations table.

The superblock (struct super_block + super_operations) represents a mounted filesystem instance. It holds the block size, root dentry, device reference, and operations like alloc_inode(), write_inode(), and statfs(). Every mount() call creates one.

The inode (struct inode + inode_operations) represents a file's identity and metadata — mode, uid, gid, size — plus operations like lookup(), create(), mkdir(), and permission(). This is the filesystem-independent version; the driver populates it from whatever on-disk format it uses.

The dentry (struct dentry + dentry_operations) represents a path component in the directory tree. Dentries form a cached tree used for path resolution. Their operations handle hash comparison and revalidation — critical for NFS, where server-side changes must be detected.

The file (struct file + file_operations) represents an open file instance. It holds the current offset, flags, and the actual I/O operations: read_iter(), write_iter(), mmap(), fsync(), and ioctl(). This is where the dispatch happens.

How filesystems plug in. Each driver registers via register_filesystem(), providing a file_system_type struct with a mount() callback. When mount() is called, the kernel finds the registered type in a global linked list, calls its mount callback to create the in-memory superblock, then grafts it into the mount tree at the specified mount point.

Path walking across mount points. The path walk algorithm uses the dentry cache to resolve each component. At each step, it checks if the current dentry is a mountpoint. If so, it transparently crosses into the mounted filesystem's root dentry. The boundary is invisible.

The dentry cache (dcache). This is one of the most performance-critical structures in the kernel. It's a hash table indexed by (parent dentry, name hash). A cache hit avoids disk I/O entirely. The kernel implements two walk modes: RCU-walk (lockless, no reference counts, blazing fast) for the common case, and ref-walk (locking, can sleep) as a fallback when the cache misses or a permission check needs to block. Hot path lookups are lock-free.

The unified page cache. Here's something elegant: the page cache is filesystem-agnostic. When vfs_read() is called, it typically goes through generic_file_read_iter(), which checks a per-inode page cache (an xarray of pages). Cache misses invoke the filesystem's readpage() or readahead() to populate pages. Because the cache is unified, memory pressure, dirty writeback, and sync/fsync work the same way whether the backend is ext4, NFS, or FUSE.

Common Questions

How does the kernel decide which filesystem driver handles an open() call?

It doesn't decide at open() time — the decision was made at mount() time. The kernel resolves the path component by component through the dentry cache. At each mount point, it transitions to the mounted filesystem's root dentry. Once the final component is reached, the inode's i_fop pointer (set during inode initialization by the filesystem driver) determines the file_operations table. The dispatch is baked into the inode, not computed per-call.

What happens when a filesystem is mounted on a non-empty directory?

The existing contents become hidden, not deleted. The dentry for the mount point gets a d_mounted flag, and path resolution switches to the mounted filesystem's root dentry instead of continuing into the original directory. Unmounting reveals the original contents. It is like placing a book on top of another -- the bottom book is still there, just not visible.

How does FUSE work from VFS's perspective?

FUSE registers as a regular filesystem type. Its inode and file operations point to FUSE kernel module functions. These functions translate VFS operations into FUSE protocol messages, write them to /dev/fuse, and block until the userspace daemon reads the message, processes it, and writes back a response. Every operation requires at least two context switches. That's the cost of running filesystem logic in userspace — and why FUSE is inherently slower than in-kernel filesystems.

Why does the dentry cache store entries for files that don't exist?

Because "does this file exist?" is one of the most common questions the kernel answers. The shell searches every directory in $PATH for each command. A build system checks dozens of include paths for each header. Without negative dentry caching, each failed lookup would hit disk. Negative dentries are cached with a timeout and evicted under memory pressure. They're a small investment that prevents enormous I/O waste.

How Technologies Use This

Docker

A container starts, opens /etc/hostname, and reads its own unique value. Another container on the same host opens the same path and reads something completely different. Both run on one kernel, one disk, yet neither sees the other's files. Without VFS, achieving this isolation would require separate kernels or full disk images per container.

The trick is function pointer indirection at the VFS layer. Docker's overlay2 driver registers overlayfs as a VFS filesystem type, stacking read-only image layers with a writable upper directory. Mount namespaces (CLONE_NEWNS) give each container its own mount tree, so open() and read() dispatch through VFS to overlayfs without ever seeing the host's mounts.

On a node running 50 containers, VFS dispatches I/O across 50 independent mount trees with under 2% CPU overhead. The application code never changes and never needs to know which storage backend sits underneath.

Kubernetes

A pod mounts an EBS volume on Monday, migrates to a node with Ceph on Tuesday, and the application code never changes. Without VFS, every pod would need storage-specific I/O code for every backend, and adding a new storage system would mean patching every workload.

VFS makes this invisible. CSI drivers call mount() with the appropriate filesystem type, and VFS dispatches all I/O through function pointer tables. The kubelet mounts the backend into the pod's mount namespace, so the application's read() and write() calls flow through VFS to the correct driver without any awareness of the underlying storage.

ConfigMap volumes push this further, using tmpfs-backed mounts with atomic symlink swaps so pods pick up config changes in under 1 second without restarting. In-flight reads never see a torn file. One abstraction layer handles every storage backend Kubernetes supports.

Same Concept Across Tech

Concept	Docker	JVM	Node.js	Go	K8s
Filesystem abstraction	overlay2 stacks layers via VFS overlayfs	java.nio.file.FileSystem SPI mimics VFS pattern	fs module calls map to VFS syscalls	os.Open dispatches through VFS	CSI drivers mount backends via VFS mount()
Mount isolation	Mount namespace (CLONE_NEWNS) per container	N/A (JVM shares host mount tree)	N/A (shares host mount tree)	N/A (shares host mount tree)	Pod mount namespace via container runtime
Pseudo-filesystem	/proc, /sys bind-mounted into container	/proc/self for JVM introspection	/proc/self/fd for fd enumeration	/proc/self/maps for memory inspection	ConfigMap uses tmpfs-backed atomic symlink swap
Page cache	Shared host page cache across all containers	JVM file I/O benefits from unified page cache	fs.readFile benefits from page cache	os.ReadFile benefits from page cache	EmptyDir tmpfs bypasses page cache

Stack Layer Mapping

Layer	Component
Hardware	Block devices, NIC (for NFS), RAM (for tmpfs/procfs)
Block I/O	Request queue, I/O scheduler, device driver
Filesystem driver	ext4, XFS, NFS, overlayfs, FUSE -- implements VFS operations
VFS	super_block, inode, dentry, file + operations tables
Syscall	open(), read(), write(), mount(), stat()
Userspace	Application code -- unaware of underlying filesystem type

Design Rationale: With 70+ filesystem types, a switch/case at read time would be absurd. Function pointer dispatch resolves in one indirection, set once at mount time. The dentry cache with RCU-walk makes the common case -- looking up a path that was recently resolved -- completely lock-free. Negative dentries exist because "does this file exist?" is asked millions of times for paths that do not exist, and hitting disk on every miss would be catastrophic for shell $PATH lookups and build systems. The unified page cache means dirty tracking, writeback, and eviction all work consistently whether data sits on ext4, NFS, or FUSE.

If You See This, Think This

Symptom	Likely Cause	First Check
EOPNOTSUPP on fallocate or xattr calls	Underlying filesystem does not support the operation	`findmnt -T <path>` to check filesystem type
Mounts inside container visible on host	Mount propagation set to shared instead of private	`cat /proc/self/mountinfo` check propagation flags
FUSE mount 10x slower than ext4 for metadata ops	Every VFS op requires context switch to userspace daemon	`strace -c -p <fuse_pid>` to measure round-trip overhead
Dentry cache consuming gigabytes of slab memory	Many unique paths accessed (build systems, package managers)	`slabtop \| grep dentry`
stat() on NFS returns stale metadata	Dentry/attribute cache timeout not expired	`mount \| grep nfs` check actimeo/acregmin settings
Path resolution slow on deeply nested paths	Dentry cache miss causing disk I/O per component	`perf trace -e 'fs:*' <command>`

When to Use / Avoid

Use VFS understanding when debugging filesystem-specific behavior differences (e.g., fallocate works on ext4 but not NFS v3)
Use when designing container storage strategies -- overlayfs, bind mounts, and tmpfs all dispatch through VFS
Use when implementing or debugging FUSE filesystems to understand the kernel-userspace round-trip cost
Use when diagnosing mount propagation issues in containerized environments
Avoid assuming all filesystems support the same operations -- check for EOPNOTSUPP returns
Avoid bypassing VFS (direct block I/O) unless building a database engine with specific alignment requirements

Try It Yourself

 1  # List all filesystem types registered with the kernel; 'nodev' means no block device required (pseudo-fs)
 2  cat /proc/filesystems
 3  
 4  # Display the complete mount hierarchy with filesystem types and mount options
 5  findmnt --tree -o TARGET,SOURCE,FSTYPE,OPTIONS
 6  
 7  # Mount a tmpfs (purely VFS/memory-backed filesystem). no block device involved
 8  mount -t tmpfs -o size=256m tmpfs /mnt/ramdisk
 9  
10  # Trace the mount syscall to see how the kernel attaches a filesystem to the VFS tree
11  strace -e mount,umount2 mount /dev/sdb1 /mnt/data
12  
13  # Per-process mount info showing mount IDs, parent IDs, device, root, mount point, and options. richer than /proc/mounts
14  cat /proc/self/mountinfo
15  
16  # Count dentry cache events system-wide over 5 seconds to measure path resolution pressure
17  sudo perf stat -e 'dentry:*' -a sleep 5

Debug Checklist

1findmnt --tree -o TARGET,SOURCE,FSTYPE,OPTIONS
2cat /proc/filesystems
3cat /proc/self/mountinfo
4slabtop | grep -i 'dentry\|inode'
5perf stat -e 'dentry:*' -a -- sleep 5 2>&1
6strace -e open,openat,mount -p <PID> 2>&1 | head -20

Key Takeaways

✓There's no giant switch/case on filesystem type. When you call read(), the kernel invokes file->f_op->read_iter(), which is ext4_file_read_iter() for ext4, nfs_file_read() for NFS, etc. Pure function-pointer indirection — one level of dispatch, zero branching
✓The dentry cache (dcache) caches "file not found" too. Negative dentries prevent repeated disk reads for names that don't exist — critical when your shell searches $PATH or a build system checks dozens of include directories
✓Plugging in a new filesystem is registering a struct. register_filesystem() adds a file_system_type to a global linked list; mount() walks the list, finds the right driver, and calls its mount() method to create a superblock
✓'Everything is a file' isn't a metaphor — it's an architecture. Pseudo-filesystems (procfs, sysfs, tmpfs) implement VFS operations purely in kernel memory, never touching a block device
✓One page cache rules them all. The VFS page cache is unified across ext4, NFS, and even FUSE — so eviction, dirty writeback, and readahead work consistently regardless of backend

Common Pitfalls

✗Assuming all filesystems support the same features — VFS operations can return -EOPNOTSUPP for unsupported ops (e.g., fallocate on NFS v3, xattrs on FAT32). The abstraction is uniform; the capabilities are not
✗Confusing the VFS inode with the on-disk inode — the VFS inode is an in-memory, filesystem-independent structure populated by the driver's read_inode. It may contain fields that don't exist on disk at all
✗Ignoring mount propagation (shared, slave, private) — this controls how mount events flow across namespaces. Get it wrong and mounts inside a container leak to the host, or vice versa
✗Expecting FUSE to perform like an in-kernel filesystem — every VFS operation on FUSE requires a context switch to userspace and back. That round-trip is the price of flexibility

Reference

System Calls

mountumountopenreadwrite

Tools

mount / findmnt/proc/filesystemsperf / ftrace

📌

In One Line

One read() call, one function pointer dereference, and the right filesystem driver runs -- VFS makes ext4, NFS, procfs, and FUSE all look the same from userspace.

Virtual File System (VFS)

DockerKubernetes

🧠

Mental Model

💡

The Problem

Architecture

Calling read() on /etc/hostname returns the machine's name. Calling read() on /proc/cpuinfo returns CPU details. Calling read() on an NFS-mounted spreadsheet returns last quarter's numbers.

How does one syscall work for all of them? The answer is the Virtual File System.

What Actually Happens

Here is the path when read(fd, buf, 4096) is called:

The syscall lands in vfs_read()
The kernel looks at the struct file for the fd
It follows file->f_op->read_iter() — a function pointer
That pointer was set when the file was opened, based on the filesystem type
For ext4, it points to ext4_file_read_iter(). For NFS, nfs_file_read(). For procfs, proc_reg_read()

That's it. No switch statement. No filesystem detection at read time. Just one pointer dereference.

Under the Hood

VFS defines four core objects, each with an associated operations table.

Common Questions

How does the kernel decide which filesystem driver handles an open() call?

What happens when a filesystem is mounted on a non-empty directory?

How does FUSE work from VFS's perspective?

Why does the dentry cache store entries for files that don't exist?

How Technologies Use This

Docker

Kubernetes

Same Concept Across Tech

Concept	Docker	JVM	Node.js	Go	K8s
Filesystem abstraction	overlay2 stacks layers via VFS overlayfs	java.nio.file.FileSystem SPI mimics VFS pattern	fs module calls map to VFS syscalls	os.Open dispatches through VFS	CSI drivers mount backends via VFS mount()
Mount isolation	Mount namespace (CLONE_NEWNS) per container	N/A (JVM shares host mount tree)	N/A (shares host mount tree)	N/A (shares host mount tree)	Pod mount namespace via container runtime
Pseudo-filesystem	/proc, /sys bind-mounted into container	/proc/self for JVM introspection	/proc/self/fd for fd enumeration	/proc/self/maps for memory inspection	ConfigMap uses tmpfs-backed atomic symlink swap
Page cache	Shared host page cache across all containers	JVM file I/O benefits from unified page cache	fs.readFile benefits from page cache	os.ReadFile benefits from page cache	EmptyDir tmpfs bypasses page cache

Stack Layer Mapping

Layer	Component
Hardware	Block devices, NIC (for NFS), RAM (for tmpfs/procfs)
Block I/O	Request queue, I/O scheduler, device driver
Filesystem driver	ext4, XFS, NFS, overlayfs, FUSE -- implements VFS operations
VFS	super_block, inode, dentry, file + operations tables
Syscall	open(), read(), write(), mount(), stat()
Userspace	Application code -- unaware of underlying filesystem type

If You See This, Think This

Symptom	Likely Cause	First Check
EOPNOTSUPP on fallocate or xattr calls	Underlying filesystem does not support the operation	`findmnt -T <path>` to check filesystem type
Mounts inside container visible on host	Mount propagation set to shared instead of private	`cat /proc/self/mountinfo` check propagation flags
FUSE mount 10x slower than ext4 for metadata ops	Every VFS op requires context switch to userspace daemon	`strace -c -p <fuse_pid>` to measure round-trip overhead
Dentry cache consuming gigabytes of slab memory	Many unique paths accessed (build systems, package managers)	`slabtop \| grep dentry`
stat() on NFS returns stale metadata	Dentry/attribute cache timeout not expired	`mount \| grep nfs` check actimeo/acregmin settings
Path resolution slow on deeply nested paths	Dentry cache miss causing disk I/O per component	`perf trace -e 'fs:*' <command>`

When to Use / Avoid

Use VFS understanding when debugging filesystem-specific behavior differences (e.g., fallocate works on ext4 but not NFS v3)
Use when designing container storage strategies -- overlayfs, bind mounts, and tmpfs all dispatch through VFS
Use when implementing or debugging FUSE filesystems to understand the kernel-userspace round-trip cost
Use when diagnosing mount propagation issues in containerized environments
Avoid assuming all filesystems support the same operations -- check for EOPNOTSUPP returns
Avoid bypassing VFS (direct block I/O) unless building a database engine with specific alignment requirements

Try It Yourself

 1  # List all filesystem types registered with the kernel; 'nodev' means no block device required (pseudo-fs)
 2  cat /proc/filesystems
 3  
 4  # Display the complete mount hierarchy with filesystem types and mount options
 5  findmnt --tree -o TARGET,SOURCE,FSTYPE,OPTIONS
 6  
 7  # Mount a tmpfs (purely VFS/memory-backed filesystem). no block device involved
 8  mount -t tmpfs -o size=256m tmpfs /mnt/ramdisk
 9  
10  # Trace the mount syscall to see how the kernel attaches a filesystem to the VFS tree
11  strace -e mount,umount2 mount /dev/sdb1 /mnt/data
12  
13  # Per-process mount info showing mount IDs, parent IDs, device, root, mount point, and options. richer than /proc/mounts
14  cat /proc/self/mountinfo
15  
16  # Count dentry cache events system-wide over 5 seconds to measure path resolution pressure
17  sudo perf stat -e 'dentry:*' -a sleep 5

Debug Checklist

1findmnt --tree -o TARGET,SOURCE,FSTYPE,OPTIONS
2cat /proc/filesystems
3cat /proc/self/mountinfo
4slabtop | grep -i 'dentry\|inode'
5perf stat -e 'dentry:*' -a -- sleep 5 2>&1
6strace -e open,openat,mount -p <PID> 2>&1 | head -20

Key Takeaways

✓There's no giant switch/case on filesystem type. When you call read(), the kernel invokes file->f_op->read_iter(), which is ext4_file_read_iter() for ext4, nfs_file_read() for NFS, etc. Pure function-pointer indirection — one level of dispatch, zero branching
✓The dentry cache (dcache) caches "file not found" too. Negative dentries prevent repeated disk reads for names that don't exist — critical when your shell searches $PATH or a build system checks dozens of include directories
✓Plugging in a new filesystem is registering a struct. register_filesystem() adds a file_system_type to a global linked list; mount() walks the list, finds the right driver, and calls its mount() method to create a superblock
✓'Everything is a file' isn't a metaphor — it's an architecture. Pseudo-filesystems (procfs, sysfs, tmpfs) implement VFS operations purely in kernel memory, never touching a block device
✓One page cache rules them all. The VFS page cache is unified across ext4, NFS, and even FUSE — so eviction, dirty writeback, and readahead work consistently regardless of backend

Common Pitfalls

✗Assuming all filesystems support the same features — VFS operations can return -EOPNOTSUPP for unsupported ops (e.g., fallocate on NFS v3, xattrs on FAT32). The abstraction is uniform; the capabilities are not
✗Confusing the VFS inode with the on-disk inode — the VFS inode is an in-memory, filesystem-independent structure populated by the driver's read_inode. It may contain fields that don't exist on disk at all
✗Ignoring mount propagation (shared, slave, private) — this controls how mount events flow across namespaces. Get it wrong and mounts inside a container leak to the host, or vice versa
✗Expecting FUSE to perform like an in-kernel filesystem — every VFS operation on FUSE requires a context switch to userspace and back. That round-trip is the price of flexibility

Reference

System Calls

mountumountopenreadwrite

Tools

mount / findmnt/proc/filesystemsperf / ftrace

📌

In One Line

One read() call, one function pointer dereference, and the right filesystem driver runs -- VFS makes ext4, NFS, procfs, and FUSE all look the same from userspace.

Virtual File System (VFS)

Mental Model

The Problem

Architecture

What Actually Happens

Under the Hood

Common Questions

How Technologies Use This

Same Concept Across Tech

If You See This, Think This

When to Use / Avoid

Try It Yourself

Debug Checklist

Key Takeaways

Common Pitfalls

Reference

In One Line

Related Topics

Virtual File System (VFS)

Mental Model

The Problem

Architecture

What Actually Happens

Under the Hood

Common Questions

How Technologies Use This

Same Concept Across Tech

If You See This, Think This

When to Use / Avoid

Try It Yourself

Debug Checklist

Key Takeaways

Common Pitfalls

Reference

In One Line

Related Topics