File Systems & I/OTopic 3 of 19

File Systems & I/OStarter

Inodes & File Metadata

GitDocker

🧠

Mental Model

Buildings and street addresses in a city. The building holds what matters -- who lives there, how many rooms, a map to the storage units inside. A street address is just a label pointing at the building. Two addresses can point to the same building (hard links). Tear down the sign, the building stays. The building only gets demolished when the last sign is removed and nobody is inside. The catch: the city issued a fixed number of building permits at founding. Empty land everywhere, but zero permits means no new buildings.

💡

The Problem

The mail server dashboard shows 200 GB free, but new emails fail with "No space left on device." Months of tiny session files in /tmp consumed every inode the filesystem had. Each file -- no matter how small -- takes one inode. Plenty of room for data, zero slots left to register new files.

Architecture

Run ls -la on any file and the output shows a filename, permissions, an owner, a size, a timestamp. It feels like all of that is stored together — one tidy bundle called "a file."

It's not. The filename is stored in one place. Everything else — permissions, ownership, timestamps, the location of the actual data on disk — is stored in a completely separate structure called an inode. The filename just points to it.

This split is not an implementation detail. It's the reason hard links work, the reason a file can be deleted while a process is still reading it, and the reason a server can die with "No space left on device" while sitting on 200 GB of free disk.

What Actually Happens

When the kernel needs to access a file's metadata, the sequence is:

Resolve the pathname through the dentry cache to get an inode number
Check the inode cache — a kernel hash table keyed by (superblock, inode number)
If it's cached, return it immediately
If not, call the filesystem's read_inode() to load the on-disk inode (e.g., ext4_inode) into a VFS struct inode

The VFS inode is the filesystem-independent version. Every filesystem driver must populate it, which is how the kernel provides a uniform interface across ext4, XFS, Btrfs, NFS, and everything else.

Here's what lives inside the inode — and what doesn't:

Stored in the inode: permissions (mode), owner (uid/gid), size, hard link count, timestamps (atime/mtime/ctime), and pointers to the data blocks on disk.

NOT in the inode: the filename. That lives in directory entries, which are separate structures that map name strings to inode numbers.

This separation is what makes hard links possible: multiple names, one inode, one set of data blocks.

Under the Hood

The link count and unlink-while-open. The st_nlink field tracks how many directory entries point to this inode. When unlink() removes a name, it decrements st_nlink. But the kernel doesn't free the inode and its data blocks until two conditions are met: st_nlink reaches zero AND no process has the file open. This means a file's last name can be deleted while a process is still reading it — the process keeps its fd, keeps reading, and the disk space is only reclaimed when it closes the fd. This is foundational for atomic file replacement and safe log rotation.

Timestamps: atime, mtime, ctime. mtime updates when the data is modified. ctime updates when anything about the inode changes — permissions, ownership, link count, or data (since data changes also update the size field). atime updates on read, but the relatime mount option (default since Linux 2.6.30) only updates atime if it's older than mtime, dramatically reducing write I/O. noatime disables it entirely — common for database workloads where every read was generating a write.

Inode size and allocation. ext4 uses 256-byte inodes (ext3 used 128). The extra space stores extended attributes (xattrs), nanosecond timestamps, and inline data for tiny files. The ratio of inodes to disk space is set at mkfs time with the -i (bytes-per-inode) option. The default of one inode per 16 KB works for most workloads. But a mail server storing millions of tiny files needs more inodes (smaller bytes-per-inode), and a video storage server with few large files needs fewer.

Extents, not block pointers. Old-school filesystems (ext2, ext3) stored block pointers — one pointer per data block. For a 1 GB file with 4 KB blocks, that's 262,144 pointers. Modern filesystems (ext4, XFS, Btrfs) use extents: a single (start block, length) pair describes a contiguous run. One extent can replace thousands of pointers. ext4 stores up to 4 extents directly in the inode. If the file is fragmented beyond that, it builds an extent tree — a B-tree of extents rooted in the inode.

Common Questions

Why can't hard links cross filesystem boundaries?

Because inode numbers are only unique within a single filesystem (identified by st_dev). Inode 42 on /dev/sda1 is a completely different file than inode 42 on /dev/sdb1. A directory entry on one filesystem can't reference an inode on another — there's no cross-device lookup mechanism. Symlinks solve this by storing a pathname string, which the VFS resolves through path walk at access time, potentially crossing mount points.

A process opens a file, then someone deletes it. What happens to disk space?

The directory entry is removed and st_nlink drops to zero, but the inode and data blocks are NOT freed until the last fd referencing it is closed. This is why du and df can disagree: du walks directory entries (which are gone), while df checks actual block allocation (still in use). Use lsof +L1 to find deleted files still held open by processes — it's one of the most common causes of "where did my disk space go?"

What's the difference between ctime and mtime?

mtime records when the file's data was last modified — writes, truncates. ctime records when the inode metadata was last changed — that includes permission changes (chmod), ownership changes (chown), link count changes (link/unlink), and also data changes (because they update size/blocks in the inode). ctime cannot be set explicitly; the kernel always maintains it. This makes ctime useful for detecting any kind of file tampering, including metadata-only changes that mtime would miss.

How does statx() improve on stat()?

Five ways: (1) birth time — actual file creation time — via STATX_BTIME. (2) Selective field requests via a mask, so callers can skip expensive operations like NFS attribute refresh with AT_STATX_DONT_SYNC. (3) Attribute flags like STATX_ATTR_COMPRESSED and STATX_ATTR_ENCRYPTED. (4) Proper nanosecond timestamp support. (5) Extensibility — new metadata fields can be added without inventing new syscalls. For new code that calls stat(), statx() is the better choice.

How Technologies Use This

Git

Running git status on a 100,000-file monorepo returns in under 500 milliseconds. Without inode metadata, Git would need to hash every file's contents on every status check, turning that quick command into minutes of disk I/O.

Git caches each file's st_ino, st_mtime, st_ctime, st_size, and st_dev in the .git/index file. On each git status run, it calls lstat() on each tracked file and compares the result to the cached inode fields. If they all match, the file is skipped entirely, no content hashing needed.

This turns an O(n * filesize) content-hashing operation into O(n * single-syscall). The lesson: inode metadata is not just bookkeeping, it is the performance shortcut that makes large-repo workflows viable.

Docker

A container host shows 40% disk free, yet file creation fails with "No space left on device." The dashboard says plenty of room. The ops team is baffled.

Disk space and inodes are two separate resources, and containers exhaust one while the other looks fine. Each overlay2 layer maintains its own inode table, and every copy-up operation allocates a new inode in the writable upper layer. A host running 200 containers with Node.js apps can burn through millions of inodes from node_modules alone while barely denting block usage. The default ext4 ratio of one inode per 16KB is far too low for container workloads.

Monitor with df -i and format container storage volumes with mkfs -i 4096 to quadruple inode density. Always check inode exhaustion before disk space when debugging container file creation failures.

Same Concept Across Tech

Technology	How inodes affect it	Key detail
Docker	Overlay2 layers create separate inodes for every file in the upper layer	Many containers with many small files can exhaust host inodes
Kubernetes	Kubelet monitors inode usage. Evicts pods when inode pressure threshold is hit	imagefs.inodesFree eviction threshold in kubelet config
Git	Loose objects are individual files (one inode each). Pack files consolidate them	git gc reduces inode usage dramatically
Nginx	Each cached response is a separate file with its own inode	Proxy cache with millions of items can exhaust inodes
Postfix/Dovecot	Maildir format stores each email as a separate file	Mail servers with millions of messages are the classic inode exhaustion case

Stack layer mapping (cannot create file despite free space):

Layer	What to check	Tool
Application	Is the app creating many tiny files instead of fewer large ones?	Application logic review
Filesystem	Are inodes exhausted? Which directory has the most files?	df -i, find with file count
Kernel	Are deleted files still held open by processes?	lsof +L1
Storage	Which filesystem type? ext4 has fixed inodes, XFS/btrfs are dynamic	df -T
Hardware	Is the partition too small for the expected file count?	Reformat with more inodes (mke2fs -i bytes-per-inode)

Design Rationale Separating file identity (inode) from file naming (directory entry) reflects a simple insight: a file's metadata and data are intrinsic, but its name is just a relationship imposed by the directory. That separation makes hard links work -- multiple names, one inode -- and lets a file survive after its last name is deleted, so long as a process still holds it open. Fixing the inode count at mkfs time (ext4) was a space-efficiency bet: preallocated inode tables at known disk offsets give O(1) lookups without an index, but the inode-to-block ratio has to be guessed upfront. XFS and Btrfs allocate inodes dynamically, trading that lookup simplicity for flexibility -- which is why inode exhaustion is mostly an ext4 problem.

If You See This, Think This

Symptom	Likely cause	First check
ENOSPC but df shows free space	Inode exhaustion, not disk space	df -i to check inode usage
File deleted but disk space not freed	Another process has the file open (link count 0 but fd still open)	lsof +L1 to find deleted-but-open files
Hard link count unexpected	Multiple directory entries pointing to the same inode	stat file, check Links field
mv is instant but cp is slow	mv just changes a directory entry (same inode). cp creates new inode + copies data	Expected behavior
Container evicted by kubelet with no OOM	Inode pressure threshold exceeded on node	Check kubelet eviction logs for inode pressure
Deleting millions of files is slow	Each unlink is a separate inode + directory entry operation	Batch delete, or truncate + unlink

When to Use / Avoid

Relevant when:

Debugging "No space left on device" with plenty of free disk space (inode exhaustion)
Understanding hard links vs symlinks (hard links share an inode, symlinks are separate inodes)
Understanding why deleting a file does not free space if another process has it open (link count > 0 or open fd)
Monitoring filesystem health on mail servers, build systems, or /tmp directories with many tiny files

Watch out for:

Filesystems with fixed inode count (ext4). Cannot increase without reformatting
XFS and btrfs allocate inodes dynamically, so inode exhaustion is rare
A file with 0 links is not deleted until all open file descriptors are closed (unlink but fd still open)

Try It Yourself

 1  # Show full inode metadata: inode number, link count, permissions, size, all three timestamps
 2  stat /etc/passwd
 3  
 4  # Display inode usage for root filesystem. IUsed, IFree, IUse%
 5  df -i /
 6  
 7  # Show just the inode number alongside the filename
 8  ls -i /etc/passwd
 9  
10  # Find all filenames (hard links) pointing to inode 12345 on the same filesystem
11  find / -xdev -inum 12345
12  
13  # Show filesystem-level info: block size, total/free inodes, filesystem type
14  stat -f /etc/passwd
15  
16  # Use statx via Python to retrieve file birth time (creation time) on supported filesystems
17  python3 -c "import os; s = os.statx('.', os.STATX_BTIME); print(s)"

Debug Checklist

1Check inode usage: df -i (shows used/available inodes per filesystem)
2Find directory with most files: find / -xdev -type d -exec sh -c 'echo $(ls -A "$1" | wc -l) "$1"' _ {} \; | sort -rn | head
3Check inode of a file: stat <file> (shows Inode: number)
4Find all hard links to same inode: find / -inum <inode_number>
5Check if deleted files are still held open: lsof +L1
6Check filesystem type (fixed vs dynamic inodes): df -T

Key Takeaways

✓Inode numbers are only unique within a single filesystem — two files on different mounts can share the same number. This is exactly why hard links can't cross filesystem boundaries: there's no cross-device inode lookup
✓Run out of inodes and you can't create files, period — even with terabytes free. The inode count is fixed at mkfs time (unless you're on XFS/Btrfs which allocate dynamically). Check with df -i before it's too late
✓statx() is the modern stat() — it returns only the fields you ask for (AT_STATX_DONT_SYNC skips NFS roundtrips), adds birth time (btime), and is extensible without needing new syscalls
✓Delete a file while a process has it open and the data survives — the inode's link count drops to 0 but the file persists until the last fd closes. This is how atomic file replacement and safe log rotation work
✓ext4 replaced indirect block pointers with an extent tree — one (start, length) pair describes a contiguous run of blocks, collapsing massive metadata overhead for large files compared to ext3's triple-indirect scheme

Common Pitfalls

✗Seeing "No space left on device" and only checking df -h — the real culprit might be inode exhaustion. Always check df -i too. They're different failure modes with the same error message
✗Using stat() on a symlink and getting the target's metadata — stat() follows symlinks by default. Use lstat() to inspect the symlink itself, or you'll think the symlink is a regular file
✗Expecting birth time (file creation time) to be available everywhere — ext4 stores it, but stat() doesn't expose it. You need statx() with STATX_BTIME, and even then not all filesystems record it
✗Relying on inode numbers for file identity across reboots on tmpfs/procfs — pseudo-filesystems allocate inodes dynamically and numbers are not stable between reboots

Reference

System Calls

statfstatlstatstatxchmodchown

Tools

stat / stat -fdf -idebugfs

📌

In One Line

Filenames are cheap labels; the inode is the real identity -- and when inodes run out, the disk looks empty while refusing to create anything.

Inodes & File Metadata

GitDocker

🧠

Mental Model

💡

The Problem

Architecture

Run ls -la on any file and the output shows a filename, permissions, an owner, a size, a timestamp. It feels like all of that is stored together — one tidy bundle called "a file."

What Actually Happens

When the kernel needs to access a file's metadata, the sequence is:

Resolve the pathname through the dentry cache to get an inode number
Check the inode cache — a kernel hash table keyed by (superblock, inode number)
If it's cached, return it immediately
If not, call the filesystem's read_inode() to load the on-disk inode (e.g., ext4_inode) into a VFS struct inode

The VFS inode is the filesystem-independent version. Every filesystem driver must populate it, which is how the kernel provides a uniform interface across ext4, XFS, Btrfs, NFS, and everything else.

Here's what lives inside the inode — and what doesn't:

Stored in the inode: permissions (mode), owner (uid/gid), size, hard link count, timestamps (atime/mtime/ctime), and pointers to the data blocks on disk.

NOT in the inode: the filename. That lives in directory entries, which are separate structures that map name strings to inode numbers.

This separation is what makes hard links possible: multiple names, one inode, one set of data blocks.

Under the Hood

Common Questions

Why can't hard links cross filesystem boundaries?

A process opens a file, then someone deletes it. What happens to disk space?

What's the difference between ctime and mtime?

How does statx() improve on stat()?

How Technologies Use This

Git

Docker

A container host shows 40% disk free, yet file creation fails with "No space left on device." The dashboard says plenty of room. The ops team is baffled.

Monitor with df -i and format container storage volumes with mkfs -i 4096 to quadruple inode density. Always check inode exhaustion before disk space when debugging container file creation failures.

Same Concept Across Tech

Technology	How inodes affect it	Key detail
Docker	Overlay2 layers create separate inodes for every file in the upper layer	Many containers with many small files can exhaust host inodes
Kubernetes	Kubelet monitors inode usage. Evicts pods when inode pressure threshold is hit	imagefs.inodesFree eviction threshold in kubelet config
Git	Loose objects are individual files (one inode each). Pack files consolidate them	git gc reduces inode usage dramatically
Nginx	Each cached response is a separate file with its own inode	Proxy cache with millions of items can exhaust inodes
Postfix/Dovecot	Maildir format stores each email as a separate file	Mail servers with millions of messages are the classic inode exhaustion case

Stack layer mapping (cannot create file despite free space):

Layer	What to check	Tool
Application	Is the app creating many tiny files instead of fewer large ones?	Application logic review
Filesystem	Are inodes exhausted? Which directory has the most files?	df -i, find with file count
Kernel	Are deleted files still held open by processes?	lsof +L1
Storage	Which filesystem type? ext4 has fixed inodes, XFS/btrfs are dynamic	df -T
Hardware	Is the partition too small for the expected file count?	Reformat with more inodes (mke2fs -i bytes-per-inode)

If You See This, Think This

Symptom	Likely cause	First check
ENOSPC but df shows free space	Inode exhaustion, not disk space	df -i to check inode usage
File deleted but disk space not freed	Another process has the file open (link count 0 but fd still open)	lsof +L1 to find deleted-but-open files
Hard link count unexpected	Multiple directory entries pointing to the same inode	stat file, check Links field
mv is instant but cp is slow	mv just changes a directory entry (same inode). cp creates new inode + copies data	Expected behavior
Container evicted by kubelet with no OOM	Inode pressure threshold exceeded on node	Check kubelet eviction logs for inode pressure
Deleting millions of files is slow	Each unlink is a separate inode + directory entry operation	Batch delete, or truncate + unlink

When to Use / Avoid

Relevant when:

Debugging "No space left on device" with plenty of free disk space (inode exhaustion)
Understanding hard links vs symlinks (hard links share an inode, symlinks are separate inodes)
Understanding why deleting a file does not free space if another process has it open (link count > 0 or open fd)
Monitoring filesystem health on mail servers, build systems, or /tmp directories with many tiny files

Watch out for:

Filesystems with fixed inode count (ext4). Cannot increase without reformatting
XFS and btrfs allocate inodes dynamically, so inode exhaustion is rare
A file with 0 links is not deleted until all open file descriptors are closed (unlink but fd still open)

Try It Yourself

 1  # Show full inode metadata: inode number, link count, permissions, size, all three timestamps
 2  stat /etc/passwd
 3  
 4  # Display inode usage for root filesystem. IUsed, IFree, IUse%
 5  df -i /
 6  
 7  # Show just the inode number alongside the filename
 8  ls -i /etc/passwd
 9  
10  # Find all filenames (hard links) pointing to inode 12345 on the same filesystem
11  find / -xdev -inum 12345
12  
13  # Show filesystem-level info: block size, total/free inodes, filesystem type
14  stat -f /etc/passwd
15  
16  # Use statx via Python to retrieve file birth time (creation time) on supported filesystems
17  python3 -c "import os; s = os.statx('.', os.STATX_BTIME); print(s)"

Debug Checklist

1Check inode usage: df -i (shows used/available inodes per filesystem)
2Find directory with most files: find / -xdev -type d -exec sh -c 'echo $(ls -A "$1" | wc -l) "$1"' _ {} \; | sort -rn | head
3Check inode of a file: stat <file> (shows Inode: number)
4Find all hard links to same inode: find / -inum <inode_number>
5Check if deleted files are still held open: lsof +L1
6Check filesystem type (fixed vs dynamic inodes): df -T

Key Takeaways

✓Inode numbers are only unique within a single filesystem — two files on different mounts can share the same number. This is exactly why hard links can't cross filesystem boundaries: there's no cross-device inode lookup
✓Run out of inodes and you can't create files, period — even with terabytes free. The inode count is fixed at mkfs time (unless you're on XFS/Btrfs which allocate dynamically). Check with df -i before it's too late
✓statx() is the modern stat() — it returns only the fields you ask for (AT_STATX_DONT_SYNC skips NFS roundtrips), adds birth time (btime), and is extensible without needing new syscalls
✓Delete a file while a process has it open and the data survives — the inode's link count drops to 0 but the file persists until the last fd closes. This is how atomic file replacement and safe log rotation work
✓ext4 replaced indirect block pointers with an extent tree — one (start, length) pair describes a contiguous run of blocks, collapsing massive metadata overhead for large files compared to ext3's triple-indirect scheme

Common Pitfalls

✗Seeing "No space left on device" and only checking df -h — the real culprit might be inode exhaustion. Always check df -i too. They're different failure modes with the same error message
✗Using stat() on a symlink and getting the target's metadata — stat() follows symlinks by default. Use lstat() to inspect the symlink itself, or you'll think the symlink is a regular file
✗Expecting birth time (file creation time) to be available everywhere — ext4 stores it, but stat() doesn't expose it. You need statx() with STATX_BTIME, and even then not all filesystems record it
✗Relying on inode numbers for file identity across reboots on tmpfs/procfs — pseudo-filesystems allocate inodes dynamically and numbers are not stable between reboots

Reference

System Calls

statfstatlstatstatxchmodchown

Tools

stat / stat -fdf -idebugfs

📌

In One Line

Filenames are cheap labels; the inode is the real identity -- and when inodes run out, the disk looks empty while refusing to create anything.

Inodes & File Metadata

Mental Model

The Problem

Architecture

What Actually Happens

Under the Hood

Common Questions

How Technologies Use This

Same Concept Across Tech

If You See This, Think This

When to Use / Avoid

Try It Yourself

Debug Checklist

Key Takeaways

Common Pitfalls

Reference

In One Line

Related Topics

Inodes & File Metadata

Mental Model

The Problem

Architecture

What Actually Happens

Under the Hood

Common Questions

How Technologies Use This

Same Concept Across Tech

If You See This, Think This

When to Use / Avoid

Try It Yourself

Debug Checklist

Key Takeaways

Common Pitfalls

Reference

In One Line

Related Topics