Inodes & File Metadata
Mental Model
Buildings and street addresses in a city. The building holds what matters -- who lives there, how many rooms, a map to the storage units inside. A street address is just a label pointing at the building. Two addresses can point to the same building (hard links). Tear down the sign, the building stays. The building only gets demolished when the last sign is removed and nobody is inside. The catch: the city issued a fixed number of building permits at founding. Empty land everywhere, but zero permits means no new buildings.
The Problem
The mail server dashboard shows 200 GB free, but new emails fail with "No space left on device." Months of tiny session files in /tmp consumed every inode the filesystem had. Each file -- no matter how small -- takes one inode. Plenty of room for data, zero slots left to register new files.
Architecture
Run ls -la on any file and the output shows a filename, permissions, an owner, a size, a timestamp. It feels like all of that is stored together — one tidy bundle called "a file."
It's not. The filename is stored in one place. Everything else — permissions, ownership, timestamps, the location of the actual data on disk — is stored in a completely separate structure called an inode. The filename just points to it.
This split is not an implementation detail. It's the reason hard links work, the reason a file can be deleted while a process is still reading it, and the reason a server can die with "No space left on device" while sitting on 200 GB of free disk.
What Actually Happens
When the kernel needs to access a file's metadata, the sequence is:
- Resolve the pathname through the dentry cache to get an inode number
- Check the inode cache — a kernel hash table keyed by (superblock, inode number)
- If it's cached, return it immediately
- If not, call the filesystem's
read_inode()to load the on-disk inode (e.g.,ext4_inode) into a VFSstruct inode
The VFS inode is the filesystem-independent version. Every filesystem driver must populate it, which is how the kernel provides a uniform interface across ext4, XFS, Btrfs, NFS, and everything else.
Here's what lives inside the inode — and what doesn't:
Stored in the inode: permissions (mode), owner (uid/gid), size, hard link count, timestamps (atime/mtime/ctime), and pointers to the data blocks on disk.
NOT in the inode: the filename. That lives in directory entries, which are separate structures that map name strings to inode numbers.
This separation is what makes hard links possible: multiple names, one inode, one set of data blocks.
Under the Hood
The link count and unlink-while-open. The st_nlink field tracks how many directory entries point to this inode. When unlink() removes a name, it decrements st_nlink. But the kernel doesn't free the inode and its data blocks until two conditions are met: st_nlink reaches zero AND no process has the file open. This means a file's last name can be deleted while a process is still reading it — the process keeps its fd, keeps reading, and the disk space is only reclaimed when it closes the fd. This is foundational for atomic file replacement and safe log rotation.
Timestamps: atime, mtime, ctime. mtime updates when the data is modified. ctime updates when anything about the inode changes — permissions, ownership, link count, or data (since data changes also update the size field). atime updates on read, but the relatime mount option (default since Linux 2.6.30) only updates atime if it's older than mtime, dramatically reducing write I/O. noatime disables it entirely — common for database workloads where every read was generating a write.
Inode size and allocation. ext4 uses 256-byte inodes (ext3 used 128). The extra space stores extended attributes (xattrs), nanosecond timestamps, and inline data for tiny files. The ratio of inodes to disk space is set at mkfs time with the -i (bytes-per-inode) option. The default of one inode per 16 KB works for most workloads. But a mail server storing millions of tiny files needs more inodes (smaller bytes-per-inode), and a video storage server with few large files needs fewer.
Extents, not block pointers. Old-school filesystems (ext2, ext3) stored block pointers — one pointer per data block. For a 1 GB file with 4 KB blocks, that's 262,144 pointers. Modern filesystems (ext4, XFS, Btrfs) use extents: a single (start block, length) pair describes a contiguous run. One extent can replace thousands of pointers. ext4 stores up to 4 extents directly in the inode. If the file is fragmented beyond that, it builds an extent tree — a B-tree of extents rooted in the inode.
Common Questions
Why can't hard links cross filesystem boundaries?
Because inode numbers are only unique within a single filesystem (identified by st_dev). Inode 42 on /dev/sda1 is a completely different file than inode 42 on /dev/sdb1. A directory entry on one filesystem can't reference an inode on another — there's no cross-device lookup mechanism. Symlinks solve this by storing a pathname string, which the VFS resolves through path walk at access time, potentially crossing mount points.
A process opens a file, then someone deletes it. What happens to disk space?
The directory entry is removed and st_nlink drops to zero, but the inode and data blocks are NOT freed until the last fd referencing it is closed. This is why du and df can disagree: du walks directory entries (which are gone), while df checks actual block allocation (still in use). Use lsof +L1 to find deleted files still held open by processes — it's one of the most common causes of "where did my disk space go?"
What's the difference between ctime and mtime?
mtime records when the file's data was last modified — writes, truncates. ctime records when the inode metadata was last changed — that includes permission changes (chmod), ownership changes (chown), link count changes (link/unlink), and also data changes (because they update size/blocks in the inode). ctime cannot be set explicitly; the kernel always maintains it. This makes ctime useful for detecting any kind of file tampering, including metadata-only changes that mtime would miss.
How does statx() improve on stat()?
Five ways: (1) birth time — actual file creation time — via STATX_BTIME. (2) Selective field requests via a mask, so callers can skip expensive operations like NFS attribute refresh with AT_STATX_DONT_SYNC. (3) Attribute flags like STATX_ATTR_COMPRESSED and STATX_ATTR_ENCRYPTED. (4) Proper nanosecond timestamp support. (5) Extensibility — new metadata fields can be added without inventing new syscalls. For new code that calls stat(), statx() is the better choice.
How Technologies Use This
Running git status on a 100,000-file monorepo returns in under 500 milliseconds. Without inode metadata, Git would need to hash every file's contents on every status check, turning that quick command into minutes of disk I/O.
Git caches each file's st_ino, st_mtime, st_ctime, st_size, and st_dev in the .git/index file. On each git status run, it calls lstat() on each tracked file and compares the result to the cached inode fields. If they all match, the file is skipped entirely, no content hashing needed.
This turns an O(n * filesize) content-hashing operation into O(n * single-syscall). The lesson: inode metadata is not just bookkeeping, it is the performance shortcut that makes large-repo workflows viable.
A container host shows 40% disk free, yet file creation fails with "No space left on device." The dashboard says plenty of room. The ops team is baffled.
Disk space and inodes are two separate resources, and containers exhaust one while the other looks fine. Each overlay2 layer maintains its own inode table, and every copy-up operation allocates a new inode in the writable upper layer. A host running 200 containers with Node.js apps can burn through millions of inodes from node_modules alone while barely denting block usage. The default ext4 ratio of one inode per 16KB is far too low for container workloads.
Monitor with df -i and format container storage volumes with mkfs -i 4096 to quadruple inode density. Always check inode exhaustion before disk space when debugging container file creation failures.
Same Concept Across Tech
| Technology | How inodes affect it | Key detail |
|---|---|---|
| Docker | Overlay2 layers create separate inodes for every file in the upper layer | Many containers with many small files can exhaust host inodes |
| Kubernetes | Kubelet monitors inode usage. Evicts pods when inode pressure threshold is hit | imagefs.inodesFree eviction threshold in kubelet config |
| Git | Loose objects are individual files (one inode each). Pack files consolidate them | git gc reduces inode usage dramatically |
| Nginx | Each cached response is a separate file with its own inode | Proxy cache with millions of items can exhaust inodes |
| Postfix/Dovecot | Maildir format stores each email as a separate file | Mail servers with millions of messages are the classic inode exhaustion case |
Stack layer mapping (cannot create file despite free space):
| Layer | What to check | Tool |
|---|---|---|
| Application | Is the app creating many tiny files instead of fewer large ones? | Application logic review |
| Filesystem | Are inodes exhausted? Which directory has the most files? | df -i, find with file count |
| Kernel | Are deleted files still held open by processes? | lsof +L1 |
| Storage | Which filesystem type? ext4 has fixed inodes, XFS/btrfs are dynamic | df -T |
| Hardware | Is the partition too small for the expected file count? | Reformat with more inodes (mke2fs -i bytes-per-inode) |
Design Rationale Separating file identity (inode) from file naming (directory entry) reflects a simple insight: a file's metadata and data are intrinsic, but its name is just a relationship imposed by the directory. That separation makes hard links work -- multiple names, one inode -- and lets a file survive after its last name is deleted, so long as a process still holds it open. Fixing the inode count at mkfs time (ext4) was a space-efficiency bet: preallocated inode tables at known disk offsets give O(1) lookups without an index, but the inode-to-block ratio has to be guessed upfront. XFS and Btrfs allocate inodes dynamically, trading that lookup simplicity for flexibility -- which is why inode exhaustion is mostly an ext4 problem.
If You See This, Think This
| Symptom | Likely cause | First check |
|---|---|---|
| ENOSPC but df shows free space | Inode exhaustion, not disk space | df -i to check inode usage |
| File deleted but disk space not freed | Another process has the file open (link count 0 but fd still open) | lsof +L1 to find deleted-but-open files |
| Hard link count unexpected | Multiple directory entries pointing to the same inode | stat file, check Links field |
| mv is instant but cp is slow | mv just changes a directory entry (same inode). cp creates new inode + copies data | Expected behavior |
| Container evicted by kubelet with no OOM | Inode pressure threshold exceeded on node | Check kubelet eviction logs for inode pressure |
| Deleting millions of files is slow | Each unlink is a separate inode + directory entry operation | Batch delete, or truncate + unlink |
When to Use / Avoid
Relevant when:
- Debugging "No space left on device" with plenty of free disk space (inode exhaustion)
- Understanding hard links vs symlinks (hard links share an inode, symlinks are separate inodes)
- Understanding why deleting a file does not free space if another process has it open (link count > 0 or open fd)
- Monitoring filesystem health on mail servers, build systems, or /tmp directories with many tiny files
Watch out for:
- Filesystems with fixed inode count (ext4). Cannot increase without reformatting
- XFS and btrfs allocate inodes dynamically, so inode exhaustion is rare
- A file with 0 links is not deleted until all open file descriptors are closed (unlink but fd still open)
Try It Yourself
1 # Show full inode metadata: inode number, link count, permissions, size, all three timestamps
2 stat /etc/passwd
3
4 # Display inode usage for root filesystem. IUsed, IFree, IUse%
5 df -i /
6
7 # Show just the inode number alongside the filename
8 ls -i /etc/passwd
9
10 # Find all filenames (hard links) pointing to inode 12345 on the same filesystem
11 find / -xdev -inum 12345
12
13 # Show filesystem-level info: block size, total/free inodes, filesystem type
14 stat -f /etc/passwd
15
16 # Use statx via Python to retrieve file birth time (creation time) on supported filesystems
17 python3 -c "import os; s = os.statx('.', os.STATX_BTIME); print(s)"Debug Checklist
- 1
Check inode usage: df -i (shows used/available inodes per filesystem) - 2
Find directory with most files: find / -xdev -type d -exec sh -c 'echo $(ls -A "$1" | wc -l) "$1"' _ {} \; | sort -rn | head - 3
Check inode of a file: stat <file> (shows Inode: number) - 4
Find all hard links to same inode: find / -inum <inode_number> - 5
Check if deleted files are still held open: lsof +L1 - 6
Check filesystem type (fixed vs dynamic inodes): df -T
Key Takeaways
- ✓Inode numbers are only unique within a single filesystem — two files on different mounts can share the same number. This is exactly why hard links can't cross filesystem boundaries: there's no cross-device inode lookup
- ✓Run out of inodes and you can't create files, period — even with terabytes free. The inode count is fixed at mkfs time (unless you're on XFS/Btrfs which allocate dynamically). Check with df -i before it's too late
- ✓statx() is the modern stat() — it returns only the fields you ask for (AT_STATX_DONT_SYNC skips NFS roundtrips), adds birth time (btime), and is extensible without needing new syscalls
- ✓Delete a file while a process has it open and the data survives — the inode's link count drops to 0 but the file persists until the last fd closes. This is how atomic file replacement and safe log rotation work
- ✓ext4 replaced indirect block pointers with an extent tree — one (start, length) pair describes a contiguous run of blocks, collapsing massive metadata overhead for large files compared to ext3's triple-indirect scheme
Common Pitfalls
- ✗Seeing "No space left on device" and only checking df -h — the real culprit might be inode exhaustion. Always check df -i too. They're different failure modes with the same error message
- ✗Using stat() on a symlink and getting the target's metadata — stat() follows symlinks by default. Use lstat() to inspect the symlink itself, or you'll think the symlink is a regular file
- ✗Expecting birth time (file creation time) to be available everywhere — ext4 stores it, but stat() doesn't expose it. You need statx() with STATX_BTIME, and even then not all filesystems record it
- ✗Relying on inode numbers for file identity across reboots on tmpfs/procfs — pseudo-filesystems allocate inodes dynamically and numbers are not stable between reboots
Reference
In One Line
Filenames are cheap labels; the inode is the real identity -- and when inodes run out, the disk looks empty while refusing to create anything.