File Systems & I/OTopic 15 of 19

File Systems & I/OAdvanced

ext4 & XFS On-Disk Internals

DockerKubernetesPostgreSQLKafka

🧠

Mental Model

A vinyl record pressing plant. The master disc (superblock) holds the track listing -- how many songs, total runtime, where each track starts. The stamper splits the vinyl into concentric grooves (block groups in ext4, allocation groups in XFS). ext4 is a plant with one stamper -- every groove gets pressed in sequence, and the track listing is carved into the master at the start and never changes. If a groove is damaged, the plant shuts down to inspect the entire disc (offline fsck). XFS runs multiple stampers in parallel, each pressing its own section of the record independently. A bad groove in one section does not stop the other stampers. The pressing log (journal) records what the stamper intended before the needle touches wax. After a power cut, the plant reads the log and finishes only the incomplete grooves instead of re-inspecting every groove on the disc.

💡

The Problem

A production database loses data after a power failure despite running on a journaling filesystem. The ext4 volume was mounted with data=writeback for performance, and the application assumed atomic 8KB writes. The journal replayed metadata correctly -- directory entries, inode timestamps, allocation trees all intact -- but the actual file contents contained a mix of old and new data. Half-written pages, torn across 4KB block boundaries. Meanwhile, a Kafka cluster on ext4 hits 100% iowait at 500 partitions because extent tree lookups serialize all metadata access through a single block group. And a Kubernetes node running Docker on XFS formatted without ftype=1 silently falls back to a degraded storage driver, burning 3x the expected CPU on layer merges. Three different failures, three different symptoms, all caused by the same gap: not understanding what the filesystem actually guarantees about data placement, crash recovery, and metadata structure.

Architecture

How does a file actually survive a power failure?

The answer depends entirely on what the filesystem promised to write, in what order, and whether a journal recorded the intent before the data moved. Most engineers treat ext4 and XFS as interchangeable black boxes -- mount, write, read, done. That assumption holds until a crash reveals what the filesystem was actually doing behind the scenes.

ext4: On-Disk Layout

ext4 organizes a block device into a linear sequence of block groups. With the default 4KB block size, each block group spans 128MB (32,768 blocks). A 1TB filesystem has roughly 8,192 block groups.

The first 1024 bytes of the device are reserved (boot sector). The superblock starts at byte 1024 and contains the filesystem's identity: magic number 0xEF53, total block and inode counts, block size, feature flags, UUID, and journal parameters. Backup superblocks exist at block group boundaries following the sparse_super pattern (groups 0, 1, 3, 5, 7, and powers of 3, 5, 7).

Each block group contains, in order:

Block bitmap -- one bit per block in the group. A 128MB group with 4KB blocks has 32,768 bits = 4KB bitmap.
Inode bitmap -- one bit per inode slot. Tracks which inodes in this group are allocated.
Inode table -- contiguous array of 256-byte inode structures. The default allocation ratio of one inode per 16KB of disk means a 128MB group holds 8,192 inodes, consuming 2MB of the group.
Data blocks -- the remaining blocks hold actual file content.

The group descriptor table near the start of the filesystem maps every group's bitmap and inode table locations. Together with the superblock, it forms the root of all metadata lookups.

Extent Trees

Before ext4, the ext2/ext3 inode stored block pointers: 12 direct pointers, one indirect, one double-indirect, one triple-indirect. A 1GB file needed 262,144 block pointers scattered across multiple indirect blocks -- each requiring a separate disk read during random access.

ext4 replaced this with extent trees. An extent is a triple: (logical block number, physical block number, length). A perfectly contiguous 1GB file needs exactly one extent. The inode itself holds 60 bytes of extent data -- enough for 4 extents directly. When a file is fragmented beyond 4 extents, the tree grows deeper with index nodes pointing to leaf blocks full of extents. The maximum tree depth is 5 levels, supporting files up to 16TB.

The practical impact: sequential reads on a contiguous file require one extent lookup instead of thousands of indirect block reads. Random reads hit at most 2-3 tree levels. Even fragmented files with thousands of extents outperform triple-indirect mapping because each extent covers a range rather than a single block.

Journaling via JBD2

ext4 uses JBD2 (Journaling Block Device 2) for crash recovery. The journal is typically stored in a hidden inode (inode 8) as a circular buffer, defaulting to 128MB on large filesystems.

Three journaling modes exist, and the choice determines what survives a crash:

journal mode (data=journal): Both metadata and file data are written to the journal first, then checkpointed to their final locations. This provides the strongest consistency -- a crash at any point either preserves the old state or completes the new state. The cost is roughly 50% write throughput penalty because every byte of data is written twice (journal + final location).

ordered mode (data=ordered, the default): Only metadata is journaled, but the kernel forces all data blocks belonging to a transaction to be flushed to their final locations before the metadata journal entry commits. This guarantees that after recovery, metadata never points at stale data. The ordering constraint costs roughly 5-10% throughput versus writeback.

writeback mode (data=writeback): Only metadata is journaled. Data blocks may be written before or after the metadata commit, in any order. After a crash, it is possible for a newly extended file to contain blocks from a previously deleted file (stale data exposure) or for a committed metadata entry to reference blocks with old content (torn writes). This mode is fastest but provides the weakest guarantees.

What Survives a Crash

In ordered mode, data blocks flush to their final locations before the journal records the metadata commit. Power failure before the data flush means the metadata transaction never committed -- recovery ignores it and the file reverts to its previous state. Power failure after the journal commit means data was already on disk. Either way, the filesystem is consistent.

In writeback mode, metadata may commit before data blocks are written. The result after a crash: directory entries, inodes, and extent trees all claim the file is complete, but the actual data blocks contain whatever was previously at those physical locations. This is how databases experience silent corruption on writeback-mounted filesystems.

XFS: On-Disk Layout

XFS approaches the problem differently. Instead of a linear sequence of block groups, XFS splits the device into allocation groups (AGs) -- typically 1GB each. A 1TB volume has roughly 1,000 AGs, each functioning as a semi-independent mini-filesystem.

Each AG has its own header containing:

AG Free Space B+ Tree (by block number): indexes free extents sorted by starting block number. Finding a free extent near a target block is O(log n).
AG Free Space B+ Tree (by size): indexes the same free extents sorted by size. Finding the smallest extent that fits a request is also O(log n).
AG Inode B+ Tree: tracks allocated inode chunks within the AG. Unlike ext4, XFS does not pre-allocate a fixed inode table -- inodes are allocated from free space on demand, in 64-inode chunks.

The critical difference is parallelism. When multiple threads allocate files simultaneously, XFS assigns each allocation to a different AG based on the parent directory's AG affinity. Two threads in different AGs never contend on the same metadata lock. This is the primary reason XFS outperforms ext4 on parallel workloads with 16+ concurrent writers.

Delayed Allocation

When an application calls write(), XFS does not immediately assign physical blocks. The data goes to the page cache with extents marked as "delayed." Physical allocation happens at writeback time, when XFS knows the full extent of the write and can allocate the largest possible contiguous run. A 100MB file written in 1MB chunks gets one 100MB extent instead of 100 separate 1MB extents. At writeback time, XFS can also pick the AG with the best-fitting free extent. The tradeoff: a crash between write() and writeback loses data that the application believed was written. Applications needing durability must call fsync() explicitly.

Reflink and Copy-on-Write

XFS supports reflink (shared extents) since Linux 4.9. cp --reflink=always source dest creates a file sharing the same physical blocks as the source -- only the extent map is duplicated. Writes to either file trigger COW for just the modified extents. This enables instant snapshots, efficient file cloning, and container layer deduplication at the cost of slightly more metadata complexity and occasional write amplification.

XFS Log (Journal) Structure

XFS maintains a circular write-ahead log for metadata transactions only -- no data journaling mode exists. This makes XFS equivalent to ext4 data=writeback from a data ordering perspective. XFS compensates with log space reservation: before a transaction begins, it reserves enough log space to guarantee completion. If the log is full, new transactions block until checkpointing reclaims space. Recovery replays the log in 2-5 seconds regardless of filesystem size -- a major advantage over ext4 fsck, which must scan the entire filesystem.

fsck vs xfs_repair

ext4 fsck (e2fsck) operates offline -- the filesystem must be unmounted. It walks every block group, inode, extent tree, and directory entry. On a 10TB filesystem with 100 million files, this takes 30-90 minutes.

xfs_repair traditionally required an unmounted filesystem, but XFS has been gaining online repair capabilities since Linux 5.x. xfs_scrub performs online checking, and recent kernels (6.x+) support online repair of many metadata structures.

For both filesystems, the journal handles 99% of unclean shutdowns without fsck. Journal replay reconstructs incomplete transactions in seconds. Full fsck is needed only when the journal itself is corrupted or latent corruption has accumulated.

Docker and overlay2

Docker overlay2 requires the underlying filesystem to store directory entry types (d_type). Without d_type, overlay2 must stat() every directory entry during layer merges -- prohibitively slow.

ext4 requires the filetype feature flag (default since ext3). XFS requires ftype=1 at format time (default since xfsprogs 4.2 / 2015). Neither can be enabled retroactively. Docker checks at startup and refuses overlay2 if d_type is missing.

Kubernetes PersistentVolumes

Kubernetes CSI drivers typically format PersistentVolumes with ext4 by default. Use ext4 for general-purpose workloads (databases, application state) where data=ordered crash consistency matters and the option to shrink the volume is valuable. Use XFS for large-file workloads (Kafka, Elasticsearch, ML datasets) where allocation groups eliminate metadata contention and delayed allocation maximizes sequential throughput. The StorageClass can specify fsType: xfs and pass mkfs parameters. For XFS, ensure ftype=1. For ext4 on high-inode workloads, pass -i 4096 to increase inode count.

Kafka and XFS

Kafka brokers store 1GB log segments written and read sequentially -- the ideal XFS workload. Delayed allocation produces contiguous 1GB extents, allocation groups allow 500 partitions to allocate concurrently without metadata contention, and the noatime mount option eliminates inode writes on every read (reducing metadata writes by ~90% on a 500-partition broker).

On ext4, the same workload suffers from deeper extent trees, block group contention, and higher fragmentation. ext4 has a simpler delayed allocator that is less effective for large sequential writes.

PostgreSQL and Crash Safety

PostgreSQL WAL (Write-Ahead Log) assumes that either:

The filesystem provides ordered writes (data before metadata), OR
PostgreSQL writes full page images to WAL before modifying data files (full_page_writes=on).

On ext4 with data=ordered, the filesystem guarantees data block ordering. PostgreSQL still benefits from full_page_writes=on as defense-in-depth -- an 8KB PostgreSQL page spans two 4KB filesystem blocks, and ordered mode does not guarantee atomicity within a single write() call.

On XFS (metadata-only journal, equivalent to writeback), full_page_writes=on is mandatory. Without it, a crash produces torn pages that WAL replay cannot fix. The performance cost is approximately 10-25% more WAL volume -- almost always worth the protection.

Inode Exhaustion

ext4 fixes its inode count at mkfs time. The default ratio is one inode per 16KB of disk space. A 100GB volume gets approximately 6.5 million inodes. Once consumed, no new files can be created even with 90% free disk space. The kernel returns ENOSPC while df -h shows plenty of free blocks.

XFS allocates inodes dynamically in 64-inode chunks from free space. Inodes are never exhausted while blocks remain. Monitor both df -h (blocks) and df -i (inodes). On ext4, if inode usage approaches 80% with low block usage, reformat with mkfs.ext4 -i 4096 (one inode per 4KB) or switch to XFS.

Common Questions

Can ext4 be converted to XFS in place?

No. The on-disk formats are completely different. Migration requires creating a new XFS filesystem and copying data via rsync or tar. Plan for downtime or use a secondary volume.

Why does XFS not support filesystem shrink?

XFS inodes encode their allocation group number in the inode number itself. Shrinking would require renumbering AGs and rewriting every inode reference -- prohibitively expensive. ext4 inodes are numbered within block groups, and resize2fs can relocate blocks from the tail to free groups near the front.

What happens if the ext4 superblock is corrupted?

The filesystem will not mount. Recovery involves using a backup superblock: e2fsck -b 32768 /dev/sdX (the backup at block group 1). The dumpe2fs tool lists all backup superblock locations. XFS keeps a secondary superblock in each allocation group header, and xfs_repair automatically tries secondary copies if the primary is damaged.

How does ext4 handle files larger than the 60-byte inline extent limit?

The first 4 extents fit directly in the inode's 60-byte data area. When a 5th extent is needed, ext4 allocates a leaf block and converts the inode to hold an index node pointing to this leaf. The tree can grow to 5 levels deep, supporting approximately 340 million extents per file.

Does XFS delayed allocation lose data on crash?

Delayed allocation means data sitting in the page cache without physical block assignment is lost on crash -- the same as any filesystem would lose unflushed dirty pages. XFS delays allocation longer than ext4 to improve contiguity. Applications that need durability must call fsync() after critical writes.

How Technologies Use This

Docker

A container image built on ext4 without directory entry type support silently breaks overlay2. Docker overlay2 requires the kernel to store file type information inside directory entries (d_type). On ext4, this requires the metadata_csum or filetype feature flag -- without it, overlay2 falls back to the vfs driver or refuses to start entirely. The symptom is mysterious "invalid argument" errors during docker pull.

The root cause is historical. Early ext4 formatted without filetype support stored only inode numbers in directory entries. The kernel had to read the inode itself to determine whether an entry was a file, directory, or symlink. overlay2 performs millions of these lookups during layer merges and cannot afford the extra inode reads.

On XFS, the equivalent requirement is ftype=1, which has been the default since xfsprogs 4.2 (2015). Docker checks this at startup with xfs_info and refuses to use overlay2 if ftype=0. Reformatting with mkfs.xfs -n ftype=1 fixes the issue, but requires data migration. Production clusters running RHEL 7 with older XFS defaults hit this constantly during Docker adoption -- the filesystem was technically functional, but the missing feature flag made container layer resolution 10-50x slower on the fallback path before it was caught.

Kubernetes

A PersistentVolume backed by ext4 runs out of inodes with 40% free disk space. The 200GB volume holds 12 million small log files from a metrics pipeline, and every inode allocated at mkfs time is consumed. No more files can be created despite ample block space. The pod enters CrashLoopBackOff because the application cannot write its PID file.

ext4 fixes its inode count at filesystem creation. The default ratio is one inode per 16KB of disk -- a 200GB volume gets roughly 12.8 million inodes. Workloads that create many small files exhaust inodes long before filling blocks. The only fix is reformatting with mkfs.ext4 -i 4096 (one inode per 4KB) or switching to XFS.

XFS allocates inodes dynamically from free space. It never runs out of inodes while blocks remain available. For Kubernetes workloads with unpredictable file counts -- log aggregators, cache directories, spool queues -- XFS eliminates an entire class of outage. The tradeoff is that XFS cannot reclaim inode space from deleted files as aggressively, and XFS volumes cannot be shrunk (only grown). Production clusters typically use ext4 for general-purpose volumes (databases, application state) and XFS for high-file-count workloads (logging, object storage, ML training datasets).

PostgreSQL

A PostgreSQL instance experiences data corruption after a power failure despite having WAL enabled. The database had full_page_writes=off for performance, and the underlying ext4 filesystem was mounted with data=writeback. The WAL contained a partial 8KB page write -- the first 4KB block was the new version, the second 4KB block was stale. PostgreSQL replayed this torn page during recovery and silently corrupted an index.

The problem is that ext4 in writeback mode journals only metadata, not data block ordering. A write to an 8KB PostgreSQL page crosses two 4KB filesystem blocks. If power fails between the two block writes, one block is new and one is old. The WAL replay applies the logical change on top of this half-written page, producing garbage.

The solution is ext4 with data=ordered (the default), which guarantees data blocks hit disk before the metadata journal commits. Combined with full_page_writes=on, PostgreSQL writes a complete copy of each page to WAL before modifying it, so recovery can restore the full page regardless of what the filesystem did. This costs roughly 10-25% more WAL volume but eliminates torn page corruption entirely. On XFS, the same principle applies -- XFS journals metadata only (equivalent to data=writeback), so PostgreSQL must always run with full_page_writes=on on XFS.

Kafka

A Kafka broker serving 800MB/s of log segment reads shows 40% CPU in iowait. The broker stores 2TB of log segments across 500 partitions on ext4. Each log segment is 1GB. Sequential read throughput is 200MB/s instead of the expected 600MB/s from the underlying NVMe array.

The root cause traces back to ext4 extent tree depth. While ext4 replaced the old triple-indirect block mapping with extent trees, a 1GB file on ext4 with 4KB blocks requires an extent tree 2-3 levels deep. Each level requires a separate metadata read before the kernel can locate the data blocks. For Kafka's sequential access pattern, readahead masks most of this latency -- but under high concurrency with 500 partitions, the extent tree lookups compete for I/O bandwidth and blow the page cache working set.

XFS handles this better for two reasons. First, allocation groups allow parallel metadata operations -- 500 concurrent reads can proceed without contending on a single metadata tree. Second, XFS delayed allocation produces longer contiguous extents, reducing tree depth. Combined with the noatime mount option (eliminating inode writes on every read), XFS typically delivers 2-3x higher throughput for Kafka log segment workloads. Production Kafka deployments almost universally run XFS with noatime,nobarrier (nobarrier only when the storage controller has battery-backed cache).

Same Concept Across Tech

Concept	Docker	Kubernetes	PostgreSQL	Kafka
Filesystem requirement	overlay2 needs d_type (ext4 filetype, XFS ftype=1)	PV provisioner sets mkfs options	data=ordered for crash safety	XFS + noatime for log segments
Inode concern	Layer explosion can exhaust ext4 inodes	Small-file PVs risk inode exhaustion	Moderate inode usage (large files)	Minimal (large log segments)
Journal mode	Default ordered is fine	Default ordered is fine	ordered + full_page_writes=on	XFS metadata-only journal is sufficient
Fragmentation	Reflink COW on XFS reduces layer duplication	Online defrag via e4defrag or xfs_fsr	Periodic VACUUM keeps files contiguous	Log segment rotation prevents fragmentation
Resize	Rarely needed (ephemeral storage)	PV expansion via resize2fs (ext4) or xfs_growfs (XFS)	Rare (tablespace sizing)	Add disks, rebalance partitions

Stack Layer Mapping

Layer	Filesystem Mechanism
Hardware	Block device (HDD/SSD/NVMe) provides raw block addressing
Block layer	I/O scheduler + block device driver handle request ordering
Filesystem	ext4/XFS translate file operations to block reads/writes
VFS	Virtual File System provides uniform API above ext4/XFS
Page cache	Buffers reads/writes in memory, triggers writeback
Application	open/read/write/fsync syscalls, unaware of on-disk layout

Design Rationale

ext4 evolved from ext3 by adding extent trees, delayed allocation, and multiblock allocation while preserving backward compatibility with ext2/ext3 on-disk structures. The priority was stability and broad compatibility over raw performance. XFS was designed from scratch at SGI for large parallel file servers, treating the disk as an array of independent allocation groups. The B+ tree metadata and delayed allocation were not afterthoughts -- they were the core design. The tradeoff is complexity: XFS has no filesystem shrink support, its repair tools are more specialized, and its on-disk format is less tolerant of partial feature adoption. ext4 trades peak performance for simplicity and resilience.

If You See This, Think This

Symptom	Likely Cause	First Check
"No space left on device" with plenty of free blocks	Inode exhaustion on ext4	`df -i` to compare inode usage vs block usage
Docker overlay2 refuses to start	Missing d_type support (ext4 filetype or XFS ftype=0)	`xfs_info` for ftype or `tune2fs -l` for filetype feature
Database corruption after power failure	data=writeback on ext4 or full_page_writes=off on any FS	`mount
High iowait with sequential reads	Extent fragmentation or deep extent trees	`filefrag -v <file>` to check extent count
Filesystem mount fails with "bad magic number"	Corrupted superblock	`e2fsck -n /dev/sdX` or `xfs_repair -n /dev/sdX` to check without modifying
Slow file creation under concurrency	ext4 block group contention (single inode bitmap lock)	`perf top` for ext4_* or switch to XFS for parallel allocation
Journal recovery loops on boot	Corrupted journal or mismatched journal device	`dumpe2fs -h /dev/sdX
XFS mount fails after unexpected shutdown	Dirty log that xfs_repair cannot replay	`xfs_repair -L /dev/sdX` (force log zero -- last resort, may lose data)

When to Use / Avoid

Use ext4 when:

General-purpose server workloads with moderate file counts
Databases that rely on data=ordered crash consistency (PostgreSQL, MySQL)
Smaller volumes (under 16TB) where offline fsck time is acceptable
Workloads needing filesystem shrink capability (ext4 supports resize2fs shrink, XFS does not)
Boot partitions and root filesystems (broader bootloader support)

Use XFS when:

Large files and high-throughput sequential I/O (Kafka, video processing, scientific data)
High parallelism with many threads creating/writing files simultaneously
Large volumes (16TB+) where offline fsck is impractical -- XFS online repair is faster
Workloads with unpredictable inode counts (dynamic inode allocation)
Container hosts where reflink COW provides efficient layer deduplication

Avoid both when:

The workload involves millions of concurrent small random writes with strict latency requirements (consider btrfs with its COW model or a purpose-built key-value store)
ZFS-level checksumming and self-healing are required (neither ext4 nor XFS checksum data blocks by default)

Try It Yourself

 1  # Show filesystem type and superblock info for ext4
 2  
 3  tune2fs -l /dev/sda1 2>/dev/null | head -30 || echo 'tune2fs not available'
 4  
 5  # Show XFS geometry and feature flags
 6  
 7  xfs_info / 2>/dev/null || echo 'xfs_info not available (not an XFS mount)'
 8  
 9  # Check ext4 inode usage vs block usage
10  
11  df -hi / && echo "---" && df -h /
12  
13  # Dump extent map for a file (shows fragmentation)
14  
15  filefrag -v /var/log/syslog 2>/dev/null || echo 'filefrag not available'
16  
17  # Read ext4 superblock with debugfs
18  
19  debugfs -R 'show_super_stats' /dev/sda1 2>/dev/null | head -20 || echo 'debugfs not available'
20  
21  # Check ext4 journal info
22  
23  dumpe2fs -h /dev/sda1 2>/dev/null | grep -i journal || echo 'dumpe2fs not available'
24  
25  # XFS free space fragmentation histogram
26  
27  xfs_spaceman -c 'freesp -s' / 2>/dev/null || echo 'xfs_spaceman not available'
28  
29  # Check XFS allocation group headers
30  
31  xfs_db -r -c 'agf 0' -c 'p' /dev/sda1 2>/dev/null || echo 'xfs_db not available'
32  
33  # Verify ext4 d_type support (needed for Docker overlay2)
34  
35  tune2fs -l /dev/sda1 2>/dev/null | grep -i filetype || echo 'tune2fs not available'
36  
37  # Show ext4 mount options including journal mode
38  
39  mount | grep 'type ext4' || echo 'no ext4 mounts found'

Debug Checklist

1df -hT -- show filesystem type, size, and free space for all mounts
2df -i -- show inode usage (ext4 inode exhaustion check)
3tune2fs -l /dev/sdX -- dump ext4 superblock: block size, inode count, feature flags, journal info
4xfs_info /mount/point -- show XFS geometry: AG count, block size, log size, naming version
5dumpe2fs -h /dev/sdX 2>/dev/null | grep -i journal -- ext4 journal type and size
6debugfs -R 'stat <8>' /dev/sdX -- read ext4 journal inode details
7xfs_db -r -c 'sb 0' -c 'p' /dev/sdX -- dump raw XFS superblock fields
8filefrag -v /path/to/file -- show extent map (fragmentation check)
9cat /proc/fs/ext4/sdX/session_write_kbytes -- ext4 write throughput counter
10xfs_spaceman -c 'freesp' /mount/point -- XFS free space histogram

Key Takeaways

✓ext4 block groups divide the disk into fixed-size chunks (typically 128MB with 4KB blocks). Each group has its own block bitmap (tracks free blocks), inode bitmap (tracks free inodes), inode table (256-byte inode entries), and data blocks. The group descriptor table at the start of the filesystem maps all groups. This structure means a 1TB ext4 filesystem has roughly 8,192 block groups.
✓XFS delayed allocation does not assign physical blocks when write() is called. Blocks stay in the page cache as "delayed" until writeback, when XFS knows the full extent of the write and can allocate the largest possible contiguous run. This produces fewer, larger extents and dramatically reduces fragmentation for sequential write workloads like log files and database WAL.
✓ext4 journaling mode determines crash survival. In ordered mode (default), the kernel flushes data blocks to their final locations, then commits the metadata journal entry. If power fails before the journal commit, the metadata change is abandoned and the data blocks are orphaned (harmless). If power fails after the journal commit, the data was already on disk. Writeback mode offers no such guarantee -- metadata may reference data blocks that contain stale content.
✓XFS reflink and copy-on-write (COW) allow instant file copies that share physical blocks. cp --reflink=always on XFS completes in milliseconds regardless of file size because it copies only the extent map, not the data. Writes to either copy trigger COW for just the modified extents. This is the foundation of XFS snapshot support and is used by container runtimes for efficient layer deduplication.
✓The ext4 inode is 256 bytes by default (configurable at mkfs). The first 128 bytes match the classic ext2 layout (mode, size, timestamps, 60 bytes for block mapping). The extra 128 bytes hold nanosecond timestamps, extended attributes inline, and the extent tree root. Small files (under ~60 bytes) can store their entire content inline in the inode, avoiding any block allocation.

Common Pitfalls

✗Mistake: Running ext4 with data=writeback for database workloads because benchmarks show 20% better throughput. Reality: writeback mode does not order data writes before metadata commits. A crash can leave committed metadata pointing at blocks containing old or zero data. PostgreSQL, MySQL, and similar databases that assume write ordering will experience silent data corruption. Use data=ordered (the default) and let the database handle its own write ordering through WAL.
✗Mistake: Formatting XFS without ftype=1 and then running Docker. Reality: overlay2 requires directory entry type information to function correctly. XFS must be formatted with -n ftype=1 (default since xfsprogs 4.2 / 2015, but older RHEL 7 systems may not have it). Docker checks this at startup and either refuses to use overlay2 or falls back to a slower driver. The fix requires reformatting -- ftype cannot be enabled on an existing filesystem.
✗Mistake: Ignoring inode exhaustion on ext4 because disk space monitoring shows plenty of free blocks. Reality: ext4 fixes the inode count at mkfs time (default: 1 inode per 16KB of disk). A 100GB volume gets ~6.5 million inodes. Workloads creating millions of small files (log rotation, mail spools, container layers) exhaust inodes while showing 50%+ free disk space. Monitor both df -h (blocks) and df -i (inodes). XFS allocates inodes dynamically and does not have this problem.
✗Mistake: Using nobarrier mount option on ext4/XFS for performance without understanding the write cache implications. Reality: nobarrier tells the filesystem not to issue FLUSH/FUA commands to the disk. If the disk has a volatile write cache (no battery backup), a power failure can lose writes that the journal believed were committed. Only use nobarrier when the storage controller has battery-backed or capacitor-backed write cache (enterprise RAID controllers, most cloud block storage).

Reference

System Calls

statfsstatvfsioctl

📌

In One Line

ext4 journals metadata in block groups with fixed inodes; XFS journals metadata across parallel allocation groups with dynamic inodes -- pick based on file size, parallelism, and crash recovery needs.

ext4 & XFS On-Disk Internals

DockerKubernetesPostgreSQLKafka

🧠

Mental Model

💡

The Problem

Architecture

How does a file actually survive a power failure?

ext4: On-Disk Layout

Each block group contains, in order:

Block bitmap -- one bit per block in the group. A 128MB group with 4KB blocks has 32,768 bits = 4KB bitmap.
Inode bitmap -- one bit per inode slot. Tracks which inodes in this group are allocated.
Inode table -- contiguous array of 256-byte inode structures. The default allocation ratio of one inode per 16KB of disk means a 128MB group holds 8,192 inodes, consuming 2MB of the group.
Data blocks -- the remaining blocks hold actual file content.

The group descriptor table near the start of the filesystem maps every group's bitmap and inode table locations. Together with the superblock, it forms the root of all metadata lookups.

Extent Trees

Journaling via JBD2

ext4 uses JBD2 (Journaling Block Device 2) for crash recovery. The journal is typically stored in a hidden inode (inode 8) as a circular buffer, defaulting to 128MB on large filesystems.

Three journaling modes exist, and the choice determines what survives a crash:

What Survives a Crash

XFS: On-Disk Layout

Each AG has its own header containing:

AG Free Space B+ Tree (by block number): indexes free extents sorted by starting block number. Finding a free extent near a target block is O(log n).
AG Free Space B+ Tree (by size): indexes the same free extents sorted by size. Finding the smallest extent that fits a request is also O(log n).
AG Inode B+ Tree: tracks allocated inode chunks within the AG. Unlike ext4, XFS does not pre-allocate a fixed inode table -- inodes are allocated from free space on demand, in 64-inode chunks.

Delayed Allocation

Reflink and Copy-on-Write

XFS Log (Journal) Structure

fsck vs xfs_repair

Docker and overlay2

Docker overlay2 requires the underlying filesystem to store directory entry types (d_type). Without d_type, overlay2 must stat() every directory entry during layer merges -- prohibitively slow.

Kubernetes PersistentVolumes

Kafka and XFS

On ext4, the same workload suffers from deeper extent trees, block group contention, and higher fragmentation. ext4 has a simpler delayed allocator that is less effective for large sequential writes.

PostgreSQL and Crash Safety

PostgreSQL WAL (Write-Ahead Log) assumes that either:

The filesystem provides ordered writes (data before metadata), OR
PostgreSQL writes full page images to WAL before modifying data files (full_page_writes=on).

Inode Exhaustion

Common Questions

Can ext4 be converted to XFS in place?

No. The on-disk formats are completely different. Migration requires creating a new XFS filesystem and copying data via rsync or tar. Plan for downtime or use a secondary volume.

Why does XFS not support filesystem shrink?

What happens if the ext4 superblock is corrupted?

How does ext4 handle files larger than the 60-byte inline extent limit?

Does XFS delayed allocation lose data on crash?

How Technologies Use This

Docker

Kubernetes

PostgreSQL

Kafka

Same Concept Across Tech

Concept	Docker	Kubernetes	PostgreSQL	Kafka
Filesystem requirement	overlay2 needs d_type (ext4 filetype, XFS ftype=1)	PV provisioner sets mkfs options	data=ordered for crash safety	XFS + noatime for log segments
Inode concern	Layer explosion can exhaust ext4 inodes	Small-file PVs risk inode exhaustion	Moderate inode usage (large files)	Minimal (large log segments)
Journal mode	Default ordered is fine	Default ordered is fine	ordered + full_page_writes=on	XFS metadata-only journal is sufficient
Fragmentation	Reflink COW on XFS reduces layer duplication	Online defrag via e4defrag or xfs_fsr	Periodic VACUUM keeps files contiguous	Log segment rotation prevents fragmentation
Resize	Rarely needed (ephemeral storage)	PV expansion via resize2fs (ext4) or xfs_growfs (XFS)	Rare (tablespace sizing)	Add disks, rebalance partitions

Stack Layer Mapping

Layer	Filesystem Mechanism
Hardware	Block device (HDD/SSD/NVMe) provides raw block addressing
Block layer	I/O scheduler + block device driver handle request ordering
Filesystem	ext4/XFS translate file operations to block reads/writes
VFS	Virtual File System provides uniform API above ext4/XFS
Page cache	Buffers reads/writes in memory, triggers writeback
Application	open/read/write/fsync syscalls, unaware of on-disk layout

Design Rationale

If You See This, Think This

Symptom	Likely Cause	First Check
"No space left on device" with plenty of free blocks	Inode exhaustion on ext4	`df -i` to compare inode usage vs block usage
Docker overlay2 refuses to start	Missing d_type support (ext4 filetype or XFS ftype=0)	`xfs_info` for ftype or `tune2fs -l` for filetype feature
Database corruption after power failure	data=writeback on ext4 or full_page_writes=off on any FS	`mount
High iowait with sequential reads	Extent fragmentation or deep extent trees	`filefrag -v <file>` to check extent count
Filesystem mount fails with "bad magic number"	Corrupted superblock	`e2fsck -n /dev/sdX` or `xfs_repair -n /dev/sdX` to check without modifying
Slow file creation under concurrency	ext4 block group contention (single inode bitmap lock)	`perf top` for ext4_* or switch to XFS for parallel allocation
Journal recovery loops on boot	Corrupted journal or mismatched journal device	`dumpe2fs -h /dev/sdX
XFS mount fails after unexpected shutdown	Dirty log that xfs_repair cannot replay	`xfs_repair -L /dev/sdX` (force log zero -- last resort, may lose data)

When to Use / Avoid

Use ext4 when:

General-purpose server workloads with moderate file counts
Databases that rely on data=ordered crash consistency (PostgreSQL, MySQL)
Smaller volumes (under 16TB) where offline fsck time is acceptable
Workloads needing filesystem shrink capability (ext4 supports resize2fs shrink, XFS does not)
Boot partitions and root filesystems (broader bootloader support)

Use XFS when:

Large files and high-throughput sequential I/O (Kafka, video processing, scientific data)
High parallelism with many threads creating/writing files simultaneously
Large volumes (16TB+) where offline fsck is impractical -- XFS online repair is faster
Workloads with unpredictable inode counts (dynamic inode allocation)
Container hosts where reflink COW provides efficient layer deduplication

Avoid both when:

The workload involves millions of concurrent small random writes with strict latency requirements (consider btrfs with its COW model or a purpose-built key-value store)
ZFS-level checksumming and self-healing are required (neither ext4 nor XFS checksum data blocks by default)

Try It Yourself

 1  # Show filesystem type and superblock info for ext4
 2  
 3  tune2fs -l /dev/sda1 2>/dev/null | head -30 || echo 'tune2fs not available'
 4  
 5  # Show XFS geometry and feature flags
 6  
 7  xfs_info / 2>/dev/null || echo 'xfs_info not available (not an XFS mount)'
 8  
 9  # Check ext4 inode usage vs block usage
10  
11  df -hi / && echo "---" && df -h /
12  
13  # Dump extent map for a file (shows fragmentation)
14  
15  filefrag -v /var/log/syslog 2>/dev/null || echo 'filefrag not available'
16  
17  # Read ext4 superblock with debugfs
18  
19  debugfs -R 'show_super_stats' /dev/sda1 2>/dev/null | head -20 || echo 'debugfs not available'
20  
21  # Check ext4 journal info
22  
23  dumpe2fs -h /dev/sda1 2>/dev/null | grep -i journal || echo 'dumpe2fs not available'
24  
25  # XFS free space fragmentation histogram
26  
27  xfs_spaceman -c 'freesp -s' / 2>/dev/null || echo 'xfs_spaceman not available'
28  
29  # Check XFS allocation group headers
30  
31  xfs_db -r -c 'agf 0' -c 'p' /dev/sda1 2>/dev/null || echo 'xfs_db not available'
32  
33  # Verify ext4 d_type support (needed for Docker overlay2)
34  
35  tune2fs -l /dev/sda1 2>/dev/null | grep -i filetype || echo 'tune2fs not available'
36  
37  # Show ext4 mount options including journal mode
38  
39  mount | grep 'type ext4' || echo 'no ext4 mounts found'

Debug Checklist

1df -hT -- show filesystem type, size, and free space for all mounts
2df -i -- show inode usage (ext4 inode exhaustion check)
3tune2fs -l /dev/sdX -- dump ext4 superblock: block size, inode count, feature flags, journal info
4xfs_info /mount/point -- show XFS geometry: AG count, block size, log size, naming version
5dumpe2fs -h /dev/sdX 2>/dev/null | grep -i journal -- ext4 journal type and size
6debugfs -R 'stat <8>' /dev/sdX -- read ext4 journal inode details
7xfs_db -r -c 'sb 0' -c 'p' /dev/sdX -- dump raw XFS superblock fields
8filefrag -v /path/to/file -- show extent map (fragmentation check)
9cat /proc/fs/ext4/sdX/session_write_kbytes -- ext4 write throughput counter
10xfs_spaceman -c 'freesp' /mount/point -- XFS free space histogram

Key Takeaways

✓ext4 block groups divide the disk into fixed-size chunks (typically 128MB with 4KB blocks). Each group has its own block bitmap (tracks free blocks), inode bitmap (tracks free inodes), inode table (256-byte inode entries), and data blocks. The group descriptor table at the start of the filesystem maps all groups. This structure means a 1TB ext4 filesystem has roughly 8,192 block groups.
✓XFS delayed allocation does not assign physical blocks when write() is called. Blocks stay in the page cache as "delayed" until writeback, when XFS knows the full extent of the write and can allocate the largest possible contiguous run. This produces fewer, larger extents and dramatically reduces fragmentation for sequential write workloads like log files and database WAL.
✓ext4 journaling mode determines crash survival. In ordered mode (default), the kernel flushes data blocks to their final locations, then commits the metadata journal entry. If power fails before the journal commit, the metadata change is abandoned and the data blocks are orphaned (harmless). If power fails after the journal commit, the data was already on disk. Writeback mode offers no such guarantee -- metadata may reference data blocks that contain stale content.
✓XFS reflink and copy-on-write (COW) allow instant file copies that share physical blocks. cp --reflink=always on XFS completes in milliseconds regardless of file size because it copies only the extent map, not the data. Writes to either copy trigger COW for just the modified extents. This is the foundation of XFS snapshot support and is used by container runtimes for efficient layer deduplication.
✓The ext4 inode is 256 bytes by default (configurable at mkfs). The first 128 bytes match the classic ext2 layout (mode, size, timestamps, 60 bytes for block mapping). The extra 128 bytes hold nanosecond timestamps, extended attributes inline, and the extent tree root. Small files (under ~60 bytes) can store their entire content inline in the inode, avoiding any block allocation.

Common Pitfalls

✗Mistake: Running ext4 with data=writeback for database workloads because benchmarks show 20% better throughput. Reality: writeback mode does not order data writes before metadata commits. A crash can leave committed metadata pointing at blocks containing old or zero data. PostgreSQL, MySQL, and similar databases that assume write ordering will experience silent data corruption. Use data=ordered (the default) and let the database handle its own write ordering through WAL.
✗Mistake: Formatting XFS without ftype=1 and then running Docker. Reality: overlay2 requires directory entry type information to function correctly. XFS must be formatted with -n ftype=1 (default since xfsprogs 4.2 / 2015, but older RHEL 7 systems may not have it). Docker checks this at startup and either refuses to use overlay2 or falls back to a slower driver. The fix requires reformatting -- ftype cannot be enabled on an existing filesystem.
✗Mistake: Ignoring inode exhaustion on ext4 because disk space monitoring shows plenty of free blocks. Reality: ext4 fixes the inode count at mkfs time (default: 1 inode per 16KB of disk). A 100GB volume gets ~6.5 million inodes. Workloads creating millions of small files (log rotation, mail spools, container layers) exhaust inodes while showing 50%+ free disk space. Monitor both df -h (blocks) and df -i (inodes). XFS allocates inodes dynamically and does not have this problem.
✗Mistake: Using nobarrier mount option on ext4/XFS for performance without understanding the write cache implications. Reality: nobarrier tells the filesystem not to issue FLUSH/FUA commands to the disk. If the disk has a volatile write cache (no battery backup), a power failure can lose writes that the journal believed were committed. Only use nobarrier when the storage controller has battery-backed or capacitor-backed write cache (enterprise RAID controllers, most cloud block storage).

Reference

System Calls

statfsstatvfsioctl

📌

Mental Model

The Problem

Architecture

ext4: On-Disk Layout

Extent Trees

Journaling via JBD2

What Survives a Crash

XFS: On-Disk Layout

Delayed Allocation

Reflink and Copy-on-Write

XFS Log (Journal) Structure

fsck vs xfs_repair

Docker and overlay2

Kubernetes PersistentVolumes

Kafka and XFS

PostgreSQL and Crash Safety

Inode Exhaustion

Common Questions

How Technologies Use This

Same Concept Across Tech

If You See This, Think This

When to Use / Avoid

Try It Yourself

Debug Checklist

Key Takeaways

Common Pitfalls

Reference

In One Line

Related Topics

Mental Model

The Problem

Architecture

ext4: On-Disk Layout

Extent Trees

Journaling via JBD2

What Survives a Crash

XFS: On-Disk Layout

Delayed Allocation

Reflink and Copy-on-Write

XFS Log (Journal) Structure

fsck vs xfs_repair

Docker and overlay2

Kubernetes PersistentVolumes

Kafka and XFS

PostgreSQL and Crash Safety

Inode Exhaustion

Common Questions

How Technologies Use This

Same Concept Across Tech

If You See This, Think This

When to Use / Avoid

Try It Yourself

Debug Checklist

Key Takeaways

Common Pitfalls

Reference

In One Line

Related Topics