The Linux Boot Process from Power-On to Userspace
Mental Model
A relay race with five runners. The firmware runner lights the torch (CPU and RAM), passes it to the bootloader runner who picks the right lane (kernel image), passes it to the kernel runner who sets up the track (memory, drivers), passes it to the initramfs runner who unlocks the gate (mounts root), and finally passes it to the init runner who starts all the events (services). If any runner fumbles, the race stops at that stage. Boot debugging is figuring out which runner dropped the baton.
The Problem
A cloud VM takes 45 seconds to boot when the target is under 5 seconds. systemd-analyze shows 8 seconds in firmware, 6 seconds in the bootloader, 4 seconds in the kernel, and 27 seconds in userspace. The initramfs is 80 MB and includes drivers for hardware that does not exist in the virtual environment. Systemd starts 94 units, but only 12 are needed for the application. Every second of boot delay during an auto-scaling event means requests are dropping.
Architecture
A server reboots. The screen stays dark for a moment, then text scrolls, then a login prompt appears. What happened in those few seconds between power-on and a working system?
The answer is a precise chain of five stages, each one solving a problem that the previous stage could not handle on its own. Each handoff is deterministic. Each stage has a specific job and a specific failure mode.
What Actually Happens
Here is the sequence from the moment electricity hits the CPU:
Stage 1: Firmware (UEFI or BIOS). The CPU starts executing from a hardcoded address in flash memory. The firmware initializes the CPU itself (cache, TLB, branch predictor), trains the memory controller to communicate with the installed DIMMs, and enumerates the PCI bus to discover storage controllers, network cards, and GPUs. It runs the Power-On Self-Test (POST). On UEFI systems, it reads the EFI System Partition (a small FAT32 partition) to find the bootloader binary. On legacy BIOS systems, it reads the first 512 bytes (MBR) of the boot disk.
Stage 2: Bootloader (GRUB). GRUB's first stage is tiny -- just enough code to find and load the second stage from /boot/grub/. The second stage reads grub.cfg, which specifies the kernel image path, the initramfs path, and the kernel command line. GRUB loads both vmlinuz (the compressed kernel) and the initramfs archive into memory at specific addresses. It sets up a minimal environment (boot parameters, framebuffer info) and transfers execution to the kernel's entry point.
Stage 3: Kernel. The kernel begins in assembly code that decompresses itself, sets up initial page tables for virtual memory, and jumps to start_kernel() in C. This function initializes the memory allocator, the scheduler, the interrupt descriptor table, and the virtual filesystem layer (VFS). It probes hardware using ACPI tables (or device tree on ARM). It unpacks the initramfs cpio archive into a tmpfs mounted at /. Then it executes /init from that temporary root filesystem.
Stage 4: initramfs. The /init script (or systemd running in initramfs mode) loads kernel modules needed to access the real root filesystem. On a server with an NVMe drive, it loads the nvme module. On a cloud VM, it loads virtio_blk. If the root filesystem is on an encrypted volume, it runs cryptsetup to unlock it. If it is on LVM, it runs lvm vgchange. Once the real root is accessible and mounted, it calls switch_root to replace the tmpfs root with the real filesystem and exec the real init process.
Stage 5: systemd (PID 1). The first persistent userspace process. systemd reads its unit files and builds a dependency graph of services, mounts, sockets, and targets. It mounts filesystems listed in /etc/fstab. It starts services in parallel where dependencies allow, using socket activation and D-Bus activation to defer starting services until they are actually needed. When the default target (multi-user.target or graphical.target) and all its dependencies are satisfied, the boot is complete.
Under the Hood
UEFI vs BIOS: the firmware gap. Legacy BIOS is a 16-bit real-mode interface from 1981. It can only address 1 MB of memory, cannot read GPT partition tables natively, and has no concept of drivers beyond INT 13h disk calls. UEFI replaced all of this with a 64-bit execution environment, its own driver model, a shell, and Secure Boot. UEFI firmware can read FAT32 filesystems directly, execute PE/COFF binaries from the EFI System Partition, and verify cryptographic signatures before running the bootloader. The tradeoff: UEFI firmware is far more complex, and its initialization (device enumeration, option ROM execution) often dominates boot time.
Secure Boot chain of trust. UEFI Secure Boot verifies that every stage is signed by a trusted key. The firmware contains Microsoft's and the distro's signing keys in its key database. It verifies GRUB's signature before executing it. GRUB, if built with Secure Boot support, verifies the kernel's signature. The kernel can verify module signatures at load time. A rootkit that modifies the bootloader or kernel on disk cannot pass these signature checks.
The kernel command line is more powerful than it looks. Parameters like root=/dev/vda1 tell the kernel where the root filesystem is. init=/bin/bash overrides the default init process (useful for emergency recovery). nomodeset disables kernel mode setting for GPU drivers. rd.break drops to a shell inside the initramfs before mounting root. systemd.unit=rescue.target boots into single-user mode. These parameters are the primary debugging tool for boot failures.
systemd parallelization model. SysVinit started services sequentially based on numbered scripts (S01, S02, ...). systemd changed this by declaring dependencies explicitly and starting independent services in parallel. Socket activation takes this further: systemd creates the listening socket for a service before the service starts. If another service connects to that socket, the connection queues until the real service is ready. This eliminates ordering dependencies that only existed because "service B connects to service A's port at startup."
Why initramfs and not just compile drivers into the kernel. A monolithic kernel with every possible driver compiled in would be enormous and waste memory loading drivers for hardware that is not present. Modules solve this, but modules live on the root filesystem. The initramfs is the escape hatch: a tiny filesystem loaded into RAM by the bootloader, containing just enough modules to access the real root. Distribution kernels ship a generic initramfs with drivers for every common storage controller. Optimized deployments strip it to only the needed drivers.
Common Questions
What happens when the kernel says "Unable to mount root fs on unknown-block(0,0)"?
The kernel finished initializing but could not find the root filesystem. The numbers in the parentheses are the major and minor device numbers -- (0,0) means no device matched. Three common causes: the root= parameter points to a device that does not exist (wrong UUID, wrong device path), the storage driver for the disk controller is not in the initramfs, or the filesystem module (ext4, xfs) is not loaded. Fix by booting an older kernel, checking /proc/cmdline for the root= value, and verifying the initramfs includes the correct drivers with lsinitrd.
How does systemd know what order to start services in?
Each unit file declares its dependencies. After=network.target means "start this after the network is up." Requires=postgresql.service means "this unit needs PostgreSQL running." Wants= is a softer dependency that does not fail the dependent unit if the wanted unit fails. systemd builds a directed acyclic graph (DAG) from all units and their dependencies, then starts units in topological order, parallelizing where the graph allows. Cycles are detected and broken with a warning.
What is the difference between initramfs and initrd?
The old initrd (initial RAM disk) was a compressed filesystem image (ext2) loaded into a ramdisk block device. The kernel mounted it as a real block device, which meant it needed the ext2 driver compiled in and wasted memory on the block device layer. initramfs replaced it: the cpio archive is unpacked directly into a tmpfs (a memory-backed filesystem), skipping the block device layer entirely. initramfs is simpler, more flexible, and does not require any compiled-in filesystem driver. The name "initrd" persists in some config files and kernel parameters, but modern distributions use initramfs.
How do micro-VMs like Firecracker boot so fast?
Firecracker eliminates the two slowest stages. There is no UEFI firmware -- the VMM loads the kernel directly into guest memory at the correct address, sets up the boot parameters, and points the vCPU at the kernel entry point. The initramfs is either minimal (virtio drivers only) or skipped entirely if the root filesystem driver is compiled into the kernel. The kernel boots in a paravirtualized environment where device discovery is instant (virtio devices, no PCI enumeration). The result is under 125 milliseconds from API call to the init process running.
How Technologies Use This
AWS Firecracker microVMs power Lambda functions that must cold-start in under 125 milliseconds. A traditional VM boot takes 45 seconds because UEFI firmware spends 5-8 seconds enumerating PCI devices, USB controllers, and option ROMs before the bootloader even loads. Firecracker eliminates the firmware and bootloader stages entirely by loading the Linux kernel image directly into guest memory from the VMM process and pointing the vCPU instruction pointer at the kernel entry point.
The kernel boots in a paravirtualized environment where device discovery is instantaneous. Instead of probing ACPI tables and scanning PCI buses, Firecracker presents only virtio-mmio devices that the kernel detects in microseconds. The initramfs is either stripped to a few hundred kilobytes containing only the virtio_blk and virtio_net modules, or skipped entirely by compiling those drivers directly into the kernel. With no firmware, no GRUB menu timeout, and no initramfs driver scanning, the kernel reaches start_kernel() within milliseconds of the API call.
The combination of these boot stage eliminations produces a total boot time under 125 milliseconds from API call to the init process running. Scaling this to 10,000 concurrent microVMs during a traffic spike means all instances reach readiness in roughly the same 125ms window, compared to the 45-second stagger of traditional UEFI-booted VMs.
A 200-node Kubernetes cluster rolls out a kernel update from 5.15 to 6.1. After rebooting, 30 nodes fail with "Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0,0)." The kubelet never starts, the nodes never rejoin the cluster, and workloads pile up on the remaining 170 nodes. The old kernel booted these same machines without issue.
The panic occurs at Stage 4 of the boot process: the kernel has finished initializing the scheduler, memory manager, and VFS, and has mounted the initramfs. The /init script inside the initramfs attempts to load the storage controller driver (nvme or ahci) to mount the real root filesystem, but the driver is missing. The initramfs was generated for the old kernel and not regenerated after the update. The 6.1 kernel renamed certain module files or changed dependencies, so the old initramfs references modules that do not exist in the new kernel's /lib/modules tree.
Recovery involves booting the old kernel from GRUB's advanced options menu, running `dracut --regenerate-all` or `update-initramfs -u -k all` to rebuild initramfs images for all installed kernels, and verifying with `lsinitrd` that the new image includes the correct storage driver. Preventing this in Kubernetes clusters requires automation that regenerates the initramfs as part of the kernel update playbook and validates the image contents before scheduling the reboot.
A Docker container runs a Node.js application as PID 1, handling 2,000 requests per second. Under sustained load, `docker top` reveals 150 zombie processes accumulating over 6 hours. Sending SIGTERM to stop the container does nothing for 10 seconds until Docker force-kills it with SIGKILL. The application never drains its 500 active WebSocket connections gracefully.
The root cause is that PID 1 inside a container inherits two responsibilities from the boot process design. First, it must reap orphaned child processes by calling wait() on zombies. Node.js spawns child processes for image processing and shell commands but does not call wait() on grandchildren that get re-parented to PID 1 when their parent exits. Second, the kernel gives PID 1 special signal handling: SIGTERM and SIGINT are ignored by default unless the process explicitly installs a signal handler. Node.js does not register a SIGTERM handler by default, so the signal is silently dropped until Docker's 10-second grace period expires and SIGKILL forces termination.
The fix is to run a lightweight init process like tini or dumb-init as PID 1. Tini (8 KB binary) calls wait() in a loop to reap any zombie, and it forwards signals to the child application process. Docker's `--init` flag injects tini automatically. In Kubernetes, setting `shareProcessNamespace: true` in the pod spec makes the pause container handle PID 1 duties, including zombie reaping and signal forwarding across all containers in the pod.
Same Concept Across Tech
| Technology | How it relates to boot | Key gotcha |
|---|---|---|
| AWS Firecracker | Skips UEFI entirely, loads kernel directly from VMM. Boot under 125ms | Requires a custom kernel and minimal initramfs. No GRUB, no firmware stage |
| Docker | Container entrypoint runs as PID 1 inside a PID namespace. No firmware or bootloader involved | PID 1 ignores SIGTERM by default. Use --init flag for tini as PID 1 |
| Kubernetes | Node boot includes cloud-init for cluster join. Pod "boot" is container image pull + entrypoint | Prebaked node images eliminate cloud-init delay. initContainers run before the main container |
| systemd-nspawn | Boots a full systemd tree inside a container namespace. Closest thing to a VM boot without firmware | Shares the host kernel. No initramfs or bootloader phase |
| U-Boot | Embedded bootloader replacing GRUB. Supports kernel XIP from flash | Different config format (boot.scr). Device tree instead of ACPI for hardware discovery |
Stack layer mapping (slow boot diagnosis):
| Layer | What to check | Tool |
|---|---|---|
| Firmware | UEFI enumeration time, USB/network option ROM delays | systemd-analyze (firmware field), UEFI setup menu |
| Bootloader | GRUB timeout, config search across multiple disks | grep GRUB_TIMEOUT /etc/default/grub |
| Kernel | Driver probing delays, missing ACPI tables, decompression time | dmesg timestamps, kernel command line parameters |
| initramfs | Oversized image, unnecessary driver loading, LUKS unlock wait | lsinitrd, dracut --list-modules |
| Userspace | Slow services, deep dependency chains, network-wait targets | systemd-analyze blame, systemd-analyze critical-chain |
Design Rationale The staged boot design exists because each stage operates with increasing knowledge of the system. Firmware knows about CPU and memory but not filesystems. The bootloader knows about filesystems but not about RAID or encryption. The initramfs knows about storage but runs from a temporary root. Only after the real root is mounted does the full init system have access to all configuration and services. Each stage exists to bridge a specific knowledge gap for the next stage.
If You See This, Think This
| Symptom | Likely cause | First check |
|---|---|---|
| "Kernel panic - VFS: Unable to mount root fs" | Missing storage driver in initramfs or wrong root= parameter | lsinitrd to verify driver presence, check /proc/cmdline for root= |
| Boot hangs after GRUB with black screen | Kernel fails to initialize display driver or panics silently | Add "nomodeset" to kernel command line, remove "quiet" and "splash" |
| systemd-analyze shows 30+ seconds in userspace | Slow service units or deep dependency chains | systemd-analyze blame, disable unnecessary units |
| systemd-analyze shows 8+ seconds in firmware | UEFI probing USB, network option ROMs, or storage controllers | Disable unused boot devices in UEFI setup, disable USB enumeration |
| Container ignores SIGTERM, killed after timeout | Application running as PID 1 without signal handler | Use tini/dumb-init as PID 1, or install SIGTERM handler in application |
| Zombie processes accumulate in container | PID 1 process not calling wait() on orphaned children | Use --init flag in Docker, or shareProcessNamespace in Kubernetes |
When to Use / Avoid
Relevant when:
- Diagnosing why a server or VM takes too long to boot
- Debugging kernel panics that occur before userspace (missing initramfs drivers, wrong root= parameter)
- Understanding why containers need a proper init process as PID 1
- Optimizing cloud VM boot times for auto-scaling responsiveness
Watch out for:
- Most boot time is spent in firmware and userspace, not the kernel itself
- An initramfs built for generic hardware includes drivers the target system does not need
- PID 1 has special signal handling -- applications running as PID 1 in containers ignore SIGTERM by default
Try It Yourself
1 # Show boot time breakdown by stage (firmware, loader, kernel, userspace)
2
3 systemd-analyze
4
5 # List the 20 slowest services during boot
6
7 systemd-analyze blame | head -20
8
9 # Show the critical chain -- the longest dependency path
10
11 systemd-analyze critical-chain
12
13 # Generate an SVG timeline of the entire boot
14
15 systemd-analyze plot > boot-timeline.svg
16
17 # Check kernel messages from current boot for errors
18
19 dmesg -T | grep -iE 'error|fail|warn' | head -20
20
21 # Show full boot log including systemd messages
22
23 journalctl -b --no-pager | head -100
24
25 # List contents of the current initramfs
26
27 lsinitrd /boot/initramfs-$(uname -r).img | head -30
28
29 # Check which kernel command line parameters were used
30
31 cat /proc/cmdline
32
33 # List all systemd units and their states
34
35 systemctl list-units --type=service --state=running
36
37 # Check GRUB configuration for timeout and default entry
38
39 grep -E 'GRUB_TIMEOUT|GRUB_DEFAULT' /etc/default/grubDebug Checklist
- 1
Break down boot time by stage: systemd-analyze - 2
Find the slowest systemd units: systemd-analyze blame | head -20 - 3
Show the critical dependency chain: systemd-analyze critical-chain - 4
Check kernel boot messages for errors: dmesg | grep -iE 'error|fail|panic' - 5
Verify initramfs contents: lsinitrd /boot/initramfs-$(uname -r).img | grep -i virtio - 6
Check GRUB timeout wasting seconds: grep GRUB_TIMEOUT /etc/default/grub - 7
Generate an SVG boot timeline: systemd-analyze plot > boot.svg
Key Takeaways
- ✓The boot sequence is a chain of handoffs: firmware loads the bootloader, the bootloader loads the kernel and initramfs, the kernel mounts the initramfs and runs /init, and /init mounts the real root and execs the real init (systemd). Each stage trusts the output of the previous one.
- ✓UEFI Secure Boot adds cryptographic verification to this chain. The firmware verifies the bootloader's signature, the bootloader verifies the kernel's signature, and the kernel can verify module signatures. A compromised bootloader cannot load a tampered kernel if Secure Boot is enforced.
- ✓The initramfs exists because the kernel needs storage drivers to read the root filesystem, but those drivers live on the root filesystem. The initramfs breaks this chicken-and-egg problem by bundling the necessary drivers into a cpio archive that the bootloader loads alongside the kernel.
- ✓systemd-analyze blame shows exactly which services are slow. systemd-analyze critical-chain shows the longest dependency chain. These two commands reveal whether boot time is spent waiting on a single slow service or blocked by a deep dependency graph.
- ✓PID 1 has special kernel treatment. It cannot be killed by signals unless it explicitly installs handlers. If PID 1 exits, the kernel panics. This is why containers need a proper init process -- the application as PID 1 misses zombie reaping and signal forwarding.
Common Pitfalls
- ✗Assuming a slow boot means the kernel is slow. In most cases, the kernel initializes in 1-3 seconds. The actual bottlenecks are UEFI firmware enumeration (especially USB and network option ROMs), oversized initramfs images loading unnecessary drivers, and systemd units with long startup times or deep dependency chains.
- ✗Including every possible driver in the initramfs for "compatibility." A generic distro initramfs can reach 80-100 MB because it bundles drivers for every storage controller, filesystem, and encryption scheme. On a cloud VM that only needs virtio_blk, virtio_net, and ext4, a tailored initramfs is under 10 MB and loads in a fraction of the time.
- ✗Running application processes as PID 1 inside containers without a proper init. The application will not reap zombie children, will not receive SIGTERM properly (PID 1 ignores signals by default unless a handler is installed), and cannot perform graceful shutdown. Use tini, dumb-init, or Docker's --init flag.
- ✗Ignoring the GRUB timeout. A default GRUB_TIMEOUT of 5 seconds means every single boot waits 5 seconds for menu input that nobody provides on a headless server. Set GRUB_TIMEOUT=0 in /etc/default/grub for servers and cloud VMs.
Reference
In One Line
Five stages -- firmware, bootloader, kernel, initramfs, systemd -- form a handoff chain from power-on to login prompt, and 90% of boot optimization is cutting waste from the first and last stages.