kdump & Crash Analysis
Mental Model
A building has a fireproof safe room built into its foundation during construction. The safe room has its own power supply and communication equipment. During normal operations, nobody enters the safe room -- it sits unused. When a fire breaks out, people move into the safe room, which remains intact because it was isolated from the rest of the building. From inside, they document exactly what happened -- where the fire started, what was burning, who was where. After the fire is out, investigators use those records to determine the cause. Without the safe room, everything would have been destroyed and the cause would remain unknown.
The Problem
A production server rebooted unexpectedly. No logs survived. The journal was not flushed to disk before the panic, and the serial console was not connected. The on-call team has a timestamp and nothing else. With kdump configured, the vmcore would have captured the exact kernel state at the moment of the crash -- the faulting instruction, the full backtrace, every process's state, and the dmesg ring buffer that never made it to persistent storage.
Architecture
A production server running a critical workload reboots at 3:47 AM. The on-call engineer logs in, checks journalctl, and finds nothing. The journal was not flushed before the crash. dmesg shows only messages from the current (post-reboot) boot. The serial console was not connected. All that exists is a timestamp in wtmp and an alert that says "unexpected reboot."
This is the default outcome of a kernel panic on a system without crash dump infrastructure. The kernel's entire state -- every register, every stack frame, every data structure -- existed in RAM for the brief moment between the panic and the power cycle. Then it was gone.
kdump exists to solve exactly this problem. It captures the crashed kernel's memory before it disappears, preserving a complete snapshot for post-mortem analysis.
How kdump Works
The core insight behind kdump is that a crashed kernel cannot be trusted to save its own state. If the memory allocator is corrupted, or a spinlock is held, or the interrupt controller is wedged, dump code running inside the crashed kernel will fail.
kdump solves this with a two-kernel design:
-
Boot time: The bootloader passes
crashkernel=256Mto the primary kernel. The kernel reserves 256 MB of physical memory that it will never use for normal operations. -
Service startup: The kdump service calls
kexec_load()(orkexec_file_load()on Secure Boot systems) to preload a secondary kernel and a minimal initramfs into the reserved memory. This secondary kernel is ready to boot at any moment. -
Panic: When the primary kernel panics, instead of rebooting through BIOS, it calls into the kexec path. kexec transfers execution directly to the preloaded secondary kernel. No firmware re-initialization, no memory clearing -- the crashed kernel's memory remains intact.
-
Capture: The secondary kernel boots into a minimal environment. It sees the crashed kernel's memory via
/proc/vmcore, an ELF-formatted view of physical RAM.makedumpfilereads/proc/vmcore, strips free pages and cache pages, compresses the remainder, and writes the result to persistent storage. -
Reboot: After the vmcore is saved, the system reboots normally into the primary kernel. The vmcore sits in
/var/crash/<timestamp>/waiting for analysis.
The key to the design: the secondary kernel runs entirely within the pre-reserved memory. It never touches the crashed kernel's memory except to read it. This is what makes kdump reliable where previous approaches (diskdump, netdump, LKCD) were not.
Setting Up kdump
On RHEL/CentOS systems, kdump is typically enabled during installation. For manual setup:
# Ensure crashkernel is in the boot parameters
grubby --update-kernel=ALL --args="crashkernel=256M"
# Install kexec-tools (provides kdump service, makedumpfile, kdumpctl)
yum install kexec-tools
# Enable and start the kdump service
systemctl enable --now kdump
# Verify the service loaded the capture kernel
kdumpctl status
The /etc/kdump.conf file controls where the vmcore is written:
# Write to local filesystem (default)
path /var/crash
# Write to a dedicated partition
ext4 /dev/sdb1
# Write to NFS for centralized collection
nfs nfs-server.example.com:/crash-dumps
# Write to SSH target
ssh crash-collector@analysis-server.example.com
# Compression and filtering
core_collector makedumpfile -l --message-level 7 -d 31
After any configuration change:
# Rebuild the kdump initramfs with new config
kdumpctl rebuild
# Verify everything is still operational
kdumpctl status
Analyzing a vmcore with crash
Once a vmcore exists, the crash utility opens it for interactive analysis. It requires the vmcore file and the matching vmlinux with debug symbols:
# Install debug symbols for the running kernel
debuginfo-install kernel-$(uname -r)
# Open the vmcore
crash /var/crash/127.0.0.1-2026-04-01-03:47/vmcore \
/usr/lib/debug/lib/modules/$(uname -r)/vmlinux
Inside the crash shell, a standard investigation follows a predictable sequence:
crash> log
This recovers the complete dmesg ring buffer from the crashed kernel's memory, including messages that were never flushed to disk. The panic message, the stack trace, and any preceding warnings are all here.
crash> bt
Shows the backtrace of the task that was running when the panic occurred. This reveals the exact code path -- from the syscall entry point through each function call to the faulting instruction.
crash> bt -a
Shows backtraces for all CPUs. On a multi-core system, this reveals what every CPU was doing at the moment of the crash. Lock contention, interrupt handling, or related activity on other CPUs often provides context that the panicking CPU's backtrace alone does not.
crash> ps
Lists every process that existed at the time of the crash, including their state (running, sleeping, zombie). Identifying which process triggered the faulting syscall is often the first step in root-cause analysis.
crash> struct task_struct.comm,pid,state ffff8881a3c40000
Inspects individual fields of a kernel structure at a specific address. The crash tool understands every kernel structure and can dereference pointers, follow linked lists, and display nested structures.
crash> rd ffff8881a3c40000 64
Reads raw memory at any address. Useful for examining buffers, looking at memory around a faulting address, or verifying that a structure contains expected values.
crash> mod
Lists all loaded kernel modules with their addresses and sizes. If the panic occurred in a module, this shows which module and where the faulting instruction sits relative to the module's base address.
makedumpfile Filtering
A raw vmcore from a 256 GB server is 256 GB. Most of that is free pages, filesystem cache, and user-space memory that is irrelevant to a kernel panic investigation. makedumpfile filters the dump:
| Dump Level | Excludes | Typical Result |
|---|---|---|
| 0 | Nothing (raw dump) | 100% of RAM |
| 1 | Zero-filled pages | 80-95% of RAM |
| 17 | Zero + free pages | 20-40% of RAM |
| 31 | Zero + cache + user + free | 2-5% of RAM |
The filtering happens during the capture phase inside the kdump initramfs. The core_collector line in /etc/kdump.conf controls which flags makedumpfile uses:
core_collector makedumpfile -l -d 31 --message-level 7
The -l flag uses lzo compression (faster than zlib's -c). The --message-level 7 flag logs progress, which is visible on the console during the dump phase.
Under the Hood
kexec internals. The kexec_load() syscall takes an entry point, a list of segments (kernel image, initramfs, command line, purgatory code), and flags. The kernel copies these segments into the reserved memory region. The purgatory is a small stub that runs between the two kernels -- it sets up the CPU state, switches page tables, and jumps to the secondary kernel's entry point. On x86-64, purgatory also handles the transition from 64-bit mode through real mode (if needed) back to the secondary kernel's startup code.
The /proc/vmcore implementation. The capture kernel discovers the crashed kernel's memory layout from the ELF headers passed via the elfcorehdr= boot parameter. Each physical memory region becomes a program header in the ELF file. When makedumpfile reads /proc/vmcore, the kernel's vmcore_read() function translates the ELF file offset to a physical address and uses copy_oldmem_page() to read from the crashed kernel's memory. On x86-64, this uses ioremap_cache() to map the old memory into the capture kernel's address space.
panic() call chain. When the kernel calls panic(), the sequence is: panic() -> crash_kexec() -> machine_kexec() -> purgatory -> secondary kernel entry. The crash_kexec() function checks whether a crash kernel is loaded (kexec_crash_image is non-NULL). If so, it disables interrupts on all other CPUs via NMI, saves register state for each CPU into per-CPU crash_notes (which the crash tool later reads for the bt -a output), and then calls machine_kexec() to transfer control.
crash_notes and per-CPU state. Before kexec transfers control, each CPU's register state is saved into a designated memory area called crash_notes. The capture kernel reads these via /sys/kernel/crash_notes. The crash tool uses these saved registers to reconstruct the backtrace for each CPU, showing exactly what every core was executing at the instant of the panic.
Why not just use a hardware watchdog? A hardware watchdog can reset a hung system, but it provides no information about why the system hung. kdump captures the full state: which CPU was spinning, which lock was held, which process was waiting. The watchdog and kdump are complementary -- the watchdog triggers the panic (via NMI watchdog or hard lockup detector), and kdump captures the state.
Common Questions
How much memory does crashkernel= actually reserve, and is it wasted?
The crashkernel=auto setting on RHEL reserves 256 MB for systems with 4-64 GB of RAM and scales up for larger systems. This memory is unavailable to the primary kernel -- it does not appear in the available memory pool. For a 64 GB server, 256 MB is 0.4% of total RAM. On memory-constrained systems, this reservation can be reduced, but going below 160 MB risks the capture kernel failing to boot. The kdumpctl estimate command reports the minimum required reservation.
Can kdump capture a hang (not a panic)?
Not directly. A hang means the kernel is still running (or stuck), not panicking. But the NMI watchdog detects soft lockups (CPU stuck in kernel mode for > 20 seconds) and hard lockups (CPU not responding to timer interrupts for > 10 seconds). When the lockup detector fires, it can be configured to call panic(), which triggers kdump. Set kernel.softlockup_panic=1 and kernel.hardlockup_panic=1 via sysctl. The kernel.hung_task_panic=1 setting does the same for tasks stuck in D state (uninterruptible sleep) for over 120 seconds.
What if the panic happens before kdump is loaded?
Early boot panics -- before the kdump service starts -- cannot be captured by kdump because the capture kernel has not been loaded yet. For these cases, pstore (persistent store) can write panic messages to UEFI variables, platform NVRAM, or a reserved RAM region that survives reboot. Check /sys/fs/pstore/ after a crash for any recovered messages. Some platforms also support ramoops, which reserves a memory region at boot specifically for storing panic logs.
How does kdump handle encrypted disks or network targets?
The kdump initramfs must contain everything needed to write the vmcore. For LUKS-encrypted dump targets, the initramfs must include the encryption tools and keys. For network targets (NFS, SSH), it must include network drivers, DHCP or static IP configuration, and authentication credentials. The kdumpctl rebuild command pulls these into the initramfs based on /etc/kdump.conf settings. Testing with echo c > /proc/sysrq-trigger validates the entire chain.
Is there a performance impact from having kdump configured?
The runtime performance impact is negligible. The only cost is the reserved memory (256-512 MB unavailable to the primary kernel) and the one-time overhead of kexec_load() during boot. There is no ongoing CPU cost, no periodic memory scanning, and no impact on interrupt latency. The capture kernel is completely dormant until a panic occurs.
How Technologies Use This
A Docker host running 240 containers on RHEL 8 panics at 3:47 AM. The kernel oops trace on the serial console shows a null pointer dereference inside the overlay2 storage driver, triggered when a container attempted to write to a layered mount during a cgroup memory reclaim event. Without a vmcore, that single-line oops is the only evidence, and it is not enough to identify which container or which cgroup configuration caused the fault.
kdump captures the full kernel memory state at the moment of panic. The crash utility opens the vmcore and the matching vmlinux with debug symbols. Running "bt" shows the complete call stack: the overlay2 ovl_write_iter() function called into the memory cgroup charge path, which hit a race condition in mem_cgroup_charge() when the container's cgroup was being removed mid-write. The "files" command in crash reveals which overlay2 mount was active, and "task" identifies the container PID whose I/O triggered the panic. The "log" command recovers dmesg entries that never reached journald.
The root cause turns out to be a known cgroup v1 race condition when a container exits while another container writes through a shared overlay2 layer. The vmcore provides the exact memory cgroup pointer, the overlay dentry, and the faulting instruction, giving the kernel team enough data to match it against an upstream patch. Without kdump, this bug would recur as a mystery reboot every few weeks.
A 150-node Kubernetes cluster experiences a node crash during a peak scheduling burst. The kubelet on the failed node was scheduling 40 pods simultaneously when the kernel panicked. The node comes back up, but the scheduler has already redistributed workloads and the original failure context is gone. The only clue is a "NodeNotReady" event in the Kubernetes API and a brief oops fragment in the serial console log.
The vmcore captured by kdump tells the full story. Opening it with crash shows the backtrace originating inside the cgroup cpu controller. The kernel was executing sched_cfs_period_timer() when it encountered a corrupted rbtree node in the CFS runqueue. The "ps" command in crash lists all 40 pod processes that were in TASK_RUNNING or TASK_UNINTERRUPTIBLE state, and "struct cfs_rq" on the faulting CPU reveals the exact scheduler state at the moment of panic. The cgroup hierarchy dump shows which pod's cpu.max setting interacted with the timer to trigger the corruption.
This crash points to a scheduler bug that only surfaces when dozens of cgroups with strict CPU bandwidth limits contend on the same physical core. The vmcore provides the cfs_bandwidth struct, the timer callstack, and the corrupted node pointer, which together match a patch merged in kernel 5.19. Without the crash dump, the team would only know that a node rebooted during high scheduling activity.
A PostgreSQL 15 server handling 8,000 transactions per second panics the host kernel during a WAL (Write-Ahead Log) flush to an NVMe SSD. The database was mid-checkpoint, writing 12 GB of dirty buffers. The system reboots, PostgreSQL recovers from WAL, but the crash repeats every 4 to 6 hours under sustained write load. Application logs show only that the database connection was lost.
The vmcore from kdump reveals a bug in the NVMe driver. The backtrace shows PostgreSQL's wal_writer process calling fdatasync(), which enters the block layer via submit_bio(), then hits a use-after-free inside nvme_queue_rq() on the NVMe command completion path. The "struct nvme_command" dump in crash shows a corrupted completion queue entry. The "dev" command identifies the exact NVMe controller and namespace. The "rd" command on the faulting address confirms that the NVMe driver reused a command slot before the previous I/O completion was fully processed.
The fix is a firmware update for the NVMe controller combined with a kernel patch that adds a completion fence to the NVMe driver's command recycling path. The vmcore data, specifically the NVMe command structure, the bio chain state, and the completion queue pointer, allowed the storage vendor to reproduce the issue in their lab within two days. Without the crash dump, correlating "PostgreSQL loses connection every few hours" to an NVMe driver race condition would have taken weeks of guesswork.
Same Concept Across Tech
| Technology | How it uses kdump/crash analysis | Key consideration |
|---|---|---|
| RHEL/CentOS | kdump enabled by default. systemctl manages the service. vmcore lands in /var/crash/ | Install kernel-debuginfo for crash analysis. Use sosreport to collect kdump config |
| Ubuntu/Debian | linux-crashdump metapackage installs kdump-tools, kexec-tools, crash | Uses /etc/default/kdump-tools for configuration instead of /etc/kdump.conf |
| AWS EC2 | kdump writes to EBS persistent volume. Ephemeral instance store is wiped on restart | Ensure the kdump initramfs includes nvme drivers for EBS on Nitro instances |
| Kubernetes nodes | kdump on the host OS captures panics from any workload. Node-level, not pod-level | Reserve crashkernel memory in node allocatable calculations to avoid pod eviction |
| Kernel module development | Load module, trigger bug, collect vmcore, analyze in crash | Use panic_on_oops=1 to ensure every oops produces a full dump, not just a warning |
Stack layer mapping (investigating an unexplained server reboot):
| Layer | What to check | Tool |
|---|---|---|
| Hardware | Was it a hardware-initiated reset (BMC, watchdog, power)? | ipmitool sel list, BMC event log |
| Firmware | Did UEFI record an error before the OS booted? | efivar, UEFI error records |
| Kernel | Is there a vmcore in /var/crash/? | ls /var/crash/, crash tool |
| Kernel log | Did dmesg capture anything before the panic? | crash> log (from vmcore), journalctl -b -1 |
| Application | Did an application trigger a kernel bug? | crash> bt (backtrace shows triggering syscall) |
Design Rationale Traditional crash dump mechanisms (like diskdump or netdump) ran dump code within the crashed kernel itself. This was fundamentally unreliable -- if the kernel's memory allocator was corrupted, or if a spinlock was held, the dump code could deadlock or produce a corrupted dump. The two-kernel approach solves this by running the dump code in a completely fresh kernel that has its own memory allocator, its own drivers, and its own scheduler. The crashed kernel's memory is treated as read-only data, not as a running system. This separation is what makes kdump the first truly reliable Linux crash dump mechanism.
If You See This, Think This
| Symptom | Likely cause | First check |
|---|---|---|
| Server rebooted, no logs, no vmcore | kdump not enabled or capture kernel failed to boot | kdumpctl status, check crashkernel= in /proc/cmdline |
| vmcore is 0 bytes or missing | Dump target full, or makedumpfile failed | Check disk space on /var/crash/, examine capture kernel console output |
| crash tool refuses to open vmcore | Mismatched vmlinux (wrong kernel version or missing debug symbols) | rpm -q kernel-debuginfo-$(uname -r), match exact build |
| kdumpctl reports "not operational" | Capture kernel failed to load. Possible Secure Boot or memory reservation issue | journalctl -u kdump, check kexec_file_load support |
| Dump takes 10+ minutes on large memory system | Raw dump too large, makedumpfile filtering not aggressive enough | Increase dump level (-d 31), switch to lzo compression (-l) |
| Capture kernel panics during dump | Insufficient crashkernel= memory for the capture kernel and its drivers | Increase reservation (crashkernel=512M or crashkernel=1G) |
When to Use / Avoid
Relevant when:
- Investigating production kernel panics where no logs survived the crash
- Setting up post-mortem analysis infrastructure for servers that must have root-cause explanations for every unplanned reboot
- Debugging kernel modules or drivers that cause intermittent panics
- Satisfying compliance requirements that mandate crash dump retention and analysis
Watch out for:
- The crashkernel= reservation reduces available RAM for applications (typically 256-512 MB)
- Network-based dump targets require the kdump initramfs to include network drivers and authentication
- Secure Boot environments require kexec_file_load() with signed kernels, not the older kexec_load()
- Very large memory systems (1 TB+) may need extended dump times even with makedumpfile filtering
Try It Yourself
1 # Check if kdump is configured and operational
2
3 kdumpctl status
4
5 # View crashkernel reservation from boot parameters
6
7 cat /proc/cmdline | grep -o "crashkernel=[^ ]*"
8
9 # Check reserved memory regions for the crash kernel
10
11 cat /proc/iomem | grep "Crash kernel"
12
13 # Estimate memory needed for the capture kernel
14
15 kdumpctl estimate
16
17 # Rebuild kdump initramfs after configuration changes
18
19 kdumpctl rebuild
20
21 # Open a vmcore for analysis with the crash tool
22
23 crash /var/crash/127.0.0.1-2026-04-01-03:47/vmcore /usr/lib/debug/lib/modules/$(uname -r)/vmlinux
24
25 # Inside crash: get backtrace of the panicking task
26
27 crash> bt
28
29 # Inside crash: show backtraces for all CPUs
30
31 crash> bt -a
32
33 # Inside crash: recover the full dmesg log from the crashed kernel
34
35 crash> log
36
37 # Inside crash: list all processes at the time of the crash
38
39 crash> ps
40
41 # Inside crash: inspect a kernel structure by address
42
43 crash> struct task_struct.comm,pid,state ffff8881a3c40000
44
45 # Inside crash: read raw memory at an address
46
47 crash> rd ffff8881a3c40000 64
48
49 # Inside crash: show loaded kernel modules
50
51 crash> mod
52
53 # Inside crash: display mount points at crash time
54
55 crash> mount
56
57 # Trigger a test panic to validate kdump (TEST SYSTEMS ONLY)
58
59 echo c > /proc/sysrq-trigger
60
61 # Verify vmcore was created after a test crash
62
63 ls -lh /var/crash/*/vmcore
64
65 # Extract just the dmesg from a vmcore without full crash analysis
66
67 vmcore-dmesg /var/crash/127.0.0.1-2026-04-01-03:47/vmcore
68
69 # Create a filtered and compressed dump manually
70
71 makedumpfile -c -d 31 /proc/vmcore /var/crash/vmcoreDebug Checklist
- 1
Verify kdump is operational: kdumpctl status - 2
Check crashkernel reservation: cat /proc/cmdline | grep crashkernel - 3
Verify reserved memory: cat /proc/iomem | grep 'Crash kernel' - 4
Check kdump configuration: cat /etc/kdump.conf - 5
Estimate memory needs: kdumpctl estimate - 6
Rebuild initramfs after config changes: kdumpctl rebuild - 7
Test with controlled panic: echo c > /proc/sysrq-trigger (test systems only) - 8
Verify vmcore after test: ls -lh /var/crash/*/vmcore - 9
Check debug symbols: rpm -q kernel-debuginfo-$(uname -r)
Key Takeaways
- ✓kdump works because of a two-kernel design. The primary kernel reserves memory at boot for a secondary kernel. On panic, kexec boots the secondary kernel into that reserved memory. The secondary kernel can read the crashed kernel's memory via /proc/vmcore because that memory was never overwritten -- the secondary kernel runs entirely within the reserved region.
- ✓The kexec_load() syscall preloads the capture kernel and initramfs into reserved memory. The kexec_file_load() variant is newer and supports signed kernels (required when Secure Boot is enabled). Both store the kernel image in reserved memory so that the panic path has zero disk I/O -- it just jumps to the preloaded kernel.
- ✓makedumpfile's dump levels control what gets excluded from the vmcore. Level 1 excludes zero pages. Level 2 adds cache pages. Level 31 excludes zero, cache, private cache, user, and free pages -- typically reducing a 64 GB dump to 1-3 GB. The trade-off: higher dump levels lose more potentially useful data. For kernel debugging, level 31 is standard. For user-space memory forensics, level 1 preserves more.
- ✓The crash tool is not just a memory viewer. It reconstructs kernel state by parsing the vmcore with knowledge of kernel data structures. It reads the task_struct list to show all processes, walks page tables, decodes lock states, and traces the exact code path that led to the panic. It even recovers the dmesg ring buffer from the crashed kernel's memory.
- ✓Testing kdump before a real crash is essential. The command "echo c > /proc/sysrq-trigger" forces an immediate kernel panic. If kdump is properly configured, the system reboots into the capture kernel, writes the vmcore, and reboots again into the normal kernel. The vmcore appears in /var/crash/ with a timestamp directory. If this test fails, kdump will also fail during a real crash.
Common Pitfalls
- ✗Not reserving enough memory for the capture kernel. If the crashkernel= parameter is too small, the capture kernel fails to boot and the vmcore is lost. Systems with many kernel modules, network-based dump targets, or complex initramfs configurations need more than the minimum 256 MB. Run kdumpctl estimate to check the actual memory requirement.
- ✗Forgetting to rebuild the kdump initramfs after configuration changes. Changing the dump target (e.g., from local disk to NFS) requires running kdumpctl rebuild to regenerate the capture kernel's initramfs. Without this, the capture kernel boots with the old configuration and may fail to write the vmcore.
- ✗Analyzing a vmcore without the matching vmlinux. The crash tool needs the exact vmlinux binary (with CONFIG_DEBUG_INFO) that was running when the crash occurred. A vmlinux from a different kernel build, even the same version, has different symbol addresses. Install the kernel-debuginfo package matching the exact kernel version and release.
- ✗Assuming kdump works on first boot without testing. UEFI Secure Boot can block kexec_load(). SELinux policies may prevent writing to the dump target. Network-based targets may fail if the kdump initramfs lacks the correct network driver. Always validate with a controlled panic via sysrq-trigger after initial setup.
- ✗Running out of disk space for the vmcore. A 256 GB server with dump level 31 and lzo compression may still produce a 5-10 GB vmcore. If /var/crash/ is on a small root partition, the dump fails silently. Configure a dedicated dump target with sufficient space, or use makedumpfile's --message-level to log progress during the dump.
Reference
In One Line
kdump preloads a secondary kernel into reserved memory at boot so that when the primary kernel panics, it can capture the entire crashed state to disk -- the only reliable way to get a full post-mortem when no logs survive.