Device Mapper & Block Layer Internals
Mental Model
A postal sorting facility with multiple processing stations. Letters (BIOs) arrive from upstairs offices (filesystems) addressed to specific mailboxes (sectors). Before reaching the delivery trucks (disk drivers), each letter passes through one or more processing stations. One station encrypts the contents (dm-crypt). Another redirects letters to different mailboxes (dm-linear). A third station photocopies letters before forwarding them so an archive copy exists (dm-snapshot). The letter itself does not know or care which stations it passes through. Each station only sees an envelope with a destination and a payload.
The Problem
A PostgreSQL database running on an LVM logical volume experiences a sudden spike in I/O latency every time an LVM snapshot is created for nightly backups. Normal write latency is 0.5ms. During and after snapshot creation, write latency jumps to 8-15ms and stays elevated for the duration of the snapshot's existence. The database itself has not changed. The disks are not saturated. iostat shows increased write amplification on the underlying physical volume. The latency disappears the moment the snapshot is removed with lvremove.
Architecture
A database is running on LVM. Performance is solid: consistent 0.5ms write latency, steady throughput. Then the nightly backup script runs lvcreate --snapshot, and within seconds, write latency triples to 15ms. The disks are not saturated. The database configuration has not changed. Removing the snapshot with lvremove brings latency back to normal instantly.
The problem is not the database or the disk. The problem is in the block layer, specifically in the device mapper subsystem that sits between the filesystem and the physical storage. Understanding device mapper and the block layer is essential for anyone running LVM, LUKS encryption, or container storage.
What Device Mapper Actually Does
Device mapper is a kernel framework that creates virtual block devices by intercepting and transforming I/O. It sits in the block layer between the filesystem (which submits I/O) and the device driver (which executes I/O on hardware).
The architecture is built around three concepts:
-
Mapped devices (
/dev/dm-0,/dev/dm-1, etc.) are virtual block devices that applications and filesystems use. LVM creates symlinks like/dev/vg0/lv_datapointing to these dm devices. -
Mapping tables define how sectors on the virtual device translate to sectors on real devices. Each table entry specifies a target type and its parameters.
-
Targets are loadable kernel modules that implement specific transformations.
dm-linearremaps sectors (used by LVM for concatenation).dm-cryptencrypts and decrypts block I/O (used by LUKS).dm-thinprovides thin provisioning with on-demand block allocation.dm-snapshotprovides copy-on-write snapshots.
When a filesystem calls submit_bio() to write a block, the BIO (Block I/O descriptor) hits the device mapper layer. The DM core looks up which target owns the BIO's sector range and calls that target's map() function. The target transforms the BIO -- remapping sectors, encrypting data, allocating blocks, or copying old data -- and either remaps it to the underlying device or submits entirely new BIOs.
The Block Layer: BIOs and Request Queues
The block layer is the plumbing that moves I/O from filesystems to device drivers. Two key structures drive everything:
struct bio is the fundamental I/O descriptor. It contains:
bi_bdev: the target block devicebi_iter.bi_sector: the starting sectorbi_iter.bi_size: the I/O size in bytesbi_opf: the operation (READ, WRITE, FLUSH, DISCARD)bi_io_vec: a scatter-gather list of memory pages holding the data
A single read() syscall can generate multiple BIOs if the data spans non-contiguous blocks on disk.
struct request_queue (managed by blk-mq, the multi-queue block layer) handles BIO merging and scheduling. The path is:
- Filesystem calls
submit_bio(). - The block layer checks if the BIO can be merged with an adjacent pending request (front-merge or back-merge). Merging is critical for sequential I/O performance.
- If no merge is possible, a new
struct requestis created. - The I/O scheduler (mq-deadline, bfq, kyber, or none) reorders requests.
- Requests are dispatched to the device driver via hardware dispatch queues.
For device mapper devices, step 1 lands in dm_make_request() instead of going directly to the scheduler. The DM target processes the BIO and resubmits it to the underlying device, where it enters the real device's request queue.
LVM and dm-linear
LVM (Logical Volume Manager) is the most common device mapper consumer. At its core, LVM uses dm-linear targets to map contiguous ranges of logical volume sectors to physical volume sectors.
# View the dm-linear mapping for an LVM logical volume
dmsetup table /dev/vg0/lv_data
# Output: 0 2097152 linear /dev/sda2 2048
# ^^^^^^^^^^ ^^^^^^ ^^^^
# sector range target underlying dev + offset
This says: sectors 0 through 2097151 of lv_data map to /dev/sda2 starting at sector 2048. When the filesystem writes to sector 1000 of lv_data, device mapper adds 2048 and issues the write to sector 3048 of /dev/sda2. The remapping is a simple addition -- negligible overhead.
When an LV spans multiple physical extents or PVs, the table has multiple entries:
dmsetup table /dev/vg0/lv_large
# 0 1048576 linear /dev/sda2 2048
# 1048576 1048576 linear /dev/sdb1 0
The first 512 MB maps to sda2, the next 512 MB to sdb1. LVM concatenation.
dm-crypt: Transparent Block Encryption
dm-crypt intercepts every BIO and runs it through the kernel crypto API. For writes, it encrypts plaintext pages into ciphertext before forwarding to the underlying device. For reads, it decrypts ciphertext returned by the device before passing pages back to the filesystem.
# View dm-crypt mapping (LUKS device)
dmsetup table /dev/mapper/root_crypt
# 0 976560128 crypt aes-xts-plain64 <key> 0 /dev/sda2 4096
# ^^^^^^^^^^^^^^^^
# cipher-chainmode-ivmode
The per-sector IV mode (plain64) means each sector gets a unique IV derived from its sector number. This allows random reads without decrypting preceding sectors. AES-XTS uses two AES keys: one for the block cipher, one for the tweak (sector-dependent whitening).
# Check encryption throughput with hardware acceleration
cryptsetup benchmark
# Output with AES-NI:
# aes-xts 256b 2891.5 MiB/s 2889.3 MiB/s
# Output without AES-NI:
# aes-xts 256b 389.2 MiB/s 392.1 MiB/s
The 7x difference between hardware-accelerated and software-only AES explains why grep aes /proc/cpuinfo is the first check when dm-crypt performance is underwhelming.
dm-thin: Thin Provisioning and Efficient Snapshots
dm-thin is the modern alternative to classic dm-snapshot. It manages a shared block pool and allocates physical blocks on demand as thin volumes write data.
# Create a thin pool and thin volumes
lvcreate -L 100G -T vg0/thinpool
lvcreate -V 50G -T vg0/thinpool -n thin_vol1
lvcreate -V 50G -T vg0/thinpool -n thin_vol2
# Both volumes claim 50 GB but share the 100 GB pool
# Actual usage depends on data written
lvs -o+data_percent vg0/thinpool
dm-thin maintains a B-tree that maps (thin device ID, virtual block) to physical block. When a thin volume writes to an unmapped virtual block, dm-thin allocates a physical block from the pool, updates the B-tree, and completes the write.
Thin snapshots are fundamentally different from classic LVM snapshots:
# Create a thin snapshot (instant, O(1) operation)
lvcreate --snapshot --name snap1 vg0/thin_vol1
# The snapshot shares the B-tree with the origin
# Writing to the origin allocates NEW blocks for the origin
# The snapshot keeps pointing to the OLD blocks
# No read-copy-write overhead on the origin
This is the key advantage: thin snapshots do not penalize origin writes. The origin gets new blocks; the snapshot retains references to old blocks via the shared B-tree with reference counting. Compare this to classic dm-snapshot where every origin write triggers a read-copy-write sequence.
The Snapshot Latency Problem
Back to the original problem: why does creating a classic LVM snapshot cause write latency to spike on the origin volume?
The dm-snapshot target uses a copy-on-write strategy. When a snapshot exists, the kernel must preserve the snapshot's view of the original data. Before any write to the origin can proceed, the old data at that location must be saved.
The sequence for every origin write:
- Read the original block from the origin device.
- Write the original block to the snapshot's COW area.
- Write the new data to the origin device.
Three I/O operations instead of one. Write amplification of 3x. The read in step 1 and the write in step 2 are synchronous -- the origin write in step 3 cannot proceed until the old data is safely stored in the COW area.
For a database doing 5,000 write IOPS, the snapshot adds 10,000 extra IOPS to the underlying device: 5,000 reads of old blocks and 5,000 writes to the COW area. If the underlying SSD can handle 20,000 IOPS, the effective write IOPS drops from 5,000 to about 6,600 (20,000 / 3). Latency increases proportionally.
# Observe the overhead directly
# Before snapshot:
iostat -xm 1 /dev/dm-0
# w/s: 5000, wMB/s: 40, w_await: 0.5ms
# Create snapshot:
lvcreate --snapshot --size 10G --name snap_backup /dev/vg0/lv_data
# After snapshot:
iostat -xm 1 /dev/dm-0
# w/s: 5000 (app writes), but underlying device shows:
iostat -xm 1 /dev/sda
# r/s: 5000 (COW reads), w/s: 10000 (COW writes + origin writes)
# w_await: 8-15ms (device approaching saturation)
The fix is to minimize snapshot lifetime (create, backup, remove immediately) or migrate to thin snapshots that avoid the read-copy-write penalty entirely.
Practical Debugging
When I/O latency is unexplained on an LVM setup, the debugging path follows the device mapper stack from top to bottom:
# Step 1: Map the full dm stack
dmsetup ls --tree
lsblk -f
# Step 2: Check for active snapshots (the most common hidden culprit)
lvs -a -o+snap_percent,origin
dmsetup status | grep snapshot
# Step 3: Compare I/O stats between dm device and physical device
# If physical device shows 3x the IOPS of the dm device, a snapshot is active
iostat -xm 1
# Step 4: Check thin pool health (if using thin provisioning)
lvs -o+data_percent,metadata_percent
# data_percent > 90% is a critical alert
# Step 5: Trace individual BIOs through the stack
blktrace -d /dev/dm-0 -o trace
blkparse -i trace -d trace.bin
btt -i trace.bin
# btt output shows per-stage latency: Q2C (total), D2C (device), Q2D (queuing)
Common Questions
Why not just use thin snapshots for everything?
dm-thin has its own costs. The B-tree metadata updates add latency to first-writes. The thin pool itself is a single-threaded bottleneck for metadata operations. For simple one-off snapshots on volumes with low write rates, classic dm-snapshot is simpler to set up and reason about. But for any workload with sustained writes during the snapshot lifetime, thin snapshots are strictly better.
How does pvmove work without unmounting?
pvmove creates a temporary dm-mirror target. The origin segments are mirrored to the destination PV. As the mirror syncs, new writes go to both locations. Once synced, the mapping table is atomically swapped to point to the new location. The filesystem never notices because the dm device number stays the same -- only the underlying mapping changes. This is device mapper's killer feature: live table swapping.
What happens when dm-crypt and dm-thin are stacked?
Each layer adds one BIO transformation. A write to an encrypted thin volume goes through: filesystem -> dm-thin (block allocation) -> dm-crypt (encryption) -> physical device. The BIO is cloned and transformed at each layer. Stacking order matters: encrypt-under-thin means the thin pool stores ciphertext and metadata operations are fast. Encrypt-over-thin means the thin pool stores plaintext and metadata itself is unencrypted. Most setups use LUKS on the physical device with LVM/thin on top (encrypt-under-thin).
How does the block layer decide when to merge BIOs?
BIO merging happens in blk_mq_submit_bio(). The block layer checks the plug list (a per-task batch of pending BIOs) and the scheduler queues for requests adjacent to the new BIO. If the new BIO's sector range is contiguous with an existing request and the combined size does not exceed the device's max_sectors_kb, the BIOs are merged. Sequential writes from a database commit are almost always merged. Random 4KB writes from a COW snapshot are almost never merged, which is another reason snapshot overhead is disproportionately expensive.
How Technologies Use This
A container host running Docker with the devicemapper storage driver stores 80 container images and 40 running containers on a single 500 GB block device. Each container gets an independent writable layer without copying the full image. The dm-thin (thin provisioning) device mapper target makes this possible by mapping virtual block addresses to a shared pool of physical blocks using a B-tree metadata structure.
When Docker pulls a 2 GB image, dm-thin creates a thin volume backed by the shared pool and writes the image layers into it. When a container starts, dm-thin creates a snapshot of that thin volume using block-level copy-on-write. The snapshot shares all physical blocks with the base image. Only when the container writes to a block does dm-thin allocate a new physical block from the pool and copy the original data before applying the write. This means 40 containers based on the same 2 GB image consume only the 2 GB base plus whatever each container has actually modified, typically 50-200 MB per container.
The operational risk is pool exhaustion. If the shared pool fills to 100%, all thin volumes freeze with I/O errors and every container on the host stops responding. Monitoring pool utilization with `dmsetup status` and setting up thin_check alerts at 80% usage is essential. Docker has largely moved to overlayfs as the default storage driver, but the devicemapper backend remains in use on hosts running RHEL 7 or older kernels without overlayfs support.
A DBA needs to back up a 500 GB PostgreSQL database without downtime. Running pg_dump holds locks and takes 4 hours on this dataset. LVM snapshots provide a block-level alternative: freeze the filesystem for a fraction of a second, snapshot the logical volume, unfreeze, then run the backup from the snapshot while the live database continues serving writes.
Running `lvcreate --snapshot --size 20G --name pg_snap /dev/vg0/pgdata` creates a new logical volume backed by a COW (copy-on-write) device mapper target. The snapshot initially shares every physical block with the origin volume. When PostgreSQL writes to a block on the origin, the device mapper intercepts the BIO, reads the old data from the origin, copies it into the snapshot's COW reservation area, and then allows the original write to proceed. Reads from the snapshot for unchanged blocks pass through directly to the origin without any copy.
The COW overhead is proportional to the origin's write rate, not the total volume size. A 500 GB database experiencing 5 GB of writes during a 2-hour backup window needs roughly 5 GB of snapshot COW space. The critical operational concern is that if the COW area fills completely, the kernel invalidates and drops the snapshot automatically. Allocating 2-3x the expected write volume and monitoring usage with `lvs -o+snap_percent` prevents data loss during long-running backups.
A Redis instance holding 20 GB of data runs on a server with LUKS full-disk encryption. Every 60 seconds, Redis triggers a BGSAVE that forks a child process to write the RDB snapshot file. The child writes the entire 20 GB dataset sequentially to disk. Every byte of that write passes through dm-crypt, the device mapper target that handles LUKS encryption, before reaching the NVMe controller.
dm-crypt intercepts each write BIO from the filesystem, encrypts the data using AES-XTS-256 via the kernel crypto API, and submits the encrypted BIO to the underlying block device. On CPUs with AES-NI hardware acceleration, encryption throughput exceeds 2 GB/s, so the 20 GB RDB write adds roughly 10 seconds of CPU overhead for the encryption pass. Without AES-NI, software AES drops throughput to 300-500 MB/s, and the same RDB persistence takes 40-65 seconds of additional CPU time purely for encryption.
The latency impact is most visible on read-heavy workloads during BGSAVE. The child process performing sequential writes through dm-crypt competes for CPU cycles on the encryption workqueue, potentially increasing p99 read latency on the parent Redis process by 1-3ms during the snapshot window. Pinning the dm-crypt workqueue threads to dedicated cores with `echo 2 > /sys/block/dm-0/queue/wq_cpu` prevents this interference. Checking `cryptsetup status root_crypt` shows the active cipher and whether AES-NI is in use.
Same Concept Across Tech
| Technology | How it uses Device Mapper | Key gotcha |
|---|---|---|
| LVM | dm-linear and dm-striped map logical extents to physical extents across one or more PVs | Snapshot COW overhead is 3x write amplification. Prefer thin snapshots for write-heavy workloads |
| LUKS/dm-crypt | dm-crypt target encrypts/decrypts every BIO with AES-XTS-256 via the kernel crypto API | Without AES-NI hardware acceleration, throughput drops by 5-10x. Check /proc/cpuinfo for aes flag |
| Docker (devicemapper) | dm-thin provides COW layers for images and containers. Each layer is a thin snapshot | Loopback mode is catastrophically slow. Direct-lvm mode is required for production. overlay2 is now preferred |
| Kubernetes (CSI) | LVM-based CSI drivers (TopoLVM, OpenEBS) create LVs as persistent volumes | Thin pool monitoring is critical. Pod I/O failures from pool exhaustion are hard to diagnose without LVM metrics |
| Database backups | LVM snapshots provide point-in-time block-level copies for filesystem backups | Snapshot lifetime must be minimized. Every second the snapshot exists, origin writes pay the COW penalty |
Stack layer mapping (I/O latency spike during LVM snapshot):
| Layer | What to check | Tool |
|---|---|---|
| Application | Is the database write rate unusually high during the backup window? | Database write throughput metrics |
| Filesystem | Is the filesystem doing excessive journaling or metadata writes? | iostat -x on the dm device |
| Device Mapper | Is a snapshot active, causing COW write amplification? | dmsetup status, lvs -o+snap_percent |
| Block Layer | Are I/O requests being merged efficiently, or is COW creating small random I/O? | /sys/block/dm-X/stat, blktrace |
| Physical Device | Is the underlying disk saturated from the 3x write amplification? | iostat -x on the physical device |
Design Rationale The block layer needed a generic mechanism to transform I/O between filesystem and disk without either layer knowing about the transformation. Device mapper solved this by introducing a pluggable target framework: each target registers map and end_io callbacks that intercept BIOs in flight. dm-linear remaps sectors for LVM. dm-crypt encrypts payloads for LUKS. dm-thin allocates blocks on demand. The framework is composable -- targets stack on top of each other, each seeing only BIOs from the layer above. This separation of concerns is why a single pvmove command can migrate data between physical disks while the filesystem stays mounted and applications keep running.
If You See This, Think This
| Symptom | Likely cause | First check |
|---|---|---|
| Write latency doubles or triples after snapshot creation | Classic LVM snapshot COW overhead (read-copy-write per origin write) | dmsetup status to confirm snapshot exists, lvs -o+snap_percent |
| Thin volume hangs with I/O errors | Thin pool is 100% full, no free blocks to allocate | lvs -o+data_percent, extend pool with lvextend |
| dm-crypt throughput below 500 MB/s | Missing AES-NI hardware acceleration | grep aes /proc/cpuinfo, check VM CPU flags |
| Snapshot silently disappears | COW area filled up, kernel invalidated the snapshot | Check dmesg for snapshot overflow messages, increase COW size |
| Unexpected disk I/O on idle system | dm-thin metadata commit or LVM polling daemon (lvmpolld) | blktrace on the thin-pool metadata device |
| lvcreate --snapshot takes seconds | Large origin volume requires chunk mapping initialization | Use thin snapshots (lvcreate --thin) for instant O(1) snapshots |
When to Use / Avoid
Relevant when:
- Setting up encrypted storage (LUKS/dm-crypt) and need to understand the performance implications
- Managing LVM logical volumes and need to understand snapshot overhead
- Debugging unexpected I/O latency on LVM volumes, especially after snapshot creation
- Working with container storage drivers that use devicemapper (legacy Docker setups)
Watch out for:
- Classic LVM snapshots cause 3x write amplification on the origin volume for the snapshot's entire lifetime
- Thin pool exhaustion freezes all thin volumes with no graceful degradation
- Device mapper stacking adds per-layer latency (5-20 microseconds per layer)
Try It Yourself
1 # Display the full device mapper tree
2
3 dmsetup ls --tree
4
5 # Show mapping table for all dm devices
6
7 dmsetup table
8
9 # Show runtime status including snapshot validity and thin pool usage
10
11 dmsetup status
12
13 # List block devices with topology info
14
15 lsblk -f -t
16
17 # Check LVM snapshot and thin pool utilization
18
19 lvs -a -o+devices,data_percent,snap_percent,pool_lv
20
21 # Benchmark dm-crypt cipher performance
22
23 cryptsetup benchmark
24
25 # Monitor block device I/O in real time (dm and physical)
26
27 iostat -xm 1 /dev/dm-0 /dev/sda
28
29 # Trace block I/O events on a device mapper device
30
31 blktrace -d /dev/dm-0 -o - | blkparse -i -
32
33 # Check thin pool autoextend configuration
34
35 grep -E "thin_pool_autoextend" /etc/lvm/lvm.conf
36
37 # Show dm device dependencies (stacking order)
38
39 dmsetup depsDebug Checklist
- 1
Check device mapper tree: dmsetup ls --tree - 2
Show mapping tables: dmsetup table - 3
Check snapshot COW usage: lvs -o+snap_percent - 4
Monitor thin pool utilization: lvs -o+data_percent - 5
Check dm-crypt performance: cryptsetup benchmark - 6
Watch block layer latency: iostat -x 1 on both dm device and underlying physical device - 7
Verify AES-NI support: grep aes /proc/cpuinfo - 8
Check for snapshot validity: dmsetup status | grep snapshot
Key Takeaways
- ✓Device mapper operates at the BIO level, not the filesystem level. It sees sector ranges and raw bytes, not files or directories. This is why dm-crypt can encrypt any filesystem (ext4, XFS, Btrfs) without knowing anything about the filesystem's internal structure.
- ✓The block layer merges adjacent BIOs into larger requests before submitting them to the device driver. This merging is critical for spinning disks (fewer seeks) and still beneficial for SSDs (fewer NVMe commands). The I/O scheduler sits between BIO submission and driver dispatch, reordering for locality or fairness.
- ✓LVM snapshots use a copy-on-write mechanism at the block level. Every write to the origin triggers a read-copy-write sequence to preserve the old data in the snapshot COW area. This write amplification is 3x at minimum. For write-heavy workloads, this overhead is severe enough that LVM thin snapshots (dm-thin) are the preferred alternative because they handle COW more efficiently with B-tree metadata.
- ✓dm-thin maintains a B-tree mapping from virtual blocks to physical blocks. Snapshots in dm-thin share the B-tree and use reference counting on physical blocks. A write to a shared block allocates a new physical block and updates only the writing volume's mapping. No read-copy-write sequence is needed for the origin. This is fundamentally more efficient than classic dm-snapshot.
- ✓Device mapper tables can be stacked. A typical encrypted LVM setup stacks dm-linear (LVM striping or concatenation) on top of dm-crypt on top of the physical device. Each layer adds a BIO transformation. dmsetup deps shows the dependency tree. Deep stacking adds latency per layer, typically 5-20 microseconds each.
Common Pitfalls
- ✗Running an LVM thin pool to 100% utilization. When the pool has no free blocks, all thin volumes backed by that pool freeze with I/O errors. Unlike a full filesystem that returns ENOSPC, a full thin pool causes every write BIO to hang or error. Set up monitoring with lvs -o+data_percent and configure autoextend in /etc/lvm/lvm.conf (thin_pool_autoextend_threshold and thin_pool_autoextend_percent).
- ✗Using classic LVM snapshots (dm-snapshot) on write-heavy databases. Every origin write pays a 3x I/O penalty (read old block, write old block to COW, write new block). On a database doing 10,000 IOPS, the snapshot adds 20,000 extra IOPS to the underlying device. Migrate to LVM thin snapshots or use filesystem-level snapshots (Btrfs, ZFS) that handle COW more efficiently.
- ✗Forgetting to size the snapshot COW area properly. If the COW area fills up, the snapshot is silently invalidated and dropped. Any backup process reading from the snapshot gets I/O errors. Always allocate 2-3x the expected write volume during the snapshot's lifetime and monitor usage with lvs -o+snap_percent.
- ✗Assuming dm-crypt has negligible overhead on all hardware. Without AES-NI (hardware AES acceleration), dm-crypt throughput drops from 2+ GB/s to 200-400 MB/s. Check for AES-NI support with grep aes /proc/cpuinfo. On VMs, ensure the hypervisor exposes AES-NI to guests. Older ARM servers without crypto extensions also suffer significant dm-crypt overhead.
Reference
In One Line
Device mapper intercepts block I/O between filesystem and disk, enabling LVM remapping, LUKS encryption, and thin provisioning -- but classic LVM snapshots impose a 3x write penalty that silently degrades database performance.