Containers & SecurityTopic 5 of 9

Security & Access ControlAdvanced

Linux Security Modules (LSM) Framework

DockerKubernetesAndroid

🧠

Mental Model

A courthouse where every visitor passes through a series of security checkpoints arranged in a fixed corridor. The first checkpoint (DAC) checks a government-issued ID. If the ID is invalid, the visitor is turned away immediately and never reaches the next checkpoint. If the ID passes, the visitor proceeds to a line of additional guards (LSMs), each operating independently with a different rulebook. One guard checks badge labels (SELinux). Another checks a list of approved corridors (AppArmor). A third runs a custom screening program that was loaded that morning (BPF LSM). The visitor only reaches the courtroom if every single guard waves them through. A single raised hand from any guard sends them back.

💡

The Problem

A containerized application receives "Permission denied" errors when reading configuration files mounted from the host. Standard file permissions are correct (644, owned by the right UID). ACLs are not in play. The container runs as the expected user. Yet open() returns EACCES. The standard debugging path -- checking ownership, group membership, and permission bits -- finds nothing wrong. The actual cause is an LSM hook firing after the DAC check passes, with a security module denying access based on a label mismatch, path restriction, or BPF policy that never appears in ls -l output. Without understanding that the kernel runs a second, independent authorization layer at over 200 hook points, the denial looks impossible.

Architecture

A containerized process calls open("/etc/app/config.yaml", O_RDONLY). The file exists, permissions are 644, the process UID matches the owner. DAC says allow. But open() returns -EACCES.

The administrator runs ls -l, confirms the permissions, checks group membership, verifies no ACL is in play. Everything looks correct. The standard Unix permission model has no explanation for this denial.

The answer is on a different layer entirely. After DAC passes, the kernel hits an LSM hook -- security_file_open() -- which iterates through every registered Linux Security Module. One of them returns -EPERM. The file never opens. The application sees only "Permission denied."

This is the LSM framework. It is the mechanism that makes SELinux, AppArmor, Smack, TOMOYO, and BPF LSM possible. And understanding it is the difference between staring at correct file permissions for hours and checking cat /sys/kernel/security/lsm in 5 seconds.

What Actually Happens

Every security-sensitive kernel operation passes through an LSM hook. There are over 200 of them, embedded in the VFS layer, network stack, process management, IPC, and capability checks. Here is the sequence for a file open:

Userspace calls open("/etc/app/config.yaml", O_RDONLY).
The kernel enters sys_openat(), which calls into the VFS path resolution code.
VFS calls inode_permission() to run the DAC check -- uid, gid, mode bits, ACLs.
If DAC denies, return -EACCES immediately. LSM hooks never fire.
If DAC passes, VFS calls security_file_open().
security_file_open() walks the security_hook_list for this hook point.
Each registered LSM's callback is invoked in registration order.
If any LSM returns a non-zero value (deny), the operation is blocked.
Only if every LSM returns zero (allow) does the file actually open.

The critical insight: LSMs can only deny. They cannot grant access that DAC rejected. They are a second gate, not an alternative path.

The Hook Architecture

LSM hooks are defined in include/linux/lsm_hook_defs.h. Each hook is a macro that expands to a function pointer type. The framework maintains a security_hook_heads structure where each member is a list_head pointing to a chain of hook callbacks from registered modules.

When the kernel compiles, each LSM (SELinux, AppArmor, etc.) calls security_add_hooks() during initialization, appending its callbacks to the relevant lists. At runtime, a hook invocation like security_inode_permission() calls call_int_hook(inode_permission, inode, mask), which iterates the list and returns the first non-zero result (or zero if all allow).

The major hook categories:

VFS hooks -- Gate every file system operation.

security_inode_permission() -- permission check on inode access
security_file_open() -- called during file open, after path resolution
security_inode_create() -- before creating a new inode
security_inode_unlink() -- before deleting a file
security_inode_rename() -- before renaming
security_file_mmap() -- before memory-mapping a file

Network hooks -- Gate socket operations.

security_socket_create() -- before creating a socket
security_socket_connect() -- before connect()
security_socket_bind() -- before bind()
security_socket_sendmsg() -- before sending data

Process lifecycle hooks -- Gate process creation and signaling.

security_task_alloc() -- during fork/clone, before the new task runs
security_bprm_check() -- during exec, before the new binary runs
security_task_kill() -- before delivering a signal
security_cred_prepare() -- before preparing new credentials

BPF hooks -- Gate BPF operations themselves.

security_bpf() -- before bpf() syscall operations
security_bpf_map() -- before accessing a BPF map
security_bpf_prog() -- before loading a BPF program

The security_struct Blob

Every major kernel object needs to store per-LSM state. SELinux needs a pointer to the security context. AppArmor needs a reference to the confining profile. Smack needs a label string.

Before kernel 5.4, each object had a single void *security pointer. Only one LSM could use it. This was the fundamental barrier to stacking.

Kernel 5.4 introduced lsm_blob_sizes. During initialization, each LSM declares how many bytes it needs per object type:

/* From SELinux's initialization */
static struct lsm_blob_sizes selinux_blob_sizes = {
    .lbs_cred   = sizeof(struct task_security_struct),
    .lbs_file   = sizeof(struct file_security_struct),
    .lbs_inode  = sizeof(struct inode_security_struct),
    .lbs_ipc    = sizeof(struct ipc_security_struct),
    .lbs_msg_msg = sizeof(struct msg_security_struct),
};

The framework sums the sizes from all registered LSMs and allocates a single contiguous blob per object. Each LSM gets a fixed offset. SELinux accesses bytes 0-47, AppArmor accesses bytes 48-63, BPF LSM accesses bytes 64-71 (example offsets). No pointer chasing, no dynamic dispatch -- just a fixed offset into a flat buffer.

BPF LSM: Runtime-Programmable Security

Traditional LSMs require compiling policy into kernel image or loading at boot. BPF LSM (kernel 5.7+, CONFIG_BPF_LSM=y) changes this by allowing eBPF programs to attach to any LSM hook at runtime.

A BPF LSM program:

Is written in C, compiled to BPF bytecode with clang.
Declares SEC("lsm/hook_name") to specify which hook to attach to.
Receives the same arguments as the kernel hook function.
Returns 0 to allow or a negative errno to deny.
Can read kernel data structures via BPF CO-RE (Compile Once, Run Everywhere).
Can use BPF maps for configuration, state tracking, and communication with userspace.

The BPF verifier guarantees safety: no infinite loops, no out-of-bounds memory access, no kernel crashes. The program is JIT-compiled to native machine code for performance.

To enable BPF LSM, the lsm= boot parameter must include bpf:

# Check current LSM list
cat /sys/kernel/security/lsm
# Output: lockdown,capability,yama,selinux

# Add bpf to the list via GRUB
# In /etc/default/grub:
# GRUB_CMDLINE_LINUX="lsm=lockdown,capability,yama,selinux,bpf"
# Then: grub2-mkconfig -o /boot/grub2/grub.cfg && reboot

After reboot, verify:

cat /sys/kernel/security/lsm
# Output: lockdown,capability,yama,selinux,bpf

LSM Stacking

Before kernel 5.4, only one "major" LSM could be active (SELinux OR AppArmor, not both). Minor LSMs (capability, yama, loadpin, lockdown) could always stack because they used no per-object storage.

After 5.4, the blob mechanism enables true stacking. The lsm= boot parameter controls which LSMs load and in what order:

# RHEL with SELinux + BPF LSM
lsm=lockdown,capability,yama,selinux,bpf

# Ubuntu with AppArmor + BPF LSM
lsm=lockdown,capability,yama,apparmor,bpf

When a hook fires, every LSM in the list gets called. The first deny wins. This means:

SELinux can provide baseline MAC enforcement.
BPF LSM can add application-specific rules on top.
Both evaluate independently on the same hook invocation.
Neither LSM needs awareness of the other.

Under the Hood

Hook evaluation cost. Each hook invocation walks a linked list. With three LSMs registered for security_file_open, that is three function pointer calls. SELinux's AVC cache resolves most checks in about 100 nanoseconds. AppArmor's path matching is similar. BPF LSM overhead depends on program complexity but is typically under 500 nanoseconds. For a file-heavy workload doing millions of opens per second, this adds up to a few percentage points of CPU overhead.

The call_int_hook macro. This is the core dispatch mechanism in security/security.c. It iterates security_hook_heads.hook_name, calling each callback. On the first non-zero return, it short-circuits and returns that value. If all return zero, the final result is zero (allow).

Ordering matters. The lsm= parameter determines evaluation order. Placing a fast-failing LSM first (one that denies most checks early) reduces total hook evaluation time. In practice, the capability LSM goes first because it is the lightest, followed by the MAC LSM, followed by BPF LSM.

init_lsmid and registration. Each LSM has an lsm_id used for ordering and blob offset calculation. The framework calls ordered_lsm_init() during boot, which walks the lsm= parameter and initializes each LSM in order. LSMs that are not in the list are skipped entirely.

Common Questions

How does the kernel know which LSMs to activate?

The lsm= kernel command-line parameter (set in the bootloader configuration) lists the LSMs to load, in order. If lsm= is not specified, the kernel uses CONFIG_LSM from the build configuration as the default. During boot, ordered_lsm_init() iterates the list, calls each LSM's init function, and registers its hooks.

Can an LSM be loaded or unloaded at runtime?

Traditional LSMs (SELinux, AppArmor) cannot be unloaded after boot. They register hooks during kernel initialization, and the framework provides no mechanism to safely remove them. BPF LSM programs are the exception: they can be attached and detached at runtime because the BPF subsystem manages their lifecycle separately. Detaching a BPF LSM program removes it from the hook chain immediately.

What happens when two stacked LSMs conflict?

There is no conflict resolution. If SELinux allows and BPF LSM denies, the operation is denied. If BPF LSM allows and SELinux denies, the operation is denied. The semantics are strictly conjunctive: every LSM must allow for the operation to proceed. This means stacking can only make a system more restrictive, never more permissive.

How does BPF LSM interact with SELinux on the same hook?

They are independent. On a security_file_open call, SELinux checks its AVC based on type enforcement labels. BPF LSM runs its eBPF program, which might check cgroup ID, PID, or file properties. Neither knows the other exists. Both get called. Both must allow.

How Technologies Use This

Docker

A production Docker host runs 120 containers across 15 microservices. Without mandatory access control, a container escape exploit that gains root inside a container also gains root-level access to the host filesystem, network interfaces, and other containers. In 2019, the CVE-2019-5736 runc vulnerability demonstrated exactly this scenario, allowing a malicious container to overwrite the host runc binary.

When SELinux is enabled on the Docker host, the container runtime assigns each container process the svirt_lxc_net_t SELinux type. The LSM framework intercepts every file open, socket connect, and process execution through hooks like security_file_open() and security_socket_connect(). SELinux policy rules restrict svirt_lxc_net_t processes to accessing only files labeled with the svirt_sandbox_file_t type, blocking reads of host files labeled etc_t, var_log_t, or any other host-specific label. Even if a process escapes the container namespace with root privileges, the LSM deny on the type mismatch prevents access to host resources.

Each container also receives a unique Multi-Category Security (MCS) label, such as s0:c123,c456. Two containers both running as svirt_lxc_net_t cannot read each other's files because their MCS categories differ. The LSM hook checks both the type and the MCS label on every access. This provides container-to-container isolation at the kernel level, independent of namespace boundaries, and functions even if the namespace isolation is compromised.

Kubernetes

A Kubernetes cluster runs 200 pods on nodes using Ubuntu with AppArmor as the primary LSM. A pod running an internal API server should only read files under /app and /etc/ssl, write to /tmp, and never execute binaries outside /app/bin. Without mandatory access control, a remote code execution vulnerability in the API server could allow an attacker to read /etc/shadow, write to /var, or execute arbitrary downloaded binaries.

Kubernetes supports AppArmor profiles through pod annotations (e.g., container.apparmor.security.beta.kubernetes.io/api-server: localhost/k8s-api-restricted). The kubelet loads the specified AppArmor profile into the kernel before starting the container. AppArmor registers itself in the LSM framework and intercepts file operations through security_file_open(), security_inode_permission(), and security_task_alloc(). The profile specifies path-based rules: allow read on /app/**, allow read on /etc/ssl/**, allow write on /tmp/**, deny everything else. Each intercepted operation is checked against the loaded profile, and violations return EACCES to the calling process.

AppArmor profiles operate in two modes relevant to Kubernetes deployments. In enforce mode, violations are blocked and logged to the audit subsystem. In complain mode, violations are logged but permitted, enabling operators to develop profiles by observing normal application behavior before enforcing restrictions. On a cluster processing 10,000 requests per second per pod, the per-operation overhead of AppArmor LSM hooks adds approximately 200 nanoseconds per checked operation, which is negligible compared to application-level latencies.

Android

An Android phone runs 150 to 200 processes simultaneously, including system services (system_server, surfaceflinger, installd), third-party applications, and hardware abstraction layers. A single malicious or compromised app could, without mandatory access control, read SMS databases belonging to the messaging app, access microphone hardware, or modify system configuration files.

Since Android 5.0, SELinux runs in enforcing mode as a mandatory component of the Android security model. Every process is assigned a specific SELinux domain: third-party apps run as untrusted_app, the system server runs as system_server, and the camera HAL runs as hal_camera_default. The LSM framework checks every inter-process communication through security_binder_transaction(), every file access through security_inode_permission(), and every socket operation through security_socket_connect(). Over 200 hook points enforce that untrusted_app can only access files labeled app_data_file with a matching user ID, cannot open raw sockets, and cannot send binder transactions to most system services without going through the permission-checking intermediary.

The Android SELinux policy contains over 50,000 rules compiled into a binary policy file loaded at boot. Google's Compatibility Test Suite (CTS) verifies that device manufacturers do not weaken the policy by adding overly permissive rules. The neverallow rules in the policy act as compile-time assertions: a rule like neverallow untrusted_app system_data_file:file write guarantees that no allow rule in the entire policy grants write access from third-party app processes to system data files. This prevents both accidental policy misconfigurations and intentional weakening by device vendors.

Same Concept Across Tech

Concept	SELinux	AppArmor	Smack	TOMOYO	BPF LSM
Policy model	Type Enforcement with labels on every object	Path-based profiles with glob patterns	Simplified label-based (subject/object labels)	Path-based with learning mode for auto-policy	eBPF programs attached to hook points
Per-object data	security context pointer in security_struct blob	Profile reference in security_struct blob	Smack label in security_struct blob	Domain info in security_struct blob	BPF map lookups keyed by object properties
Policy update	Compile .te/.fc/.if → load binary policy module	Edit text profile → apparmor_parser -r	Write label rules to /smack/load2	Edit /etc/tomoyo/ policy files	Load/detach eBPF programs at runtime
Stacking role	Primary MAC on RHEL/Fedora	Primary MAC on Ubuntu/Debian	Lightweight MAC for embedded/IoT	Learning-focused MAC for auditing access patterns	Supplementary runtime policy on any distro

Stack Layer	LSM Component
Syscall entry	sys_openat(), sys_connect(), sys_kill() invoke VFS/net/signal code
VFS / networking / process	LSM hooks embedded at security-critical points call security_hook_heads
LSM framework	Iterates security_hook_list, calls each registered module, aggregates deny/allow
LSM modules	SELinux, AppArmor, Smack, TOMOYO, BPF LSM each implement hook callbacks
Per-object state	security_struct blob on inodes, tasks, creds, sockets stores per-LSM data

Design rationale: The kernel needed a way to support multiple, fundamentally different security models without hardcoding any one of them into core subsystems. The hook-based architecture means VFS, networking, and process management code contains a single call_int_hook() invocation at each security decision point. Which modules respond, and what logic they apply, is entirely decoupled from the code that triggers the hook. This separation is why SELinux labels and AppArmor paths and BPF programs can all coexist on the same hook point without modifying a single line in the VFS.

If You See This, Think This

Symptom	Likely Cause	First Check
EACCES with correct file permissions and ownership	LSM hook denying after DAC passes	cat /sys/kernel/security/lsm to identify active modules; then check module-specific logs
Container gets "Permission denied" on mounted host volume	SELinux label on host files does not match container domain	ls -Z on the host path; compare with ps -eZ on the container process
Application works outside container but fails inside	AppArmor or SELinux confining the container runtime profile	Run in permissive/complain mode to confirm; check dmesg or audit.log
BPF program loads but custom security policy has no effect	BPF LSM not in the lsm= boot parameter list	cat /sys/kernel/security/lsm must include "bpf"
Custom LSM module blocks nothing after loading	Hooks not registered via security_add_hooks() or module loaded after boot	Verify module init calls security_add_hooks(); check dmesg for registration messages
Operation blocked with no audit trail	BPF LSM program denying without logging; or audit subsystem rate-limiting	bpftool prog list

When to Use / Avoid

Understanding why EACCES appears when DAC permissions are correct
Debugging container isolation failures caused by SELinux label mismatches or AppArmor path denials
Implementing custom runtime security policies via BPF LSM without kernel recompilation
Auditing which security modules are active and in what order on a production system
Designing defense-in-depth strategies that layer multiple LSMs (MAC + BPF LSM)
Skip when the system is single-user, non-networked, and physical security is the primary control
Skip when rapid kernel development iteration requires disabling LSM overhead temporarily

Try It Yourself

 1  # Check which LSMs are active and their order
 2  
 3  cat /sys/kernel/security/lsm
 4  
 5  # Check if BPF LSM is available in the kernel config
 6  
 7  grep CONFIG_BPF_LSM /boot/config-$(uname -r) 2>/dev/null || zcat /proc/config.gz 2>/dev/null | grep BPF_LSM
 8  
 9  # List all BPF programs, filter for LSM type
10  
11  bpftool prog list 2>/dev/null | grep -A2 lsm || echo 'No BPF LSM programs loaded'
12  
13  # Count LSM hooks defined in the kernel headers
14  
15  grep -c 'LSM_HOOK' /usr/src/linux-headers-$(uname -r)/include/linux/lsm_hook_defs.h 2>/dev/null || echo 'Kernel headers not installed'
16  
17  # View SELinux denials caused by LSM hooks
18  
19  ausearch -m AVC -ts today 2>/dev/null | head -20 || echo 'ausearch not available'
20  
21  # View AppArmor denials from LSM hooks
22  
23  dmesg 2>/dev/null | grep APPARMOR | tail -10 || echo 'No AppArmor messages'
24  
25  # Show the LSM security blob sizes allocated for each module
26  
27  dmesg 2>/dev/null | grep -i 'lsm.*blob\|security.*blob' | head -5 || echo 'No blob size messages in dmesg'
28  
29  # Attach a BPF LSM program using bpftool (requires root and BPF LSM enabled)
30  
31  bpftool prog load ./lsm_deny_unlink.o /sys/fs/bpf/lsm_deny_unlink type lsm 2>/dev/null || echo 'BPF LSM load example (requires compiled .o)'

Debug Checklist

1cat /sys/kernel/security/lsm -- verify which LSMs are active and their evaluation order
2ausearch -m AVC -ts today | head -20 -- find recent SELinux denials with source/target context
3dmesg | grep -i apparmor -- find AppArmor denial messages
4bpftool prog list | grep lsm -- check for BPF LSM programs attached to hooks
5ls -Z /path/to/file -- view SELinux security context on the file
6ps -eZ | grep <process> -- view the SELinux domain of the running process
7aa-status 2>/dev/null -- list loaded AppArmor profiles and their mode
8grep denied /var/log/audit/audit.log | tail -10 -- raw audit log search for LSM denials

Key Takeaways

✓LSM hooks fire AFTER DAC checks pass. If standard Unix permissions deny access, the LSM hook is never reached. This means LSMs can only further restrict access, never grant access that DAC denied. The design is intentionally restrictive: LSMs are an additional gate, not a bypass.
✓Since kernel 5.4, multiple major LSMs can stack. The lsm= boot parameter specifies the order: lsm=lockdown,capability,selinux,bpf. Every hook iterates through all registered modules. A single deny from any module blocks the operation. This enables layered security policies where SELinux provides baseline MAC and BPF LSM adds application-specific rules.
✓BPF LSM (kernel 5.7+) allows attaching eBPF programs to any of the 200+ LSM hooks at runtime. No kernel recompilation, no reboot. The BPF verifier ensures the program is safe. This transforms LSM from a boot-time-only framework into a runtime-programmable security layer.
✓The security_struct blob mechanism is what makes stacking possible. Before 5.4, each kernel object (inode, task) had a single void* security pointer, so only one LSM could store per-object data. The blob mechanism allocates a contiguous chunk partitioned among all active LSMs, with each LSM accessing its portion via a fixed offset.
✓Hook placement is deliberate and follows the principle of complete mediation. Every path from a syscall to a security-sensitive kernel operation must pass through at least one LSM hook. The VFS layer alone has hooks at inode lookup, permission check, file open, read, write, mmap, and attribute changes. Missing a hook would create a bypass.

Common Pitfalls

✗Assuming "Permission denied" always means DAC. When file permissions, ownership, and ACLs all check out but open() still returns EACCES, an LSM is the most likely cause. Check cat /sys/kernel/security/lsm to see which modules are active, then consult the appropriate audit log (ausearch -m AVC for SELinux, dmesg | grep APPARMOR for AppArmor).
✗Believing LSMs can grant access. LSMs are restrictive-only hooks. They cannot override a DAC denial or grant permissions that the standard permission model rejects. If DAC denies, the LSM hook never runs. If DAC allows, the LSM gets a veto but cannot add further permissions.
✗Disabling the entire LSM stack (setenforce 0 or removing the AppArmor profile) to debug one denial. This removes all mandatory access control, not just the offending rule. The correct approach: switch to permissive mode (SELinux) or complain mode (AppArmor), reproduce the issue, read the audit log, and fix the specific rule.
✗Ignoring BPF LSM programs during debugging. On systems with BPF LSM enabled, eBPF programs attached to LSM hooks can deny operations without leaving traditional audit log entries. Use bpftool prog list to check for attached BPF LSM programs and bpftool prog dump to inspect their logic.

Reference

System Calls

security_file_opensecurity_inode_permissionsecurity_task_allocsecurity_socket_connectsecurity_bpf

Tools

cat /sys/kernel/security/lsmbpftool prog list / bpftool prog showausearch -m AVC/sys/kernel/security/

📌

In One Line

Over 200 kernel hook points funnel every security-sensitive operation through a chain of registered modules, and a single deny from any LSM in the stack blocks the operation regardless of what DAC or other LSMs decided.

Linux Security Modules (LSM) Framework

DockerKubernetesAndroid

🧠

Mental Model

💡

The Problem

Architecture

A containerized process calls open("/etc/app/config.yaml", O_RDONLY). The file exists, permissions are 644, the process UID matches the owner. DAC says allow. But open() returns -EACCES.

What Actually Happens

Userspace calls open("/etc/app/config.yaml", O_RDONLY).
The kernel enters sys_openat(), which calls into the VFS path resolution code.
VFS calls inode_permission() to run the DAC check -- uid, gid, mode bits, ACLs.
If DAC denies, return -EACCES immediately. LSM hooks never fire.
If DAC passes, VFS calls security_file_open().
security_file_open() walks the security_hook_list for this hook point.
Each registered LSM's callback is invoked in registration order.
If any LSM returns a non-zero value (deny), the operation is blocked.
Only if every LSM returns zero (allow) does the file actually open.

The critical insight: LSMs can only deny. They cannot grant access that DAC rejected. They are a second gate, not an alternative path.

The Hook Architecture

The major hook categories:

VFS hooks -- Gate every file system operation.

security_inode_permission() -- permission check on inode access
security_file_open() -- called during file open, after path resolution
security_inode_create() -- before creating a new inode
security_inode_unlink() -- before deleting a file
security_inode_rename() -- before renaming
security_file_mmap() -- before memory-mapping a file

Network hooks -- Gate socket operations.

security_socket_create() -- before creating a socket
security_socket_connect() -- before connect()
security_socket_bind() -- before bind()
security_socket_sendmsg() -- before sending data

Process lifecycle hooks -- Gate process creation and signaling.

security_task_alloc() -- during fork/clone, before the new task runs
security_bprm_check() -- during exec, before the new binary runs
security_task_kill() -- before delivering a signal
security_cred_prepare() -- before preparing new credentials

BPF hooks -- Gate BPF operations themselves.

security_bpf() -- before bpf() syscall operations
security_bpf_map() -- before accessing a BPF map
security_bpf_prog() -- before loading a BPF program

The security_struct Blob

Every major kernel object needs to store per-LSM state. SELinux needs a pointer to the security context. AppArmor needs a reference to the confining profile. Smack needs a label string.

Before kernel 5.4, each object had a single void *security pointer. Only one LSM could use it. This was the fundamental barrier to stacking.

Kernel 5.4 introduced lsm_blob_sizes. During initialization, each LSM declares how many bytes it needs per object type:

/* From SELinux's initialization */
static struct lsm_blob_sizes selinux_blob_sizes = {
    .lbs_cred   = sizeof(struct task_security_struct),
    .lbs_file   = sizeof(struct file_security_struct),
    .lbs_inode  = sizeof(struct inode_security_struct),
    .lbs_ipc    = sizeof(struct ipc_security_struct),
    .lbs_msg_msg = sizeof(struct msg_security_struct),
};

BPF LSM: Runtime-Programmable Security

Traditional LSMs require compiling policy into kernel image or loading at boot. BPF LSM (kernel 5.7+, CONFIG_BPF_LSM=y) changes this by allowing eBPF programs to attach to any LSM hook at runtime.

A BPF LSM program:

Is written in C, compiled to BPF bytecode with clang.
Declares SEC("lsm/hook_name") to specify which hook to attach to.
Receives the same arguments as the kernel hook function.
Returns 0 to allow or a negative errno to deny.
Can read kernel data structures via BPF CO-RE (Compile Once, Run Everywhere).
Can use BPF maps for configuration, state tracking, and communication with userspace.

The BPF verifier guarantees safety: no infinite loops, no out-of-bounds memory access, no kernel crashes. The program is JIT-compiled to native machine code for performance.

To enable BPF LSM, the lsm= boot parameter must include bpf:

# Check current LSM list
cat /sys/kernel/security/lsm
# Output: lockdown,capability,yama,selinux

# Add bpf to the list via GRUB
# In /etc/default/grub:
# GRUB_CMDLINE_LINUX="lsm=lockdown,capability,yama,selinux,bpf"
# Then: grub2-mkconfig -o /boot/grub2/grub.cfg && reboot

After reboot, verify:

cat /sys/kernel/security/lsm
# Output: lockdown,capability,yama,selinux,bpf

LSM Stacking

Before kernel 5.4, only one "major" LSM could be active (SELinux OR AppArmor, not both). Minor LSMs (capability, yama, loadpin, lockdown) could always stack because they used no per-object storage.

After 5.4, the blob mechanism enables true stacking. The lsm= boot parameter controls which LSMs load and in what order:

# RHEL with SELinux + BPF LSM
lsm=lockdown,capability,yama,selinux,bpf

# Ubuntu with AppArmor + BPF LSM
lsm=lockdown,capability,yama,apparmor,bpf

When a hook fires, every LSM in the list gets called. The first deny wins. This means:

SELinux can provide baseline MAC enforcement.
BPF LSM can add application-specific rules on top.
Both evaluate independently on the same hook invocation.
Neither LSM needs awareness of the other.

Under the Hood

Common Questions

How does the kernel know which LSMs to activate?

Can an LSM be loaded or unloaded at runtime?

What happens when two stacked LSMs conflict?

How does BPF LSM interact with SELinux on the same hook?

How Technologies Use This

Docker

Kubernetes

Android

Same Concept Across Tech

Concept	SELinux	AppArmor	Smack	TOMOYO	BPF LSM
Policy model	Type Enforcement with labels on every object	Path-based profiles with glob patterns	Simplified label-based (subject/object labels)	Path-based with learning mode for auto-policy	eBPF programs attached to hook points
Per-object data	security context pointer in security_struct blob	Profile reference in security_struct blob	Smack label in security_struct blob	Domain info in security_struct blob	BPF map lookups keyed by object properties
Policy update	Compile .te/.fc/.if → load binary policy module	Edit text profile → apparmor_parser -r	Write label rules to /smack/load2	Edit /etc/tomoyo/ policy files	Load/detach eBPF programs at runtime
Stacking role	Primary MAC on RHEL/Fedora	Primary MAC on Ubuntu/Debian	Lightweight MAC for embedded/IoT	Learning-focused MAC for auditing access patterns	Supplementary runtime policy on any distro

Stack Layer	LSM Component
Syscall entry	sys_openat(), sys_connect(), sys_kill() invoke VFS/net/signal code
VFS / networking / process	LSM hooks embedded at security-critical points call security_hook_heads
LSM framework	Iterates security_hook_list, calls each registered module, aggregates deny/allow
LSM modules	SELinux, AppArmor, Smack, TOMOYO, BPF LSM each implement hook callbacks
Per-object state	security_struct blob on inodes, tasks, creds, sockets stores per-LSM data

If You See This, Think This

Symptom	Likely Cause	First Check
EACCES with correct file permissions and ownership	LSM hook denying after DAC passes	cat /sys/kernel/security/lsm to identify active modules; then check module-specific logs
Container gets "Permission denied" on mounted host volume	SELinux label on host files does not match container domain	ls -Z on the host path; compare with ps -eZ on the container process
Application works outside container but fails inside	AppArmor or SELinux confining the container runtime profile	Run in permissive/complain mode to confirm; check dmesg or audit.log
BPF program loads but custom security policy has no effect	BPF LSM not in the lsm= boot parameter list	cat /sys/kernel/security/lsm must include "bpf"
Custom LSM module blocks nothing after loading	Hooks not registered via security_add_hooks() or module loaded after boot	Verify module init calls security_add_hooks(); check dmesg for registration messages
Operation blocked with no audit trail	BPF LSM program denying without logging; or audit subsystem rate-limiting	bpftool prog list

When to Use / Avoid

Understanding why EACCES appears when DAC permissions are correct
Debugging container isolation failures caused by SELinux label mismatches or AppArmor path denials
Implementing custom runtime security policies via BPF LSM without kernel recompilation
Auditing which security modules are active and in what order on a production system
Designing defense-in-depth strategies that layer multiple LSMs (MAC + BPF LSM)
Skip when the system is single-user, non-networked, and physical security is the primary control
Skip when rapid kernel development iteration requires disabling LSM overhead temporarily

Try It Yourself

 1  # Check which LSMs are active and their order
 2  
 3  cat /sys/kernel/security/lsm
 4  
 5  # Check if BPF LSM is available in the kernel config
 6  
 7  grep CONFIG_BPF_LSM /boot/config-$(uname -r) 2>/dev/null || zcat /proc/config.gz 2>/dev/null | grep BPF_LSM
 8  
 9  # List all BPF programs, filter for LSM type
10  
11  bpftool prog list 2>/dev/null | grep -A2 lsm || echo 'No BPF LSM programs loaded'
12  
13  # Count LSM hooks defined in the kernel headers
14  
15  grep -c 'LSM_HOOK' /usr/src/linux-headers-$(uname -r)/include/linux/lsm_hook_defs.h 2>/dev/null || echo 'Kernel headers not installed'
16  
17  # View SELinux denials caused by LSM hooks
18  
19  ausearch -m AVC -ts today 2>/dev/null | head -20 || echo 'ausearch not available'
20  
21  # View AppArmor denials from LSM hooks
22  
23  dmesg 2>/dev/null | grep APPARMOR | tail -10 || echo 'No AppArmor messages'
24  
25  # Show the LSM security blob sizes allocated for each module
26  
27  dmesg 2>/dev/null | grep -i 'lsm.*blob\|security.*blob' | head -5 || echo 'No blob size messages in dmesg'
28  
29  # Attach a BPF LSM program using bpftool (requires root and BPF LSM enabled)
30  
31  bpftool prog load ./lsm_deny_unlink.o /sys/fs/bpf/lsm_deny_unlink type lsm 2>/dev/null || echo 'BPF LSM load example (requires compiled .o)'

Debug Checklist

1cat /sys/kernel/security/lsm -- verify which LSMs are active and their evaluation order
2ausearch -m AVC -ts today | head -20 -- find recent SELinux denials with source/target context
3dmesg | grep -i apparmor -- find AppArmor denial messages
4bpftool prog list | grep lsm -- check for BPF LSM programs attached to hooks
5ls -Z /path/to/file -- view SELinux security context on the file
6ps -eZ | grep <process> -- view the SELinux domain of the running process
7aa-status 2>/dev/null -- list loaded AppArmor profiles and their mode
8grep denied /var/log/audit/audit.log | tail -10 -- raw audit log search for LSM denials

Key Takeaways

✓LSM hooks fire AFTER DAC checks pass. If standard Unix permissions deny access, the LSM hook is never reached. This means LSMs can only further restrict access, never grant access that DAC denied. The design is intentionally restrictive: LSMs are an additional gate, not a bypass.
✓Since kernel 5.4, multiple major LSMs can stack. The lsm= boot parameter specifies the order: lsm=lockdown,capability,selinux,bpf. Every hook iterates through all registered modules. A single deny from any module blocks the operation. This enables layered security policies where SELinux provides baseline MAC and BPF LSM adds application-specific rules.
✓BPF LSM (kernel 5.7+) allows attaching eBPF programs to any of the 200+ LSM hooks at runtime. No kernel recompilation, no reboot. The BPF verifier ensures the program is safe. This transforms LSM from a boot-time-only framework into a runtime-programmable security layer.
✓The security_struct blob mechanism is what makes stacking possible. Before 5.4, each kernel object (inode, task) had a single void* security pointer, so only one LSM could store per-object data. The blob mechanism allocates a contiguous chunk partitioned among all active LSMs, with each LSM accessing its portion via a fixed offset.
✓Hook placement is deliberate and follows the principle of complete mediation. Every path from a syscall to a security-sensitive kernel operation must pass through at least one LSM hook. The VFS layer alone has hooks at inode lookup, permission check, file open, read, write, mmap, and attribute changes. Missing a hook would create a bypass.

Common Pitfalls

✗Assuming "Permission denied" always means DAC. When file permissions, ownership, and ACLs all check out but open() still returns EACCES, an LSM is the most likely cause. Check cat /sys/kernel/security/lsm to see which modules are active, then consult the appropriate audit log (ausearch -m AVC for SELinux, dmesg | grep APPARMOR for AppArmor).
✗Believing LSMs can grant access. LSMs are restrictive-only hooks. They cannot override a DAC denial or grant permissions that the standard permission model rejects. If DAC denies, the LSM hook never runs. If DAC allows, the LSM gets a veto but cannot add further permissions.
✗Disabling the entire LSM stack (setenforce 0 or removing the AppArmor profile) to debug one denial. This removes all mandatory access control, not just the offending rule. The correct approach: switch to permissive mode (SELinux) or complain mode (AppArmor), reproduce the issue, read the audit log, and fix the specific rule.
✗Ignoring BPF LSM programs during debugging. On systems with BPF LSM enabled, eBPF programs attached to LSM hooks can deny operations without leaving traditional audit log entries. Use bpftool prog list to check for attached BPF LSM programs and bpftool prog dump to inspect their logic.

Reference

System Calls

security_file_opensecurity_inode_permissionsecurity_task_allocsecurity_socket_connectsecurity_bpf

Tools

cat /sys/kernel/security/lsmbpftool prog list / bpftool prog showausearch -m AVC/sys/kernel/security/

📌

Mental Model

The Problem

Architecture

What Actually Happens

The Hook Architecture

The security_struct Blob

BPF LSM: Runtime-Programmable Security

LSM Stacking

Under the Hood

Common Questions

How Technologies Use This

Same Concept Across Tech

If You See This, Think This

When to Use / Avoid

Try It Yourself

Debug Checklist

Key Takeaways

Common Pitfalls

Reference

In One Line

Related Topics

Mental Model

The Problem

Architecture

What Actually Happens

The Hook Architecture

The security_struct Blob

BPF LSM: Runtime-Programmable Security

LSM Stacking

Under the Hood

Common Questions

How Technologies Use This

Same Concept Across Tech

If You See This, Think This

When to Use / Avoid

Try It Yourself

Debug Checklist

Key Takeaways

Common Pitfalls

Reference

In One Line

Related Topics